[PoC] Improve dead tuple storage for lazy vacuum

pg@bowt.ie

over 4 years ago

In reply to: Masahiko Sawada (#1)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:

1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].

I think that the main problem with the 1GB limitation is that it is
surprising -- it can cause disruption when we first exceed the magical
limit of ~174 million TIDs. This can cause us to dirty index pages a
second time when we might have been able to just do it once with
sufficient memory for TIDs. OTOH there are actually cases where having
less memory for TIDs makes performance *better* because of locality
effects. This perverse behavior with memory sizing isn't a rare case
that we can safely ignore -- unfortunately it's fairly common.

My point is that we should be careful to choose the correct goal.
Obviously memory use matters. But it might be more helpful to think of
memory use as just a proxy for what truly matters, not a goal in
itself. It's hard to know what this means (what is the "real goal"?),
and hard to measure it even if you know for sure. It could still be
useful to think of it like this.

A run container is selected in this test case, using 4 bytes for each block.

Execution Time Memory Usage
array 8,883.03 600,008,248
intset 7,358.23 100,671,488
tbm 758.81 100,671,544
rtbm 764.33 29,384,816

Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.

This seems very promising.

I wonder how much you have thought about the index AM side. It makes
sense to initially evaluate these techniques using this approach of
separating the data structure from how it is used by VACUUM -- I think
that that was a good idea. But at the same time there may be certain
important theoretical questions that cannot be answered this way --
questions about how everything "fits together" in a real VACUUM might
matter a lot. You've probably thought about this at least a little
already. Curious to hear how you think it "fits together" with the
work that you've done already.

The loop inside btvacuumpage() makes each loop iteration call the
callback -- this is always a call to lazy_tid_reaped() in practice.
And that's where we do binary searches. These binary searches are
usually where we see a huge number of cycles spent when we look at
profiles, including the profile that produced your flame graph. But I
worry that that might be a bit misleading -- the way that profilers
attribute costs is very complicated and can never be fully trusted.
While it is true that lazy_tid_reaped() often accesses main memory,
which will of course add a huge amount of latency and make it a huge
bottleneck, the "big picture" is still relevant.

I think that the compiler currently has to make very conservative
assumptions when generating the machine code used by the loop inside
btvacuumpage(), which calls through an opaque function pointer at
least once per loop iteration -- anything can alias, so the compiler
must be conservative. The data dependencies are hard for both the
compiler and the CPU to analyze. The cost of using a function pointer
compared to a direct function call is usually quite low, but there are
important exceptions -- cases where it prevents other useful
optimizations. Maybe this is an exception.

I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.

This approach would make btbulkdelete() similar to
_bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
an independent idea to your ideas -- I imagine that this would work
far better when combined with a more compact data structure, which is
naturally more capable of batch processing than a simple array of
TIDs. Maybe this will help the compiler and the CPU to fully
understand the *natural* data dependencies, so that they can be as
effective as possible in making the code run fast. It's possible that
a modern CPU will be able to *hide* the latency more intelligently
than what we have today. The latency is such a big problem that we may
be able to justify "wasting" other CPU resources, just because it
sometimes helps with hiding the latency. For example, it might
actually be okay to sort all of the TIDs on the page to make the bulk
processing work -- though you might still do a precheck that is
similar to the precheck inside lazy_tid_reaped() that was added by you
in commit bbaf315309e.

Of course it's very easy to be wrong about stuff like this. But it
might not be that hard to prototype. You can literally copy and paste
code from _bt_delitems_delete_check() to do this. It does the same
basic thing already.

--
Peter Geoghegan

pg@bowt.ie

over 4 years ago

In reply to: Peter Geoghegan (#3)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jul 7, 2021 at 1:24 PM Peter Geoghegan <pg@bowt.ie> wrote:

I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.

Maybe for something like rtbm.c (which is inspired by Roaring
bitmaps), you would really want to use an "intersection" operation for
this. The TIDs that we need to physically delete from the leaf page
inside btvacuumpage() are the intersection of two bitmaps: our bitmap
of all TIDs on the leaf page, and our bitmap of all TIDs that need to
be deleting by the ongoing btbulkdelete() call.

Obviously the typical case is that most TIDs in the index do *not* get
deleted -- needing to delete more than ~20% of all TIDs in the index
will be rare. Ideally it would be very cheap to figure out that a TID
does not need to be deleted at all. Something a little like a negative
cache (but not a true negative cache). This is a little bit like how
hash joins can be made faster by adding a Bloom filter -- most hash
probes don't need to join a tuple in the real world, and we can make
these hash probes even faster by using a Bloom filter as a negative
cache.

If you had the list of TIDs from a leaf page sorted for batch
processing, and if you had roaring bitmap style "chunks" with
"container" metadata stored in the data structure, you could then use
merging/intersection -- that has some of the same advantages. I think
that this would be a lot more efficient than having one binary search
per TID. Most TIDs from the leaf page can be skipped over very
quickly, in large groups. It's very rare for VACUUM to need to delete
TIDs from completely random heap table blocks in the real world (some
kind of pattern is much more common).

When this merging process finds 1 TID that might really be deletable
then it's probably going to find much more than 1 -- better to make
that cache miss take care of all of the TIDs together. Also seems like
the CPU could do some clever prefetching with this approach -- it
could prefetch TIDs where the initial chunk metadata is insufficient
to eliminate them early -- these are the groups of TIDs that will have
many TIDs that we actually need to delete. ISTM that improving
temporal locality through batching could matter a lot here.

--
Peter Geoghegan

sawada.mshk@gmail.com

over 4 years ago

In reply to: Matthias van de Meent (#2)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jul 7, 2021 at 11:25 PM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Wed, 7 Jul 2021 at 13:47, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi all,

Index vacuuming is one of the most time-consuming processes in lazy
vacuuming. lazy_tid_reaped() is a large part among them. The attached
the flame graph shows a profile of a vacuum on a table that has one index
and 80 million live rows and 20 million dead rows, where
lazy_tid_reaped() accounts for about 47% of the total vacuum execution
time.

[...]

Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.

Those are some great results, with a good path to meaningful improvements.

Feedback is very welcome. Thank you for reading the email through to the end.

The current available infrastructure for TIDs is quite ill-defined for
TableAM authors [0], and other TableAMs might want to use more than
just the 11 bits in use by max-BLCKSZ HeapAM MaxHeapTuplesPerPage to
identify tuples. (MaxHeapTuplesPerPage is 1169 at the maximum 32k
BLCKSZ, which requires 11 bits to fit).

Could you also check what the (performance, memory) impact would be if
these proposed structures were to support the maximum
MaxHeapTuplesPerPage of 1169 or the full uint16-range of offset
numbers that could be supported by our current TID struct?

I think tbm will be the most affected by the memory impact of the
larger maximum MaxHeapTuplesPerPage. For example, with 32kB blocks
(MaxHeapTuplesPerPage = 1169), even if there is only one dead tuple in
a block, it will always require at least 147 bytes per block.

Rtbm chooses the container type among array, bitmap, or run depending
on the number and distribution of dead tuples in a block, and only
bitmap containers can be searched with O(1). Run containers depend on
the distribution of dead tuples within a block. So let’s compare array
and bitmap containers.

With 8kB blocks (MaxHeapTuplesPerPage = 291), 36 bytes are needed for
a bitmap container at maximum. In other words, when compared to an
array container, bitmap will be chosen if there are more than 18 dead
tuples in a block. On the other hand, with 32kB blocks
(MaxHeapTuplesPerPage = 1169), 147 bytes are needed for a bitmap
container at maximum, so bitmap container will be chosen if there are
more than 74 dead tuples in a block. And, with full uint16-range
(MaxHeapTuplesPerPage = 65535), 8192 bytes are needed at maximum, so
bitmap container will be chosen if there are more than 4096 dead
tuples in a block. Therefore, in any case, if more than about 6% of
tuples in a block are garbage, a bitmap container will be chosen and
bring a faster lookup performance. (Of course, if a run container is
chosen, the container size gets smaller but the lookup performance is
O(logN).) But if the number of dead tuples in the table is small and
we have the larger MaxHeapTuplesPerPage, it’s likely to choose an
array container, and the lookup performance becomes O(logN). Still, it
should be faster than the array data structure because the range of
search targets in an array container is much smaller.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

sawada.mshk@gmail.com

over 4 years ago

In reply to: Peter Geoghegan (#3)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jul 8, 2021 at 5:24 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:

1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].

I think that the main problem with the 1GB limitation is that it is
surprising -- it can cause disruption when we first exceed the magical
limit of ~174 million TIDs. This can cause us to dirty index pages a
second time when we might have been able to just do it once with
sufficient memory for TIDs. OTOH there are actually cases where having
less memory for TIDs makes performance *better* because of locality
effects. This perverse behavior with memory sizing isn't a rare case
that we can safely ignore -- unfortunately it's fairly common.

My point is that we should be careful to choose the correct goal.
Obviously memory use matters. But it might be more helpful to think of
memory use as just a proxy for what truly matters, not a goal in
itself. It's hard to know what this means (what is the "real goal"?),
and hard to measure it even if you know for sure. It could still be
useful to think of it like this.

As I wrote in the first email, I think there are two important factors
in index vacuuming performance: the performance to check if heap TID
that an index tuple points to is dead, and the number of times to
perform index bulk-deletion. The flame graph I attached in the first
mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
disk-intensive operation in practice. Given that most index AM's
bulk-deletion does a full index scan and a table could have multiple
indexes, reducing the number of times to perform index bulk-deletion
really contributes to reducing the execution time, especially for
large tables. I think that a more compact data structure for dead
tuple TIDs is one of the ways to achieve that.

A run container is selected in this test case, using 4 bytes for each block.

Execution Time Memory Usage
array 8,883.03 600,008,248
intset 7,358.23 100,671,488
tbm 758.81 100,671,544
rtbm 764.33 29,384,816

Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.

This seems very promising.

I wonder how much you have thought about the index AM side. It makes
sense to initially evaluate these techniques using this approach of
separating the data structure from how it is used by VACUUM -- I think
that that was a good idea. But at the same time there may be certain
important theoretical questions that cannot be answered this way --
questions about how everything "fits together" in a real VACUUM might
matter a lot. You've probably thought about this at least a little
already. Curious to hear how you think it "fits together" with the
work that you've done already.

Yeah, that definitely needs to be considered. Currently, what we need
for the dead tuple storage for lazy vacuum are store, lookup, and
iteration. And given the parallel vacuum, it has to be able to be
allocated on DSM or DSA. While implementing the PoC code, I'm trying
to integrate it with the current lazy vacuum code. As far as I've seen
so far, the integration is not hard, at least with the *current* lazy
vacuum code and index AMs code.

The loop inside btvacuumpage() makes each loop iteration call the
callback -- this is always a call to lazy_tid_reaped() in practice.
And that's where we do binary searches. These binary searches are
usually where we see a huge number of cycles spent when we look at
profiles, including the profile that produced your flame graph. But I
worry that that might be a bit misleading -- the way that profilers
attribute costs is very complicated and can never be fully trusted.
While it is true that lazy_tid_reaped() often accesses main memory,
which will of course add a huge amount of latency and make it a huge
bottleneck, the "big picture" is still relevant.

I think that the compiler currently has to make very conservative
assumptions when generating the machine code used by the loop inside
btvacuumpage(), which calls through an opaque function pointer at
least once per loop iteration -- anything can alias, so the compiler
must be conservative. The data dependencies are hard for both the
compiler and the CPU to analyze. The cost of using a function pointer
compared to a direct function call is usually quite low, but there are
important exceptions -- cases where it prevents other useful
optimizations. Maybe this is an exception.

I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.

This approach would make btbulkdelete() similar to
_bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
an independent idea to your ideas -- I imagine that this would work
far better when combined with a more compact data structure, which is
naturally more capable of batch processing than a simple array of
TIDs. Maybe this will help the compiler and the CPU to fully
understand the *natural* data dependencies, so that they can be as
effective as possible in making the code run fast. It's possible that
a modern CPU will be able to *hide* the latency more intelligently
than what we have today. The latency is such a big problem that we may
be able to justify "wasting" other CPU resources, just because it
sometimes helps with hiding the latency. For example, it might
actually be okay to sort all of the TIDs on the page to make the bulk
processing work -- though you might still do a precheck that is
similar to the precheck inside lazy_tid_reaped() that was added by you
in commit bbaf315309e.

Interesting idea. I remember you mentioned this idea somewhere and
I've considered this idea too while implementing the PoC code. It's
definitely worth trying. Maybe we can write a patch for this as a
separate patch? It will change index AM and could improve also the
current bulk-deletion. We can consider a better data structure on top
of this idea.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

hannuk@google.com

over 4 years ago

In reply to: Masahiko Sawada (#6)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Very nice results.

I have been working on the same problem but a bit different solution -
a mix of binary search for (sub)pages and 32-bit bitmaps for
tid-in-page.

Even with currebnt allocation heuristics (allocate 291 tids per page)
it initially allocate much less space, instead of current 291*6=1746
bytes per page it needs to allocate 80 bytes.

Also it can be laid out so that it is friendly to parallel SIMD
searches doing up to 8 tid lookups in parallel.

That said, for allocating the tid array, the best solution is to
postpone it as much as possible and to do the initial collection into
a file, which

1) postpones the memory allocation to the beginning of index cleanups

2) lets you select the correct size and structure as you know more
about the distribution at that time

3) do the first heap pass in one go and then advance frozenxmin
*before* index cleanup

Also, collecting dead tids into a file makes it trivial (well, almost
:) ) to parallelize the initial heap scan, so more resources can be
thrown at it if available.

Cheers
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

Show quoted text

On Thu, Jul 8, 2021 at 10:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jul 8, 2021 at 5:24 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jul 7, 2021 at 4:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:

1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].

I think that the main problem with the 1GB limitation is that it is
surprising -- it can cause disruption when we first exceed the magical
limit of ~174 million TIDs. This can cause us to dirty index pages a
second time when we might have been able to just do it once with
sufficient memory for TIDs. OTOH there are actually cases where having
less memory for TIDs makes performance *better* because of locality
effects. This perverse behavior with memory sizing isn't a rare case
that we can safely ignore -- unfortunately it's fairly common.

My point is that we should be careful to choose the correct goal.
Obviously memory use matters. But it might be more helpful to think of
memory use as just a proxy for what truly matters, not a goal in
itself. It's hard to know what this means (what is the "real goal"?),
and hard to measure it even if you know for sure. It could still be
useful to think of it like this.

As I wrote in the first email, I think there are two important factors
in index vacuuming performance: the performance to check if heap TID
that an index tuple points to is dead, and the number of times to
perform index bulk-deletion. The flame graph I attached in the first
mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
disk-intensive operation in practice. Given that most index AM's
bulk-deletion does a full index scan and a table could have multiple
indexes, reducing the number of times to perform index bulk-deletion
really contributes to reducing the execution time, especially for
large tables. I think that a more compact data structure for dead
tuple TIDs is one of the ways to achieve that.

A run container is selected in this test case, using 4 bytes for each block.

Execution Time Memory Usage
array 8,883.03 600,008,248
intset 7,358.23 100,671,488
tbm 758.81 100,671,544
rtbm 764.33 29,384,816

Overall, 'rtbm' has a much better lookup performance and good memory
usage especially if there are relatively many dead tuples. However, in
some cases, 'intset' and 'array' have a better memory usage.

This seems very promising.

I wonder how much you have thought about the index AM side. It makes
sense to initially evaluate these techniques using this approach of
separating the data structure from how it is used by VACUUM -- I think
that that was a good idea. But at the same time there may be certain
important theoretical questions that cannot be answered this way --
questions about how everything "fits together" in a real VACUUM might
matter a lot. You've probably thought about this at least a little
already. Curious to hear how you think it "fits together" with the
work that you've done already.

Yeah, that definitely needs to be considered. Currently, what we need
for the dead tuple storage for lazy vacuum are store, lookup, and
iteration. And given the parallel vacuum, it has to be able to be
allocated on DSM or DSA. While implementing the PoC code, I'm trying
to integrate it with the current lazy vacuum code. As far as I've seen
so far, the integration is not hard, at least with the *current* lazy
vacuum code and index AMs code.

The loop inside btvacuumpage() makes each loop iteration call the
callback -- this is always a call to lazy_tid_reaped() in practice.
And that's where we do binary searches. These binary searches are
usually where we see a huge number of cycles spent when we look at
profiles, including the profile that produced your flame graph. But I
worry that that might be a bit misleading -- the way that profilers
attribute costs is very complicated and can never be fully trusted.
While it is true that lazy_tid_reaped() often accesses main memory,
which will of course add a huge amount of latency and make it a huge
bottleneck, the "big picture" is still relevant.

I think that the compiler currently has to make very conservative
assumptions when generating the machine code used by the loop inside
btvacuumpage(), which calls through an opaque function pointer at
least once per loop iteration -- anything can alias, so the compiler
must be conservative. The data dependencies are hard for both the
compiler and the CPU to analyze. The cost of using a function pointer
compared to a direct function call is usually quite low, but there are
important exceptions -- cases where it prevents other useful
optimizations. Maybe this is an exception.

I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.

This approach would make btbulkdelete() similar to
_bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
an independent idea to your ideas -- I imagine that this would work
far better when combined with a more compact data structure, which is
naturally more capable of batch processing than a simple array of
TIDs. Maybe this will help the compiler and the CPU to fully
understand the *natural* data dependencies, so that they can be as
effective as possible in making the code run fast. It's possible that
a modern CPU will be able to *hide* the latency more intelligently
than what we have today. The latency is such a big problem that we may
be able to justify "wasting" other CPU resources, just because it
sometimes helps with hiding the latency. For example, it might
actually be okay to sort all of the TIDs on the page to make the bulk
processing work -- though you might still do a precheck that is
similar to the precheck inside lazy_tid_reaped() that was added by you
in commit bbaf315309e.

Interesting idea. I remember you mentioned this idea somewhere and
I've considered this idea too while implementing the PoC code. It's
definitely worth trying. Maybe we can write a patch for this as a
separate patch? It will change index AM and could improve also the
current bulk-deletion. We can consider a better data structure on top
of this idea.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

hannuk@google.com

over 4 years ago

In reply to: Hannu Krosing (#7)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Resending as forgot to send to the list (thanks Peter :) )

On Wed, Jul 7, 2021 at 10:24 PM Peter Geoghegan <pg@bowt.ie> wrote:

The loop inside btvacuumpage() makes each loop iteration call the
callback -- this is always a call to lazy_tid_reaped() in practice.
And that's where we do binary searches. These binary searches are
usually where we see a huge number of cycles spent when we look at
profiles, including the profile that produced your flame graph. But I
worry that that might be a bit misleading -- the way that profilers
attribute costs is very complicated and can never be fully trusted.
While it is true that lazy_tid_reaped() often accesses main memory,
which will of course add a huge amount of latency and make it a huge
bottleneck, the "big picture" is still relevant.

This is why I have mainly focused on making it possible to use SIMD and
run 4-8 binary searches in parallel, mostly 8, for AVX2.

How I am approaching this is separating "page search" tyo run over a
(naturally) sorted array of 32 bit page pointers and only when the
page is found the indexes in this array are used to look up the
in-page bitmaps.
This allows the heavier bsearch activity to run on smaller range of
memory, hopefully reducing the cache trashing.

There are opportunities to optimise this further for cash hits, buy
collecting the tids from indexes in larger patches and then
constraining the searches in the main is-deleted-bitmap to run over
sections of it, but at some point this becomes a very complex
balancing act, as the manipulation of the bits-to-check from indexes
also takes time, not to mention the need to release the index pages
and then later chase the tid pointers in case they have moved while
checking them.

I have not measured anything yet, but one of my concerns in case of
very large dead tuple collections searched by 8-way parallel bsearch
could actually get close to saturating RAM bandwidth by reading (8 x
32bits x cache-line-size) bytes from main memory every few cycles, so
we may need some inner-loop level throttling similar to current
vacuum_cost_limit for data pages.

I think that the compiler currently has to make very conservative
assumptions when generating the machine code used by the loop inside
btvacuumpage(), which calls through an opaque function pointer at
least once per loop iteration -- anything can alias, so the compiler
must be conservative.

Definitely this! The lookup function needs to be turned into an inline
function or #define as well to give the compiler maximum freedoms.

The data dependencies are hard for both the
compiler and the CPU to analyze. The cost of using a function pointer
compared to a direct function call is usually quite low, but there are
important exceptions -- cases where it prevents other useful
optimizations. Maybe this is an exception.

Yes. Also this could be a place where unrolling the loop could make a
real difference.

Maybe not unrolling the full 32 loops for 32 bit bserach, but
something like 8-loop unroll for getting most of the benefit.

The 32x unroll would not be really that bad for performance if all 32
loops were needed, but mostly we would need to jump into last 10 to 20
loops for lookup min 1000 to 1000000 pages and I suspect this is such
a weird corner case that compiler is really unlikely to have this
optimisation supported. Of course I may be wrong and ith is a common
enough case for the optimiser.

I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.

While it may make sense to have different bitmap encodings for
different distributions, it likely would not be good for optimisations
if all these are used at the same time.

This is why I propose the first bitmap collecting phase to collect
into a file and then - when reading into memory for lookups phase -
possibly rewrite the initial structure to something else if it sees
that it is more efficient. Like for example where the first half of
the file consists of only empty pages.

This approach would make btbulkdelete() similar to
_bt_simpledel_pass() + _bt_delitems_delete_check(). This is not really
an independent idea to your ideas -- I imagine that this would work
far better when combined with a more compact data structure, which is
naturally more capable of batch processing than a simple array of
TIDs. Maybe this will help the compiler and the CPU to fully
understand the *natural* data dependencies, so that they can be as
effective as possible in making the code run fast. It's possible that
a modern CPU will be able to *hide* the latency more intelligently
than what we have today. The latency is such a big problem that we may
be able to justify "wasting" other CPU resources, just because it
sometimes helps with hiding the latency. For example, it might
actually be okay to sort all of the TIDs on the page to make the bulk
processing work

Then again it may be so much extra work that it starts to dominate
some parts of profiles.

For example see the work that was done in improving the mini-vacuum
part where it was actually faster to copy data out to a separate
buffer and then back in than shuffle it around inside the same 8k page
:)

So only testing will tell.

-- though you might still do a precheck that is
similar to the precheck inside lazy_tid_reaped() that was added by you
in commit bbaf315309e.

Of course it's very easy to be wrong about stuff like this. But it
might not be that hard to prototype. You can literally copy and paste
code from _bt_delitems_delete_check() to do this. It does the same
basic thing already.

Also a lot of testing would be needed to figure out which strategy
fits best for which distribution of dead tuples, and possibly their
relation to the order of tuples to check from indexes .

Cheers

--
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

pg@bowt.ie

over 4 years ago

In reply to: Masahiko Sawada (#6)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jul 8, 2021 at 1:47 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

As I wrote in the first email, I think there are two important factors
in index vacuuming performance: the performance to check if heap TID
that an index tuple points to is dead, and the number of times to
perform index bulk-deletion. The flame graph I attached in the first
mail shows CPU spent much time on lazy_tid_reaped() but vacuum is a
disk-intensive operation in practice.

Maybe. But I recently bought an NVME SSD that can read at over
6GB/second. So "disk-intensive" is not what it used to be -- at least
not for reads. In general it's not good if we do multiple scans of an
index -- no question. But there is a danger in paying a little too
much attention to what is true in general -- we should not ignore what
might be true in specific cases either. Maybe we can solve some
problems by spilling the TID data structure to disk -- if we trade
sequential I/O for random I/O, we may be able to do only one pass over
the index (especially when we have *almost* enough memory to fit all
TIDs, but not quite enough).

The big problem with multiple passes over the index is not the extra
read bandwidth -- it's the extra page dirtying (writes), especially
with things like indexes on UUID columns. We want to dirty each leaf
page in each index at most once per VACUUM, and should be willing to
pay some cost in order to get a larger benefit with page dirtying.
After all, writes are much more expensive on modern flash devices --
if we have to do more random read I/O to spill the TIDs then that
might actually be 100% worth it. And, we don't need much memory for
something that works well as a negative cache, either -- so maybe the
extra random read I/O needed to spill the TIDs will be very limited
anyway.

There are many possibilities. You can probably think of other
trade-offs yourself. We could maybe use a cost model for all this --
it is a little like a hash join IMV. This is just something to think
about while refining the design.

Interesting idea. I remember you mentioned this idea somewhere and
I've considered this idea too while implementing the PoC code. It's
definitely worth trying. Maybe we can write a patch for this as a
separate patch? It will change index AM and could improve also the
current bulk-deletion. We can consider a better data structure on top
of this idea.

I'm happy to write it as a separate patch, either by leaving it to you
or by collaborating directly. It's not necessary to tie it to the
first patch. But at the same time it is highly related to what you're
already doing.

As I said I am totally prepared to be wrong here. But it seems worth
it to try. In Postgres 14, the _bt_delitems_vacuum() function (which
actually carries out VACUUM's physical page modifications to a leaf
page) is almost identical to _bt_delitems_delete(). And
_bt_delitems_delete() was already built with these kinds of problems
in mind -- it batches work to get the most out of synchronizing with
distant state describing which tuples to delete. It's not exactly the
same situation, but it's *kinda* similar. More importantly, it's a
relatively cheap and easy experiment to run, since we already have
most of what we need (we can take it from
_bt_delitems_delete_check()).

Usually this kind of micro optimization is not very valuable -- 99.9%+
of all code just isn't that sensitive to having the right
optimizations. But this is one of the rare important cases where we
really should look at the raw machine code, and do some kind of
microarchitectural level analysis through careful profiling, using
tools like perf. The laws of physics (or electronic engineering) make
it inevitable that searching for TIDs to match is going to be kind of
slow. But we should at least make sure that we use every trick
available to us to reduce the bottleneck, since it really does matter
a lot to users. Users should be able to expect that this code will at
least be as fast as the hardware that they paid for can allow (or
close to it). There is a great deal of microarchitectural
sophistication with modern CPUs, much of which is designed to make
problems like this one less bad [1]https://www.agner.org/optimize/microarchitecture.pdf -- Peter Geoghegan.

[1]: https://www.agner.org/optimize/microarchitecture.pdf -- Peter Geoghegan
--
Peter Geoghegan

#10

pg@bowt.ie

over 4 years ago

In reply to: Hannu Krosing (#8)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jul 8, 2021 at 1:53 PM Hannu Krosing <hannuk@google.com> wrote:

How I am approaching this is separating "page search" tyo run over a
(naturally) sorted array of 32 bit page pointers and only when the
page is found the indexes in this array are used to look up the
in-page bitmaps.
This allows the heavier bsearch activity to run on smaller range of
memory, hopefully reducing the cache trashing.

I think that the really important thing is to figure out roughly the
right data structure first.

There are opportunities to optimise this further for cash hits, buy
collecting the tids from indexes in larger patches and then
constraining the searches in the main is-deleted-bitmap to run over
sections of it, but at some point this becomes a very complex
balancing act, as the manipulation of the bits-to-check from indexes
also takes time, not to mention the need to release the index pages
and then later chase the tid pointers in case they have moved while
checking them.

I would say that 200 TIDs per leaf page is common and ~1350 TIDs per
leaf page is not uncommon (with deduplication). Seems like that might
be enough?

I have not measured anything yet, but one of my concerns in case of
very large dead tuple collections searched by 8-way parallel bsearch
could actually get close to saturating RAM bandwidth by reading (8 x
32bits x cache-line-size) bytes from main memory every few cycles, so
we may need some inner-loop level throttling similar to current
vacuum_cost_limit for data pages.

If it happens then it'll be a nice problem to have, I suppose.

Maybe not unrolling the full 32 loops for 32 bit bserach, but
something like 8-loop unroll for getting most of the benefit.

My current assumption is that we're bound by memory speed right now,
and that that is the big bottleneck to eliminate -- we must keep the
CPU busy with data to process first. That seems like the most
promising thing to focus on right now.

While it may make sense to have different bitmap encodings for
different distributions, it likely would not be good for optimisations
if all these are used at the same time.

To some degree designs like Roaring bitmaps are just that -- a way of
dynamically figuring out which strategy to use based on data
characteristics.

This is why I propose the first bitmap collecting phase to collect
into a file and then - when reading into memory for lookups phase -
possibly rewrite the initial structure to something else if it sees
that it is more efficient. Like for example where the first half of
the file consists of only empty pages.

Yeah, I agree that something like that could make sense. Although
rewriting it doesn't seem particularly promising, since we can easily
make it cheap to process any TID that falls into a range of blocks
that have no dead tuples. We don't need to rewrite the data structure
to make it do that well, AFAICT.

When I said that I thought of this a little like a hash join, I was
being more serious than you might imagine. Note that the number of
index tuples that VACUUM will delete from each index can now be far
less than the total number of TIDs stored in memory. So even when we
have (say) 20% of all of the TIDs from the table in our in memory list
managed by vacuumlazy.c, it's now quite possible that VACUUM will only
actually "match"/"join" (i.e. delete) as few as 2% of the index tuples
it finds in the index (there really is no way to predict how many).
The opportunistic deletion stuff could easily be doing most of the
required cleanup in an eager fashion following recent improvements --
VACUUM need only take care of "floating garbage" these days. In other
words, thinking about this as something that is a little bit like a
hash join makes sense because hash joins do very well with high join
selectivity, and high join selectivity is common in the real world.
The intersection of TIDs from each leaf page with the in-memory TID
delete structure will often be very small indeed.

Then again it may be so much extra work that it starts to dominate
some parts of profiles.

For example see the work that was done in improving the mini-vacuum
part where it was actually faster to copy data out to a separate
buffer and then back in than shuffle it around inside the same 8k page

Some of what I'm saying is based on the experience of improving
similar code used by index tuple deletion in Postgres 14. That did
quite a lot of sorting of TIDs and things like that. In the end the
sorting had no more than a negligible impact on performance. What
really mattered was that we efficiently coordinate with distant heap
pages that describe which index tuples we can delete from a given leaf
page. Sorting hundreds of TIDs is cheap. Reading hundreds of random
locations in memory (or even far fewer) is not so cheap. It might even
be very slow indeed. Sorting in order to batch could end up looking
like cheap insurance that we should be glad to pay for.

So only testing will tell.

True.

--
Peter Geoghegan

#11

hannuk@google.com

over 4 years ago

In reply to: Peter Geoghegan (#10)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Jul 9, 2021 at 12:34 AM Peter Geoghegan <pg@bowt.ie> wrote:

...

I would say that 200 TIDs per leaf page is common and ~1350 TIDs per
leaf page is not uncommon (with deduplication). Seems like that might
be enough?

Likely yes, and also it would have the nice property of not changing
the index page locking behaviour.

Are deduplicated tids in the leaf page already sorted in heap order ?
This could potentially simplify / speed up the sort.

I have not measured anything yet, but one of my concerns in case of
very large dead tuple collections searched by 8-way parallel bsearch
could actually get close to saturating RAM bandwidth by reading (8 x
32bits x cache-line-size) bytes from main memory every few cycles, so
we may need some inner-loop level throttling similar to current
vacuum_cost_limit for data pages.

If it happens then it'll be a nice problem to have, I suppose.

Maybe not unrolling the full 32 loops for 32 bit bserach, but
something like 8-loop unroll for getting most of the benefit.

My current assumption is that we're bound by memory speed right now,

Most likely yes, and this should be also easy to check with manually
unrolling perhaps 4 loops and measuring any speed increase.

and that that is the big bottleneck to eliminate -- we must keep the
CPU busy with data to process first. That seems like the most
promising thing to focus on right now.

This has actually two parts
- trying to make sure that we can make as much as possible from cache
- if we need to get out of cache then try to parallelise this as
much as possible

at the same time we need to watch that we are not making the index
tuple preparation work so heavy that it starts to dominate over memory
access

While it may make sense to have different bitmap encodings for
different distributions, it likely would not be good for optimisations
if all these are used at the same time.

To some degree designs like Roaring bitmaps are just that -- a way of
dynamically figuring out which strategy to use based on data
characteristics.

it is, but as I am keeping one eye open for vectorisation, I don't
like when different parts of the same bitmap have radically different
encoding strategies.

This is why I propose the first bitmap collecting phase to collect
into a file and then - when reading into memory for lookups phase -
possibly rewrite the initial structure to something else if it sees
that it is more efficient. Like for example where the first half of
the file consists of only empty pages.

Yeah, I agree that something like that could make sense. Although
rewriting it doesn't seem particularly promising,

yeah, I hope to prove (or verify :) ) the structure is good enough so
that it does not need the rewrite.

since we can easily
make it cheap to process any TID that falls into a range of blocks
that have no dead tuples.

I actually meant the opposite case, where we could replace a full 80
bytes 291-bit "all dead" bitmap with just a range - int4 for page and
two int2-s for min and max tid-in page for extra 10x reduction, on top
of original 21x reduction from current 6 bytes / bit encoding to my
page_bsearch_vector bitmaps which encodes one page to maximum of 80
bytes (5 x int4 sub-page pointers + 5 x int4 bitmaps).

I also started out by investigating RoaringBitmaps, but when I
realized that we will likely have to rewrite it anyway I continued
working on getting to a single uniform encoding which fits most use
cases Good Enough and then use that uniformity to enable the compiler
to do its optimisation and hopefully also vectoriziation magic.

We don't need to rewrite the data structure
to make it do that well, AFAICT.

When I said that I thought of this a little like a hash join, I was
being more serious than you might imagine. Note that the number of
index tuples that VACUUM will delete from each index can now be far
less than the total number of TIDs stored in memory. So even when we
have (say) 20% of all of the TIDs from the table in our in memory list
managed by vacuumlazy.c, it's now quite possible that VACUUM will only
actually "match"/"join" (i.e. delete) as few as 2% of the index tuples
it finds in the index (there really is no way to predict how many).
The opportunistic deletion stuff could easily be doing most of the
required cleanup in an eager fashion following recent improvements --
VACUUM need only take care of "floating garbage" these days.

Ok, this points to the need to mainly optimise for quite sparse
population of dead tuples, which is still mainly clustered page-wise ?

In other
words, thinking about this as something that is a little bit like a
hash join makes sense because hash joins do very well with high join
selectivity, and high join selectivity is common in the real world.
The intersection of TIDs from each leaf page with the in-memory TID
delete structure will often be very small indeed.

The hard to optimize case is still when we have dead tuple counts in
hundreds of millions, or even billions, like on a HTAP database after
a few hours of OLAP query have accumulated loads of dead tuples in
tables getting heavy OLTP traffic.

There of course we could do a totally different optimisation, where we
also allow reaping tuples newer than the OLAP queries snapshot if we
can prove that when the snapshot moves forward next time, it has to
jump over said transactions making them indeed DEAD and not RECENTLY
DEAD. Currently we let a single OLAP query ruin everything :)

Then again it may be so much extra work that it starts to dominate
some parts of profiles.

For example see the work that was done in improving the mini-vacuum
part where it was actually faster to copy data out to a separate
buffer and then back in than shuffle it around inside the same 8k page

Some of what I'm saying is based on the experience of improving
similar code used by index tuple deletion in Postgres 14. That did
quite a lot of sorting of TIDs and things like that. In the end the
sorting had no more than a negligible impact on performance.

Good to know :)

What
really mattered was that we efficiently coordinate with distant heap
pages that describe which index tuples we can delete from a given leaf
page. Sorting hundreds of TIDs is cheap. Reading hundreds of random
locations in memory (or even far fewer) is not so cheap. It might even
be very slow indeed. Sorting in order to batch could end up looking
like cheap insurance that we should be glad to pay for.

If the most expensive operation is sorting a few hundred of tids, then
this should be fast enough.

My worries were more that after the sorting we can not to dsimple
index lookups for them, but each needs to be found via bseach or maybe
even just search if that is faster under some size limit, and that
these could add up. Or some other needed thing that also has to be
done, like allocating extra memory or moving other data around in a
way that CPU does not like.

Cheers
-----
Hannu Krosing
Google Cloud - We have a long list of planned contributions and we are hiring.
Contact me if interested.

#12

andres@anarazel.de

over 4 years ago

In reply to: Masahiko Sawada (#1)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:

1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].

2. Allocate the whole memory space at once.

3. Slow lookup performance (O(logN)).

I’ve done some experiments in this area and would like to share the
results and discuss ideas.

Yea, this is a serious issue.

3) could possibly be addressed to a decent degree without changing the
fundamental datastructure too much. There's some sizable and trivial
wins by just changing vac_cmp_itemptr() to compare int64s and by using
an open coded bsearch().

The big problem with bsearch isn't imo the O(log(n)) complexity - it's
that it has an abominally bad cache locality. And that can be addressed
https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf

Imo 2) isn't really that a hard problem to improve, even if we were to
stay with the current bsearch approach. Reallocation with an aggressive
growth factor or such isn't that bad.

That's not to say we ought to stay with binary search...

Problems Solutions
===============

Firstly, I've considered using existing data structures:
IntegerSet(src/backend/lib/integerset.c) and
TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
only either point 2 or 3. IntegerSet uses lower memory thanks to
simple-8b encoding but is slow at lookup, still O(logN), since it’s a
tree structure. On the other hand, TIDBitmap has a good lookup
performance, O(1), but could unnecessarily use larger memory in some
cases since it always allocates the space for bitmap enough to store
all possible offsets. With 8kB blocks, the maximum number of line
pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
bitmap is 40 bytes long and we always need 46 bytes in total per block
including other meta information.

Imo tidbitmap isn't particularly good, even in the current use cases -
it's constraining in what we can store (a problem for other AMs), not
actually that dense, the lossy mode doesn't choose what information to
loose well etc.

It'd be nice if we came up with a datastructure that could also replace
the bitmap scan cases.

The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.

Not a huge fan of encoding this much knowledge about the tid layout...

For example, if there are two dead tuples at offset 1 and 150, it uses
the array container that has an array of two 2-byte integers
representing 1 and 150, using 4 bytes in total. If we used the bitmap
container in this case, we would need 20 bytes instead. On the other
hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
uses the run container that has an array of 2-byte integers. The first
value in each pair represents a starting offset number, whereas the
second value represents its length. Therefore, in this case, the run
container uses only 4 bytes in total. Finally, if there are dead
tuples at every other offset from 1 to 100, it uses the bitmap
container that has an uncompressed bitmap, using 13 bytes. We need
another 16 bytes per block entry for hash table entry.

The lookup complexity of a bitmap container is O(1) whereas the one of
an array and a run container is O(N) or O(logN) but the number of
elements in those two containers should not be large it would not be a
problem.

Hm. Why is O(N) not an issue? Consider e.g. the case of a table in which
many tuples have been deleted. In cases where the "run" storage is
cheaper (e.g. because there's high offset numbers due to HOT pruning),
we could end up regularly scanning a few hundred entries for a
match. That's not cheap anymore.

Evaluation
========

Before implementing this idea and integrating it with lazy vacuum
code, I've implemented a benchmark tool dedicated to evaluating
lazy_tid_reaped() performance[4].

Good idea!

In all test cases, I simulated that the table has 1,000,000 blocks and
every block has at least one dead tuple.

That doesn't strike me as a particularly common scenario? I think it's
quite rare for there to be so evenly but sparse dead tuples. In
particularly it's very common for there to be long runs of dead tuples
separated by long ranges of no dead tuples at all...

The benchmark scenario is that for
each virtual heap tuple we check if there is its TID in the dead
tuple storage. Here are the results of execution time in milliseconds
and memory usage in bytes:

In which order are the dead tuples checked? Looks like in sequential
order? In the case of an index over a column that's not correlated with
the heap order the lookups are often much more random - which can
influence lookup performance drastically, due to cache differences in
cache locality. Which will make some structures look worse/better than
others.

Greetings,

Andres Freund

#13

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#12)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2021-07-08 20:53:32 -0700, Andres Freund wrote:

On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:

1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].

2. Allocate the whole memory space at once.

3. Slow lookup performance (O(logN)).

I’ve done some experiments in this area and would like to share the
results and discuss ideas.

Yea, this is a serious issue.

3) could possibly be addressed to a decent degree without changing the
fundamental datastructure too much. There's some sizable and trivial
wins by just changing vac_cmp_itemptr() to compare int64s and by using
an open coded bsearch().

Just using itemptr_encode() makes array in test #1 go from 8s to 6.5s on my
machine.

Another thing I just noticed is that you didn't include the build times for the
datastructures. They are lower than the lookups currently, but it does seem
like a relevant thing to measure as well. E.g. for #1 I see the following build
times

array 24.943 ms
tbm 206.456 ms
intset 93.575 ms
vtbm 134.315 ms
rtbm 145.964 ms

that's a significant range...

Randomizing the lookup order (using a random shuffle in
generate_index_tuples()) changes the benchmark results for #1 significantly:

shuffled time unshuffled time
array 6551.726 ms 6478.554 ms
intset 67590.879 ms 10815.810 ms
rtbm 17992.487 ms 2518.492 ms
tbm 364.917 ms 360.128 ms
vtbm 12227.884 ms 1288.123 ms

FWIW, I get an assertion failure when using an assertion build:

#2 0x0000561800ea02e0 in ExceptionalCondition (conditionName=0x7f9115a88e91 "found", errorType=0x7f9115a88d11 "FailedAssertion",
fileName=0x7f9115a88e8a "rtbm.c", lineNumber=242) at /home/andres/src/postgresql/src/backend/utils/error/assert.c:69
#3 0x00007f9115a87645 in rtbm_add_tuples (rtbm=0x561806293280, blkno=0, offnums=0x7fffdccabb00, nitems=10) at rtbm.c:242
#4 0x00007f9115a8363d in load_rtbm (rtbm=0x561806293280, itemptrs=0x7f908a203050, nitems=10000000) at bdbench.c:618
#5 0x00007f9115a834b9 in rtbm_attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143, maxoff=32639)
at bdbench.c:587
#6 0x00007f9115a83837 in attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143, maxoff=32639)
at bdbench.c:658
#7 0x00007f9115a84190 in attach_dead_tuples (fcinfo=0x56180322d690) at bdbench.c:873

I assume you just inverted the Assert(found) assertion?

Greetings,

Andres Freund

#14

sawada.mshk@gmail.com

over 4 years ago

In reply to: Andres Freund (#12)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Jul 9, 2021 at 12:53 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:

1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].

2. Allocate the whole memory space at once.

3. Slow lookup performance (O(logN)).

I’ve done some experiments in this area and would like to share the
results and discuss ideas.

Yea, this is a serious issue.

3) could possibly be addressed to a decent degree without changing the
fundamental datastructure too much. There's some sizable and trivial
wins by just changing vac_cmp_itemptr() to compare int64s and by using
an open coded bsearch().

The big problem with bsearch isn't imo the O(log(n)) complexity - it's
that it has an abominally bad cache locality. And that can be addressed
https://arxiv.org/ftp/arxiv/papers/1509/1509.05053.pdf

Imo 2) isn't really that a hard problem to improve, even if we were to
stay with the current bsearch approach. Reallocation with an aggressive
growth factor or such isn't that bad.

That's not to say we ought to stay with binary search...

Problems Solutions
===============

Firstly, I've considered using existing data structures:
IntegerSet(src/backend/lib/integerset.c) and
TIDBitmap(src/backend/nodes/tidbitmap.c). Those address point 1 but
only either point 2 or 3. IntegerSet uses lower memory thanks to
simple-8b encoding but is slow at lookup, still O(logN), since it’s a
tree structure. On the other hand, TIDBitmap has a good lookup
performance, O(1), but could unnecessarily use larger memory in some
cases since it always allocates the space for bitmap enough to store
all possible offsets. With 8kB blocks, the maximum number of line
pointers in a heap page is 291 (c.f., MaxHeapTuplesPerPage) so the
bitmap is 40 bytes long and we always need 46 bytes in total per block
including other meta information.

Imo tidbitmap isn't particularly good, even in the current use cases -
it's constraining in what we can store (a problem for other AMs), not
actually that dense, the lossy mode doesn't choose what information to
loose well etc.

It'd be nice if we came up with a datastructure that could also replace
the bitmap scan cases.

Agreed.

The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.

Not a huge fan of encoding this much knowledge about the tid layout...

For example, if there are two dead tuples at offset 1 and 150, it uses
the array container that has an array of two 2-byte integers
representing 1 and 150, using 4 bytes in total. If we used the bitmap
container in this case, we would need 20 bytes instead. On the other
hand, if there are consecutive 20 dead tuples from offset 1 to 20, it
uses the run container that has an array of 2-byte integers. The first
value in each pair represents a starting offset number, whereas the
second value represents its length. Therefore, in this case, the run
container uses only 4 bytes in total. Finally, if there are dead
tuples at every other offset from 1 to 100, it uses the bitmap
container that has an uncompressed bitmap, using 13 bytes. We need
another 16 bytes per block entry for hash table entry.

The lookup complexity of a bitmap container is O(1) whereas the one of
an array and a run container is O(N) or O(logN) but the number of
elements in those two containers should not be large it would not be a
problem.

Hm. Why is O(N) not an issue? Consider e.g. the case of a table in which
many tuples have been deleted. In cases where the "run" storage is
cheaper (e.g. because there's high offset numbers due to HOT pruning),
we could end up regularly scanning a few hundred entries for a
match. That's not cheap anymore.

With 8kB blocks, the maximum size of a bitmap container is 37 bytes.
IOW, other two types of containers are always smaller than 37 bytes.
Since the run container uses 4 bytes per run, the number of runs in a
run container never be more than 9. Even with 32kB blocks, we don’t
have more than 37 runs. So I think N is small enough in this case.

Evaluation
========

Before implementing this idea and integrating it with lazy vacuum
code, I've implemented a benchmark tool dedicated to evaluating
lazy_tid_reaped() performance[4].

Good idea!

In all test cases, I simulated that the table has 1,000,000 blocks and
every block has at least one dead tuple.

That doesn't strike me as a particularly common scenario? I think it's
quite rare for there to be so evenly but sparse dead tuples. In
particularly it's very common for there to be long runs of dead tuples
separated by long ranges of no dead tuples at all...

Agreed. I'll test with such scenarios.

The benchmark scenario is that for
each virtual heap tuple we check if there is its TID in the dead
tuple storage. Here are the results of execution time in milliseconds
and memory usage in bytes:

In which order are the dead tuples checked? Looks like in sequential
order? In the case of an index over a column that's not correlated with
the heap order the lookups are often much more random - which can
influence lookup performance drastically, due to cache differences in
cache locality. Which will make some structures look worse/better than
others.

Good point. It's sequential order, which is not good. I'll test again
after shuffling virtual index tuples.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#15

sawada.mshk@gmail.com

over 4 years ago

In reply to: Andres Freund (#13)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Jul 9, 2021 at 2:37 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-07-08 20:53:32 -0700, Andres Freund wrote:

On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:

1. Don't allocate more than 1GB. There was a discussion to eliminate
this limitation by using MemoryContextAllocHuge() but there were
concerns about point 2[1].

2. Allocate the whole memory space at once.

3. Slow lookup performance (O(logN)).

I’ve done some experiments in this area and would like to share the
results and discuss ideas.

Yea, this is a serious issue.

3) could possibly be addressed to a decent degree without changing the
fundamental datastructure too much. There's some sizable and trivial
wins by just changing vac_cmp_itemptr() to compare int64s and by using
an open coded bsearch().

Just using itemptr_encode() makes array in test #1 go from 8s to 6.5s on my
machine.

Another thing I just noticed is that you didn't include the build times for the
datastructures. They are lower than the lookups currently, but it does seem
like a relevant thing to measure as well. E.g. for #1 I see the following build
times

array 24.943 ms
tbm 206.456 ms
intset 93.575 ms
vtbm 134.315 ms
rtbm 145.964 ms

that's a significant range...

Good point. I got similar results when measuring on my machine:

array 57.987 ms
tbm 297.720 ms
intset 113.796 ms
vtbm 165.268 ms
rtbm 199.658 ms

Randomizing the lookup order (using a random shuffle in
generate_index_tuples()) changes the benchmark results for #1 significantly:

shuffled time unshuffled time
array 6551.726 ms 6478.554 ms
intset 67590.879 ms 10815.810 ms
rtbm 17992.487 ms 2518.492 ms
tbm 364.917 ms 360.128 ms
vtbm 12227.884 ms 1288.123 ms

I believe that in your test, tbm_reaped() actually always returned
true. That could explain tbm was very fast in both cases. Since
TIDBitmap in the core doesn't support the existence check tbm_reaped()
in bdbench.c always returns true. I added a patch in the repository to
add existence check support to TIDBitmap, although it assumes bitmap
never be lossy.

That being said, I'm surprised that rtbm is slower than array even in
the unshuffled case. I've also measured the shuffle cases and got
different results. To be clear, I used prepare() SQL function to
prepare both virtual dead tuples and index tuples, load them by
attach_dead_tuples() SQL function, and executed bench() SQL function
for each data structure. Here are the results:

shuffled time unshuffled time
array 88899.513 ms 12616.521 ms
intset 73476.055 ms 10063.405 ms
rtbm 22264.671 ms 2073.171 ms
tbm 10285.092 ms 1417.312 ms
vtbm 14488.581 ms 1240.666 ms

FWIW, I get an assertion failure when using an assertion build:

#2 0x0000561800ea02e0 in ExceptionalCondition (conditionName=0x7f9115a88e91 "found", errorType=0x7f9115a88d11 "FailedAssertion",
fileName=0x7f9115a88e8a "rtbm.c", lineNumber=242) at /home/andres/src/postgresql/src/backend/utils/error/assert.c:69
#3 0x00007f9115a87645 in rtbm_add_tuples (rtbm=0x561806293280, blkno=0, offnums=0x7fffdccabb00, nitems=10) at rtbm.c:242
#4 0x00007f9115a8363d in load_rtbm (rtbm=0x561806293280, itemptrs=0x7f908a203050, nitems=10000000) at bdbench.c:618
#5 0x00007f9115a834b9 in rtbm_attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143, maxoff=32639)
at bdbench.c:587
#6 0x00007f9115a83837 in attach (lvtt=0x7f9115a8c300 <LVTestSubjects+352>, nitems=10000000, minblk=2139062143, maxblk=2139062143, maxoff=32639)
at bdbench.c:658
#7 0x00007f9115a84190 in attach_dead_tuples (fcinfo=0x56180322d690) at bdbench.c:873

I assume you just inverted the Assert(found) assertion?

Right. Fixed it.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#16

sawada.mshk@gmail.com

over 4 years ago

In reply to: Peter Geoghegan (#4)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jul 8, 2021 at 7:51 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jul 7, 2021 at 1:24 PM Peter Geoghegan <pg@bowt.ie> wrote:

I wonder how much it would help to break up that loop into two loops.
Make the callback into a batch operation that generates state that
describes what to do with each and every index tuple on the leaf page.
The first loop would build a list of TIDs, then you'd call into
vacuumlazy.c and get it to process the TIDs, and finally the second
loop would physically delete the TIDs that need to be deleted. This
would mean that there would be only one call per leaf page per
btbulkdelete(). This would reduce the number of calls to the callback
by at least 100x, and maybe more than 1000x.

Maybe for something like rtbm.c (which is inspired by Roaring
bitmaps), you would really want to use an "intersection" operation for
this. The TIDs that we need to physically delete from the leaf page
inside btvacuumpage() are the intersection of two bitmaps: our bitmap
of all TIDs on the leaf page, and our bitmap of all TIDs that need to
be deleting by the ongoing btbulkdelete() call.

Agreed. In such a batch operation, what we need to do here is to
compute the intersection of two bitmaps.

Obviously the typical case is that most TIDs in the index do *not* get
deleted -- needing to delete more than ~20% of all TIDs in the index
will be rare. Ideally it would be very cheap to figure out that a TID
does not need to be deleted at all. Something a little like a negative
cache (but not a true negative cache). This is a little bit like how
hash joins can be made faster by adding a Bloom filter -- most hash
probes don't need to join a tuple in the real world, and we can make
these hash probes even faster by using a Bloom filter as a negative
cache.

Agreed.

If you had the list of TIDs from a leaf page sorted for batch
processing, and if you had roaring bitmap style "chunks" with
"container" metadata stored in the data structure, you could then use
merging/intersection -- that has some of the same advantages. I think
that this would be a lot more efficient than having one binary search
per TID. Most TIDs from the leaf page can be skipped over very
quickly, in large groups. It's very rare for VACUUM to need to delete
TIDs from completely random heap table blocks in the real world (some
kind of pattern is much more common).

When this merging process finds 1 TID that might really be deletable
then it's probably going to find much more than 1 -- better to make
that cache miss take care of all of the TIDs together. Also seems like
the CPU could do some clever prefetching with this approach -- it
could prefetch TIDs where the initial chunk metadata is insufficient
to eliminate them early -- these are the groups of TIDs that will have
many TIDs that we actually need to delete. ISTM that improving
temporal locality through batching could matter a lot here.

That's a promising approach.

In rtbm, the pair of one hash entry and one container is used per
block. Therefore, we can skip TID from the leaf page by checking the
hash table, if there is no dead tuple in the block. If there is the
hash entry, since it means the block has at least one dead tuple, we
can look for the offset of TID from the leaf page from the container.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#17

sawada.mshk@gmail.com

over 4 years ago

In reply to: Hannu Krosing (#7)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jul 8, 2021 at 10:40 PM Hannu Krosing <hannuk@google.com> wrote:

Very nice results.

I have been working on the same problem but a bit different solution -
a mix of binary search for (sub)pages and 32-bit bitmaps for
tid-in-page.

Even with currebnt allocation heuristics (allocate 291 tids per page)
it initially allocate much less space, instead of current 291*6=1746
bytes per page it needs to allocate 80 bytes.

Also it can be laid out so that it is friendly to parallel SIMD
searches doing up to 8 tid lookups in parallel.

Interesting.

That said, for allocating the tid array, the best solution is to
postpone it as much as possible and to do the initial collection into
a file, which

1) postpones the memory allocation to the beginning of index cleanups

2) lets you select the correct size and structure as you know more
about the distribution at that time

3) do the first heap pass in one go and then advance frozenxmin
*before* index cleanup

I think we have to do index vacuuming before heap vacuuming (2nd heap
pass). So do you mean that it advances relfrozenxid of pg_class before
both index vacuuming and heap vacuuming?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#18

andres@anarazel.de

over 4 years ago

In reply to: Masahiko Sawada (#1)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:

Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:

So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
The authors provide an implementation of Roaring Bitmap[3] (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.

The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.

How are you thinking of implementing iteration efficiently for rtbm? The
second heap pass needs that obviously... I think the only option would
be to qsort the whole thing?

Greetings,

Andres Freund

#19

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#18)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2021-07-09 10:17:49 -0700, Andres Freund wrote:

On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:

Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:

So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
The authors provide an implementation of Roaring Bitmap[3] (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.

The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.

How are you thinking of implementing iteration efficiently for rtbm? The
second heap pass needs that obviously... I think the only option would
be to qsort the whole thing?

I experimented further, trying to use an old radix tree implementation I
had lying around to store dead tuples. With a bit of trickery that seems
to work well.

The radix tree implementation I have basically maps an int64 to another
int64. Each level of the radix tree stores 6 bits of the key, and uses
those 6 bits to index a 1<<64 long array leading to the next level.

My first idea was to use itemptr_encode() to convert tids into an int64
and store the lower 6 bits in the value part of the radix tree. That
turned out to work well performance wise, but awfully memory usage
wise. The problem is that we at most use 9 bits for offsets, but reserve
16 bits for it in the ItemPointerData. Which means that there's often a
lot of empty "tree levels" for those 0 bits, making it hard to get to a
decent memory usage.

The simplest way to address that was to simply compress out those
guaranteed-to-be-zero bits. That results in memory usage that's quite
good - nearly always beating array, occasionally beating rtbm. It's an
ordered datastructure, so the latter isn't too surprising. For lookup
performance the radix approach is commonly among the best, if not the
best.

A variation of the storage approach is to just use the block number as
the index, and store the tids as the value. Even with the absolutely
naive approach of just using a Bitmapset that reduces memory usage
substantially - at a small cost to search performance. Of course it'd be
better to use an adaptive approach like you did for rtbm, I just thought
this is good enough.

This largely works well, except when there are a large number of evenly
spread out dead tuples. I don't think that's a particularly common
situation, but it's worth considering anyway.

The reason the memory usage can be larger for sparse workloads obviously
can lead to tree nodes with only one child. As they are quite large
(1<<6 pointers to further children) that then can lead to large increase
in memory usage.

I have toyed with implementing adaptively large radix nodes like
proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
gotten it quite working.

Greetings,

Andres Freund

#20

sawada.mshk@gmail.com

over 4 years ago

In reply to: Andres Freund (#18)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sat, Jul 10, 2021 at 2:17 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:

Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:

So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
The authors provide an implementation of Roaring Bitmap[3] (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.

The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.

How are you thinking of implementing iteration efficiently for rtbm? The
second heap pass needs that obviously... I think the only option would
be to qsort the whole thing?

Yes, I'm thinking that the iteration of rtbm is somewhat similar to
tbm. That is, we iterate and collect hash table entries and do qsort
hash entries by the block number. Then fetch the entry along with its
container one by one in order of the block number.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#21

[1]: /messages/by-id/CA+TgmoakKFXwUv1Cx2mspUuPQHzYF74BfJ8koF5YdgVLCvhpwA@mail.gmail.com

sawada.mshk@gmail.com

over 4 years ago

In reply to: Andres Freund (#19)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Sorry for the late reply.

On Sat, Jul 10, 2021 at 11:55 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-07-09 10:17:49 -0700, Andres Freund wrote:

On 2021-07-07 20:46:38 +0900, Masahiko Sawada wrote:

Currently, the TIDs of dead tuples are stored in an array that is
collectively allocated at the start of lazy vacuum and TID lookup uses
bsearch(). There are the following challenges and limitations:

So I prototyped a new data structure dedicated to storing dead tuples
during lazy vacuum while borrowing the idea from Roaring Bitmap[2].
The authors provide an implementation of Roaring Bitmap[3] (Apache
2.0 license). But I've implemented this idea from scratch because we
need to integrate it with Dynamic Shared Memory/Area to support
parallel vacuum and need to support ItemPointerData, 6-bytes integer
in total, whereas the implementation supports only 4-bytes integers.
Also, when it comes to vacuum, we neither need to compute the
intersection, the union, nor the difference between sets, but need
only an existence check.

The data structure is somewhat similar to TIDBitmap. It consists of
the hash table and the container area; the hash table has entries per
block and each block entry allocates its memory space, called a
container, in the container area to store its offset numbers. The
container area is actually an array of bytes and can be enlarged as
needed. In the container area, the data representation of offset
numbers varies depending on their cardinality. It has three container
types: array, bitmap, and run.

How are you thinking of implementing iteration efficiently for rtbm? The
second heap pass needs that obviously... I think the only option would
be to qsort the whole thing?

I experimented further, trying to use an old radix tree implementation I
had lying around to store dead tuples. With a bit of trickery that seems
to work well.

Thank you for experimenting with another approach.

The radix tree implementation I have basically maps an int64 to another
int64. Each level of the radix tree stores 6 bits of the key, and uses
those 6 bits to index a 1<<64 long array leading to the next level.

My first idea was to use itemptr_encode() to convert tids into an int64
and store the lower 6 bits in the value part of the radix tree. That
turned out to work well performance wise, but awfully memory usage
wise. The problem is that we at most use 9 bits for offsets, but reserve
16 bits for it in the ItemPointerData. Which means that there's often a
lot of empty "tree levels" for those 0 bits, making it hard to get to a
decent memory usage.

The simplest way to address that was to simply compress out those
guaranteed-to-be-zero bits. That results in memory usage that's quite
good - nearly always beating array, occasionally beating rtbm. It's an
ordered datastructure, so the latter isn't too surprising. For lookup
performance the radix approach is commonly among the best, if not the
best.

How were its both lookup performance and memory usage comparing to
intset? I guess the performance trends of those two approaches are
similar since both consists of a tree. Intset encodes uint64 by
simple-8B encoding so I'm interested also in the comparison in terms
of memory usage.

A variation of the storage approach is to just use the block number as
the index, and store the tids as the value. Even with the absolutely
naive approach of just using a Bitmapset that reduces memory usage
substantially - at a small cost to search performance. Of course it'd be
better to use an adaptive approach like you did for rtbm, I just thought
this is good enough.

This largely works well, except when there are a large number of evenly
spread out dead tuples. I don't think that's a particularly common
situation, but it's worth considering anyway.

The reason the memory usage can be larger for sparse workloads obviously
can lead to tree nodes with only one child. As they are quite large
(1<<6 pointers to further children) that then can lead to large increase
in memory usage.

Interesting. How big was it in such workloads comparing to other data
structures?

I personally like adaptive approaches especially in the context of
vacuum improvements. We know common patterns of dead tuple
distribution but it’s not necessarily true since it depends on data
distribution and timings of autovacuum etc even with the same
workload. And we might be able to provide a new approach that works
well in 95% of use cases but if things get worse than before in
another 5% I think the approach is not a good approach. Ideally, it
should be better in common cases and at least be the same as before in
other cases.

BTW is the implementation of the radix tree approach available
somewhere? If so I'd like to experiment with that too.

I have toyed with implementing adaptively large radix nodes like
proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
gotten it quite working.

That seems promising approach.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#22

andres@anarazel.de

over 4 years ago

In reply to: Masahiko Sawada (#21)

3 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote:

BTW is the implementation of the radix tree approach available
somewhere? If so I'd like to experiment with that too.

I have toyed with implementing adaptively large radix nodes like
proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
gotten it quite working.

That seems promising approach.

I've since implemented some, but not all of the ideas of that paper
(adaptive node sizes, but not the tree compression pieces).

E.g. for

select prepare(
1000000, -- max block
20, -- # of dead tuples per page
10, -- dead tuples interval within a page
1 -- page inteval
);
attach size shuffled ordered
array 69 ms 120 MB 84.87 s 8.66 s
intset 173 ms 65 MB 68.82 s 11.75 s
rtbm 201 ms 67 MB 11.54 s 1.35 s
tbm 232 ms 100 MB 8.33 s 1.26 s
vtbm 162 ms 58 MB 10.01 s 1.22 s
radix 88 ms 42 MB 11.49 s 1.67 s

and for
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1 -- page inteval
);

attach size shuffled ordered
array 24 ms 60MB 3.74s 1.02 s
intset 97 ms 49MB 3.14s 0.75 s
rtbm 138 ms 36MB 0.41s 0.14 s
tbm 198 ms 101MB 0.41s 0.14 s
vtbm 118 ms 27MB 0.39s 0.12 s
radix 33 ms 10MB 0.28s 0.10 s

(this is an almost unfairly good case for radix)

Running out of time to format the results of the other testcases before
I have to run, unfortunately. radix uses 42MB both in test case 3 and
4.

The radix tree code isn't good right now. A ridiculous amount of
duplication etc. The naming clearly shows its origins from a buffer
mapping radix tree...

Currently in a bunch of the cases 20% of the time is spent in
radix_reaped(). If I move that into radix.c and for bfm_lookup() to be
inlined, I get reduced overhead. rbtm for example essentially already
does that, because it does splitting of ItemPointer in rtbm.c.

I've attached my current patches against your tree.

Greetings,

Andres Freund

Attachments:

0001-Fix-build-warnings.patchtext/x-diff; charset=us-asciiDownload

From 5dfbe02000aefd3e085bdea0ec809247e1fb71b3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 19 Jul 2021 16:03:28 -0700
Subject: [PATCH 1/3] Fix build warnings.

---
 bdbench/bdbench.c |  2 +-
 bdbench/rtbm.c    |  4 ++--
 bdbench/vtbm.c    | 10 +++++++---
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/bdbench/bdbench.c b/bdbench/bdbench.c
index 800567d..1df5c53 100644
--- a/bdbench/bdbench.c
+++ b/bdbench/bdbench.c
@@ -655,7 +655,7 @@ _bench(LVTestType *lvtt)
 	fclose(f);
 #endif
 
-	elog(NOTICE, "\"%s\": dead tuples %lu, index tuples %lu, mathed %d, mem %zu",
+	elog(NOTICE, "\"%s\": dead tuples %lu, index tuples %lu, matched %d, mem %zu",
 		 lvtt->name,
 		 lvtt->dtinfo.nitems,
 		 IndexTids_cache->dtinfo.nitems,
diff --git a/bdbench/rtbm.c b/bdbench/rtbm.c
index 025d2a9..eac277a 100644
--- a/bdbench/rtbm.c
+++ b/bdbench/rtbm.c
@@ -449,9 +449,9 @@ dump_entry(RTbm *rtbm, DtEntry *entry)
 		}
 	}
 
-	elog(NOTICE, "%s (offset %d len %d)",
+	elog(NOTICE, "%s (offset %llu len %d)",
 		 str.data,
-		 entry->offset, len);
+		 (long long unsigned) entry->offset, len);
 }
 
 static int
diff --git a/bdbench/vtbm.c b/bdbench/vtbm.c
index c59d6e1..63320f5 100644
--- a/bdbench/vtbm.c
+++ b/bdbench/vtbm.c
@@ -72,7 +72,8 @@ vtbm_add_tuples(VTbm *vtbm, const BlockNumber blkno,
 	DtEntry *entry;
 	bool	found;
 	char	oldstatus;
-	int wordnum, bitnum;
+	int wordnum = 0;
+	int bitnum;
 
 	entry = dttable_insert(vtbm->dttable, blkno, &found);
 	Assert(!found);
@@ -216,8 +217,10 @@ vtbm_dump(VTbm *vtbm)
 		 vtbm->bitmap_size, vtbm->npages);
 	for (int i = 0; i < vtbm->npages; i++)
 	{
+		char *bitmap;
+
 		entry = entries[i];
-		char *bitmap = &(vtbm->bitmap[entry->offset]);
+		bitmap = &(vtbm->bitmap[entry->offset]);
 
 		appendStringInfo(&str, "[%5d] : ", entry->blkno);
 		for (int off = 0; off < entry->len; off++)
@@ -239,6 +242,7 @@ vtbm_dump_blk(VTbm *vtbm, BlockNumber blkno)
 {
 	DtEntry *entry;
 	StringInfoData str;
+	char *bitmap;
 
 	initStringInfo(&str);
 
@@ -252,7 +256,7 @@ vtbm_dump_blk(VTbm *vtbm, BlockNumber blkno)
 		return;
 	}
 
-	char *bitmap = &(vtbm->bitmap[entry->offset]);
+	bitmap = &(vtbm->bitmap[entry->offset]);
 
 	appendStringInfo(&str, "[%5d] : ", entry->blkno);
 	for (int off = 1; off < entry->len; off++)
-- 
2.32.0.rc2

0002-Add-radix-tree.patchtext/x-diff; charset=us-asciiDownload

From 5ba05ffad4a9605a6fb5a24fe625542aee226ec8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 19 Jul 2021 16:04:55 -0700
Subject: [PATCH 2/3] Add radix tree.

---
 bdbench/radix.c | 3088 +++++++++++++++++++++++++++++++++++++++++++++++
 bdbench/radix.h |   76 ++
 2 files changed, 3164 insertions(+)
 create mode 100644 bdbench/radix.c
 create mode 100644 bdbench/radix.h

diff --git a/bdbench/radix.c b/bdbench/radix.c
new file mode 100644
index 0000000..c7061f0
--- /dev/null
+++ b/bdbench/radix.c
@@ -0,0 +1,3088 @@
+/*
+ *
+ */
+
+#include "postgres.h"
+
+#include "radix.h"
+
+#include "lib/stringinfo.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+
+
+/*
+ * How many bits are encoded in one tree level.
+ *
+ * Linux uses 6, ART uses 8. In a non-adaptive radix tree the disadvantage of
+ * a higher fanout is increased memory usage - but the adapative node size
+ * addresses that to a good degree. Using a common multiple of 8 (i.e. bytes
+ * in a byte) has the advantage of making it easier to eventually support
+ * variable length data. Therefore go with 8 for now.
+ */
+#define BFM_FANOUT			8
+
+#define BFM_MAX_CLASS		(1<<BFM_FANOUT)
+
+#define BFM_MASK			((1 << BFM_FANOUT) - 1)
+
+
+/*
+ * Base type for all node types.
+ */
+struct bfm_tree_node_inner;
+typedef struct bfm_tree_node
+{
+	/*
+	 * Size class of entry (stored as uint8 instead of bfm_tree_node_kind to
+	 * save space).
+	 *
+	 * XXX: For efficiency in random access cases it'd be a good idea to
+	 * encode the kind of a node in the pointer value of upper nodes, in the
+	 * low bits. Being able to do the node type dispatch during traversal
+	 * before the memory for the node has been fetched from memory would
+	 * likely improve performance significantly.  But that'd require at least
+	 * 8 byte alignment, which we don't currently guarantee on all platforms.
+	 */
+	uint8 kind;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. I.e. the key is shifted by `shift` and the lowest BFM_FANOUT bits
+	 * are then represented in chunk.
+	 */
+	uint8 node_shift;
+	uint8 node_chunk;
+
+	/*
+	 * Number of children - currently uint16 to be able to indicate 256
+	 * children at a fanout of 8.
+	 */
+	uint16 count;
+
+	/* FIXME: Right now there's always unused bytes here :( */
+
+	/*
+	 * FIXME: could be removed by using a stack while walking down to deleted
+	 * node.
+	 */
+	struct bfm_tree_node_inner *parent;
+} bfm_tree_node;
+
+/*
+ * Base type for all inner nodes.
+ */
+typedef struct bfm_tree_node_inner
+{
+	bfm_tree_node b;
+} bfm_tree_node_inner;
+
+/*
+ * Base type for all leaf nodes.
+ */
+typedef struct bfm_tree_node_leaf
+{
+	bfm_tree_node b;
+} bfm_tree_node_leaf;
+
+
+/*
+ * Size classes.
+ *
+ * To reduce memory usage compared to a simple radix tree with a fixed fanout
+ * we use adaptive node sizes, with different storage methods for different
+ * numbers of elements.
+ *
+ * FIXME: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 32->56->160->304->1296->2064/2096 bytes for inner/leaf nodes, repeatedly
+ * just above a power of 2, leading to large amounts of allocator padding with
+ * aset.c. Hence the use of slab.
+ *
+ * FIXME: Duplication.
+ *
+ * XXX: Consider implementing path compression, it reduces worst case memory
+ * usage substantially. I.e. collapse sequences of nodes with just one child
+ * into one node. That would make it feasible to use this datastructure for
+ * wide keys. Gut feeling: When compressing inner nodes a limited number of
+ * tree levels should be skippable to keep nodes of a constant size. But when
+ * collapsing to leaf nodes it likely is worth to make them variable width,
+ * it's such a common scenario (a sparse key will always end with such a chain
+ * of nodes).
+ */
+
+/*
+ * Inner node size classes.
+ */
+typedef struct bfm_tree_node_inner_1
+{
+	bfm_tree_node_inner b;
+
+	/* single child, for key chunk */
+	uint8 chunk;
+	bfm_tree_node *slot;
+} bfm_tree_node_inner_1;
+
+typedef struct bfm_tree_node_inner_4
+{
+	bfm_tree_node_inner b;
+
+	/* four children, for key chunks */
+	uint8 chunks[4];
+	bfm_tree_node *slots[4];
+} bfm_tree_node_inner_4;
+
+typedef struct bfm_tree_node_inner_16
+{
+	bfm_tree_node_inner b;
+
+	/* four children, for key chunks */
+	uint8 chunks[16];
+	bfm_tree_node *slots[16];
+} bfm_tree_node_inner_16;
+
+#define BFM_TREE_NODE_32_INVALID 0xFF
+typedef struct bfm_tree_node_inner_32
+{
+	bfm_tree_node_inner b;
+
+	/*
+	 * 32 children. Offsets is indexed by they key chunk and points into
+	 * ->slots. An offset of BFM_TREE_NODE_32_INVALID indicates a non-existing
+	 * entry.
+	 *
+	 * XXX: It'd be nice to shrink the offsets array to use fewer bits - we
+	 * only need to index into an array of 32 entries. But 32 offsets already
+	 * is 5 bits, making a simple & fast encoding nontrivial.
+	 */
+	uint8 chunks[32];
+	bfm_tree_node *slots[32];
+} bfm_tree_node_inner_32;
+
+#define BFM_TREE_NODE_128_INVALID 0xFF
+typedef struct bfm_tree_node_inner_128
+{
+	bfm_tree_node_inner b;
+
+	uint8 offsets[BFM_MAX_CLASS];
+	bfm_tree_node *slots[128];
+} bfm_tree_node_inner_128;
+
+typedef struct bfm_tree_node_inner_max
+{
+	bfm_tree_node_inner b;
+	bfm_tree_node *slots[BFM_MAX_CLASS];
+} bfm_tree_node_inner_max;
+
+
+/*
+ * Leaf node size classes.
+ *
+ * Currently these are separate from inner node size classes for two main
+ * reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+
+typedef struct bfm_tree_node_leaf_1
+{
+	bfm_tree_node_leaf b;
+	uint8 chunk;
+	bfm_value_type value;
+} bfm_tree_node_leaf_1;
+
+#define BFM_TREE_NODE_LEAF_4_INVALID 0xFFFF
+typedef struct bfm_tree_node_leaf_4
+{
+	bfm_tree_node_leaf b;
+	uint8 chunks[4];
+	bfm_value_type values[4];
+} bfm_tree_node_leaf_4;
+
+#define BFM_TREE_NODE_LEAF_16_INVALID 0xFFFF
+typedef struct bfm_tree_node_leaf_16
+{
+	bfm_tree_node_leaf b;
+	uint8 chunks[16];
+	bfm_value_type values[16];
+} bfm_tree_node_leaf_16;
+
+typedef struct bfm_tree_node_leaf_32
+{
+	bfm_tree_node_leaf b;
+	uint8 chunks[32];
+	bfm_value_type values[32];
+} bfm_tree_node_leaf_32;
+
+typedef struct bfm_tree_node_leaf_128
+{
+	bfm_tree_node_leaf b;
+	uint8 offsets[BFM_MAX_CLASS];
+	bfm_value_type values[128];
+} bfm_tree_node_leaf_128;
+
+typedef struct bfm_tree_node_leaf_max
+{
+	bfm_tree_node_leaf b;
+	uint8 set[BFM_MAX_CLASS / (sizeof(uint8) * BITS_PER_BYTE)];
+	bfm_value_type values[BFM_MAX_CLASS];
+} bfm_tree_node_leaf_max;
+
+
+typedef struct bfm_tree_size_class_info
+{
+	const char *const name;
+	int elements;
+	size_t size;
+} bfm_tree_size_class_info;
+
+const bfm_tree_size_class_info inner_class_info[] =
+{
+	[BFM_KIND_1] = {"1", 1, sizeof(bfm_tree_node_inner_1)},
+	[BFM_KIND_4] = {"4", 4, sizeof(bfm_tree_node_inner_4)},
+	[BFM_KIND_16] = {"16", 16, sizeof(bfm_tree_node_inner_16)},
+	[BFM_KIND_32] = {"32", 32, sizeof(bfm_tree_node_inner_32)},
+	[BFM_KIND_128] = {"128", 128, sizeof(bfm_tree_node_inner_128)},
+	[BFM_KIND_MAX] = {"max", BFM_MAX_CLASS, sizeof(bfm_tree_node_inner_max)},
+};
+
+const bfm_tree_size_class_info leaf_class_info[] =
+{
+	[BFM_KIND_1] = {"1", 1, sizeof(bfm_tree_node_leaf_1)},
+	[BFM_KIND_4] = {"4", 4, sizeof(bfm_tree_node_leaf_4)},
+	[BFM_KIND_16] = {"16", 16, sizeof(bfm_tree_node_leaf_16)},
+	[BFM_KIND_32] = {"32", 32, sizeof(bfm_tree_node_leaf_32)},
+	[BFM_KIND_128] = {"128", 128, sizeof(bfm_tree_node_leaf_128)},
+	[BFM_KIND_MAX] = {"max", BFM_MAX_CLASS, sizeof(bfm_tree_node_leaf_max)},
+};
+
+static void *
+bfm_alloc_node(bfm_tree *root, bool inner, bfm_tree_node_kind kind, size_t size)
+{
+	bfm_tree_node *node;
+
+#ifdef BFM_USE_SLAB
+	if (inner)
+		node = (bfm_tree_node *) MemoryContextAlloc(root->inner_slabs[kind], size);
+	else
+		node = (bfm_tree_node *) MemoryContextAlloc(root->leaf_slabs[kind], size);
+#elif defined(BFM_USE_OS)
+	node = (bfm_tree_node *) malloc(size);
+#else
+	node = (bfm_tree_node *) MemoryContextAlloc(root->context, size);
+#endif
+
+	return node;
+}
+
+static bfm_tree_node_inner *
+bfm_alloc_inner(bfm_tree *root, bfm_tree_node_kind kind, size_t size)
+{
+	bfm_tree_node_inner *node;
+
+	Assert(inner_class_info[kind].size == size);
+#ifdef BFM_STATS
+	root->inner_nodes[kind]++;
+#endif
+
+	node = bfm_alloc_node(root, true, kind, size);
+
+	memset(&node->b, 0, sizeof(node->b));
+	node->b.kind = kind;
+
+	return node;
+}
+
+static bfm_tree_node_inner *
+bfm_alloc_leaf(bfm_tree *root, bfm_tree_node_kind kind, size_t size)
+{
+	bfm_tree_node_inner *node;
+
+	Assert(leaf_class_info[kind].size == size);
+#ifdef BFM_STATS
+	root->leaf_nodes[kind]++;
+#endif
+
+	node = bfm_alloc_node(root, false, kind, size);
+
+	memset(&node->b, 0, sizeof(node->b));
+	node->b.kind = kind;
+
+	return node;
+}
+
+
+static bfm_tree_node_inner_1 *
+bfm_alloc_inner_1(bfm_tree *root)
+{
+	bfm_tree_node_inner_1 *node =
+		(bfm_tree_node_inner_1 *) bfm_alloc_inner(root, BFM_KIND_1, sizeof(*node));
+
+	return node;
+}
+
+#define BFM_TREE_NODE_INNER_4_INVALID 0xFF
+static bfm_tree_node_inner_4 *
+bfm_alloc_inner_4(bfm_tree *root)
+{
+	bfm_tree_node_inner_4 *node =
+		(bfm_tree_node_inner_4 *) bfm_alloc_inner(root, BFM_KIND_4, sizeof(*node));
+
+	return node;
+}
+
+#define BFM_TREE_NODE_INNER_16_INVALID 0xFF
+static bfm_tree_node_inner_16 *
+bfm_alloc_inner_16(bfm_tree *root)
+{
+	bfm_tree_node_inner_16 *node =
+		(bfm_tree_node_inner_16 *) bfm_alloc_inner(root, BFM_KIND_16, sizeof(*node));
+
+	return node;
+}
+
+#define BFM_TREE_NODE_INNER_32_INVALID 0xFF
+static bfm_tree_node_inner_32 *
+bfm_alloc_inner_32(bfm_tree *root)
+{
+	bfm_tree_node_inner_32 *node =
+		(bfm_tree_node_inner_32 *) bfm_alloc_inner(root, BFM_KIND_32, sizeof(*node));
+
+	return node;
+}
+
+static bfm_tree_node_inner_128 *
+bfm_alloc_inner_128(bfm_tree *root)
+{
+	bfm_tree_node_inner_128 *node =
+		(bfm_tree_node_inner_128 *) bfm_alloc_inner(root, BFM_KIND_128, sizeof(*node));
+
+	memset(&node->offsets, BFM_TREE_NODE_128_INVALID, sizeof(node->offsets));
+
+	return node;
+}
+
+static bfm_tree_node_inner_max *
+bfm_alloc_inner_max(bfm_tree *root)
+{
+	bfm_tree_node_inner_max *node =
+		(bfm_tree_node_inner_max *) bfm_alloc_inner(root, BFM_KIND_MAX, sizeof(*node));
+
+	memset(&node->slots, 0, sizeof(node->slots));
+
+	return node;
+}
+
+static bfm_tree_node_leaf_1 *
+bfm_alloc_leaf_1(bfm_tree *root)
+{
+	bfm_tree_node_leaf_1 *node =
+		(bfm_tree_node_leaf_1 *) bfm_alloc_leaf(root, BFM_KIND_1, sizeof(*node));
+
+	return node;
+}
+
+static bfm_tree_node_leaf_4 *
+bfm_alloc_leaf_4(bfm_tree *root)
+{
+	bfm_tree_node_leaf_4 *node =
+		(bfm_tree_node_leaf_4 *) bfm_alloc_leaf(root, BFM_KIND_4, sizeof(*node));
+
+	return node;
+}
+
+static bfm_tree_node_leaf_16 *
+bfm_alloc_leaf_16(bfm_tree *root)
+{
+	bfm_tree_node_leaf_16 *node =
+		(bfm_tree_node_leaf_16 *) bfm_alloc_leaf(root, BFM_KIND_16, sizeof(*node));
+
+	return node;
+}
+
+static bfm_tree_node_leaf_32 *
+bfm_alloc_leaf_32(bfm_tree *root)
+{
+	bfm_tree_node_leaf_32 *node =
+		(bfm_tree_node_leaf_32 *) bfm_alloc_leaf(root, BFM_KIND_32, sizeof(*node));
+
+	return node;
+}
+
+static bfm_tree_node_leaf_128 *
+bfm_alloc_leaf_128(bfm_tree *root)
+{
+	bfm_tree_node_leaf_128 *node =
+		(bfm_tree_node_leaf_128 *) bfm_alloc_leaf(root, BFM_KIND_128, sizeof(*node));
+
+	memset(node->offsets, BFM_TREE_NODE_128_INVALID, sizeof(node->offsets));
+
+	return node;
+}
+
+static bfm_tree_node_leaf_max *
+bfm_alloc_leaf_max(bfm_tree *root)
+{
+	bfm_tree_node_leaf_max *node =
+		(bfm_tree_node_leaf_max *) bfm_alloc_leaf(root, BFM_KIND_MAX, sizeof(*node));
+
+	memset(node->set, 0, sizeof(node->set));
+
+	return node;
+}
+
+static void
+bfm_free_internal(bfm_tree *root, void *p)
+{
+#if defined(BFM_USE_OS)
+	free(p);
+#else
+	pfree(p);
+#endif
+}
+
+static void
+bfm_free_inner(bfm_tree *root, bfm_tree_node_inner *node)
+{
+	Assert(node->b.node_shift != 0);
+
+#ifdef BFM_STATS
+	root->inner_nodes[node->b.kind]--;
+#endif
+
+	bfm_free_internal(root, node);
+}
+
+static void
+bfm_free_leaf(bfm_tree *root, bfm_tree_node_leaf *node)
+{
+	Assert(node->b.node_shift == 0);
+
+#ifdef BFM_STATS
+	root->leaf_nodes[node->b.kind]--;
+#endif
+
+	bfm_free_internal(root, node);
+}
+
+#define BFM_LEAF_MAX_SET_OFFSET(i) (i / (sizeof(uint8) * BITS_PER_BYTE))
+#define BFM_LEAF_MAX_SET_BIT(i) (UINT64_C(1) << (i & ((sizeof(uint8) * BITS_PER_BYTE)-1)))
+
+static inline bool
+bfm_leaf_max_isset(bfm_tree_node_leaf_max *node_max, uint32 i)
+{
+	return node_max->set[BFM_LEAF_MAX_SET_OFFSET(i)] & BFM_LEAF_MAX_SET_BIT(i);
+}
+
+static inline void
+bfm_leaf_max_set(bfm_tree_node_leaf_max *node_max, uint32 i)
+{
+	node_max->set[BFM_LEAF_MAX_SET_OFFSET(i)] |= BFM_LEAF_MAX_SET_BIT(i);
+}
+
+static inline void
+bfm_leaf_max_unset(bfm_tree_node_leaf_max *node_max, uint32 i)
+{
+	node_max->set[BFM_LEAF_MAX_SET_OFFSET(i)] &= ~BFM_LEAF_MAX_SET_BIT(i);
+}
+
+static uint64
+bfm_maxval_shift(uint32 shift)
+{
+	uint32 maxshift = (sizeof(bfm_key_type) * BITS_PER_BYTE) / BFM_FANOUT  * BFM_FANOUT;
+
+	Assert(shift <= maxshift);
+
+	if (shift == maxshift)
+		return UINT64_MAX;
+
+	return (UINT64_C(1) << (shift + BFM_FANOUT)) - 1;
+}
+
+static inline int
+search_chunk_array_4_eq(uint8 *chunks, uint8 match, uint8 count)
+{
+	int index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (chunks[i] == match)
+		{
+			index = i;
+			break;
+		}
+	}
+
+	return index;
+}
+
+static inline int
+search_chunk_array_4_le(uint8 *chunks, uint8 match, uint8 count)
+{
+	int index;
+
+	for (index = 0; index < count; index++)
+		if (chunks[index] >= match)
+			break;
+
+	return index;
+}
+
+
+#if defined(__SSE2__)
+#include <emmintrin.h> // x86 SSE intrinsics
+#endif
+
+static inline int
+search_chunk_array_16_eq(uint8 *chunks, uint8 match, uint8 count)
+{
+#if !defined(__SSE2__) || defined(USE_ASSERT_CHECKING)
+	int index = -1;
+#endif
+
+#ifdef __SSE2__
+	int index_sse;
+	__m128i spread_chunk = _mm_set1_epi8(match);
+	__m128i haystack = _mm_loadu_si128((__m128i_u*) chunks);
+	__m128i cmp=_mm_cmpeq_epi8(spread_chunk, haystack);
+	uint32_t bitfield=_mm_movemask_epi8(cmp);
+
+	bitfield &= ((1<<count)-1);
+
+	if (bitfield)
+		index_sse = __builtin_ctz(bitfield);
+	else
+		index_sse = -1;
+
+#endif
+
+#if !defined(__SSE2__) || defined(USE_ASSERT_CHECKING)
+	for (int i = 0; i < count; i++)
+	{
+		if (chunks[i] == match)
+		{
+			index = i;
+			break;
+		}
+	}
+
+#if defined(__SSE2__)
+	Assert(index_sse == index);
+#endif
+
+#endif
+
+#if defined(__SSE2__)
+	return index_sse;
+#else
+	return index;
+#endif
+}
+
+/*
+ * This is a bit more complicated than search_chunk_array_16_eq(), because
+ * until recently no unsigned uint8 comparison instruction existed on x86. So
+ * we need to play some trickery using _mm_min_epu8() to effectively get
+ * <=. There never will be any equal elements in the current uses, but that's
+ * what we get here...
+ */
+static inline int
+search_chunk_array_16_le(uint8 *chunks, uint8 match, uint8 count)
+{
+#if !defined(__SSE2__) || defined(USE_ASSERT_CHECKING)
+	int index;
+#endif
+
+#ifdef __SSE2__
+	int index_sse;
+	__m128i spread_chunk = _mm_set1_epi8(match);
+	__m128i haystack = _mm_loadu_si128((__m128i_u*) chunks);
+	__m128i min = _mm_min_epu8(haystack, spread_chunk);
+	__m128i cmp = _mm_cmpeq_epi8(spread_chunk, min);
+	uint32_t bitfield=_mm_movemask_epi8(cmp);
+
+	bitfield &= ((1<<count)-1);
+
+	if (bitfield)
+		index_sse = __builtin_ctz(bitfield);
+	else
+		index_sse = count;
+#endif
+
+#if !defined(__SSE2__) || defined(USE_ASSERT_CHECKING)
+	for (index = 0; index < count; index++)
+		if (chunks[index] >= match)
+			break;
+
+#if defined(__SSE2__)
+	Assert(index_sse == index);
+#endif
+
+#endif
+
+#if defined(__SSE2__)
+	return index_sse;
+#else
+	return index;
+#endif
+}
+
+#if defined(__AVX2__)
+#include <immintrin.h> // x86 SSE intrinsics
+#endif
+
+static inline int
+search_chunk_array_32_eq(uint8 *chunks, uint8 match, uint8 count)
+{
+#if !defined(__AVX2__) || defined(USE_ASSERT_CHECKING)
+	int index = -1;
+#endif
+
+#ifdef __AVX2__
+	int index_sse;
+	__m256i spread_chunk = _mm256_set1_epi8(match);
+	__m256i haystack = _mm256_loadu_si256((__m256i_u*) chunks);
+	__m256i cmp= _mm256_cmpeq_epi8(spread_chunk, haystack);
+	uint32_t bitfield = _mm256_movemask_epi8(cmp);
+
+	bitfield &= ((UINT64_C(1)<<count)-1);
+
+	if (bitfield)
+		index_sse = __builtin_ctz(bitfield);
+	else
+		index_sse = -1;
+
+#endif
+
+#if !defined(__AVX2__) || defined(USE_ASSERT_CHECKING)
+	for (int i = 0; i < count; i++)
+	{
+		if (chunks[i] == match)
+		{
+			index = i;
+			break;
+		}
+	}
+
+#if defined(__AVX2__)
+	Assert(index_sse == index);
+#endif
+
+#endif
+
+#if defined(__AVX2__)
+	return index_sse;
+#else
+	return index;
+#endif
+}
+
+/*
+ * This is a bit more complicated than search_chunk_array_16_eq(), because
+ * until recently no unsigned uint8 comparison instruction existed on x86. So
+ * we need to play some trickery using _mm_min_epu8() to effectively get
+ * <=. There never will be any equal elements in the current uses, but that's
+ * what we get here...
+ */
+static inline int
+search_chunk_array_32_le(uint8 *chunks, uint8 match, uint8 count)
+{
+#if !defined(__AVX2__) || defined(USE_ASSERT_CHECKING)
+	int index;
+#endif
+
+#ifdef __AVX2__
+	int index_sse;
+	__m256i spread_chunk = _mm256_set1_epi8(match);
+	__m256i haystack = _mm256_loadu_si256((__m256i_u*) chunks);
+	__m256i min = _mm256_min_epu8(haystack, spread_chunk);
+	__m256i cmp=_mm256_cmpeq_epi8(spread_chunk, min);
+	uint32_t bitfield=_mm256_movemask_epi8(cmp);
+
+	bitfield &= ((1<<count)-1);
+
+	if (bitfield)
+		index_sse = __builtin_ctz(bitfield);
+	else
+		index_sse = count;
+#endif
+
+#if !defined(__AVX2__) || defined(USE_ASSERT_CHECKING)
+	for (index = 0; index < count; index++)
+		if (chunks[index] >= match)
+			break;
+
+#if defined(__AVX2__)
+	Assert(index_sse == index);
+#endif
+
+#endif
+
+#if defined(__AVX2__)
+	return index_sse;
+#else
+	return index;
+#endif
+}
+
+static inline void
+chunk_slot_array_grow(uint8 *source_chunks, bfm_tree_node **source_slots,
+					  uint8 *target_chunks, bfm_tree_node **target_slots,
+					  bfm_tree_node_inner *oldnode, bfm_tree_node_inner *newnode)
+{
+	memcpy(target_chunks, source_chunks, sizeof(source_chunks[0]) * oldnode->b.count);
+	memcpy(target_slots, source_slots, sizeof(source_slots[0]) * oldnode->b.count);
+
+	for (int i = 0; i < oldnode->b.count; i++)
+	{
+		Assert(source_slots[i]->parent == oldnode);
+		source_slots[i]->parent = newnode;
+	}
+}
+
+/*
+ * FIXME: Find a way to deduplicate with bfm_find_one_level_inner()
+ */
+pg_attribute_always_inline static bfm_tree_node *
+bfm_find_one_level_inner(bfm_tree_node_inner * pg_restrict node, uint8 chunk)
+{
+	bfm_tree_node *slot = NULL;
+
+	Assert(node->b.node_shift != 0); /* is inner node */
+
+	/* tell the compiler it doesn't need a bounds check */
+	if ((bfm_tree_node_kind) node->b.kind > BFM_KIND_MAX)
+		pg_unreachable();
+
+	switch((bfm_tree_node_kind) node->b.kind)
+	{
+		case BFM_KIND_1:
+			{
+				bfm_tree_node_inner_1 *node_1 =
+					(bfm_tree_node_inner_1 *) node;
+
+				Assert(node_1->b.b.count <= 1);
+				if (node_1->chunk == chunk)
+					slot = node_1->slot;
+				break;
+			}
+
+		case BFM_KIND_4:
+			{
+				bfm_tree_node_inner_4 *node_4 =
+					(bfm_tree_node_inner_4 *) node;
+				int index;
+
+				Assert(node_4->b.b.count <= 4);
+				index = search_chunk_array_4_eq(node_4->chunks, chunk, node_4->b.b.count);
+
+				if (index != -1)
+					slot = node_4->slots[index];
+
+				break;
+			}
+
+		case BFM_KIND_16:
+			{
+				bfm_tree_node_inner_16 *node_16 =
+					(bfm_tree_node_inner_16 *) node;
+				int index;
+
+				Assert(node_16->b.b.count <= 16);
+
+				index = search_chunk_array_16_eq(node_16->chunks, chunk, node_16->b.b.count);
+				if (index != -1)
+					slot = node_16->slots[index];
+
+				break;
+			}
+
+		case BFM_KIND_32:
+			{
+				bfm_tree_node_inner_32 *node_32 =
+					(bfm_tree_node_inner_32 *) node;
+				int index;
+
+				Assert(node_32->b.b.count <= 32);
+
+				index = search_chunk_array_32_eq(node_32->chunks, chunk, node_32->b.b.count);
+				if (index != -1)
+					slot = node_32->slots[index];
+
+				break;
+			}
+
+		case BFM_KIND_128:
+			{
+				bfm_tree_node_inner_128 *node_128 =
+					(bfm_tree_node_inner_128 *) node;
+
+				Assert(node_128->b.b.count <= 128);
+
+				if (node_128->offsets[chunk] != BFM_TREE_NODE_128_INVALID)
+				{
+					slot = node_128->slots[node_128->offsets[chunk]];
+				}
+				break;
+			}
+
+		case BFM_KIND_MAX:
+			{
+				bfm_tree_node_inner_max *node_max =
+					(bfm_tree_node_inner_max *) node;
+
+				Assert(node_max->b.b.count <= BFM_MAX_CLASS);
+				slot = node_max->slots[chunk];
+
+				break;
+			}
+	}
+
+	return slot;
+}
+
+/*
+ * FIXME: Find a way to deduplicate with bfm_find_one_level_inner()
+ */
+pg_attribute_always_inline static bool
+bfm_find_one_level_leaf(bfm_tree_node_leaf * pg_restrict node, uint8 chunk, bfm_value_type * pg_restrict valp)
+{
+	bool found = false;
+
+	Assert(node->b.node_shift == 0); /* is leaf node */
+
+	/* tell the compiler it doesn't need a bounds check */
+	if ((bfm_tree_node_kind) node->b.kind > BFM_KIND_MAX)
+		pg_unreachable();
+
+	switch((bfm_tree_node_kind) node->b.kind)
+	{
+		case BFM_KIND_1:
+			{
+				bfm_tree_node_leaf_1 *node_1 =
+					(bfm_tree_node_leaf_1 *) node;
+
+				Assert(node_1->b.b.count <= 1);
+				if (node_1->b.b.count == 1 &&
+					node_1->chunk == chunk)
+				{
+					*valp = node_1->value;
+					found = true;
+					break;
+				}
+				break;
+			}
+
+		case BFM_KIND_4:
+			{
+				bfm_tree_node_leaf_4 *node_4 =
+					(bfm_tree_node_leaf_4 *) node;
+				int index;
+
+				Assert(node_4->b.b.count <= 4);
+				index = search_chunk_array_4_eq(node_4->chunks, chunk, node_4->b.b.count);
+
+				if (index != -1)
+				{
+					*valp = node_4->values[index];
+					found = true;
+				}
+				break;
+			}
+
+		case BFM_KIND_16:
+			{
+				bfm_tree_node_leaf_16 *node_16 =
+					(bfm_tree_node_leaf_16 *) node;
+				int index;
+
+				Assert(node_16->b.b.count <= 16);
+
+				index = search_chunk_array_16_eq(node_16->chunks, chunk, node_16->b.b.count);
+				if (index != -1)
+				{
+					*valp = node_16->values[index];
+					found = true;
+					break;
+				}
+				break;
+			}
+
+		case BFM_KIND_32:
+			{
+				bfm_tree_node_leaf_32 *node_32 =
+					(bfm_tree_node_leaf_32 *) node;
+				int index;
+
+				Assert(node_32->b.b.count <= 32);
+
+				index = search_chunk_array_32_eq(node_32->chunks, chunk, node_32->b.b.count);
+				if (index != -1)
+				{
+					*valp = node_32->values[index];
+					found = true;
+					break;
+				}
+				break;
+			}
+
+		case BFM_KIND_128:
+			{
+				bfm_tree_node_leaf_128 *node_128 =
+					(bfm_tree_node_leaf_128 *) node;
+
+				Assert(node_128->b.b.count <= 128);
+
+				if (node_128->offsets[chunk] != BFM_TREE_NODE_128_INVALID)
+				{
+					*valp = node_128->values[node_128->offsets[chunk]];
+					found = true;
+				}
+				break;
+			}
+
+		case BFM_KIND_MAX:
+			{
+				bfm_tree_node_leaf_max *node_max =
+					(bfm_tree_node_leaf_max *) node;
+
+				Assert(node_max->b.b.count <= BFM_MAX_CLASS);
+
+				if (bfm_leaf_max_isset(node_max, chunk))
+				{
+					*valp = node_max->values[chunk];
+					found = true;
+				}
+				break;
+			}
+	}
+
+	return found;
+}
+
+pg_attribute_always_inline static bool
+bfm_walk(bfm_tree *root, bfm_tree_node **nodep, bfm_value_type *valp, uint64_t key)
+{
+	bfm_tree_node *rnode;
+	bfm_tree_node *cur;
+	uint8 chunk;
+	uint32 shift;
+
+	rnode = root->rnode;
+
+	/* can't be contained in the tree */
+	if (!rnode || key > root->maxval)
+	{
+		*nodep = NULL;
+		return false;
+	}
+
+	shift = rnode->node_shift;
+	chunk = (key >> shift) & BFM_MASK;
+	cur = rnode;
+
+	while (shift > 0)
+	{
+		bfm_tree_node_inner *cur_inner;
+		bfm_tree_node *slot;
+
+		Assert(cur->node_shift > 0); /* leaf nodes look different */
+		Assert(cur->node_shift == shift);
+
+		cur_inner = (bfm_tree_node_inner *) cur;
+
+		slot = bfm_find_one_level_inner(cur_inner, chunk);
+
+		if (slot == NULL)
+		{
+			*nodep = cur;
+			return false;
+		}
+
+		Assert(&slot->parent->b == cur);
+		Assert(slot->node_chunk == chunk);
+
+		cur = slot;
+		shift -= BFM_FANOUT;
+		chunk = (key >> shift) & BFM_MASK;
+	}
+
+	Assert(cur->node_shift == shift && shift == 0);
+
+	*nodep = cur;
+
+	return bfm_find_one_level_leaf((bfm_tree_node_leaf*) cur, chunk, valp);
+}
+
+/*
+ * Redirect parent pointers to oldnode by newnode, for the key chunk
+ * chunk. Used when growing or shrinking nodes.
+ */
+static void
+bfm_redirect(bfm_tree *root, bfm_tree_node *oldnode, bfm_tree_node *newnode, uint8 chunk)
+{
+	bfm_tree_node_inner *parent = oldnode->parent;
+
+	if (parent == NULL)
+	{
+		Assert(root->rnode == oldnode);
+		root->rnode = newnode;
+		return;
+	}
+
+	/* if there is a parent, it needs to be an inner node */
+	Assert(parent->b.node_shift != 0);
+
+	if ((bfm_tree_node_kind) parent->b.kind > BFM_KIND_MAX)
+		pg_unreachable();
+
+	switch((bfm_tree_node_kind) parent->b.kind)
+	{
+		case BFM_KIND_1:
+			{
+				bfm_tree_node_inner_1 *parent_1 =
+					(bfm_tree_node_inner_1 *) parent;
+
+				Assert(parent_1->slot == oldnode);
+				Assert(parent_1->chunk == chunk);
+
+				parent_1->slot = newnode;
+				break;
+			}
+
+		case BFM_KIND_4:
+			{
+				bfm_tree_node_inner_4 *parent_4 =
+					(bfm_tree_node_inner_4 *) parent;
+				int index;
+
+				Assert(parent_4->b.b.count <= 4);
+				index = search_chunk_array_4_eq(parent_4->chunks, chunk, parent_4->b.b.count);
+				Assert(index != -1);
+
+				Assert(parent_4->slots[index] == oldnode);
+				parent_4->slots[index] = newnode;
+
+				break;
+			}
+
+		case BFM_KIND_16:
+			{
+				bfm_tree_node_inner_16 *parent_16 =
+					(bfm_tree_node_inner_16 *) parent;
+				int index;
+
+				index = search_chunk_array_16_eq(parent_16->chunks, chunk, parent_16->b.b.count);
+				Assert(index != -1);
+
+				Assert(parent_16->slots[index] == oldnode);
+				parent_16->slots[index] = newnode;
+				break;
+			}
+
+		case BFM_KIND_32:
+			{
+				bfm_tree_node_inner_32 *parent_32 =
+					(bfm_tree_node_inner_32 *) parent;
+				int index;
+
+				index = search_chunk_array_32_eq(parent_32->chunks, chunk, parent_32->b.b.count);
+				Assert(index != -1);
+
+				Assert(parent_32->slots[index] == oldnode);
+				parent_32->slots[index] = newnode;
+				break;
+			}
+
+		case BFM_KIND_128:
+			{
+				bfm_tree_node_inner_128 *parent_128 =
+					(bfm_tree_node_inner_128 *) parent;
+				uint8 offset;
+
+				offset = parent_128->offsets[chunk];
+				Assert(offset != BFM_TREE_NODE_128_INVALID);
+				Assert(parent_128->slots[offset] == oldnode);
+				parent_128->slots[offset] = newnode;
+				break;
+			}
+
+		case BFM_KIND_MAX:
+			{
+				bfm_tree_node_inner_max *parent_max =
+					(bfm_tree_node_inner_max *) parent;
+
+				Assert(parent_max->slots[chunk] == oldnode);
+				parent_max->slots[chunk] = newnode;
+
+				break;
+			}
+	}
+}
+
+static void
+bfm_node_copy_common(bfm_tree *root, bfm_tree_node *oldnode, bfm_tree_node *newnode)
+{
+	newnode->node_shift = oldnode->node_shift;
+	newnode->node_chunk = oldnode->node_chunk;
+	newnode->count = oldnode->count;
+	newnode->parent = oldnode->parent;
+}
+
+/*
+ * Insert child into node.
+ *
+ * NB: `node` cannot be used after this call anymore, it changes if the node
+ * needs to be grown to fit the insertion.
+ *
+ * FIXME: Find a way to deduplicate with bfm_set_leaf()
+ */
+static void
+bfm_insert_inner(bfm_tree *root, bfm_tree_node_inner *node, bfm_tree_node *child, int child_chunk)
+{
+	Assert(node->b.node_shift != 0); /* is inner node */
+
+	child->node_chunk = child_chunk;
+
+	/* tell the compiler it doesn't need a bounds check */
+	if ((bfm_tree_node_kind) node->b.kind > BFM_KIND_MAX)
+		pg_unreachable();
+
+	switch((bfm_tree_node_kind) node->b.kind)
+	{
+		case BFM_KIND_1:
+			{
+				bfm_tree_node_inner_1 *node_1 =
+					(bfm_tree_node_inner_1 *) node;
+
+				Assert(node_1->b.b.count <= 1);
+
+				if (unlikely(node_1->b.b.count == 1))
+				{
+					/* grow node from 1 -> 4 */
+					bfm_tree_node_inner_4 *newnode_4;
+
+					newnode_4 = bfm_alloc_inner_4(root);
+					bfm_node_copy_common(root, &node->b, &newnode_4->b.b);
+
+					Assert(node_1->slot->parent != NULL);
+					Assert(node_1->slot->parent == node);
+					newnode_4->chunks[0] = node_1->chunk;
+					newnode_4->slots[0] = node_1->slot;
+					node_1->slot->parent = &newnode_4->b;
+
+					bfm_redirect(root, &node->b, &newnode_4->b.b, newnode_4->b.b.node_chunk);
+					bfm_free_inner(root, node);
+					node = &newnode_4->b;
+				}
+				else
+				{
+					child->parent = node;
+					node_1->chunk = child_chunk;
+					node_1->slot = child;
+					break;
+				}
+			}
+			/* fallthrough */
+
+		case BFM_KIND_4:
+			{
+				bfm_tree_node_inner_4 *node_4 =
+					(bfm_tree_node_inner_4 *) node;
+
+				Assert(node_4->b.b.count <= 4);
+				if (unlikely(node_4->b.b.count == 4))
+				{
+					/* grow node from 4 -> 16 */
+					bfm_tree_node_inner_16 *newnode_16;
+
+					newnode_16 = bfm_alloc_inner_16(root);
+					bfm_node_copy_common(root, &node->b, &newnode_16->b.b);
+
+					chunk_slot_array_grow(node_4->chunks, node_4->slots,
+										  newnode_16->chunks, newnode_16->slots,
+										  &node_4->b, &newnode_16->b);
+
+					bfm_redirect(root, &node->b, &newnode_16->b.b, newnode_16->b.b.node_chunk);
+					bfm_free_inner(root, node);
+					node = &newnode_16->b;
+				}
+				else
+				{
+					int insertpos;
+
+					for (insertpos = 0; insertpos < node_4->b.b.count; insertpos++)
+						if (node_4->chunks[insertpos] >= child_chunk)
+							break;
+
+					child->parent = node;
+
+					memmove(&node_4->slots[insertpos + 1],
+							&node_4->slots[insertpos],
+							(node_4->b.b.count - insertpos) * sizeof(node_4->slots[0]));
+					memmove(&node_4->chunks[insertpos + 1],
+							&node_4->chunks[insertpos],
+							(node_4->b.b.count - insertpos) * sizeof(node_4->chunks[0]));
+
+					node_4->chunks[insertpos] = child_chunk;
+					node_4->slots[insertpos] = child;
+					break;
+				}
+			}
+			/* fallthrough */
+
+		case BFM_KIND_16:
+			{
+				bfm_tree_node_inner_16 *node_16 =
+					(bfm_tree_node_inner_16 *) node;
+
+				Assert(node_16->b.b.count <= 16);
+				if (unlikely(node_16->b.b.count == 16))
+				{
+					/* grow node from 16 -> 32 */
+					bfm_tree_node_inner_32 *newnode_32;
+
+					newnode_32 = bfm_alloc_inner_32(root);
+					bfm_node_copy_common(root, &node->b, &newnode_32->b.b);
+
+					chunk_slot_array_grow(node_16->chunks, node_16->slots,
+										  newnode_32->chunks, newnode_32->slots,
+										  &node_16->b, &newnode_32->b);
+
+					bfm_redirect(root, &node->b, &newnode_32->b.b, newnode_32->b.b.node_chunk);
+					bfm_free_inner(root, node);
+					node = &newnode_32->b;
+				}
+				else
+				{
+					int insertpos;
+
+					insertpos = search_chunk_array_16_le(node_16->chunks, child_chunk, node_16->b.b.count);
+
+					child->parent = node;
+
+					memmove(&node_16->slots[insertpos + 1],
+							&node_16->slots[insertpos],
+							(node_16->b.b.count - insertpos) * sizeof(node_16->slots[0]));
+					memmove(&node_16->chunks[insertpos + 1],
+							&node_16->chunks[insertpos],
+							(node_16->b.b.count - insertpos) * sizeof(node_16->chunks[0]));
+
+					node_16->chunks[insertpos] = child_chunk;
+					node_16->slots[insertpos] = child;
+					break;
+				}
+			}
+			/* fallthrough */
+
+		case BFM_KIND_32:
+			{
+				bfm_tree_node_inner_32 *node_32 =
+					(bfm_tree_node_inner_32 *) node;
+
+				Assert(node_32->b.b.count <= 32);
+				if (unlikely(node_32->b.b.count == 32))
+				{
+					/* grow node from 32 -> 128 */
+					bfm_tree_node_inner_128 *newnode_128;
+
+					newnode_128 = bfm_alloc_inner_128(root);
+					bfm_node_copy_common(root, &node->b, &newnode_128->b.b);
+
+					memcpy(newnode_128->slots, node_32->slots, sizeof(node_32->slots));
+
+					/* change parent pointers of children */
+					for (int i = 0; i < 32; i++)
+					{
+						Assert(node_32->slots[i]->parent == node);
+						newnode_128->offsets[node_32->chunks[i]] = i;
+						node_32->slots[i]->parent = &newnode_128->b;
+					}
+
+					bfm_redirect(root, &node->b, &newnode_128->b.b, newnode_128->b.b.node_chunk);
+					bfm_free_inner(root, node);
+					node = &newnode_128->b;
+				}
+				else
+				{
+					int insertpos;
+
+					insertpos = search_chunk_array_32_le(node_32->chunks, child_chunk, node_32->b.b.count);
+
+					child->parent = node;
+
+					memmove(&node_32->slots[insertpos + 1],
+							&node_32->slots[insertpos],
+							(node_32->b.b.count - insertpos) * sizeof(node_32->slots[0]));
+					memmove(&node_32->chunks[insertpos + 1],
+							&node_32->chunks[insertpos],
+							(node_32->b.b.count - insertpos) * sizeof(node_32->chunks[0]));
+
+					node_32->chunks[insertpos] = child_chunk;
+					node_32->slots[insertpos] = child;
+					break;
+				}
+			}
+			/* fallthrough */
+
+		case BFM_KIND_128:
+			{
+				bfm_tree_node_inner_128 *node_128 =
+					(bfm_tree_node_inner_128 *) node;
+				uint8 offset;
+
+				Assert(node_128->b.b.count <= 128);
+				if (unlikely(node_128->b.b.count == 128))
+				{
+					/* grow node from 128 -> max */
+					bfm_tree_node_inner_max *newnode_max;
+
+					newnode_max = bfm_alloc_inner_max(root);
+					bfm_node_copy_common(root, &node->b, &newnode_max->b.b);
+
+					for (int i = 0; i < BFM_MAX_CLASS; i++)
+					{
+						uint8 offset = node_128->offsets[i];
+
+						if (offset == BFM_TREE_NODE_128_INVALID)
+							continue;
+
+						Assert(node_128->slots[offset] != NULL);
+						Assert(node_128->slots[offset]->parent == node);
+
+						node_128->slots[offset]->parent = &newnode_max->b;
+
+						newnode_max->slots[i] = node_128->slots[offset];
+					}
+
+					bfm_redirect(root, &node->b, &newnode_max->b.b, newnode_max->b.b.node_chunk);
+					bfm_free_inner(root, node);
+					node = &newnode_max->b;
+				}
+				else
+				{
+					child->parent = node;
+					offset = node_128->b.b.count;
+					/* FIXME: this may overwrite entry if there had been deletions */
+					node_128->offsets[child_chunk] = offset;
+					node_128->slots[offset] = child;
+					break;
+				}
+			}
+			/* fallthrough */
+
+		case BFM_KIND_MAX:
+			{
+				bfm_tree_node_inner_max *node_max =
+					(bfm_tree_node_inner_max *) node;
+
+				Assert(node_max->b.b.count <= (BFM_MAX_CLASS - 1));
+				Assert(node_max->slots[child_chunk] == NULL);
+
+				child->parent = node;
+				node_max->slots[child_chunk] = child;
+
+				break;
+			}
+	}
+
+	node->b.count++;
+}
+
+static bool pg_noinline
+bfm_grow_leaf_1(bfm_tree *root, bfm_tree_node_leaf_1 *node_1,
+				int child_chunk, bfm_value_type val)
+{
+	/* grow node from 1 -> 4 */
+	bfm_tree_node_leaf_4 *newnode_4;
+
+	Assert(node_1->b.b.count == 1);
+
+	newnode_4 = bfm_alloc_leaf_4(root);
+	bfm_node_copy_common(root, &node_1->b.b, &newnode_4->b.b);
+
+	/* copy old & insert new value in the right order */
+	if (child_chunk < node_1->chunk)
+	{
+		newnode_4->chunks[0] = child_chunk;
+		newnode_4->values[0] = val;
+		newnode_4->chunks[1] = node_1->chunk;
+		newnode_4->values[1] = node_1->value;
+	}
+	else
+	{
+		newnode_4->chunks[0] = node_1->chunk;
+		newnode_4->values[0] = node_1->value;
+		newnode_4->chunks[1] = child_chunk;
+		newnode_4->values[1] = val;
+	}
+
+	newnode_4->b.b.count++;
+#ifdef BFM_STATS
+	root->entries++;
+#endif
+
+	bfm_redirect(root, &node_1->b.b, &newnode_4->b.b, newnode_4->b.b.node_chunk);
+	bfm_free_leaf(root, &node_1->b);
+
+	return false;
+}
+
+static bool pg_noinline
+bfm_grow_leaf_4(bfm_tree *root, bfm_tree_node_leaf_4 *node_4,
+				int child_chunk, bfm_value_type val)
+{
+	/* grow node from 4 -> 16 */
+	bfm_tree_node_leaf_16 *newnode_16;
+	int insertpos;
+
+	Assert(node_4->b.b.count == 4);
+
+	newnode_16 = bfm_alloc_leaf_16(root);
+	bfm_node_copy_common(root, &node_4->b.b, &newnode_16->b.b);
+
+	insertpos = search_chunk_array_4_le(node_4->chunks, child_chunk, node_4->b.b.count);
+
+	/* first copy old elements ordering before */
+	memcpy(&newnode_16->chunks[0],
+		   &node_4->chunks[0],
+		   sizeof(node_4->chunks[0]) * insertpos);
+	memcpy(&newnode_16->values[0],
+		   &node_4->values[0],
+		   sizeof(node_4->values[0]) * insertpos);
+
+	/* then the new element */
+	newnode_16->chunks[insertpos] = child_chunk;
+	newnode_16->values[insertpos] = val;
+
+	/* and lastly the old elements after */
+	memcpy(&newnode_16->chunks[insertpos + 1],
+		   &node_4->chunks[insertpos],
+		   (node_4->b.b.count-insertpos) * sizeof(node_4->chunks[0]));
+	memcpy(&newnode_16->values[insertpos + 1],
+		   &node_4->values[insertpos],
+		   (node_4->b.b.count-insertpos) * sizeof(node_4->values[0]));
+
+	newnode_16->b.b.count++;
+#ifdef BFM_STATS
+	root->entries++;
+#endif
+
+	bfm_redirect(root, &node_4->b.b, &newnode_16->b.b, newnode_16->b.b.node_chunk);
+	bfm_free_leaf(root, &node_4->b);
+
+	return false;
+}
+
+static bool pg_noinline
+bfm_grow_leaf_16(bfm_tree *root, bfm_tree_node_leaf_16 *node_16,
+				int child_chunk, bfm_value_type val)
+{
+	/* grow node from 16 -> 32 */
+	bfm_tree_node_leaf_32 *newnode_32;
+	int insertpos;
+
+	Assert(node_16->b.b.count == 16);
+
+	newnode_32 = bfm_alloc_leaf_32(root);
+	bfm_node_copy_common(root, &node_16->b.b, &newnode_32->b.b);
+
+	insertpos = search_chunk_array_16_le(node_16->chunks, child_chunk, node_16->b.b.count);
+
+	/* first copy old elements ordering before */
+	memcpy(&newnode_32->chunks[0],
+		   &node_16->chunks[0],
+		   sizeof(node_16->chunks[0]) * insertpos);
+	memcpy(&newnode_32->values[0],
+		   &node_16->values[0],
+		   sizeof(node_16->values[0]) * insertpos);
+
+	/* then the new element */
+	newnode_32->chunks[insertpos] = child_chunk;
+	newnode_32->values[insertpos] = val;
+
+	/* and lastly the old elements after */
+	memcpy(&newnode_32->chunks[insertpos + 1],
+		   &node_16->chunks[insertpos],
+		   (node_16->b.b.count-insertpos) * sizeof(node_16->chunks[0]));
+	memcpy(&newnode_32->values[insertpos + 1],
+		   &node_16->values[insertpos],
+		   (node_16->b.b.count-insertpos) * sizeof(node_16->values[0]));
+
+	newnode_32->b.b.count++;
+#ifdef BFM_STATS
+	root->entries++;
+#endif
+
+	bfm_redirect(root, &node_16->b.b, &newnode_32->b.b, newnode_32->b.b.node_chunk);
+	bfm_free_leaf(root, &node_16->b);
+
+	return false;
+}
+
+static bool pg_noinline
+bfm_grow_leaf_32(bfm_tree *root, bfm_tree_node_leaf_32 *node_32,
+				 int child_chunk, bfm_value_type val)
+{
+	/* grow node from 32 -> 128 */
+	bfm_tree_node_leaf_128 *newnode_128;
+	uint8 offset;
+
+	newnode_128 = bfm_alloc_leaf_128(root);
+	bfm_node_copy_common(root, &node_32->b.b, &newnode_128->b.b);
+
+	memcpy(newnode_128->values, node_32->values, sizeof(node_32->values));
+
+	for (int i = 0; i < 32; i++)
+		newnode_128->offsets[node_32->chunks[i]] = i;
+
+	offset = newnode_128->b.b.count;
+	newnode_128->offsets[child_chunk] = offset;
+	newnode_128->values[offset] = val;
+
+	newnode_128->b.b.count++;
+#ifdef BFM_STATS
+	root->entries++;
+#endif
+
+	bfm_redirect(root, &node_32->b.b, &newnode_128->b.b, newnode_128->b.b.node_chunk);
+	bfm_free_leaf(root, &node_32->b);
+
+	return false;
+}
+
+static bool pg_noinline
+bfm_grow_leaf_128(bfm_tree *root, bfm_tree_node_leaf_128 *node_128,
+				  int child_chunk, bfm_value_type val)
+{
+	/* grow node from 128 -> max */
+	bfm_tree_node_leaf_max *newnode_max;
+	int i;
+
+	newnode_max = bfm_alloc_leaf_max(root);
+	bfm_node_copy_common(root, &node_128->b.b, &newnode_max->b.b);
+
+	/*
+	 * The bitmask manipulation is a surprisingly large portion of the
+	 * overhead in the naive implementation. Unrolling the bit manipulation
+	 * removes a lot of that overhead.
+	 */
+	i = 0;
+	for (int byte = 0; byte < BFM_MAX_CLASS / BITS_PER_BYTE; byte++)
+	{
+		uint8 bitmap = 0;
+
+		for (int bit = 0; bit < BITS_PER_BYTE; bit++)
+		{
+			uint8 offset = node_128->offsets[i];
+
+			if (offset != BFM_TREE_NODE_128_INVALID)
+			{
+				bitmap |= 1 << bit;
+				newnode_max->values[i] = node_128->values[offset];
+			}
+
+			i++;
+		}
+
+		newnode_max->set[byte] = bitmap;
+	}
+
+	bfm_leaf_max_set(newnode_max, child_chunk);
+	newnode_max->values[child_chunk] = val;
+	newnode_max->b.b.count++;
+#ifdef BFM_STATS
+	root->entries++;
+#endif
+
+	bfm_redirect(root, &node_128->b.b, &newnode_max->b.b, newnode_max->b.b.node_chunk);
+	bfm_free_leaf(root, &node_128->b);
+
+	return false;
+}
+
+/*
+ * Set key to val. Return false if entry doesn't yet exist, true if it did.
+ *
+ * See comments to bfm_insert_inner().
+ */
+static bool pg_noinline
+bfm_set_leaf(bfm_tree *root, bfm_key_type key, bfm_value_type val,
+			 bfm_tree_node_leaf *node, int child_chunk)
+{
+	Assert(node->b.node_shift == 0); /* is leaf node */
+
+	/* tell the compiler it doesn't need a bounds check */
+	if ((bfm_tree_node_kind) node->b.kind > BFM_KIND_MAX)
+		pg_unreachable();
+
+	switch((bfm_tree_node_kind) node->b.kind)
+	{
+		case BFM_KIND_1:
+			{
+				bfm_tree_node_leaf_1 *node_1 =
+					(bfm_tree_node_leaf_1 *) node;
+
+				Assert(node_1->b.b.count <= 1);
+
+				if (node_1->b.b.count == 1 &&
+					node_1->chunk == child_chunk)
+				{
+					node_1->value = val;
+					return true;
+				}
+				else if (likely(node_1->b.b.count < 1))
+				{
+					node_1->chunk = child_chunk;
+					node_1->value = val;
+				}
+				else
+					return bfm_grow_leaf_1(root, node_1, child_chunk, val);
+
+				break;
+			}
+
+		case BFM_KIND_4:
+			{
+				bfm_tree_node_leaf_4 *node_4 =
+					(bfm_tree_node_leaf_4 *) node;
+				int index;
+
+				Assert(node_4->b.b.count <= 4);
+
+				index = search_chunk_array_4_eq(node_4->chunks, child_chunk, node_4->b.b.count);
+				if (index != -1)
+				{
+					node_4->values[index] = val;
+					return true;
+				}
+
+				if (likely(node_4->b.b.count < 4))
+				{
+					int insertpos;
+
+					insertpos = search_chunk_array_4_le(node_4->chunks, child_chunk, node_4->b.b.count);
+
+					for (int i = node_4->b.b.count - 1; i >= insertpos; i--)
+					{
+						/* workaround for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101481 */
+#ifdef __GNUC__
+						__asm__("");
+#endif
+						node_4->values[i + 1] = node_4->values[i];
+						node_4->chunks[i + 1] = node_4->chunks[i];
+					}
+
+					node_4->chunks[insertpos] = child_chunk;
+					node_4->values[insertpos] = val;
+				}
+				else
+					return bfm_grow_leaf_4(root, node_4, child_chunk, val);
+
+				break;
+			}
+
+		case BFM_KIND_16:
+			{
+				bfm_tree_node_leaf_16 *node_16 =
+					(bfm_tree_node_leaf_16 *) node;
+				int index;
+
+				Assert(node_16->b.b.count <= 16);
+
+				index = search_chunk_array_16_eq(node_16->chunks, child_chunk, node_16->b.b.count);
+				if (index != -1)
+				{
+					node_16->values[index] = val;
+					return true;
+				}
+
+				if (likely(node_16->b.b.count < 16))
+				{
+					int insertpos;
+
+					insertpos = search_chunk_array_16_le(node_16->chunks, child_chunk, node_16->b.b.count);
+
+					if (node_16->b.b.count > 16 || insertpos > 15)
+						pg_unreachable();
+
+					for (int i = node_16->b.b.count - 1; i >= insertpos; i--)
+					{
+						/* workaround for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101481 */
+#ifdef __GNUC__
+						__asm__("");
+#endif
+						node_16->values[i + 1] = node_16->values[i];
+						node_16->chunks[i + 1] = node_16->chunks[i];
+					}
+					node_16->chunks[insertpos] = child_chunk;
+					node_16->values[insertpos] = val;
+				}
+				else
+					return bfm_grow_leaf_16(root, node_16, child_chunk, val);
+
+				break;
+			}
+
+		case BFM_KIND_32:
+			{
+				bfm_tree_node_leaf_32 *node_32 =
+					(bfm_tree_node_leaf_32 *) node;
+				int index;
+
+				Assert(node_32->b.b.count <= 32);
+
+				index = search_chunk_array_32_eq(node_32->chunks, child_chunk, node_32->b.b.count);
+				if (index != -1)
+				{
+					node_32->values[index] = val;
+					return true;
+				}
+
+				if (likely(node_32->b.b.count < 32))
+				{
+					int insertpos;
+
+					insertpos = search_chunk_array_32_le(node_32->chunks, child_chunk, node_32->b.b.count);
+
+					if (node_32->b.b.count > 32 || insertpos > 31)
+						pg_unreachable();
+
+					for (int i = node_32->b.b.count - 1; i >= insertpos; i--)
+					{
+						/* workaround for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101481 */
+#ifdef __GNUC__
+						__asm__("");
+#endif
+						node_32->values[i + 1] = node_32->values[i];
+						node_32->chunks[i + 1] = node_32->chunks[i];
+					}
+					node_32->chunks[insertpos] = child_chunk;
+					node_32->values[insertpos] = val;
+				}
+				else
+					return bfm_grow_leaf_32(root, node_32, child_chunk, val);
+
+				break;
+			}
+
+		case BFM_KIND_128:
+			{
+				bfm_tree_node_leaf_128 *node_128 =
+					(bfm_tree_node_leaf_128 *) node;
+				uint8 offset;
+
+				Assert(node_128->b.b.count <= 128);
+
+				if (node_128->offsets[child_chunk] != BFM_TREE_NODE_128_INVALID)
+				{
+					offset = node_128->offsets[child_chunk];
+					node_128->values[offset] = val;
+
+					return true;
+				}
+				else if (likely(node_128->b.b.count < 128))
+				{
+					offset = node_128->b.b.count;
+					node_128->offsets[child_chunk] = offset;
+					node_128->values[offset] = val;
+				}
+				else
+					return bfm_grow_leaf_128(root, node_128, child_chunk, val);
+
+				break;
+			}
+
+		case BFM_KIND_MAX:
+			{
+				bfm_tree_node_leaf_max *node_max =
+					(bfm_tree_node_leaf_max *) node;
+
+				Assert(node_max->b.b.count <= (BFM_MAX_CLASS - 1));
+
+				if (bfm_leaf_max_isset(node_max, child_chunk))
+				{
+					node_max->values[child_chunk] = val;
+					return true;
+				}
+
+				bfm_leaf_max_set(node_max, child_chunk);
+				node_max->values[child_chunk] = val;
+
+				break;
+			}
+	}
+
+	node->b.count++;
+#ifdef BFM_STATS
+	root->entries++;
+#endif
+
+	return false;
+}
+
+static bool pg_noinline
+bfm_set_extend(bfm_tree *root, bfm_key_type key, bfm_value_type val,
+			   bfm_tree_node_inner *cur_inner,
+			   uint32 shift, uint8 chunk)
+{
+	bfm_tree_node_leaf_1 *new_leaf_1;
+
+	while (shift > BFM_FANOUT)
+	{
+		bfm_tree_node_inner_1 *new_inner_1;
+
+		Assert(shift == cur_inner->b.node_shift);
+
+		new_inner_1 = bfm_alloc_inner_1(root);
+		new_inner_1->b.b.node_shift = shift - BFM_FANOUT;
+
+		bfm_insert_inner(root, cur_inner, &new_inner_1->b.b, chunk);
+
+		shift -= BFM_FANOUT;
+		chunk = (key >> shift) & BFM_MASK;
+		cur_inner = &new_inner_1->b;
+	}
+
+	Assert(shift == BFM_FANOUT && cur_inner->b.node_shift == BFM_FANOUT);
+
+	new_leaf_1 = bfm_alloc_leaf_1(root);
+	new_leaf_1->b.b.count = 1;
+	new_leaf_1->b.b.node_shift = 0;
+
+	new_leaf_1->chunk = key & BFM_MASK;
+	new_leaf_1->value = val;
+
+#ifdef BFM_STATS
+	root->entries++;
+#endif
+
+	bfm_insert_inner(root, cur_inner, &new_leaf_1->b.b, chunk);
+
+	return false;
+}
+
+static bool pg_noinline
+bfm_set_empty(bfm_tree *root, bfm_key_type key, bfm_value_type val)
+{
+	uint32 shift;
+
+	Assert(root->rnode == NULL);
+
+	if (key == 0)
+		shift = 0;
+	else
+		shift = (pg_leftmost_one_pos64(key)/BFM_FANOUT)*BFM_FANOUT;
+
+	if (shift == 0)
+	{
+		bfm_tree_node_leaf_1 *nroot = bfm_alloc_leaf_1(root);
+
+		Assert((key & BFM_MASK) == key);
+
+		nroot->b.b.node_shift = 0;
+		nroot->b.b.node_chunk = 0;
+		nroot->b.b.parent = NULL;
+
+		root->maxval = bfm_maxval_shift(0);
+
+		root->rnode = &nroot->b.b;
+
+		return bfm_set_leaf(root, key, val, &nroot->b, key);
+	}
+	else
+	{
+		bfm_tree_node_inner_1 *nroot = bfm_alloc_inner_1(root);
+
+		nroot->b.b.node_shift = shift;
+		nroot->b.b.node_chunk = 0;
+		nroot->b.b.parent = NULL;
+
+		root->maxval = bfm_maxval_shift(shift);
+		root->rnode = &nroot->b.b;
+
+
+		return bfm_set_extend(root, key, val, &nroot->b,
+								 shift, (key >> shift) & BFM_MASK);
+	}
+}
+
+/*
+ * Tree doesn't have sufficient height. Put new tree node(s) on top, move
+ * the old node below it, and then insert.
+ */
+static bool pg_noinline
+bfm_set_shallow(bfm_tree *root, bfm_key_type key, bfm_value_type val)
+{
+	uint32 shift;
+	bfm_tree_node_inner_1 *nroot = NULL;;
+
+	Assert(root->rnode != NULL);
+
+	if (key == 0)
+		shift = 0;
+	else
+		shift = (pg_leftmost_one_pos64(key)/BFM_FANOUT)*BFM_FANOUT;
+
+	Assert(root->rnode->node_shift < shift);
+
+	while (unlikely(root->rnode->node_shift < shift))
+	{
+		nroot = bfm_alloc_inner_1(root);
+
+		nroot->slot = root->rnode;
+		nroot->chunk = 0;
+		nroot->b.b.count = 1;
+		nroot->b.b.parent = NULL;
+		nroot->b.b.node_shift = root->rnode->node_shift + BFM_FANOUT;
+
+		root->rnode->parent = &nroot->b;
+		root->rnode = &nroot->b.b;
+
+		root->maxval = bfm_maxval_shift(nroot->b.b.node_shift);
+	}
+
+	Assert(nroot != NULL);
+
+	return bfm_set_extend(root, key, val, &nroot->b,
+							 shift, (key >> shift) & BFM_MASK);
+}
+
+static void
+bfm_delete_inner(bfm_tree * pg_restrict root, bfm_tree_node_inner * pg_restrict node, bfm_tree_node *pg_restrict child, int child_chunk)
+{
+	switch((bfm_tree_node_kind) node->b.kind)
+	{
+		case BFM_KIND_1:
+			{
+				bfm_tree_node_inner_1 *node_1 =
+					(bfm_tree_node_inner_1 *) node;
+
+				Assert(node_1->slot == child);
+				Assert(node_1->chunk == child_chunk);
+
+				node_1->chunk = 17;
+				node_1->slot = NULL;
+
+				break;
+			}
+
+		case BFM_KIND_4:
+			{
+				bfm_tree_node_inner_4 *node_4 =
+					(bfm_tree_node_inner_4 *) node;
+				int index;
+
+				index = search_chunk_array_4_eq(node_4->chunks, child_chunk, node_4->b.b.count);
+				Assert(index != -1);
+
+				Assert(node_4->slots[index] == child);
+				memmove(&node_4->slots[index],
+						&node_4->slots[index + 1],
+						(node_4->b.b.count-index-1) * sizeof(void*));
+				memmove(&node_4->chunks[index],
+						&node_4->chunks[index + 1],
+						node_4->b.b.count-index-1);
+
+				node_4->chunks[node_4->b.b.count - 1] = BFM_TREE_NODE_INNER_4_INVALID;
+				node_4->slots[node_4->b.b.count - 1] = NULL;
+
+				break;
+			}
+
+		case BFM_KIND_16:
+			{
+				bfm_tree_node_inner_16 *node_16 =
+					(bfm_tree_node_inner_16 *) node;
+				int index;
+
+				index = search_chunk_array_16_eq(node_16->chunks, child_chunk, node_16->b.b.count);
+				Assert(index != -1);
+
+				Assert(node_16->slots[index] == child);
+				memmove(&node_16->slots[index],
+						&node_16->slots[index + 1],
+						(node_16->b.b.count - index - 1) * sizeof(node_16->slots[0]));
+				memmove(&node_16->chunks[index],
+						&node_16->chunks[index + 1],
+						(node_16->b.b.count - index - 1) * sizeof(node_16->chunks[0]));
+
+				node_16->chunks[node_16->b.b.count - 1] = BFM_TREE_NODE_INNER_16_INVALID;
+				node_16->slots[node_16->b.b.count - 1] = NULL;
+
+				break;
+			}
+
+		case BFM_KIND_32:
+			{
+				bfm_tree_node_inner_32 *node_32 =
+					(bfm_tree_node_inner_32 *) node;
+				int index;
+
+				index = search_chunk_array_32_eq(node_32->chunks, child_chunk, node_32->b.b.count);
+				Assert(index != -1);
+
+				Assert(node_32->slots[index] == child);
+				memmove(&node_32->slots[index],
+						&node_32->slots[index + 1],
+						(node_32->b.b.count - index - 1) * sizeof(node_32->slots[0]));
+				memmove(&node_32->chunks[index],
+						&node_32->chunks[index + 1],
+						(node_32->b.b.count - index - 1) * sizeof(node_32->chunks[0]));
+
+				node_32->chunks[node_32->b.b.count - 1] = BFM_TREE_NODE_INNER_32_INVALID;
+				node_32->slots[node_32->b.b.count - 1] = NULL;
+
+				break;
+			}
+
+		case BFM_KIND_128:
+			{
+				bfm_tree_node_inner_128 *node_128 =
+					(bfm_tree_node_inner_128 *) node;
+				uint8 offset;
+
+				offset = node_128->offsets[child_chunk];
+				Assert(offset != BFM_TREE_NODE_128_INVALID);
+				Assert(node_128->slots[offset] == child);
+				node_128->offsets[child_chunk] = BFM_TREE_NODE_128_INVALID;
+				node_128->slots[offset] = NULL;
+				break;
+			}
+
+		case BFM_KIND_MAX:
+			{
+				bfm_tree_node_inner_max *node_max =
+					(bfm_tree_node_inner_max *) node;
+
+				Assert(node_max->slots[child_chunk] == child);
+				node_max->slots[child_chunk] = NULL;
+
+				break;
+			}
+	}
+
+	node->b.count--;
+
+	if (node->b.count == 0)
+	{
+		if (node->b.parent)
+			bfm_delete_inner(root, node->b.parent, &node->b, node->b.node_chunk);
+		else
+			root->rnode = NULL;
+		bfm_free_inner(root, node);
+	}
+}
+
+/*
+ * NB: After this call node cannot be used anymore, it may have been freed or
+ * shrunk.
+ *
+ * FIXME: this should implement shrinking of nodes
+ */
+static void pg_noinline
+bfm_delete_leaf(bfm_tree * pg_restrict root, bfm_tree_node_leaf *pg_restrict node, int child_chunk)
+{
+	/* tell the compiler it doesn't need a bounds check */
+	if ((bfm_tree_node_kind) node->b.kind > BFM_KIND_MAX)
+		pg_unreachable();
+
+	switch((bfm_tree_node_kind) node->b.kind)
+	{
+		case BFM_KIND_1:
+			{
+				bfm_tree_node_leaf_1 *node_1 =
+					(bfm_tree_node_leaf_1 *) node;
+
+				Assert(node_1->chunk == child_chunk);
+
+				node_1->chunk = 17;
+				break;
+			}
+
+		case BFM_KIND_4:
+			{
+				bfm_tree_node_leaf_4 *node_4 =
+					(bfm_tree_node_leaf_4 *) node;
+				int index;
+
+				index = search_chunk_array_4_eq(node_4->chunks, child_chunk, node_4->b.b.count);
+				Assert(index != -1);
+
+				memmove(&node_4->values[index],
+						&node_4->values[index + 1],
+						(node_4->b.b.count - index - 1) * sizeof(node_4->values[0]));
+				memmove(&node_4->chunks[index],
+						&node_4->chunks[index + 1],
+						(node_4->b.b.count - index - 1) * sizeof(node_4->chunks[0]));
+
+				node_4->chunks[node_4->b.b.count - 1] = BFM_TREE_NODE_INNER_4_INVALID;
+				node_4->values[node_4->b.b.count - 1] = 0xFF;
+
+				break;
+			}
+
+		case BFM_KIND_16:
+			{
+				bfm_tree_node_leaf_16 *node_16 =
+					(bfm_tree_node_leaf_16 *) node;
+				int index;
+
+				index = search_chunk_array_16_eq(node_16->chunks, child_chunk, node_16->b.b.count);
+				Assert(index != -1);
+
+				memmove(&node_16->values[index],
+						&node_16->values[index + 1],
+						(node_16->b.b.count - index - 1) * sizeof(node_16->values[0]));
+				memmove(&node_16->chunks[index],
+						&node_16->chunks[index + 1],
+						(node_16->b.b.count - index - 1) * sizeof(node_16->chunks[0]));
+
+				node_16->chunks[node_16->b.b.count - 1] = BFM_TREE_NODE_INNER_16_INVALID;
+				node_16->values[node_16->b.b.count - 1] = 0xFF;
+
+				break;
+			}
+
+		case BFM_KIND_32:
+			{
+				bfm_tree_node_leaf_32 *node_32 =
+					(bfm_tree_node_leaf_32 *) node;
+				int index;
+
+				index = search_chunk_array_32_eq(node_32->chunks, child_chunk, node_32->b.b.count);
+				Assert(index != -1);
+
+				memmove(&node_32->values[index],
+						&node_32->values[index + 1],
+						(node_32->b.b.count - index - 1) * sizeof(node_32->values[0]));
+				memmove(&node_32->chunks[index],
+						&node_32->chunks[index + 1],
+						(node_32->b.b.count - index - 1) * sizeof(node_32->chunks[0]));
+
+				node_32->chunks[node_32->b.b.count - 1] = BFM_TREE_NODE_INNER_32_INVALID;
+				node_32->values[node_32->b.b.count - 1] = 0xFF;
+
+				break;
+			}
+
+		case BFM_KIND_128:
+			{
+				bfm_tree_node_leaf_128 *node_128 =
+					(bfm_tree_node_leaf_128 *) node;
+
+				Assert(node_128->offsets[child_chunk] != BFM_TREE_NODE_128_INVALID);
+				node_128->offsets[child_chunk] = BFM_TREE_NODE_128_INVALID;
+				break;
+			}
+
+		case BFM_KIND_MAX:
+			{
+				bfm_tree_node_leaf_max *node_max =
+					(bfm_tree_node_leaf_max *) node;
+
+				Assert(bfm_leaf_max_isset(node_max, child_chunk));
+				bfm_leaf_max_unset(node_max, child_chunk);
+
+				break;
+			}
+	}
+
+#ifdef BFM_STATS
+	root->entries--;
+#endif
+	node->b.count--;
+
+	if (node->b.count == 0)
+	{
+		if (node->b.parent)
+			bfm_delete_inner(root, node->b.parent, &node->b, node->b.node_chunk);
+		else
+			root->rnode = NULL;
+		bfm_free_leaf(root, node);
+	}
+}
+
+void
+bfm_init(bfm_tree *root)
+{
+	memset(root, 0, sizeof(*root));
+
+#if 1
+	root->context = AllocSetContextCreate(CurrentMemoryContext, "radix bench internal",
+										  ALLOCSET_DEFAULT_SIZES);
+#else
+	root->context = CurrentMemoryContext;
+#endif
+
+#ifdef BFM_USE_SLAB
+	for (int i = 0; i < BFM_KIND_COUNT; i++)
+	{
+		root->inner_slabs[i] = SlabContextCreate(root->context,
+												 inner_class_info[i].name,
+												 Max(pg_nextpower2_32((MAXALIGN(inner_class_info[i].size) + 16) * 32), 1024),
+												 inner_class_info[i].size);
+		root->leaf_slabs[i] = SlabContextCreate(root->context,
+												leaf_class_info[i].name,
+												Max(pg_nextpower2_32((MAXALIGN(leaf_class_info[i].size) + 16) * 32), 1024),
+												leaf_class_info[i].size);
+#if 0
+		elog(LOG, "%s %s size original %zu, mult %zu, round %u",
+			 "leaf",
+			 leaf_class_info[i].name,
+			 leaf_class_info[i].size,
+			 leaf_class_info[i].size * 32,
+			 pg_nextpower2_32(leaf_class_info[i].size * 32));
+#endif
+	}
+#endif
+
+	/*
+	 * XXX: Might be worth to always allocate a root node, to avoid related
+	 * branches?
+	 */
+}
+
+bool
+bfm_lookup(bfm_tree *root, uint64_t key, bfm_value_type *val)
+{
+	bfm_tree_node *node;
+
+	return bfm_walk(root, &node, val, key);
+}
+
+/*
+ * Set key to val. Returns false if entry doesn't yet exist, true if it did.
+ */
+bool
+bfm_set(bfm_tree *root, bfm_key_type key, bfm_value_type val)
+{
+	bfm_tree_node *cur;
+	bfm_tree_node_leaf *target;
+	uint8 chunk;
+	uint32 shift;
+
+	if (unlikely(!root->rnode))
+		return bfm_set_empty(root, key, val);
+	else if (key > root->maxval)
+		return bfm_set_shallow(root, key, val);
+
+	shift = root->rnode->node_shift;
+	chunk = (key >> shift) & BFM_MASK;
+	cur = root->rnode;
+
+	while (shift > 0)
+	{
+		bfm_tree_node_inner *cur_inner;
+		bfm_tree_node *slot;
+
+		Assert(cur->node_shift > 0); /* leaf nodes look different */
+		Assert(cur->node_shift == shift);
+
+		cur_inner = (bfm_tree_node_inner *) cur;
+
+		slot = bfm_find_one_level_inner(cur_inner, chunk);
+
+		if (slot == NULL)
+			return bfm_set_extend(root, key, val, cur_inner, shift, chunk);
+
+		Assert(&slot->parent->b == cur);
+		Assert(slot->node_chunk == chunk);
+
+		cur = slot;
+		shift -= BFM_FANOUT;
+		chunk = (key >> shift) & BFM_MASK;
+	}
+
+	Assert(shift == 0 && cur->node_shift == 0);
+
+	target = (bfm_tree_node_leaf *) cur;
+
+	/*
+	 * FIXME: what is the best API to deal with existing values? Overwrite?
+	 * Overwrite and return old value? Just return true?
+	 */
+	return bfm_set_leaf(root, key, val, target, chunk);
+}
+
+bool
+bfm_delete(bfm_tree *root, uint64 key)
+{
+	bfm_tree_node *node;
+	bfm_value_type val;
+
+	if (!bfm_walk(root, &node, &val, key))
+		return false;
+
+	Assert(node != NULL && node->node_shift == 0);
+
+	/* recurses upwards and deletes parent nodes if necessary */
+	bfm_delete_leaf(root, (bfm_tree_node_leaf *) node, key & BFM_MASK);
+
+	return true;
+}
+
+
+StringInfo
+bfm_stats(bfm_tree *root)
+{
+	StringInfo s;
+#ifdef BFM_STATS
+	size_t total;
+	size_t inner_bytes;
+	size_t leaf_bytes;
+	size_t allocator_bytes;
+#endif
+
+	s = makeStringInfo();
+
+	/* FIXME: Some of the below could be printed even without BFM_STATS */
+#ifdef BFM_STATS
+	appendStringInfo(s, "%zu entries and depth %d\n",
+					 root->entries,
+					 root->rnode ? root->rnode->node_shift / BFM_FANOUT : 0);
+
+	{
+		appendStringInfo(s, "\tinner nodes:");
+		total = 0;
+		inner_bytes = 0;
+		for (int i = 0; i < BFM_KIND_COUNT; i++)
+		{
+			total += root->inner_nodes[i];
+			inner_bytes += inner_class_info[i].size * root->inner_nodes[i];
+			appendStringInfo(s, " %s: %zu, ",
+							 inner_class_info[i].name,
+							 root->inner_nodes[i]);
+		}
+		appendStringInfo(s, " total: %zu, total_bytes: %zu\n", total,
+						 inner_bytes);
+	}
+
+	{
+		appendStringInfo(s, "\tleaf nodes:");
+		total = 0;
+		leaf_bytes = 0;
+		for (int i = 0; i < BFM_KIND_COUNT; i++)
+		{
+			total += root->leaf_nodes[i];
+			leaf_bytes += leaf_class_info[i].size * root->leaf_nodes[i];
+			appendStringInfo(s, " %s: %zu, ",
+							 leaf_class_info[i].name,
+							 root->leaf_nodes[i]);
+		}
+		appendStringInfo(s, " total: %zu, total_bytes: %zu\n", total,
+						 leaf_bytes);
+	}
+
+	allocator_bytes = MemoryContextMemAllocated(root->context, true);
+
+	appendStringInfo(s, "\t%.2f MB excluding allocator overhead, %.2f MiB including\n",
+					 (inner_bytes + leaf_bytes) / (double) (1024 * 1024),
+					 allocator_bytes / (double) (1024 * 1024));
+	appendStringInfo(s, "\t%.2f bytes/entry excluding allocator overhead\n",
+					 root->entries > 0 ?
+					 (inner_bytes + leaf_bytes)/(double)root->entries : 0);
+	appendStringInfo(s, "\t%.2f bytes/entry including allocator overhead\n",
+					 root->entries > 0 ?
+					 allocator_bytes/(double)root->entries : 0);
+#endif
+
+	if (0)
+		bfm_print(root);
+
+	return s;
+}
+
+static void
+bfm_print_node(StringInfo s, int indent, bfm_value_type key, bfm_tree_node *node);
+
+static void
+bfm_print_node_child(StringInfo s, int indent, bfm_value_type key, bfm_tree_node *node,
+					 int i, uint8 chunk, bfm_tree_node *child)
+{
+	appendStringInfoSpaces(s, indent + 2);
+	appendStringInfo(s, "%u: child chunk: 0x%.2X, child: %p\n",
+					 i, chunk, child);
+	key |= ((uint64) chunk) << node->node_shift;
+
+	bfm_print_node(s, indent + 4, key, child);
+}
+
+static void
+bfm_print_value(StringInfo s, int indent, bfm_value_type key, bfm_tree_node *node,
+				int i, uint8 chunk, bfm_value_type value)
+{
+	key |= chunk;
+
+	appendStringInfoSpaces(s, indent + 2);
+	appendStringInfo(s, "%u: chunk: 0x%.2X, key: 0x%llX/%llu, value: 0x%llX/%llu\n",
+					 i,
+					 chunk,
+					 (unsigned long long) key,
+					 (unsigned long long) key,
+					 (unsigned long long) value,
+					 (unsigned long long) value);
+}
+
+static void
+bfm_print_node(StringInfo s, int indent, bfm_value_type key, bfm_tree_node *node)
+{
+	appendStringInfoSpaces(s, indent);
+	appendStringInfo(s, "%s: kind %d, children: %u, shift: %u, node chunk: 0x%.2X, partial key: 0x%llX\n",
+					 node->node_shift != 0 ? "inner" : "leaf",
+					 node->kind,
+					 node->count,
+					 node->node_shift,
+					 node->node_chunk,
+					 (long long unsigned) key);
+
+	if (node->node_shift != 0)
+	{
+		bfm_tree_node_inner *inner = (bfm_tree_node_inner *) node;
+
+		switch((bfm_tree_node_kind) inner->b.kind)
+		{
+			case BFM_KIND_1:
+				{
+					bfm_tree_node_inner_1 *node_1 =
+						(bfm_tree_node_inner_1 *) node;
+
+					if (node_1->b.b.count > 0)
+						bfm_print_node_child(s, indent, key, node,
+											 0, node_1->chunk, node_1->slot);
+
+					break;
+				}
+
+			case BFM_KIND_4:
+				{
+					bfm_tree_node_inner_4 *node_4 =
+						(bfm_tree_node_inner_4 *) node;
+
+					for (int i = 0; i < node_4->b.b.count; i++)
+					{
+						bfm_print_node_child(s, indent, key, node,
+											 i, node_4->chunks[i], node_4->slots[i]);
+					}
+
+					break;
+				}
+
+			case BFM_KIND_16:
+				{
+					bfm_tree_node_inner_16 *node_16 =
+						(bfm_tree_node_inner_16 *) node;
+
+					for (int i = 0; i < node_16->b.b.count; i++)
+					{
+						bfm_print_node_child(s, indent, key, node,
+											 i, node_16->chunks[i], node_16->slots[i]);
+					}
+
+					break;
+				}
+
+			case BFM_KIND_32:
+				{
+					bfm_tree_node_inner_32 *node_32 =
+						(bfm_tree_node_inner_32 *) node;
+
+					for (int i = 0; i < node_32->b.b.count; i++)
+					{
+						bfm_print_node_child(s, indent, key, node,
+											 i, node_32->chunks[i], node_32->slots[i]);
+					}
+
+					break;
+				}
+
+			case BFM_KIND_128:
+				{
+					bfm_tree_node_inner_128 *node_128 =
+						(bfm_tree_node_inner_128 *) node;
+
+					for (int i = 0; i < BFM_MAX_CLASS; i++)
+					{
+						uint8 offset = node_128->offsets[i];
+
+						if (offset == BFM_TREE_NODE_128_INVALID)
+							continue;
+
+						bfm_print_node_child(s, indent, key, node,
+											 offset, i, node_128->slots[offset]);
+					}
+
+					break;
+				}
+
+			case BFM_KIND_MAX:
+				{
+					bfm_tree_node_inner_max *node_max =
+						(bfm_tree_node_inner_max *) node;
+
+					for (int i = 0; i < BFM_MAX_CLASS; i++)
+					{
+						if (node_max->slots[i] == NULL)
+							continue;
+
+						bfm_print_node_child(s, indent, key, node,
+											 i, i, node_max->slots[i]);
+					}
+
+					break;
+				}
+		}
+	}
+	else
+	{
+		bfm_tree_node_leaf *leaf = (bfm_tree_node_leaf *) node;
+
+		switch((bfm_tree_node_kind) leaf->b.kind)
+		{
+			case BFM_KIND_1:
+				{
+					bfm_tree_node_leaf_1 *node_1 =
+						(bfm_tree_node_leaf_1 *) node;
+
+					if (node_1->b.b.count > 0)
+						bfm_print_value(s, indent, key, node,
+										0, node_1->chunk, node_1->value);
+
+					break;
+				}
+
+			case BFM_KIND_4:
+				{
+					bfm_tree_node_leaf_4 *node_4 =
+						(bfm_tree_node_leaf_4 *) node;
+
+					for (int i = 0; i < node_4->b.b.count; i++)
+					{
+						bfm_print_value(s, indent, key, node,
+										i, node_4->chunks[i], node_4->values[i]);
+					}
+
+					break;
+				}
+
+			case BFM_KIND_16:
+				{
+					bfm_tree_node_leaf_16 *node_16 =
+						(bfm_tree_node_leaf_16 *) node;
+
+					for (int i = 0; i < node_16->b.b.count; i++)
+					{
+						bfm_print_value(s, indent, key, node,
+										i, node_16->chunks[i], node_16->values[i]);
+					}
+
+					break;
+				}
+
+			case BFM_KIND_32:
+				{
+					bfm_tree_node_leaf_32 *node_32 =
+						(bfm_tree_node_leaf_32 *) node;
+
+					for (int i = 0; i < node_32->b.b.count; i++)
+					{
+						bfm_print_value(s, indent, key, node,
+										i, node_32->chunks[i], node_32->values[i]);
+					}
+
+					break;
+				}
+
+			case BFM_KIND_128:
+				{
+					bfm_tree_node_leaf_128 *node_128 =
+						(bfm_tree_node_leaf_128 *) node;
+
+					for (int i = 0; i < BFM_MAX_CLASS; i++)
+					{
+						uint8 offset = node_128->offsets[i];
+
+						if (offset == BFM_TREE_NODE_128_INVALID)
+							continue;
+
+						bfm_print_value(s, indent, key, node,
+										offset, i, node_128->values[offset]);
+					}
+
+					break;
+				}
+
+			case BFM_KIND_MAX:
+				{
+					bfm_tree_node_leaf_max *node_max =
+						(bfm_tree_node_leaf_max *) node;
+
+					for (int i = 0; i < BFM_MAX_CLASS; i++)
+					{
+						if (!bfm_leaf_max_isset(node_max, i))
+							continue;
+
+						bfm_print_value(s, indent, key, node,
+										i, i, node_max->values[i]);
+					}
+
+					break;
+				}
+		}
+	}
+}
+
+void
+bfm_print(bfm_tree *root)
+{
+	StringInfoData s;
+
+	initStringInfo(&s);
+
+	if (root->rnode)
+		bfm_print_node(&s, 0 /* indent */, 0 /* key */, root->rnode);
+
+	elog(LOG, "radix debug print:\n%s", s.data);
+	pfree(s.data);
+}
+
+
+#define EXPECT_TRUE(expr)	\
+	do { \
+		if (!(expr)) \
+			elog(ERROR, \
+				 "%s was unexpectedly false in file \"%s\" line %u", \
+				 #expr, __FILE__, __LINE__); \
+	} while (0)
+
+#define EXPECT_FALSE(expr)	\
+	do { \
+		if (expr) \
+			elog(ERROR, \
+				 "%s was unexpectedly true in file \"%s\" line %u", \
+				 #expr, __FILE__, __LINE__); \
+	} while (0)
+
+#define EXPECT_EQ_U32(result_expr, expected_expr)	\
+	do { \
+		uint32		result = (result_expr); \
+		uint32		expected = (expected_expr); \
+		if (result != expected) \
+			elog(ERROR, \
+				 "%s yielded %u, expected %s in file \"%s\" line %u", \
+				 #result_expr, result, #expected_expr, __FILE__, __LINE__); \
+	} while (0)
+
+static void
+bfm_test_insert_leaf_grow(bfm_tree *root)
+{
+	bfm_value_type val;
+
+	/* 0->1 */
+	EXPECT_FALSE(bfm_set(root, 0, 0+3));
+	EXPECT_TRUE(bfm_lookup(root, 0, &val));
+	EXPECT_EQ_U32(val, 0+3);
+
+	/* node 1->4 */
+	for (int i = 1; i < 4; i++)
+	{
+		EXPECT_FALSE(bfm_set(root, i, i+3));
+	}
+	for (int i = 0; i < 4; i++)
+	{
+		EXPECT_TRUE(bfm_lookup(root, i, &val));
+		EXPECT_EQ_U32(val, i+3);
+	}
+
+	/* node 4->16, reverse order, for giggles */
+	for (int i = 15; i >= 4; i--)
+	{
+		EXPECT_FALSE(bfm_set(root, i, i+3));
+	}
+	for (int i = 0; i < 16; i++)
+	{
+		EXPECT_TRUE(bfm_lookup(root, i, &val));
+		EXPECT_EQ_U32(val, i+3);
+	}
+
+	/* node 16->32 */
+	for (int i = 16; i < 32; i++)
+	{
+		EXPECT_FALSE(bfm_set(root, i, i+3));
+	}
+	for (int i = 0; i < 32; i++)
+	{
+		EXPECT_TRUE(bfm_lookup(root, i, &val));
+		EXPECT_EQ_U32(val, i+3);
+	}
+
+	/* node 32->128 */
+	for (int i = 32; i < 128; i++)
+	{
+		EXPECT_FALSE(bfm_set(root, i, i+3));
+	}
+	for (int i = 0; i < 128; i++)
+	{
+		EXPECT_TRUE(bfm_lookup(root, i, &val));
+		EXPECT_EQ_U32(val, i+3);
+	}
+
+	/* node 128->max */
+	for (int i = 128; i < BFM_MAX_CLASS; i++)
+	{
+		EXPECT_FALSE(bfm_set(root, i, i+3));
+	}
+	for (int i = 0; i < BFM_MAX_CLASS; i++)
+	{
+		EXPECT_TRUE(bfm_lookup(root, i, &val));
+		EXPECT_EQ_U32(val, i+3);
+	}
+
+}
+
+static void
+bfm_test_insert_inner_grow(void)
+{
+	bfm_tree root;
+	bfm_value_type val;
+	bfm_value_type cur;
+
+	bfm_init(&root);
+
+	cur = 1025;
+
+	while (!root.rnode ||
+		   root.rnode->node_shift == 0 ||
+		   root.rnode->count < 4)
+	{
+		EXPECT_FALSE(bfm_set(&root, cur, -cur));
+		cur += BFM_MAX_CLASS;
+	}
+
+	for (int i = 1025; i < cur; i += BFM_MAX_CLASS)
+	{
+		EXPECT_TRUE(bfm_lookup(&root, i, &val));
+		EXPECT_EQ_U32(val, -i);
+	}
+
+	while (root.rnode->count < 32)
+	{
+		EXPECT_FALSE(bfm_set(&root, cur, -cur));
+		cur += BFM_MAX_CLASS;
+	}
+
+	for (int i = 1025; i < cur; i += BFM_MAX_CLASS)
+	{
+		EXPECT_TRUE(bfm_lookup(&root, i, &val));
+		EXPECT_EQ_U32(val, -i);
+	}
+
+	while (root.rnode->count < 128)
+	{
+		EXPECT_FALSE(bfm_set(&root, cur, -cur));
+		cur += BFM_MAX_CLASS;
+	}
+
+	for (int i = 1025; i < cur; i += BFM_MAX_CLASS)
+	{
+		EXPECT_TRUE(bfm_lookup(&root, i, &val));
+		EXPECT_EQ_U32(val, -i);
+	}
+
+	while (root.rnode->count < BFM_MAX_CLASS)
+	{
+		EXPECT_FALSE(bfm_set(&root, cur, -cur));
+		cur += BFM_MAX_CLASS;
+	}
+
+	for (int i = 1025; i < cur; i += BFM_MAX_CLASS)
+	{
+		EXPECT_TRUE(bfm_lookup(&root, i, &val));
+		EXPECT_EQ_U32(val, -i);
+	}
+
+	while (root.rnode->count == BFM_MAX_CLASS)
+	{
+		EXPECT_FALSE(bfm_set(&root, cur, -cur));
+		cur += BFM_MAX_CLASS;
+	}
+
+	for (int i = 1025; i < cur; i += BFM_MAX_CLASS)
+	{
+		EXPECT_TRUE(bfm_lookup(&root, i, &val));
+		EXPECT_EQ_U32(val, -i);
+	}
+
+}
+
+static void
+bfm_test_delete_lots(void)
+{
+	bfm_tree root;
+	bfm_value_type val;
+	bfm_key_type insertval;
+
+	bfm_init(&root);
+
+	insertval = 0;
+	while (!root.rnode ||
+		   root.rnode->node_shift != (BFM_FANOUT * 2))
+	{
+		EXPECT_FALSE(bfm_set(&root, insertval, -insertval));
+		insertval++;
+	}
+
+	for (bfm_key_type i = 0; i < insertval; i++)
+	{
+		EXPECT_TRUE(bfm_lookup(&root, i, &val));
+		EXPECT_EQ_U32(val, -i);
+		EXPECT_TRUE(bfm_delete(&root, i));
+		EXPECT_FALSE(bfm_lookup(&root, i, &val));
+	}
+
+	EXPECT_TRUE(root.rnode == NULL);
+}
+
+#include "portability/instr_time.h"
+
+static void
+bfm_test_insert_bulk(int count)
+{
+	bfm_tree root;
+	bfm_value_type val;
+	instr_time start, end, diff;
+	int misses;
+	int mult = 1;
+
+	bfm_init(&root);
+
+	INSTR_TIME_SET_CURRENT(start);
+
+	for (int i = 0; i < count; i++)
+		bfm_set(&root, i*mult, -i);
+
+	INSTR_TIME_SET_CURRENT(end);
+	INSTR_TIME_SET_ZERO(diff);
+	INSTR_TIME_ACCUM_DIFF(diff, end, start);
+
+	elog(NOTICE, "%d ordered insertions in %f seconds, %d/sec",
+		 count,
+		 INSTR_TIME_GET_DOUBLE(diff),
+		 (int)(count/INSTR_TIME_GET_DOUBLE(diff)));
+
+	INSTR_TIME_SET_CURRENT(start);
+
+	misses = 0;
+	for (int i = 0; i < count; i++)
+	{
+		if (unlikely(!bfm_lookup(&root, i*mult, &val)))
+			misses++;
+	}
+	if (misses > 0)
+		elog(ERROR, "not present for lookup: %d entries", misses);
+
+	INSTR_TIME_SET_CURRENT(end);
+	INSTR_TIME_SET_ZERO(diff);
+	INSTR_TIME_ACCUM_DIFF(diff, end, start);
+
+	elog(NOTICE, "%d ordered lookups in %f seconds, %d/sec",
+		 count,
+		 INSTR_TIME_GET_DOUBLE(diff),
+		 (int)(count/INSTR_TIME_GET_DOUBLE(diff)));
+
+	elog(LOG, "stats after lookup are: %s",
+		 bfm_stats(&root)->data);
+
+	INSTR_TIME_SET_CURRENT(start);
+
+	misses = 0;
+	for (int i = 0; i < count; i++)
+	{
+		if (unlikely(!bfm_delete(&root, i*mult)))
+			misses++;
+	}
+	if (misses > 0)
+		elog(ERROR, "not present for deletion: %d entries", misses);
+
+	INSTR_TIME_SET_CURRENT(end);
+	INSTR_TIME_SET_ZERO(diff);
+	INSTR_TIME_ACCUM_DIFF(diff, end, start);
+
+	elog(NOTICE, "%d ordered deletions in %f seconds, %d/sec",
+		 count,
+		 INSTR_TIME_GET_DOUBLE(diff),
+		 (int)(count/INSTR_TIME_GET_DOUBLE(diff)));
+
+	elog(LOG, "stats after deletion are: %s",
+		 bfm_stats(&root)->data);
+}
+
+void
+bfm_tests(void)
+{
+	bfm_tree root;
+	bfm_value_type val;
+
+	/* initialize a tree starting with a large value */
+	bfm_init(&root);
+	EXPECT_FALSE(bfm_set(&root, 1024, 1));
+	EXPECT_TRUE(bfm_lookup(&root, 1024, &val));
+	EXPECT_EQ_U32(val, 1);
+	/* there should only be the key we inserted */
+#ifdef BFM_STATS
+	EXPECT_EQ_U32(root.leaf_nodes[0], 1);
+#endif
+
+	/* check that we can subsequently insert a small value */
+	EXPECT_FALSE(bfm_set(&root, 1, 2));
+	EXPECT_TRUE(bfm_lookup(&root, 1, &val));
+	EXPECT_EQ_U32(val, 2);
+	EXPECT_TRUE(bfm_lookup(&root, 1024, &val));
+	EXPECT_EQ_U32(val, 1);
+
+	/* check that a 0 key and 0 value are correctly recognized */
+	bfm_init(&root);
+	EXPECT_FALSE(bfm_lookup(&root, 0, &val));
+	EXPECT_FALSE(bfm_set(&root, 0, 17));
+	EXPECT_TRUE(bfm_lookup(&root, 0, &val));
+	EXPECT_EQ_U32(val, 17);
+
+	EXPECT_FALSE(bfm_lookup(&root, 2, &val));
+	EXPECT_FALSE(bfm_set(&root, 2, 0));
+	EXPECT_TRUE(bfm_lookup(&root, 2, &val));
+	EXPECT_EQ_U32(val, 0);
+
+	/* check that repeated insertion of the same key updates value */
+	bfm_init(&root);
+	EXPECT_FALSE(bfm_set(&root, 9, 12));
+	EXPECT_TRUE(bfm_lookup(&root, 9, &val));
+	EXPECT_EQ_U32(val, 12);
+	EXPECT_TRUE(bfm_set(&root, 9, 13));
+	EXPECT_TRUE(bfm_lookup(&root, 9, &val));
+	EXPECT_EQ_U32(val, 13);
+
+
+	/* initialize a tree starting with a leaf value */
+	bfm_init(&root);
+	EXPECT_FALSE(bfm_set(&root, 3, 1));
+	EXPECT_TRUE(bfm_lookup(&root, 3, &val));
+	EXPECT_EQ_U32(val, 1);
+	/* there should only be the key we inserted */
+#ifdef BFM_STATS
+	EXPECT_EQ_U32(root.leaf_nodes[0], 1);
+#endif
+	/* and no inner ones */
+#ifdef BFM_STATS
+	EXPECT_EQ_U32(root.inner_nodes[0], 0);
+#endif
+
+	EXPECT_FALSE(bfm_set(&root, 1717, 17));
+	EXPECT_TRUE(bfm_lookup(&root, 1717, &val));
+	EXPECT_EQ_U32(val, 17);
+
+	/* check that a root leaf node grows correctly */
+	bfm_init(&root);
+	bfm_test_insert_leaf_grow(&root);
+
+	/* check that a non-root leaf node grows correctly */
+	bfm_init(&root);
+	EXPECT_FALSE(bfm_set(&root, 1024, 1024));
+	bfm_test_insert_leaf_grow(&root);
+
+	/* check that an inner node grows correctly */
+	bfm_test_insert_inner_grow();
+
+
+	bfm_init(&root);
+	EXPECT_FALSE(bfm_set(&root, 1, 1));
+	EXPECT_TRUE(bfm_lookup(&root, 1, &val));
+
+	/* deletion from leaf node at root */
+	EXPECT_TRUE(bfm_delete(&root, 1));
+	EXPECT_FALSE(bfm_lookup(&root, 1, &val));
+
+	/* repeated deletion fails */
+	EXPECT_FALSE(bfm_delete(&root, 1));
+	EXPECT_TRUE(root.rnode == NULL);
+
+	/* one deletion doesn't disturb other values in leaf */
+	EXPECT_FALSE(bfm_set(&root, 1, 1));
+	EXPECT_FALSE(bfm_set(&root, 2, 2));
+	EXPECT_TRUE(bfm_delete(&root, 1));
+	EXPECT_FALSE(bfm_lookup(&root, 1, &val));
+	EXPECT_TRUE(bfm_lookup(&root, 2, &val));
+	EXPECT_EQ_U32(val, 2);
+
+	EXPECT_TRUE(bfm_delete(&root, 2));
+	EXPECT_FALSE(bfm_lookup(&root, 2, &val));
+	EXPECT_TRUE(root.rnode == NULL);
+
+	/* deletion from a leaf node succeeds */
+	EXPECT_FALSE(bfm_set(&root, 0xFFFF02, 0xFFFF02));
+	EXPECT_FALSE(bfm_set(&root, 1, 1));
+	EXPECT_FALSE(bfm_set(&root, 2, 2));
+
+	EXPECT_TRUE(bfm_delete(&root, 1));
+	EXPECT_TRUE(bfm_lookup(&root, 0xFFFF02, &val));
+	EXPECT_FALSE(bfm_lookup(&root, 1, &val));
+	EXPECT_TRUE(bfm_lookup(&root, 2, &val));
+
+	EXPECT_TRUE(bfm_delete(&root, 2));
+	EXPECT_TRUE(bfm_lookup(&root, 0xFFFF02, &val));
+	EXPECT_FALSE(bfm_lookup(&root, 1, &val));
+
+	EXPECT_TRUE(bfm_delete(&root, 0xFFFF02));
+	EXPECT_FALSE(bfm_delete(&root, 0xFFFF02));
+	EXPECT_FALSE(bfm_lookup(&root, 0xFFFF02, &val));
+	EXPECT_TRUE(root.rnode == NULL);
+
+	/* check that repeatedly inserting and deleting the same value works */
+	bfm_init(&root);
+	EXPECT_FALSE(bfm_set(&root, 0x10000, -0x10000));
+	EXPECT_FALSE(bfm_set(&root, 0, 0));
+	EXPECT_TRUE(bfm_lookup(&root, 0, &val));
+	EXPECT_TRUE(bfm_delete(&root, 0));
+	EXPECT_FALSE(bfm_lookup(&root, 0, &val));
+	EXPECT_FALSE(bfm_set(&root, 0, 0));
+	EXPECT_TRUE(bfm_set(&root, 0, 0));
+	EXPECT_TRUE(bfm_lookup(&root, 0, &val));
+
+	bfm_test_delete_lots();
+
+	if (0)
+	{
+		int cnt = 300;
+
+		bfm_init(&root);
+		MemoryContextStats(root.context);
+		for (int i = 0; i < cnt; i++)
+			EXPECT_FALSE(bfm_set(&root, i, i));
+		MemoryContextStats(root.context);
+		for (int i = 0; i < cnt; i++)
+			EXPECT_TRUE(bfm_delete(&root, i));
+		MemoryContextStats(root.context);
+	}
+
+	if (1)
+	{
+		//bfm_test_insert_bulk(        100 * 1000);
+		//bfm_test_insert_bulk(       1000 * 1000);
+#ifdef USE_ASSERT_CHECKING
+		bfm_test_insert_bulk(   1 * 1000 * 1000);
+#endif
+		//bfm_test_insert_bulk(  10 * 1000 * 1000);
+#ifndef USE_ASSERT_CHECKING
+		bfm_test_insert_bulk( 100 * 1000 * 1000);
+#endif
+		//bfm_test_insert_bulk(1000 * 1000 * 1000);
+	}
+
+	//bfm_print(&root);
+}
diff --git a/bdbench/radix.h b/bdbench/radix.h
new file mode 100644
index 0000000..c908aa5
--- /dev/null
+++ b/bdbench/radix.h
@@ -0,0 +1,76 @@
+/*-------------------------------------------------------------------------
+ *
+ * radix.h
+ *	  radix tree, yay.
+ *
+ *
+ * Portions Copyright (c) 2014-2021, PostgreSQL Global Development Group
+ *
+ * src/include/storage/radix.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef RADIX_H
+#define RADIX_H
+
+typedef uint64 bfm_key_type;
+typedef uint64 bfm_value_type;
+//typedef uint32 bfm_value_type;
+//typedef char bfm_value_type;
+
+/* How many different size classes are there */
+#define BFM_KIND_COUNT 6
+
+typedef enum bfm_tree_node_kind
+{
+	BFM_KIND_1,
+	BFM_KIND_4,
+	BFM_KIND_16,
+	BFM_KIND_32,
+	BFM_KIND_128,
+	BFM_KIND_MAX
+} bfm_tree_node_kind;
+
+struct MemoryContextData;
+struct bfm_tree_node;
+
+/* NB: makes things a bit slower */
+#define BFM_STATS
+
+#define BFM_USE_SLAB
+//#define BFM_USE_OS
+
+/*
+ * A radix tree with nodes that are sized based on occupancy.
+ */
+typedef struct bfm_tree
+{
+	struct bfm_tree_node *rnode;
+	uint64 maxval;
+
+	struct MemoryContextData *context;
+#ifdef BFM_USE_SLAB
+	struct MemoryContextData *inner_slabs[BFM_KIND_COUNT];
+	struct MemoryContextData *leaf_slabs[BFM_KIND_COUNT];
+#endif
+
+#ifdef BFM_STATS
+	/* stats */
+	size_t entries;
+	size_t inner_nodes[BFM_KIND_COUNT];
+	size_t leaf_nodes[BFM_KIND_COUNT];
+#endif
+} bfm_tree;
+
+extern void bfm_init(bfm_tree *root);
+extern bool bfm_lookup(bfm_tree *root, bfm_key_type key, bfm_value_type *val);
+extern bool bfm_set(bfm_tree *root, bfm_key_type key, bfm_value_type val);
+extern bool bfm_delete(bfm_tree *root, bfm_key_type key);
+
+extern struct StringInfoData* bfm_stats(bfm_tree *root);
+extern void bfm_print(bfm_tree *root);
+
+extern void bfm_tests(void);
+
+#endif
-- 
2.32.0.rc2

0003-Add-radix-tree-benchmark-integration.patchtext/x-diff; charset=us-asciiDownload

From 131074dcbe72ff8af00cb879c7c92747dc100e69 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 19 Jul 2021 16:05:44 -0700
Subject: [PATCH 3/3] Add radix tree benchmark integration.

---
 bdbench/Makefile         |   2 +-
 bdbench/bdbench--1.0.sql |   4 +
 bdbench/bdbench.c        | 181 ++++++++++++++++++++++++++++++++++++++-
 bdbench/bench.sql        |   6 +-
 4 files changed, 189 insertions(+), 4 deletions(-)

diff --git a/bdbench/Makefile b/bdbench/Makefile
index 6d52940..723132a 100644
--- a/bdbench/Makefile
+++ b/bdbench/Makefile
@@ -2,7 +2,7 @@
 
 MODULE_big = bdbench
 DATA = bdbench--1.0.sql
-OBJS = bdbench.o vtbm.o rtbm.o
+OBJS = bdbench.o vtbm.o rtbm.o radix.o
 
 EXTENSION = bdbench
 REGRESS= bdbench
diff --git a/bdbench/bdbench--1.0.sql b/bdbench/bdbench--1.0.sql
index 933cf71..bd59293 100644
--- a/bdbench/bdbench--1.0.sql
+++ b/bdbench/bdbench--1.0.sql
@@ -109,3 +109,7 @@ RETURNS text
 AS 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE;
 
+CREATE FUNCTION radix_run_tests()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE;
diff --git a/bdbench/bdbench.c b/bdbench/bdbench.c
index 1df5c53..85d8eaa 100644
--- a/bdbench/bdbench.c
+++ b/bdbench/bdbench.c
@@ -19,6 +19,7 @@
 
 #include "vtbm.h"
 #include "rtbm.h"
+#include "radix.h"
 
 //#define DEBUG_DUMP_MATCHED 1
 
@@ -89,6 +90,7 @@ PG_FUNCTION_INFO_V1(attach_dead_tuples);
 PG_FUNCTION_INFO_V1(bench);
 PG_FUNCTION_INFO_V1(test_generate_tid);
 PG_FUNCTION_INFO_V1(rtbm_test);
+PG_FUNCTION_INFO_V1(radix_run_tests);
 PG_FUNCTION_INFO_V1(prepare);
 
 /*
@@ -137,6 +139,16 @@ static void rtbm_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
 static bool rtbm_reaped(LVTestType *lvtt, ItemPointer itemptr);
 static Size rtbm_mem_usage(LVTestType *lvtt);
 
+/* radix */
+static void radix_init(LVTestType *lvtt, uint64 nitems);
+static void radix_fini(LVTestType *lvtt);
+static void radix_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
+						 BlockNumber maxblk, OffsetNumber maxoff);
+static bool radix_reaped(LVTestType *lvtt, ItemPointer itemptr);
+static Size radix_mem_usage(LVTestType *lvtt);
+static void radix_load(void *tbm, ItemPointerData *itemptrs, int nitems);
+
+
 /* Misc functions */
 static void generate_index_tuples(uint64 nitems, BlockNumber minblk,
 								  BlockNumber maxblk, OffsetNumber maxoff);
@@ -156,12 +168,13 @@ static void load_rtbm(RTbm *vtbm, ItemPointerData *itemptrs, int nitems);
 		.dtinfo = {0}, \
 		.name = #n, \
 		.init_fn = n##_init, \
+		.fini_fn = n##_fini, \
 		.attach_fn = n##_attach, \
 		.reaped_fn = n##_reaped, \
 		.mem_usage_fn = n##_mem_usage, \
 			}
 
-#define TEST_SUBJECT_TYPES 5
+#define TEST_SUBJECT_TYPES 6
 static LVTestType LVTestSubjects[TEST_SUBJECT_TYPES] =
 {
 	DECLARE_SUBJECT(array),
@@ -169,6 +182,7 @@ static LVTestType LVTestSubjects[TEST_SUBJECT_TYPES] =
 	DECLARE_SUBJECT(intset),
 	DECLARE_SUBJECT(vtbm),
 	DECLARE_SUBJECT(rtbm),
+	DECLARE_SUBJECT(radix)
 };
 
 static bool
@@ -192,6 +206,31 @@ update_info(DeadTupleInfo *info, uint64 nitems, BlockNumber minblk,
 	info->maxoff = maxoff;
 }
 
+
+/* from geqo's init_tour(), geqo_randint() */
+static int
+shuffle_randrange(unsigned short xseed[3], int lower, int upper)
+{
+	return (int) floor( pg_erand48(xseed) * ((upper-lower)+0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(uint64 nitems, ItemPointer itemptrs)
+{
+	/* reproducability */
+	unsigned short xseed[3] = {0};
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int j = shuffle_randrange(xseed, i, nitems - 1);
+		ItemPointerData t = itemptrs[j];
+
+		itemptrs[j] = itemptrs[i];
+		itemptrs[i] = t;
+	}
+}
+
 static void
 generate_index_tuples(uint64 nitems, BlockNumber minblk, BlockNumber maxblk,
 					 OffsetNumber maxoff)
@@ -586,6 +625,138 @@ load_rtbm(RTbm *rtbm, ItemPointerData *itemptrs, int nitems)
 	rtbm_add_tuples(rtbm, curblkno, offs, noffs);
 }
 
+/* ---------- radix ---------- */
+static void
+radix_init(LVTestType *lvtt, uint64 nitems)
+{
+	MemoryContext old_ctx;
+
+	lvtt->mcxt = AllocSetContextCreate(TopMemoryContext,
+									   "radix bench",
+									   ALLOCSET_DEFAULT_SIZES);
+	old_ctx = MemoryContextSwitchTo(lvtt->mcxt);
+	lvtt->private = palloc(sizeof(bfm_tree));
+	bfm_init(lvtt->private);
+	MemoryContextSwitchTo(old_ctx);
+}
+static void
+radix_fini(LVTestType *lvtt)
+{
+#if 0
+	if (lvtt->private)
+		bfm_free((RTbm *) lvtt->private);
+#endif
+}
+
+/* log(sizeof(bfm_value_type) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+#define ENCODE_BITS 6
+
+static uint64
+radix_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64 upper;
+
+	uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64 tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= ItemPointerGetBlockNumber(tid) << shift;
+
+	*off = tid_i & ((1 << ENCODE_BITS)-1);
+	upper = tid_i >> ENCODE_BITS;
+	Assert(*off < (sizeof(bfm_value_type) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static void
+radix_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
+			   BlockNumber maxblk, OffsetNumber maxoff)
+{
+	MemoryContext oldcontext = MemoryContextSwitchTo(lvtt->mcxt);
+
+	radix_load(lvtt->private,
+			   DeadTuples_orig->itemptrs,
+			   DeadTuples_orig->dtinfo.nitems);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+
+static bool
+radix_reaped(LVTestType *lvtt, ItemPointer itemptr)
+{
+	uint64 key;
+	uint32 off;
+	bfm_value_type val;
+
+	key = radix_to_key_off(itemptr, &off);
+
+	if (!bfm_lookup((bfm_tree *) lvtt->private, key, &val))
+		return false;
+
+	return val & ((bfm_value_type)1 << off);
+}
+
+static uint64
+radix_mem_usage(LVTestType *lvtt)
+{
+	bfm_tree *root = lvtt->private;
+	size_t mem = MemoryContextMemAllocated(lvtt->mcxt, true);
+	StringInfo s;
+
+	s = bfm_stats(root);
+
+	ereport(NOTICE,
+			errmsg("radix tree of %.2f MB, %s",
+				   (double) mem / (1024 * 1024),
+				   s->data),
+			errhidestmt(true),
+			errhidecontext(true));
+
+	pfree(s->data);
+	pfree(s);
+
+	return mem;
+}
+
+static void
+radix_load(void *tbm, ItemPointerData *itemptrs, int nitems)
+{
+	bfm_tree *root = (bfm_tree *) tbm;
+	uint64 last_key = PG_UINT64_MAX;
+	uint64 val = 0;
+
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemPointer tid = &(itemptrs[i]);
+		uint64 key;
+		uint32 off;
+
+		key = radix_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX &&
+			last_key != key)
+		{
+			bfm_set(root, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64)1 << off;
+	}
+
+	if (last_key != PG_UINT64_MAX)
+	{
+		bfm_set(root, last_key, val);
+	}
+}
+
+
 static void
 attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk, BlockNumber maxblk,
 	   OffsetNumber maxoff)
@@ -952,3 +1123,11 @@ rtbm_test(PG_FUNCTION_ARGS)
 	PG_RETURN_NULL();
 
 }
+
+Datum
+radix_run_tests(PG_FUNCTION_ARGS)
+{
+	bfm_tests();
+
+	PG_RETURN_VOID();
+}
diff --git a/bdbench/bench.sql b/bdbench/bench.sql
index c5ef2d3..94cfde0 100644
--- a/bdbench/bench.sql
+++ b/bdbench/bench.sql
@@ -11,10 +11,11 @@ select prepare(
 
 -- Load dead tuples to all data structures.
 select 'array', attach_dead_tuples('array');
-select 'tbm', attach_dead_tuples('tbm');
 select 'intset', attach_dead_tuples('intset');
-select 'vtbm', attach_dead_tuples('vtbm');
 select 'rtbm', attach_dead_tuples('rtbm');
+select 'tbm', attach_dead_tuples('tbm');
+select 'vtbm', attach_dead_tuples('vtbm');
+select 'radix', attach_dead_tuples('radix');
 
 -- Do benchmark of lazy_tid_reaped.
 select 'array bench', bench('array');
@@ -22,6 +23,7 @@ select 'intset bench', bench('intset');
 select 'rtbm bench', bench('rtbm');
 select 'tbm bench', bench('tbm');
 select 'vtbm bench', bench('vtbm');
+select 'radix', bench('radix');
 
 -- Check the memory usage.
 select * from pg_backend_memory_contexts where name ~ 'bench' or name = 'TopMemoryContext' order by name;
-- 
2.32.0.rc2

#23

andres@anarazel.de

over 4 years ago

In reply to: Andres Freund (#22)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2021-07-19 16:49:15 -0700, Andres Freund wrote:

E.g. for

select prepare(
1000000, -- max block
20, -- # of dead tuples per page
10, -- dead tuples interval within a page
1 -- page inteval
);
attach size shuffled ordered
array 69 ms 120 MB 84.87 s 8.66 s
intset 173 ms 65 MB 68.82 s 11.75 s
rtbm 201 ms 67 MB 11.54 s 1.35 s
tbm 232 ms 100 MB 8.33 s 1.26 s
vtbm 162 ms 58 MB 10.01 s 1.22 s
radix 88 ms 42 MB 11.49 s 1.67 s

and for
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1 -- page inteval
);

attach size shuffled ordered
array 24 ms 60MB 3.74s 1.02 s
intset 97 ms 49MB 3.14s 0.75 s
rtbm 138 ms 36MB 0.41s 0.14 s
tbm 198 ms 101MB 0.41s 0.14 s
vtbm 118 ms 27MB 0.39s 0.12 s
radix 33 ms 10MB 0.28s 0.10 s

Oh, I forgot: The performance numbers are with the fixes in
/messages/by-id/20210717194333.mr5io3zup3kxahfm@alap3.anarazel.de
applied.

Greetings,

Andres Freund

#24

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Andres Freund (#22)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

I've dreamed to write more compact structure for vacuum for three
years, but life didn't give me a time to.

Let me join to friendly competition.

I've bet on HATM approach: popcount-ing bitmaps for non-empty elements.

Novelties:
- 32 consecutive pages are stored together in a single sparse array
(called "chunks").
Chunk contains:
- its number,
- 4 byte bitmap of non-empty pages,
- array of non-empty page headers 2 byte each.
Page header contains offset of page's bitmap in bitmaps container.
(Except if there is just one dead tuple in a page. Then it is
written into header itself).
- container of concatenated bitmaps.

Ie, page metadata overhead varies from 2.4byte (32pages in single
chunk)
to 18byte (1 page in single chunk) per page.

- If page's bitmap is sparse ie contains a lot of "all-zero" bytes,
it is compressed by removing zero byte and indexing with two-level
bitmap index.
Two-level index - zero bytes in first level are removed using
second level. It is mostly done for 32kb pages, but let it stay since
it is almost free.

- If page's bitmaps contains a lot of "all-one" bytes, it is inverted
and then encoded as sparse.

- Chunks are allocated with custom "allocator" that has no
per-allocation overhead. It is possible because there is no need
to perform "free": allocator is freed as whole at once.

- Array of pointers to chunks is also bitmap indexed. It saves cpu time
when not every 32 consecutive pages has at least one dead tuple.
But consumes time otherwise. Therefore additional optimization is
added
to quick skip lookup for first non-empty run of chunks.
(Ahhh, I believe this explanation is awful).

Andres Freund wrote 2021-07-20 02:49:

Hi,

On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote:

BTW is the implementation of the radix tree approach available
somewhere? If so I'd like to experiment with that too.

I have toyed with implementing adaptively large radix nodes like
proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
gotten it quite working.

That seems promising approach.

I've since implemented some, but not all of the ideas of that paper
(adaptive node sizes, but not the tree compression pieces).

E.g. for

select prepare(
1000000, -- max block
20, -- # of dead tuples per page
10, -- dead tuples interval within a page
1 -- page inteval
);
attach size shuffled ordered
array 69 ms 120 MB 84.87 s 8.66 s
intset 173 ms 65 MB 68.82 s 11.75 s
rtbm 201 ms 67 MB 11.54 s 1.35 s
tbm 232 ms 100 MB 8.33 s 1.26 s
vtbm 162 ms 58 MB 10.01 s 1.22 s
radix 88 ms 42 MB 11.49 s 1.67 s

and for
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1 -- page inteval
);

attach size shuffled ordered
array 24 ms 60MB 3.74s 1.02 s
intset 97 ms 49MB 3.14s 0.75 s
rtbm 138 ms 36MB 0.41s 0.14 s
tbm 198 ms 101MB 0.41s 0.14 s
vtbm 118 ms 27MB 0.39s 0.12 s
radix 33 ms 10MB 0.28s 0.10 s

(this is an almost unfairly good case for radix)

Running out of time to format the results of the other testcases before
I have to run, unfortunately. radix uses 42MB both in test case 3 and
4.

My results (Ubuntu 20.04 Intel Core i7-1165G7):

Test1.

select prepare(1000000, 10, 20, 1); -- original

attach size shuffled
array 29ms 60MB 93.99s
intset 93ms 49MB 80.94s
rtbm 171ms 67MB 14.05s
tbm 238ms 100MB 8.36s
vtbm 148ms 59MB 9.12s
radix 100ms 42MB 11.81s
svtm 75ms 29MB 8.90s

select prepare(1000000, 20, 10, 1); -- Andres's variant

attach size shuffled
array 61ms 120MB 111.91s
intset 163ms 66MB 85.00s
rtbm 236ms 67MB 10.72s
tbm 290ms 100MB 8.40s
vtbm 190ms 59MB 9.28s
radix 117ms 42MB 12.00s
svtm 98ms 29MB 8.77s

Test2.

select prepare(1000000, 10, 1, 1);

attach size shuffled
array 31ms 60MB 4.68s
intset 97ms 49MB 4.03s
rtbm 163ms 36MB 0.42s
tbm 240ms 100MB 0.42s
vtbm 136ms 27MB 0.36s
radix 60ms 10MB 0.72s
svtm 39ms 6MB 0.19s

(Bad radix result probably due to smaller cache in notebook's CPU ?)

Test3

select prepare(1000000, 2, 100, 1);

attach size shuffled
array 6ms 12MB 53.42s
intset 23ms 16MB 54.99s
rtbm 115ms 38MB 8.19s
tbm 186ms 100MB 8.37s
vtbm 105ms 59MB 9.08s
radix 64ms 42MB 10.41s
svtm 73ms 10MB 7.49s

Test4

select prepare(1000000, 100, 1, 1);

attach size shuffled
array 304ms 600MB 75.12s
intset 775ms 98MB 47.49s
rtbm 356ms 38MB 4.11s
tbm 539ms 100MB 4.20s
vtbm 493ms 42MB 4.44s
radix 263ms 42MB 6.05s
svtm 360ms 8MB 3.49s

Therefore Specialized Vaccum Tid Map always consumes least memory amount
and usually faster.

(I've applied Andres's patch for slab allocator before testing)

Attached patch is against 6753911a444e12e4b55 commit of your pgtools
with
applied Andres's patches for radix method.

I've also pushed it to github:
https://github.com/funny-falcon/pgtools/tree/svtm/bdbench

regards,
Yura Sokolov

Attachments:

0001-svtm-specialized-vacuum-tid-map.patchtext/x-diff; name=0001-svtm-specialized-vacuum-tid-map.patchDownload

From 3a6c96cc705b1af412cf9300be6f676f6c5e4aa6 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <funny.falcon@gmail.com>
Date: Sun, 25 Jul 2021 03:06:48 +0300
Subject: [PATCH] svtm - specialized vacuum tid map

---
 bdbench/Makefile  |   2 +-
 bdbench/bdbench.c |  91 ++++++-
 bdbench/bench.sql |   2 +
 bdbench/svtm.c    | 635 ++++++++++++++++++++++++++++++++++++++++++++++
 bdbench/svtm.h    |  19 ++
 5 files changed, 746 insertions(+), 3 deletions(-)
 create mode 100644 bdbench/svtm.c
 create mode 100644 bdbench/svtm.h

diff --git a/bdbench/Makefile b/bdbench/Makefile
index 723132a..a6f758f 100644
--- a/bdbench/Makefile
+++ b/bdbench/Makefile
@@ -2,7 +2,7 @@
 
 MODULE_big = bdbench
 DATA = bdbench--1.0.sql
-OBJS = bdbench.o vtbm.o rtbm.o radix.o
+OBJS = bdbench.o vtbm.o rtbm.o radix.o svtm.o
 
 EXTENSION = bdbench
 REGRESS= bdbench
diff --git a/bdbench/bdbench.c b/bdbench/bdbench.c
index 85d8eaa..a8bc49a 100644
--- a/bdbench/bdbench.c
+++ b/bdbench/bdbench.c
@@ -7,6 +7,7 @@
 
 #include "postgres.h"
 
+#include <math.h>
 #include "catalog/index.h"
 #include "fmgr.h"
 #include "funcapi.h"
@@ -20,6 +21,7 @@
 #include "vtbm.h"
 #include "rtbm.h"
 #include "radix.h"
+#include "svtm.h"
 
 //#define DEBUG_DUMP_MATCHED 1
 
@@ -148,6 +150,15 @@ static bool radix_reaped(LVTestType *lvtt, ItemPointer itemptr);
 static Size radix_mem_usage(LVTestType *lvtt);
 static void radix_load(void *tbm, ItemPointerData *itemptrs, int nitems);
 
+/* svtm */
+static void svtm_init(LVTestType *lvtt, uint64 nitems);
+static void svtm_fini(LVTestType *lvtt);
+static void svtm_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
+						 BlockNumber maxblk, OffsetNumber maxoff);
+static bool svtm_reaped(LVTestType *lvtt, ItemPointer itemptr);
+static Size svtm_mem_usage(LVTestType *lvtt);
+static void svtm_load(SVTm *tbm, ItemPointerData *itemptrs, int nitems);
+
 
 /* Misc functions */
 static void generate_index_tuples(uint64 nitems, BlockNumber minblk,
@@ -174,7 +185,7 @@ static void load_rtbm(RTbm *vtbm, ItemPointerData *itemptrs, int nitems);
 		.mem_usage_fn = n##_mem_usage, \
 			}
 
-#define TEST_SUBJECT_TYPES 6
+#define TEST_SUBJECT_TYPES 7
 static LVTestType LVTestSubjects[TEST_SUBJECT_TYPES] =
 {
 	DECLARE_SUBJECT(array),
@@ -182,7 +193,8 @@ static LVTestType LVTestSubjects[TEST_SUBJECT_TYPES] =
 	DECLARE_SUBJECT(intset),
 	DECLARE_SUBJECT(vtbm),
 	DECLARE_SUBJECT(rtbm),
-	DECLARE_SUBJECT(radix)
+	DECLARE_SUBJECT(radix),
+	DECLARE_SUBJECT(svtm)
 };
 
 static bool
@@ -756,6 +768,81 @@ radix_load(void *tbm, ItemPointerData *itemptrs, int nitems)
 	}
 }
 
+/* ------------ svtm ----------- */
+static void
+svtm_init(LVTestType *lvtt, uint64 nitems)
+{
+	MemoryContext old_ctx;
+
+	lvtt->mcxt = AllocSetContextCreate(TopMemoryContext,
+									   "svtm bench",
+									   ALLOCSET_DEFAULT_SIZES);
+	old_ctx = MemoryContextSwitchTo(lvtt->mcxt);
+	lvtt->private = svtm_create();
+	MemoryContextSwitchTo(old_ctx);
+}
+
+static void
+svtm_fini(LVTestType *lvtt)
+{
+	if (lvtt->private != NULL)
+		svtm_free(lvtt->private);
+}
+
+static void
+svtm_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
+			   BlockNumber maxblk, OffsetNumber maxoff)
+{
+	MemoryContext oldcontext = MemoryContextSwitchTo(lvtt->mcxt);
+
+	svtm_load(lvtt->private,
+			   DeadTuples_orig->itemptrs,
+			   DeadTuples_orig->dtinfo.nitems);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+static bool
+svtm_reaped(LVTestType *lvtt, ItemPointer itemptr)
+{
+	return svtm_lookup(lvtt->private, itemptr);
+}
+
+static uint64
+svtm_mem_usage(LVTestType *lvtt)
+{
+	svtm_stats((SVTm *) lvtt->private);
+	return MemoryContextMemAllocated(lvtt->mcxt, true);
+}
+
+static void
+svtm_load(SVTm *svtm, ItemPointerData *itemptrs, int nitems)
+{
+	BlockNumber curblkno = InvalidBlockNumber;
+	OffsetNumber offs[1024];
+	int noffs = 0;
+
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemPointer tid = &(itemptrs[i]);
+		BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+
+		if (curblkno != InvalidBlockNumber &&
+			curblkno != blkno)
+		{
+			svtm_add_page(svtm, curblkno, offs, noffs);
+			curblkno = blkno;
+			noffs = 0;
+		}
+
+		curblkno = blkno;
+		offs[noffs++] = ItemPointerGetOffsetNumber(tid);
+	}
+
+	svtm_add_page(svtm, curblkno, offs, noffs);
+	svtm_finalize_addition(svtm);
+}
+
 
 static void
 attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk, BlockNumber maxblk,
diff --git a/bdbench/bench.sql b/bdbench/bench.sql
index 94cfde0..b303591 100644
--- a/bdbench/bench.sql
+++ b/bdbench/bench.sql
@@ -16,6 +16,7 @@ select 'rtbm', attach_dead_tuples('rtbm');
 select 'tbm', attach_dead_tuples('tbm');
 select 'vtbm', attach_dead_tuples('vtbm');
 select 'radix', attach_dead_tuples('radix');
+select 'svtm', attach_dead_tuples('svtm');
 
 -- Do benchmark of lazy_tid_reaped.
 select 'array bench', bench('array');
@@ -24,6 +25,7 @@ select 'rtbm bench', bench('rtbm');
 select 'tbm bench', bench('tbm');
 select 'vtbm bench', bench('vtbm');
 select 'radix', bench('radix');
+select 'svtm', bench('svtm');
 
 -- Check the memory usage.
 select * from pg_backend_memory_contexts where name ~ 'bench' or name = 'TopMemoryContext' order by name;
diff --git a/bdbench/svtm.c b/bdbench/svtm.c
new file mode 100644
index 0000000..6ce4ed9
--- /dev/null
+++ b/bdbench/svtm.c
@@ -0,0 +1,635 @@
+/*------------------------------------------------------------------------------
+ *
+ * svtm.c - Specialized Vacuum TID Map
+ *		Data structure to hold TIDs of dead tuples during vacuum.
+ *
+ * It takes in account following properties of PostgreSQL ItemPointer and
+ * vacuum heap scan process:
+ * - page number is 32bit integer
+ * - 14 bit is enough for tuple offset.
+ *   - but usually number of tuples is significantly lesser
+ *   - and 0 is InvalidOffset
+ * - heap is scanned sequentially therefore pages are in increasing order,
+ * - tuples of a single page could be added at once.
+ *
+ * It uses techniques from HATM (Hash Array Mapped Trie), and Roaring bitmaps.
+ *
+ * # Page.
+ *
+ * Page information consists of 16 bit page header and bitmap or sparse bitmap
+ * container. Header and bitmap contains different information
+ * depending on high bits of header.
+ *
+ * Sparse bitmap is made from raw bitmap by skipping all-zero bytes. Non-zero
+ * bytes than indexed with bitmap of sparseness.
+ *
+ * If bitmap contains a lot of all-one bytes, then it is inverted before
+ * going to be sparse.
+ *
+ * Kinds of header/bitmap:
+ * - embedded 1 offset
+ *     high bits: 11
+ *     lower bits: 14bit tuple offset
+ *     bitmap: no external bitmap
+ *
+ * - raw bitmap
+ *     high bits: 00
+ *     lower bits: 14bit offset in bitmap container
+ *     bitmap: 1 byte bitmap length = K
+ *             K byte raw bitmap
+ *   This container is used if there is no detectable pattern in offsets.
+ *
+ * - sparse bitmap
+ *     high bits: 10
+ *     lower bits: 14bit offset in bitmap container
+ *     bitmap: 1 byte raw bitmap length = K
+ *     		   1 byte sparseness bitmap length = S
+ *     		   S bytes sparseness bitmap
+ *     		   Z bytes of non-zero bitmap bytes
+ *   If raw bitmap contains > 62.5% of zero bytes, then sparse bitmap format is
+ *   chosen.
+ *
+ * - inverted sparse bitmap
+ *     high bits: 10
+ *     lower bits: 14bit offset in bitmap container
+ *     bitmap: 1 byte raw bitmap length = K
+ *     		   1 byte sparseness bitmap length = S
+ *     		   S bytes sparseness bitmap
+ *     		   Z bytes of non-zero inverted bitmap bytes
+ *   If raw bitmap contains > 62.5% of all-ones bytes, then sparse bitmap format
+ *   is used to encode whenever tuple is not dead instead.
+ *
+ * # Page map chunk.
+ *
+ * 32 consecutive page headers are stored in an sparse array together with
+ * their bitmaps. Pages without any dead tuple are skipped from this array.
+ *
+ * Therefore chunk map contains:
+ * - 32bitmap of pages presence
+ * - array of 0-32 page headers
+ * - byte array of concatenated bitmaps for all pages in a chunk (with offsets
+ *   encoded in page headers).
+ *
+ * Maximum chunk size:
+ * - page header map: 4 + 32*2 = 68 bytes
+ * - bitmaps byte array:
+ *     32kb page: 32 * 148 = 4736 byte
+ *     8kb page: 32 * 36 = 1152 byte
+ * - sum:
+ *     32kb page: 4804 bytes
+ *     8kb page: 1220 bytes
+ *
+ * Each chunk is allocated as a single blob.
+ *
+ * # Page chunk map.
+ *
+ * Pointers to chunks are stored into sparse array indexed with ixmap bitmap.
+ * Number of first non-empty chunk and first empty chunk after it are
+ * remembered to reduce size of bitmap and speedup access to first run
+ * of non-empty chunks.
+ */
+
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "lib/stringinfo.h"
+#include "port/pg_bitutils.h"
+
+#include "svtm.h"
+
+#define PAGES_PER_CHUNK (1<<5)
+#define BITMAP_PER_PAGE (MaxHeapTuplesPerPage/8 + 1)
+#define PAGE_TO_CHUNK(blkno) ((uint32)(blkno)>>5)
+#define CHUNK_TO_PAGE(chunkno) ((chunkno)<<5)
+
+#define SVTAllocChunk ((1<<19)-128)
+
+typedef struct SVTPagesChunk SVTPagesChunk;
+typedef struct SVTChunkBuilder SVTChunkBuilder;
+typedef struct SVTAlloc		 SVTAlloc;
+typedef struct IxMap IxMap;
+typedef uint16 SVTHeader;
+
+struct SVTAlloc {
+	SVTAlloc*	next;
+	Size		pos;
+	Size		limit;
+	uint8		bytes[FLEXIBLE_ARRAY_MEMBER];
+};
+
+struct SVTChunkBuilder
+{
+	uint32	chunk_number;
+	uint32  npages;
+	uint32  bitmaps_pos;
+	uint32  hcnt[4];
+	BlockNumber	pages[PAGES_PER_CHUNK];
+	SVTHeader	headers[PAGES_PER_CHUNK];
+	/* we add 3 for BITMAP_PER_PAGE for 4 byte roundup */
+	uint8   bitmaps[(BITMAP_PER_PAGE+3)*PAGES_PER_CHUNK];
+};
+
+struct IxMap {
+	uint32	bitmap;
+	uint32	offset;
+};
+
+struct SVTm
+{
+	BlockNumber	lastblock; 	/* max block number + 1 */
+	struct {
+		uint32	start, end;
+	}			firstrun;
+	uint32		nchunks;
+	SVTPagesChunk **chunks; /* chunks pointers */
+	IxMap	   *ixmap;   	/* compression map for chunks */
+	Size		total_size;
+	SVTAlloc	*alloc;
+
+	uint32  npages;
+	uint32  hcnt[4];
+
+	SVTChunkBuilder builder; /* builder for current chunk */
+};
+
+struct SVTPagesChunk
+{
+	uint32  chunk_number;
+	uint32 	bitmap;
+	SVTHeader	headers[FLEXIBLE_ARRAY_MEMBER];
+};
+
+#define bm2(b,c) (((b)<<1)|(c))
+enum SVTHeaderType {
+	SVTH_rawBitmap     = bm2(0,0),
+	SVTH_inverseBitmap = bm2(0,1),
+	SVTH_sparseBitmap  = bm2(1,0),
+	SVTH_single        = bm2(1,1),
+};
+#define HeaderTypeOffset (14)
+#define MakeHeaderType(l) ((SVTHeader)(l) << HeaderTypeOffset)
+#define HeaderType(h) (((h)>>14)&3)
+
+#define BitmapPosition(h) ((h) & ((1<<14)-1))
+#define MakeBitmapPosition(l) ((l) & ((1<<14)-1))
+#define MaxBitmapPosition ((1<<14)-1)
+
+#define SingleItem(h) ((h) & ((1<<14)-1))
+#define MakeSingleItem(h) ((h) & ((1<<14)-1))
+
+/*
+ * we could not use pg_popcount32 in contrib in windows,
+ * therefore define our own.
+ */
+#define INVALID_INDEX (~(uint32)0)
+const uint8 four_bit_cnt[32] = {
+	0, 1, 1, 2, 1, 2, 2, 3,
+	1, 2, 2, 3, 2, 3, 3, 4,
+	1, 2, 2, 3, 2, 3, 3, 4,
+	2, 3, 3, 4, 3, 4, 4, 5,
+};
+
+#define makeoff(v, bits) ((v)/bits)
+#define makebit(v, bits) (1<<((v)&((bits)-1)))
+#define maskbits(v, vits) ((v) & ((1<<(bits))-1))
+#define bitszero(v, vits) (maskbits((v), (bits)) == 0)
+
+static inline uint32 svt_popcnt32(uint32 val);
+static void svtm_build_chunk(SVTm *store);
+
+static inline uint32
+svt_popcnt8(uint8 val)
+{
+	return four_bit_cnt[val&15] + four_bit_cnt[(val>>4)&15];
+}
+
+static inline uint32
+svt_popcnt32(uint32 val)
+{
+	return pg_popcount32(val);
+}
+
+static SVTAlloc*
+svtm_alloc_alloc(void)
+{
+	SVTAlloc *alloc = palloc0(SVTAllocChunk);
+	alloc->limit = SVTAllocChunk - offsetof(SVTAlloc, bytes);
+	return alloc;
+}
+
+SVTm*
+svtm_create(void)
+{
+	SVTm* store = palloc0(sizeof(SVTm));
+	/* preallocate chunks just to pass it to repalloc later */
+	store->chunks = palloc(sizeof(SVTPagesChunk*)*2);
+	store->alloc = svtm_alloc_alloc();
+	return store;
+}
+
+static void*
+svtm_alloc(SVTm *store, Size size)
+{
+	SVTAlloc *alloc = store->alloc;
+	void *res;
+
+	size = INTALIGN(size);
+
+	if (alloc->limit - alloc->pos < size)
+	{
+		alloc = svtm_alloc_alloc();
+		alloc->next = store->alloc;
+		store->alloc = alloc;
+	}
+
+	res = alloc->bytes + alloc->pos;
+	alloc->pos += size;
+
+	return res;
+}
+
+void
+svtm_free(SVTm *store)
+{
+	SVTAlloc *alloc, *next;
+
+	if (store == NULL)
+		return;
+	if (store->ixmap != NULL)
+		pfree(store->ixmap);
+	if (store->chunks != NULL)
+		pfree(store->chunks);
+	alloc = store->alloc;
+	while (alloc != NULL)
+	{
+		next = alloc->next;
+		pfree(alloc);
+		alloc = next;
+	}
+	pfree(store);
+}
+
+void
+svtm_add_page(SVTm *store, const BlockNumber blkno,
+		const OffsetNumber *offnums, uint32 nitems)
+{
+	SVTChunkBuilder		*bld = &store->builder;
+	SVTHeader	header = 0;
+	uint32	chunkno = PAGE_TO_CHUNK(blkno);
+	uint32	bmlen = 0, bbmlen = 0, bbbmlen = 0;
+	uint32  sbmlen = 0;
+	uint32  nonzerocnt;
+	uint32  allzerocnt = 0, allonecnt = 0;
+	uint32  firstoff, lastoff;
+	uint32	i, j;
+	uint8  *append;
+	uint8	bitmap[BITMAP_PER_PAGE] = {0};
+	uint8	spix1[BITMAP_PER_PAGE/8+1] = {0};
+	uint8	spix2[BITMAP_PER_PAGE/64+2] = {0};
+#define off(i) (offnums[i]-1)
+
+	if (nitems == 0)
+		return;
+
+	if (chunkno != bld->chunk_number)
+	{
+		Assert(chunkno > bld->chunk_number);
+		svtm_build_chunk(store);
+		bld->chunk_number = chunkno;
+	}
+
+	Assert(bld->npages == 0 || blkno > bld->pages[bld->npages-1]);
+
+	firstoff = off(0);
+	lastoff = off(nitems-1);
+	Assert(lastoff < (1<<11));
+
+	if (nitems == 1 && lastoff < (1<<10))
+	{
+		/* 1 embedded item */
+		header = MakeHeaderType(SVTH_single);
+		header |= firstoff;
+	}
+	else
+	{
+		Assert(bld->bitmaps_pos < MaxBitmapPosition);
+
+		append = bld->bitmaps + bld->bitmaps_pos;
+		header = MakeBitmapPosition(bld->bitmaps_pos);
+		/* calculate bitmap */
+		for (i = 0; i < nitems; i++)
+		{
+			Assert(i == 0 || off(i) < off(i-1));
+			bitmap[makeoff(off(i),8)] |= makebit(off(i), 8);
+		}
+
+		bmlen = lastoff/8 + 1;
+		append[0] = bmlen;
+
+		for (i = 0; i < bmlen; i++)
+		{
+			allzerocnt += bitmap[i] == 0;
+			allonecnt += bitmap[i] == 0xff;
+		}
+
+		/* if we could not abuse sparness of bitmap, pack it as is */
+		if (allzerocnt <= bmlen*5/8 && allonecnt <= bmlen*5/8)
+		{
+			header |= MakeHeaderType(SVTH_rawBitmap);
+			memmove(append+1, bitmap, bmlen);
+			bld->bitmaps_pos += bmlen + 1;
+		}
+		else
+		{
+			/* if there is more present tuples than absent, invert map */
+			if (allonecnt > bmlen*5/8)
+			{
+				header |= MakeHeaderType(SVTH_inverseBitmap);
+				for (i = 0; i < bmlen; i++)
+					bitmap[i] ^= 0xff;
+				nonzerocnt = bmlen - allonecnt;
+			}
+			else
+			{
+				header |= MakeHeaderType(SVTH_sparseBitmap);
+				nonzerocnt = bmlen - allzerocnt;
+			}
+
+			/* Then we compose two level bitmap index for bitmap. */
+
+			/* First compress bitmap itself with first level index */
+			bbmlen = (bmlen+7)/8;
+			j = 0;
+			for (i = 0; i < bmlen; i++)
+			{
+				if (bitmap[i] != 0)
+				{
+					spix1[makeoff(i, 8)] |= makebit(i, 8);
+					bitmap[j] = bitmap[i];
+					j++;
+				}
+			}
+			Assert(j == nonzerocnt);
+
+			/* Then compress first level index with second level */
+			bbbmlen = (bbmlen+7)/8;
+			Assert(bbbmlen <= 3);
+			sbmlen = 0;
+			for (i = 0; i < bbmlen; i++)
+			{
+				if (spix1[i] != 0)
+				{
+					spix2[makeoff(i, 8)] |= makebit(i, 8);
+					spix1[sbmlen] = spix1[i];
+					sbmlen++;
+				}
+			}
+			Assert(sbmlen < 19);
+
+			/*
+			 * second byte contains length of first level and offset
+			 * to compressed bitmap itself.
+			 */
+			append[1] = (bbbmlen << 5) | (bbbmlen + sbmlen);
+			memmove(append+2, spix2, bbbmlen);
+			memmove(append+2+bbbmlen, spix1, sbmlen);
+			memmove(append+2+bbbmlen+sbmlen, bitmap, nonzerocnt);
+			bld->bitmaps_pos += bbbmlen + sbmlen + nonzerocnt + 2;
+		}
+		Assert(bld->bitmaps_pos <= MaxBitmapPosition);
+	}
+	bld->pages[bld->npages] = blkno;
+	bld->headers[bld->npages] = header;
+	bld->npages++;
+	bld->hcnt[HeaderType(header)]++;
+}
+#undef off
+
+static void
+svtm_build_chunk(SVTm *store)
+{
+	SVTChunkBuilder		*bld = &store->builder;
+	SVTPagesChunk	*chunk;
+	uint32 bitmap = 0;
+	BlockNumber startblock;
+	uint32 off;
+	uint32 i;
+	Size total_size;
+
+	Assert(bld->npages < ~(uint16)0);
+
+	if (bld->npages == 0)
+		return;
+
+	startblock = CHUNK_TO_PAGE(bld->chunk_number);
+	for (i = 0; i < bld->npages; i++)
+	{
+		off = bld->pages[i] - startblock;
+		bitmap |= makebit(off, 32);
+	}
+
+	total_size = offsetof(SVTPagesChunk, headers) +
+		sizeof(SVTHeader)*bld->npages +
+		bld->bitmaps_pos;
+
+	chunk = svtm_alloc(store, total_size);
+	chunk->chunk_number = bld->chunk_number;;
+	chunk->bitmap = bitmap;
+	memmove(chunk->headers,
+			bld->headers, sizeof(SVTHeader)*bld->npages);
+	memmove((char*)(chunk->headers + bld->npages),
+			bld->bitmaps, bld->bitmaps_pos);
+
+	/*
+	 * We allocate store->chunks in power-of-two sizes.
+	 * Then check for "we will overflow" is equal to "nchunks is power of two".
+	 */
+	if ((store->nchunks & (store->nchunks-1)) == 0)
+	{
+		Size new_nchunks = store->nchunks ? (store->nchunks<<1) : 1;
+		store->chunks = (SVTPagesChunk**) repalloc(store->chunks,
+				new_nchunks * sizeof(SVTPagesChunk*));
+	}
+	store->chunks[store->nchunks] = chunk;
+	store->nchunks++;
+	store->lastblock = bld->pages[bld->npages-1];
+	store->total_size += total_size;
+
+	for (i = 0; i<4; i++)
+		store->hcnt[i] += bld->hcnt[i];
+	store->npages += bld->npages;
+
+	memset(bld, 0, sizeof(SVTChunkBuilder));
+}
+
+void
+svtm_finalize_addition(SVTm *store)
+{
+	SVTPagesChunk **chunks = store->chunks;
+	IxMap  *ixmap;
+	uint32	last_chunk, chunkno;
+	uint32	firstrun, firstrunend;
+	uint32	nmaps;
+	uint32	i;
+
+	if (store->nchunks == 0)
+	{
+		/*
+		 * block number will be rejected with:
+		 * block <= lastblock, lastblock == 0
+		 * chunk >= firstrun.start, firstrun.start = 1
+		 */
+		store->firstrun.start = 1;
+		return;
+	}
+
+	firstrun = chunks[0]->chunk_number;
+	firstrunend = firstrun+1;
+
+	/* adsorb last chunk */
+	svtm_build_chunk(store);
+
+	/* Now we need to build ixmap */
+	last_chunk = PAGE_TO_CHUNK(store->lastblock);
+	nmaps = makeoff(last_chunk, 32) + 1;
+	ixmap = palloc0(nmaps * sizeof(IxMap));
+
+	for (i = 0; i < store->nchunks; i++)
+	{
+		chunkno = chunks[i]->chunk_number;
+		if (chunkno == firstrunend)
+			firstrunend++;
+		chunkno -= firstrun;
+		ixmap[makeoff(chunkno,32)].bitmap |= makebit(chunkno,32);
+	}
+
+	for (i = 1; i < nmaps; i++)
+	{
+		ixmap[i].offset = ixmap[i-1].offset;
+		ixmap[i].offset += svt_popcnt32(ixmap[i-1].bitmap);
+	}
+
+	store->firstrun.start = firstrun;
+	store->firstrun.end = firstrunend;
+	store->ixmap = ixmap;
+}
+
+bool
+svtm_lookup(SVTm *store, ItemPointer tid)
+{
+	BlockNumber		blkno = ItemPointerGetBlockNumber(tid);
+	OffsetNumber	offset = ItemPointerGetOffsetNumber(tid) - 1;
+	SVTPagesChunk  *chunk;
+	IxMap          *ixmap = store->ixmap;
+	uint32			off, bit;
+
+	SVTHeader		header;
+	uint8		   *bitmaps;
+	uint8		   *bitmap;
+	uint32	index;
+	uint32	chunkno, blk_in_chunk;
+	uint8	type;
+	uint8	bmoff, bmbit, bmlen, bmbyte;
+	uint8	bmstart, bbmoff, bbmbit, bbmbyte;
+	uint8	bbbmlen, bbbmoff, bbbmbit;
+	uint8	six1off, sbmoff;
+	bool	inverse, bitset;
+
+	if (blkno > store->lastblock)
+		return false;
+
+	chunkno = PAGE_TO_CHUNK(blkno);
+	if (chunkno < store->firstrun.start)
+		return false;
+
+	if (chunkno < store->firstrun.end)
+		index = chunkno - store->firstrun.start;
+	else
+	{
+		off = makeoff(chunkno - store->firstrun.start, 32);
+		bit = makebit(chunkno - store->firstrun.start, 32);
+		if ((ixmap[off].bitmap & bit) == 0)
+			return false;
+
+		index = ixmap[off].offset + svt_popcnt32(ixmap[off].bitmap & (bit-1));
+	}
+	chunk = store->chunks[index];
+	Assert(chunkno == chunk->chunk_number);
+
+	blk_in_chunk = blkno - CHUNK_TO_PAGE(chunkno);
+	bit = makebit(blk_in_chunk, 32);
+
+	if ((chunk->bitmap & bit) == 0)
+		return false;
+	index = svt_popcnt32(chunk->bitmap & (bit - 1));
+	header = chunk->headers[index];
+
+	type = HeaderType(header);
+	if (type == SVTH_single)
+		return offset == SingleItem(header);
+
+	bitmaps = (uint8*)(chunk->headers + svt_popcnt32(chunk->bitmap));
+	bmoff = makeoff(offset, 8);
+	bmbit = makebit(offset, 8);
+	inverse = false;
+
+	bitmap = bitmaps + BitmapPosition(header);
+	bmlen = bitmap[0];
+	if (bmoff >= bmlen)
+		return false;
+
+	switch (type)
+	{
+		case SVTH_rawBitmap:
+			return (bitmap[bmoff+1] & bmbit) != 0;
+
+		case SVTH_inverseBitmap:
+			inverse = true;
+			/* fallthrough */
+		case SVTH_sparseBitmap:
+			bmstart = bitmap[1] & 0x1f;
+			bbbmlen = bitmap[1] >> 5;
+			bitmap += 2;
+			bbmoff = makeoff(bmoff, 8);
+			bbmbit = makebit(bmoff, 8);
+			bbbmoff = makeoff(bbmoff, 8);
+			bbbmbit = makebit(bbmoff, 8);
+			/* check bit in second level index */
+			if ((bitmap[bbbmoff] & bbbmbit) == 0)
+				return inverse;
+			/* calculate sparse offset into compressed first level index */
+			six1off = pg_popcount((char*)bitmap, bbbmoff) +
+						svt_popcnt8(bitmap[bbbmoff] & (bbbmbit-1));
+			/* check bit in first level index */
+			bbmbyte = bitmap[bbbmlen+six1off];
+			if ((bbmbyte & bbmbit) == 0)
+				return inverse;
+			/* and sparse offset into compressed bitmap itself */
+			sbmoff = pg_popcount((char*)bitmap+bbbmlen, six1off) +
+						svt_popcnt8(bbmbyte & (bbmbit-1));
+			bmbyte = bitmap[bmstart + sbmoff];
+			/* finally check bit in bitmap */
+			bitset = (bmbyte & bmbit) != 0;
+			return bitset != inverse;
+	}
+	Assert(false);
+	return false;
+}
+
+void svtm_stats(SVTm *store)
+{
+	StringInfo s;
+
+	s = makeStringInfo();
+	appendStringInfo(s, "svtm: nchunks %u npages %u\n",
+			store->nchunks, store->npages);
+	appendStringInfo(s, "single=%u raw=%u inserse=%u sparse=%u",
+			store->hcnt[SVTH_single], store->hcnt[SVTH_rawBitmap],
+			store->hcnt[SVTH_inverseBitmap], store->hcnt[SVTH_sparseBitmap]);
+
+	elog(NOTICE, "%s", s->data);
+	pfree(s->data);
+	pfree(s);
+}
diff --git a/bdbench/svtm.h b/bdbench/svtm.h
new file mode 100644
index 0000000..fdb5e3f
--- /dev/null
+++ b/bdbench/svtm.h
@@ -0,0 +1,19 @@
+#ifndef _SVTM_H
+#define _SVTM_H
+
+/* Specialized Vacuum TID Map */
+typedef struct SVTm SVTm;
+
+SVTm *svtm_create(void);
+void svtm_free(SVTm *store);
+/*
+ * Add page tuple offsets to map.
+ * offnums should be sorted. Max offset number should be < 2048.
+ */
+void svtm_add_page(SVTm *store, const BlockNumber blkno,
+		const OffsetNumber *offnums, uint32 nitems);
+void svtm_finalize_addition(SVTm *store);
+bool svtm_lookup(SVTm *store, ItemPointer tid);
+void svtm_stats(SVTm *store);
+
+#endif
-- 
2.32.0

#25

sawada.mshk@gmail.com

over 4 years ago

In reply to: Yura Sokolov (#24)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jul 26, 2021 at 1:07 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

Hi,

I've dreamed to write more compact structure for vacuum for three
years, but life didn't give me a time to.

Let me join to friendly competition.

I've bet on HATM approach: popcount-ing bitmaps for non-empty elements.

Thank you for proposing the new idea!

Novelties:
- 32 consecutive pages are stored together in a single sparse array
(called "chunks").
Chunk contains:
- its number,
- 4 byte bitmap of non-empty pages,
- array of non-empty page headers 2 byte each.
Page header contains offset of page's bitmap in bitmaps container.
(Except if there is just one dead tuple in a page. Then it is
written into header itself).
- container of concatenated bitmaps.

Ie, page metadata overhead varies from 2.4byte (32pages in single
chunk)
to 18byte (1 page in single chunk) per page.

- If page's bitmap is sparse ie contains a lot of "all-zero" bytes,
it is compressed by removing zero byte and indexing with two-level
bitmap index.
Two-level index - zero bytes in first level are removed using
second level. It is mostly done for 32kb pages, but let it stay since
it is almost free.

- If page's bitmaps contains a lot of "all-one" bytes, it is inverted
and then encoded as sparse.

- Chunks are allocated with custom "allocator" that has no
per-allocation overhead. It is possible because there is no need
to perform "free": allocator is freed as whole at once.

- Array of pointers to chunks is also bitmap indexed. It saves cpu time
when not every 32 consecutive pages has at least one dead tuple.
But consumes time otherwise. Therefore additional optimization is
added
to quick skip lookup for first non-empty run of chunks.
(Ahhh, I believe this explanation is awful).

It sounds better than my proposal.

Andres Freund wrote 2021-07-20 02:49:

Hi,

On 2021-07-19 15:20:54 +0900, Masahiko Sawada wrote:

BTW is the implementation of the radix tree approach available
somewhere? If so I'd like to experiment with that too.

I have toyed with implementing adaptively large radix nodes like
proposed in https://db.in.tum.de/~leis/papers/ART.pdf - but haven't
gotten it quite working.

That seems promising approach.

I've since implemented some, but not all of the ideas of that paper
(adaptive node sizes, but not the tree compression pieces).

E.g. for

select prepare(
1000000, -- max block
20, -- # of dead tuples per page
10, -- dead tuples interval within a page
1 -- page inteval
);
attach size shuffled ordered
array 69 ms 120 MB 84.87 s 8.66 s
intset 173 ms 65 MB 68.82 s 11.75 s
rtbm 201 ms 67 MB 11.54 s 1.35 s
tbm 232 ms 100 MB 8.33 s 1.26 s
vtbm 162 ms 58 MB 10.01 s 1.22 s
radix 88 ms 42 MB 11.49 s 1.67 s

and for
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1 -- page inteval
);

attach size shuffled ordered
array 24 ms 60MB 3.74s 1.02 s
intset 97 ms 49MB 3.14s 0.75 s
rtbm 138 ms 36MB 0.41s 0.14 s
tbm 198 ms 101MB 0.41s 0.14 s
vtbm 118 ms 27MB 0.39s 0.12 s
radix 33 ms 10MB 0.28s 0.10 s

(this is an almost unfairly good case for radix)

Running out of time to format the results of the other testcases before
I have to run, unfortunately. radix uses 42MB both in test case 3 and
4.

My results (Ubuntu 20.04 Intel Core i7-1165G7):

Test1.

select prepare(1000000, 10, 20, 1); -- original

attach size shuffled
array 29ms 60MB 93.99s
intset 93ms 49MB 80.94s
rtbm 171ms 67MB 14.05s
tbm 238ms 100MB 8.36s
vtbm 148ms 59MB 9.12s
radix 100ms 42MB 11.81s
svtm 75ms 29MB 8.90s

select prepare(1000000, 20, 10, 1); -- Andres's variant

attach size shuffled
array 61ms 120MB 111.91s
intset 163ms 66MB 85.00s
rtbm 236ms 67MB 10.72s
tbm 290ms 100MB 8.40s
vtbm 190ms 59MB 9.28s
radix 117ms 42MB 12.00s
svtm 98ms 29MB 8.77s

Test2.

select prepare(1000000, 10, 1, 1);

attach size shuffled
array 31ms 60MB 4.68s
intset 97ms 49MB 4.03s
rtbm 163ms 36MB 0.42s
tbm 240ms 100MB 0.42s
vtbm 136ms 27MB 0.36s
radix 60ms 10MB 0.72s
svtm 39ms 6MB 0.19s

(Bad radix result probably due to smaller cache in notebook's CPU ?)

Test3

select prepare(1000000, 2, 100, 1);

attach size shuffled
array 6ms 12MB 53.42s
intset 23ms 16MB 54.99s
rtbm 115ms 38MB 8.19s
tbm 186ms 100MB 8.37s
vtbm 105ms 59MB 9.08s
radix 64ms 42MB 10.41s
svtm 73ms 10MB 7.49s

Test4

select prepare(1000000, 100, 1, 1);

attach size shuffled
array 304ms 600MB 75.12s
intset 775ms 98MB 47.49s
rtbm 356ms 38MB 4.11s
tbm 539ms 100MB 4.20s
vtbm 493ms 42MB 4.44s
radix 263ms 42MB 6.05s
svtm 360ms 8MB 3.49s

Therefore Specialized Vaccum Tid Map always consumes least memory amount
and usually faster.

I'll experiment with the proposed ideas including this idea in more
scenarios and share the results tomorrow.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#26

sawada.mshk@gmail.com

over 4 years ago

In reply to: Masahiko Sawada (#25)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jul 26, 2021 at 11:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I'll experiment with the proposed ideas including this idea in more
scenarios and share the results tomorrow.

I've done some benchmarks for proposed data structures. In this trial,
I've done with the scenario where dead tuples are concentrated on a
particular range of table blocks (test 5-8), in addition to the
scenarios I've done in the previous trial. Also, I've done benchmarks
of each scenario while increasing table size. In the first test, the
maximum block number of the table is 1,000,000 (i.g., 8GB table) and
in the second test, it's 10,000,000 (80GB table). We can see how
performance and memory consumption changes with a large-scale table.
Here are the results:

* Test 1
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
20 -- page interval
);

name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 57.23 MB | 0.040 | 98.613 | 572.21 MB | 0.387 | 1521.981
intset | 46.88 MB | 0.114 | 75.944 | 468.67 MB | 0.961 | 997.760
radix | 40.26 MB | 0.102 | 18.427 | 336.64 MB | 0.797 | 266.146
rtbm | 64.02 MB | 0.234 | 22.443 | 512.02 MB | 2.230 | 275.143
svtm | 27.28 MB | 0.060 | 13.568 | 274.07 MB | 0.476 | 211.073
tbm | 96.01 MB | 0.273 | 10.347 | 768.01 MB | 2.882 | 128.103

* Test 2
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 57.23 MB | 0.041 | 4.757 | 572.21 MB | 0.344 | 71.228
intset | 46.88 MB | 0.127 | 3.762 | 468.67 MB | 1.093 | 49.573
radix | 9.95 MB | 0.048 | 0.679 | 82.57 MB | 0.371 | 16.211
rtbm | 34.02 MB | 0.179 | 0.534 | 288.02 MB | 2.092 | 8.693
svtm | 5.78 MB | 0.043 | 0.239 | 54.60 MB | 0.342 | 7.759
tbm | 96.01 MB | 0.274 | 0.521 | 768.01 MB | 2.685 | 6.360

* Test 3
select prepare(
1000000, -- max block
2, -- # of dead tuples per page
100, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 11.45 MB | 0.009 | 57.698 | 114.45 MB | 0.076 | 1045.639
intset | 15.63 MB | 0.031 | 46.083 | 156.23 MB | 0.243 | 848.525
radix | 40.26 MB | 0.063 | 13.755 | 336.64 MB | 0.501 | 223.413
rtbm | 36.02 MB | 0.123 | 11.527 | 320.02 MB | 1.843 | 180.977
svtm | 9.28 MB | 0.053 | 9.631 | 92.59 MB | 0.438 | 212.626
tbm | 96.01 MB | 0.228 | 10.381 | 768.01 MB | 2.258 | 126.630

* Test 4
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 572.21 MB | 0.367 | 78.047 | 5722.05 MB | 3.942 | 1154.776
intset | 93.74 MB | 0.777 | 45.146 | 937.34 MB | 7.716 | 643.708
radix | 40.26 MB | 0.203 | 9.015 | 336.64 MB | 1.775 | 133.294
rtbm | 36.02 MB | 0.369 | 5.639 | 320.02 MB | 3.823 | 88.832
svtm | 7.28 MB | 0.294 | 3.891 | 73.60 MB | 2.690 | 103.744
tbm | 96.01 MB | 0.534 | 5.223 | 768.01 MB | 5.679 | 60.632

* Test 5
select prepare(
1000000, -- max block
150, -- # of dead tuples per page
1, -- dead tuples interval within a page
10000, -- # of consecutive pages having dead tuples
20000 -- page interval
);

There are 10000 consecutive pages that have 150 dead tuples at every
20000 pages.

name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 429.16 MB | 0.274 | 75.664 | 4291.54 MB | 3.067 | 1259.501
intset | 46.88 MB | 0.559 | 36.449 | 468.67 MB | 4.565 | 517.445
radix | 20.26 MB | 0.166 | 8.466 | 196.90 MB | 1.273 | 166.587
rtbm | 18.02 MB | 0.242 | 8.491 | 160.02 MB | 2.407 | 171.725
svtm | 3.66 MB | 0.243 | 3.635 | 37.10 MB | 2.022 | 86.165
tbm | 48.01 MB | 0.344 | 9.763 | 384.01 MB | 3.327 | 151.824

* Test 6
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
10000, -- # of consecutive pages having dead tuples
20000 -- page interval
);

There are 10000 consecutive pages that have 10 dead tuples at every 20000 pages.

name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 28.62 MB | 0.022 | 2.791 | 286.11 MB | 0.170 | 46.920
intset | 23.45 MB | 0.061 | 2.156 | 234.34 MB | 0.501 | 32.577
radix | 5.04 MB | 0.026 | 0.433 | 48.57 MB | 0.191 | 11.060
rtbm | 17.02 MB | 0.074 | 0.533 | 144.02 MB | 0.954 | 11.502
svtm | 3.16 MB | 0.023 | 0.206 | 27.60 MB | 0.175 | 4.886
tbm | 48.01 MB | 0.132 | 0.656 | 384.01 MB | 1.284 | 10.231

* Test 7
select prepare(
1000000, -- max block
150, -- # of dead tuples per page
1, -- dead tuples interval within a page
1000, -- # of consecutive pages having dead tuples
999000 -- page interval
);

There are pages that have 150 dead tuples at first 1000 blocks and
last 1000 blocks.

name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 1.72 MB | 0.002 | 7.507 | 17.17 MB | 0.011 | 76.510
intset | 0.20 MB | 0.003 | 6.742 | 1.89 MB | 0.022 | 52.122
radix | 0.20 MB | 0.001 | 1.023 | 1.07 MB | 0.007 | 12.023
rtbm | 0.15 MB | 0.001 | 2.637 | 0.65 MB | 0.009 | 34.528
svtm | 0.52 MB | 0.002 | 0.721 | 0.61 MB | 0.010 | 6.434
tbm | 0.20 MB | 0.002 | 2.733 | 1.51 MB | 0.015 | 38.538

* Test 8
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within a page
50, -- # of consecutive pages having dead tuples
100 -- page interval
);

There are 50 consecutive pages that have 100 dead tuples at every 100 pages.

name | attach | attach | shuffled | size_x10 | attach_x10| shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 286.11 MB | 0.184 | 67.233 | 2861.03 MB | 1.743 | 979.070
intset | 46.88 MB | 0.389 | 35.176 | 468.67 MB | 3.698 | 505.322
radix | 21.82 MB | 0.116 | 6.160 | 186.86 MB | 0.891 | 117.730
rtbm | 18.02 MB | 0.182 | 5.909 | 160.02 MB | 1.870 | 112.550
svtm | 4.28 MB | 0.152 | 3.213 | 37.60 MB | 1.383 | 79.073
tbm | 48.01 MB | 0.265 | 6.673 | 384.01 MB | 2.586 | 101.327

Overall, 'svtm' is faster and consumes less memory. 'radix' tree also
has good performance and memory usage.

From these results, svtm is the best data structure among proposed
ideas for dead tuple storage used during lazy vacuum in terms of
performance and memory usage. I think it can support iteration by
extracting the offset of dead tuples for each block while iterating
chunks.

Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.

In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases? On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#27

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Masahiko Sawada (#26)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Masahiko Sawada писал 2021-07-27 07:06:

On Mon, Jul 26, 2021 at 11:01 PM Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

I'll experiment with the proposed ideas including this idea in more
scenarios and share the results tomorrow.

I've done some benchmarks for proposed data structures. In this trial,
I've done with the scenario where dead tuples are concentrated on a
particular range of table blocks (test 5-8), in addition to the
scenarios I've done in the previous trial. Also, I've done benchmarks
of each scenario while increasing table size. In the first test, the
maximum block number of the table is 1,000,000 (i.g., 8GB table) and
in the second test, it's 10,000,000 (80GB table). We can see how
performance and memory consumption changes with a large-scale table.
Here are the results:

* Test 1
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
20 -- page interval
);

name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 57.23 MB | 0.040 | 98.613 | 572.21 MB | 0.387 |
1521.981
intset | 46.88 MB | 0.114 | 75.944 | 468.67 MB | 0.961 |
997.760
radix | 40.26 MB | 0.102 | 18.427 | 336.64 MB | 0.797 |
266.146
rtbm | 64.02 MB | 0.234 | 22.443 | 512.02 MB | 2.230 |
275.143
svtm | 27.28 MB | 0.060 | 13.568 | 274.07 MB | 0.476 |
211.073
tbm | 96.01 MB | 0.273 | 10.347 | 768.01 MB | 2.882 |
128.103

* Test 2
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 57.23 MB | 0.041 | 4.757 | 572.21 MB | 0.344 |
71.228
intset | 46.88 MB | 0.127 | 3.762 | 468.67 MB | 1.093 |
49.573
radix | 9.95 MB | 0.048 | 0.679 | 82.57 MB | 0.371 |
16.211
rtbm | 34.02 MB | 0.179 | 0.534 | 288.02 MB | 2.092 |
8.693
svtm | 5.78 MB | 0.043 | 0.239 | 54.60 MB | 0.342 |
7.759
tbm | 96.01 MB | 0.274 | 0.521 | 768.01 MB | 2.685 |
6.360

* Test 3
select prepare(
1000000, -- max block
2, -- # of dead tuples per page
100, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 11.45 MB | 0.009 | 57.698 | 114.45 MB | 0.076 |
1045.639
intset | 15.63 MB | 0.031 | 46.083 | 156.23 MB | 0.243 |
848.525
radix | 40.26 MB | 0.063 | 13.755 | 336.64 MB | 0.501 |
223.413
rtbm | 36.02 MB | 0.123 | 11.527 | 320.02 MB | 1.843 |
180.977
svtm | 9.28 MB | 0.053 | 9.631 | 92.59 MB | 0.438 |
212.626
tbm | 96.01 MB | 0.228 | 10.381 | 768.01 MB | 2.258 |
126.630

* Test 4
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1 -- page interval
);

name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 572.21 MB | 0.367 | 78.047 | 5722.05 MB | 3.942 |
1154.776
intset | 93.74 MB | 0.777 | 45.146 | 937.34 MB | 7.716 |
643.708
radix | 40.26 MB | 0.203 | 9.015 | 336.64 MB | 1.775 |
133.294
rtbm | 36.02 MB | 0.369 | 5.639 | 320.02 MB | 3.823 |
88.832
svtm | 7.28 MB | 0.294 | 3.891 | 73.60 MB | 2.690 |
103.744
tbm | 96.01 MB | 0.534 | 5.223 | 768.01 MB | 5.679 |
60.632

* Test 5
select prepare(
1000000, -- max block
150, -- # of dead tuples per page
1, -- dead tuples interval within a page
10000, -- # of consecutive pages having dead tuples
20000 -- page interval
);

There are 10000 consecutive pages that have 150 dead tuples at every
20000 pages.

name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 429.16 MB | 0.274 | 75.664 | 4291.54 MB | 3.067 |
1259.501
intset | 46.88 MB | 0.559 | 36.449 | 468.67 MB | 4.565 |
517.445
radix | 20.26 MB | 0.166 | 8.466 | 196.90 MB | 1.273 |
166.587
rtbm | 18.02 MB | 0.242 | 8.491 | 160.02 MB | 2.407 |
171.725
svtm | 3.66 MB | 0.243 | 3.635 | 37.10 MB | 2.022 |
86.165
tbm | 48.01 MB | 0.344 | 9.763 | 384.01 MB | 3.327 |
151.824

* Test 6
select prepare(
1000000, -- max block
10, -- # of dead tuples per page
1, -- dead tuples interval within a page
10000, -- # of consecutive pages having dead tuples
20000 -- page interval
);

There are 10000 consecutive pages that have 10 dead tuples at every
20000 pages.

name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 28.62 MB | 0.022 | 2.791 | 286.11 MB | 0.170 |
46.920
intset | 23.45 MB | 0.061 | 2.156 | 234.34 MB | 0.501 |
32.577
radix | 5.04 MB | 0.026 | 0.433 | 48.57 MB | 0.191 |
11.060
rtbm | 17.02 MB | 0.074 | 0.533 | 144.02 MB | 0.954 |
11.502
svtm | 3.16 MB | 0.023 | 0.206 | 27.60 MB | 0.175 |
4.886
tbm | 48.01 MB | 0.132 | 0.656 | 384.01 MB | 1.284 |
10.231

* Test 7
select prepare(
1000000, -- max block
150, -- # of dead tuples per page
1, -- dead tuples interval within a page
1000, -- # of consecutive pages having dead tuples
999000 -- page interval
);

There are pages that have 150 dead tuples at first 1000 blocks and
last 1000 blocks.

name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 1.72 MB | 0.002 | 7.507 | 17.17 MB | 0.011 |
76.510
intset | 0.20 MB | 0.003 | 6.742 | 1.89 MB | 0.022 |
52.122
radix | 0.20 MB | 0.001 | 1.023 | 1.07 MB | 0.007 |
12.023
rtbm | 0.15 MB | 0.001 | 2.637 | 0.65 MB | 0.009 |
34.528
svtm | 0.52 MB | 0.002 | 0.721 | 0.61 MB | 0.010 |
6.434
tbm | 0.20 MB | 0.002 | 2.733 | 1.51 MB | 0.015 |
38.538

* Test 8
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within a page
50, -- # of consecutive pages having dead tuples
100 -- page interval
);

There are 50 consecutive pages that have 100 dead tuples at every 100
pages.

name | attach | attach | shuffled | size_x10 | attach_x10|
shuffled_x10
--------+-----------+--------+----------+------------+-----------+-------------
array | 286.11 MB | 0.184 | 67.233 | 2861.03 MB | 1.743 |
979.070
intset | 46.88 MB | 0.389 | 35.176 | 468.67 MB | 3.698 |
505.322
radix | 21.82 MB | 0.116 | 6.160 | 186.86 MB | 0.891 |
117.730
rtbm | 18.02 MB | 0.182 | 5.909 | 160.02 MB | 1.870 |
112.550
svtm | 4.28 MB | 0.152 | 3.213 | 37.60 MB | 1.383 |
79.073
tbm | 48.01 MB | 0.265 | 6.673 | 384.01 MB | 2.586 |
101.327

Overall, 'svtm' is faster and consumes less memory. 'radix' tree also
has good performance and memory usage.

From these results, svtm is the best data structure among proposed
ideas for dead tuple storage used during lazy vacuum in terms of
performance and memory usage. I think it can support iteration by
extracting the offset of dead tuples for each block while iterating
chunks.

Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.

In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases? On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.

I can evolve svtm to transparent intset replacement certainly. Using
same trick from radix_to_key it will store tids efficiently:

shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
tid_i = ItemPointerGetOffsetNumber(tid);
tid_i |= ItemPointerGetBlockNumber(tid) << shift;

Will do today's evening.

regards
Yura Sokolov aka funny_falcon

#28

andres@anarazel.de

over 4 years ago

In reply to: Yura Sokolov (#24)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2021-07-25 19:07:18 +0300, Yura Sokolov wrote:

I've dreamed to write more compact structure for vacuum for three
years, but life didn't give me a time to.

Let me join to friendly competition.

I've bet on HATM approach: popcount-ing bitmaps for non-empty elements.

My concern with several of the proposals in this thread is that they
over-optimize for this specific case. It's not actually that crucial to
have a crazily optimized vacuum dead tid storage datatype. Having
something more general that also performs reasonably for the dead tuple
storage, but also performs well in a number of other cases, makes a lot
more sense to me.

(Bad radix result probably due to smaller cache in notebook's CPU ?)

Probably largely due to the node dispatch. a) For some reason gcc likes
jump tables too much, I get better numbers when disabling those b) the
node type dispatch should be stuffed into the low bits of the pointer.

select prepare(1000000, 2, 100, 1);

attach size shuffled
array 6ms 12MB 53.42s
intset 23ms 16MB 54.99s
rtbm 115ms 38MB 8.19s
tbm 186ms 100MB 8.37s
vtbm 105ms 59MB 9.08s
radix 64ms 42MB 10.41s
svtm 73ms 10MB 7.49s

Test4

select prepare(1000000, 100, 1, 1);

attach size shuffled
array 304ms 600MB 75.12s
intset 775ms 98MB 47.49s
rtbm 356ms 38MB 4.11s
tbm 539ms 100MB 4.20s
vtbm 493ms 42MB 4.44s
radix 263ms 42MB 6.05s
svtm 360ms 8MB 3.49s

Therefore Specialized Vaccum Tid Map always consumes least memory amount
and usually faster.

Impressive.

Greetings,

Andres Freund

#29

andres@anarazel.de

over 4 years ago

In reply to: Masahiko Sawada (#26)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:

Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.

Indeed.

In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases?

Yes, I think there are. Whenever there is some spatial locality it has a
decent chance of winning over a hash table, and it will most of the time
win over ordered datastructures like rbtrees (which perform very poorly
due to the number of branches and pointer dispatches). There's plenty
hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
degree of locality, so I'd expect a few potential uses. When adding
"tree compression" (i.e. skip inner nodes that have a single incoming &
outgoing node) radix trees even can deal quite performantly with
variable width keys.

On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.

I don't think we should rely on the read-only-ness. It seems pretty
clear that we'd want parallel dead-tuple scans at a point not too far
into the future?

Greetings,

Andres Freund

#30

sawada.mshk@gmail.com

over 4 years ago

In reply to: Andres Freund (#29)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:

Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.

Indeed.

In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases?

Yes, I think there are. Whenever there is some spatial locality it has a
decent chance of winning over a hash table, and it will most of the time
win over ordered datastructures like rbtrees (which perform very poorly
due to the number of branches and pointer dispatches). There's plenty
hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
degree of locality, so I'd expect a few potential uses. When adding
"tree compression" (i.e. skip inner nodes that have a single incoming &
outgoing node) radix trees even can deal quite performantly with
variable width keys.

Good point.

On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.

I don't think we should rely on the read-only-ness. It seems pretty
clear that we'd want parallel dead-tuple scans at a point not too far
into the future?

Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.

During the performance benchmark, I found some bugs in the radix tree
implementation. Also, we need the functionality of tree iteration, and
if we have the radix tree in the source tree as a general library, we
need some changes since the current implementation seems to be for a
replacement for shared buffer’s hash table. I'll try to work on those
stuff as PoC if you don't. What do you think?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#31

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Masahiko Sawada (#30)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Masahiko Sawada писал 2021-07-29 12:11:

On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de>
wrote:

Hi,

On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:

Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.

Indeed.

In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases?

Yes, I think there are. Whenever there is some spatial locality it has
a
decent chance of winning over a hash table, and it will most of the
time
win over ordered datastructures like rbtrees (which perform very
poorly
due to the number of branches and pointer dispatches). There's plenty
hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
degree of locality, so I'd expect a few potential uses. When adding
"tree compression" (i.e. skip inner nodes that have a single incoming
&
outgoing node) radix trees even can deal quite performantly with
variable width keys.

Good point.

On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.

I don't think we should rely on the read-only-ness. It seems pretty
clear that we'd want parallel dead-tuple scans at a point not too far
into the future?

Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.

Main portion of svtm that leads to memory saving is compression of many
pages at once (CHUNK). It could be combined with radix as a storage for
pointers to CHUNKs.

For a moment I'm benchmarking IntegerSet replacement based on Trie (HATM
like)
and CHUNK compression, therefore datastructure could be used for gist
vacuum as well.

Since it is generic (allows to index all 64bit) it lacks of trick used
to speedup svtm. Still on 10x test it is faster than radix.

I'll send result later today after all benchmarks complete.

And I'll try then to make mix of radix and CHUNK compression.

During the performance benchmark, I found some bugs in the radix tree
implementation.

There is a bug in radix_to_key_off as well:

tid_i |= ItemPointerGetBlockNumber(tid) << shift;

ItemPointerGetBlockNumber returns uint32, therefore result after shift
is uint32 as well.

It leads to lesser memory consumption (and therefore better times) on
10x test, when page number exceed 2^23 (8M). It still produce "correct"
result for test since every page is filled in the same way.

Could you push your fixes for radix, please?

regards,
Yura Sokolov

y.sokolov@postgrespro.ru
funny.falcon@gmail.com

#32

[1]: /messages/by-id/CA+TgmoZgapzekbTqdBrcH8O8Yifi10_nB7uWLB8ajAhGL21M6A@mail.gmail.com

sawada.mshk@gmail.com

over 4 years ago

In reply to: Yura Sokolov (#31)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jul 29, 2021 at 8:03 PM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

Masahiko Sawada писал 2021-07-29 12:11:

On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de>
wrote:

Hi,

On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:

Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.

Indeed.

In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases?

Yes, I think there are. Whenever there is some spatial locality it has
a
decent chance of winning over a hash table, and it will most of the
time
win over ordered datastructures like rbtrees (which perform very
poorly
due to the number of branches and pointer dispatches). There's plenty
hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
degree of locality, so I'd expect a few potential uses. When adding
"tree compression" (i.e. skip inner nodes that have a single incoming
&
outgoing node) radix trees even can deal quite performantly with
variable width keys.

Good point.

On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.

I don't think we should rely on the read-only-ness. It seems pretty
clear that we'd want parallel dead-tuple scans at a point not too far
into the future?

Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.

Main portion of svtm that leads to memory saving is compression of many
pages at once (CHUNK). It could be combined with radix as a storage for
pointers to CHUNKs.

For a moment I'm benchmarking IntegerSet replacement based on Trie (HATM
like)
and CHUNK compression, therefore datastructure could be used for gist
vacuum as well.

Since it is generic (allows to index all 64bit) it lacks of trick used
to speedup svtm. Still on 10x test it is faster than radix.

BTW, how does svtm work when we add two sets of dead tuple TIDs to one
svtm? Dead tuple TIDs are unique sets but those sets could have TIDs
of the different offsets on the same block. The case I imagine is the
idea discussed on this thread[1]/messages/by-id/CA+TgmoZgapzekbTqdBrcH8O8Yifi10_nB7uWLB8ajAhGL21M6A@mail.gmail.com. With this idea, we store the
collected dead tuple TIDs somewhere and skip index vacuuming for some
reason (index skipping optimization, failsafe mode, or interruptions
etc.). Then, in the next lazy vacuum timing, we load the dead tuple
TIDs and start to scan the heap. During the heap scan in the second
lazy vacuum, it's possible that new dead tuples will be found on the
pages that we have already stored in svtm during the first lazy
vacuum. How can we efficiently update the chunk in the svtm?

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#33

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Masahiko Sawada (#32)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Masahiko Sawada писал 2021-07-29 17:29:

On Thu, Jul 29, 2021 at 8:03 PM Yura Sokolov <y.sokolov@postgrespro.ru>
wrote:

Masahiko Sawada писал 2021-07-29 12:11:

On Thu, Jul 29, 2021 at 3:53 AM Andres Freund <andres@anarazel.de>
wrote:

Hi,

On 2021-07-27 13:06:56 +0900, Masahiko Sawada wrote:

Apart from performance and memory usage points of view, we also need
to consider the reusability of the code. When I started this thread, I
thought the best data structure would be the one optimized for
vacuum's dead tuple storage. However, if we can use a data structure
that can also be used in general, we can use it also for other
purposes. Moreover, if it's too optimized for the current TID system
(32 bits block number, 16 bits offset number, maximum block/offset
number, etc.) it may become a blocker for future changes.

Indeed.

In that sense, radix tree also seems good since it can also be used in
gist vacuum as a replacement for intset, or a replacement for hash
table for shared buffer as discussed before. Are there any other use
cases?

Yes, I think there are. Whenever there is some spatial locality it has
a
decent chance of winning over a hash table, and it will most of the
time
win over ordered datastructures like rbtrees (which perform very
poorly
due to the number of branches and pointer dispatches). There's plenty
hashtables, e.g. for caches, locks, etc, in PG that have a medium-high
degree of locality, so I'd expect a few potential uses. When adding
"tree compression" (i.e. skip inner nodes that have a single incoming
&
outgoing node) radix trees even can deal quite performantly with
variable width keys.

Good point.

On the other hand, I’m concerned that radix tree would be an
over-engineering in terms of vacuum's dead tuples storage since the
dead tuple storage is static data and requires only lookup operation,
so if we want to use radix tree as dead tuple storage, I'd like to see
further use cases.

I don't think we should rely on the read-only-ness. It seems pretty
clear that we'd want parallel dead-tuple scans at a point not too far
into the future?

Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.

Main portion of svtm that leads to memory saving is compression of
many
pages at once (CHUNK). It could be combined with radix as a storage
for
pointers to CHUNKs., bute

For a moment I'm benchmarking IntegerSet replacement based on Trie
(HATM
like)
and CHUNK compression, therefore datastructure could be used for gist
vacuum as well.

Since it is generic (allows to index all 64bit) it lacks of trick used
to speedup svtm. Still on 10x test it is faster than radix.

I've attached IntegerSet2 patch for pgtools repo and benchmark results.
Branch https://github.com/funny-falcon/pgtools/tree/integerset2

SVTM is measured with couple of changes from commit
5055ef72d23482dd3e11ce
in that branch: 1) more often compress bitmap, but slower, 2) couple of
popcount tricks.

IntegerSet consists of trie index to CHUNKS. CHUNKS is compressed bitmap
of 2^15 (6+9) bits (almost like in SVTM, but for fixed bit width).

Well, IntegerSet2 is always faster than IntegerSet and always uses
significantly less memory (radix uses more memory than IntegerSet in
couple of tests and uses comparable in others).

IntegerSet2 is not always faster than radix. It is more like radix.
That it because both are generic prefix trees with comparable amount of
memory accesses. SVTM did the trick being not multilevel prefix tree,
but
just 1 level bitmap index to chunks.

I believe, trie part of IntegerSet could be replaced with radix.
Ie use radix as storage for pointers to CHUNKS.

BTW, how does svtm work when we add two sets of dead tuple TIDs to one
svtm? Dead tuple TIDs are unique sets but those sets could have TIDs
of the different offsets on the same block. The case I imagine is the
idea discussed on this thread[1]. With this idea, we store the
collected dead tuple TIDs somewhere and skip index vacuuming for some
reason (index skipping optimization, failsafe mode, or interruptions
etc.). Then, in the next lazy vacuum timing, we load the dead tuple
TIDs and start to scan the heap. During the heap scan in the second
lazy vacuum, it's possible that new dead tuples will be found on the
pages that we have already stored in svtm during the first lazy
vacuum. How can we efficiently update the chunk in the svtm?

If we store tidmap to disk, then it will be serialized. Since SVTM/
IntegerSet2 are ordered, they could be loaded in order. Then we
can just merge tuples in per page basis: deserialize page (or CHUNK),
put new tuples, store again. Since both scan (scan of serilized map
and scan of table) are in order, merging will be cheap enough.

SVTM and IntegerSet2 already works in "buffered" way on insertion.
(As well as IntegerSet that also does compression but in small parts).

regards,

Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

Attachments:

0001-integerset2.patchtext/x-diff; name=0001-integerset2.patchDownload

From c555983109cf202a2bd395de77711f302b7a5024 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <funny.falcon@gmail.com>
Date: Wed, 28 Jul 2021 17:21:02 +0300
Subject: [PATCH] integerset2

---
 bdbench/Makefile         |   2 +-
 bdbench/bdbench--1.0.sql |   5 +
 bdbench/bdbench.c        |  86 +++-
 bdbench/bench.sql        |   2 +
 bdbench/integerset2.c    | 887 +++++++++++++++++++++++++++++++++++++++
 bdbench/integerset2.h    |  15 +
 6 files changed, 994 insertions(+), 3 deletions(-)
 create mode 100644 bdbench/integerset2.c
 create mode 100644 bdbench/integerset2.h

diff --git a/bdbench/Makefile b/bdbench/Makefile
index a6f758f..0b00211 100644
--- a/bdbench/Makefile
+++ b/bdbench/Makefile
@@ -2,7 +2,7 @@
 
 MODULE_big = bdbench
 DATA = bdbench--1.0.sql
-OBJS = bdbench.o vtbm.o rtbm.o radix.o svtm.o
+OBJS = bdbench.o vtbm.o rtbm.o radix.o svtm.o integerset2.o
 
 EXTENSION = bdbench
 REGRESS= bdbench
diff --git a/bdbench/bdbench--1.0.sql b/bdbench/bdbench--1.0.sql
index 0ba10a8..ae15514 100644
--- a/bdbench/bdbench--1.0.sql
+++ b/bdbench/bdbench--1.0.sql
@@ -115,3 +115,8 @@ CREATE FUNCTION radix_run_tests()
 RETURNS void
 AS 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE;
+
+CREATE FUNCTION intset2_run_tests()
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE;
diff --git a/bdbench/bdbench.c b/bdbench/bdbench.c
index d15526e..883099c 100644
--- a/bdbench/bdbench.c
+++ b/bdbench/bdbench.c
@@ -22,6 +22,7 @@
 #include "rtbm.h"
 #include "radix.h"
 #include "svtm.h"
+#include "integerset2.h"
 
 //#define DEBUG_DUMP_MATCHED 1
 
@@ -93,6 +94,7 @@ PG_FUNCTION_INFO_V1(bench);
 PG_FUNCTION_INFO_V1(test_generate_tid);
 PG_FUNCTION_INFO_V1(rtbm_test);
 PG_FUNCTION_INFO_V1(radix_run_tests);
+PG_FUNCTION_INFO_V1(intset2_run_tests);
 PG_FUNCTION_INFO_V1(prepare);
 
 /*
@@ -159,6 +161,14 @@ static bool svtm_reaped(LVTestType *lvtt, ItemPointer itemptr);
 static Size svtm_mem_usage(LVTestType *lvtt);
 static void svtm_load(SVTm *tbm, ItemPointerData *itemptrs, int nitems);
 
+/* intset2 */
+static void intset2_init(LVTestType *lvtt, uint64 nitems);
+static void intset2_fini(LVTestType *lvtt);
+static void intset2_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
+						 BlockNumber maxblk, OffsetNumber maxoff);
+static bool intset2_reaped(LVTestType *lvtt, ItemPointer itemptr);
+static Size intset2_mem_usage(LVTestType *lvtt);
+
 
 /* Misc functions */
 static void generate_index_tuples(uint64 nitems, BlockNumber minblk,
@@ -185,7 +195,7 @@ static void load_rtbm(RTbm *vtbm, ItemPointerData *itemptrs, int nitems);
 		.mem_usage_fn = n##_mem_usage, \
 			}
 
-#define TEST_SUBJECT_TYPES 7
+#define TEST_SUBJECT_TYPES 8
 static LVTestType LVTestSubjects[TEST_SUBJECT_TYPES] =
 {
 	DECLARE_SUBJECT(array),
@@ -194,7 +204,8 @@ static LVTestType LVTestSubjects[TEST_SUBJECT_TYPES] =
 	DECLARE_SUBJECT(vtbm),
 	DECLARE_SUBJECT(rtbm),
 	DECLARE_SUBJECT(radix),
-	DECLARE_SUBJECT(svtm)
+	DECLARE_SUBJECT(svtm),
+	DECLARE_SUBJECT(intset2)
 };
 
 static bool
@@ -843,6 +854,69 @@ svtm_load(SVTm *svtm, ItemPointerData *itemptrs, int nitems)
 	svtm_finalize_addition(svtm);
 }
 
+/* ------------ intset2 ----------- */
+static void
+intset2_init(LVTestType *lvtt, uint64 nitems)
+{
+	MemoryContext old_ctx;
+
+	lvtt->mcxt = AllocSetContextCreate(TopMemoryContext,
+									   "intset2 bench",
+									   ALLOCSET_DEFAULT_SIZES);
+	old_ctx = MemoryContextSwitchTo(lvtt->mcxt);
+	lvtt->private = intset2_create();
+	MemoryContextSwitchTo(old_ctx);
+}
+
+static void
+intset2_fini(LVTestType *lvtt)
+{
+	if (lvtt->private != NULL)
+		intset2_free(lvtt->private);
+}
+
+static inline uint64
+intset2_encode(ItemPointer tid)
+{
+	uint64 tid_i;
+	uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+
+	Assert(ItemPointerGetOffsetNumber(tid)>0);
+	tid_i = ItemPointerGetOffsetNumber(tid) - 1;
+	tid_i |= (uint64)ItemPointerGetBlockNumber(tid) << shift;
+
+	return tid_i;
+}
+
+static void
+intset2_attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk,
+			   BlockNumber maxblk, OffsetNumber maxoff)
+{
+	uint64 i;
+	MemoryContext oldcontext = MemoryContextSwitchTo(lvtt->mcxt);
+
+	for (i = 0; i < nitems; i++)
+	{
+		intset2_add_member(lvtt->private,
+				intset2_encode(DeadTuples_orig->itemptrs + i));
+	}
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
+static bool
+intset2_reaped(LVTestType *lvtt, ItemPointer itemptr)
+{
+	return intset2_is_member(lvtt->private, intset2_encode(itemptr));
+}
+
+static uint64
+intset2_mem_usage(LVTestType *lvtt)
+{
+	//svtm_stats((SVTm *) lvtt->private);
+	return MemoryContextMemAllocated(lvtt->mcxt, true);
+}
+
 
 static void
 attach(LVTestType *lvtt, uint64 nitems, BlockNumber minblk, BlockNumber maxblk,
@@ -1229,3 +1303,11 @@ radix_run_tests(PG_FUNCTION_ARGS)
 
 	PG_RETURN_VOID();
 }
+
+Datum
+intset2_run_tests(PG_FUNCTION_ARGS)
+{
+	intset2_test_1();
+
+	PG_RETURN_VOID();
+}
diff --git a/bdbench/bench.sql b/bdbench/bench.sql
index b303591..01ee846 100644
--- a/bdbench/bench.sql
+++ b/bdbench/bench.sql
@@ -17,6 +17,7 @@ select 'tbm', attach_dead_tuples('tbm');
 select 'vtbm', attach_dead_tuples('vtbm');
 select 'radix', attach_dead_tuples('radix');
 select 'svtm', attach_dead_tuples('svtm');
+select 'intset2', attach_dead_tuples('intset2');
 
 -- Do benchmark of lazy_tid_reaped.
 select 'array bench', bench('array');
@@ -26,6 +27,7 @@ select 'tbm bench', bench('tbm');
 select 'vtbm bench', bench('vtbm');
 select 'radix', bench('radix');
 select 'svtm', bench('svtm');
+select 'intset2', bench('intset2');
 
 -- Check the memory usage.
 select * from pg_backend_memory_contexts where name ~ 'bench' or name = 'TopMemoryContext' order by name;
diff --git a/bdbench/integerset2.c b/bdbench/integerset2.c
new file mode 100644
index 0000000..441e224
--- /dev/null
+++ b/bdbench/integerset2.c
@@ -0,0 +1,887 @@
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "lib/stringinfo.h"
+#include "port/pg_bitutils.h"
+#include "nodes/bitmapset.h"
+
+#include "integerset2.h"
+
+#define ONE ((bitmapword)1)
+#if BITS_PER_BITMAPWORD == 64
+#define SHIFT 6
+#define pg_popcountW pg_popcount64
+#else
+#define SHIFT 5
+#define pg_popcountW pg_popcount32
+#endif
+#define START(x) ((x) >> SHIFT)
+#define STARTN(x, n) ((x) >> (SHIFT*(n)))
+#define NBIT(x) ((x) & (((uint64)1 << SHIFT)-1))
+#define BIT(x) (ONE << NBIT(x))
+#define NBITN(x, n) NBIT(STARTN(x, (n)-1))
+#define BITN(x, n) (ONE << NBITN((x), (n)))
+
+/*
+ * Compressed leaf bitmap is indexed with 2 level bitmap index with
+ * 1 byte in root level. Therefore there is 8 bytes in second level
+ * and 64 bytes in third level.
+ */
+#define LEAF_SHIFT (3+3+3)
+#define LEAF_BITS (1 << LEAF_SHIFT)
+#define LEAF_BYTES (LEAF_BITS / 8)
+#define LBYTE(x) (((x) / 8) & (LEAF_BYTES-1))
+#define LBIT(x) (1 << ((x) & 7));
+
+#define CHUNK_LEAFS BITS_PER_BITMAPWORD
+#define CHUNK_SHIFT (LEAF_SHIFT + SHIFT)
+#define CSTART(x) ((x) & ~(((uint64)1 << CHUNK_SHIFT)-1))
+#define CPOS(x) NBIT((x) >> LEAF_SHIFT)
+
+#define VAL_TO_PAGE(val) ((val) >> LEAF_SHIFT)
+#define VAL_TO_CHUNK(val) ((val) >> CHUNK_SHIFT)
+#define TRIE_LEVELS (64 / SHIFT)
+
+#define ISAllocBatch (1<<18)
+
+typedef struct IntsetAllocator IntsetAllocator;
+struct IntsetAllocator
+{
+	Size	total_size;
+	Size	alloc_size;
+	Size	pos;
+	Size	limit;
+	uint8  *current;
+	List   *chunks;
+};
+
+/* TRIE (HAMT like) */
+typedef struct IntsetTrieVal IntsetTrieVal;
+typedef struct IntsetTrieElem IntsetTrieElem;
+typedef void* (*trie_alloc)(Size size, void *arg);
+typedef struct IntsetTrie IntsetTrie;
+
+struct IntsetTrieElem
+{
+	uint64	key;
+	bitmapword	bitmap;
+	union
+	{
+		void		*val;
+		IntsetTrieElem	*children;
+	}			p;
+};
+
+struct IntsetTrie
+{
+	trie_alloc	alloc;
+	void *alloc_arg;
+
+	int root_level;
+	IntsetTrieElem	root;
+	uint32 n[TRIE_LEVELS - 1];
+	IntsetTrieElem l[TRIE_LEVELS - 1][BITS_PER_BITMAPWORD];
+};
+
+struct IntsetTrieVal
+{
+	bitmapword	bitmap;
+	void	   *val;
+};
+
+/* Intset */
+
+typedef enum IntsetLeafType IntsetLeafType;
+typedef struct IntsetLeafBitmap IntsetLeafBitmap;
+typedef struct IntsetLeafEmbed IntsetLeafEmbed;
+typedef union IntsetLeafHeader IntsetLeafHeader;
+/* alias for pointer */
+typedef IntsetLeafHeader IntsetChunk;
+typedef struct IntsetLeafBuilder IntsetLeafBuilder;
+typedef struct IntsetChunkBuilder IntsetChunkBuilder;
+
+#define bm2(b,c) (((b)<<1)|(c))
+enum IntsetLeafType {
+	LT_RAW     = bm2(0, 0),
+	LT_INVERSE = bm2(0, 1),
+	LT_SPARSE  = bm2(1, 0),
+	LT_EMBED   = bm2(1, 1),
+};
+
+struct IntsetLeafBitmap
+{
+	IntsetLeafType	type:2;
+	uint32  minbyte:6;
+	uint32  maxbyte:6;
+	uint32	offset:16;
+};
+
+struct IntsetLeafEmbed
+{
+	IntsetLeafType	type:2;
+	uint32	v0:9;
+	uint32	v1:9;
+	uint32	v2:9;
+};
+
+union IntsetLeafHeader
+{
+	IntsetLeafBitmap	b;
+	IntsetLeafEmbed		e;
+	uint32	v;
+};
+
+StaticAssertDecl(sizeof(IntsetLeafBitmap) == sizeof(IntsetLeafEmbed),
+		"incompatible bit field packing");
+StaticAssertDecl(sizeof(IntsetLeafBitmap) == sizeof(uint32),
+		"incompatible bit field packing");
+
+
+struct IntsetLeafBuilder
+{
+	uint16	nvals;
+	uint16	embed[3];
+	uint8	minbyte;
+	uint8	maxbyte;
+	uint8	bytes[LEAF_BYTES];
+};
+
+struct IntsetChunkBuilder
+{
+	uint64	chunk;
+	bitmapword bitmap;
+	IntsetLeafBuilder leafs[CHUNK_LEAFS];
+};
+
+struct IntegerSet2
+{
+	uint64	firstvalue;
+	uint64	nvalues;
+
+	IntsetAllocator alloc;
+
+	IntsetChunkBuilder current;
+	IntsetTrie trie;
+};
+
+
+/* Allocator functions */
+
+static void *intset2_alloc(Size size, IntsetAllocator *alloc);
+static void  intset2_alloc_free(IntsetAllocator *alloc);
+
+/* Trie functions */
+
+static inline void intset2_trie_init(IntsetTrie *trie,
+									 trie_alloc alloc,
+									 void* arg);
+static void intset2_trie_insert(IntsetTrie *trie,
+								uint64 key,
+								IntsetTrieVal val);
+static IntsetTrieVal intset2_trie_lookup(IntsetTrie *trie, uint64 key);
+
+/* Intset functions */
+
+static uint8 intset2_leafbuilder_add(IntsetLeafBuilder *leaf, uint64 v);
+static inline bool intset2_leafbuilder_is_member(IntsetLeafBuilder *leaf,
+												 uint64 v);
+static uint8 intset2_chunkbuilder_add(IntsetChunkBuilder *chunk, uint64 v);
+static bool intset2_chunkbuilder_is_member(IntsetChunkBuilder *chunk,
+										   uint64 v);
+static bool intset2_chunk_is_member(IntsetChunk *chunk,
+									bitmapword bitmap,
+									uint64 v);
+
+static void intset2_compress_current(IntegerSet2 *intset);
+
+static inline uint8 pg_popcount8(uint8 b);
+static inline uint8 pg_popcount8_lowbits(uint8 b, uint8 nbits);
+static inline uint8 pg_popcount_small(uint8 *b, uint8 len);
+static inline uint32 intset2_compact(uint8 *dest, uint8 *src, uint8 len, bool inverse);
+
+/* Allocator */
+
+static void*
+intset2_alloc(Size size, IntsetAllocator *alloc)
+{
+	Assert(size < ISAllocBatch);
+
+	size = MAXALIGN(size);
+
+	if (alloc->limit - alloc->pos < size)
+	{
+		alloc->current = palloc0(ISAllocBatch);
+		alloc->chunks = lappend(alloc->chunks, alloc->current);
+		alloc->pos = 0;
+		alloc->limit = ISAllocBatch;
+		alloc->total_size += ISAllocBatch;
+	}
+
+	alloc->pos += size;
+	alloc->alloc_size += size;
+	return alloc->current + (alloc->pos - size);
+}
+
+static void
+intset2_alloc_free(IntsetAllocator *alloc)
+{
+	list_free_deep(alloc->chunks);
+}
+
+/* Trie */
+
+static inline void
+intset2_trie_init(IntsetTrie *trie, trie_alloc alloc, void* arg)
+{
+	memset(trie, 0, sizeof(*trie));
+	trie->root_level = -1;
+	trie->alloc = alloc;
+	trie->alloc_arg = arg;
+}
+
+static void
+intset2_trie_insert(IntsetTrie *trie, uint64 key, IntsetTrieVal val)
+{
+	IntsetTrieElem *root = &trie->root;
+	IntsetTrieElem *chunk;
+	IntsetTrieElem *parent;
+	IntsetTrieElem insert;
+	int level = trie->root_level;
+
+	if (level == -1)
+	{
+		trie->root_level = 0;
+		root->key = key;
+		root->bitmap = val.bitmap;
+		root->p.val = val.val;
+		return;
+	}
+
+	Assert(root->key <= STARTN(key, level));
+	Assert(trie->root_level != 0 || root->key < key);
+
+	/* Adjust root level */
+	while (root->key != STARTN(key, level))
+	{
+		trie->l[level][0] = *root;
+		trie->n[level] = 1;
+		root->p.children = trie->l[level];
+		root->bitmap = BIT(root->key);
+		root->key >>= SHIFT;
+		level++;
+	}
+	trie->root_level = level;
+
+	/* Actual insert */
+	insert.key = key;
+	insert.bitmap = val.bitmap;
+	insert.p.val = val.val;
+
+	/*
+	 * Iterate while we need to move current level to alloced
+	 * space.
+	 *
+	 * Since we've fixed root in the loop above, we certainly
+	 * will quit.
+	 */
+	for (level = 0;; level++) {
+		IntsetTrieElem *alloced;
+		uint32 n = trie->n[level];
+		Size asize;
+
+		chunk = trie->l[level];
+		Assert(chunk[n-1].key <= insert.key);
+
+		if (level < trie->root_level-1)
+			parent = &trie->l[level+1][trie->n[level+1]-1];
+		else
+			parent = root;
+
+		Assert(pg_popcountW(parent->bitmap) == n);
+
+		if (parent->key == START(insert.key))
+			/* Yes, we are in the same chunk */
+			break;
+
+		/*
+		 * We are not in the same chunk. We need to move
+		 * layer to allocated space and start new one.
+		 */
+		asize = n * sizeof(IntsetTrieElem);
+		alloced = trie->alloc(asize, trie->alloc_arg);
+		memmove(alloced, chunk, asize);
+		parent->p.children = alloced;
+
+		/* insert into this level */
+		memset(chunk, 0, sizeof(*chunk) * BITS_PER_BITMAPWORD);
+		chunk[0] = insert;
+		trie->n[level] = 1;
+
+		/* prepare insertion into upper level */
+		insert.bitmap = BIT(insert.key);
+		insert.p.children = chunk;
+		insert.key >>= SHIFT;
+	}
+
+	Assert((parent->bitmap & BIT(insert.key)) == 0);
+
+	parent->bitmap |= BIT(insert.key);
+	chunk[trie->n[level]] = insert;
+	trie->n[level]++;
+
+	Assert(pg_popcountW(parent->bitmap) == trie->n[level]);
+}
+
+static IntsetTrieVal
+intset2_trie_lookup(IntsetTrie *trie, uint64 key)
+{
+	IntsetTrieVal	result = {0, NULL};
+	IntsetTrieElem *current = &trie->root;
+	int level = trie->root_level;
+
+	if (level == -1)
+		return result;
+
+	/* root is out of bound */
+	if (current->key != STARTN(key, level))
+		return result;
+
+	for (; level > 0; level--)
+	{
+		int n;
+		uint64 bit = BITN(key, level);
+
+		if ((current->bitmap & bit) == 0)
+			/* Not found */
+			return result;
+		n = pg_popcountW(current->bitmap & (bit-1));
+		current = &current->p.children[n];
+	}
+
+	Assert(current->key == key);
+
+	result.bitmap = current->bitmap;
+	result.val = current->p.val;
+
+	return result;
+}
+
+/* Intset */
+
+/* returns 1 if new element were added, 0 otherwise */
+static uint8
+intset2_leafbuilder_add(IntsetLeafBuilder *leaf, uint64 v)
+{
+	uint16	bv;
+	uint8	lbyte, lbit, missing;
+
+	bv = v % LEAF_BITS;
+	lbyte = LBYTE(bv);
+	lbit = LBIT(bv);
+
+	if (leaf->nvals < 3)
+		leaf->embed[leaf->nvals] = bv;
+	if (leaf->nvals == 0)
+		leaf->minbyte = leaf->maxbyte = lbyte;
+	else
+	{
+		Assert(lbyte >= leaf->maxbyte);
+		leaf->maxbyte = lbyte;
+	}
+
+	lbyte -= leaf->minbyte;
+
+	missing = (leaf->bytes[lbyte] & lbit) == 0;
+	leaf->bytes[lbyte] |= lbit;
+	leaf->nvals += missing;
+	return missing;
+}
+
+static inline bool
+intset2_leafbuilder_is_member(IntsetLeafBuilder *leaf, uint64 v)
+{
+	uint16	bv;
+	uint8	lbyte, lbit;
+
+	bv = v % LEAF_BITS;
+	lbyte = LBYTE(bv);
+	lbit = LBIT(bv);
+
+	/* we shouldn't be here unless we set something */
+	Assert(leaf->nvals != 0);
+
+	if (lbyte < leaf->minbyte || lbyte > leaf->maxbyte)
+		return false;
+	lbyte -= leaf->minbyte;
+	return (leaf->bytes[lbyte] & lbit) != 0;
+}
+
+static uint8
+intset2_chunkbuilder_add(IntsetChunkBuilder *chunk, uint64 v)
+{
+	IntsetLeafBuilder *leafs = chunk->leafs;
+
+	Assert(CSTART(v) == chunk->chunk);
+	chunk->bitmap |= (bitmapword)1<<CPOS(v);
+	return intset2_leafbuilder_add(&leafs[CPOS(v)], v);
+}
+
+static bool
+intset2_chunkbuilder_is_member(IntsetChunkBuilder *chunk, uint64 v)
+{
+	IntsetLeafBuilder *leafs = chunk->leafs;
+
+	Assert(CSTART(v) == chunk->chunk);
+	if ((chunk->bitmap & ((bitmapword)1<<CPOS(v))) == 0)
+		return false;
+	return intset2_leafbuilder_is_member(&leafs[CPOS(v)], v);
+}
+
+static bool
+intset2_chunk_is_member(IntsetChunk *chunk, bitmapword bitmap, uint64 v)
+{
+	IntsetLeafHeader	h;
+
+	uint32	cpos;
+	bitmapword	cbit;
+	uint8	*buf;
+	uint32	bv;
+	uint8	root;
+	uint8	lbyte;
+	uint8	l1bm;
+	uint8	l1len;
+	uint8	l1pos;
+	uint8	lbit;
+	bool	found;
+	bool	inverse;
+
+	cpos = CPOS(v);
+	cbit = ONE << cpos;
+
+	if ((bitmap & cbit) == 0)
+		return false;
+	h = chunk[pg_popcountW(bitmap & (cbit-1))];
+
+	bv = v % LEAF_BITS;
+	if (h.e.type == LT_EMBED)
+		return bv == h.e.v0 || bv == h.e.v1 || bv == h.e.v2;
+
+	lbyte = LBYTE(bv);
+	lbit = LBIT(bv);
+	buf = (uint8*)(chunk + pg_popcountW(bitmap)) + h.b.offset;
+
+	if (lbyte < h.b.minbyte || lbyte > h.b.maxbyte)
+		return false;
+	lbyte -= h.b.minbyte;
+
+	if (h.b.type == LT_RAW)
+		return (buf[lbyte] & lbit) != 0;
+
+	inverse = h.b.type == LT_INVERSE;
+
+	/*
+	 * Bitmap is sparse, so we have to recalculate lbyte.
+	 * lbyte = popcount(bits in level1 up to lbyte)
+	 */
+	root = buf[0];
+	if ((root & (1<<(lbyte/8))) == 0)
+		return inverse;
+
+	/* Calculate position in sparse level1 index. */
+	l1pos = pg_popcount8_lowbits(root, lbyte/8);
+	l1bm = buf[1+l1pos];
+	if ((l1bm & (1<<(lbyte&7))) == 0)
+		return inverse;
+	/* Now we have to check bitmap byte itself */
+	/* Calculate length of sparse level1 index */
+	l1len = pg_popcount8(root);
+	/*
+	 * Corrected lbyte position is count of bits set in the level1 upto
+	 * our original position.
+	 */
+	lbyte = pg_popcount_small(buf+1, l1pos) +
+			pg_popcount8_lowbits(l1bm, lbyte&7);
+	found = (buf[1+l1len+lbyte] & lbit) != 0;
+	return found != inverse;
+}
+
+IntegerSet2*
+intset2_create(void)
+{
+	IntegerSet2 *intset = palloc0(sizeof(IntegerSet2));
+
+	intset2_trie_init(&intset->trie,
+			(trie_alloc)intset2_alloc,
+			&intset->alloc);
+
+	return intset;
+}
+
+void
+intset2_free(IntegerSet2 *intset)
+{
+	intset2_alloc_free(&intset->alloc);
+	pfree(intset);
+}
+
+void
+intset2_add_member(IntegerSet2 *intset, uint64 v)
+{
+	uint64 cstart;
+	if (intset->nvalues == 0)
+	{
+		uint8 add;
+
+		intset->firstvalue = CSTART(v);
+		v -= intset->firstvalue;
+		add = intset2_chunkbuilder_add(&intset->current, v);
+		Assert(add == 1);
+		intset->nvalues += add;
+		return;
+	}
+
+	v -= intset->firstvalue;
+	cstart = CSTART(v);
+	Assert(cstart >= intset->current.chunk);
+	if (cstart != intset->current.chunk)
+	{
+		intset2_compress_current(intset);
+		intset->current.chunk = cstart;
+	}
+
+	intset->nvalues += intset2_chunkbuilder_add(&intset->current, v);
+}
+
+bool
+intset2_is_member(IntegerSet2 *intset, uint64 v)
+{
+	IntsetTrieVal trieval;
+
+	if (intset->nvalues == 0)
+		return false;
+
+	if (v < intset->firstvalue)
+		return false;
+
+	v -= intset->firstvalue;
+
+	if (intset->current.chunk < CSTART(v))
+		return false;
+
+	if (intset->current.chunk == CSTART(v))
+		return intset2_chunkbuilder_is_member(&intset->current, v);
+
+	trieval = intset2_trie_lookup(&intset->trie, v>>CHUNK_SHIFT);
+	return intset2_chunk_is_member(trieval.val, trieval.bitmap, v);
+}
+
+uint64
+intset2_num_entries(IntegerSet2 *intset)
+{
+	return intset->nvalues;
+}
+
+uint64
+intset2_memory_usage(IntegerSet2 *intset)
+{
+	/* we are missing alloc->chunks here */
+	return sizeof(IntegerSet2) + intset->alloc.total_size;
+}
+
+static void
+intset2_compress_current(IntegerSet2 *intset)
+{
+	IntsetChunkBuilder *bld = &intset->current;
+	IntsetLeafBuilder *leaf;
+	uint32 nheaders = 0;
+	IntsetLeafHeader headers[BITS_PER_BITMAPWORD];
+	IntsetLeafHeader h = {.v = 0};
+	IntsetTrieVal    trieval = {0, NULL};
+	uint64  triekey;
+	uint32	hlen, totallen;
+	uint32	bufpos = 0;
+	uint32	i;
+	uint8	buffer[BITS_PER_BITMAPWORD * LEAF_BYTES];
+
+	for (i = 0; i < BITS_PER_BITMAPWORD; i++)
+	{
+		if ((bld->bitmap & (ONE<<i)) == 0)
+			continue;
+
+		leaf = &bld->leafs[i];
+		Assert(leaf->nvals != 0);
+
+		if (leaf->nvals < 3)
+		{
+			h.e.type = LT_EMBED;
+			/*
+			 * Header elements should be all filled because we doesn't store
+			 * their amount;
+			 * do the trick to fill possibly empty place
+			 * n = 1 => n/2 = 0, n-1 = 0
+			 * n = 2 => n/2 = 1, n-1 = 1
+			 * n = 3 => n/2 = 1, n-1 = 2
+			 */
+			h.e.v0 = leaf->embed[0];
+			h.e.v1 = leaf->embed[leaf->nvals/2];
+			h.e.v2 = leaf->embed[leaf->nvals-1];
+		}
+		else
+		{
+			/* root raw and root inverse */
+			uint8	rraw = 0,
+					rinv = 0;
+			/* level 1 index raw and index inverse */
+			uint8	raw[LEAF_BYTES/8] = {0},
+					inv[LEAF_BYTES/8] = {0};
+			/* zero count for raw map and inverse map */
+			uint8	cnt_00 = 0,
+					cnt_ff = 0;
+			uint8	mlen, llen;
+			uint8   splen, invlen, threshold;
+			uint8	b00, bff;
+			uint8  *buf;
+			int j;
+
+			h.b.minbyte = leaf->minbyte;
+			h.b.maxbyte = leaf->maxbyte;
+			h.b.offset = bufpos;
+
+			mlen = leaf->maxbyte+1 - leaf->minbyte;
+			for (j = 0; j < mlen; j++)
+			{
+				b00 = leaf->bytes[j] == 0;
+				bff = leaf->bytes[j] == 0xff;
+				cnt_00 += b00;
+				cnt_ff += bff;
+				raw[j/8] |= (1-b00) << (j&7);
+				inv[j/8] |= (1-bff) << (j&7);
+				Assert(j/64 == 0);
+				rraw |= (1-b00) << ((j/8)&7);
+				rinv |= (1-bff) << ((j/8)&7);
+			}
+
+			llen = (mlen-1)/8+1;
+			for (j = 0; j < llen; j++)
+			{
+				cnt_00 += raw[j] == 0;
+				cnt_ff += inv[j] == 0;
+			}
+
+			buf = buffer + bufpos;
+
+			splen = mlen + llen + 1 - cnt_00;
+			invlen = mlen + llen + 1 - cnt_ff;
+			threshold = mlen <= 4 ? 0 : /* don't compress */
+						mlen <= 8 ? mlen - 2 :
+									mlen * 3 / 4;
+
+			/* sparse map compresses well */
+			if (splen <= threshold && splen <= invlen)
+			{
+				h.b.type = LT_SPARSE;
+				*buf++ = rraw;
+				buf += intset2_compact(buf, raw, llen, false);
+				buf += intset2_compact(buf, leaf->bytes, mlen, false);
+			}
+			/* inverse sparse map compresses well */
+			else if (invlen <= threshold)
+			{
+				h.b.type = LT_INVERSE;
+				*buf++ = rinv;
+				buf += intset2_compact(buf, inv, llen, false);
+				buf += intset2_compact(buf, leaf->bytes, mlen, true);
+			}
+			/* fallback to raw type */
+			else
+			{
+				h.b.type = LT_RAW;
+				memmove(buf, leaf->bytes, mlen);
+				buf += mlen;
+			}
+
+			bufpos = buf - buffer;
+		}
+		headers[nheaders] = h;
+		nheaders++;
+	}
+
+	hlen = nheaders * sizeof(h);
+	totallen = hlen + bufpos;
+
+	trieval.bitmap = bld->bitmap;
+	trieval.val = intset2_alloc(totallen, &intset->alloc);
+	memmove(trieval.val, headers, hlen);
+	memmove((char*)trieval.val + hlen, buffer, bufpos);
+
+	triekey = bld->chunk >> CHUNK_SHIFT;
+	intset2_trie_insert(&intset->trie, triekey, trieval);
+
+	memset(&intset->current, 0, sizeof(intset->current));
+}
+
+#define EXPECT_TRUE(expr)	\
+	do { \
+		Assert(expr); \
+		if (!(expr)) \
+			elog(ERROR, \
+				 "%s was unexpectedly false in file \"%s\" line %u", \
+				 #expr, __FILE__, __LINE__); \
+	} while (0)
+
+#define EXPECT_FALSE(expr)	\
+	do { \
+		Assert(!(expr)); \
+		if (expr) \
+			elog(ERROR, \
+				 "%s was unexpectedly true in file \"%s\" line %u", \
+				 #expr, __FILE__, __LINE__); \
+	} while (0)
+
+#define EXPECT_EQ_U32(result_expr, expected_expr)	\
+	do { \
+		uint32		result = (result_expr); \
+		uint32		expected = (expected_expr); \
+		Assert(result == expected); \
+		if (result != expected) \
+			elog(ERROR, \
+				 "%s yielded %u, expected %s in file \"%s\" line %u", \
+				 #result_expr, result, #expected_expr, __FILE__, __LINE__); \
+	} while (0)
+
+static void
+intset2_test_1_off(uint64 off)
+{
+	IntegerSet2 *intset;
+	uint64 i, d, v;
+
+	intset = intset2_create();
+
+#define K 799
+
+	for (i = 0, d = 1; d < (ONE << (CHUNK_SHIFT + SHIFT + 1)); i+=(d=1+i/K))
+	{
+		v = i + off;
+		EXPECT_FALSE(intset2_is_member(intset, v));
+		EXPECT_FALSE(intset2_is_member(intset, v+1));
+		if (i != 0)
+		{
+			EXPECT_TRUE(intset2_is_member(intset, v-d));
+		}
+		if (d > 1)
+		{
+			EXPECT_FALSE(intset2_is_member(intset, v-1));
+			EXPECT_FALSE(intset2_is_member(intset, v-(d-1)));
+		}
+		intset2_add_member(intset, v);
+		EXPECT_TRUE(intset2_is_member(intset, v));
+		if (i != 0)
+		{
+			EXPECT_TRUE(intset2_is_member(intset, v-d));
+		}
+		if (d > 1)
+		{
+			EXPECT_FALSE(intset2_is_member(intset, v-1));
+			EXPECT_FALSE(intset2_is_member(intset, v-(d-1)));
+		}
+		EXPECT_FALSE(intset2_is_member(intset, v+1));
+	}
+
+	for (i = 0, d = 0; d < (1 << (CHUNK_SHIFT + SHIFT + 1)); i+=(d=1+i/K))
+	{
+		v = i + off;
+
+		EXPECT_TRUE(intset2_is_member(intset, v));
+		if (d != 0)
+		{
+			EXPECT_TRUE(intset2_is_member(intset, v-d));
+		}
+		if (d > 1)
+		{
+			EXPECT_FALSE(intset2_is_member(intset, v+1));
+			EXPECT_FALSE(intset2_is_member(intset, v-1));
+			EXPECT_FALSE(intset2_is_member(intset, v-(d-1)));
+		}
+	}
+
+	intset2_free(intset);
+}
+
+void
+intset2_test_1(void)
+{
+	intset2_test_1_off(0);
+	intset2_test_1_off(1001);
+	intset2_test_1_off(10000001);
+	intset2_test_1_off(100000000001);
+}
+
+/* Tools */
+
+static inline uint32
+intset2_compact(uint8 *dest, uint8 *src, uint8 len, bool inverse)
+{
+	uint32	i, j;
+	uint8	b;
+
+	for (i = j = 0; i < len; i++)
+	{
+		b = inverse ? ~src[i] : src[i];
+		dest[j] = b;
+		j += b != 0;
+	}
+
+	return j;
+}
+
+static const uint8 popcnt[256] = {
+	0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
+	4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
+};
+
+static inline uint8
+pg_popcount8(uint8 b)
+{
+	return popcnt[b];
+}
+
+static inline uint8
+pg_popcount8_lowbits(uint8 b, uint8 nbits)
+{
+	Assert(nbits < 8);
+	return popcnt[b&((1<<nbits)-1)];
+}
+
+static inline uint8
+pg_popcount_small(uint8 *b, uint8 len)
+{
+	uint8 r = 0;
+	switch (len&7)
+	{
+		case 7:	r += popcnt[b[6]]; /* fallthrough */
+		case 6:	r += popcnt[b[5]]; /* fallthrough */
+		case 5:	r += popcnt[b[4]]; /* fallthrough */
+		case 4:	r += popcnt[b[3]]; /* fallthrough */
+		case 3:	r += popcnt[b[2]]; /* fallthrough */
+		case 2:	r += popcnt[b[1]]; /* fallthrough */
+		case 1:	r += popcnt[b[0]]; /* fallthrough */
+	}
+	return r;
+}
+
diff --git a/bdbench/integerset2.h b/bdbench/integerset2.h
new file mode 100644
index 0000000..b987605
--- /dev/null
+++ b/bdbench/integerset2.h
@@ -0,0 +1,15 @@
+#ifndef INTEGERSET2_H
+#define INTEGERSET2_H
+
+typedef struct IntegerSet2 IntegerSet2;
+
+extern IntegerSet2 *intset2_create(void);
+extern void intset2_free(IntegerSet2 *intset);
+extern void intset2_add_member(IntegerSet2 *intset, uint64 x);
+extern bool intset2_is_member(IntegerSet2 *intset, uint64 x);
+
+extern uint64 intset2_num_entries(IntegerSet2 *intset);
+extern uint64 intset2_memory_usage(IntegerSet2 *intset);
+
+extern void intset2_test_1(void);
+#endif  /* INTEGERSET2_H */
-- 
2.32.0

#34

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Yura Sokolov (#33)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Yura Sokolov писал 2021-07-29 18:29:

I've attached IntegerSet2 patch for pgtools repo and benchmark results.
Branch https://github.com/funny-falcon/pgtools/tree/integerset2

Strange web-mail client... I never can be sure what it will attach...

Reattach benchmark results

Show quoted text

regards,

Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

#35

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Masahiko Sawada (#30)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jul 29, 2021 at 5:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.

What I'm about to say might be a really stupid idea, especially since
I haven't looked at any of the code already posted, but what I'm
wondering about is whether we need a full radix tree or maybe just a
radix-like lookup aid. For example, suppose that for a relation <= 8MB
in size, we create an array of 1024 elements indexed by block number.
Each element of the array stores an offset into the dead TID array.
When you need to probe for a TID, you look up blkno and blkno + 1 in
the array and then bsearch only between those two offsets. For bigger
relations, a two or three level structure could be built, or it could
always be 3 levels. This could even be done on demand, so you
initialize all of the elements to some special value that means "not
computed yet" and then fill them the first time they're needed,
perhaps with another special value that means "no TIDs in that block".

I don't know if this is better, but I do kind of like the fact that
the basic representation is just an array. It makes it really easy to
predict how much memory will be needed for a given number of dead
TIDs, and it's very DSM-friendly as well.

--
Robert Haas
EDB: http://www.enterprisedb.com

#36

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Robert Haas (#35)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Robert Haas писал 2021-07-29 20:15:

On Thu, Jul 29, 2021 at 5:11 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

Indeed. Given that the radix tree itself has other use cases, I have
no concern about using radix tree for vacuum's dead tuples storage. It
will be better to have one that can be generally used and has some
optimizations that are helpful also for vacuum's use case, rather than
having one that is very optimized only for vacuum's use case.

What I'm about to say might be a really stupid idea, especially since
I haven't looked at any of the code already posted, but what I'm
wondering about is whether we need a full radix tree or maybe just a
radix-like lookup aid. For example, suppose that for a relation <= 8MB
in size, we create an array of 1024 elements indexed by block number.
Each element of the array stores an offset into the dead TID array.
When you need to probe for a TID, you look up blkno and blkno + 1 in
the array and then bsearch only between those two offsets. For bigger
relations, a two or three level structure could be built, or it could
always be 3 levels. This could even be done on demand, so you
initialize all of the elements to some special value that means "not
computed yet" and then fill them the first time they're needed,
perhaps with another special value that means "no TIDs in that block".

8MB relation is not a problem, imo. There is no need to do anything to
handle 8MB relation.

Problem is 2TB relation. It has 256M pages and, lets suppose, 3G dead
tuples.

Then offset array will be 2GB and tuple offset array will be 6GB (2 byte
offset per tuple). 8GB in total.

We can make offset array only for higher 3 bytes of block number.
We then will have 1M offset array weighted 8MB, and there will be array
of 3byte tuple pointers (1 remaining byte from block number, and 2 bytes
from Tuple) weighted 9GB.

But using per-batch compression schemes, there could be amortized
4 byte per page and 1 byte per tuple: 1GB + 3GB = 4GB memory.
Yes, it is not as guaranteed as in array approach. But 95% of time it is
such low and even lower. And better: more tuples are dead - better
compression works. Page with all tuples dead could be encoded as little
as 5 bytes. Therefore, overall memory consumption is more stable and
predictive.

Lower memory consumption of tuple storage means there is less chance
indexes should be scanned twice or more times. It gives more
predictability in user experience.

I don't know if this is better, but I do kind of like the fact that
the basic representation is just an array. It makes it really easy to
predict how much memory will be needed for a given number of dead
TIDs, and it's very DSM-friendly as well.

Whole thing could be encoded in one single array of bytes. Just give
"pointer-to-array"+"array-size" to constructor, and use "bump allocator"
inside. Complex logical structure doesn't imply "DSM-unfriendliness".
Hmm.... I mean if it is suitably designed.

In fact, my code uses bump allocator internally to avoid "per-allocation
overhead" of "aset", "slab" or "generational". And IntegerSet2 version
even uses it for all allocations since it has no reallocatable parts.

Well, if datastructure has reallocatable parts, it could be less
friendly
to DSM.

regards,

---
Yura Sokolov
y.sokolov@postgrespro.ru
funny.falcon@gmail.com

#37

andres@anarazel.de

over 4 years ago

In reply to: Robert Haas (#35)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2021-07-29 13:15:53 -0400, Robert Haas wrote:

I don't know if this is better, but I do kind of like the fact that
the basic representation is just an array. It makes it really easy to
predict how much memory will be needed for a given number of dead
TIDs, and it's very DSM-friendly as well.

I think those advantages are far outstripped by the big disadvantage of
needing to either size the array accurately from the start, or to
reallocate the whole array. Our current pre-allocation behaviour is
very wasteful for most vacuums but doesn't handle large work_mem at all,
causing unnecessary index scans.

Greetings,

Andres Freund

#38

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Andres Freund (#37)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jul 29, 2021 at 3:14 PM Andres Freund <andres@anarazel.de> wrote:

I think those advantages are far outstripped by the big disadvantage of
needing to either size the array accurately from the start, or to
reallocate the whole array. Our current pre-allocation behaviour is
very wasteful for most vacuums but doesn't handle large work_mem at all,
causing unnecessary index scans.

I agree that the current pre-allocation behavior is bad, but I don't
really see that as an issue with my idea. Fixing that would require
allocating the array in chunks, but that doesn't really affect the
core of the idea much, at least as I see it.

But I accept that Yura has a very good point about the memory usage of
what I was proposing.

--
Robert Haas
EDB: http://www.enterprisedb.com

#39

andres@anarazel.de

over 4 years ago

In reply to: Robert Haas (#38)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2021-07-30 15:13:49 -0400, Robert Haas wrote:

On Thu, Jul 29, 2021 at 3:14 PM Andres Freund <andres@anarazel.de> wrote:

I think those advantages are far outstripped by the big disadvantage of
needing to either size the array accurately from the start, or to
reallocate the whole array. Our current pre-allocation behaviour is
very wasteful for most vacuums but doesn't handle large work_mem at all,
causing unnecessary index scans.

I agree that the current pre-allocation behavior is bad, but I don't
really see that as an issue with my idea. Fixing that would require
allocating the array in chunks, but that doesn't really affect the
core of the idea much, at least as I see it.

Well, then it'd not really be the "simple array approach" anymore :)

But I accept that Yura has a very good point about the memory usage of
what I was proposing.

The lower memory usage also often will result in a better cache
utilization - which is a crucial factor for index vacuuming when the
index order isn't correlated with the heap order. Cache misses really
are a crucial performance factor there.

Greetings,

Andres Freund

#40

Robert Haas

robertmhaas@gmail.com

over 4 years ago

In reply to: Andres Freund (#39)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Jul 30, 2021 at 3:34 PM Andres Freund <andres@anarazel.de> wrote:

The lower memory usage also often will result in a better cache
utilization - which is a crucial factor for index vacuuming when the
index order isn't correlated with the heap order. Cache misses really
are a crucial performance factor there.

Fair enough.

--
Robert Haas
EDB: http://www.enterprisedb.com

#41

Matthias van de Meent

boekewurm+postgres@gmail.com

almost 4 years ago

In reply to: Yura Sokolov (#36)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

Today I noticed the inefficiencies of our dead tuple storage once
again, and started theorizing about a better storage method; which is
when I remembered that this thread exists, and that this thread
already has amazing results.

Are there any plans to get the results of this thread from PoC to committable?

Kind regards,

Matthias van de Meent

#42

andres@anarazel.de

almost 4 years ago

In reply to: Matthias van de Meent (#41)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2022-02-11 13:47:01 +0100, Matthias van de Meent wrote:

Today I noticed the inefficiencies of our dead tuple storage once
again, and started theorizing about a better storage method; which is
when I remembered that this thread exists, and that this thread
already has amazing results.

Are there any plans to get the results of this thread from PoC to committable?

I'm not currently planning to work on it personally. It'd would be awesome if
somebody did...

Greetings,

Andres Freund

#43

sawada.mshk@gmail.com

almost 4 years ago

In reply to: Andres Freund (#42)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sun, Feb 13, 2022 at 11:02 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-02-11 13:47:01 +0100, Matthias van de Meent wrote:

Today I noticed the inefficiencies of our dead tuple storage once
again, and started theorizing about a better storage method; which is
when I remembered that this thread exists, and that this thread
already has amazing results.

Are there any plans to get the results of this thread from PoC to committable?

I'm not currently planning to work on it personally. It'd would be awesome if
somebody did...

Actually, I'm working on simplifying and improving radix tree
implementation for PG16 dev cycle. From the discussion so far I think
it's better to have a data structure that can be used for
general-purpose and is also good for storing TID, not very specific to
store TID. So I think radix tree would be a potent candidate. I have
done the insertion and search implementation.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#44

andres@anarazel.de

almost 4 years ago

In reply to: Masahiko Sawada (#43)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On 2022-02-13 12:36:13 +0900, Masahiko Sawada wrote:

Actually, I'm working on simplifying and improving radix tree
implementation for PG16 dev cycle. From the discussion so far I think
it's better to have a data structure that can be used for
general-purpose and is also good for storing TID, not very specific to
store TID. So I think radix tree would be a potent candidate. I have
done the insertion and search implementation.

Awesome!

#45

[1]: https://github.com/MasahikoSawada/pgtools/tree/master/bdbench
[2]: /messages/by-id/CAFiTN-visUO9VTz2+h224z5QeUjKhKNdSfjaCucPhYJdbzxx0g@mail.gmail.com

sawada.mshk@gmail.com

over 3 years ago

In reply to: Andres Freund (#44)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On Sun, Feb 13, 2022 at 12:39 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-02-13 12:36:13 +0900, Masahiko Sawada wrote:

Actually, I'm working on simplifying and improving radix tree
implementation for PG16 dev cycle. From the discussion so far I think
it's better to have a data structure that can be used for
general-purpose and is also good for storing TID, not very specific to
store TID. So I think radix tree would be a potent candidate. I have
done the insertion and search implementation.

Awesome!

To move this project forward, I've implemented radix tree
implementation from scratch while studying Andres's implementation. It
supports insertion, search, and iteration but not deletion yet. In my
implementation, I use Datum as the value so internal and lead nodes
have the same data structure, simplifying the implementation. The
iteration on the radix tree returns keys with the value in ascending
order of the key. The patch has regression tests for radix tree but is
still in PoC state: left many debugging codes, not supported SSE2 SIMD
instructions, added -mavx2 flag is hard-coded.

I've measured the size and loading and lookup performance for each
candidate data structure with two test cases: dense and sparse, by
using the test tool[1]https://github.com/MasahikoSawada/pgtools/tree/master/bdbench. Here are the results:

* Case1 - Dense (simulating the case where there are 1000 consecutive
pages each of which has 100 dead tuples, at 100 page intervals.)
select prepare(
1000000, -- max block
100, -- # of dead tuples per page
1, -- dead tuples interval within a page
1000, -- # of consecutive pages having dead tuples
1100 -- page interval
);

name size attach lookup
array 520 MB 248.60 ms 89891.92 ms
hash 3188 MB 28029.59 ms 50850.32 ms
intset 85 MB 644.96 ms 39801.17 ms
tbm 96 MB 474.06 ms 6641.38 ms
radix 37 MB 173.03 ms 9145.97 ms
radix_tree 36 MB 184.51 ms 9729.94 ms

* Case2 - Sparse (simulating a case where there are pages that have 2
dead tuples every 1000 pages.)
select prepare(
10000000, -- max block
2, -- # of dead tuples per page
50, -- dead tuples interval within a page
1, -- # of consecutive pages having dead tuples
1000 -- page interval
);

name size attach lookup
array 125 kB 0.53 ms 82183.61 ms
hash 1032 kB 1.31 ms 28128.33 ms
intset 222 kB 0.51 ms 87775.68 ms
tbm 768 MB 1.24 ms 98674.60 ms
radix 1080 kB 1.66 ms 20698.07 ms
radix_tree 949 kB 1.50 ms 21465.23 ms

Each test virtually generates TIDs and loads them to the data
structure, and then searches for virtual index TIDs.
'array' is a sorted array which is the current method, 'hash' is HTAB,
'intset' is IntegerSet, and 'tbm' is TIDBitmap. The last two results
are radix tree implementations: 'radix' is Andres's radix tree
implementation and 'radix_tree' is my radix tree implementation. In
both radix tree tests, I converted TIDs into an int64 and store the
lower 6 bits in the value part of the radix tree.

Overall, radix tree implementations have good numbers. Once we got an
agreement on moving in this direction, I'll start a new thread for
that and move the implementation further; there are many things to do
and discuss: deletion, API design, SIMD support, more tests etc.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

radixtree.patchapplication/octet-stream; name=radixtree.patchDownload

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..fd002d594a 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,9 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
+radixtree.o: CFLAGS+=-mavx2
+
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..a5ad897ee9
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,1377 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module is based on the paper "The Adaptive Radix Tree: ARTful Indexing
+ * for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas Neumann,
+ * 2013.
+ *
+ * There are some difference from the proposed implementation.  For instance,
+ * this radix tree module utilize AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit witdh SIMD vector is used in the paper.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+#if defined(__AVX2__)
+#include <immintrin.h> // x86 AVX2 intrinsics
+#endif
+
+/* How many bits are encoded in one tree level */
+#define RADIX_TREE_NODE_FANOUT	8
+
+#define RADIX_TREE_NODE_MAX_SLOTS (1 << RADIX_TREE_NODE_FANOUT)
+#define RADIX_TREE_NODE_MAX_SLOT_BITS \
+	(RADIX_TREE_NODE_MAX_SLOTS / (sizeof(uint8) * BITS_PER_BYTE))
+
+#define RADIX_TREE_CHUNK_MASK ((1 << RADIX_TREE_NODE_FANOUT) - 1)
+#define RADIX_TREE_MAX_SHIFT	key_get_shift(UINT64_MAX)
+#define RADIX_TREE_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RADIX_TREE_NODE_FANOUT)
+
+#define GET_KEY_CHUNK(key, shift) \
+	((uint8) (((key) >> (shift)) & RADIX_TREE_CHUNK_MASK))
+
+typedef enum radix_tree_node_kind
+{
+	RADIX_TREE_NODE_KIND_4 = 0,
+	RADIX_TREE_NODE_KIND_32,
+	RADIX_TREE_NODE_KIND_128,
+	RADIX_TREE_NODE_KIND_256
+} radix_tree_node_kind;
+#define RADIX_TREE_NODE_KIND_COUNT 4
+
+/*
+ * Base type for all nodes types.
+ *
+ * The key is a 64-bit unsigned integer and the value is a Datum. The internal
+ * tree nodes, shift > 0, store the pointer to its child nodes as a Datum value.
+ * The leaf nodes, shift == 0, stores the value that the user specified as a Datum
+ * value.
+ */
+typedef struct radix_tree_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at fanout of 8.
+	 */
+	uint16	count;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this node.
+	 * That is, the key is shifted by 'shift' and the lowest RADIX_TREE_NODE_FANOUT
+	 * bits are then represented in chunk.
+	 */
+	uint8	shift;
+	uint8	chunk;
+
+	/* Size class of the node */
+	radix_tree_node_kind	kind;
+} radix_tree_node;
+#define NodeIsLeaf(n) (((radix_tree_node *) (n))->shift == 0)
+#define NodeHasFreeSlot(n) \
+	(((radix_tree_node *) (n))->count < \
+	 radix_tree_node_info[((radix_tree_node *) (n))->kind].max_slots)
+
+/*
+ * To reduce memory usage compared to a simple radix tree with a fixed fanout
+ * we use adaptive node sides, with different storage methods for different
+ * numbers of elements.
+ */
+typedef struct radix_tree_node_4
+{
+	radix_tree_node n;
+
+	/* 4 children, for key chunks */
+	uint8	chunks[4];
+	Datum	slots[4];
+} radix_tree_node_4;
+
+typedef struct radix_tree_node_32
+{
+	radix_tree_node n;
+
+	/* 32 children, for key chunks */
+	uint8	chunks[32];
+	Datum slots[32];
+} radix_tree_node_32;
+
+typedef struct radix_tree_node_128
+{
+	radix_tree_node n;
+
+	/*
+	 * The index of slots for each fanout. 0 means unused whereas slots is
+	 * 0-indexed. So we can get the slots of the chunk C by slots[C - 1].
+	 */
+	uint8	slot_idxs[RADIX_TREE_NODE_MAX_SLOTS];
+
+	Datum	slots[128];
+} radix_tree_node_128;
+
+typedef struct radix_tree_node_256
+{
+	radix_tree_node n;
+
+	/* A bitmap to track which slot is in use */
+	uint8	set[RADIX_TREE_NODE_MAX_SLOT_BITS];
+
+	Datum	slots[RADIX_TREE_NODE_MAX_SLOTS];
+} radix_tree_node_256;
+#define RADIX_TREE_NODE_256_SET_BYTE(v) ((v) / RADIX_TREE_NODE_FANOUT)
+#define RADIX_TREE_NODE_256_SET_BIT(v) (UINT64_C(1) << ((v) % RADIX_TREE_NODE_FANOUT))
+
+/* Information of each size class */
+typedef struct radix_tree_node_info_elem
+{
+	const char *name;
+	int		max_slots;
+	Size	size;
+} radix_tree_node_info_elem;
+
+static radix_tree_node_info_elem radix_tree_node_info[] =
+{
+	{"radix tree node 4", 4, sizeof(radix_tree_node_4)},
+	{"radix tree node 32", 32, sizeof(radix_tree_node_32)},
+	{"radix tree node 128", 128, sizeof(radix_tree_node_128)},
+	{"radix tree node 256", 256, sizeof(radix_tree_node_256)},
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending order
+ * of the key.  To support this, the we iterate nodes of each level.
+ * radix_tree_iter_node_data struct is used to track the iteration within a node.
+ * radix_tree_iter has the array of this struct, stack, in order to track the iteration
+ * of every level.  During the iteration, we also construct the key to return. The key
+ * is updated whenever we update the node iteration information, e.g., when advancing
+ * the current index within the node or when moving to the next node at the same level.
+ */
+typedef struct radix_tree_iter_node_data
+{
+	radix_tree_node *node;	/* current node being iterated */
+	int	current_idx;		/* current position. -1 for initial value */
+} radix_tree_iter_node_data;
+
+struct radix_tree_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	radix_tree_iter_node_data stack[RADIX_TREE_MAX_LEVEL];
+	int	stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64	key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	radix_tree_node	*root;
+	uint64	max_val;
+	uint64	num_keys;
+	MemoryContextData *slabs[RADIX_TREE_NODE_KIND_COUNT];
+
+	/* stats */
+	uint64	mem_used;
+	int32	cnt[RADIX_TREE_NODE_KIND_COUNT];
+};
+
+static radix_tree_node *radix_tree_node_grow(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node);
+static radix_tree_node *radix_tree_find_child(radix_tree_node *node, uint64 key);
+static Datum *radix_tree_find_slot_ptr(radix_tree_node *node, uint8 chunk);
+static void radix_tree_replace_slot(radix_tree_node *parent, radix_tree_node *node,
+									uint8 chunk);
+static void radix_tree_extend(radix_tree *tree, uint64 key);
+static void radix_tree_new_root(radix_tree *tree, uint64 key, Datum val);
+static radix_tree_node *radix_tree_insert_child(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node,
+												uint64 key);
+static void radix_tree_insert_val(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node,
+								  uint64 key, Datum val, bool *replaced_p);
+
+static inline void radix_tree_iter_update_key(radix_tree_iter *iter, uint8 chunk, uint8 shift);
+static Datum radix_tree_node_iterate_next(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+										  bool *found_p);
+static void radix_tree_store_iter_node(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+									   radix_tree_node *node);
+static void radix_tree_update_iter_stack(radix_tree_iter *iter, int from);
+
+static inline int
+node_32_search_eq(radix_tree_node_32 *node, uint8 chunk)
+{
+#ifdef __AVX2__
+	__m256i	_key = _mm256_set1_epi8(chunk);
+	__m256i _data = _mm256_loadu_si256((__m256i_u *) node->chunks);
+	__m256i _cmp = _mm256_cmpeq_epi8(_key, _data);
+	uint32	bitfield = _mm256_movemask_epi8(_cmp);
+
+	bitfield &= ((UINT64_C(1) << node->n.count) - 1);
+
+	return (bitfield) ? __builtin_ctz(bitfield) : -1;
+
+#else
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] > chunk)
+			return -1;
+
+		if (node->chunks[i] == chunk)
+			return i;
+	}
+
+	return -1;
+#endif	/* __AVX2__ */
+}
+
+/*
+ * This is a bit more complicated than search_chunk_array_16_eq(), because
+ * until recently no unsigned uint8 comparison instruction existed on x86. So
+ * we need to play some trickery using _mm_min_epu8() to effectively get
+ * <=. There never will be any equal elements in the current uses, but that's
+ * what we get here...
+ */
+static inline int
+node_32_search_le(radix_tree_node_32 *node, uint8 chunk)
+{
+#ifdef __AVX2__
+	__m256i _key = _mm256_set1_epi8(chunk);
+	__m256i _data = _mm256_loadu_si256((__m256i_u*) node->chunks);
+	__m256i _min = _mm256_min_epu8(_key, _data);
+	__m256i cmp = _mm256_cmpeq_epi8(_key, _min);
+	uint32_t bitfield=_mm256_movemask_epi8(cmp);
+
+	bitfield &= ((UINT64_C(1) << node->n.count) - 1);
+
+	return (bitfield) ? __builtin_ctz(bitfield) : node->n.count;
+#else
+	int index;
+
+	for (index = 0; index < node->n.count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+
+	return index;
+#endif	/* __AVX2__ */
+}
+
+static inline int
+node_128_get_slot_pos(radix_tree_node_128 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] - 1;
+}
+
+static inline bool
+node_128_is_slot_used(radix_tree_node_128 *node, uint8 chunk)
+{
+	return (node_128_get_slot_pos(node, chunk) >= 0);
+}
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_256_is_slot_used(radix_tree_node_256 *node, uint8 chunk)
+{
+	return (node->set[RADIX_TREE_NODE_256_SET_BYTE(chunk)] &
+			RADIX_TREE_NODE_256_SET_BIT(chunk)) != 0;
+
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_256_set(radix_tree_node_256 *node, uint8 chunk, Datum slot)
+{
+	node->slots[chunk] = slot;
+	node->set[RADIX_TREE_NODE_256_SET_BYTE(chunk)] |= RADIX_TREE_NODE_256_SET_BIT(chunk);
+}
+
+/* Return the shift that is satisfied to store the given key */
+inline static int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RADIX_TREE_NODE_FANOUT) * RADIX_TREE_NODE_FANOUT;
+}
+
+/* Return the max value stored in a node with the given shift */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RADIX_TREE_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64_C(1) << (shift + RADIX_TREE_NODE_FANOUT)) - 1;
+}
+
+/* Allocate a new node with the given node kind */
+static radix_tree_node *
+radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
+{
+	radix_tree_node *newnode;
+
+	newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
+														 radix_tree_node_info[kind].size);
+	newnode->kind = kind;
+
+	/* update stats */
+	tree->mem_used += GetMemoryChunkSpace(newnode);
+	tree->cnt[kind]++;
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+radix_tree_free_node(radix_tree *tree, radix_tree_node *node)
+{
+	/* update stats */
+	tree->mem_used -= GetMemoryChunkSpace(node);
+	tree->cnt[node->kind]--;
+
+	pfree(node);
+}
+
+/* Copy the common fields without the node kind */
+static void
+radix_tree_copy_node_common(radix_tree_node *src, radix_tree_node *dst)
+{
+	dst->shift = src->shift;
+	dst->chunk = src->chunk;
+	dst->count = src->count;
+}
+
+/* The tree doesn't have not sufficient height, so grow it */
+static void
+radix_tree_extend(radix_tree *tree, uint64 key)
+{
+	int max_shift;
+	int shift = tree->root->shift + RADIX_TREE_NODE_FANOUT;
+
+	max_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'max_shift' */
+	while (shift <= max_shift)
+	{
+		radix_tree_node_4 *node =
+			(radix_tree_node_4 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+
+		node->n.count = 1;
+		node->n.shift = shift;
+		node->chunks[0] = 0;
+		node->slots[0] = PointerGetDatum(tree->root);
+
+		tree->root->chunk = 0;
+		tree->root = (radix_tree_node *) node;
+
+		shift += RADIX_TREE_NODE_FANOUT;
+	}
+
+	tree->max_val = shift_get_max_val(max_shift);
+}
+
+/*
+ * Return the pointer to the child node corresponding with the key. Otherwise (if
+ * not found) return NULL.
+ */
+static radix_tree_node *
+radix_tree_find_child(radix_tree_node *node, uint64 key)
+{
+	Datum *slot_ptr;
+	int chunk = GET_KEY_CHUNK(key, node->shift);
+
+	slot_ptr = radix_tree_find_slot_ptr(node, chunk);
+
+	return (slot_ptr == NULL) ? NULL : (radix_tree_node *) DatumGetPointer(*slot_ptr);
+}
+
+/*
+ * Return the address of the slot corresponding to chunk in the node, if found.
+ * Otherwise return NULL.
+ */
+static Datum *
+radix_tree_find_slot_ptr(radix_tree_node *node, uint8 chunk)
+{
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+		{
+			radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+
+			/* Do linear search */
+			for (int i = 0; i < n4->n.count; i++)
+			{
+				if (n4->chunks[i] > chunk)
+					break;
+
+				if (n4->chunks[i] == chunk)
+					return &(n4->slots[i]);
+			}
+
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_32:
+		{
+			radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+			int ret;
+
+			/* Search by SIMD instructions */
+			ret = node_32_search_eq(n32, chunk);
+
+			if (ret < 0)
+				break;
+
+			return &(n32->slots[ret]);
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_128:
+		{
+			radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+			if (!node_128_is_slot_used(n128, chunk))
+				break;
+
+			return &(n128->slots[node_128_get_slot_pos(n128, chunk)]);
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_256:
+		{
+			radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+			if (!node_256_is_slot_used(n256, chunk))
+				break;
+
+			return &(n256->slots[chunk]);
+			break;
+		}
+	}
+
+	return NULL;
+}
+
+/* Link from the parent to the node */
+static void
+radix_tree_replace_slot(radix_tree_node *parent, radix_tree_node *node, uint8 chunk)
+{
+	Datum *slot_ptr;
+
+	slot_ptr = radix_tree_find_slot_ptr(parent, chunk);
+	*slot_ptr = PointerGetDatum(node);
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+radix_tree_new_root(radix_tree *tree, uint64 key, Datum val)
+{
+	radix_tree_node_4 * n4 =
+		(radix_tree_node_4 * ) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+	int shift = key_get_shift(key);
+
+	n4->n.shift = shift;
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = (radix_tree_node *) n4;
+}
+
+/* Insert 'node' as a child node of 'parent' */
+static radix_tree_node *
+radix_tree_insert_child(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node,
+						uint64 key)
+{
+	radix_tree_node *newchild =
+		(radix_tree_node *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+
+	Assert(!NodeIsLeaf(node));
+
+	newchild->shift = node->shift - RADIX_TREE_NODE_FANOUT;
+	newchild->chunk = GET_KEY_CHUNK(key, node->shift);
+
+	radix_tree_insert_val(tree, parent, node, key, PointerGetDatum(newchild), NULL);
+
+	return (radix_tree_node *) newchild;
+}
+
+/*
+ * Insert the value to the node. The node grows if it's full.
+ *
+ * *replaced_p is set to true if the key already exists and its value is updated
+ * by this function.
+ */
+static void
+radix_tree_insert_val(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node,
+					  uint64 key, Datum val, bool *replaced_p)
+{
+	int chunk = GET_KEY_CHUNK(key, node->shift);
+	bool replaced = false;
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+		{
+			radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+			int idx;
+
+			for (idx = 0; idx < n4->n.count; idx++)
+			{
+				if (n4->chunks[idx] >= chunk)
+					break;
+			}
+
+			if (NodeHasFreeSlot(n4))
+			{
+				if (n4->n.count == 0)
+				{
+					/* the first key for this node, add it */
+				}
+				else if (n4->chunks[idx] == chunk)
+				{
+					/* found the key, replace it */
+					replaced = true;
+				}
+				else if (idx != n4->n.count)
+				{
+					/*
+					 * the key needs to be inserted in the middle of the array,
+					 * make space for the new key.
+					 */
+					memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
+							sizeof(uint8) * (n4->n.count - idx));
+					memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
+							sizeof(radix_tree_node *) * (n4->n.count - idx));
+				}
+
+				n4->chunks[idx] = chunk;
+				n4->slots[idx] = val;
+
+				/* Done */
+				break;
+			}
+
+			/* The node needs to grow */
+			node = radix_tree_node_grow(tree, parent, node);
+			Assert(node->kind == RADIX_TREE_NODE_KIND_32);
+		}
+		/* FALLTHROUGH */
+		case RADIX_TREE_NODE_KIND_32:
+		{
+			radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+			int idx;
+
+			idx = node_32_search_le(n32, chunk);
+
+			if (NodeHasFreeSlot(n32))
+			{
+				if (n32->n.count == 0)
+				{
+					/* first key for this node, add it */
+				}
+				else if (n32->chunks[idx] == chunk)
+				{
+					/* found the key, replace it */
+					replaced = true;
+				}
+				else if (idx != n32->n.count)
+				{
+					/*
+					 * the key needs to be inserted in the middle of the array,
+					 * make space for the new key.
+					 */
+					memmove(&(n32->chunks[idx + 1]), &(n32->chunks[idx]),
+							sizeof(uint8) * (n32->n.count - idx));
+					memmove(&(n32->slots[idx + 1]), &(n32->slots[idx]),
+							sizeof(radix_tree_node *) * (n32->n.count - idx));
+				}
+
+				n32->chunks[idx] = chunk;
+				n32->slots[idx] = val;
+				break;
+			}
+
+			/* The node needs to grow */
+			node = radix_tree_node_grow(tree, parent, node);
+			Assert(node->kind == RADIX_TREE_NODE_KIND_128);
+		}
+		/* FALLTHROUGH */
+		case RADIX_TREE_NODE_KIND_128:
+		{
+			radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+			if (node_128_is_slot_used(n128, chunk))
+			{
+				n128->slots[node_128_get_slot_pos(n128, chunk)] = val;
+				replaced = true;
+				break;
+			}
+
+			if (NodeHasFreeSlot(n128))
+			{
+				uint8 pos = n128->n.count + 1;
+
+				n128->slot_idxs[chunk] = pos;
+				n128->slots[pos - 1] = val;
+				break;
+			}
+
+			node = radix_tree_node_grow(tree, parent, node);
+			Assert(node->kind == RADIX_TREE_NODE_KIND_256);
+		}
+		/* FALLTHROUGH */
+		case RADIX_TREE_NODE_KIND_256:
+		{
+			radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+			if (node_256_is_slot_used(n256, chunk))
+				replaced = true;
+
+			node_256_set(n256, chunk, val);
+			break;
+		}
+	}
+
+	if (!replaced)
+		node->count++;
+
+	if (replaced_p)
+		*replaced_p = replaced;
+}
+
+/* Change the node type to a larger one */
+static radix_tree_node *
+radix_tree_node_grow(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node)
+{
+	radix_tree_node *newnode = NULL;
+
+	Assert(node->count ==
+		   radix_tree_node_info[node->kind].max_slots);
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+		{
+			radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+			radix_tree_node_32 *new32 =
+				(radix_tree_node_32 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_32);
+
+			radix_tree_copy_node_common((radix_tree_node *) n4,
+										(radix_tree_node *) new32);
+
+			memcpy(&(new32->chunks), &(n4->chunks), sizeof(uint8) * 4);
+			memcpy(&(new32->slots), &(n4->slots), sizeof(Datum) * 4);
+
+			newnode = (radix_tree_node *) new32;
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_32:
+		{
+			radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+			radix_tree_node_128 *new128 =
+				(radix_tree_node_128 *) radix_tree_alloc_node(tree,RADIX_TREE_NODE_KIND_128);
+
+			radix_tree_copy_node_common((radix_tree_node *) n32,
+										(radix_tree_node *) new128);
+
+			for (int i = 0; i < n32->n.count; i++)
+			{
+				new128->slot_idxs[n32->chunks[i]] = i + 1;
+				new128->slots[i] = n32->slots[i];
+			}
+
+			newnode = (radix_tree_node *) new128;
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_128:
+		{
+			radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+			radix_tree_node_256 *new256 =
+				(radix_tree_node_256 *) radix_tree_alloc_node(tree,RADIX_TREE_NODE_KIND_256);
+			int cnt = 0;
+
+			radix_tree_copy_node_common((radix_tree_node *) n128,
+										(radix_tree_node *) new256);
+
+			for (int i = 0; i < RADIX_TREE_NODE_MAX_SLOTS && cnt < n128->n.count; i++)
+			{
+				if (!node_128_is_slot_used(n128, i))
+					continue;
+
+				node_256_set(new256, i, n128->slots[node_128_get_slot_pos(n128, i)]);
+				cnt++;
+			}
+
+			newnode = (radix_tree_node *) new256;
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_256:
+			elog(ERROR, "radix tree node_256 cannot grow");
+			break;
+	}
+
+	/* Replace the old node with the new one */
+	if (parent == node)
+		tree->root = newnode;
+	else
+		radix_tree_replace_slot(parent, newnode, node->chunk);
+
+	/* Free the old node */
+	radix_tree_free_node(tree, node);
+
+	return newnode;
+}
+
+/* Create the radix tree in the given memory context */
+radix_tree *
+radix_tree_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->max_val = 0;
+	tree->root = NULL;
+	tree->context = ctx;
+	tree->num_keys = 0;
+	tree->mem_used = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RADIX_TREE_NODE_KIND_COUNT; i++)
+	{
+		tree->slabs[i] = SlabContextCreate(ctx,
+										   radix_tree_node_info[i].name,
+										   SLAB_DEFAULT_BLOCK_SIZE,
+										   radix_tree_node_info[i].size);
+		tree->cnt[i] = 0;
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+void
+radix_tree_destroy(radix_tree *tree)
+{
+	for (int i = 0; i < RADIX_TREE_NODE_KIND_COUNT; i++)
+		MemoryContextDelete(tree->slabs[i]);
+
+	pfree(tree);
+}
+
+/*
+ * Insert the key with the val.
+ *
+ * found_p is set to true if the key already present, otherwise false, if
+ * it's not NULL.
+ *
+ * XXX: consider a better API. Is it better to support like 'update' flag
+ * instead of 'found_p' so the user can asks to update the value if already
+ * exists?
+ */
+void
+radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
+{
+	int shift;
+	bool	replaced;
+	radix_tree_node *node;
+	radix_tree_node *parent = tree->root;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		radix_tree_new_root(tree, key, val);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		radix_tree_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = tree->root;
+	while (shift > 0)
+	{
+		radix_tree_node *child;
+
+		child = radix_tree_find_child(node, key);
+
+		if (child == NULL)
+			child = radix_tree_insert_child(tree, parent, node, key);
+
+		parent = node;
+		node = child;
+		shift -= RADIX_TREE_NODE_FANOUT;
+	}
+
+	/* arrived at a leaf, so insert the value */
+	Assert(NodeIsLeaf(node));
+	radix_tree_insert_val(tree, parent, node, key, val, &replaced);
+
+	if (!replaced)
+		tree->num_keys++;
+
+	if (found_p)
+		*found_p = replaced;
+}
+
+/*
+ * Return the Datum value of the given key.
+ *
+ * found_p is set to true if it's found, otherwise false.
+ */
+Datum
+radix_tree_search(radix_tree *tree, uint64 key, bool *found_p)
+{
+	radix_tree_node *node;
+	int shift;
+
+	if (!tree->root || key > tree->max_val)
+		goto not_found;
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		radix_tree_node *child;
+
+		if (NodeIsLeaf(node))
+		{
+			Datum *slot_ptr;
+			int chunk = GET_KEY_CHUNK(key, node->shift);
+
+			/* We reached at a leaf node, find the corresponding slot */
+			slot_ptr = radix_tree_find_slot_ptr(node, chunk);
+
+			if (slot_ptr == NULL)
+				goto not_found;
+
+			/* Found! */
+			*found_p = true;
+			return *slot_ptr;
+		}
+
+		child = radix_tree_find_child(node, key);
+
+		if (child == NULL)
+			goto not_found;
+
+		node = child;
+		shift -= RADIX_TREE_NODE_FANOUT;
+	}
+
+not_found:
+	*found_p = false;
+	return (Datum) 0;
+}
+
+/* Create and return the iterator for the given radix tree */
+radix_tree_iter *
+radix_tree_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	radix_tree_iter *iter;
+	int top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (radix_tree_iter *) palloc0(sizeof(radix_tree_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree)
+		return iter;
+
+	top_level = iter->tree->root->shift / RADIX_TREE_NODE_FANOUT;
+
+	iter->stack_len = top_level;
+	iter->stack[top_level].node = iter->tree->root;
+	iter->stack[top_level].current_idx = -1;
+
+	/* Descend to the left most leaf node from the root */
+	radix_tree_update_iter_stack(iter, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+radix_tree_iterate_next(radix_tree_iter *iter, uint64 *key_p, Datum *value_p)
+{
+	bool found = false;
+	Datum slot = (Datum) 0;
+	int level;
+
+	/* Empty tree */
+	if (!iter->tree)
+		return false;
+
+	for (;;)
+	{
+		radix_tree_node *node;
+		radix_tree_iter_node_data *node_iter;
+
+		/*
+		 * Iterate node at each level from the bottom of the tree until we find
+		 * the next slot.
+		 */
+		for (level = 0; level <= iter->stack_len; level++)
+		{
+			slot = radix_tree_node_iterate_next(iter, &(iter->stack[level]), &found);
+
+			if (found)
+				break;
+		}
+
+		/* end of iteration */
+		if (!found)
+			return false;
+
+		/* found the next slot at the leaf node, return it */
+		if (level == 0)
+		{
+			*key_p = iter->key;
+			*value_p = slot;
+			return true;
+		}
+
+		/*
+		 * We have advanced more than one nodes including internal nodes. So we need
+		 * to update the stack by descending to the left most leaf node from this level.
+		 */
+		node = (radix_tree_node *) DatumGetPointer(slot);
+		node_iter = &(iter->stack[level - 1]);
+		radix_tree_store_iter_node(iter, node_iter, node);
+
+		radix_tree_update_iter_stack(iter, level - 1);
+	}
+}
+
+void
+radix_tree_end_iterate(radix_tree_iter *iter)
+{
+	pfree(iter);
+}
+
+/*
+ * Update the part of the key being constructed during the iteration with the
+ * given chunk
+ */
+static inline void
+radix_tree_iter_update_key(radix_tree_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RADIX_TREE_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Iterate over the given radix tree node and returns the next slot of the given
+ * node and set true to *found_p, if any.  Otherwise, set false to *found_p.
+ */
+static Datum
+radix_tree_node_iterate_next(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+							 bool *found_p)
+{
+	radix_tree_node *node = node_iter->node;
+	Datum slot = (Datum) 0;
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+		{
+			radix_tree_node_4 *n4 = (radix_tree_node_4 *) node_iter->node;
+
+			node_iter->current_idx++;
+
+			if (node_iter->current_idx >= n4->n.count)
+				goto not_found;
+
+			slot = n4->slots[node_iter->current_idx];
+
+			/* Update the part of the key with the current chunk */
+			if (NodeIsLeaf(node))
+				radix_tree_iter_update_key(iter, n4->chunks[node_iter->current_idx], 0);
+
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_32:
+		{
+			radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+
+			node_iter->current_idx++;
+
+			if (node_iter->current_idx >= n32->n.count)
+				goto not_found;
+
+			slot = n32->slots[node_iter->current_idx];
+
+			/* Update the part of the key with the current chunk */
+			if (NodeIsLeaf(node))
+				radix_tree_iter_update_key(iter, n32->chunks[node_iter->current_idx], 0);
+
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_128:
+		{
+			radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+			int i;
+
+			for (i = node_iter->current_idx + 1; i < RADIX_TREE_NODE_MAX_SLOTS; i++)
+			{
+				if (node_128_is_slot_used(n128, i))
+					break;
+			}
+
+			if (i >= RADIX_TREE_NODE_MAX_SLOTS)
+				goto not_found;
+
+			node_iter->current_idx = i;
+			slot = n128->slots[node_128_get_slot_pos(n128, i)];
+
+			/* Update the part of the key */
+			if (NodeIsLeaf(node))
+				radix_tree_iter_update_key(iter, node_iter->current_idx, 0);
+
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_256:
+		{
+			radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+			int i;
+
+			for (i = node_iter->current_idx + 1; i < RADIX_TREE_NODE_MAX_SLOTS; i++)
+			{
+				if (node_256_is_slot_used(n256, i))
+					break;
+			}
+
+			if (i >= RADIX_TREE_NODE_MAX_SLOTS)
+				goto not_found;
+
+			node_iter->current_idx = i;
+			slot = n256->slots[i];
+
+			/* Update the part of the key */
+			if (NodeIsLeaf(node))
+				radix_tree_iter_update_key(iter, node_iter->current_idx, 0);
+
+			break;
+		}
+	}
+
+	*found_p = true;
+	return slot;
+
+not_found:
+	*found_p = false;
+	return (Datum) 0;
+}
+
+/*
+ * Initialize and update the node iteration struct with the given radix tree node.
+ * This function also updates the part of the key with the chunk of the given node.
+ */
+static void
+radix_tree_store_iter_node(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+						   radix_tree_node *node)
+{
+	node_iter->node = node;
+	node_iter->current_idx = -1;
+
+	radix_tree_iter_update_key(iter, node->chunk, node->shift + RADIX_TREE_NODE_FANOUT);
+}
+
+/*
+ * Build the stack of the radix tree node while descending to the leaf from the 'from'
+ * level.
+ */
+static void
+radix_tree_update_iter_stack(radix_tree_iter *iter, int from)
+{
+	radix_tree_node *node = iter->stack[from].node;
+	int level = from;
+
+	for (;;)
+	{
+		radix_tree_iter_node_data *node_iter = &(iter->stack[level--]);
+		bool found;
+
+		/* Set the current node */
+		radix_tree_store_iter_node(iter, node_iter, node);
+
+		if (NodeIsLeaf(node))
+			break;
+
+		node = (radix_tree_node *)
+			DatumGetPointer(radix_tree_node_iterate_next(iter, node_iter, &found));
+
+		/*
+		 * Since we always get the first slot in the node, we have to found
+		 * the slot.
+		 */
+		Assert(found);
+	}
+}
+
+uint64
+radix_tree_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+uint64
+radix_tree_memory_usage(radix_tree *tree)
+{
+	return tree->mem_used;
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RADIX_TREE_DEBUG
+void
+radix_tree_stats(radix_tree *tree)
+{
+	fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u(%lu), n32 = %u(%lu), n128 = %u(%lu), n256 = %u(%lu)",
+			tree->num_keys,
+			tree->root->shift / RADIX_TREE_NODE_FANOUT,
+			tree->cnt[0], tree->cnt[0] * sizeof(radix_tree_node_4),
+			tree->cnt[1], tree->cnt[1] * sizeof(radix_tree_node_32),
+			tree->cnt[2], tree->cnt[2] * sizeof(radix_tree_node_128),
+			tree->cnt[3], tree->cnt[3] * sizeof(radix_tree_node_256));
+	//radix_tree_dump(tree);
+}
+
+static void
+radix_tree_print_slot(StringInfo buf, uint8 chunk, Datum slot, int idx, bool is_leaf, int level)
+{
+	char space[128] = {0};
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	if (is_leaf)
+		appendStringInfo(buf, "%s[%d] \"0x%X\" val(0x%lX) LEAF\n",
+						 space,
+						 idx,
+						 chunk,
+						 DatumGetInt64(slot));
+	else
+		appendStringInfo(buf , "%s[%d] \"0x%X\" -> ",
+						 space,
+						 idx,
+						 chunk);
+}
+
+static void
+radix_tree_dump_node(radix_tree_node *node, int level, StringInfo buf, bool recurse)
+{
+	bool is_leaf = NodeIsLeaf(node);
+
+	appendStringInfo(buf, "[\"%s\" type %d, cnt %u, shift %u, chunk \"0x%X\"] chunks:\n",
+					 NodeIsLeaf(node) ? "LEAF" : "INNR",
+					 (node->kind == RADIX_TREE_NODE_KIND_4) ? 4 :
+					 (node->kind == RADIX_TREE_NODE_KIND_32) ? 32 :
+					 (node->kind == RADIX_TREE_NODE_KIND_128) ? 128 : 256,
+					 node->count, node->shift, node->chunk);
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+		{
+			radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+
+			for (int i = 0; i < n4->n.count; i++)
+			{
+				radix_tree_print_slot(buf, n4->chunks[i], n4->slots[i], i, is_leaf, level);
+
+				if (!is_leaf)
+				{
+					if (recurse)
+					{
+						StringInfoData buf2;
+
+						initStringInfo(&buf2);
+						radix_tree_dump_node((radix_tree_node *) n4->slots[i], level + 1, &buf2, recurse);
+						appendStringInfo(buf, "%s", buf2.data);
+					}
+					else
+						appendStringInfo(buf, "\n");
+				}
+			}
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_32:
+		{
+			radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+
+			for (int i = 0; i < n32->n.count; i++)
+			{
+				radix_tree_print_slot(buf, n32->chunks[i], n32->slots[i], i, is_leaf, level);
+
+				if (!is_leaf)
+				{
+					if (recurse)
+					{
+						StringInfoData buf2;
+
+						initStringInfo(&buf2);
+						radix_tree_dump_node((radix_tree_node *) n32->slots[i], level + 1, &buf2, recurse);
+						appendStringInfo(buf, "%s", buf2.data);
+					}
+					else
+						appendStringInfo(buf, "\n");
+				}
+			}
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_128:
+		{
+			radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+			for (int i = 0; i < RADIX_TREE_NODE_MAX_SLOTS; i++)
+			{
+				if (!node_128_is_slot_used(n128, i))
+					continue;
+
+				radix_tree_print_slot(buf, i, n128->slots[node_128_get_slot_pos(n128, i)],
+									  i, is_leaf, level);
+
+				if (!is_leaf)
+				{
+					if (recurse)
+					{
+						StringInfoData buf2;
+
+						initStringInfo(&buf2);
+						radix_tree_dump_node((radix_tree_node *) n128->slots[node_128_get_slot_pos(n128, i)],
+											 level + 1, &buf2, recurse);
+						appendStringInfo(buf, "%s", buf2.data);
+					}
+					else
+						appendStringInfo(buf, "\n");
+				}
+			}
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_256:
+		{
+			radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+			for (int i = 0; i < RADIX_TREE_NODE_MAX_SLOTS; i++)
+			{
+				if (!node_256_is_slot_used(n256, i))
+					continue;
+
+				radix_tree_print_slot(buf, i, n256->slots[i], i, is_leaf, level);
+
+				if (!is_leaf)
+				{
+					if (recurse)
+					{
+						StringInfoData buf2;
+
+						initStringInfo(&buf2);
+						radix_tree_dump_node((radix_tree_node *) n256->slots[i], level + 1, &buf2, recurse);
+						appendStringInfo(buf, "%s", buf2.data);
+					}
+					else
+						appendStringInfo(buf, "\n");
+				}
+			}
+			break;
+		}
+	}
+}
+
+void
+radix_tree_dump_search(radix_tree *tree, uint64 key)
+{
+	StringInfoData buf;
+	radix_tree_node *node;
+	int shift;
+	int level = 0;
+
+	elog(WARNING, "-----------------------------------------------------------");
+	elog(WARNING, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(WARNING, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(WARNING, "key %lu (0x%lX) is larger than max val",
+			 key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		radix_tree_node *child;
+
+		radix_tree_dump_node(node, level, &buf, false);
+
+		if (NodeIsLeaf(node))
+		{
+			int chunk = GET_KEY_CHUNK(key, node->shift);
+
+			/* We reached at a leaf node, find the corresponding slot */
+			radix_tree_find_slot_ptr(node, chunk);
+
+			break;
+		}
+
+		child = radix_tree_find_child(node, key);
+
+		if (child == NULL)
+			break;
+
+		node = child;
+		shift -= RADIX_TREE_NODE_FANOUT;
+		level++;
+	}
+
+	elog(WARNING, "\n%s", buf.data);
+}
+
+void
+radix_tree_dump(radix_tree *tree)
+{
+	StringInfoData buf;
+
+	initStringInfo(&buf);
+
+	elog(WARNING, "-----------------------------------------------------------");
+	elog(WARNING, "max_val = %lu", tree->max_val);
+	radix_tree_dump_node(tree->root, 0, &buf, true);
+	elog(WARNING, "\n%s", buf.data);
+	elog(WARNING, "-----------------------------------------------------------");
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..fe5a4fd79a
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,41 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RADIX_TREE_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct radix_tree_iter radix_tree_iter;
+
+extern radix_tree *radix_tree_create(MemoryContext ctx);
+extern Datum radix_tree_search(radix_tree *tree, uint64 key, bool *found);
+extern void radix_tree_destroy(radix_tree *tree);
+extern void radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p);
+extern uint64 radix_tree_memory_usage(radix_tree *tree);
+extern uint64 radix_tree_num_entries(radix_tree *tree);
+
+extern radix_tree_iter *radix_tree_begin_iterate(radix_tree *tree);
+extern bool radix_tree_iterate_next(radix_tree_iter *iter, uint64 *key_p, Datum *value_p);
+extern void radix_tree_end_iterate(radix_tree_iter *iter);
+
+
+#ifdef RADIX_TREE_DEBUG
+extern void radix_tree_dump(radix_tree *tree);
+extern void radix_tree_dump_search(radix_tree *tree, uint64 key);
+extern void radix_tree_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9090226daa..51b2514faf 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -24,6 +24,7 @@ SUBDIRS = \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..0c96ebc739
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,20 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..e9fe7e0124
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,397 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool intset_test_stats = true;
+
+static int radix_tree_node_max_entries[] = {4, 16, 48, 256};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+} test_spec;
+
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 10000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void test_empty(void);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	bool found;
+
+	radixtree = radix_tree_create(CurrentMemoryContext);
+
+	radix_tree_search(radixtree, 0, &found);
+	if (found)
+		elog(ERROR, "radix_tree_search on empty tree returned true");
+
+	radix_tree_search(radixtree, 1, &found);
+	if (found)
+		elog(ERROR, "radix_tree_search on empty tree returned true");
+
+	radix_tree_search(radixtree, PG_UINT64_MAX, &found);
+	if (found)
+		elog(ERROR, "radix_tree_search on empty tree returned true");
+
+	if (radix_tree_num_entries(radixtree) != 0)
+		elog(ERROR, "radix_tree_num_entries on empty tree return non-zero");
+
+	radix_tree_destroy(radixtree);
+}
+
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		bool found;
+		Datum val;
+
+		val = radix_tree_search(radixtree, key, &found);
+		if (!found)
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (DatumGetUInt64(val) != key)
+			elog(ERROR, "radix_tree_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, DatumGetUInt64(val), key);
+	}
+}
+
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+	uint64 num_entries;
+
+	radixtree = radix_tree_create(CurrentMemoryContext);
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		bool found;
+
+		radix_tree_insert(radixtree, key, Int64GetDatum(key), &found);
+
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+		for (int j = 0; j < lengthof(radix_tree_node_max_entries); j++)
+		{
+			if (i == (radix_tree_node_max_entries[j] - 1))
+			{
+				check_search_on_node(radixtree, shift,
+									 (j == 0) ? 0 : radix_tree_node_max_entries[j - 1],
+									 radix_tree_node_max_entries[j]);
+				break;
+			}
+		}
+	}
+
+	num_entries = radix_tree_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "radix_tree_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+	radix_tree *radixtree;
+	radix_tree_iter *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (intset_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the integer set.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.  (intset_create() creates a memory context of its
+	 * own, too, but we don't have direct access to it, so we cannot call
+	 * MemoryContextStats() on it directly).
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = radix_tree_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool found;
+
+			x = last_int + pattern_values[i];
+
+			radix_tree_insert(radixtree, x, Int64GetDatum(x), &found);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (intset_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by intset_memory_usage(), as well as the
+	 * stats from the memory context.  They should be in the same ballpark,
+	 * but it's hard to automate testing that, so if you're making changes to
+	 * the implementation, just observe that manually.
+	 */
+	if (intset_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by intset_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = radix_tree_memory_usage(radixtree);
+		fprintf(stderr, "radix_tree_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that intset_get_num_entries works */
+	n = radix_tree_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "radix_tree_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with intset_is_member()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		Datum		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to intset_is_member() ? */
+		v = radix_tree_search(radixtree, x, &found);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (DatumGetUInt64(v) != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 DatumGetUInt64(v), x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (intset_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = radix_tree_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!radix_tree_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			if (DatumGetUInt64(val) != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (intset_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true

#46

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#45)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Overall, radix tree implementations have good numbers. Once we got an
agreement on moving in this direction, I'll start a new thread for
that and move the implementation further; there are many things to do
and discuss: deletion, API design, SIMD support, more tests etc.

(FWIW, I think the current thread is still fine.)

--
John Naylor
EDB: http://www.enterprisedb.com

#47

sawada.mshk@gmail.com

over 3 years ago

In reply to: John Naylor (#46)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, May 10, 2022 at 6:58 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Overall, radix tree implementations have good numbers. Once we got an
agreement on moving in this direction, I'll start a new thread for
that and move the implementation further; there are many things to do
and discuss: deletion, API design, SIMD support, more tests etc.

+1

Thanks!

I've attached an updated version patch. It is still WIP but I've
implemented deletion and improved test cases and comments.

(FWIW, I think the current thread is still fine.)

Okay, agreed.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

radixtree_wip_v2.patchapplication/octet-stream; name=radixtree_wip_v2.patchDownload

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..fd002d594a 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,9 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
+radixtree.o: CFLAGS+=-mavx2
+
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..ad08f45fd8
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,1632 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013.
+ *
+ * There are some differences from the proposed implementation.  For instance,
+ * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit witdh SIMD vector is used in the paper.
+ * Also, there is no support for path compression and lazy path expansion. The
+ * radix tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * The key is a 64-bit unsigned integer and the value is a Datum. Both internal
+ * nodes and leaf nodes have the identical structure. For internal tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, also have the Datum value that is specified by the user.
+ *
+ * Interface
+ * ---------
+ *
+ * radix_tree_create		- Create a new, empty radix tree
+ * radix_tree_destroy		- Destroy the radix tree
+ * radix_tree_insert		- Insert a key-value pair
+ * radix_tree_delete		- Delete a key-value pair
+ * radix_tree_begin_iterate	- Begin iterating through all key-value pairs
+ * radix_tree_iterate_next	- Return next key-value pair, if any
+ * radix_tree_end_iterate	- End iteration
+ *
+ * radix_tree_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * radix_tree_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+#if defined(__AVX2__)
+#include <immintrin.h> // x86 AVX2 intrinsics
+#endif
+
+/* The number of bits are encoded in one tree level */
+#define RADIX_TREE_NODE_FANOUT	8
+
+/* The number of maximum slots in the node, used in node-256 */
+#define RADIX_TREE_NODE_MAX_SLOTS (1 << RADIX_TREE_NODE_FANOUT)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * in node-128 and node-256.
+ */
+#define RADIX_TREE_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RADIX_TREE_CHUNK_MASK ((1 << RADIX_TREE_NODE_FANOUT) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RADIX_TREE_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RADIX_TREE_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RADIX_TREE_NODE_FANOUT)
+
+/* Get a chunk from the key */
+#define GET_KEY_CHUNK(key, shift) \
+	((uint8) (((key) >> (shift)) & RADIX_TREE_CHUNK_MASK))
+
+/* Mapping from value to the bit in is-set bitmap in the node */
+#define NODE_BITMAP_BYTE(v) ((v) / RADIX_TREE_NODE_FANOUT)
+#define NODE_BITMAP_BIT(v) (UINT64_C(1) << ((v) % RADIX_TREE_NODE_FANOUT))
+
+/* Enum used radix_tree_node_search */
+typedef enum radix_tree_action
+{
+	RADIX_TREE_FIND = 0, /* find the key-value */
+	RADIX_TREE_DELETE,	/* delete the key-value */
+} radix_tree_action;
+
+/*
+ * supported radix tree nodes.
+ *
+ * XXX: should we add KIND_16 as we can utilize SSE2 SIMD instructions?
+ */
+typedef enum radix_tree_node_kind
+{
+	RADIX_TREE_NODE_KIND_4 = 0,
+	RADIX_TREE_NODE_KIND_32,
+	RADIX_TREE_NODE_KIND_128,
+	RADIX_TREE_NODE_KIND_256
+} radix_tree_node_kind;
+#define RADIX_TREE_NODE_KIND_COUNT 4
+
+/*
+ * Base type for all nodes types.
+ */
+typedef struct radix_tree_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at ta fanout of 8.
+	 */
+	uint16	count;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this node.
+	 * That is, the key is shifted by 'shift' and the lowest RADIX_TREE_NODE_FANOUT
+	 * bits are then represented in chunk.
+	 */
+	uint8	shift;
+	uint8	chunk;
+
+	/* Size class of the node */
+	radix_tree_node_kind	kind;
+} radix_tree_node;
+/* Macros for radix tree nodes */
+#define IS_LEAF_NODE(n) (((radix_tree_node *) (n))->shift == 0)
+#define IS_EMPTY_NODE(n) (((radix_tree_node *) (n))->count == 0)
+#define HAS_FREE_SLOT(n) \
+	(((radix_tree_node *) (n))->count < \
+	 radix_tree_node_info[((radix_tree_node *) (n))->kind].max_slots)
+
+/*
+ * To reduce memory usage compared to a simple radix tree with a fixed fanout
+ * we use adaptive node sides, with different storage methods for different
+ * numbers of elements.
+ */
+typedef struct radix_tree_node_4
+{
+	radix_tree_node n;
+
+	/* 4 children, for key chunks */
+	uint8	chunks[4];
+	Datum	slots[4];
+} radix_tree_node_4;
+
+typedef struct radix_tree_node_32
+{
+	radix_tree_node n;
+
+	/* 32 children, for key chunks */
+	uint8	chunks[32];
+	Datum slots[32];
+} radix_tree_node_32;
+
+#define RADIX_TREE_NODE_128_BITS RADIX_TREE_NODE_NSLOTS_BITS(128)
+typedef struct radix_tree_node_128
+{
+	radix_tree_node n;
+
+	/*
+	 * The index of slots for each fanout. 0 means unused whereas slots is
+	 * 0-indexed. So we can get the slots of the chunk C by slots[C - 1].
+	 */
+	uint8	slot_idxs[RADIX_TREE_NODE_MAX_SLOTS];
+
+	/* A bitmap to track which slot is in use */
+	uint8	isset[RADIX_TREE_NODE_128_BITS];
+	Datum	slots[128];
+} radix_tree_node_128;
+
+#define RADIX_TREE_NODE_MAX_BITS RADIX_TREE_NODE_NSLOTS_BITS(RADIX_TREE_NODE_MAX_SLOTS)
+typedef struct radix_tree_node_256
+{
+	radix_tree_node n;
+
+	/* A bitmap to track which slot is in use */
+	uint8	isset[RADIX_TREE_NODE_MAX_BITS];
+
+	Datum	slots[RADIX_TREE_NODE_MAX_SLOTS];
+} radix_tree_node_256;
+
+/* Information of each size class */
+typedef struct radix_tree_node_info_elem
+{
+	const char *name;
+	int		max_slots;
+	Size	size;
+} radix_tree_node_info_elem;
+
+static radix_tree_node_info_elem radix_tree_node_info[] =
+{
+	{"radix tree node 4", 4, sizeof(radix_tree_node_4)},
+	{"radix tree node 32", 32, sizeof(radix_tree_node_32)},
+	{"radix tree node 128", 128, sizeof(radix_tree_node_128)},
+	{"radix tree node 256", 256, sizeof(radix_tree_node_256)},
+};
+
+/*
+ * As we descend a radix tree, we push the node to the stack. The stack is used
+ * at deletion.
+ */
+typedef struct radix_tree_stack_data
+{
+	radix_tree_node	*node;
+	struct radix_tree_stack_data	*parent;
+} radix_tree_stack_data;
+typedef radix_tree_stack_data *radix_tree_stack;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending order
+ * of the key. To support this, the we iterate nodes of each level.
+ * radix_tree_iter_node_data struct is used to track the iteration within a node.
+ * radix_tree_iter has the array of this struct, stack, in order to track the iteration
+ * of every level. During the iteration, we also construct the key to return. The key
+ * is updated whenever we update the node iteration information, e.g., when advancing
+ * the current index within the node or when moving to the next node at the same level.
+ */
+typedef struct radix_tree_iter_node_data
+{
+	radix_tree_node *node;	/* current node being iterated */
+	int	current_idx;		/* current position. -1 for initial value */
+} radix_tree_iter_node_data;
+
+struct radix_tree_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	radix_tree_iter_node_data stack[RADIX_TREE_MAX_LEVEL];
+	int	stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64	key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	radix_tree_node	*root;
+	uint64	max_val;
+	uint64	num_keys;
+	MemoryContextData *slabs[RADIX_TREE_NODE_KIND_COUNT];
+
+	/* stats */
+	uint64	mem_used;
+	int32	cnt[RADIX_TREE_NODE_KIND_COUNT];
+};
+
+static radix_tree_node *radix_tree_node_grow(radix_tree *tree, radix_tree_node *parent,
+											 radix_tree_node *node, uint64 key);
+static bool radix_tree_node_search_child(radix_tree_node *node, radix_tree_node **child_p,
+										 uint64 key);
+static bool radix_tree_node_search(radix_tree_node *node, Datum **slot_p, uint64 key,
+								   radix_tree_action action);
+static void radix_tree_extend(radix_tree *tree, uint64 key);
+static void radix_tree_new_root(radix_tree *tree, uint64 key, Datum val);
+static radix_tree_node *radix_tree_node_insert_child(radix_tree *tree,
+													 radix_tree_node *parent,
+													 radix_tree_node *node,
+													 uint64 key);
+static void radix_tree_node_insert_val(radix_tree *tree, radix_tree_node *parent,
+									   radix_tree_node *node, uint64 key, Datum val,
+									   bool *replaced_p);
+static inline void radix_tree_iter_update_key(radix_tree_iter *iter, uint8 chunk, uint8 shift);
+static Datum radix_tree_node_iterate_next(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+										  bool *found_p);
+static void radix_tree_store_iter_node(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+									   radix_tree_node *node);
+static void radix_tree_update_iter_stack(radix_tree_iter *iter, int from);
+
+/*
+ * Helper functions for accessing each kind of nodes.
+ */
+static inline int
+node_32_search_eq(radix_tree_node_32 *node, uint8 chunk)
+{
+#ifdef __AVX2__
+	__m256i	_key = _mm256_set1_epi8(chunk);
+	__m256i _data = _mm256_loadu_si256((__m256i_u *) node->chunks);
+	__m256i _cmp = _mm256_cmpeq_epi8(_key, _data);
+	uint32	bitfield = _mm256_movemask_epi8(_cmp);
+
+	bitfield &= ((UINT64_C(1) << node->n.count) - 1);
+
+	return (bitfield) ? __builtin_ctz(bitfield) : -1;
+
+#else
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] > chunk)
+			return -1;
+
+		if (node->chunks[i] == chunk)
+			return i;
+	}
+
+	return -1;
+#endif	/* __AVX2__ */
+}
+
+/*
+ * This is a bit more complicated than search_chunk_array_16_eq(), because
+ * until recently no unsigned uint8 comparison instruction existed on x86. So
+ * we need to play some trickery using _mm_min_epu8() to effectively get
+ * <=. There never will be any equal elements in the current uses, but that's
+ * what we get here...
+ */
+static inline int
+node_32_search_le(radix_tree_node_32 *node, uint8 chunk)
+{
+#ifdef __AVX2__
+	__m256i _key = _mm256_set1_epi8(chunk);
+	__m256i _data = _mm256_loadu_si256((__m256i_u*) node->chunks);
+	__m256i _min = _mm256_min_epu8(_key, _data);
+	__m256i cmp = _mm256_cmpeq_epi8(_key, _min);
+	uint32_t bitfield=_mm256_movemask_epi8(cmp);
+
+	bitfield &= ((UINT64_C(1) << node->n.count) - 1);
+
+	return (bitfield) ? __builtin_ctz(bitfield) : node->n.count;
+#else
+	int index;
+
+	for (index = 0; index < node->n.count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+
+	return index;
+#endif	/* __AVX2__ */
+}
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(radix_tree_node_128 *node, uint8 chunk)
+{
+	return (node->slot_idxs[chunk] != 0);
+}
+
+/* Is the slot in the node used */
+static inline bool
+node_128_is_slot_used(radix_tree_node_128 *node, uint8 slot)
+{
+	return ((node->isset[NODE_BITMAP_BYTE(slot)] & NODE_BITMAP_BIT(slot)) != 0);
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_128_set(radix_tree_node_128 *node, uint8 chunk, Datum slot)
+{
+	int	slotpos = 0;
+
+	while (node_128_is_slot_used(node, slotpos))
+		slotpos++;
+	node->slot_idxs[chunk] = slotpos + 1;
+	node->slots[slotpos] = slot;
+	node->isset[NODE_BITMAP_BYTE(slotpos)] |= NODE_BITMAP_BIT(slotpos);
+}
+
+/* Delete the slot at the corresponding chunk */
+static inline void
+node_128_unset(radix_tree_node_128 *node, uint8 chunk)
+{
+	node->slot_idxs[chunk] = 0;
+	node->isset[NODE_BITMAP_BYTE(chunk)] &= ~(NODE_BITMAP_BIT(chunk));
+}
+
+/* Return the slot data corresponding to the chunk */
+static inline Datum
+node_128_get_chunk_slot(radix_tree_node_128 *node, uint8 chunk)
+{
+	return node->slots[node->slot_idxs[chunk] - 1];
+}
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_256_is_chunk_used(radix_tree_node_256 *node, uint8 chunk)
+{
+	return (node->isset[NODE_BITMAP_BYTE(chunk)] & NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_256_set(radix_tree_node_256 *node, uint8 chunk, Datum slot)
+{
+	node->slots[chunk] = slot;
+	node->isset[NODE_BITMAP_BYTE(chunk)] |= NODE_BITMAP_BIT(chunk);
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_256_unset(radix_tree_node_256 *node, uint8 chunk)
+{
+	node->isset[NODE_BITMAP_BYTE(chunk)] &= ~(NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+inline static int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RADIX_TREE_NODE_FANOUT) * RADIX_TREE_NODE_FANOUT;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RADIX_TREE_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64_C(1) << (shift + RADIX_TREE_NODE_FANOUT)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static radix_tree_node *
+radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
+{
+	radix_tree_node *newnode;
+
+	newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
+														 radix_tree_node_info[kind].size);
+	newnode->kind = kind;
+
+	/* stats */
+	tree->mem_used += GetMemoryChunkSpace(newnode);
+	tree->cnt[kind]++;
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+radix_tree_free_node(radix_tree *tree, radix_tree_node *node)
+{
+	/* stats */
+	tree->mem_used -= GetMemoryChunkSpace(node);
+	tree->cnt[node->kind]--;
+
+	pfree(node);
+}
+
+/* Free a stack made by radix_tree_delete */
+static void
+radix_tree_free_stack(radix_tree_stack stack)
+{
+	radix_tree_stack ostack;
+
+	while (stack != NULL)
+	{
+		ostack = stack;
+		stack = stack->parent;
+		pfree(ostack);
+	}
+}
+
+/* Copy the common fields without the kind */
+static void
+radix_tree_copy_node_common(radix_tree_node *src, radix_tree_node *dst)
+{
+	dst->shift = src->shift;
+	dst->chunk = src->chunk;
+	dst->count = src->count;
+}
+
+/* The tree doesn't have not sufficient height, so grow it */
+static void
+radix_tree_extend(radix_tree *tree, uint64 key)
+{
+	int max_shift;
+	int shift = tree->root->shift + RADIX_TREE_NODE_FANOUT;
+
+	max_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'max_shift' */
+	while (shift <= max_shift)
+	{
+		radix_tree_node_4 *node =
+			(radix_tree_node_4 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+
+		node->n.count = 1;
+		node->n.shift = shift;
+		node->chunks[0] = 0;
+		node->slots[0] = PointerGetDatum(tree->root);
+
+		tree->root->chunk = 0;
+		tree->root = (radix_tree_node *) node;
+
+		shift += RADIX_TREE_NODE_FANOUT;
+	}
+
+	tree->max_val = shift_get_max_val(max_shift);
+}
+
+/*
+ * Wrapper for radix_tree_node_search to search the pointer to the child node in the
+ * node.
+ *
+ * Return true if the corresponding child is found, otherwise return false.  On success,
+ * it sets child_p.
+ */
+static bool
+radix_tree_node_search_child(radix_tree_node *node, radix_tree_node **child_p, uint64 key)
+{
+	bool	found = false;
+	Datum	*slot_ptr;
+
+	if (radix_tree_node_search(node, &slot_ptr, key, RADIX_TREE_FIND))
+	{
+		/* Found the pointer to the child node */
+		found = true;
+		*child_p = (radix_tree_node *) DatumGetPointer(*slot_ptr);
+	}
+
+	return found;
+}
+
+/*
+ * Return true if the corresponding slot is used, otherwise return false.  On success,
+ * sets the pointer to the slot to slot_p.
+ */
+static bool
+radix_tree_node_search(radix_tree_node *node, Datum **slot_p, uint64 key,
+					   radix_tree_action action)
+{
+	int		chunk = GET_KEY_CHUNK(key, node->shift);
+	bool	found = false;
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+		{
+			radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+
+			/* Do linear search */
+			for (int i = 0; i < n4->n.count; i++)
+			{
+				if (n4->chunks[i] > chunk)
+					break;
+
+				if (n4->chunks[i] == chunk)
+				{
+					if (action == RADIX_TREE_FIND)
+						*slot_p = &(n4->slots[i]);
+					else	/* RADIX_TREE_DELETE */
+					{
+						memmove(&(n4->chunks[i]), &(n4->chunks[i + 1]),
+								sizeof(uint8) * (n4->n.count - i - 1));
+						memmove(&(n4->slots[i]), &(n4->slots[i + 1]),
+								sizeof(radix_tree_node *) * (n4->n.count - i - 1));
+					}
+
+					found = true;
+					break;
+				}
+			}
+
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_32:
+		{
+			radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+			int idx;
+
+			/* Search by SIMD instructions */
+			idx = node_32_search_eq(n32, chunk);
+
+			if (idx >= 0)
+			{
+				if (action == RADIX_TREE_FIND)
+					*slot_p = &(n32->slots[idx]);
+				else /* RADIX_TREE_DELETE */
+				{
+					memmove(&(n32->chunks[idx]), &(n32->chunks[idx + 1]),
+							sizeof(uint8) * (n32->n.count - idx - 1));
+					memmove(&(n32->slots[idx]), &(n32->slots[idx + 1]),
+							sizeof(radix_tree_node *) * (n32->n.count - idx - 1));
+				}
+
+				found = true;
+			}
+
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_128:
+		{
+			radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+			if (node_128_is_chunk_used(n128, chunk))
+			{
+				if (action == RADIX_TREE_FIND)
+					*slot_p = &(n128->slots[n128->slot_idxs[chunk] - 1]);
+				else /* RADIX_TREE_DELETE */
+					node_128_unset(n128, chunk);
+
+				found = true;
+			}
+
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_256:
+		{
+			radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+			if (node_256_is_chunk_used(n256, chunk))
+			{
+				if (action == RADIX_TREE_FIND)
+					*slot_p = &(n256->slots[chunk]);
+				else /* RADIX_TREE_DELETE */
+					node_256_unset(n256, chunk);
+
+				found = true;
+			}
+
+			break;
+		}
+	}
+
+	/* Update statistics */
+	if (action == RADIX_TREE_DELETE && found)
+		node->count--;
+
+	return found;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+radix_tree_new_root(radix_tree *tree, uint64 key, Datum val)
+{
+	radix_tree_node_4 * n4 =
+		(radix_tree_node_4 * ) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+	int shift = key_get_shift(key);
+
+	n4->n.shift = shift;
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = (radix_tree_node *) n4;
+}
+
+/* Insert 'node' as a child node of 'parent' */
+static radix_tree_node *
+radix_tree_node_insert_child(radix_tree *tree, radix_tree_node *parent,
+							 radix_tree_node *node, uint64 key)
+{
+	radix_tree_node *newchild =
+		(radix_tree_node *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+
+	Assert(!IS_LEAF_NODE(node));
+
+	newchild->shift = node->shift - RADIX_TREE_NODE_FANOUT;
+	newchild->chunk = GET_KEY_CHUNK(key, node->shift);
+
+	radix_tree_node_insert_val(tree, parent, node, key, PointerGetDatum(newchild), NULL);
+
+	return (radix_tree_node *) newchild;
+}
+
+/*
+ * Insert the value to the node. The node grows if it's full.
+ */
+static void
+radix_tree_node_insert_val(radix_tree *tree, radix_tree_node *parent,
+						   radix_tree_node *node, uint64 key, Datum val,
+						   bool *replaced_p)
+{
+	int chunk = GET_KEY_CHUNK(key, node->shift);
+	bool replaced = false;
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+		{
+			radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+			int idx;
+
+			for (idx = 0; idx < n4->n.count; idx++)
+			{
+				if (n4->chunks[idx] >= chunk)
+					break;
+			}
+
+			if (HAS_FREE_SLOT(n4))
+			{
+				if (n4->n.count == 0)
+				{
+					/* the first key for this node, add it */
+				}
+				else if (n4->chunks[idx] == chunk)
+				{
+					/* found the key, replace it */
+					replaced = true;
+				}
+				else if (idx != n4->n.count)
+				{
+					/*
+					 * the key needs to be inserted in the middle of the array,
+					 * make space for the new key.
+					 */
+					memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
+							sizeof(uint8) * (n4->n.count - idx));
+					memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
+							sizeof(radix_tree_node *) * (n4->n.count - idx));
+				}
+
+				n4->chunks[idx] = chunk;
+				n4->slots[idx] = val;
+
+				/* Done */
+				break;
+			}
+
+			/* The node needs to grow */
+			node = radix_tree_node_grow(tree, parent, node, key);
+			Assert(node->kind == RADIX_TREE_NODE_KIND_32);
+		}
+		/* FALLTHROUGH */
+		case RADIX_TREE_NODE_KIND_32:
+		{
+			radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+			int idx;
+
+			idx = node_32_search_le(n32, chunk);
+
+			if (HAS_FREE_SLOT(n32))
+			{
+				if (n32->n.count == 0)
+				{
+					/* first key for this node, add it */
+				}
+				else if (n32->chunks[idx] == chunk)
+				{
+					/* found the key, replace it */
+					replaced = true;
+				}
+				else if (idx != n32->n.count)
+				{
+					/*
+					 * the key needs to be inserted in the middle of the array,
+					 * make space for the new key.
+					 */
+					memmove(&(n32->chunks[idx + 1]), &(n32->chunks[idx]),
+							sizeof(uint8) * (n32->n.count - idx));
+					memmove(&(n32->slots[idx + 1]), &(n32->slots[idx]),
+							sizeof(radix_tree_node *) * (n32->n.count - idx));
+				}
+
+				n32->chunks[idx] = chunk;
+				n32->slots[idx] = val;
+				break;
+			}
+
+			/* The node needs to grow */
+			node = radix_tree_node_grow(tree, parent, node, key);
+			Assert(node->kind == RADIX_TREE_NODE_KIND_128);
+		}
+		/* FALLTHROUGH */
+		case RADIX_TREE_NODE_KIND_128:
+		{
+			radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+			if (node_128_is_chunk_used(n128, chunk))
+			{
+				/* found the existing value */
+				node_128_set(n128, chunk, val);
+				replaced = true;
+				break;
+			}
+
+			if (HAS_FREE_SLOT(n128))
+			{
+				node_128_set(n128, chunk, val);
+				break;
+			}
+
+			node = radix_tree_node_grow(tree, parent, node, key);
+			Assert(node->kind == RADIX_TREE_NODE_KIND_256);
+		}
+		/* FALLTHROUGH */
+		case RADIX_TREE_NODE_KIND_256:
+		{
+			radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+			if (node_256_is_chunk_used(n256, chunk))
+				replaced = true;
+
+			node_256_set(n256, chunk, val);
+			break;
+		}
+	}
+
+	if (!replaced)
+		node->count++;
+
+	if (replaced_p)
+		*replaced_p = replaced;
+}
+
+/* Change the node type to a larger one */
+static radix_tree_node *
+radix_tree_node_grow(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node,
+					 uint64 key)
+{
+	radix_tree_node *newnode = NULL;
+
+	Assert(node->count ==
+		   radix_tree_node_info[node->kind].max_slots);
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+		{
+			radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+			radix_tree_node_32 *new32 =
+				(radix_tree_node_32 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_32);
+
+			radix_tree_copy_node_common((radix_tree_node *) n4,
+										(radix_tree_node *) new32);
+
+			memcpy(&(new32->chunks), &(n4->chunks), sizeof(uint8) * 4);
+			memcpy(&(new32->slots), &(n4->slots), sizeof(Datum) * 4);
+
+#ifdef USE_ASSERT_CHECKING
+			{
+				/* Check if the chunks in the new node are sorted */
+				for (int i = 1; i < new32->n.count ; i++)
+					Assert(new32->chunks[i - 1] <= new32->chunks[i]);
+				Assert(new32->n.count == 4);
+			}
+#endif
+
+			newnode = (radix_tree_node *) new32;
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_32:
+		{
+			radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+			radix_tree_node_128 *new128 =
+				(radix_tree_node_128 *) radix_tree_alloc_node(tree,RADIX_TREE_NODE_KIND_128);
+
+			radix_tree_copy_node_common((radix_tree_node *) n32,
+										(radix_tree_node *) new128);
+
+			for (int i = 0; i < n32->n.count; i++)
+				node_128_set(new128, n32->chunks[i], n32->slots[i]);
+
+#ifdef USE_ASSERT_CHECKING
+			{
+				for (int i = 0; i < n32->n.count; i++)
+					Assert(node_128_is_chunk_used(new128, n32->chunks[i]));
+				Assert(new128->n.count == 32);
+			}
+#endif
+
+			newnode = (radix_tree_node *) new128;
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_128:
+		{
+			radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+			radix_tree_node_256 *new256 =
+				(radix_tree_node_256 *) radix_tree_alloc_node(tree,RADIX_TREE_NODE_KIND_256);
+			int cnt = 0;
+
+			radix_tree_copy_node_common((radix_tree_node *) n128,
+										(radix_tree_node *) new256);
+
+			for (int i = 0; i < 256 && cnt < n128->n.count; i++)
+			{
+				if (!node_128_is_chunk_used(n128, i))
+					continue;
+
+				node_256_set(new256, i, node_128_get_chunk_slot(n128, i));
+				cnt++;
+			}
+
+#ifdef USE_ASSERT_CHECKING
+			{
+				int n = 0;
+				for (int i = 0; i < RADIX_TREE_NODE_MAX_BITS; i++)
+					n += pg_popcount32(new256->isset[i]);
+
+				Assert(new256->n.count == n);
+			}
+#endif
+
+			newnode = (radix_tree_node *) new256;
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_256:
+			elog(ERROR, "radix tree node_256 cannot be grew");
+			break;
+	}
+
+	if (parent == node)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = newnode;
+	}
+	else
+	{
+		Datum *slot_ptr = NULL;
+
+		/* Redirect from the parent to the node */
+		radix_tree_node_search(parent, &slot_ptr, key, RADIX_TREE_FIND);
+		Assert(*slot_ptr);
+		*slot_ptr = PointerGetDatum(newnode);
+	}
+
+	radix_tree_free_node(tree, node);
+
+	return newnode;
+}
+
+radix_tree *
+radix_tree_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->max_val = 0;
+	tree->root = NULL;
+	tree->context = ctx;
+	tree->num_keys = 0;
+	tree->mem_used = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RADIX_TREE_NODE_KIND_COUNT; i++)
+	{
+		tree->slabs[i] = SlabContextCreate(ctx,
+										   radix_tree_node_info[i].name,
+										   SLAB_DEFAULT_BLOCK_SIZE,
+										   radix_tree_node_info[i].size);
+		tree->cnt[i] = 0;
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+void
+radix_tree_destroy(radix_tree *tree)
+{
+	for (int i = 0; i < RADIX_TREE_NODE_KIND_COUNT; i++)
+		MemoryContextDelete(tree->slabs[i]);
+
+	pfree(tree);
+}
+
+/*
+ * Insert the key with the val.
+ *
+ * found_p is set to true if the key already present, otherwise false, if
+ * it's not NULL.
+ *
+ * XXX: do we need to support update_if_exists behavior?
+ */
+void
+radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
+{
+	int shift;
+	bool	replaced;
+	radix_tree_node *node;
+	radix_tree_node *parent = tree->root;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		radix_tree_new_root(tree, key, val);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		radix_tree_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = tree->root;
+	while (shift > 0)
+	{
+		radix_tree_node *child;
+
+		if (!radix_tree_node_search_child(node, &child, key))
+			child = radix_tree_node_insert_child(tree, parent, node, key);
+
+		Assert(child != NULL);
+
+		parent = node;
+		node = child;
+		shift -= RADIX_TREE_NODE_FANOUT;
+	}
+
+	/* arrived at a leaf */
+	Assert(IS_LEAF_NODE(node));
+
+	radix_tree_node_insert_val(tree, parent, node, key, val, &replaced);
+
+	/* Update the statistics */
+	if (!replaced)
+		tree->num_keys++;
+
+	if (found_p)
+		*found_p = replaced;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if the key is successfully
+ * found, otherwise return false.  On success, we set the value to *val_p so
+ * it must not be NULL.
+ */
+bool
+radix_tree_search(radix_tree *tree, uint64 key, Datum *val_p)
+{
+	radix_tree_node *node;
+	Datum *value_ptr;
+	int shift;
+
+	Assert(val_p);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift > 0)
+	{
+		radix_tree_node *child;
+
+		if (!radix_tree_node_search_child(node, &child, key))
+			return false;
+
+		node = child;
+		shift -= RADIX_TREE_NODE_FANOUT;
+	}
+
+	/* We reached at a leaf node, search the corresponding slot */
+	Assert(IS_LEAF_NODE(node));
+
+	if (!radix_tree_node_search(node, &value_ptr, key, RADIX_TREE_FIND))
+		return false;
+
+	/* Found, set the value to return */
+	*val_p = *value_ptr;
+	return true;
+}
+
+bool
+radix_tree_delete(radix_tree *tree, uint64 key)
+{
+	radix_tree_node *node;
+	int shift;
+	radix_tree_stack stack = NULL;
+	bool deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		radix_tree_node *child;
+		radix_tree_stack new_stack;
+
+		new_stack = (radix_tree_stack) palloc(sizeof(radix_tree_stack_data));
+		new_stack->node = node;
+		new_stack->parent = stack;
+		stack = new_stack;
+
+		if (IS_LEAF_NODE(node))
+			break;
+
+		if (!radix_tree_node_search_child(node, &child, key))
+		{
+			radix_tree_free_stack(stack);
+			return false;
+		}
+
+		node = child;
+		shift -= RADIX_TREE_NODE_FANOUT;
+	}
+
+	Assert(IS_LEAF_NODE(stack->node));
+	while (stack != NULL)
+	{
+		radix_tree_node *node = stack->node;
+		Datum	*slot;
+
+		stack = stack->parent;
+
+		deleted = radix_tree_node_search(node, &slot, key, RADIX_TREE_DELETE);
+
+		if (!IS_EMPTY_NODE(node))
+			break;
+
+		Assert(deleted);
+		radix_tree_free_node(tree, node);
+	}
+
+	if (deleted)
+		tree->num_keys--;
+
+	radix_tree_free_stack(stack);
+	return deleted;
+}
+
+/* Create and return the iterator for the given radix tree */
+radix_tree_iter *
+radix_tree_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	radix_tree_iter *iter;
+	int top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (radix_tree_iter *) palloc0(sizeof(radix_tree_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree)
+		return iter;
+
+	top_level = iter->tree->root->shift / RADIX_TREE_NODE_FANOUT;
+
+	iter->stack_len = top_level;
+	iter->stack[top_level].node = iter->tree->root;
+	iter->stack[top_level].current_idx = -1;
+
+	/* Descend to the left most leaf node from the root */
+	radix_tree_update_iter_stack(iter, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+radix_tree_iterate_next(radix_tree_iter *iter, uint64 *key_p, Datum *value_p)
+{
+	bool found = false;
+	Datum slot = (Datum) 0;
+	int level;
+
+	/* Empty tree */
+	if (!iter->tree)
+		return false;
+
+	for (;;)
+	{
+		radix_tree_node *node;
+		radix_tree_iter_node_data *node_iter;
+
+		/*
+		 * Iterate node at each level from the bottom of the tree until we search
+		 * the next slot.
+		 */
+		for (level = 0; level <= iter->stack_len; level++)
+		{
+			slot = radix_tree_node_iterate_next(iter, &(iter->stack[level]), &found);
+
+			if (found)
+				break;
+		}
+
+		/* end of iteration */
+		if (!found)
+			return false;
+
+		/* found the next slot at the leaf node, return it */
+		if (level == 0)
+		{
+			*key_p = iter->key;
+			*value_p = slot;
+			return true;
+		}
+
+		/*
+		 * We have advanced more than one nodes including internal nodes. So we need
+		 * to update the stack by descending to the left most leaf node from this level.
+		 */
+		node = (radix_tree_node *) DatumGetPointer(slot);
+		node_iter = &(iter->stack[level - 1]);
+		radix_tree_store_iter_node(iter, node_iter, node);
+
+		radix_tree_update_iter_stack(iter, level - 1);
+	}
+}
+
+void
+radix_tree_end_iterate(radix_tree_iter *iter)
+{
+	pfree(iter);
+}
+
+/*
+ * Update the part of the key being constructed during the iteration with the
+ * given chunk
+ */
+static inline void
+radix_tree_iter_update_key(radix_tree_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RADIX_TREE_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Iterate over the given radix tree node and returns the next slot of the given
+ * node and set true to *found_p, if any.  Otherwise, set false to *found_p.
+ */
+static Datum
+radix_tree_node_iterate_next(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+							 bool *found_p)
+{
+	radix_tree_node *node = node_iter->node;
+	Datum slot = (Datum) 0;
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+		{
+			radix_tree_node_4 *n4 = (radix_tree_node_4 *) node_iter->node;
+
+			node_iter->current_idx++;
+
+			if (node_iter->current_idx >= n4->n.count)
+				goto not_found;
+
+			slot = n4->slots[node_iter->current_idx];
+
+			/* Update the part of the key with the current chunk */
+			if (IS_LEAF_NODE(node))
+				radix_tree_iter_update_key(iter, n4->chunks[node_iter->current_idx], 0);
+
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_32:
+		{
+			radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+
+			node_iter->current_idx++;
+
+			if (node_iter->current_idx >= n32->n.count)
+				goto not_found;
+
+			slot = n32->slots[node_iter->current_idx];
+
+			/* Update the part of the key with the current chunk */
+			if (IS_LEAF_NODE(node))
+				radix_tree_iter_update_key(iter, n32->chunks[node_iter->current_idx], 0);
+
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_128:
+		{
+			radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+			int i;
+
+			for (i = node_iter->current_idx + 1; i < 256; i++)
+			{
+				if (node_128_is_chunk_used(n128, i))
+					break;
+			}
+
+			if (i >= 256)
+				goto not_found;
+
+			node_iter->current_idx = i;
+			slot = node_128_get_chunk_slot(n128, i);
+
+			/* Update the part of the key */
+			if (IS_LEAF_NODE(node))
+				radix_tree_iter_update_key(iter, node_iter->current_idx, 0);
+
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_256:
+		{
+			radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+			int i;
+
+			for (i = node_iter->current_idx + 1; i < 256; i++)
+			{
+				if (node_256_is_chunk_used(n256, i))
+					break;
+			}
+
+			if (i >= 256)
+				goto not_found;
+
+			node_iter->current_idx = i;
+			slot = n256->slots[i];
+
+			/* Update the part of the key */
+			if (IS_LEAF_NODE(node))
+				radix_tree_iter_update_key(iter, node_iter->current_idx, 0);
+
+			break;
+		}
+	}
+
+	*found_p = true;
+	return slot;
+
+not_found:
+	*found_p = false;
+	return (Datum) 0;
+}
+
+/*
+ * Initialize and update the node iteration struct with the given radix tree node.
+ * This function also updates the part of the key with the chunk of the given node.
+ */
+static void
+radix_tree_store_iter_node(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+						   radix_tree_node *node)
+{
+	node_iter->node = node;
+	node_iter->current_idx = -1;
+
+	radix_tree_iter_update_key(iter, node->chunk, node->shift + RADIX_TREE_NODE_FANOUT);
+}
+
+/*
+ * Build the stack of the radix tree node while descending to the leaf from the 'from'
+ * level.
+ */
+static void
+radix_tree_update_iter_stack(radix_tree_iter *iter, int from)
+{
+	radix_tree_node *node = iter->stack[from].node;
+	int level = from;
+
+	for (;;)
+	{
+		radix_tree_iter_node_data *node_iter = &(iter->stack[level--]);
+		bool found;
+
+		/* Set the current node */
+		radix_tree_store_iter_node(iter, node_iter, node);
+
+		if (IS_LEAF_NODE(node))
+			break;
+
+		node = (radix_tree_node *)
+			DatumGetPointer(radix_tree_node_iterate_next(iter, node_iter, &found));
+
+		/*
+		 * Since we always get the first slot in the node, we have to found
+		 * the slot.
+		 */
+		Assert(found);
+	}
+}
+
+uint64
+radix_tree_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+uint64
+radix_tree_memory_usage(radix_tree *tree)
+{
+	return tree->mem_used;
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RADIX_TREE_DEBUG
+void
+radix_tree_stats(radix_tree *tree)
+{
+	fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u(%lu), n32 = %u(%lu), n128 = %u(%lu), n256 = %u(%lu)",
+			tree->num_keys,
+			tree->root->shift / RADIX_TREE_NODE_FANOUT,
+			tree->cnt[0], tree->cnt[0] * sizeof(radix_tree_node_4),
+			tree->cnt[1], tree->cnt[1] * sizeof(radix_tree_node_32),
+			tree->cnt[2], tree->cnt[2] * sizeof(radix_tree_node_128),
+			tree->cnt[3], tree->cnt[3] * sizeof(radix_tree_node_256));
+	//radix_tree_dump(tree);
+}
+
+static void
+radix_tree_print_slot(StringInfo buf, uint8 chunk, Datum slot, int idx, bool is_leaf, int level)
+{
+	char space[128] = {0};
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	if (is_leaf)
+		appendStringInfo(buf, "%s[%d] \"0x%X\" val(0x%lX) LEAF\n",
+						 space,
+						 idx,
+						 chunk,
+						 DatumGetInt64(slot));
+	else
+		appendStringInfo(buf , "%s[%d] \"0x%X\" -> ",
+						 space,
+						 idx,
+						 chunk);
+}
+
+static void
+radix_tree_dump_node(radix_tree_node *node, int level, StringInfo buf, bool recurse)
+{
+	bool is_leaf = IS_LEAF_NODE(node);
+
+	appendStringInfo(buf, "[\"%s\" type %d, cnt %u, shift %u, chunk \"0x%X\"] chunks:\n",
+					 IS_LEAF_NODE(node) ? "LEAF" : "INNR",
+					 (node->kind == RADIX_TREE_NODE_KIND_4) ? 4 :
+					 (node->kind == RADIX_TREE_NODE_KIND_32) ? 32 :
+					 (node->kind == RADIX_TREE_NODE_KIND_128) ? 128 : 256,
+					 node->count, node->shift, node->chunk);
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+		{
+			radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+
+			for (int i = 0; i < n4->n.count; i++)
+			{
+				radix_tree_print_slot(buf, n4->chunks[i], n4->slots[i], i, is_leaf, level);
+
+				if (!is_leaf)
+				{
+					if (recurse)
+					{
+						StringInfoData buf2;
+
+						initStringInfo(&buf2);
+						radix_tree_dump_node((radix_tree_node *) n4->slots[i], level + 1, &buf2, recurse);
+						appendStringInfo(buf, "%s", buf2.data);
+					}
+					else
+						appendStringInfo(buf, "\n");
+				}
+			}
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_32:
+		{
+			radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+
+			for (int i = 0; i < n32->n.count; i++)
+			{
+				radix_tree_print_slot(buf, n32->chunks[i], n32->slots[i], i, is_leaf, level);
+
+				if (!is_leaf)
+				{
+					if (recurse)
+					{
+						StringInfoData buf2;
+
+						initStringInfo(&buf2);
+						radix_tree_dump_node((radix_tree_node *) n32->slots[i], level + 1, &buf2, recurse);
+						appendStringInfo(buf, "%s", buf2.data);
+					}
+					else
+						appendStringInfo(buf, "\n");
+				}
+			}
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_128:
+		{
+			radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+			for (int j = 0; j < 256; j++)
+			{
+				if (!node_128_is_chunk_used(n128, j))
+					continue;
+
+				appendStringInfo(buf, "slot_idxs[%d]=%d, ", j, n128->slot_idxs[j]);
+			}
+			appendStringInfo(buf, "\nisset-bitmap:");
+			for (int j = 0; j < 16; j++)
+			{
+				appendStringInfo(buf, "%X ", (uint8) n128->isset[j]);
+			}
+			appendStringInfo(buf, "\n");
+
+			for (int i = 0; i < 256; i++)
+			{
+				if (!node_128_is_chunk_used(n128, i))
+					continue;
+
+				radix_tree_print_slot(buf, i, node_128_get_chunk_slot(n128, i),
+									  i, is_leaf, level);
+
+				if (!is_leaf)
+				{
+					if (recurse)
+					{
+						StringInfoData buf2;
+
+						initStringInfo(&buf2);
+						radix_tree_dump_node((radix_tree_node *) node_128_get_chunk_slot(n128, i),
+											 level + 1, &buf2, recurse);
+						appendStringInfo(buf, "%s", buf2.data);
+					}
+					else
+						appendStringInfo(buf, "\n");
+				}
+			}
+			break;
+		}
+		case RADIX_TREE_NODE_KIND_256:
+		{
+			radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+			for (int i = 0; i < 256; i++)
+			{
+				if (!node_256_is_chunk_used(n256, i))
+					continue;
+
+				radix_tree_print_slot(buf, i, n256->slots[i], i, is_leaf, level);
+
+				if (!is_leaf)
+				{
+					if (recurse)
+					{
+						StringInfoData buf2;
+
+						initStringInfo(&buf2);
+						radix_tree_dump_node((radix_tree_node *) n256->slots[i], level + 1, &buf2, recurse);
+						appendStringInfo(buf, "%s", buf2.data);
+					}
+					else
+						appendStringInfo(buf, "\n");
+				}
+			}
+			break;
+		}
+	}
+}
+
+void
+radix_tree_dump_search(radix_tree *tree, uint64 key)
+{
+	StringInfoData buf;
+	radix_tree_node *node;
+	int shift;
+	int level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+			 key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		radix_tree_node *child;
+
+		radix_tree_dump_node(node, level, &buf, false);
+
+		if (IS_LEAF_NODE(node))
+		{
+			Datum *dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			radix_tree_node_search(node, &dummy, key, RADIX_TREE_FIND);
+
+			break;
+		}
+
+		if (!radix_tree_node_search_child(node, &child, key))
+			break;
+
+		node = child;
+		shift -= RADIX_TREE_NODE_FANOUT;
+		level++;
+	}
+
+	elog(NOTICE, "\n%s", buf.data);
+}
+
+void
+radix_tree_dump(radix_tree *tree)
+{
+	StringInfoData buf;
+
+	initStringInfo(&buf);
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu", tree->max_val);
+	radix_tree_dump_node(tree->root, 0, &buf, true);
+	elog(NOTICE, "\n%s", buf.data);
+	elog(NOTICE, "-----------------------------------------------------------");
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..c072f8ea98
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RADIX_TREE_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct radix_tree_iter radix_tree_iter;
+
+extern radix_tree *radix_tree_create(MemoryContext ctx);
+extern bool radix_tree_search(radix_tree *tree, uint64 key, Datum *val_p);
+extern void radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p);
+extern bool radix_tree_delete(radix_tree *tree, uint64 key);
+extern void radix_tree_destroy(radix_tree *tree);
+extern uint64 radix_tree_memory_usage(radix_tree *tree);
+extern uint64 radix_tree_num_entries(radix_tree *tree);
+
+extern radix_tree_iter *radix_tree_begin_iterate(radix_tree *tree);
+extern bool radix_tree_iterate_next(radix_tree_iter *iter, uint64 *key_p, Datum *value_p);
+extern void radix_tree_end_iterate(radix_tree_iter *iter);
+
+
+#ifdef RADIX_TREE_DEBUG
+extern void radix_tree_dump(radix_tree *tree);
+extern void radix_tree_dump_search(radix_tree *tree, uint64 key);
+extern void radix_tree_stats(radix_tree *tree);
+#endif
+
+#endif /* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9090226daa..51b2514faf 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -24,6 +24,7 @@ SUBDIRS = \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_integerset.out b/src/test/modules/test_radixtree/expected/test_integerset.out
new file mode 100644
index 0000000000..822dd031e9
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_integerset.out
@@ -0,0 +1,31 @@
+CREATE EXTENSION test_integerset;
+--
+-- All the logic is in the test_integerset() function. It will throw
+-- an error if something fails.
+--
+SELECT test_integerset();
+NOTICE:  testing intset with empty set
+NOTICE:  testing intset with distances > 2^60 between values
+NOTICE:  testing intset with single value 0
+NOTICE:  testing intset with single value 1
+NOTICE:  testing intset with single value 18446744073709551614
+NOTICE:  testing intset with single value 18446744073709551615
+NOTICE:  testing intset with value 0, and all between 1000 and 2000
+NOTICE:  testing intset with value 1, and all between 1000 and 2000
+NOTICE:  testing intset with value 1, and all between 1000 and 2000000
+NOTICE:  testing intset with value 18446744073709551614, and all between 1000 and 2000
+NOTICE:  testing intset with value 18446744073709551615, and all between 1000 and 2000
+NOTICE:  testing intset with pattern "all ones"
+NOTICE:  testing intset with pattern "alternating bits"
+NOTICE:  testing intset with pattern "clusters of ten"
+NOTICE:  testing intset with pattern "clusters of hundred"
+NOTICE:  testing intset with pattern "one-every-64k"
+NOTICE:  testing intset with pattern "sparse"
+NOTICE:  testing intset with pattern "single values, distance > 2^32"
+NOTICE:  testing intset with pattern "clusters, distance > 2^32"
+NOTICE:  testing intset with pattern "clusters, distance > 2^60"
+ test_integerset 
+-----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..0c96ebc739
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,20 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..e93c7f6676
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,446 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool radix_tree_test_stats = true;
+
+static int radix_tree_node_max_entries[] = {4, 16, 128, 256};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+} test_spec;
+
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 10000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void test_empty(void);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	Datum dummy;
+
+	radixtree = radix_tree_create(CurrentMemoryContext);
+
+	if (radix_tree_search(radixtree, 0, &dummy))
+		elog(ERROR, "radix_tree_search on empty tree returned true");
+
+	if (radix_tree_search(radixtree, 1, &dummy))
+		elog(ERROR, "radix_tree_search on empty tree returned true");
+
+	if (radix_tree_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "radix_tree_search on empty tree returned true");
+
+	if (radix_tree_num_entries(radixtree) != 0)
+		elog(ERROR, "radix_tree_num_entries on empty tree return non-zero");
+
+	radix_tree_destroy(radixtree);
+}
+
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		Datum val;
+
+		if (!radix_tree_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (DatumGetUInt64(val) != key)
+			elog(ERROR, "radix_tree_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, DatumGetUInt64(val), key);
+	}
+}
+
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+	uint64 num_entries;
+
+	radixtree = radix_tree_create(CurrentMemoryContext);
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		bool found;
+
+		radix_tree_insert(radixtree, key, Int64GetDatum(key), &found);
+
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+		for (int j = 0; j < lengthof(radix_tree_node_max_entries); j++)
+		{
+			if (i == (radix_tree_node_max_entries[j] - 1))
+			{
+				check_search_on_node(radixtree, shift,
+									 (j == 0) ? 0 : radix_tree_node_max_entries[j - 1],
+									 radix_tree_node_max_entries[j]);
+				break;
+			}
+		}
+	}
+
+	num_entries = radix_tree_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "radix_tree_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+	radix_tree *radixtree;
+	radix_tree_iter *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (radix_tree_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the integer set.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = radix_tree_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool found;
+
+			x = last_int + pattern_values[i];
+
+			radix_tree_insert(radixtree, x, Int64GetDatum(x), &found);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (radix_tree_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by radix_tree_memory_usage(), as well as the
+	 * stats from the memory context.  They should be in the same ballpark,
+	 * but it's hard to automate testing that, so if you're making changes to
+	 * the implementation, just observe that manually.
+	 */
+	if (radix_tree_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by radix_tree_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = radix_tree_memory_usage(radixtree);
+		fprintf(stderr, "radix_tree_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that radix_tree_num_entries works */
+	n = radix_tree_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "radix_tree_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with radix_tree_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		Datum		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to radix_tree_search() ? */
+		found = radix_tree_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (DatumGetUInt64(v) != x))
+		{
+			radix_tree_dump_search(radixtree, x);
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 DatumGetUInt64(v), x);
+		}
+	}
+	endtime = GetCurrentTimestamp();
+	if (radix_tree_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = radix_tree_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!radix_tree_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			if (DatumGetUInt64(val) != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (radix_tree_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with radix_tree_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = radix_tree_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		uint64		x;
+		Datum		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to radix_tree_search() ? */
+		found = radix_tree_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!radix_tree_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (radix_tree_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (radix_tree_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (radix_tree_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = radix_tree_num_entries(radixtree);
+
+	/* Check that radix_tree_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "radix_tree_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true

#48

sawada.mshk@gmail.com

over 3 years ago

In reply to: Masahiko Sawada (#47)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, May 25, 2022 at 11:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, May 10, 2022 at 6:58 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, May 10, 2022 at 8:52 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Overall, radix tree implementations have good numbers. Once we got an
agreement on moving in this direction, I'll start a new thread for
that and move the implementation further; there are many things to do
and discuss: deletion, API design, SIMD support, more tests etc.

+1

Thanks!

I've attached an updated version patch. It is still WIP but I've
implemented deletion and improved test cases and comments.

I've attached an updated version patch that changes the configure
script. I'm still studying how to support AVX2 on msvc build. Also,
added more regression tests.

The integration with lazy vacuum and parallel vacuum is missing for
now. In order to support parallel vacuum, we need to have the radix
tree support to be created on DSA.

Added this item to the next CF.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

radixtree_wip_v3.patchapplication/x-patch; name=radixtree_wip_v3.patchDownload

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index d3562d6fee..a56d6e89da 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -676,3 +676,27 @@ if test x"$Ac_cachevar" = x"yes"; then
 fi
 undefine([Ac_cachevar])dnl
 ])# PGAC_ARMV8_CRC32C_INTRINSICS
+
+# PGAC_AVX2_INTRINSICS
+# --------------------
+# Check if the compiler supports the Intel AVX2 instructinos.
+#
+# If the intrinsics are supported, sets pgac_avx2_intrinsics, and CFLAGS_AVX2.
+AC_DEFUN([PGAC_AVX2_INTRINSICS],
+[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx2_intrinsics_$1])])dnl
+AC_CACHE_CHECK([for _mm256_set_1_epi8 _mm256_cmpeq_epi8 _mm256_movemask_epi8 CFLAGS=$1], [Ac_cachevar],
+[pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS $1"
+AC_LINK_IFELSE([AC_LANG_PROGRAM([#include <immintrin.h>],
+  [__m256i vec = _mm256_set1_epi8(0);
+   __m256i cmp = _mm256_cmpeq_epi8(vec, vec);
+   return _mm256_movemask_epi8(cmp) > 0;])],
+  [Ac_cachevar=yes],
+  [Ac_cachevar=no])
+CFLAGS="$pgac_save_CFLAGS"])
+if test x"$Ac_cachevar" = x"yes"; then
+  CFLAGS_AVX2="$1"
+  pgac_avx2_intrinsics=yes
+fi
+undefine([Ac_cachevar])dnl
+])# PGAC_AVX2_INTRINSICS
diff --git a/configure b/configure
index 7dec6b7bf9..6ebc15a8c1 100755
--- a/configure
+++ b/configure
@@ -645,6 +645,7 @@ XGETTEXT
 MSGMERGE
 MSGFMT_FLAGS
 MSGFMT
+CFLAGS_AVX2
 PG_CRC32C_OBJS
 CFLAGS_ARMV8_CRC32C
 CFLAGS_SSE42
@@ -18829,6 +18830,82 @@ $as_echo "slicing-by-8" >&6; }
 fi
 
 
+# Check for Intel AVX2 intrinsics.
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm256i CFLAGS=" >&5
+$as_echo_n "checking for _mm256i CFLAGS=... " >&6; }
+if ${pgac_cv_avx2_intrinsics_+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS "
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+__m256i vec = _mm256_set1_epi8(0);
+   __m256i cmp = _mm256_cmpeq_epi8(vec, vec);
+   return _mm256_movemask_epi8(cmp) > 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx2_intrinsics_=yes
+else
+  pgac_cv_avx2_intrinsics_=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx2_intrinsics_" >&5
+$as_echo "$pgac_cv_avx2_intrinsics_" >&6; }
+if test x"$pgac_cv_avx2_intrinsics_" = x"yes"; then
+  CFLAGS_AVX2=""
+  pgac_avx2_intrinsics=yes
+fi
+
+if test x"pgac_avx2_intrinsics" != x"yes"; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _mm256i CFLAGS=-mavx2" >&5
+$as_echo_n "checking for _mm256i CFLAGS=-mavx2... " >&6; }
+if ${pgac_cv_avx2_intrinsics__mavx2+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  pgac_save_CFLAGS=$CFLAGS
+CFLAGS="$pgac_save_CFLAGS -mavx2"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#include <immintrin.h>
+int
+main ()
+{
+__m256i vec = _mm256_set1_epi8(0);
+   __m256i cmp = _mm256_cmpeq_epi8(vec, vec);
+   return _mm256_movemask_epi8(cmp) > 0;
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv_avx2_intrinsics__mavx2=yes
+else
+  pgac_cv_avx2_intrinsics__mavx2=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+CFLAGS="$pgac_save_CFLAGS"
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx2_intrinsics__mavx2" >&5
+$as_echo "$pgac_cv_avx2_intrinsics__mavx2" >&6; }
+if test x"$pgac_cv_avx2_intrinsics__mavx2" = x"yes"; then
+  CFLAGS_AVX2="-mavx2"
+  pgac_avx2_intrinsics=yes
+fi
+
+fi
+
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/configure.ac b/configure.ac
index d093fb88dd..6b6d095306 100644
--- a/configure.ac
+++ b/configure.ac
@@ -2300,6 +2300,12 @@ else
 fi
 AC_SUBST(PG_CRC32C_OBJS)
 
+# Check for Intel AVX2 intrinsics.
+PGAC_AVX2_INTRINSICS([])
+if test x"pgac_avx2_intrinsics" != x"yes"; then
+  PGAC_AVX2_INTRINSICS([-mavx2])
+fi
+AC_SUBST(CFLAGS_AVX2)
 
 # Select semaphore implementation type.
 if test "$PORTNAME" != "win32"; then
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 051718e4fe..9717094724 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -263,6 +263,7 @@ CFLAGS_UNROLL_LOOPS = @CFLAGS_UNROLL_LOOPS@
 CFLAGS_VECTORIZE = @CFLAGS_VECTORIZE@
 CFLAGS_SSE42 = @CFLAGS_SSE42@
 CFLAGS_ARMV8_CRC32C = @CFLAGS_ARMV8_CRC32C@
+CFLAGS_AVX2 = @CFLAGS_AVX2@
 PERMIT_DECLARATION_AFTER_STATEMENT = @PERMIT_DECLARATION_AFTER_STATEMENT@
 CXXFLAGS = @CXXFLAGS@
 
diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..5e4516ca90 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,10 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
+# radixtree.o need CFLAGS_AVX2
+radixtree.o: CFLAGS+=$(CFLAGS_AVX2)
+
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..bf87f932fd
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,1763 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013.
+ *
+ * There are some differences from the proposed implementation.  For instance,
+ * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
+ * Also, there is no support for path compression and lazy path expansion. The
+ * radix tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * The key is a 64-bit unsigned integer and the value is a Datum. Both internal
+ * nodes and leaf nodes have the identical structure. For internal tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, also have the Datum value that is specified by the user.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * radix_tree_create		- Create a new, empty radix tree
+ * radix_tree_free			- Free the radix tree
+ * radix_tree_insert		- Insert a key-value pair
+ * radix_tree_delete		- Delete a key-value pair
+ * radix_tree_begin_iterate	- Begin iterating through all key-value pairs
+ * radix_tree_iterate_next	- Return next key-value pair, if any
+ * radix_tree_end_iterate	- End iteration
+ *
+ * radix_tree_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * radix_tree_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+#if defined(__AVX2__)
+#include <immintrin.h>			/* AVX2 intrinsics */
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RADIX_TREE_NODE_FANOUT	8
+
+/* The number of maximum slots in the node, used in node-256 */
+#define RADIX_TREE_NODE_MAX_SLOTS (1 << RADIX_TREE_NODE_FANOUT)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * in node-128 and node-256.
+ */
+#define RADIX_TREE_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RADIX_TREE_CHUNK_MASK ((1 << RADIX_TREE_NODE_FANOUT) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RADIX_TREE_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RADIX_TREE_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RADIX_TREE_NODE_FANOUT)
+
+/* Get a chunk from the key */
+#define GET_KEY_CHUNK(key, shift) \
+	((uint8) (((key) >> (shift)) & RADIX_TREE_CHUNK_MASK))
+
+/* Mapping from the value to the bit in is-set bitmap in the node-128 and node-256 */
+#define NODE_BITMAP_BYTE(v) ((v) / RADIX_TREE_NODE_FANOUT)
+#define NODE_BITMAP_BIT(v) (UINT64_C(1) << ((v) % RADIX_TREE_NODE_FANOUT))
+
+/* Enum used radix_tree_node_search() */
+typedef enum
+{
+	RADIX_TREE_FIND = 0,		/* find the key-value */
+	RADIX_TREE_DELETE,			/* delete the key-value */
+} radix_tree_action;
+
+/*
+ * supported radix tree nodes.
+ *
+ * XXX: should we add KIND_16 as we can utilize SSE2 SIMD instructions?
+ */
+typedef enum radix_tree_node_kind
+{
+	RADIX_TREE_NODE_KIND_4 = 0,
+	RADIX_TREE_NODE_KIND_32,
+	RADIX_TREE_NODE_KIND_128,
+	RADIX_TREE_NODE_KIND_256
+} radix_tree_node_kind;
+#define RADIX_TREE_NODE_KIND_COUNT 4
+
+/*
+ * Base type for all nodes types.
+ */
+typedef struct radix_tree_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at ta fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RADIX_TREE_NODE_FANOUT bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Size class of the node */
+	radix_tree_node_kind kind;
+} radix_tree_node;
+
+/* Macros for radix tree nodes */
+#define IS_LEAF_NODE(n) (((radix_tree_node *) (n))->shift == 0)
+#define IS_EMPTY_NODE(n) (((radix_tree_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+	(((radix_tree_node *) (n))->count < \
+	 radix_tree_node_info[((radix_tree_node *) (n))->kind].max_slots)
+
+/*
+ * To reduce memory usage compared to a simple radix tree with a fixed fanout
+ * we use adaptive node sides, with different storage methods for different
+ * numbers of elements.
+ */
+typedef struct radix_tree_node_4
+{
+	radix_tree_node n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+	Datum		slots[4];
+} radix_tree_node_4;
+
+typedef struct radix_tree_node_32
+{
+	radix_tree_node n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+	Datum		slots[32];
+} radix_tree_node_32;
+
+#define RADIX_TREE_NODE_128_BITS RADIX_TREE_NODE_NSLOTS_BITS(128)
+typedef struct radix_tree_node_128
+{
+	radix_tree_node n;
+
+	/*
+	 * The index of slots for each fanout. 0 means unused whereas slots is
+	 * 0-indexed. So we can get the slot of the chunk C by slots[C] - 1.
+	 */
+	uint8		slot_idxs[RADIX_TREE_NODE_MAX_SLOTS];
+
+	/* A bitmap to track which slot is in use */
+	uint8		isset[RADIX_TREE_NODE_128_BITS];
+
+	Datum		slots[128];
+} radix_tree_node_128;
+
+#define RADIX_TREE_NODE_MAX_BITS RADIX_TREE_NODE_NSLOTS_BITS(RADIX_TREE_NODE_MAX_SLOTS)
+typedef struct radix_tree_node_256
+{
+	radix_tree_node n;
+
+	/* A bitmap to track which slot is in use */
+	uint8		isset[RADIX_TREE_NODE_MAX_BITS];
+
+	Datum		slots[RADIX_TREE_NODE_MAX_SLOTS];
+} radix_tree_node_256;
+
+/* Information of each size class */
+typedef struct radix_tree_node_info_elem
+{
+	const char *name;
+	int			max_slots;
+	Size		size;
+} radix_tree_node_info_elem;
+
+static radix_tree_node_info_elem radix_tree_node_info[] =
+{
+	{"radix tree node 4", 4, sizeof(radix_tree_node_4)},
+	{"radix tree node 32", 32, sizeof(radix_tree_node_32)},
+	{"radix tree node 128", 128, sizeof(radix_tree_node_128)},
+	{"radix tree node 256", 256, sizeof(radix_tree_node_256)},
+};
+
+/*
+ * As we descend a radix tree, we push the node to the stack. The stack is used
+ * at deletion.
+ */
+typedef struct radix_tree_stack_data
+{
+	radix_tree_node *node;
+	struct radix_tree_stack_data *parent;
+} radix_tree_stack_data;
+typedef radix_tree_stack_data *radix_tree_stack;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending order
+ * of the key. To support this, the we iterate nodes of each level.
+ * radix_tree_iter_node_data struct is used to track the iteration within a node.
+ * radix_tree_iter has the array of this struct, stack, in order to track the iteration
+ * of every level. During the iteration, we also construct the key to return. The key
+ * is updated whenever we update the node iteration information, e.g., when advancing
+ * the current index within the node or when moving to the next node at the same level.
+ */
+typedef struct radix_tree_iter_node_data
+{
+	radix_tree_node *node;		/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} radix_tree_iter_node_data;
+
+struct radix_tree_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	radix_tree_iter_node_data stack[RADIX_TREE_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	radix_tree_node *root;
+	uint64		max_val;
+	uint64		num_keys;
+	MemoryContextData *slabs[RADIX_TREE_NODE_KIND_COUNT];
+
+	/* statistics */
+	uint64		mem_used;
+	int32		cnt[RADIX_TREE_NODE_KIND_COUNT];
+};
+
+static radix_tree_node *radix_tree_node_grow(radix_tree *tree, radix_tree_node *parent,
+											 radix_tree_node *node, uint64 key);
+static bool radix_tree_node_search_child(radix_tree_node *node, radix_tree_node **child_p,
+										 uint64 key);
+static bool radix_tree_node_search(radix_tree_node *node, Datum **slot_p, uint64 key,
+								   radix_tree_action action);
+static void radix_tree_extend(radix_tree *tree, uint64 key);
+static void radix_tree_new_root(radix_tree *tree, uint64 key, Datum val);
+static radix_tree_node *radix_tree_node_insert_child(radix_tree *tree,
+													 radix_tree_node *parent,
+													 radix_tree_node *node,
+													 uint64 key);
+static void radix_tree_node_insert_val(radix_tree *tree, radix_tree_node *parent,
+									   radix_tree_node *node, uint64 key, Datum val,
+									   bool *replaced_p);
+static inline void radix_tree_iter_update_key(radix_tree_iter *iter, uint8 chunk, uint8 shift);
+static Datum radix_tree_node_iterate_next(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+										  bool *found_p);
+static void radix_tree_store_iter_node(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+									   radix_tree_node *node);
+static void radix_tree_update_iter_stack(radix_tree_iter *iter, int from);
+static void radix_tree_verify_node(radix_tree_node *node);
+
+/*
+ * Helper functions for accessing each kind of nodes.
+ */
+static inline int
+node_32_search_eq(radix_tree_node_32 *node, uint8 chunk)
+{
+#ifdef __AVX2__
+	__m256i		_key = _mm256_set1_epi8(chunk);
+	__m256i		_data = _mm256_loadu_si256((__m256i_u *) node->chunks);
+	__m256i		_cmp = _mm256_cmpeq_epi8(_key, _data);
+	uint32		bitfield = _mm256_movemask_epi8(_cmp);
+
+	bitfield &= ((UINT64_C(1) << node->n.count) - 1);
+
+	return (bitfield) ? __builtin_ctz(bitfield) : -1;
+
+#else
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] > chunk)
+			return -1;
+
+		if (node->chunks[i] == chunk)
+			return i;
+	}
+
+	return -1;
+#endif							/* __AVX2__ */
+}
+
+/*
+ * This is a bit more complicated than search_chunk_array_16_eq(), because
+ * until recently no unsigned uint8 comparison instruction existed on x86. So
+ * we need to play some trickery using _mm_min_epu8() to effectively get
+ * <=. There never will be any equal elements in the current uses, but that's
+ * what we get here...
+ */
+static inline int
+node_32_search_le(radix_tree_node_32 *node, uint8 chunk)
+{
+#ifdef __AVX2__
+	__m256i		_key = _mm256_set1_epi8(chunk);
+	__m256i		_data = _mm256_loadu_si256((__m256i_u *) node->chunks);
+	__m256i		_min = _mm256_min_epu8(_key, _data);
+	__m256i		cmp = _mm256_cmpeq_epi8(_key, _min);
+	uint32_t	bitfield = _mm256_movemask_epi8(cmp);
+
+	bitfield &= ((UINT64_C(1) << node->n.count) - 1);
+
+	return (bitfield) ? __builtin_ctz(bitfield) : node->n.count;
+#else
+	int			index;
+
+	for (index = 0; index < node->n.count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+
+	return index;
+#endif							/* __AVX2__ */
+}
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(radix_tree_node_128 *node, uint8 chunk)
+{
+	return (node->slot_idxs[chunk] != 0);
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_128_is_slot_used(radix_tree_node_128 *node, uint8 slot)
+{
+	return ((node->isset[NODE_BITMAP_BYTE(slot)] & NODE_BITMAP_BIT(slot)) != 0);
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_128_set(radix_tree_node_128 *node, uint8 chunk, Datum val)
+{
+	int			slotpos = 0;
+
+	/* Search an unused slot */
+	while (node_128_is_slot_used(node, slotpos))
+		slotpos++;
+
+	node->slot_idxs[chunk] = slotpos + 1;
+	node->slots[slotpos] = val;
+	node->isset[NODE_BITMAP_BYTE(slotpos)] |= NODE_BITMAP_BIT(slotpos);
+}
+
+/* Delete the slot at the corresponding chunk */
+static inline void
+node_128_unset(radix_tree_node_128 *node, uint8 chunk)
+{
+	int			slotpos = node->slot_idxs[chunk] - 1;
+
+	if (!node_128_is_chunk_used(node, chunk))
+		return;
+
+	node->isset[NODE_BITMAP_BYTE(slotpos)] &= ~(NODE_BITMAP_BIT(slotpos));
+	node->slot_idxs[chunk] = 0;
+}
+
+/* Return the slot data corresponding to the chunk */
+static inline Datum
+node_128_get_chunk_slot(radix_tree_node_128 *node, uint8 chunk)
+{
+	return node->slots[node->slot_idxs[chunk] - 1];
+}
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_256_is_chunk_used(radix_tree_node_256 *node, uint8 chunk)
+{
+	return (node->isset[NODE_BITMAP_BYTE(chunk)] & NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_256_set(radix_tree_node_256 *node, uint8 chunk, Datum slot)
+{
+	node->slots[chunk] = slot;
+	node->isset[NODE_BITMAP_BYTE(chunk)] |= NODE_BITMAP_BIT(chunk);
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_256_unset(radix_tree_node_256 *node, uint8 chunk)
+{
+	node->isset[NODE_BITMAP_BYTE(chunk)] &= ~(NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+inline static int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RADIX_TREE_NODE_FANOUT) * RADIX_TREE_NODE_FANOUT;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RADIX_TREE_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64_C(1) << (shift + RADIX_TREE_NODE_FANOUT)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static radix_tree_node *
+radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
+{
+	radix_tree_node *newnode;
+
+	newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
+														 radix_tree_node_info[kind].size);
+	newnode->kind = kind;
+
+	/* update the statistics */
+	tree->mem_used += GetMemoryChunkSpace(newnode);
+	tree->cnt[kind]++;
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+radix_tree_free_node(radix_tree *tree, radix_tree_node *node)
+{
+	/*
+	 * XXX: If we're deleting the root node, make the tree empty
+	 */
+	if (tree->root == node)
+	{
+		tree->root = NULL;
+	}
+
+	/* update the statistics */
+	tree->mem_used -= GetMemoryChunkSpace(node);
+	tree->cnt[node->kind]--;
+
+	Assert(tree->mem_used >= 0);
+	Assert(tree->cnt[node->kind] >= 0);
+
+	pfree(node);
+}
+
+/* Free a stack made by radix_tree_delete */
+static void
+radix_tree_free_stack(radix_tree_stack stack)
+{
+	radix_tree_stack ostack;
+
+	while (stack != NULL)
+	{
+		ostack = stack;
+		stack = stack->parent;
+		pfree(ostack);
+	}
+}
+
+/* Copy the common fields without the kind */
+static void
+radix_tree_copy_node_common(radix_tree_node *src, radix_tree_node *dst)
+{
+	dst->shift = src->shift;
+	dst->chunk = src->chunk;
+	dst->count = src->count;
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+radix_tree_extend(radix_tree *tree, uint64 key)
+{
+	int			max_shift;
+	int			shift = tree->root->shift + RADIX_TREE_NODE_FANOUT;
+
+	max_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'max_shift' */
+	while (shift <= max_shift)
+	{
+		radix_tree_node_4 *node =
+		(radix_tree_node_4 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+
+		node->n.count = 1;
+		node->n.shift = shift;
+		node->chunks[0] = 0;
+		node->slots[0] = PointerGetDatum(tree->root);
+
+		tree->root->chunk = 0;
+		tree->root = (radix_tree_node *) node;
+
+		shift += RADIX_TREE_NODE_FANOUT;
+	}
+
+	tree->max_val = shift_get_max_val(max_shift);
+}
+
+/*
+ * Wrapper for radix_tree_node_search to search the pointer to the child node in the
+ * node.
+ *
+ * Return true if the corresponding child is found, otherwise return false.  On success,
+ * it sets child_p.
+ */
+static bool
+radix_tree_node_search_child(radix_tree_node *node, radix_tree_node **child_p, uint64 key)
+{
+	bool		found = false;
+	Datum	   *slot_ptr;
+
+	if (radix_tree_node_search(node, &slot_ptr, key, RADIX_TREE_FIND))
+	{
+		/* Found the pointer to the child node */
+		found = true;
+		*child_p = (radix_tree_node *) DatumGetPointer(*slot_ptr);
+	}
+
+	return found;
+}
+
+/*
+ * Return true if the corresponding slot is used, otherwise return false.  On success,
+ * sets the pointer to the slot to slot_p.
+ */
+static bool
+radix_tree_node_search(radix_tree_node *node, Datum **slot_p, uint64 key,
+					   radix_tree_action action)
+{
+	int			chunk = GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+			{
+				radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+
+				/* Do linear search */
+				for (int i = 0; i < n4->n.count; i++)
+				{
+					if (n4->chunks[i] > chunk)
+						break;
+
+					/*
+					 * If we find the chunk in the node, do the specified
+					 * action
+					 */
+					if (n4->chunks[i] == chunk)
+					{
+						if (action == RADIX_TREE_FIND)
+							*slot_p = &(n4->slots[i]);
+						else	/* RADIX_TREE_DELETE */
+						{
+							memmove(&(n4->chunks[i]), &(n4->chunks[i + 1]),
+									sizeof(uint8) * (n4->n.count - i - 1));
+							memmove(&(n4->slots[i]), &(n4->slots[i + 1]),
+									sizeof(radix_tree_node *) * (n4->n.count - i - 1));
+						}
+
+						found = true;
+						break;
+					}
+				}
+
+				break;
+			}
+		case RADIX_TREE_NODE_KIND_32:
+			{
+				radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+				int			idx;
+
+				/* Search by SIMD instructions */
+				idx = node_32_search_eq(n32, chunk);
+
+				/* If we find the chunk in the node, do the specified action */
+				if (idx >= 0)
+				{
+					if (action == RADIX_TREE_FIND)
+						*slot_p = &(n32->slots[idx]);
+					else		/* RADIX_TREE_DELETE */
+					{
+						memmove(&(n32->chunks[idx]), &(n32->chunks[idx + 1]),
+								sizeof(uint8) * (n32->n.count - idx - 1));
+						memmove(&(n32->slots[idx]), &(n32->slots[idx + 1]),
+								sizeof(radix_tree_node *) * (n32->n.count - idx - 1));
+					}
+
+					found = true;
+				}
+
+				break;
+			}
+		case RADIX_TREE_NODE_KIND_128:
+			{
+				radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+				/* If we find the chunk in the node, do the specified action */
+				if (node_128_is_chunk_used(n128, chunk))
+				{
+					if (action == RADIX_TREE_FIND)
+						*slot_p = &(n128->slots[n128->slot_idxs[chunk] - 1]);
+					else		/* RADIX_TREE_DELETE */
+						node_128_unset(n128, chunk);
+
+					found = true;
+				}
+
+				break;
+			}
+		case RADIX_TREE_NODE_KIND_256:
+			{
+				radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+				/* If we find the chunk in the node, do the specified action */
+				if (node_256_is_chunk_used(n256, chunk))
+				{
+					if (action == RADIX_TREE_FIND)
+						*slot_p = &(n256->slots[chunk]);
+					else		/* RADIX_TREE_DELETE */
+						node_256_unset(n256, chunk);
+
+					found = true;
+				}
+
+				break;
+			}
+	}
+
+	/* Update the statistics */
+	if (action == RADIX_TREE_DELETE && found)
+		node->count--;
+
+	return found;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+radix_tree_new_root(radix_tree *tree, uint64 key, Datum val)
+{
+	radix_tree_node_4 *n4 =
+	(radix_tree_node_4 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+	int			shift = key_get_shift(key);
+
+	n4->n.shift = shift;
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = (radix_tree_node *) n4;
+}
+
+/* Insert 'node' as a child node of 'parent' */
+static radix_tree_node *
+radix_tree_node_insert_child(radix_tree *tree, radix_tree_node *parent,
+							 radix_tree_node *node, uint64 key)
+{
+	radix_tree_node *newchild =
+	(radix_tree_node *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_4);
+
+	Assert(!IS_LEAF_NODE(node));
+
+	newchild->shift = node->shift - RADIX_TREE_NODE_FANOUT;
+	newchild->chunk = GET_KEY_CHUNK(key, node->shift);
+
+	radix_tree_node_insert_val(tree, parent, node, key, PointerGetDatum(newchild), NULL);
+
+	return (radix_tree_node *) newchild;
+}
+
+/*
+ * Insert the value to the node. The node grows if it's full.
+ */
+static void
+radix_tree_node_insert_val(radix_tree *tree, radix_tree_node *parent,
+						   radix_tree_node *node, uint64 key, Datum val,
+						   bool *replaced_p)
+{
+	int			chunk = GET_KEY_CHUNK(key, node->shift);
+	bool		replaced = false;
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+			{
+				radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+				int			idx;
+
+				for (idx = 0; idx < n4->n.count; idx++)
+				{
+					if (n4->chunks[idx] >= chunk)
+						break;
+				}
+
+				if (NODE_HAS_FREE_SLOT(n4))
+				{
+					if (n4->n.count == 0)
+					{
+						/* the first key for this node, add it */
+					}
+					else if (n4->chunks[idx] == chunk)
+					{
+						/* found the key, replace it */
+						replaced = true;
+					}
+					else if (idx != n4->n.count)
+					{
+						/*
+						 * the key needs to be inserted in the middle of the
+						 * array, make space for the new key.
+						 */
+						memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
+								sizeof(uint8) * (n4->n.count - idx));
+						memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
+								sizeof(radix_tree_node *) * (n4->n.count - idx));
+					}
+
+					n4->chunks[idx] = chunk;
+					n4->slots[idx] = val;
+
+					/* Done */
+					break;
+				}
+
+				/* The node doesn't have free slot so needs to grow */
+				node = radix_tree_node_grow(tree, parent, node, key);
+				Assert(node->kind == RADIX_TREE_NODE_KIND_32);
+			}
+			/* FALLTHROUGH */
+		case RADIX_TREE_NODE_KIND_32:
+			{
+				radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+				int			idx;
+
+				idx = node_32_search_le(n32, chunk);
+
+				if (NODE_HAS_FREE_SLOT(n32))
+				{
+					if (n32->n.count == 0)
+					{
+						/* first key for this node, add it */
+					}
+					else if (n32->chunks[idx] == chunk)
+					{
+						/* found the key, replace it */
+						replaced = true;
+					}
+					else if (idx != n32->n.count)
+					{
+						/*
+						 * the key needs to be inserted in the middle of the
+						 * array, make space for the new key.
+						 */
+						memmove(&(n32->chunks[idx + 1]), &(n32->chunks[idx]),
+								sizeof(uint8) * (n32->n.count - idx));
+						memmove(&(n32->slots[idx + 1]), &(n32->slots[idx]),
+								sizeof(radix_tree_node *) * (n32->n.count - idx));
+					}
+
+					n32->chunks[idx] = chunk;
+					n32->slots[idx] = val;
+
+					/* Done */
+					break;
+				}
+
+				/* The node doesn't have free slot so needs to grow */
+				node = radix_tree_node_grow(tree, parent, node, key);
+				Assert(node->kind == RADIX_TREE_NODE_KIND_128);
+			}
+			/* FALLTHROUGH */
+		case RADIX_TREE_NODE_KIND_128:
+			{
+				radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+				if (node_128_is_chunk_used(n128, chunk))
+				{
+					/* found the existing value */
+					node_128_set(n128, chunk, val);
+					replaced = true;
+					break;
+				}
+
+				if (NODE_HAS_FREE_SLOT(n128))
+				{
+					node_128_set(n128, chunk, val);
+
+					/* Done */
+					break;
+				}
+
+				/* The node doesn't have free slot so needs to grow */
+				node = radix_tree_node_grow(tree, parent, node, key);
+				Assert(node->kind == RADIX_TREE_NODE_KIND_256);
+			}
+			/* FALLTHROUGH */
+		case RADIX_TREE_NODE_KIND_256:
+			{
+				radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+				if (node_256_is_chunk_used(n256, chunk))
+					replaced = true;
+
+				node_256_set(n256, chunk, val);
+
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!replaced)
+		node->count++;
+
+	if (replaced_p)
+		*replaced_p = replaced;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	radix_tree_verify_node(node);
+}
+
+/* Change the node type to the next larger one */
+static radix_tree_node *
+radix_tree_node_grow(radix_tree *tree, radix_tree_node *parent, radix_tree_node *node,
+					 uint64 key)
+{
+	radix_tree_node *newnode = NULL;
+
+	Assert(node->count == radix_tree_node_info[node->kind].max_slots);
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+			{
+				radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+				radix_tree_node_32 *new32 =
+				(radix_tree_node_32 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_32);
+
+				radix_tree_copy_node_common((radix_tree_node *) n4,
+											(radix_tree_node *) new32);
+
+				/* Copy both chunks and slots to the new node */
+				memcpy(&(new32->chunks), &(n4->chunks), sizeof(uint8) * 4);
+				memcpy(&(new32->slots), &(n4->slots), sizeof(Datum) * 4);
+
+				newnode = (radix_tree_node *) new32;
+				break;
+			}
+		case RADIX_TREE_NODE_KIND_32:
+			{
+				radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+				radix_tree_node_128 *new128 =
+				(radix_tree_node_128 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_128);
+
+				/* Copy both chunks and slots to the new node */
+				radix_tree_copy_node_common((radix_tree_node *) n32,
+											(radix_tree_node *) new128);
+
+				for (int i = 0; i < n32->n.count; i++)
+					node_128_set(new128, n32->chunks[i], n32->slots[i]);
+
+				newnode = (radix_tree_node *) new128;
+				break;
+			}
+		case RADIX_TREE_NODE_KIND_128:
+			{
+				radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+				radix_tree_node_256 *new256 =
+				(radix_tree_node_256 *) radix_tree_alloc_node(tree, RADIX_TREE_NODE_KIND_256);
+				int			cnt = 0;
+
+				radix_tree_copy_node_common((radix_tree_node *) n128,
+											(radix_tree_node *) new256);
+
+				for (int i = 0; i < RADIX_TREE_NODE_MAX_SLOTS && cnt < n128->n.count; i++)
+				{
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					node_256_set(new256, i, node_128_get_chunk_slot(n128, i));
+					cnt++;
+				}
+
+				newnode = (radix_tree_node *) new256;
+				break;
+			}
+		case RADIX_TREE_NODE_KIND_256:
+			elog(ERROR, "radix tree node-256 cannot grow");
+			break;
+	}
+
+	if (parent == node)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = newnode;
+	}
+	else
+	{
+		Datum	   *slot_ptr = NULL;
+
+		/* Redirect from the parent to the node */
+		radix_tree_node_search(parent, &slot_ptr, key, RADIX_TREE_FIND);
+		Assert(*slot_ptr);
+		*slot_ptr = PointerGetDatum(newnode);
+	}
+
+	/* Verify the node has grown properly */
+	radix_tree_verify_node(newnode);
+
+	/* Free the old node */
+	radix_tree_free_node(tree, node);
+
+	return newnode;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+radix_tree_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->max_val = 0;
+	tree->root = NULL;
+	tree->context = ctx;
+	tree->num_keys = 0;
+	tree->mem_used = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RADIX_TREE_NODE_KIND_COUNT; i++)
+	{
+		tree->slabs[i] = SlabContextCreate(ctx,
+										   radix_tree_node_info[i].name,
+										   SLAB_DEFAULT_BLOCK_SIZE,
+										   radix_tree_node_info[i].size);
+		tree->cnt[i] = 0;
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+radix_tree_free(radix_tree *tree)
+{
+	for (int i = 0; i < RADIX_TREE_NODE_KIND_COUNT; i++)
+		MemoryContextDelete(tree->slabs[i]);
+
+	pfree(tree);
+}
+
+/*
+ * Insert the key with the val.
+ *
+ * found_p is set to true if the key already present, otherwise false, if
+ * it's not NULL.
+ *
+ * XXX: do we need to support update_if_exists behavior?
+ */
+void
+radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
+{
+	int			shift;
+	bool		replaced;
+	radix_tree_node *node;
+	radix_tree_node *parent = tree->root;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		radix_tree_new_root(tree, key, val);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		radix_tree_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = tree->root;
+	while (shift > 0)
+	{
+		radix_tree_node *child;
+
+		if (!radix_tree_node_search_child(node, &child, key))
+			child = radix_tree_node_insert_child(tree, parent, node, key);
+
+		Assert(child != NULL);
+
+		parent = node;
+		node = child;
+		shift -= RADIX_TREE_NODE_FANOUT;
+	}
+
+	/* arrived at a leaf */
+	Assert(IS_LEAF_NODE(node));
+
+	radix_tree_node_insert_val(tree, parent, node, key, val, &replaced);
+
+	/* Update the statistics */
+	if (!replaced)
+		tree->num_keys++;
+
+	if (found_p)
+		*found_p = replaced;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if the key is successfully
+ * found, otherwise return false.  On success, we set the value to *val_p so
+ * it must not be NULL.
+ */
+bool
+radix_tree_search(radix_tree *tree, uint64 key, Datum *val_p)
+{
+	radix_tree_node *node;
+	Datum	   *value_ptr;
+	int			shift;
+
+	Assert(val_p);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift > 0)
+	{
+		radix_tree_node *child;
+
+		if (!radix_tree_node_search_child(node, &child, key))
+			return false;
+
+		node = child;
+		shift -= RADIX_TREE_NODE_FANOUT;
+	}
+
+	/* We reached at a leaf node, search the corresponding slot */
+	Assert(IS_LEAF_NODE(node));
+
+	if (!radix_tree_node_search(node, &value_ptr, key, RADIX_TREE_FIND))
+		return false;
+
+	/* Found, set the value to return */
+	*val_p = *value_ptr;
+	return true;
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+radix_tree_delete(radix_tree *tree, uint64 key)
+{
+	radix_tree_node *node;
+	int			shift;
+	radix_tree_stack stack = NULL;
+	bool		deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descending the tree to search the key while building a stack of nodes
+	 * we visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		radix_tree_node *child;
+		radix_tree_stack new_stack;
+
+		new_stack = (radix_tree_stack) palloc(sizeof(radix_tree_stack_data));
+		new_stack->node = node;
+		new_stack->parent = stack;
+		stack = new_stack;
+
+		if (IS_LEAF_NODE(node))
+			break;
+
+		if (!radix_tree_node_search_child(node, &child, key))
+		{
+			radix_tree_free_stack(stack);
+			return false;
+		}
+
+		node = child;
+		shift -= RADIX_TREE_NODE_FANOUT;
+	}
+
+	/*
+	 * Delete the key from the leaf node and recursively delete internal nodes
+	 * if necessary.
+	 */
+	Assert(IS_LEAF_NODE(stack->node));
+	while (stack != NULL)
+	{
+		radix_tree_node *node;
+		Datum	   *slot;
+
+		/* pop the node from the stack */
+		node = stack->node;
+		stack = stack->parent;
+
+		deleted = radix_tree_node_search(node, &slot, key, RADIX_TREE_DELETE);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!IS_EMPTY_NODE(node))
+			break;
+
+		Assert(deleted);
+
+		/* The node became empty */
+		radix_tree_free_node(tree, node);
+
+		/*
+		 * If we eventually deleted the root node while recursively deleting
+		 * empty nodes, we make the tree empty.
+		 */
+		if (stack == NULL)
+		{
+			tree->root = NULL;
+			tree->max_val = 0;
+		}
+	}
+
+	if (deleted)
+		tree->num_keys--;
+
+	radix_tree_free_stack(stack);
+	return deleted;
+}
+
+/* Create and return the iterator for the given radix tree */
+radix_tree_iter *
+radix_tree_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	radix_tree_iter *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (radix_tree_iter *) palloc0(sizeof(radix_tree_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree)
+		return iter;
+
+	top_level = iter->tree->root->shift / RADIX_TREE_NODE_FANOUT;
+
+	iter->stack_len = top_level;
+	iter->stack[top_level].node = iter->tree->root;
+	iter->stack[top_level].current_idx = -1;
+
+	/* Descend to the left most leaf node from the root */
+	radix_tree_update_iter_stack(iter, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+radix_tree_iterate_next(radix_tree_iter *iter, uint64 *key_p, Datum *value_p)
+{
+	bool		found = false;
+	Datum		slot = (Datum) 0;
+	int			level;
+
+	/* Empty tree */
+	if (!iter->tree)
+		return false;
+
+	for (;;)
+	{
+		radix_tree_node *node;
+		radix_tree_iter_node_data *node_iter;
+
+		/*
+		 * Iterate node at each level from the bottom of the tree until we
+		 * search the next slot.
+		 */
+		for (level = 0; level <= iter->stack_len; level++)
+		{
+			slot = radix_tree_node_iterate_next(iter, &(iter->stack[level]), &found);
+
+			if (found)
+				break;
+		}
+
+		/* end of iteration */
+		if (!found)
+			return false;
+
+		/* found the next slot at the leaf node, return it */
+		if (level == 0)
+		{
+			*key_p = iter->key;
+			*value_p = slot;
+			return true;
+		}
+
+		/*
+		 * We have advanced more than one nodes including internal nodes. So
+		 * we need to update the stack by descending to the left most leaf
+		 * node from this level.
+		 */
+		node = (radix_tree_node *) DatumGetPointer(slot);
+		node_iter = &(iter->stack[level - 1]);
+		radix_tree_store_iter_node(iter, node_iter, node);
+
+		radix_tree_update_iter_stack(iter, level - 1);
+	}
+}
+
+void
+radix_tree_end_iterate(radix_tree_iter *iter)
+{
+	pfree(iter);
+}
+
+/*
+ * Update the part of the key being constructed during the iteration with the
+ * given chunk
+ */
+static inline void
+radix_tree_iter_update_key(radix_tree_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RADIX_TREE_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Iterate over the given radix tree node and returns the next slot of the given
+ * node and set true to *found_p, if any.  Otherwise, set false to *found_p.
+ */
+static Datum
+radix_tree_node_iterate_next(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+							 bool *found_p)
+{
+	radix_tree_node *node = node_iter->node;
+	Datum		slot = (Datum) 0;
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+			{
+				radix_tree_node_4 *n4 = (radix_tree_node_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+
+				if (node_iter->current_idx >= n4->n.count)
+					goto not_found;
+
+				slot = n4->slots[node_iter->current_idx];
+
+				/* Update the part of the key with the current chunk */
+				if (IS_LEAF_NODE(node))
+					radix_tree_iter_update_key(iter, n4->chunks[node_iter->current_idx], 0);
+
+				break;
+			}
+		case RADIX_TREE_NODE_KIND_32:
+			{
+				radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+
+				node_iter->current_idx++;
+
+				if (node_iter->current_idx >= n32->n.count)
+					goto not_found;
+
+				slot = n32->slots[node_iter->current_idx];
+
+				/* Update the part of the key with the current chunk */
+				if (IS_LEAF_NODE(node))
+					radix_tree_iter_update_key(iter, n32->chunks[node_iter->current_idx], 0);
+
+				break;
+			}
+		case RADIX_TREE_NODE_KIND_128:
+			{
+				radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < 256; i++)
+				{
+					if (node_128_is_chunk_used(n128, i))
+						break;
+				}
+
+				if (i >= 256)
+					goto not_found;
+
+				node_iter->current_idx = i;
+				slot = node_128_get_chunk_slot(n128, i);
+
+				/* Update the part of the key */
+				if (IS_LEAF_NODE(node))
+					radix_tree_iter_update_key(iter, node_iter->current_idx, 0);
+
+				break;
+			}
+		case RADIX_TREE_NODE_KIND_256:
+			{
+				radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < 256; i++)
+				{
+					if (node_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= 256)
+					goto not_found;
+
+				node_iter->current_idx = i;
+				slot = n256->slots[i];
+
+				/* Update the part of the key */
+				if (IS_LEAF_NODE(node))
+					radix_tree_iter_update_key(iter, node_iter->current_idx, 0);
+
+				break;
+			}
+	}
+
+	*found_p = true;
+	return slot;
+
+not_found:
+	*found_p = false;
+	return (Datum) 0;
+}
+
+/*
+ * Initialize and update the node iteration struct with the given radix tree node.
+ * This function also updates the part of the key with the chunk of the given node.
+ */
+static void
+radix_tree_store_iter_node(radix_tree_iter *iter, radix_tree_iter_node_data *node_iter,
+						   radix_tree_node *node)
+{
+	node_iter->node = node;
+	node_iter->current_idx = -1;
+
+	radix_tree_iter_update_key(iter, node->chunk, node->shift + RADIX_TREE_NODE_FANOUT);
+}
+
+/*
+ * Build the stack of the radix tree node while descending to the leaf from the 'from'
+ * level.
+ */
+static void
+radix_tree_update_iter_stack(radix_tree_iter *iter, int from)
+{
+	radix_tree_node *node = iter->stack[from].node;
+	int			level = from;
+
+	for (;;)
+	{
+		radix_tree_iter_node_data *node_iter = &(iter->stack[level--]);
+		bool		found;
+
+		/* Set the current node */
+		radix_tree_store_iter_node(iter, node_iter, node);
+
+		if (IS_LEAF_NODE(node))
+			break;
+
+		node = (radix_tree_node *)
+			DatumGetPointer(radix_tree_node_iterate_next(iter, node_iter, &found));
+
+		/*
+		 * Since we always get the first slot in the node, we have to found
+		 * the slot.
+		 */
+		Assert(found);
+	}
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+radix_tree_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+radix_tree_memory_usage(radix_tree *tree)
+{
+	return tree->mem_used;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+radix_tree_verify_node(radix_tree_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+			{
+				radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+
+				/* Check if the chunks in the node are sorted */
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RADIX_TREE_NODE_KIND_32:
+			{
+				radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+
+				/* Check if the chunks in the node are sorted */
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RADIX_TREE_NODE_KIND_128:
+			{
+				radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RADIX_TREE_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(node_128_is_slot_used(n128, n128->slot_idxs[i] - 1));
+
+					cnt++;
+				}
+
+				Assert(n128->n.count == cnt);
+				break;
+			}
+		case RADIX_TREE_NODE_KIND_256:
+			{
+				radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RADIX_TREE_NODE_MAX_BITS; i++)
+					cnt += pg_popcount32(n256->isset[i]);
+
+				/* Check if the number of used chunk matches */
+				Assert(n256->n.count == cnt);
+
+				break;
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RADIX_TREE_DEBUG
+void
+radix_tree_stats(radix_tree *tree)
+{
+	fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u(%lu), n32 = %u(%lu), n128 = %u(%lu), n256 = %u(%lu)",
+			tree->num_keys,
+			tree->root->shift / RADIX_TREE_NODE_FANOUT,
+			tree->cnt[0], tree->cnt[0] * sizeof(radix_tree_node_4),
+			tree->cnt[1], tree->cnt[1] * sizeof(radix_tree_node_32),
+			tree->cnt[2], tree->cnt[2] * sizeof(radix_tree_node_128),
+			tree->cnt[3], tree->cnt[3] * sizeof(radix_tree_node_256));
+	/* radix_tree_dump(tree); */
+}
+
+static void
+radix_tree_print_slot(StringInfo buf, uint8 chunk, Datum slot, int idx, bool is_leaf, int level)
+{
+	char		space[128] = {0};
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	if (is_leaf)
+		appendStringInfo(buf, "%s[%d] \"0x%X\" val(0x%lX) LEAF\n",
+						 space,
+						 idx,
+						 chunk,
+						 DatumGetInt64(slot));
+	else
+		appendStringInfo(buf, "%s[%d] \"0x%X\" -> ",
+						 space,
+						 idx,
+						 chunk);
+}
+
+static void
+radix_tree_dump_node(radix_tree_node *node, int level, StringInfo buf, bool recurse)
+{
+	bool		is_leaf = IS_LEAF_NODE(node);
+
+	appendStringInfo(buf, "[\"%s\" type %d, cnt %u, shift %u, chunk \"0x%X\"] chunks:\n",
+					 IS_LEAF_NODE(node) ? "LEAF" : "INNR",
+					 (node->kind == RADIX_TREE_NODE_KIND_4) ? 4 :
+					 (node->kind == RADIX_TREE_NODE_KIND_32) ? 32 :
+					 (node->kind == RADIX_TREE_NODE_KIND_128) ? 128 : 256,
+					 node->count, node->shift, node->chunk);
+
+	switch (node->kind)
+	{
+		case RADIX_TREE_NODE_KIND_4:
+			{
+				radix_tree_node_4 *n4 = (radix_tree_node_4 *) node;
+
+				for (int i = 0; i < n4->n.count; i++)
+				{
+					radix_tree_print_slot(buf, n4->chunks[i], n4->slots[i], i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							radix_tree_dump_node((radix_tree_node *) n4->slots[i], level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+				break;
+			}
+		case RADIX_TREE_NODE_KIND_32:
+			{
+				radix_tree_node_32 *n32 = (radix_tree_node_32 *) node;
+
+				for (int i = 0; i < n32->n.count; i++)
+				{
+					radix_tree_print_slot(buf, n32->chunks[i], n32->slots[i], i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							radix_tree_dump_node((radix_tree_node *) n32->slots[i], level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+				break;
+			}
+		case RADIX_TREE_NODE_KIND_128:
+			{
+				radix_tree_node_128 *n128 = (radix_tree_node_128 *) node;
+
+				for (int j = 0; j < 256; j++)
+				{
+					if (!node_128_is_chunk_used(n128, j))
+						continue;
+
+					appendStringInfo(buf, "slot_idxs[%d]=%d, ", j, n128->slot_idxs[j]);
+				}
+				appendStringInfo(buf, "\nisset-bitmap:");
+				for (int j = 0; j < 16; j++)
+				{
+					appendStringInfo(buf, "%X ", (uint8) n128->isset[j]);
+				}
+				appendStringInfo(buf, "\n");
+
+				for (int i = 0; i < 256; i++)
+				{
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					radix_tree_print_slot(buf, i, node_128_get_chunk_slot(n128, i),
+										  i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							radix_tree_dump_node((radix_tree_node *) node_128_get_chunk_slot(n128, i),
+												 level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+				break;
+			}
+		case RADIX_TREE_NODE_KIND_256:
+			{
+				radix_tree_node_256 *n256 = (radix_tree_node_256 *) node;
+
+				for (int i = 0; i < 256; i++)
+				{
+					if (!node_256_is_chunk_used(n256, i))
+						continue;
+
+					radix_tree_print_slot(buf, i, n256->slots[i], i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							radix_tree_dump_node((radix_tree_node *) n256->slots[i], level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+radix_tree_dump_search(radix_tree *tree, uint64 key)
+{
+	StringInfoData buf;
+	radix_tree_node *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+			 key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		radix_tree_node *child;
+
+		radix_tree_dump_node(node, level, &buf, false);
+
+		if (IS_LEAF_NODE(node))
+		{
+			Datum	   *dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			radix_tree_node_search(node, &dummy, key, RADIX_TREE_FIND);
+
+			break;
+		}
+
+		if (!radix_tree_node_search_child(node, &child, key))
+			break;
+
+		node = child;
+		shift -= RADIX_TREE_NODE_FANOUT;
+		level++;
+	}
+
+	elog(NOTICE, "\n%s", buf.data);
+}
+
+void
+radix_tree_dump(radix_tree *tree)
+{
+	StringInfoData buf;
+
+	initStringInfo(&buf);
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu", tree->max_val);
+	radix_tree_dump_node(tree->root, 0, &buf, true);
+	elog(NOTICE, "\n%s", buf.data);
+	elog(NOTICE, "-----------------------------------------------------------");
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..7e864d124b
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+/* #define RADIX_TREE_DEBUG 1 */
+
+typedef struct radix_tree radix_tree;
+typedef struct radix_tree_iter radix_tree_iter;
+
+extern radix_tree *radix_tree_create(MemoryContext ctx);
+extern bool radix_tree_search(radix_tree *tree, uint64 key, Datum *val_p);
+extern void radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p);
+extern bool radix_tree_delete(radix_tree *tree, uint64 key);
+extern void radix_tree_free(radix_tree *tree);
+extern uint64 radix_tree_memory_usage(radix_tree *tree);
+extern uint64 radix_tree_num_entries(radix_tree *tree);
+
+extern radix_tree_iter *radix_tree_begin_iterate(radix_tree *tree);
+extern bool radix_tree_iterate_next(radix_tree_iter *iter, uint64 *key_p, Datum *value_p);
+extern void radix_tree_end_iterate(radix_tree_iter *iter);
+
+
+#ifdef RADIX_TREE_DEBUG
+extern void radix_tree_dump(radix_tree *tree);
+extern void radix_tree_dump_search(radix_tree *tree, uint64 key);
+extern void radix_tree_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9090226daa..51b2514faf 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -24,6 +24,7 @@ SUBDIRS = \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..6d5b06a800
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,502 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool radix_tree_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int radix_tree_node_max_entries[] = {
+	4,		/* RADIX_TREE_NODE_KIND_4 */
+	16,		/* RADIX_TREE_NODE_KIND_16 */
+	128,	/* RADIX_TREE_NODE_KIND_128 */
+	256		/* RADIX_TREE_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 10000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	Datum dummy;
+
+	radixtree = radix_tree_create(CurrentMemoryContext);
+
+	if (radix_tree_search(radixtree, 0, &dummy))
+		elog(ERROR, "radix_tree_search on empty tree returned true");
+
+	if (radix_tree_search(radixtree, 1, &dummy))
+		elog(ERROR, "radix_tree_search on empty tree returned true");
+
+	if (radix_tree_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "radix_tree_search on empty tree returned true");
+
+	if (radix_tree_num_entries(radixtree) != 0)
+		elog(ERROR, "radix_tree_num_entries on empty tree return non-zero");
+
+	radix_tree_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		Datum val;
+
+		if (!radix_tree_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (DatumGetUInt64(val) != key)
+			elog(ERROR, "radix_tree_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, DatumGetUInt64(val), key);
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		bool found;
+
+		radix_tree_insert(radixtree, key, Int64GetDatum(key), &found);
+
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+		for (int j = 0; j < lengthof(radix_tree_node_max_entries); j++)
+		{
+			/*
+			 * After filling all slots in each node type, check if the values are
+			 * stored properly.
+			 */
+			if (i == (radix_tree_node_max_entries[j] - 1))
+			{
+				check_search_on_node(radixtree, shift,
+									 (j == 0) ? 0 : radix_tree_node_max_entries[j - 1],
+									 radix_tree_node_max_entries[j]);
+				break;
+			}
+		}
+	}
+
+	num_entries = radix_tree_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "radix_tree_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64	key = ((uint64) i << shift);
+		bool	found;
+
+		found = radix_tree_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+	}
+
+	num_entries = radix_tree_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "radix_tree_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = radix_tree_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search
+	 * entries again.
+	 */
+	test_node_types_insert(radixtree, shift);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift);
+
+	radix_tree_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+	radix_tree *radixtree;
+	radix_tree_iter *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (radix_tree_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = radix_tree_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool found;
+
+			x = last_int + pattern_values[i];
+
+			radix_tree_insert(radixtree, x, Int64GetDatum(x), &found);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (radix_tree_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by radix_tree_memory_usage(), as well as the
+	 * stats from the memory context.  They should be in the same ballpark,
+	 * but it's hard to automate testing that, so if you're making changes to
+	 * the implementation, just observe that manually.
+	 */
+	if (radix_tree_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by radix_tree_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = radix_tree_memory_usage(radixtree);
+		fprintf(stderr, "radix_tree_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that radix_tree_num_entries works */
+	n = radix_tree_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "radix_tree_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with radix_tree_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		Datum		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to radix_tree_search() ? */
+		found = radix_tree_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (DatumGetUInt64(v) != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 DatumGetUInt64(v), x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (radix_tree_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = radix_tree_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!radix_tree_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			if (DatumGetUInt64(val) != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (radix_tree_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with radix_tree_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = radix_tree_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		uint64		x;
+		Datum		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to radix_tree_search() ? */
+		found = radix_tree_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!radix_tree_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (radix_tree_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (radix_tree_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (radix_tree_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = radix_tree_num_entries(radixtree);
+
+	/* Check that radix_tree_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "radix_tree_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true

#49

See
<https://docs.microsoft.com/en-us/cpp/build/reference/arch-x64?view=msvc-170>

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#48)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jun 16, 2022 at 11:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached an updated version patch that changes the configure
script. I'm still studying how to support AVX2 on msvc build. Also,
added more regression tests.

Thanks for the update, I will take a closer look at the patch in the
near future, possibly next week. For now, though, I'd like to question
why we even need to use 32-byte registers in the first place. For one,
the paper referenced has 16-pointer nodes, but none for 32 (next level
is 48 and uses a different method to find the index of the next
pointer). Andres' prototype has 32-pointer nodes, but in a quick read
of his patch a couple weeks ago I don't recall a reason mentioned for
it. Even if 32-pointer nodes are better from a memory perspective, I
imagine it should be possible to use two SSE2 registers to find the
index. It'd be locally slightly more complex, but not much. It might
not even cost much more in cycles since AVX2 would require indirecting
through a function pointer. It's much more convenient if we don't need
a runtime check. There are also thermal and power disadvantages when
using AXV2 in some workloads. I'm not sure that's the case here, but
if it is, we'd better be getting something in return.

One more thing in general: In an earlier version, I noticed that
Andres used the slab allocator and documented why. The last version of
your patch that I saw had the same allocator, but not the "why".
Especially in early stages of review, we want to document design
decisions so it's more clear for the reader.

--
John Naylor
EDB: http://www.enterprisedb.com

#50

Andrew Dunstan

andrew@dunslane.net

over 3 years ago

In reply to: Masahiko Sawada (#48)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On 2022-06-16 Th 00:56, Masahiko Sawada wrote:

I've attached an updated version patch that changes the configure
script. I'm still studying how to support AVX2 on msvc build. Also,
added more regression tests.

I think you would need to add '/arch:AVX2' to the compiler flags in
MSBuildProject.pm.

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

#51

sawada.mshk@gmail.com

over 3 years ago

In reply to: John Naylor (#49)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On Thu, Jun 16, 2022 at 4:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Jun 16, 2022 at 11:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached an updated version patch that changes the configure
script. I'm still studying how to support AVX2 on msvc build. Also,
added more regression tests.

Thanks for the update, I will take a closer look at the patch in the
near future, possibly next week.

Thanks!

For now, though, I'd like to question
why we even need to use 32-byte registers in the first place. For one,
the paper referenced has 16-pointer nodes, but none for 32 (next level
is 48 and uses a different method to find the index of the next
pointer). Andres' prototype has 32-pointer nodes, but in a quick read
of his patch a couple weeks ago I don't recall a reason mentioned for
it.

I might be wrong but since AVX2 instruction set is introduced in
Haswell microarchitecture in 2013 and the referenced paper is
published in the same year, the art didn't use AVX2 instruction set.
32-pointer nodes are better from a memory perspective as you
mentioned. Andres' prototype supports both 16-pointer nodes and
32-pointer nodes (out of 6 node types). This would provide better
memory usage but on the other hand, it would also bring overhead of
switching the node type. Anyway, it's an important design decision to
support which size of node to support. It should be done based on
experiment results and documented.

Even if 32-pointer nodes are better from a memory perspective, I
imagine it should be possible to use two SSE2 registers to find the
index. It'd be locally slightly more complex, but not much. It might
not even cost much more in cycles since AVX2 would require indirecting
through a function pointer. It's much more convenient if we don't need
a runtime check.

Right.

There are also thermal and power disadvantages when
using AXV2 in some workloads. I'm not sure that's the case here, but
if it is, we'd better be getting something in return.

Good point.

One more thing in general: In an earlier version, I noticed that
Andres used the slab allocator and documented why. The last version of
your patch that I saw had the same allocator, but not the "why".
Especially in early stages of review, we want to document design
decisions so it's more clear for the reader.

Indeed. I'll add comments in the next version patch.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#52

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#51)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jun 20, 2022 at 7:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

[v3 patch]

Hi Masahiko,

Since there are new files, and they are pretty large, I've attached
most specific review comments and questions as a diff rather than in
the email body. This is not a full review, which will take more time
-- this is a first pass mostly to aid my understanding, and discuss
some of the design and performance implications.

I tend to think it's a good idea to avoid most cosmetic review until
it's close to commit, but I did mention a couple things that might
enhance readability during review.

As I mentioned to you off-list, I have some thoughts on the nodes using SIMD:

On Thu, Jun 16, 2022 at 4:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

For now, though, I'd like to question
why we even need to use 32-byte registers in the first place. For one,
the paper referenced has 16-pointer nodes, but none for 32 (next level
is 48 and uses a different method to find the index of the next
pointer). Andres' prototype has 32-pointer nodes, but in a quick read
of his patch a couple weeks ago I don't recall a reason mentioned for
it.

I might be wrong but since AVX2 instruction set is introduced in
Haswell microarchitecture in 2013 and the referenced paper is
published in the same year, the art didn't use AVX2 instruction set.

Sure, but with a bit of work the same technique could be done on that
node size with two 16-byte registers.

32-pointer nodes are better from a memory perspective as you
mentioned. Andres' prototype supports both 16-pointer nodes and
32-pointer nodes (out of 6 node types). This would provide better
memory usage but on the other hand, it would also bring overhead of
switching the node type.

Right, using more node types provides smaller increments of node size.
Just changing node type can be better or worse, depending on the
input.

Anyway, it's an important design decision to
support which size of node to support. It should be done based on
experiment results and documented.

Agreed. I would add that in the first step, we want something
straightforward to read and easy to integrate into our codebase. I
suspect other optimizations would be worth a lot more than using AVX2:
- collapsing inner nodes
- taking care when constructing the key (more on this when we
integrate with VACUUM)
...and a couple Andres mentioned:
- memory management: in
/messages/by-id/20210717194333.mr5io3zup3kxahfm@alap3.anarazel.de
- node dispatch:
/messages/by-id/20210728184139.qhvx6nbwdcvo63m6@alap3.anarazel.de

Therefore, I would suggest that we use SSE2 only, because:
- portability is very easy
- to avoid a performance hit from indirecting through a function pointer

When the PG16 cycle opens, I will work separately on ensuring the
portability of using SSE2, so you can focus on other aspects. I think
it would be a good idea to have both node16 and node32 for testing.
During benchmarking we can delete one or the other and play with the
other thresholds a bit.

Ideally, node16 and node32 would have the same code with a different
loop count (1 or 2). More generally, there is too much duplication of
code (noted by Andres in his PoC), and there are many variable names
with the node size embedded. This is a bit tricky to make more
general, so we don't need to try it yet, but ideally we would have
something similar to:

switch (node->kind) // todo: inspect tagged pointer
{
case RADIX_TREE_NODE_KIND_4:
idx = node_search_eq(node, chunk, 4);
do_action(node, idx, 4, ...);
break;
case RADIX_TREE_NODE_KIND_32:
idx = node_search_eq(node, chunk, 32);
do_action(node, idx, 32, ...);
...
}

static pg_alwaysinline void
node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
{
if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
// do simple loop with (node_simple *) node;
else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
// do vectorized loop where available with (node_vec *) node;
...
}

...and let the compiler do loop unrolling and branch removal. Not sure
how difficult this is to do, but something to think about.

Another thought: for non-x86 platforms, the SIMD nodes degenerate to
"simple loop", and looping over up to 32 elements is not great
(although possibly okay). We could do binary search, but that has bad
branch prediction.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v3-radix-review-diff-20220627.txttext/plain; charset=US-ASCII; name=v3-radix-review-diff-20220627.txtDownload

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bf87f932fd..2bb04eba86 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -16,6 +16,11 @@
  *
  * The key is a 64-bit unsigned integer and the value is a Datum. Both internal
  * nodes and leaf nodes have the identical structure. For internal tree nodes,
+It might worth mentioning:
+- the paper refers to this technique as "Multi-value leaves"
+- we chose it (I assume) for simplicity and to avoid an additional pointer traversal
+- it is the reason this code currently does not support variable-length keys.
+
  * shift > 0, store the pointer to its child node as the value. The leaf nodes,
  * shift == 0, also have the Datum value that is specified by the user.
  *
@@ -24,6 +29,7 @@
  * Interface
  * ---------
  *
+*_search belongs here too.
  * radix_tree_create		- Create a new, empty radix tree
  * radix_tree_free			- Free the radix tree
  * radix_tree_insert		- Insert a key-value pair
@@ -58,12 +64,18 @@
 #include <immintrin.h>			/* AVX2 intrinsics */
 #endif
 
+// The name prefixes are a bit long, to shorten, maybe s/radix_tree_/rt_/ ?
+// ...and same for capitalized macros -> RT_
+
 /* The number of bits encoded in one tree level */
+// terminology: this is not fanout, it's "span" -- ART has variable fanout (the different node types)
+// maybe BITS_PER_BYTE since the entire code assumes that chunks are byte-addressable
 #define RADIX_TREE_NODE_FANOUT	8
 
 /* The number of maximum slots in the node, used in node-256 */
 #define RADIX_TREE_NODE_MAX_SLOTS (1 << RADIX_TREE_NODE_FANOUT)
 
+// maybe call them "nodes indexed by array lookups" -- the actual size is unimportant and could change
 /*
  * Return the number of bits required to represent nslots slots, used
  * in node-128 and node-256.
@@ -84,7 +96,9 @@
 	((uint8) (((key) >> (shift)) & RADIX_TREE_CHUNK_MASK))
 
 /* Mapping from the value to the bit in is-set bitmap in the node-128 and node-256 */
+// these macros assume we're addressing bytes, so maybe BITS_PER_BYTE instead of span (here referred to as fanout)?
 #define NODE_BITMAP_BYTE(v) ((v) / RADIX_TREE_NODE_FANOUT)
+// Should this be UINT64CONST?
 #define NODE_BITMAP_BIT(v) (UINT64_C(1) << ((v) % RADIX_TREE_NODE_FANOUT))
 
 /* Enum used radix_tree_node_search() */
@@ -132,6 +146,7 @@ typedef struct radix_tree_node
 } radix_tree_node;
 
 /* Macros for radix tree nodes */
+// not sure why are we doing casts here?
 #define IS_LEAF_NODE(n) (((radix_tree_node *) (n))->shift == 0)
 #define IS_EMPTY_NODE(n) (((radix_tree_node *) (n))->count == 0)
 #define NODE_HAS_FREE_SLOT(n) \
@@ -161,11 +176,14 @@ typedef struct radix_tree_node_32
 	Datum		slots[32];
 } radix_tree_node_32;
 
+// unnecessary symbol
 #define RADIX_TREE_NODE_128_BITS RADIX_TREE_NODE_NSLOTS_BITS(128)
 typedef struct radix_tree_node_128
 {
 	radix_tree_node n;
 
+// maybe use 0xFF for INVALID_IDX ? then we can use 0-indexing
+// and if we do that, do we need isset? on creation, we can just memset slot_idx to INVALID_IDX
 	/*
 	 * The index of slots for each fanout. 0 means unused whereas slots is
 	 * 0-indexed. So we can get the slot of the chunk C by slots[C] - 1.
@@ -178,6 +196,7 @@ typedef struct radix_tree_node_128
 	Datum		slots[128];
 } radix_tree_node_128;
 
+// unnecessary symbol
 #define RADIX_TREE_NODE_MAX_BITS RADIX_TREE_NODE_NSLOTS_BITS(RADIX_TREE_NODE_MAX_SLOTS)
 typedef struct radix_tree_node_256
 {
@@ -205,6 +224,7 @@ static radix_tree_node_info_elem radix_tree_node_info[] =
 	{"radix tree node 256", 256, sizeof(radix_tree_node_256)},
 };
 
+// this comment is about a data structure, but talks about code somewhere else
 /*
  * As we descend a radix tree, we push the node to the stack. The stack is used
  * at deletion.
@@ -262,6 +282,7 @@ struct radix_tree
 
 static radix_tree_node *radix_tree_node_grow(radix_tree *tree, radix_tree_node *parent,
 											 radix_tree_node *node, uint64 key);
+// maybe _node_find_child or _get_child because "search child" implies to me that we're searching within the child.
 static bool radix_tree_node_search_child(radix_tree_node *node, radix_tree_node **child_p,
 										 uint64 key);
 static bool radix_tree_node_search(radix_tree_node *node, Datum **slot_p, uint64 key,
@@ -289,14 +310,19 @@ static void radix_tree_verify_node(radix_tree_node *node);
 static inline int
 node_32_search_eq(radix_tree_node_32 *node, uint8 chunk)
 {
+// If we use SSE intrinsics on Windows, this code might be still be slow (see below),
+// so also guard with HAVE__BUILTIN_CTZ
 #ifdef __AVX2__
 	__m256i		_key = _mm256_set1_epi8(chunk);
 	__m256i		_data = _mm256_loadu_si256((__m256i_u *) node->chunks);
 	__m256i		_cmp = _mm256_cmpeq_epi8(_key, _data);
 	uint32		bitfield = _mm256_movemask_epi8(_cmp);
 
+// bitfield is uint32, so we don't need UINT64_C
 	bitfield &= ((UINT64_C(1) << node->n.count) - 1);
 
+// To make this portable, should be pg_rightmost_one_pos32().
+// Future TODO: This is slow on Windows, until will need to add the correct interfaces to pg_bitutils.h.
 	return (bitfield) ? __builtin_ctz(bitfield) : -1;
 
 #else
@@ -313,6 +339,7 @@ node_32_search_eq(radix_tree_node_32 *node, uint8 chunk)
 #endif							/* __AVX2__ */
 }
 
+// copy-paste error: search_chunk_array_16_eq
 /*
  * This is a bit more complicated than search_chunk_array_16_eq(), because
  * until recently no unsigned uint8 comparison instruction existed on x86. So
@@ -346,6 +373,7 @@ node_32_search_le(radix_tree_node_32 *node, uint8 chunk)
 #endif							/* __AVX2__ */
 }
 
+// see 0xFF idea above
 /* Does the given chunk in the node has the value? */
 static inline bool
 node_128_is_chunk_used(radix_tree_node_128 *node, uint8 chunk)
@@ -367,6 +395,8 @@ node_128_set(radix_tree_node_128 *node, uint8 chunk, Datum val)
 	int			slotpos = 0;
 
 	/* Search an unused slot */
+	// this could be slow - maybe iterate over the bytes and if the byte < 0xFF then check each bit
+	//
 	while (node_128_is_slot_used(node, slotpos))
 		slotpos++;
 
@@ -516,6 +546,7 @@ radix_tree_extend(radix_tree *tree, uint64 key)
 
 	max_shift = key_get_shift(key);
 
+	// why do we need the "max height" and not just one more?
 	/* Grow tree from 'shift' to 'max_shift' */
 	while (shift <= max_shift)
 	{
@@ -752,6 +783,7 @@ radix_tree_node_insert_val(radix_tree *tree, radix_tree_node *parent,
 						memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
 								sizeof(uint8) * (n4->n.count - idx));
 						memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
+								// sizeof(Datum) ?
 								sizeof(radix_tree_node *) * (n4->n.count - idx));
 					}

#53

hannuk@google.com

over 3 years ago

In reply to: John Naylor (#52)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Another thought: for non-x86 platforms, the SIMD nodes degenerate to
"simple loop", and looping over up to 32 elements is not great
(although possibly okay). We could do binary search, but that has bad
branch prediction.

I am not sure that for relevant non-x86 platforms SIMD / vector
instructions would not be used (though it would be a good idea to
verify)
Do you know any modern platforms that do not have SIMD ?

I would definitely test before assuming binary search is better.

Often other approaches like counting search over such small vectors is
much better when the vector fits in cache (or even a cache line) and
you always visit all items as this will completely avoid branch
predictions and allows compiler to vectorize and / or unroll the loop
as needed.

Cheers
Hannu

#54

andres@anarazel.de

over 3 years ago

In reply to: John Naylor (#52)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2022-06-27 18:12:13 +0700, John Naylor wrote:

Another thought: for non-x86 platforms, the SIMD nodes degenerate to
"simple loop", and looping over up to 32 elements is not great
(although possibly okay). We could do binary search, but that has bad
branch prediction.

I'd be quite quite surprised if binary search were cheaper. Particularly on
less fancy platforms.

- Andres

#55

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Hannu Krosing (#53)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jun 27, 2022 at 10:23 PM Hannu Krosing <hannuk@google.com> wrote:

Another thought: for non-x86 platforms, the SIMD nodes degenerate to
"simple loop", and looping over up to 32 elements is not great
(although possibly okay). We could do binary search, but that has bad
branch prediction.

I am not sure that for relevant non-x86 platforms SIMD / vector
instructions would not be used (though it would be a good idea to
verify)

By that logic, we can also dispense with intrinsics on x86 because the
compiler will autovectorize there too (if I understand your claim
correctly). I'm not quite convinced of that in this case.

I would definitely test before assuming binary search is better.

I wasn't very clear in my language, but I did reject binary search as
having bad branch prediction.

--
John Naylor
EDB: http://www.enterprisedb.com

#56

andres@anarazel.de

over 3 years ago

In reply to: John Naylor (#55)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2022-06-28 11:17:42 +0700, John Naylor wrote:

On Mon, Jun 27, 2022 at 10:23 PM Hannu Krosing <hannuk@google.com> wrote:

Another thought: for non-x86 platforms, the SIMD nodes degenerate to
"simple loop", and looping over up to 32 elements is not great
(although possibly okay). We could do binary search, but that has bad
branch prediction.

I am not sure that for relevant non-x86 platforms SIMD / vector
instructions would not be used (though it would be a good idea to
verify)

By that logic, we can also dispense with intrinsics on x86 because the
compiler will autovectorize there too (if I understand your claim
correctly). I'm not quite convinced of that in this case.

Last time I checked (maybe a year ago?) none of the popular compilers could
autovectorize that code pattern.

Greetings,

Andres Freund

#57

sawada.mshk@gmail.com

over 3 years ago

In reply to: John Naylor (#52)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On Mon, Jun 27, 2022 at 8:12 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Jun 20, 2022 at 7:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

[v3 patch]

Hi Masahiko,

Since there are new files, and they are pretty large, I've attached
most specific review comments and questions as a diff rather than in
the email body. This is not a full review, which will take more time
-- this is a first pass mostly to aid my understanding, and discuss
some of the design and performance implications.

I tend to think it's a good idea to avoid most cosmetic review until
it's close to commit, but I did mention a couple things that might
enhance readability during review.

Thank you for reviewing the patch!

As I mentioned to you off-list, I have some thoughts on the nodes using SIMD:

On Thu, Jun 16, 2022 at 4:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

For now, though, I'd like to question
why we even need to use 32-byte registers in the first place. For one,
the paper referenced has 16-pointer nodes, but none for 32 (next level
is 48 and uses a different method to find the index of the next
pointer). Andres' prototype has 32-pointer nodes, but in a quick read
of his patch a couple weeks ago I don't recall a reason mentioned for
it.

I might be wrong but since AVX2 instruction set is introduced in
Haswell microarchitecture in 2013 and the referenced paper is
published in the same year, the art didn't use AVX2 instruction set.

Sure, but with a bit of work the same technique could be done on that
node size with two 16-byte registers.

32-pointer nodes are better from a memory perspective as you
mentioned. Andres' prototype supports both 16-pointer nodes and
32-pointer nodes (out of 6 node types). This would provide better
memory usage but on the other hand, it would also bring overhead of
switching the node type.

Right, using more node types provides smaller increments of node size.
Just changing node type can be better or worse, depending on the
input.

Anyway, it's an important design decision to
support which size of node to support. It should be done based on
experiment results and documented.

Agreed. I would add that in the first step, we want something
straightforward to read and easy to integrate into our codebase.

Agreed.

I
suspect other optimizations would be worth a lot more than using AVX2:
- collapsing inner nodes
- taking care when constructing the key (more on this when we
integrate with VACUUM)
...and a couple Andres mentioned:
- memory management: in
/messages/by-id/20210717194333.mr5io3zup3kxahfm@alap3.anarazel.de
- node dispatch:
/messages/by-id/20210728184139.qhvx6nbwdcvo63m6@alap3.anarazel.de

Therefore, I would suggest that we use SSE2 only, because:
- portability is very easy
- to avoid a performance hit from indirecting through a function pointer

Okay, I'll try these optimizations and see if the performance becomes better.

When the PG16 cycle opens, I will work separately on ensuring the
portability of using SSE2, so you can focus on other aspects.

Thanks!

I think it would be a good idea to have both node16 and node32 for testing.
During benchmarking we can delete one or the other and play with the
other thresholds a bit.

I've done benchmark tests while changing the node types. The code base
is v3 patch that doesn't have the optimization you mentioned below
(memory management and node dispatch) but I added the code to use SSE2
for node-16 and node-32. The 'name' in the below result indicates the
kind of instruction set (AVX2 or SSE2) and the node type used. For
instance, sse2_4_32_48_256 means the radix tree has four kinds of node
types for each which have 4, 32, 48, and 256 pointers, respectively,
and use SSE2 instruction set.

name size attach
lookup
avx2_4_32_128_256 1154 MB 6742.53 ms 47765.63 ms
avx2_4_32_48_256 1839 MB 4239.35 ms 40528.39 ms
sse2_4_16_128_256 1154 MB 6994.43 ms 40383.85 ms
sse2_4_16_32_128_256 1154 MB 7239.35 ms 43542.39 ms
sse2_4_16_48_256 1839 MB 4404.63 ms 36048.96 ms
sse2_4_32_128_256 1154 MB 6688.50 ms 44902.64 ms

name size attach lookup
avx2_4_32_128_256 1535 kB 1.85 ms 17427.42 ms
avx2_4_32_48_256 1472 kB 2.01 ms 22176.75 ms
sse2_4_16_128_256 1582 kB 2.16 ms 15391.12 ms
sse2_4_16_32_128_256 1535 kB 2.14 ms 18757.86 ms
sse2_4_16_48_256 1489 kB 1.91 ms 19210.39 ms
sse2_4_32_128_256 1535 kB 2.05 ms 17777.55 ms

The statistics of the number of each node types are:

* avx2_4_32_128_256 (dense and sparse)
* nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1

* avx2_4_32_48_256
* nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n48 = 227, n256 = 916433
* nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n48 = 159, n256 = 50

* sse2_4_16_128_256
* nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n128 = 916914, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n128 = 256, n256 = 1

* sse2_4_16_32_128_256
* nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n32 = 285, n128 =
916629, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n32 = 48, n128 =
208, n256 = 1

* sse2_4_16_48_256
* nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
* nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50

* sse2_4_32_128_256
* nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1

Observations are:

In both test cases, There is not much difference between using AVX2
and SSE2. The more mode types, the more time it takes for loading the
data (see sse2_4_16_32_128_256).

In dense case, since most nodes have around 100 children, the radix
tree that has node-128 had a good figure in terms of memory usage. On
the other hand, the radix tree that doesn't have node-128 has a better
number in terms of insertion performance. This is probably because we
need to iterate over 'isset' flags from the beginning of the array in
order to find an empty slot when inserting new data. We do the same
thing also for node-48 but it was better than node-128 as it's up to
48.

In terms of lookup performance, the results vary but I could not find
any common pattern that makes the performance better or worse. Getting
more statistics such as the number of each node type per tree level
might help me.

Ideally, node16 and node32 would have the same code with a different
loop count (1 or 2). More generally, there is too much duplication of
code (noted by Andres in his PoC), and there are many variable names
with the node size embedded. This is a bit tricky to make more
general, so we don't need to try it yet, but ideally we would have
something similar to:

switch (node->kind) // todo: inspect tagged pointer
{
case RADIX_TREE_NODE_KIND_4:
idx = node_search_eq(node, chunk, 4);
do_action(node, idx, 4, ...);
break;
case RADIX_TREE_NODE_KIND_32:
idx = node_search_eq(node, chunk, 32);
do_action(node, idx, 32, ...);
...
}

static pg_alwaysinline void
node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
{
if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
// do simple loop with (node_simple *) node;
else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
// do vectorized loop where available with (node_vec *) node;
...
}

...and let the compiler do loop unrolling and branch removal. Not sure
how difficult this is to do, but something to think about.

Agreed.

I'll update my patch based on your review comments and use SSE2.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#58

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#57)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I
suspect other optimizations would be worth a lot more than using AVX2:
- collapsing inner nodes
- taking care when constructing the key (more on this when we
integrate with VACUUM)
...and a couple Andres mentioned:
- memory management: in
/messages/by-id/20210717194333.mr5io3zup3kxahfm@alap3.anarazel.de
- node dispatch:
/messages/by-id/20210728184139.qhvx6nbwdcvo63m6@alap3.anarazel.de

Therefore, I would suggest that we use SSE2 only, because:
- portability is very easy
- to avoid a performance hit from indirecting through a function pointer

Okay, I'll try these optimizations and see if the performance becomes better.

FWIW, I think it's fine if we delay these until after committing a
good-enough version. The exception is key construction and I think
that deserves some attention now (more on this below).

I've done benchmark tests while changing the node types. The code base
is v3 patch that doesn't have the optimization you mentioned below
(memory management and node dispatch) but I added the code to use SSE2
for node-16 and node-32.

Great, this is helpful to visualize what's going on!

* sse2_4_16_48_256
* nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
* nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50

* sse2_4_32_128_256
* nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1

Observations are:

In both test cases, There is not much difference between using AVX2
and SSE2. The more mode types, the more time it takes for loading the
data (see sse2_4_16_32_128_256).

Good to know. And as Andres mentioned in his PoC, more node types
would be a barrier for pointer tagging, since 32-bit platforms only
have two spare bits in the pointer.

In dense case, since most nodes have around 100 children, the radix
tree that has node-128 had a good figure in terms of memory usage. On

Looking at the node stats, and then your benchmark code, I think key
construction is a major influence, maybe more than node type. The
key/value scheme tested now makes sense:

blockhi || blocklo || 9 bits of item offset

(with the leaf nodes containing a bit map of the lowest few bits of
this whole thing)

We want the lower fanout nodes at the top of the tree and higher
fanout ones at the bottom.

Note some consequences: If the table has enough columns such that much
fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
dense case the nodes above the leaves will have lower fanout (maybe
they will fit in a node32). Also, the bitmap values in the leaves will
be more empty. In other words, many tables in the wild *resemble* the
sparse case a bit, even if truly all tuples on the page are dead.

Note also that the dense case in the benchmark above has ~4500 times
more keys than the sparse case, and uses about ~1000 times more
memory. But the runtime is only 2-3 times longer. That's interesting
to me.

To optimize for the sparse case, it seems to me that the key/value would be

blockhi || 9 bits of item offset || blocklo

I believe that would make the leaf nodes more dense, with fewer inner
nodes, and could drastically speed up the sparse case, and maybe many
realistic dense cases. I'm curious to hear your thoughts.

the other hand, the radix tree that doesn't have node-128 has a better
number in terms of insertion performance. This is probably because we
need to iterate over 'isset' flags from the beginning of the array in
order to find an empty slot when inserting new data. We do the same
thing also for node-48 but it was better than node-128 as it's up to
48.

I mentioned in my diff, but for those following along, I think we can
improve that by iterating over the bytes and if it's 0xFF all 8 bits
are set already so keep looking...

In terms of lookup performance, the results vary but I could not find
any common pattern that makes the performance better or worse. Getting
more statistics such as the number of each node type per tree level
might help me.

I think that's a sign that the choice of node types might not be
terribly important for these two cases. That's good if that's true in
general -- a future performance-critical use of this code might tweak
things for itself without upsetting vacuum.

--
John Naylor
EDB: http://www.enterprisedb.com

#59

sawada.mshk@gmail.com

over 3 years ago

In reply to: John Naylor (#58)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jun 28, 2022 at 10:10 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I
suspect other optimizations would be worth a lot more than using AVX2:
- collapsing inner nodes
- taking care when constructing the key (more on this when we
integrate with VACUUM)
...and a couple Andres mentioned:
- memory management: in
/messages/by-id/20210717194333.mr5io3zup3kxahfm@alap3.anarazel.de
- node dispatch:
/messages/by-id/20210728184139.qhvx6nbwdcvo63m6@alap3.anarazel.de

Therefore, I would suggest that we use SSE2 only, because:
- portability is very easy
- to avoid a performance hit from indirecting through a function pointer

Okay, I'll try these optimizations and see if the performance becomes better.

FWIW, I think it's fine if we delay these until after committing a
good-enough version. The exception is key construction and I think
that deserves some attention now (more on this below).

Agreed.

I've done benchmark tests while changing the node types. The code base
is v3 patch that doesn't have the optimization you mentioned below
(memory management and node dispatch) but I added the code to use SSE2
for node-16 and node-32.

Great, this is helpful to visualize what's going on!

* sse2_4_16_48_256
* nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
* nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50

* sse2_4_32_128_256
* nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1

Observations are:

In both test cases, There is not much difference between using AVX2
and SSE2. The more mode types, the more time it takes for loading the
data (see sse2_4_16_32_128_256).

Good to know. And as Andres mentioned in his PoC, more node types
would be a barrier for pointer tagging, since 32-bit platforms only
have two spare bits in the pointer.

In dense case, since most nodes have around 100 children, the radix
tree that has node-128 had a good figure in terms of memory usage. On

Looking at the node stats, and then your benchmark code, I think key
construction is a major influence, maybe more than node type. The
key/value scheme tested now makes sense:

blockhi || blocklo || 9 bits of item offset

(with the leaf nodes containing a bit map of the lowest few bits of
this whole thing)

We want the lower fanout nodes at the top of the tree and higher
fanout ones at the bottom.

So more inner nodes can fit in CPU cache, right?

Note some consequences: If the table has enough columns such that much
fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
dense case the nodes above the leaves will have lower fanout (maybe
they will fit in a node32). Also, the bitmap values in the leaves will
be more empty. In other words, many tables in the wild *resemble* the
sparse case a bit, even if truly all tuples on the page are dead.

Note also that the dense case in the benchmark above has ~4500 times
more keys than the sparse case, and uses about ~1000 times more
memory. But the runtime is only 2-3 times longer. That's interesting
to me.

To optimize for the sparse case, it seems to me that the key/value would be

blockhi || 9 bits of item offset || blocklo

I believe that would make the leaf nodes more dense, with fewer inner
nodes, and could drastically speed up the sparse case, and maybe many
realistic dense cases.

Does it have an effect on the number of inner nodes?

I'm curious to hear your thoughts.

Thank you for your analysis. It's worth trying. We use 9 bits for item
offset but most pages don't use all bits in practice. So probably it
might be better to move the most significant bit of item offset to the
left of blockhi. Or more simply:

9 bits of item offset || blockhi || blocklo

the other hand, the radix tree that doesn't have node-128 has a better
number in terms of insertion performance. This is probably because we
need to iterate over 'isset' flags from the beginning of the array in
order to find an empty slot when inserting new data. We do the same
thing also for node-48 but it was better than node-128 as it's up to
48.

I mentioned in my diff, but for those following along, I think we can
improve that by iterating over the bytes and if it's 0xFF all 8 bits
are set already so keep looking...

Right. Using 0xFF also makes the code readable so I'll change that.

In terms of lookup performance, the results vary but I could not find
any common pattern that makes the performance better or worse. Getting
more statistics such as the number of each node type per tree level
might help me.

I think that's a sign that the choice of node types might not be
terribly important for these two cases. That's good if that's true in
general -- a future performance-critical use of this code might tweak
things for itself without upsetting vacuum.

Agreed.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#60

andres@anarazel.de

over 3 years ago

In reply to: Masahiko Sawada (#45)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

I just noticed that I had a reply forgotten in drafts...

On 2022-05-10 10:51:46 +0900, Masahiko Sawada wrote:

To move this project forward, I've implemented radix tree
implementation from scratch while studying Andres's implementation. It
supports insertion, search, and iteration but not deletion yet. In my
implementation, I use Datum as the value so internal and lead nodes
have the same data structure, simplifying the implementation. The
iteration on the radix tree returns keys with the value in ascending
order of the key. The patch has regression tests for radix tree but is
still in PoC state: left many debugging codes, not supported SSE2 SIMD
instructions, added -mavx2 flag is hard-coded.

Very cool - thanks for picking this up.

Greetings,

Andres Freund

#61

andres@anarazel.de

over 3 years ago

In reply to: Masahiko Sawada (#48)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote:

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..bf87f932fd
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,1763 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013.
+ *
+ * There are some differences from the proposed implementation.  For instance,
+ * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
+ * Also, there is no support for path compression and lazy path expansion. The
+ * radix tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.

I think we're going to need path compression at some point, fwiw. I'd bet on
it being beneficial even for the tid case.

+ * The key is a 64-bit unsigned integer and the value is a Datum.

I don't think it's a good idea to define the value type to be a datum.

+/*
+ * As we descend a radix tree, we push the node to the stack. The stack is used
+ * at deletion.
+ */
+typedef struct radix_tree_stack_data
+{
+	radix_tree_node *node;
+	struct radix_tree_stack_data *parent;
+} radix_tree_stack_data;
+typedef radix_tree_stack_data *radix_tree_stack;

I think it's a very bad idea for traversal to need allocations. I really want
to eventually use this for shared structures (eventually with lock-free
searches at least), and needing to do allocations while traversing the tree is
a no-go for that.

Particularly given that the tree currently has a fixed depth, can't you just
allocate this on the stack once?

+/*
+ * Allocate a new node with the given node kind.
+ */
+static radix_tree_node *
+radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
+{
+	radix_tree_node *newnode;
+
+	newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
+														 radix_tree_node_info[kind].size);
+	newnode->kind = kind;
+
+	/* update the statistics */
+	tree->mem_used += GetMemoryChunkSpace(newnode);
+	tree->cnt[kind]++;
+
+	return newnode;
+}

Why are you tracking the memory usage at this level of detail? It's *much*
cheaper to track memory usage via the memory contexts? Since they're dedicated
for the radix tree, that ought to be sufficient?

+					else if (idx != n4->n.count)
+					{
+						/*
+						 * the key needs to be inserted in the middle of the
+						 * array, make space for the new key.
+						 */
+						memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
+								sizeof(uint8) * (n4->n.count - idx));
+						memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
+								sizeof(radix_tree_node *) * (n4->n.count - idx));
+					}

Maybe we could add a static inline helper for these memmoves? Both because
it's repetitive (for different node types) and because the last time I looked
gcc was generating quite bad code for this. And having to put workarounds into
multiple places is obviously worse than having to do it in one place.

+/*
+ * Insert the key with the val.
+ *
+ * found_p is set to true if the key already present, otherwise false, if
+ * it's not NULL.
+ *
+ * XXX: do we need to support update_if_exists behavior?
+ */

Yes, I think that's needed - hence using bfm_set() instead of insert() in the
prototype.

+void
+radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
+{
+	int			shift;
+	bool		replaced;
+	radix_tree_node *node;
+	radix_tree_node *parent = tree->root;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		radix_tree_new_root(tree, key, val);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		radix_tree_extend(tree, key);

FWIW, the reason I used separate functions for these in the prototype is that
it turns out to generate a lot better code, because it allows non-inlined
function calls to be sibling calls - thereby avoiding the need for a dedicated
stack frame. That's not possible once you need a palloc or such, so splitting
off those call paths into dedicated functions is useful.

Greetings,

Andres Freund

#62

andres@anarazel.de

over 3 years ago

In reply to: Masahiko Sawada (#57)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2022-06-28 15:24:11 +0900, Masahiko Sawada wrote:

In both test cases, There is not much difference between using AVX2
and SSE2. The more mode types, the more time it takes for loading the
data (see sse2_4_16_32_128_256).

Yea, at some point the compiler starts using a jump table instead of branches,
and that turns out to be a good bit more expensive. And even with branches, it
obviously adds hard to predict branches. IIRC I fought a bit with the compiler
to avoid some of that cost, it's possible that got "lost" in Sawada-san's
patch.

Sawada-san, what led you to discard the 1 and 16 node types? IIRC the 1 node
one is not unimportant until we have path compression.

Right now the node struct sizes are:
4 - 48 bytes
32 - 296 bytes
128 - 1304 bytes
256 - 2088 bytes

I guess radix_tree_node_128->isset is just 16 bytes compared to 1288 other
bytes, but needing that separate isset array somehow is sad :/. I wonder if a
smaller "free index" would do the trick? Point to the element + 1 where we
searched last and start a plain loop there. Particularly in an insert-only
workload that'll always work, and in other cases it'll still often work I
think.

One thing I was wondering about is trying to choose node types in
roughly-power-of-two struct sizes. It's pretty easy to end up with significant
fragmentation in the slabs right now when inserting as you go, because some of
the smaller node types will be freed but not enough to actually free blocks of
memory. If we instead have ~power-of-two sizes we could just use a single slab
of the max size, and carve out the smaller node types out of that largest
allocation.

Btw, that fragmentation is another reason why I think it's better to track
memory usage via memory contexts, rather than doing so based on
GetMemoryChunkSpace().

Ideally, node16 and node32 would have the same code with a different
loop count (1 or 2). More generally, there is too much duplication of
code (noted by Andres in his PoC), and there are many variable names
with the node size embedded. This is a bit tricky to make more
general, so we don't need to try it yet, but ideally we would have
something similar to:

switch (node->kind) // todo: inspect tagged pointer
{
case RADIX_TREE_NODE_KIND_4:
idx = node_search_eq(node, chunk, 4);
do_action(node, idx, 4, ...);
break;
case RADIX_TREE_NODE_KIND_32:
idx = node_search_eq(node, chunk, 32);
do_action(node, idx, 32, ...);
...
}

FWIW, that should be doable with an inline function, if you pass it the memory
to the "array" rather than the node directly. Not so sure it's a good idea to
do dispatch between node types / search methods inside the helper, as you
suggest below:

static pg_alwaysinline void
node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
{
if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
// do simple loop with (node_simple *) node;
else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
// do vectorized loop where available with (node_vec *) node;
...
}

Greetings,

Andres Freund

#63

sawada.mshk@gmail.com

over 3 years ago

In reply to: Masahiko Sawada (#59)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jul 4, 2022 at 2:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jun 28, 2022 at 10:10 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Jun 28, 2022 at 1:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I
suspect other optimizations would be worth a lot more than using AVX2:
- collapsing inner nodes
- taking care when constructing the key (more on this when we
integrate with VACUUM)
...and a couple Andres mentioned:
- memory management: in
/messages/by-id/20210717194333.mr5io3zup3kxahfm@alap3.anarazel.de
- node dispatch:
/messages/by-id/20210728184139.qhvx6nbwdcvo63m6@alap3.anarazel.de

Therefore, I would suggest that we use SSE2 only, because:
- portability is very easy
- to avoid a performance hit from indirecting through a function pointer

Okay, I'll try these optimizations and see if the performance becomes better.

FWIW, I think it's fine if we delay these until after committing a
good-enough version. The exception is key construction and I think
that deserves some attention now (more on this below).

Agreed.

I've done benchmark tests while changing the node types. The code base
is v3 patch that doesn't have the optimization you mentioned below
(memory management and node dispatch) but I added the code to use SSE2
for node-16 and node-32.

Great, this is helpful to visualize what's going on!

* sse2_4_16_48_256
* nkeys = 90910000, height = 3, n4 = 0, n16 = 0, n48 = 512, n256 = 916433
* nkeys = 20000, height = 3, n4 = 20000, n16 = 0, n48 = 207, n256 = 50

* sse2_4_32_128_256
* nkeys = 90910000, height = 3, n4 = 0, n32 = 285, n128 = 916629, n256 = 31
* nkeys = 20000, height = 3, n4 = 20000, n32 = 48, n128 = 208, n256 = 1

Observations are:

In both test cases, There is not much difference between using AVX2
and SSE2. The more mode types, the more time it takes for loading the
data (see sse2_4_16_32_128_256).

Good to know. And as Andres mentioned in his PoC, more node types
would be a barrier for pointer tagging, since 32-bit platforms only
have two spare bits in the pointer.

In dense case, since most nodes have around 100 children, the radix
tree that has node-128 had a good figure in terms of memory usage. On

Looking at the node stats, and then your benchmark code, I think key
construction is a major influence, maybe more than node type. The
key/value scheme tested now makes sense:

blockhi || blocklo || 9 bits of item offset

(with the leaf nodes containing a bit map of the lowest few bits of
this whole thing)

We want the lower fanout nodes at the top of the tree and higher
fanout ones at the bottom.

So more inner nodes can fit in CPU cache, right?

Note some consequences: If the table has enough columns such that much
fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
dense case the nodes above the leaves will have lower fanout (maybe
they will fit in a node32). Also, the bitmap values in the leaves will
be more empty. In other words, many tables in the wild *resemble* the
sparse case a bit, even if truly all tuples on the page are dead.

Note also that the dense case in the benchmark above has ~4500 times
more keys than the sparse case, and uses about ~1000 times more
memory. But the runtime is only 2-3 times longer. That's interesting
to me.

To optimize for the sparse case, it seems to me that the key/value would be

blockhi || 9 bits of item offset || blocklo

I believe that would make the leaf nodes more dense, with fewer inner
nodes, and could drastically speed up the sparse case, and maybe many
realistic dense cases.

Does it have an effect on the number of inner nodes?

I'm curious to hear your thoughts.

Thank you for your analysis. It's worth trying. We use 9 bits for item
offset but most pages don't use all bits in practice. So probably it
might be better to move the most significant bit of item offset to the
left of blockhi. Or more simply:

9 bits of item offset || blockhi || blocklo

the other hand, the radix tree that doesn't have node-128 has a better
number in terms of insertion performance. This is probably because we
need to iterate over 'isset' flags from the beginning of the array in
order to find an empty slot when inserting new data. We do the same
thing also for node-48 but it was better than node-128 as it's up to
48.

I mentioned in my diff, but for those following along, I think we can
improve that by iterating over the bytes and if it's 0xFF all 8 bits
are set already so keep looking...

Right. Using 0xFF also makes the code readable so I'll change that.

In terms of lookup performance, the results vary but I could not find
any common pattern that makes the performance better or worse. Getting
more statistics such as the number of each node type per tree level
might help me.

I think that's a sign that the choice of node types might not be
terribly important for these two cases. That's good if that's true in
general -- a future performance-critical use of this code might tweak
things for itself without upsetting vacuum.

Agreed.

I've attached an updated patch that incorporated comments from John.
Here are some comments I could not address and the reason:

+// bitfield is uint32, so we don't need UINT64_C
bitfield &= ((UINT64_C(1) << node->n.count) - 1);

Since node->n.count could be 32, I think we need to use UINT64CONST() here.

/* Macros for radix tree nodes */
+// not sure why are we doing casts here?
#define IS_LEAF_NODE(n) (((radix_tree_node *) (n))->shift == 0)
#define IS_EMPTY_NODE(n) (((radix_tree_node *) (n))->count == 0)

I've left the casts as I use IS_LEAF_NODE for rt_node_4/16/32/128/256.

Also, I've dropped the configure script support for AVX2, and support
for SSE2 is missing. I'll update it later.

I've not addressed the comments I got from Andres yet so I'll update
the patch according to the discussion but the current patch would be
more readable than the previous one thanks to the comments from John.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

radixtree_wip_v4.patchapplication/octet-stream; name=radixtree_wip_v4.patchDownload

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..ead0755d25 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,9 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
+radixtree.o: CFLAGS+=-msse2
+
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..f1118679d6
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2040 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation.  For instance,
+ * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
+ * Also, there is no support for path compression and lazy path expansion. The
+ * radix tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * The key is a 64-bit unsigned integer and the value is a Datum. Both internal
+ * nodes and leaf nodes have the identical structure. For internal tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, also have the Datum value that is specified by the user. The
+ * paper refers to this technique as "Multi-value leaves".  We choose it for
+ * simplicity and to avoid an additional pointer traversal.  It is the reason
+ * this code currently does not support variable-length keys.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_insert		- Insert a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iterate	- Begin iterating through all key-value pairs
+ * rt_iterate_next	- Return next key-value pair, if any
+ * rt_end_iterate	- End iteration
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+#if defined(__SSE2__)
+#include <emmintrin.h>          /* SSE2 intrinsics */
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) \
+	((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-128
+ * and node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+	RT_ACTION_FIND = 0,		/* find the key-value */
+	RT_ACTION_DELETE,			/* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree nodes.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 48 -> 152 -> 296 -> 1304 -> 2088 bytes for inner/leaf nodes, leading to
+ * large amounts of allocator padding with aset.c. Hence the use of slab.
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+typedef enum rt_node_kind
+{
+	RT_NODE_KIND_4 = 0,
+	RT_NODE_KIND_16,
+	RT_NODE_KIND_32,
+	RT_NODE_KIND_128,
+	RT_NODE_KIND_256
+} rt_node_kind;
+#define RT_NODE_KIND_COUNT 5
+
+/*
+ * Base type for all nodes types.
+ */
+typedef struct rt_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Size class of the node */
+	rt_node_kind kind;
+} rt_node;
+
+/* Macros for radix tree nodes */
+#define IS_LEAF_NODE(n) (((rt_node *) (n))->shift == 0)
+#define IS_EMPTY_NODE(n) (((rt_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+	(((rt_node *) (n))->count < rt_node_info[((rt_node *) (n))->kind].max_slots)
+
+/*
+ * To reduce memory usage compared to a simple radix tree with a fixed
+ * fanout we use adaptive node sides, with different storage methods
+ * for different numbers of elements.
+ */
+typedef struct rt_node_4
+{
+	rt_node n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+	Datum		slots[4];
+} rt_node_4;
+
+typedef struct rt_node_16
+{
+	rt_node n;
+
+	/* 16 children, for key chunks */
+	uint8		chunks[16];
+	Datum		slots[16];
+} rt_node_16;
+
+typedef struct rt_node_32
+{
+	rt_node n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+	Datum		slots[32];
+} rt_node_32;
+
+#define RT_NODE_128_INVALID_IDX	0xFF
+typedef struct rt_node_128
+{
+	rt_node n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/*
+	 * Slots for 128 children.
+	 *
+	 * Since the rt_node_xxx node is used by both inner and leaf nodes,
+	 * we need to distinguish between a null pointer in inner nodes and
+	 * a (Datum) 0 value in leaf node. isset is a bitmap to track which
+	 * slot is in use.
+	 */
+	Datum		slots[128];
+	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
+} rt_node_128;
+
+typedef struct rt_node_256
+{
+	rt_node n;
+
+	/*
+	 * Slots for 256 children. The isset is a bitmap to track which slot
+	 * is in use.
+	 */
+	Datum		slots[RT_NODE_MAX_SLOTS];
+	uint8		isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+} rt_node_256;
+
+/* Information of each size class */
+typedef struct rt_node_info_elem
+{
+	const char *name;
+	int			max_slots;
+	Size		size;
+} rt_node_info_elem;
+
+static rt_node_info_elem rt_node_info[] =
+{
+	{"radix tree node 4", 4, sizeof(rt_node_4)},
+	{"radix tree node 16", 16, sizeof(rt_node_16)},
+	{"radix tree node 32", 32, sizeof(rt_node_32)},
+	{"radix tree node 128", 128, sizeof(rt_node_128)},
+	{"radix tree node 256", 256, sizeof(rt_node_256)},
+};
+
+/*
+ * The data structure for stacking the radix tree nodes.
+ *
+ * During deleting a key-value pair, we descend the radix tree while pushing
+ * the inner nodes. The stack can be freed by using rt_free_stack.
+ */
+typedef struct rt_stack_data
+{
+	rt_node *node;
+	struct rt_stack_data *parent;
+} rt_stack_data;
+typedef rt_stack_data *rt_stack;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending order
+ * of the key. To support this, the we iterate nodes of each level.
+ * rt_iter_node_data struct is used to track the iteration within a node.
+ * rt_iter has the array of this struct, stack, in order to track the iteration
+ * of every level. During the iteration, we also construct the key to return. The key
+ * is updated whenever we update the node iteration information, e.g., when advancing
+ * the current index within the node or when moving to the next node at the same level.
+ */
+typedef struct rt_iter_node_data
+{
+	rt_node *node;		/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_iter_node_data;
+
+struct rt_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	rt_iter_node_data stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node		*root;
+	uint64		max_val;
+	uint64		num_keys;
+	MemoryContextData *slabs[RT_NODE_KIND_COUNT];
+
+	/* statistics */
+	uint64		mem_used;
+	int32		cnt[RT_NODE_KIND_COUNT];
+};
+
+static rt_node *rt_node_grow(radix_tree *tree, rt_node *parent,
+											 rt_node *node, uint64 key);
+static bool rt_node_find_child(rt_node *node, rt_node **child_p, uint64 key);
+static bool rt_node_search(rt_node *node, Datum **slot_p, uint64 key,
+								   rt_action action);
+static void rt_extend(radix_tree *tree, uint64 key);
+static void rt_new_root(radix_tree *tree, uint64 key, Datum val);
+static rt_node *rt_node_insert_child(radix_tree *tree,
+													 rt_node *parent,
+													 rt_node *node,
+													 uint64 key);
+static void rt_node_insert_val(radix_tree *tree, rt_node *parent,
+									   rt_node *node, uint64 key, Datum val,
+									   bool *replaced_p);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+static Datum rt_node_iterate_next(rt_iter *iter, rt_iter_node_data *node_iter,
+										  bool *found_p);
+static void rt_store_iter_node(rt_iter *iter, rt_iter_node_data *node_iter,
+									   rt_node *node);
+static void rt_update_iter_stack(rt_iter *iter, int from);
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Helper functions for accessing each kind of nodes.
+ */
+
+static inline int
+node_16_search_eq(rt_node_16 *node, uint8 chunk)
+{
+/*
+ * On Windows, even if we use SSE intrinsics, pg_rightmost_one_pos32 is slow.
+ * So we guard with HAVE__BUILTIN_CTZ as well.
+ *
+ * XXX: once we have the correct interfaces to pg_bitutils.h for Windows
+ * we can remove the HAVE__BUILTIN_CTZ condition.
+ */
+#if defined(__SSE2__) && defined(HAVE__BUILTIN_CTZ)
+	__m128i	key_v = _mm_set1_epi8(chunk);
+	__m128i data_v = _mm_loadu_si128((__m128i_u *) node->chunks);
+	__m128i cmp_v = _mm_cmpeq_epi8(key_v, data_v);
+	uint32	bitfield = _mm_movemask_epi8(cmp_v);
+
+	bitfield &= ((1 << node->n.count) - 1);
+
+	return bitfield ? pg_rightmost_one_pos32(bitfield) : -1;
+#else
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] > chunk)
+			return -1;
+
+		if (node->chunks[i] == chunk)
+			return i;
+	}
+
+	return -1;
+#endif
+}
+
+/*
+ * This is a bit more complicated than search_chunk_array_16_eq(), because
+ * until recently no unsigned uint8 comparison instruction existed on x86. So
+ * we need to play some trickery using _mm_min_epu8() to effectively get
+ * <=. There never will be any equal elements in the current uses, but that's
+ * what we get here...
+ */
+static inline int
+node_16_search_le(rt_node_16 *node, uint8 chunk)
+{
+#if defined(__SSE2__) && defined(HAVE__BUILTIN_CTZ)
+	__m128i key_v = _mm_set1_epi8(chunk);
+	__m128i data_v = _mm_loadu_si128((__m128i_u *) node->chunks);
+	__m128i min_v = _mm_min_epu8(data_v, key_v);
+	__m128i cmp_v = _mm_cmpeq_epi8(key_v, min_v);
+	uint32	bitfield = _mm_movemask_epi8(cmp_v);
+
+	bitfield &= ((1 << node->n.count) - 1);
+
+	return (bitfield) ? pg_rightmost_one_pos32(bitfield) : node->n.count;
+#else
+	int			index;
+
+	for (index = 0; index < node->n.count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+
+	return index;
+#endif
+}
+
+static inline int
+node_32_search_eq(rt_node_32 *node, uint8 chunk)
+{
+#if defined(__SSE2__) && defined(HAVE__BUILTIN_CTZ)
+	int	index = 0;
+	__m128i key_v = _mm_set1_epi8(chunk);
+
+	while (index < node->n.count)
+	{
+		__m128i data_v = _mm_loadu_si128((__m128i_u *) &(node->chunks[index]));
+		__m128i cmp_v = _mm_cmpeq_epi8(key_v, data_v);
+		uint32	bitfield = _mm_movemask_epi8(cmp_v);
+
+		bitfield &= ((UINT64CONST(1) << node->n.count) - 1);
+
+		if (bitfield)
+		{
+			index += pg_rightmost_one_pos32(bitfield);
+			break;
+		}
+
+		index += 16;
+	}
+
+	return (index < node->n.count) ? index : -1;
+#else
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] > chunk)
+			return -1;
+
+		if (node->chunks[i] == chunk)
+			return i;
+	}
+
+	return -1;
+#endif
+}
+
+/*
+ * Similar to node_16_search_le we need to play some trickery using
+ * _mm_min_epu8() to effectively get <=. There never will be any equal elements
+ * in the current uses, but that's what we get here...
+ */
+static inline int
+node_32_search_le(rt_node_32 *node, uint8 chunk)
+{
+#if defined(__SSE2__) && defined(HAVE__BUILTIN_CTZ)
+	int index = 0;
+	bool found = false;
+	__m128i key_v = _mm_set1_epi8(chunk);
+
+	while (index < node->n.count)
+	{
+		__m128i data_v = _mm_loadu_si128((__m128i_u *) &(node->chunks[index]));
+		__m128i min_v = _mm_min_epu8(data_v, key_v);
+		__m128i cmp_v = _mm_cmpeq_epi8(key_v, min_v);
+		uint32	bitfield = _mm_movemask_epi8(cmp_v);
+
+		bitfield &= ((UINT64CONST(1) << node->n.count)-1);
+
+		if (bitfield)
+		{
+			index += pg_rightmost_one_pos32(bitfield);
+			found = true;
+			break;
+		}
+
+		index += 16;
+	}
+
+	return found ? index : node->n.count;
+#else
+	int			index;
+
+	for (index = 0; index < node->n.count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+
+	return index;
+#endif
+}
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(rt_node_128 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_128_is_slot_used(rt_node_128 *node, uint8 slot)
+{
+	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_128_set(rt_node_128 *node, uint8 chunk, Datum val)
+{
+	int		slotpos;
+
+	/*
+	 * Find an unused slot. We iterate over the isset bitmap per byte
+	 * then check each bit.
+	 */
+	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	{
+		if (node->isset[slotpos] < 0xFF)
+			break;
+	}
+	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+	slotpos *= BITS_PER_BYTE;
+	while (node_128_is_slot_used(node, slotpos))
+		slotpos++;
+
+	node->slot_idxs[chunk] = slotpos;
+	node->slots[slotpos] = val;
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+}
+
+/* Delete the slot at the corresponding chunk */
+static inline void
+node_128_unset(rt_node_128 *node, uint8 chunk)
+{
+	int			slotpos = node->slot_idxs[chunk];
+
+	if (!node_128_is_chunk_used(node, chunk))
+		return;
+
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Return the slot data corresponding to the chunk */
+static inline Datum
+node_128_get_chunk_slot(rt_node_128 *node, uint8 chunk)
+{
+	return node->slots[node->slot_idxs[chunk]];
+}
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_256_is_chunk_used(rt_node_256 *node, uint8 chunk)
+{
+	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_256_set(rt_node_256 *node, uint8 chunk, Datum slot)
+{
+	node->slots[chunk] = slot;
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_256_unset(rt_node_256 *node, uint8 chunk)
+{
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+inline static int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_node_kind kind)
+{
+	rt_node *newnode;
+
+	newnode = (rt_node *) MemoryContextAllocZero(tree->slabs[kind],
+												 rt_node_info[kind].size);
+	newnode->kind = kind;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_128)
+	{
+		rt_node_128 *n128 = (rt_node_128 *) newnode;
+
+		memset(&(n128->slot_idxs), RT_NODE_128_INVALID_IDX,
+			   sizeof(n128->slot_idxs));
+	}
+
+	/* update the statistics */
+	tree->mem_used += GetMemoryChunkSpace(newnode);
+	tree->cnt[kind]++;
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+		tree->root = NULL;
+
+	/* update the statistics */
+	tree->mem_used -= GetMemoryChunkSpace(node);
+	tree->cnt[node->kind]--;
+
+	Assert(tree->mem_used >= 0);
+	Assert(tree->cnt[node->kind] >= 0);
+
+	pfree(node);
+}
+
+/* Free a stack made by rt_delete */
+static void
+rt_free_stack(rt_stack stack)
+{
+	rt_stack ostack;
+
+	while (stack != NULL)
+	{
+		ostack = stack;
+		stack = stack->parent;
+		pfree(ostack);
+	}
+}
+
+/* Copy the common fields without the kind */
+static void
+rt_copy_node_common(rt_node *src, rt_node *dst)
+{
+	dst->shift = src->shift;
+	dst->chunk = src->chunk;
+	dst->count = src->count;
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_4 *node =
+		(rt_node_4 *) rt_alloc_node(tree, RT_NODE_KIND_4);
+
+		node->n.count = 1;
+		node->n.shift = shift;
+		node->chunks[0] = 0;
+		node->slots[0] = PointerGetDatum(tree->root);
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * Wrapper for rt_node_search to search the pointer to the child node in the
+ * node.
+ *
+ * Return true if the corresponding child is found, otherwise return false.  On success,
+ * it sets child_p.
+ */
+static bool
+rt_node_find_child(rt_node *node, rt_node **child_p, uint64 key)
+{
+	bool		found = false;
+	Datum	   *slot_ptr;
+
+	if (rt_node_search(node, &slot_ptr, key, RT_ACTION_FIND))
+	{
+		/* Found the pointer to the child node */
+		found = true;
+		*child_p = (rt_node *) DatumGetPointer(*slot_ptr);
+	}
+
+	return found;
+}
+
+/*
+ * Return true if the corresponding slot is used, otherwise return false.  On success,
+ * sets the pointer to the slot to slot_p.
+ */
+static bool
+rt_node_search(rt_node *node, Datum **slot_p, uint64 key,
+					   rt_action action)
+{
+	int			chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_4 *n4 = (rt_node_4 *) node;
+
+				/* Do linear search */
+				for (int i = 0; i < n4->n.count; i++)
+				{
+					if (n4->chunks[i] > chunk)
+						break;
+
+					/*
+					 * If we find the chunk in the node, do the specified
+					 * action
+					 */
+					if (n4->chunks[i] == chunk)
+					{
+						if (action == RT_ACTION_FIND)
+							*slot_p = &(n4->slots[i]);
+						else	/* RT_ACTION_DELETE */
+						{
+							memmove(&(n4->chunks[i]), &(n4->chunks[i + 1]),
+									sizeof(uint8) * (n4->n.count - i - 1));
+							memmove(&(n4->slots[i]), &(n4->slots[i + 1]),
+									sizeof(rt_node *) * (n4->n.count - i - 1));
+						}
+
+						found = true;
+						break;
+					}
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_16:
+			{
+				rt_node_16 *n16 = (rt_node_16 *) node;
+				int			idx;
+
+				/* Search by SIMD instructions */
+				idx = node_16_search_eq(n16, chunk);
+
+				/* If we find the chunk in the node, do the specified action */
+				if (idx >= 0)
+				{
+					if (action == RT_ACTION_FIND)
+						*slot_p = &(n16->slots[idx]);
+					else		/* RT_ACTION_DELETE */
+					{
+						memmove(&(n16->chunks[idx]), &(n16->chunks[idx + 1]),
+								sizeof(uint8) * (n16->n.count - idx - 1));
+						memmove(&(n16->slots[idx]), &(n16->slots[idx + 1]),
+								sizeof(rt_node *) * (n16->n.count - idx - 1));
+					}
+
+					found = true;
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_32 *n32 = (rt_node_32 *) node;
+				int			idx;
+
+				/* Search by SIMD instructions */
+				idx = node_32_search_eq(n32, chunk);
+
+				/* If we find the chunk in the node, do the specified action */
+				if (idx >= 0)
+				{
+					if (action == RT_ACTION_FIND)
+						*slot_p = &(n32->slots[idx]);
+					else		/* RT_ACTION_DELETE */
+					{
+						memmove(&(n32->chunks[idx]), &(n32->chunks[idx + 1]),
+								sizeof(uint8) * (n32->n.count - idx - 1));
+						memmove(&(n32->slots[idx]), &(n32->slots[idx + 1]),
+								sizeof(rt_node *) * (n32->n.count - idx - 1));
+					}
+
+					found = true;
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_128 *n128 = (rt_node_128 *) node;
+
+				/* If we find the chunk in the node, do the specified action */
+				if (node_128_is_chunk_used(n128, chunk))
+				{
+					if (action == RT_ACTION_FIND)
+						*slot_p = &(n128->slots[n128->slot_idxs[chunk]]);
+					else		/* RT_ACTION_DELETE */
+						node_128_unset(n128, chunk);
+
+					found = true;
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_256 *n256 = (rt_node_256 *) node;
+
+				/* If we find the chunk in the node, do the specified action */
+				if (node_256_is_chunk_used(n256, chunk))
+				{
+					if (action == RT_ACTION_FIND)
+						*slot_p = &(n256->slots[chunk]);
+					else		/* RT_ACTION_DELETE */
+						node_256_unset(n256, chunk);
+
+					found = true;
+				}
+
+				break;
+			}
+	}
+
+	/* Update the statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	return found;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key, Datum val)
+{
+	rt_node_4 *n4 =
+	(rt_node_4 *) rt_alloc_node(tree, RT_NODE_KIND_4);
+	int			shift = key_get_shift(key);
+
+	n4->n.shift = shift;
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = (rt_node *) n4;
+}
+
+/* Insert 'node' as a child node of 'parent' */
+static rt_node *
+rt_node_insert_child(radix_tree *tree, rt_node *parent,
+					 rt_node *node, uint64 key)
+{
+	rt_node *newchild =
+	(rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4);
+
+	Assert(!IS_LEAF_NODE(node));
+
+	newchild->shift = node->shift - RT_NODE_SPAN;
+	newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+	rt_node_insert_val(tree, parent, node, key, PointerGetDatum(newchild), NULL);
+
+	return (rt_node *) newchild;
+}
+
+/*
+ * Insert the value to the node. The node grows if it's full.
+ */
+static void
+rt_node_insert_val(radix_tree *tree, rt_node *parent,
+						   rt_node *node, uint64 key, Datum val,
+						   bool *replaced_p)
+{
+	int			chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		replaced = false;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_4 *n4 = (rt_node_4 *) node;
+				int			idx;
+
+				for (idx = 0; idx < n4->n.count; idx++)
+				{
+					if (n4->chunks[idx] >= chunk)
+						break;
+				}
+
+				if (NODE_HAS_FREE_SLOT(n4))
+				{
+					if (n4->n.count == 0)
+					{
+						/* the first key for this node, add it */
+					}
+					else if (n4->chunks[idx] == chunk)
+					{
+						/* found the key, replace it */
+						replaced = true;
+					}
+					else if (idx != n4->n.count)
+					{
+						/*
+						 * the key needs to be inserted in the middle of the
+						 * array, make space for the new key.
+						 */
+						memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
+								sizeof(uint8) * (n4->n.count - idx));
+						memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
+								sizeof(Datum) * (n4->n.count - idx));
+					}
+
+					n4->chunks[idx] = chunk;
+					n4->slots[idx] = val;
+
+					/* Done */
+					break;
+				}
+
+				/* The node doesn't have free slot so needs to grow */
+				node = rt_node_grow(tree, parent, node, key);
+				Assert(node->kind == RT_NODE_KIND_16);
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_16:
+			{
+				rt_node_16 *n16 = (rt_node_16 *) node;
+				int			idx;
+
+				idx = node_16_search_le(n16, chunk);
+
+				if (NODE_HAS_FREE_SLOT(n16))
+				{
+					if (n16->n.count == 0)
+					{
+						/* first key for this node, add it */
+					}
+					else if (n16->chunks[idx] == chunk)
+					{
+						/* found the key, replace it */
+						replaced = true;
+					}
+					else if (idx != n16->n.count)
+					{
+						/*
+						 * the key needs to be inserted in the middle of the
+						 * array, make space for the new key.
+						 */
+						memmove(&(n16->chunks[idx + 1]), &(n16->chunks[idx]),
+								sizeof(uint8) * (n16->n.count - idx));
+						memmove(&(n16->slots[idx + 1]), &(n16->slots[idx]),
+								sizeof(Datum) * (n16->n.count - idx));
+					}
+
+					n16->chunks[idx] = chunk;
+					n16->slots[idx] = val;
+
+					/* Done */
+					break;
+				}
+
+				/* The node doesn't have free slot so needs to grow */
+				node = rt_node_grow(tree, parent, node, key);
+				Assert(node->kind == RT_NODE_KIND_32);
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_32 *n32 = (rt_node_32 *) node;
+				int			idx;
+
+				idx = node_32_search_le(n32, chunk);
+
+				if (NODE_HAS_FREE_SLOT(n32))
+				{
+					if (n32->n.count == 0)
+					{
+						/* first key for this node, add it */
+					}
+					else if (n32->chunks[idx] == chunk)
+					{
+						/* found the key, replace it */
+						replaced = true;
+					}
+					else if (idx != n32->n.count)
+					{
+						/*
+						 * the key needs to be inserted in the middle of the
+						 * array, make space for the new key.
+						 */
+						memmove(&(n32->chunks[idx + 1]), &(n32->chunks[idx]),
+								sizeof(uint8) * (n32->n.count - idx));
+						memmove(&(n32->slots[idx + 1]), &(n32->slots[idx]),
+								sizeof(Datum) * (n32->n.count - idx));
+					}
+
+					n32->chunks[idx] = chunk;
+					n32->slots[idx] = val;
+
+					/* Done */
+					break;
+				}
+
+				/* The node doesn't have free slot so needs to grow */
+				node = rt_node_grow(tree, parent, node, key);
+				Assert(node->kind == RT_NODE_KIND_128);
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_128:
+			{
+				rt_node_128 *n128 = (rt_node_128 *) node;
+
+				if (node_128_is_chunk_used(n128, chunk))
+				{
+					/* found the existing value */
+					node_128_set(n128, chunk, val);
+					replaced = true;
+					break;
+				}
+
+				if (NODE_HAS_FREE_SLOT(n128))
+				{
+					node_128_set(n128, chunk, val);
+
+					/* Done */
+					break;
+				}
+
+				/* The node doesn't have free slot so needs to grow */
+				node = rt_node_grow(tree, parent, node, key);
+				Assert(node->kind == RT_NODE_KIND_256);
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_256 *n256 = (rt_node_256 *) node;
+
+				if (node_256_is_chunk_used(n256, chunk))
+					replaced = true;
+
+				node_256_set(n256, chunk, val);
+
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!replaced)
+		node->count++;
+
+	if (replaced_p)
+		*replaced_p = replaced;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+}
+
+/* Change the node type to the next larger one */
+static rt_node *
+rt_node_grow(radix_tree *tree, rt_node *parent, rt_node *node,
+					 uint64 key)
+{
+	rt_node *newnode = NULL;
+
+	Assert(node->count == rt_node_info[node->kind].max_slots);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_4 *n4 = (rt_node_4 *) node;
+				rt_node_16 *new16 =
+				(rt_node_16 *) rt_alloc_node(tree, RT_NODE_KIND_16);
+
+				rt_copy_node_common((rt_node *) n4,
+											(rt_node *) new16);
+
+				/* Copy both chunks and slots to the new node */
+				memcpy(&(new16->chunks), &(n4->chunks), sizeof(uint8) * 4);
+				memcpy(&(new16->slots), &(n4->slots), sizeof(Datum) * 4);
+
+				newnode = (rt_node *) new16;
+				break;
+			}
+		case RT_NODE_KIND_16:
+			{
+				rt_node_16 *n16 = (rt_node_16 *) node;
+				rt_node_32 *new32 =
+				(rt_node_32 *) rt_alloc_node(tree, RT_NODE_KIND_32);
+
+				rt_copy_node_common((rt_node *) n16,
+											(rt_node *) new32);
+
+				/* Copy both chunks and slots to the new node */
+				memcpy(&(new32->chunks), &(n16->chunks), sizeof(uint8) * 16);
+				memcpy(&(new32->slots), &(n16->slots), sizeof(Datum) * 16);
+
+				newnode = (rt_node *) new32;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_32 *n32 = (rt_node_32 *) node;
+				rt_node_128 *new128 =
+				(rt_node_128 *) rt_alloc_node(tree, RT_NODE_KIND_128);
+
+				/* Copy both chunks and slots to the new node */
+				rt_copy_node_common((rt_node *) n32,
+											(rt_node *) new128);
+
+				for (int i = 0; i < n32->n.count; i++)
+					node_128_set(new128, n32->chunks[i], n32->slots[i]);
+
+				newnode = (rt_node *) new128;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_128 *n128 = (rt_node_128 *) node;
+				rt_node_256 *new256 =
+				(rt_node_256 *) rt_alloc_node(tree, RT_NODE_KIND_256);
+				int			cnt = 0;
+
+				rt_copy_node_common((rt_node *) n128,
+											(rt_node *) new256);
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->n.count; i++)
+				{
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					node_256_set(new256, i, node_128_get_chunk_slot(n128, i));
+					cnt++;
+				}
+
+				newnode = (rt_node *) new256;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			elog(ERROR, "radix tree node-256 cannot grow");
+			break;
+	}
+
+	if (parent == node)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = newnode;
+	}
+	else
+	{
+		Datum	   *slot_ptr = NULL;
+
+		/* Redirect from the parent to the node */
+		rt_node_search(parent, &slot_ptr, key, RT_ACTION_FIND);
+		Assert(*slot_ptr);
+		*slot_ptr = PointerGetDatum(newnode);
+	}
+
+	/* Verify the node has grown properly */
+	rt_verify_node(newnode);
+
+	/* Free the old node */
+	rt_free_node(tree, node);
+
+	return newnode;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+	tree->mem_used = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		tree->slabs[i] = SlabContextCreate(ctx,
+										   rt_node_info[i].name,
+										   SLAB_DEFAULT_BLOCK_SIZE,
+										   rt_node_info[i].size);
+		tree->cnt[i] = 0;
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		MemoryContextDelete(tree->slabs[i]);
+
+	pfree(tree);
+}
+
+/*
+ * Insert the key with the val.
+ *
+ * found_p is set to true if the key already present, otherwise false, if
+ * it's not NULL.
+ *
+ * XXX: do we need to support update_if_exists behavior?
+ */
+void
+rt_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
+{
+	int			shift;
+	bool		replaced;
+	rt_node *node;
+	rt_node *parent = tree->root;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key, val);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = tree->root;
+	while (shift > 0)
+	{
+		rt_node *child;
+
+		if (!rt_node_find_child(node, &child, key))
+			child = rt_node_insert_child(tree, parent, node, key);
+
+		Assert(child != NULL);
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* arrived at a leaf */
+	Assert(IS_LEAF_NODE(node));
+
+	rt_node_insert_val(tree, parent, node, key, val, &replaced);
+
+	/* Update the statistics */
+	if (!replaced)
+		tree->num_keys++;
+
+	if (found_p)
+		*found_p = replaced;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if the key is successfully
+ * found, otherwise return false.  On success, we set the value to *val_p so
+ * it must not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, Datum *val_p)
+{
+	rt_node *node;
+	Datum	   *value_ptr;
+	int			shift;
+
+	Assert(val_p);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift > 0)
+	{
+		rt_node *child;
+
+		if (!rt_node_find_child(node, &child, key))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* We reached at a leaf node, search the corresponding slot */
+	Assert(IS_LEAF_NODE(node));
+
+	if (!rt_node_search(node, &value_ptr, key, RT_ACTION_FIND))
+		return false;
+
+	/* Found, set the value to return */
+	*val_p = *value_ptr;
+	return true;
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node *node;
+	int			shift;
+	rt_stack stack = NULL;
+	bool		deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descending the tree to search the key while building a stack of nodes
+	 * we visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node *child;
+		rt_stack new_stack;
+
+		new_stack = (rt_stack) palloc(sizeof(rt_stack_data));
+		new_stack->node = node;
+		new_stack->parent = stack;
+		stack = new_stack;
+
+		if (IS_LEAF_NODE(node))
+			break;
+
+		if (!rt_node_find_child(node, &child, key))
+		{
+			rt_free_stack(stack);
+			return false;
+		}
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/*
+	 * Delete the key from the leaf node and recursively delete internal nodes
+	 * if necessary.
+	 */
+	Assert(IS_LEAF_NODE(stack->node));
+	while (stack != NULL)
+	{
+		rt_node *node;
+		Datum	   *slot;
+
+		/* pop the node from the stack */
+		node = stack->node;
+		stack = stack->parent;
+
+		deleted = rt_node_search(node, &slot, key, RT_ACTION_DELETE);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!IS_EMPTY_NODE(node))
+			break;
+
+		Assert(deleted);
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+
+		/*
+		 * If we eventually deleted the root node while recursively deleting
+		 * empty nodes, we make the tree empty.
+		 */
+		if (stack == NULL)
+		{
+			tree->root = NULL;
+			tree->max_val = 0;
+		}
+	}
+
+	if (deleted)
+		tree->num_keys--;
+
+	rt_free_stack(stack);
+	return deleted;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+
+	iter->stack_len = top_level;
+	iter->stack[top_level].node = iter->tree->root;
+	iter->stack[top_level].current_idx = -1;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update the stack of the radix tree node while descending to the leaf from
+ * the 'from' level.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, int from)
+{
+	rt_node *node = iter->stack[from].node;
+	int			level = from;
+
+	for (;;)
+	{
+		rt_iter_node_data *node_iter = &(iter->stack[level--]);
+		bool		found;
+
+		/* Set the node to this level */
+		rt_store_iter_node(iter, node_iter, node);
+
+		/* Finish if we reached to the leaf node */
+		if (IS_LEAF_NODE(node))
+			break;
+
+		/* Advance to the next slot in the node */
+		node = (rt_node *)
+			DatumGetPointer(rt_node_iterate_next(iter, node_iter, &found));
+
+		/*
+		 * Since we always get the first slot in the node, we have to found
+		 * the slot.
+		 */
+		Assert(found);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, Datum *value_p)
+{
+	bool		found = false;
+	Datum		slot = (Datum) 0;
+
+	/* Empty tree */
+	if (!iter->tree)
+		return false;
+
+	for (;;)
+	{
+		rt_node *node;
+		rt_iter_node_data *node_iter;
+		int		level;
+
+		/*
+		 * Iterate node at each level from the bottom of the tree, i.e.,
+		 * the lead node, until we find the next slot.
+		 */
+		for (level = 0; level <= iter->stack_len; level++)
+		{
+			slot = rt_node_iterate_next(iter, &(iter->stack[level]), &found);
+
+			if (found)
+				break;
+		}
+
+		/* We could not find any new key-value pair, the iteration finished */
+		if (!found)
+			break;
+
+		/* found the next slot at the leaf node, return it */
+		if (level == 0)
+		{
+			*key_p = iter->key;
+			*value_p = slot;
+			break;
+		}
+
+		/*
+		 * We have advanced slots more than one nodes including both the lead
+		 * node and internal nodes. So we update the stack by descending to the
+		 * left most leaf node from this level.
+		 */
+		node = (rt_node *) DatumGetPointer(slot);
+		node_iter = &(iter->stack[level - 1]);
+		rt_store_iter_node(iter, node_iter, node);
+		rt_update_iter_stack(iter, level - 1);
+	}
+
+	return found;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+/*
+ * Iterate over the given radix tree node and returns the next slot of the given
+ * node and set true to *found_p, if any.  Otherwise, set false to *found_p.
+ */
+static Datum
+rt_node_iterate_next(rt_iter *iter, rt_iter_node_data *node_iter, bool *found_p)
+{
+	rt_node *node = node_iter->node;
+	Datum		slot = (Datum) 0;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_4 *n4 = (rt_node_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+
+				if (node_iter->current_idx >= n4->n.count)
+					goto not_found;
+
+				slot = n4->slots[node_iter->current_idx];
+
+				/* Update the part of the key by the current chunk */
+				if (IS_LEAF_NODE(n4))
+					rt_iter_update_key(iter, n4->chunks[node_iter->current_idx], 0);
+
+				break;
+			}
+		case RT_NODE_KIND_16:
+			{
+				rt_node_16 *n16 = (rt_node_16 *) node;
+
+				node_iter->current_idx++;
+
+				if (node_iter->current_idx >= n16->n.count)
+					goto not_found;
+
+				slot = n16->slots[node_iter->current_idx];
+
+				/* Update the part of the key */
+				if (IS_LEAF_NODE(n16))
+					rt_iter_update_key(iter, n16->chunks[node_iter->current_idx], 0);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_32 *n32 = (rt_node_32 *) node;
+
+				node_iter->current_idx++;
+
+				if (node_iter->current_idx >= n32->n.count)
+					goto not_found;
+
+				slot = n32->slots[node_iter->current_idx];
+
+				/* Update the part of the key */
+				if (IS_LEAF_NODE(n32))
+					rt_iter_update_key(iter, n32->chunks[node_iter->current_idx], 0);
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_128 *n128 = (rt_node_128 *) node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < 256; i++)
+				{
+					if (node_128_is_chunk_used(n128, i))
+						break;
+				}
+
+				if (i >= 256)
+					goto not_found;
+
+				node_iter->current_idx = i;
+				slot = node_128_get_chunk_slot(n128, i);
+
+				/* Update the part of the key */
+				if (IS_LEAF_NODE(n128))
+					rt_iter_update_key(iter, node_iter->current_idx, 0);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_256 *n256 = (rt_node_256 *) node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < 256; i++)
+				{
+					if (node_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= 256)
+					goto not_found;
+
+				node_iter->current_idx = i;
+				slot = n256->slots[i];
+
+				/* Update the part of the key */
+				if (IS_LEAF_NODE(n256))
+					rt_iter_update_key(iter, node_iter->current_idx, 0);
+
+				break;
+			}
+	}
+
+	*found_p = true;
+	return slot;
+
+not_found:
+	*found_p = false;
+	return (Datum) 0;
+}
+
+/*
+ * Initialize and update the node iteration struct with the given radix tree
+ * node. This function also updates the part of the key by the chunk of the
+ * given node.
+ */
+static void
+rt_store_iter_node(rt_iter *iter, rt_iter_node_data *node_iter,
+				   rt_node *node)
+{
+	node_iter->node = node;
+	node_iter->current_idx = -1;
+
+	rt_iter_update_key(iter, node->chunk, node->shift + RT_NODE_SPAN);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	return tree->mem_used;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_4 *n4 = (rt_node_4 *) node;
+
+				/* Check if the chunks in the node are sorted */
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_16:
+			{
+				rt_node_16 *n16 = (rt_node_16 *) node;
+
+				/* Check if the chunks in the node are sorted */
+				for (int i = 1; i < n16->n.count; i++)
+					Assert(n16->chunks[i - 1] < n16->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_32 *n32 = (rt_node_32 *) node;
+
+				/* Check if the chunks in the node are sorted */
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_128 *n128 = (rt_node_128 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(node_128_is_slot_used(n128, n128->slot_idxs[i]));
+
+					cnt++;
+				}
+
+				Assert(n128->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_256 *n256 = (rt_node_256 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+					cnt += pg_popcount32(n256->isset[i]);
+
+				/* Check if the number of used chunk matches */
+				Assert(n256->n.count == cnt);
+
+				break;
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u(%lu), n16 = %u(%lu),n32 = %u(%lu), n128 = %u(%lu), n256 = %u(%lu)",
+			tree->num_keys,
+			tree->root->shift / RT_NODE_SPAN,
+			tree->cnt[0], tree->cnt[0] * sizeof(rt_node_4),
+			tree->cnt[1], tree->cnt[1] * sizeof(rt_node_16),
+			tree->cnt[2], tree->cnt[2] * sizeof(rt_node_32),
+			tree->cnt[3], tree->cnt[3] * sizeof(rt_node_128),
+			tree->cnt[4], tree->cnt[4] * sizeof(rt_node_256));
+	/* rt_dump(tree); */
+}
+
+static void
+rt_print_slot(StringInfo buf, uint8 chunk, Datum slot, int idx, bool is_leaf, int level)
+{
+	char		space[128] = {0};
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	if (is_leaf)
+		appendStringInfo(buf, "%s[%d] \"0x%X\" val(0x%lX) LEAF\n",
+						 space,
+						 idx,
+						 chunk,
+						 DatumGetInt64(slot));
+	else
+		appendStringInfo(buf, "%s[%d] \"0x%X\" -> ",
+						 space,
+						 idx,
+						 chunk);
+}
+
+static void
+rt_dump_node(rt_node *node, int level, StringInfo buf, bool recurse)
+{
+	bool		is_leaf = IS_LEAF_NODE(node);
+
+	appendStringInfo(buf, "[\"%s\" type %d, cnt %u, shift %u, chunk \"0x%X\"] chunks:\n",
+					 IS_LEAF_NODE(node) ? "LEAF" : "INNR",
+					 (node->kind == RT_NODE_KIND_4) ? 4 :
+					 (node->kind == RT_NODE_KIND_32) ? 32 :
+					 (node->kind == RT_NODE_KIND_128) ? 128 : 256,
+					 node->count, node->shift, node->chunk);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_4 *n4 = (rt_node_4 *) node;
+
+				for (int i = 0; i < n4->n.count; i++)
+				{
+					rt_print_slot(buf, n4->chunks[i], n4->slots[i], i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							rt_dump_node((rt_node *) n4->slots[i], level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_16:
+			{
+				rt_node_16 *n16 = (rt_node_16 *) node;
+
+				for (int i = 0; i < n16->n.count; i++)
+				{
+					rt_print_slot(buf, n16->chunks[i], n16->slots[i], i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							rt_dump_node((rt_node *) n16->slots[i], level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_32 *n32 = (rt_node_32 *) node;
+
+				for (int i = 0; i < n32->n.count; i++)
+				{
+					rt_print_slot(buf, n32->chunks[i], n32->slots[i], i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							rt_dump_node((rt_node *) n32->slots[i], level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_128 *n128 = (rt_node_128 *) node;
+
+				for (int j = 0; j < 256; j++)
+				{
+					if (!node_128_is_chunk_used(n128, j))
+						continue;
+
+					appendStringInfo(buf, "slot_idxs[%d]=%d, ", j, n128->slot_idxs[j]);
+				}
+				appendStringInfo(buf, "\nisset-bitmap:");
+				for (int j = 0; j < 16; j++)
+				{
+					appendStringInfo(buf, "%X ", (uint8) n128->isset[j]);
+				}
+				appendStringInfo(buf, "\n");
+
+				for (int i = 0; i < 256; i++)
+				{
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					rt_print_slot(buf, i, node_128_get_chunk_slot(n128, i),
+										  i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							rt_dump_node((rt_node *) node_128_get_chunk_slot(n128, i),
+												 level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_256 *n256 = (rt_node_256 *) node;
+
+				for (int i = 0; i < 256; i++)
+				{
+					if (!node_256_is_chunk_used(n256, i))
+						continue;
+
+					rt_print_slot(buf, i, n256->slots[i], i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							rt_dump_node((rt_node *) n256->slots[i], level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	StringInfoData buf;
+	rt_node *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+			 key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node *child;
+
+		rt_dump_node(node, level, &buf, false);
+
+		if (IS_LEAF_NODE(node))
+		{
+			Datum	   *dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search(node, &dummy, key, RT_ACTION_FIND);
+
+			break;
+		}
+
+		if (!rt_node_find_child(node, &child, key))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+
+	elog(NOTICE, "\n%s", buf.data);
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+	StringInfoData buf;
+
+	initStringInfo(&buf);
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu", tree->max_val);
+	rt_dump_node(tree->root, 0, &buf, true);
+	elog(NOTICE, "\n%s", buf.data);
+	elog(NOTICE, "-----------------------------------------------------------");
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..7efd4bb735
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+/* #define RT_DEBUG 1 */
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern bool rt_search(radix_tree *tree, uint64 key, Datum *val_p);
+extern void rt_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+extern void rt_free(radix_tree *tree);
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, Datum *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9090226daa..51b2514faf 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -24,6 +24,7 @@ SUBDIRS = \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..384b1fc41d
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,503 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int rt_node_max_entries[] = {
+	4,		/* RT_NODE_KIND_4 */
+	16,		/* RT_NODE_KIND_16 */
+	32,		/* RT_NODE_KIND_32 */
+	128,	/* RT_NODE_KIND_128 */
+	256		/* RT_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 10000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	Datum dummy;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		Datum val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (DatumGetUInt64(val) != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, DatumGetUInt64(val), key);
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		bool found;
+
+		rt_insert(radixtree, key, Int64GetDatum(key), &found);
+
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+		for (int j = 0; j < lengthof(rt_node_max_entries); j++)
+		{
+			/*
+			 * After filling all slots in each node type, check if the values are
+			 * stored properly.
+			 */
+			if (i == (rt_node_max_entries[j] - 1))
+			{
+				check_search_on_node(radixtree, shift,
+									 (j == 0) ? 0 : rt_node_max_entries[j - 1],
+									 rt_node_max_entries[j]);
+				break;
+			}
+		}
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64	key = ((uint64) i << shift);
+		bool	found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search
+	 * entries again.
+	 */
+	test_node_types_insert(radixtree, shift);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+	radix_tree *radixtree;
+	rt_iter *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool found;
+
+			x = last_int + pattern_values[i];
+
+			rt_insert(radixtree, x, Int64GetDatum(x), &found);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the
+	 * stats from the memory context.  They should be in the same ballpark,
+	 * but it's hard to automate testing that, so if you're making changes to
+	 * the implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		Datum		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (DatumGetUInt64(v) != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 DatumGetUInt64(v), x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			if (DatumGetUInt64(val) != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		uint64		x;
+		Datum		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true

#64

sawada.mshk@gmail.com

over 3 years ago

In reply to: Andres Freund (#61)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-06-16 13:56:55 +0900, Masahiko Sawada wrote:

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..bf87f932fd
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,1763 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *           Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013.
+ *
+ * There are some differences from the proposed implementation.  For instance,
+ * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
+ * Also, there is no support for path compression and lazy path expansion. The
+ * radix tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.

I think we're going to need path compression at some point, fwiw. I'd bet on
it being beneficial even for the tid case.

+ * The key is a 64-bit unsigned integer and the value is a Datum.

I don't think it's a good idea to define the value type to be a datum.

A datum value is convenient to represent both a pointer and a value so
I used it to avoid defining node types for inner and leaf nodes
separately. Since a datum could be 4 bytes or 8 bytes depending it
might not be good for some platforms. But what kind of aspects do you
not like the idea of using datum?

+/*
+ * As we descend a radix tree, we push the node to the stack. The stack is used
+ * at deletion.
+ */
+typedef struct radix_tree_stack_data
+{
+     radix_tree_node *node;
+     struct radix_tree_stack_data *parent;
+} radix_tree_stack_data;
+typedef radix_tree_stack_data *radix_tree_stack;
I think it's a very bad idea for traversal to need allocations. I really want
to eventually use this for shared structures (eventually with lock-free
searches at least), and needing to do allocations while traversing the tree is
a no-go for that.

Particularly given that the tree currently has a fixed depth, can't you just
allocate this on the stack once?

Yes, we can do that.

+/*
+ * Allocate a new node with the given node kind.
+ */
+static radix_tree_node *
+radix_tree_alloc_node(radix_tree *tree, radix_tree_node_kind kind)
+{
+     radix_tree_node *newnode;
+
+     newnode = (radix_tree_node *) MemoryContextAllocZero(tree->slabs[kind],
+                                                                                                              radix_tree_node_info[kind].size);
+     newnode->kind = kind;
+
+     /* update the statistics */
+     tree->mem_used += GetMemoryChunkSpace(newnode);
+     tree->cnt[kind]++;
+
+     return newnode;
+}

Indeed. I'll use MemoryContextMemAllocated instead.

+                                     else if (idx != n4->n.count)
+                                     {
+                                             /*
+                                              * the key needs to be inserted in the middle of the
+                                              * array, make space for the new key.
+                                              */
+                                             memmove(&(n4->chunks[idx + 1]), &(n4->chunks[idx]),
+                                                             sizeof(uint8) * (n4->n.count - idx));
+                                             memmove(&(n4->slots[idx + 1]), &(n4->slots[idx]),
+                                                             sizeof(radix_tree_node *) * (n4->n.count - idx));
+                                     }

Agreed, I'll update it.

+/*
+ * Insert the key with the val.
+ *
+ * found_p is set to true if the key already present, otherwise false, if
+ * it's not NULL.
+ *
+ * XXX: do we need to support update_if_exists behavior?
+ */
Yes, I think that's needed - hence using bfm_set() instead of insert() in the
prototype.

Agreed.

+void
+radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
+{
+     int                     shift;
+     bool            replaced;
+     radix_tree_node *node;
+     radix_tree_node *parent = tree->root;
+
+     /* Empty tree, create the root */
+     if (!tree->root)
+             radix_tree_new_root(tree, key, val);
+
+     /* Extend the tree if necessary */
+     if (key > tree->max_val)
+             radix_tree_extend(tree, key);
FWIW, the reason I used separate functions for these in the prototype is that
it turns out to generate a lot better code, because it allows non-inlined
function calls to be sibling calls - thereby avoiding the need for a dedicated
stack frame. That's not possible once you need a palloc or such, so splitting
off those call paths into dedicated functions is useful.

Thank you for the info. How much does using sibling call optimization
help the performance in this case? I think that these two cases are
used only a limited number of times: inserting the first key and
extending the tree.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#65

sawada.mshk@gmail.com

over 3 years ago

In reply to: Andres Freund (#62)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jul 5, 2022 at 7:00 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-06-28 15:24:11 +0900, Masahiko Sawada wrote:

In both test cases, There is not much difference between using AVX2
and SSE2. The more mode types, the more time it takes for loading the
data (see sse2_4_16_32_128_256).

Yea, at some point the compiler starts using a jump table instead of branches,
and that turns out to be a good bit more expensive. And even with branches, it
obviously adds hard to predict branches. IIRC I fought a bit with the compiler
to avoid some of that cost, it's possible that got "lost" in Sawada-san's
patch.

Sawada-san, what led you to discard the 1 and 16 node types? IIRC the 1 node
one is not unimportant until we have path compression.

I wanted to start with a smaller number of node types for simplicity.
16 node type has been added to v4 patch I submitted[1]. I think it's
trade-off between better memory and the overhead of growing (and
shrinking) the node type. I'm going to add more node types once we
turn out based on the benchmark that it's beneficial.

Right now the node struct sizes are:
4 - 48 bytes
32 - 296 bytes
128 - 1304 bytes
256 - 2088 bytes

I guess radix_tree_node_128->isset is just 16 bytes compared to 1288 other
bytes, but needing that separate isset array somehow is sad :/. I wonder if a
smaller "free index" would do the trick? Point to the element + 1 where we
searched last and start a plain loop there. Particularly in an insert-only
workload that'll always work, and in other cases it'll still often work I
think.

radix_tree_node_128->isset is used to distinguish between null-pointer
in inner nodes and 0 in leaf nodes. So I guess we can have a flag to
indicate a leaf or an inner so that we can interpret (Datum) 0 as
either null-pointer or 0. Or if we define different data types for
inner and leaf nodes probably we don't need it.

One thing I was wondering about is trying to choose node types in
roughly-power-of-two struct sizes. It's pretty easy to end up with significant
fragmentation in the slabs right now when inserting as you go, because some of
the smaller node types will be freed but not enough to actually free blocks of
memory. If we instead have ~power-of-two sizes we could just use a single slab
of the max size, and carve out the smaller node types out of that largest
allocation.

You meant to manage memory allocation (and free) for smaller node
types by ourselves?

How about using different block size for different node types?

Btw, that fragmentation is another reason why I think it's better to track
memory usage via memory contexts, rather than doing so based on
GetMemoryChunkSpace().

Agreed.

Ideally, node16 and node32 would have the same code with a different
loop count (1 or 2). More generally, there is too much duplication of
code (noted by Andres in his PoC), and there are many variable names
with the node size embedded. This is a bit tricky to make more
general, so we don't need to try it yet, but ideally we would have
something similar to:

switch (node->kind) // todo: inspect tagged pointer
{
case RADIX_TREE_NODE_KIND_4:
idx = node_search_eq(node, chunk, 4);
do_action(node, idx, 4, ...);
break;
case RADIX_TREE_NODE_KIND_32:
idx = node_search_eq(node, chunk, 32);
do_action(node, idx, 32, ...);
...
}

FWIW, that should be doable with an inline function, if you pass it the memory
to the "array" rather than the node directly. Not so sure it's a good idea to
do dispatch between node types / search methods inside the helper, as you
suggest below:

static pg_alwaysinline void
node_search_eq(radix_tree_node node, uint8 chunk, int16 node_fanout)
{
if (node_fanout <= SIMPLE_LOOP_THRESHOLD)
// do simple loop with (node_simple *) node;
else if (node_fanout <= VECTORIZED_LOOP_THRESHOLD)
// do vectorized loop where available with (node_vec *) node;
...
}

Yeah, It's worth trying at some point.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#66

andres@anarazel.de

over 3 years ago

In reply to: Masahiko Sawada (#64)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2022-07-05 16:33:17 +0900, Masahiko Sawada wrote:

On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres@anarazel.de> wrote:
A datum value is convenient to represent both a pointer and a value so
I used it to avoid defining node types for inner and leaf nodes
separately.

I'm not convinced that's a good goal. I think we're going to want to have
different key and value types, and trying to unify leaf and inner nodes is
going to make that impossible.

Consider e.g. using it for something like a buffer mapping table - your key
might be way too wide to fit it sensibly into 64bit.

Since a datum could be 4 bytes or 8 bytes depending it might not be good for
some platforms.

Right - thats another good reason why it's problematic. A lot of key types
aren't going to be 4/8 bytes dependent on 32/64bit, but either / or.

+void
+radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
+{
+     int                     shift;
+     bool            replaced;
+     radix_tree_node *node;
+     radix_tree_node *parent = tree->root;
+
+     /* Empty tree, create the root */
+     if (!tree->root)
+             radix_tree_new_root(tree, key, val);
+
+     /* Extend the tree if necessary */
+     if (key > tree->max_val)
+             radix_tree_extend(tree, key);
FWIW, the reason I used separate functions for these in the prototype is that
it turns out to generate a lot better code, because it allows non-inlined
function calls to be sibling calls - thereby avoiding the need for a dedicated
stack frame. That's not possible once you need a palloc or such, so splitting
off those call paths into dedicated functions is useful.
Thank you for the info. How much does using sibling call optimization
help the performance in this case? I think that these two cases are
used only a limited number of times: inserting the first key and
extending the tree.

It's not that it helps in the cases moved into separate functions - it's that
not having that code in the "normal" paths keeps the normal path faster.

Greetings,

Andres Freund

#67

andres@anarazel.de

over 3 years ago

In reply to: Masahiko Sawada (#65)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2022-07-05 16:33:29 +0900, Masahiko Sawada wrote:

One thing I was wondering about is trying to choose node types in
roughly-power-of-two struct sizes. It's pretty easy to end up with significant
fragmentation in the slabs right now when inserting as you go, because some of
the smaller node types will be freed but not enough to actually free blocks of
memory. If we instead have ~power-of-two sizes we could just use a single slab
of the max size, and carve out the smaller node types out of that largest
allocation.

You meant to manage memory allocation (and free) for smaller node
types by ourselves?

For all of them basically. Using a single slab allocator and then subdividing
the "common block size" into however many chunks that fit into a single node
type.

How about using different block size for different node types?

Not following...

Greetings,

Andres Freund

#68

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#59)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jul 4, 2022 at 12:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Looking at the node stats, and then your benchmark code, I think key
construction is a major influence, maybe more than node type. The
key/value scheme tested now makes sense:

blockhi || blocklo || 9 bits of item offset

(with the leaf nodes containing a bit map of the lowest few bits of
this whole thing)

We want the lower fanout nodes at the top of the tree and higher
fanout ones at the bottom.

So more inner nodes can fit in CPU cache, right?

My thinking is, on average, there will be more dense space utilization
in the leaf bitmaps, and fewer inner nodes. I'm not quite sure about
cache, since with my idea a search might have to visit more nodes to
get the common negative result (indexed tid not found in vacuum's
list).

Note some consequences: If the table has enough columns such that much
fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
dense case the nodes above the leaves will have lower fanout (maybe
they will fit in a node32). Also, the bitmap values in the leaves will
be more empty. In other words, many tables in the wild *resemble* the
sparse case a bit, even if truly all tuples on the page are dead.

Note also that the dense case in the benchmark above has ~4500 times
more keys than the sparse case, and uses about ~1000 times more
memory. But the runtime is only 2-3 times longer. That's interesting
to me.

To optimize for the sparse case, it seems to me that the key/value would be

blockhi || 9 bits of item offset || blocklo

I believe that would make the leaf nodes more dense, with fewer inner
nodes, and could drastically speed up the sparse case, and maybe many
realistic dense cases.

Does it have an effect on the number of inner nodes?

I'm curious to hear your thoughts.

Thank you for your analysis. It's worth trying. We use 9 bits for item
offset but most pages don't use all bits in practice. So probably it
might be better to move the most significant bit of item offset to the
left of blockhi. Or more simply:

9 bits of item offset || blockhi || blocklo

A concern here is most tids won't use many bits in blockhi either,
most often far fewer, so this would make the tree higher, I think.
Each value of blockhi represents 0.5GB of heap (32TB max). Even with
very large tables I'm guessing most pages of interest to vacuum are
concentrated in a few of these 0.5GB "segments".

And it's possible path compression would change the tradeoffs here.

--
John Naylor
EDB: http://www.enterprisedb.com

#69

sawada.mshk@gmail.com

over 3 years ago

In reply to: Andres Freund (#66)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jul 5, 2022 at 5:09 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-07-05 16:33:17 +0900, Masahiko Sawada wrote:

On Tue, Jul 5, 2022 at 6:18 AM Andres Freund <andres@anarazel.de> wrote:
A datum value is convenient to represent both a pointer and a value so
I used it to avoid defining node types for inner and leaf nodes
separately.

I'm not convinced that's a good goal. I think we're going to want to have
different key and value types, and trying to unify leaf and inner nodes is
going to make that impossible.

Consider e.g. using it for something like a buffer mapping table - your key
might be way too wide to fit it sensibly into 64bit.

Right. It seems to be better to have an interface so that the user of
the radix tree can specify the arbitrary key size (and perhaps value
size too?) on creation. And we can have separate leaf node types that
have values instead of pointers. If the value size is less than
pointer size, we can have values within leaf nodes but if it’s bigger
probably the leaf node can have pointers to memory where to store the
value.

Since a datum could be 4 bytes or 8 bytes depending it might not be good for
some platforms.

Right - thats another good reason why it's problematic. A lot of key types
aren't going to be 4/8 bytes dependent on 32/64bit, but either / or.
+void
+radix_tree_insert(radix_tree *tree, uint64 key, Datum val, bool *found_p)
+{
+     int                     shift;
+     bool            replaced;
+     radix_tree_node *node;
+     radix_tree_node *parent = tree->root;
+
+     /* Empty tree, create the root */
+     if (!tree->root)
+             radix_tree_new_root(tree, key, val);
+
+     /* Extend the tree if necessary */
+     if (key > tree->max_val)
+             radix_tree_extend(tree, key);
FWIW, the reason I used separate functions for these in the prototype is that
it turns out to generate a lot better code, because it allows non-inlined
function calls to be sibling calls - thereby avoiding the need for a dedicated
stack frame. That's not possible once you need a palloc or such, so splitting
off those call paths into dedicated functions is useful.
Thank you for the info. How much does using sibling call optimization
help the performance in this case? I think that these two cases are
used only a limited number of times: inserting the first key and
extending the tree.
It's not that it helps in the cases moved into separate functions - it's that
not having that code in the "normal" paths keeps the normal path faster.

Thanks, understood.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#70

sawada.mshk@gmail.com

over 3 years ago

In reply to: John Naylor (#68)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jul 5, 2022 at 5:49 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Mon, Jul 4, 2022 at 12:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Looking at the node stats, and then your benchmark code, I think key
construction is a major influence, maybe more than node type. The
key/value scheme tested now makes sense:

blockhi || blocklo || 9 bits of item offset

(with the leaf nodes containing a bit map of the lowest few bits of
this whole thing)

We want the lower fanout nodes at the top of the tree and higher
fanout ones at the bottom.

So more inner nodes can fit in CPU cache, right?

My thinking is, on average, there will be more dense space utilization
in the leaf bitmaps, and fewer inner nodes. I'm not quite sure about
cache, since with my idea a search might have to visit more nodes to
get the common negative result (indexed tid not found in vacuum's
list).

Note some consequences: If the table has enough columns such that much
fewer than 100 tuples fit on a page (maybe 30 or 40), then in the
dense case the nodes above the leaves will have lower fanout (maybe
they will fit in a node32). Also, the bitmap values in the leaves will
be more empty. In other words, many tables in the wild *resemble* the
sparse case a bit, even if truly all tuples on the page are dead.

Note also that the dense case in the benchmark above has ~4500 times
more keys than the sparse case, and uses about ~1000 times more
memory. But the runtime is only 2-3 times longer. That's interesting
to me.

To optimize for the sparse case, it seems to me that the key/value would be

blockhi || 9 bits of item offset || blocklo

I believe that would make the leaf nodes more dense, with fewer inner
nodes, and could drastically speed up the sparse case, and maybe many
realistic dense cases.

Does it have an effect on the number of inner nodes?

I'm curious to hear your thoughts.

Thank you for your analysis. It's worth trying. We use 9 bits for item
offset but most pages don't use all bits in practice. So probably it
might be better to move the most significant bit of item offset to the
left of blockhi. Or more simply:

9 bits of item offset || blockhi || blocklo

A concern here is most tids won't use many bits in blockhi either,
most often far fewer, so this would make the tree higher, I think.
Each value of blockhi represents 0.5GB of heap (32TB max). Even with
very large tables I'm guessing most pages of interest to vacuum are
concentrated in a few of these 0.5GB "segments".

Right.

I guess that the tree height is affected by where garbages are, right?
For example, even if all garbage in the table is concentrated in
0.5GB, if they exist between 2^17 and 2^18 block, we use the first
byte of blockhi. If the table is larger than 128GB, the second byte of
the blockhi could be used depending on where the garbage exists.

Another variation of how to store TID would be that we use the block
number as a key and store a bitmap of the offset as a value. We can
use Bitmapset for example, or an approach like Roaring bitmap.

I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size? Given the use case of buffer mapping, we would
need a wider key to store RelFileNode, ForkNumber, and BlockNumber. On
the other hand, limiting the key size is 64 bit integer makes the
logic simple, and possibly it could still be used in buffer mapping
cases by using a tree of a tree. For value size, if we support
different value sizes specified by the user, we can either embed
multiple values in the leaf node (called Multi-value leaves in ART
paper) or introduce a leaf node that stores one value (called
Single-value leaves).

And it's possible path compression would change the tradeoffs here.

Agreed.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#71

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#70)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Jul 8, 2022 at 9:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I guess that the tree height is affected by where garbages are, right?
For example, even if all garbage in the table is concentrated in
0.5GB, if they exist between 2^17 and 2^18 block, we use the first
byte of blockhi. If the table is larger than 128GB, the second byte of
the blockhi could be used depending on where the garbage exists.

Right.

Another variation of how to store TID would be that we use the block
number as a key and store a bitmap of the offset as a value. We can
use Bitmapset for example,

I like the idea of using existing code to set/check a bitmap if it's
convenient. But (in case that was implied here) I'd really like to
stay away from variable-length values, which would require
"Single-value leaves" (slow). I also think it's fine to treat the
key/value as just bits, and not care where exactly they came from, as
we've been talking about.

or an approach like Roaring bitmap.

This would require two new data structures instead of one. That
doesn't seem like a path to success.

I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size?

I don't think we need to start over. Andres' prototype had certain
design decisions built in for the intended use case (although maybe
not clearly documented as such). Subsequent patches in this thread
substantially changed many design aspects. If there were any changes
that made things wonderful for vacuum, it wasn't explained, but Andres
did explain how some of these changes were not good for other uses.
Going to fixed 64-bit keys and values should still allow many future
applications, so let's do that if there's no reason not to.

For value size, if we support
different value sizes specified by the user, we can either embed
multiple values in the leaf node (called Multi-value leaves in ART
paper)

I don't think "Multi-value leaves" allow for variable-length values,
FWIW. And now I see I also used this term wrong in my earlier review
comment -- v3/4 don't actually use "multi-value leaves", but Andres'
does (going by the multiple leaf types). From the paper: "Multi-value
leaves: The values are stored in one of four different leaf node
types, which mirror the structure of inner nodes, but contain values
instead of pointers."

(It seems v3/v4 could be called a variation of "Combined pointer/value
slots: If values fit into pointers, no separate node types are
necessary. Instead, each pointer storage location in an inner node can
either store a pointer or a value." But without the advantage of
variable length keys).

--
John Naylor
EDB: http://www.enterprisedb.com

#72

sawada.mshk@gmail.com

over 3 years ago

In reply to: John Naylor (#71)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Jul 8, 2022 at 3:43 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Fri, Jul 8, 2022 at 9:10 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I guess that the tree height is affected by where garbages are, right?
For example, even if all garbage in the table is concentrated in
0.5GB, if they exist between 2^17 and 2^18 block, we use the first
byte of blockhi. If the table is larger than 128GB, the second byte of
the blockhi could be used depending on where the garbage exists.

Right.

Another variation of how to store TID would be that we use the block
number as a key and store a bitmap of the offset as a value. We can
use Bitmapset for example,

I like the idea of using existing code to set/check a bitmap if it's
convenient. But (in case that was implied here) I'd really like to
stay away from variable-length values, which would require
"Single-value leaves" (slow). I also think it's fine to treat the
key/value as just bits, and not care where exactly they came from, as
we've been talking about.

or an approach like Roaring bitmap.

This would require two new data structures instead of one. That
doesn't seem like a path to success.

Agreed.

I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size?

I don't think we need to start over. Andres' prototype had certain
design decisions built in for the intended use case (although maybe
not clearly documented as such). Subsequent patches in this thread
substantially changed many design aspects. If there were any changes
that made things wonderful for vacuum, it wasn't explained, but Andres
did explain how some of these changes were not good for other uses.
Going to fixed 64-bit keys and values should still allow many future
applications, so let's do that if there's no reason not to.

I thought Andres pointed out that given that we store BufferTag (or
part of that) into the key, the fixed 64-bit keys might not be enough
for buffer mapping use cases. If we want to use wider keys more than
64-bit, we would need to consider it.

For value size, if we support
different value sizes specified by the user, we can either embed
multiple values in the leaf node (called Multi-value leaves in ART
paper)

I don't think "Multi-value leaves" allow for variable-length values,
FWIW. And now I see I also used this term wrong in my earlier review
comment -- v3/4 don't actually use "multi-value leaves", but Andres'
does (going by the multiple leaf types). From the paper: "Multi-value
leaves: The values are stored in one of four different leaf node
types, which mirror the structure of inner nodes, but contain values
instead of pointers."

Right, but sorry I meant the user specifies the arbitrary fixed-size
value length on creation like we do in dynahash.c.

(It seems v3/v4 could be called a variation of "Combined pointer/value
slots: If values fit into pointers, no separate node types are
necessary. Instead, each pointer storage location in an inner node can
either store a pointer or a value." But without the advantage of
variable length keys).

Agreed.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#73

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#72)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jul 12, 2022 at 8:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size?

I don't think we need to start over. Andres' prototype had certain
design decisions built in for the intended use case (although maybe
not clearly documented as such). Subsequent patches in this thread
substantially changed many design aspects. If there were any changes
that made things wonderful for vacuum, it wasn't explained, but Andres
did explain how some of these changes were not good for other uses.
Going to fixed 64-bit keys and values should still allow many future
applications, so let's do that if there's no reason not to.

I thought Andres pointed out that given that we store BufferTag (or
part of that) into the key, the fixed 64-bit keys might not be enough
for buffer mapping use cases. If we want to use wider keys more than
64-bit, we would need to consider it.

It sounds like you've answered your own question, then. If so, I'm
curious what your current thinking is.

If we *did* want to have maximum flexibility, then "single-value
leaves" method would be the way to go, since it seems to be the
easiest way to have variable-length both keys and values. I do have a
concern that the extra pointer traversal would be a drag on
performance, and also require lots of small memory allocations. If we
happened to go that route, your idea upthread of using a bitmapset of
item offsets in the leaves sounds like a good fit for that.

I also have some concerns about also simultaneously trying to design
for the use for buffer mappings. I certainly want to make this good
for as many future uses as possible, and I'd really like to preserve
any optimizations already fought for. However, to make concrete
progress on the thread subject, I also don't think it's the most
productive use of time to get tied up about the fine details of
something that will not likely happen for several years at the
earliest.

--
John Naylor
EDB: http://www.enterprisedb.com

#74

sawada.mshk@gmail.com

over 3 years ago

In reply to: John Naylor (#73)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jul 14, 2022 at 1:17 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Jul 12, 2022 at 8:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size?

I don't think we need to start over. Andres' prototype had certain
design decisions built in for the intended use case (although maybe
not clearly documented as such). Subsequent patches in this thread
substantially changed many design aspects. If there were any changes
that made things wonderful for vacuum, it wasn't explained, but Andres
did explain how some of these changes were not good for other uses.
Going to fixed 64-bit keys and values should still allow many future
applications, so let's do that if there's no reason not to.

I thought Andres pointed out that given that we store BufferTag (or
part of that) into the key, the fixed 64-bit keys might not be enough
for buffer mapping use cases. If we want to use wider keys more than
64-bit, we would need to consider it.

It sounds like you've answered your own question, then. If so, I'm
curious what your current thinking is.

If we *did* want to have maximum flexibility, then "single-value
leaves" method would be the way to go, since it seems to be the
easiest way to have variable-length both keys and values. I do have a
concern that the extra pointer traversal would be a drag on
performance, and also require lots of small memory allocations.

Agreed.

I also have some concerns about also simultaneously trying to design
for the use for buffer mappings. I certainly want to make this good
for as many future uses as possible, and I'd really like to preserve
any optimizations already fought for. However, to make concrete
progress on the thread subject, I also don't think it's the most
productive use of time to get tied up about the fine details of
something that will not likely happen for several years at the
earliest.

I’d like to keep the first version simple. We can improve it and add
more optimizations later. Using radix tree for vacuum TID storage
would still be a big win comparing to using a flat array, even without
all these optimizations. In terms of single-value leaves method, I'm
also concerned about an extra pointer traversal and extra memory
allocation. It's most flexible but multi-value leaves method is also
flexible enough for many use cases. Using the single-value method
seems to be too much as the first step for me.

Overall, using 64-bit keys and 64-bit values would be a reasonable
choice for me as the first step . It can cover wider use cases
including vacuum TID use cases. And possibly it can cover use cases by
combining a hash table or using tree of tree, for example.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#75

andres@anarazel.de

over 3 years ago

In reply to: Masahiko Sawada (#70)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2022-07-08 11:09:44 +0900, Masahiko Sawada wrote:

I think that at this stage it's better to define the design first. For
example, key size and value size, and these sizes are fixed or can be
set the arbitary size? Given the use case of buffer mapping, we would
need a wider key to store RelFileNode, ForkNumber, and BlockNumber. On
the other hand, limiting the key size is 64 bit integer makes the
logic simple, and possibly it could still be used in buffer mapping
cases by using a tree of a tree. For value size, if we support
different value sizes specified by the user, we can either embed
multiple values in the leaf node (called Multi-value leaves in ART
paper) or introduce a leaf node that stores one value (called
Single-value leaves).

FWIW, I think the best path forward would be to do something similar to the
simplehash.h approach, so it can be customized to the specific user.

Greetings,

Andres Freund

#76

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Andres Freund (#75)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jul 19, 2022 at 9:24 AM Andres Freund <andres@anarazel.de> wrote:

FWIW, I think the best path forward would be to do something similar to

the

simplehash.h approach, so it can be customized to the specific user.

I figured that would come up at some point. It may be worth doing in the
future, but I think it's way too much to ask for the first use case.

--
John Naylor
EDB: http://www.enterprisedb.com

#77

pg@bowt.ie

over 3 years ago

In reply to: John Naylor (#76)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jul 18, 2022 at 9:10 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Jul 19, 2022 at 9:24 AM Andres Freund <andres@anarazel.de> wrote:

FWIW, I think the best path forward would be to do something similar to the
simplehash.h approach, so it can be customized to the specific user.

I figured that would come up at some point. It may be worth doing in the future, but I think it's way too much to ask for the first use case.

I have a prototype patch that creates a read-only snapshot of the
visibility map, and has vacuumlazy.c work off of that when determining
with pages to skip. The patch also gets rid of the
SKIP_PAGES_THRESHOLD stuff. This is very effective with TPC-C,
principally because it really cuts down on the number of scanned_pages
that are scanned only because the VM bit is unset concurrently by DML.
The window for this is very large when the table is large (and
naturally takes a long time to scan), resulting in many more "dead but
not yet removable" tuples being encountered than necessary. Which
itself causes bogus information in the FSM -- information about the
space that VACUUM could free from the page, which is often highly
misleading.

There are remaining questions about how to do this properly. Right now
I'm just copying pages from the VM into local memory, right after
OldestXmin is first acquired -- we "lock in" a snapshot of the VM at
the earliest opportunity, which is what lazy_scan_skip() actually
works off now. There needs to be some consideration given to the
resource management aspects of this -- it needs to use memory
sensibly, which the current prototype patch doesn't do at all. I'm
probably going to seriously pursue this as a project soon, and will
probably need some kind of data structure for the local copy. The raw
pages are usually quite space inefficient, considering we only need an
immutable snapshot of the VM.

I wonder if it makes sense to use this as part of this project. It
will be possible to know the exact heap pages that will become
scanned_pages before scanning even one page with this design (perhaps
with caveats about low memory conditions). It could also be very
effective as a way of speeding up TID lookups in the reasonably common
case where most scanned_pages don't have any LP_DEAD items -- just
look it up in our local/materialized copy of the VM first. But even
when LP_DEAD items are spread fairly evenly, it could still give us
reliable information about the distribution of LP_DEAD items very
early on.

Maybe the two data structures could even be combined in some way? You
can use more memory for the local copy of the VM if you know that you
won't need the memory for dead_items. It's kinda the same problem, in
a way.

--
Peter Geoghegan

#78

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#74)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I’d like to keep the first version simple. We can improve it and add
more optimizations later. Using radix tree for vacuum TID storage
would still be a big win comparing to using a flat array, even without
all these optimizations. In terms of single-value leaves method, I'm
also concerned about an extra pointer traversal and extra memory
allocation. It's most flexible but multi-value leaves method is also
flexible enough for many use cases. Using the single-value method
seems to be too much as the first step for me.

Overall, using 64-bit keys and 64-bit values would be a reasonable
choice for me as the first step . It can cover wider use cases
including vacuum TID use cases. And possibly it can cover use cases by
combining a hash table or using tree of tree, for example.

These two aspects would also bring it closer to Andres' prototype, which 1)
makes review easier and 2) easier to preserve optimization work already
done, so +1 from me.

--
John Naylor
EDB: http://www.enterprisedb.com

#79

sawada.mshk@gmail.com

over 3 years ago

In reply to: John Naylor (#78)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jul 19, 2022 at 1:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I’d like to keep the first version simple. We can improve it and add
more optimizations later. Using radix tree for vacuum TID storage
would still be a big win comparing to using a flat array, even without
all these optimizations. In terms of single-value leaves method, I'm
also concerned about an extra pointer traversal and extra memory
allocation. It's most flexible but multi-value leaves method is also
flexible enough for many use cases. Using the single-value method
seems to be too much as the first step for me.

Overall, using 64-bit keys and 64-bit values would be a reasonable
choice for me as the first step . It can cover wider use cases
including vacuum TID use cases. And possibly it can cover use cases by
combining a hash table or using tree of tree, for example.

These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to preserve optimization work already done, so +1 from me.

Thanks.

I've updated the patch. It now implements 64-bit keys, 64-bit values,
and the multi-value leaves method. I've tried to remove duplicated
codes but we might find a better way to do that.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

Attachments:

radixtree_v5.patchapplication/octet-stream; name=radixtree_v5.patchDownload

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..ead0755d25 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,9 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
+radixtree.o: CFLAGS+=-msse2
+
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..1aececbf46
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2336 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
+ * Also, there is no support for path compression and lazy path expansion. The
+ * radix tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The internal nodes and
+ * the leaf nodes have slightly different structure: for internal tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_set			- Set a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iter	- Begin iterating through all key-value pairs
+ * rt_iter_next		- Return next key-value pair, if any
+ * rt_end_iter		- End iteration
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+#if defined(__SSE2__)
+#include <emmintrin.h>			/* SSE2 intrinsics */
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-128 */
+#define RT_NODE_128_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) \
+	((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+	RT_ACTION_FIND = 0,			/* find the key-value */
+	RT_ACTION_DELETE,			/* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree nodes.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 48 -> 152 -> 296 -> 1304 -> 2088 bytes for inner/leaf nodes, leading to
+ * large amounts of allocator padding with aset.c. Hence the use of slab.
+ *
+ * XXX: need to have node-1 until there is no path compression optimization?
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+typedef enum rt_node_kind
+{
+	RT_NODE_KIND_4 = 0,
+	RT_NODE_KIND_16,
+	RT_NODE_KIND_32,
+	RT_NODE_KIND_128,
+	RT_NODE_KIND_256
+} rt_node_kind;
+#define RT_NODE_KIND_COUNT (RT_NODE_KIND_256 + 1)
+
+/*
+ * Base type for all nodes types.
+ */
+typedef struct rt_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Size class of the node */
+	rt_node_kind kind;
+} rt_node;
+
+/* Macros for radix tree nodes */
+#define IS_LEAF_NODE(n) (((rt_node *) (n))->shift == 0)
+#define IS_EMPTY_NODE(n) (((rt_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+	(((rt_node *) (n))->count < rt_node_info[((rt_node *) (n))->kind].fanout)
+
+/* Base types for inner and leaf nodes of each node type */
+typedef struct rd_node_base_4
+{
+	rt_node		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} rt_node_base_4;
+
+typedef struct rd_node_base_16
+{
+	rt_node		n;
+
+	/* 16 children, for key chunks */
+	uint8		chunks[16];
+}			rt_node_base_16;
+
+typedef struct rd_node_base_32
+{
+	rt_node		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} rt_node_base_32;
+
+typedef struct rd_node_base_128
+{
+	rt_node		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
+} rt_node_base_128;
+
+typedef struct rd_node_base_256
+{
+	rt_node		n;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * There are separate from inner node size classes for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+	rt_node_base_4 base;
+
+	/* 4 children, for key chunks */
+	rt_node    *children[4];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+	rt_node_base_4 base;
+
+	/* 4 values, for key chunks */
+	uint64		values[4];
+}			rt_node_leaf_4;
+
+typedef struct rt_node_inner_16
+{
+	rt_node_base_16 base;
+
+	/* 16 children, for key chunks */
+	rt_node    *children[16];
+}			rt_node_inner_16;
+
+typedef struct rt_node_leaf_16
+{
+	rt_node_base_16 base;
+
+	/* 16 values, for key chunks */
+	uint64		values[16];
+}			rt_node_leaf_16;
+
+typedef struct rt_node_inner_32
+{
+	rt_node_base_32 base;
+
+	/* 32 children, for key chunks */
+	rt_node    *children[32];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+	rt_node_base_32 base;
+
+	/* 32 values, for key chunks */
+	uint64		values[32];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_128
+{
+	rt_node_base_128 base;
+
+	/* Slots for 128 children */
+	rt_node    *children[128];
+} rt_node_inner_128;
+
+typedef struct rt_node_leaf_128
+{
+	rt_node_base_128 base;
+
+	/* Slots for 128 values */
+	uint64		values[128];
+} rt_node_leaf_128;
+
+typedef struct rt_node_inner_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 children */
+	rt_node    *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information of each size class */
+typedef struct rt_node_info_elem
+{
+	const char *name;
+	int			fanout;
+	Size		inner_size;
+	Size		leaf_size;
+} rt_node_info_elem;
+
+static rt_node_info_elem rt_node_info[RT_NODE_KIND_COUNT] = {
+
+	[RT_NODE_KIND_4] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(rt_node_inner_4),
+		.leaf_size = sizeof(rt_node_leaf_4),
+	},
+	[RT_NODE_KIND_16] = {
+		.name = "radix tree node 16",
+		.fanout = 16,
+		.inner_size = sizeof(rt_node_inner_16),
+		.leaf_size = sizeof(rt_node_leaf_16),
+	},
+	[RT_NODE_KIND_32] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(rt_node_inner_32),
+		.leaf_size = sizeof(rt_node_leaf_32),
+	},
+	[RT_NODE_KIND_128] = {
+		.name = "radix tree node 128",
+		.fanout = 128,
+		.inner_size = sizeof(rt_node_inner_128),
+		.leaf_size = sizeof(rt_node_leaf_128),
+	},
+	[RT_NODE_KIND_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(rt_node_inner_256),
+		.leaf_size = sizeof(rt_node_leaf_256),
+	},
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ * rt_iter_node_data struct is used to track the iteration within a node.
+ * rt_iter has the array of this struct, stack, in order to track the iteration
+ * of every level. During the iteration, we also construct the key to return
+ * whenever we update the node iteration information, e.g., when advancing the
+ * current index within the node or when moving to the next node at the same level.
+ */
+typedef struct rt_iter_node_data
+{
+	rt_node    *node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_iter_node_data;
+
+struct rt_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	rt_iter_node_data stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node    *root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+
+	/* statistics */
+	int32		cnt[RT_NODE_KIND_COUNT];
+};
+
+static rt_node *rt_node_grow(radix_tree *tree, rt_node *parent,
+							 rt_node *node, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_node_kind kind, bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_copy_node_common(rt_node *src, rt_node *dst);
+static void rt_extend(radix_tree *tree, uint64 key);
+static void rt_new_root(radix_tree *tree, uint64 key);
+
+/* search */
+static bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+								 rt_node **child_p);
+static bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+								uint64 *value_p);
+static bool rt_node_search(rt_node *node, uint64 key, rt_action action, void **slot_p);
+
+/* insertion */
+static rt_node *rt_node_add_new_child(radix_tree *tree, rt_node *parent,
+									  rt_node *node, uint64 key);
+static int	rt_node_prepare_insert(radix_tree *tree, rt_node *parent,
+								   rt_node **node_p, uint64 key,
+								   bool *will_replace_p);
+static void rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+								 uint64 key, rt_node *child, bool *replaced_p);
+static void rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+								uint64 key, uint64 value, bool *replaced_p);
+
+static rt_node *rt_alloc_node(radix_tree *tree, rt_node_kind kind, bool inner);
+static void rt_extend(radix_tree *tree, uint64 key);
+static void rt_new_root(radix_tree *tree, uint64 key);
+static void rt_copy_node_common(rt_node *src, rt_node *dst);
+
+/* iteration */
+static pg_attribute_always_inline void rt_iter_update_key(rt_iter *iter, uint8 chunk,
+														  uint8 shift);
+static void *rt_node_iterate_next(rt_iter *iter, rt_iter_node_data *node_iter,
+								  bool *found_p);
+static void rt_store_iter_node(rt_iter *iter, rt_iter_node_data *node_iter,
+							   rt_node *node);
+static void rt_update_iter_stack(rt_iter *iter, int from);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * The fanout threshold to choice how to search the key in the chunk array.
+ *
+ * On platforms where vector instructions, we use the simple for-loop approach for
+ * all cases.
+ */
+#define RT_SIMPLE_LOOP_THRESHOLD		4	/* use simple for-loop */
+#define RT_VECRTORIZED_LOOP_THRESHOLD	32	/* use SIMD instructions */
+
+static pg_attribute_always_inline int
+search_chunk_array_eq(uint8 *chunks, uint8 key, uint8 node_fanout, uint8 node_count)
+{
+	if (node_fanout <= RT_SIMPLE_LOOP_THRESHOLD)
+	{
+		for (int i = 0; i < node_count; i++)
+		{
+			if (chunks[i] > key)
+				return -1;
+
+			if (chunks[i] == key)
+				return i;
+		}
+
+		return -1;
+	}
+	else if (node_fanout <= RT_VECRTORIZED_LOOP_THRESHOLD)
+	{
+		/*
+		 * On Windows, even if we use SSE intrinsics, pg_rightmost_one_pos32
+		 * is slow. So we guard with HAVE__BUILTIN_CTZ as well.
+		 *
+		 * XXX: once we have the correct interfaces to pg_bitutils.h for
+		 * Windows we can remove the HAVE__BUILTIN_CTZ condition.
+		 */
+#if defined(__SSE2__) && defined(HAVE__BUILTIN_CTZ)
+		int			index = 0;
+		__m128i		key_v = _mm_set1_epi8(key);
+
+		while (index < node_count)
+		{
+			__m128i		data_v = _mm_loadu_si128((__m128i_u *) & (chunks[index]));
+			__m128i		cmp_v = _mm_cmpeq_epi8(key_v, data_v);
+			uint32		bitfield = _mm_movemask_epi8(cmp_v);
+
+			bitfield &= ((UINT64CONST(1) << node_count) - 1);
+
+			if (bitfield)
+			{
+				index += pg_rightmost_one_pos32(bitfield);
+				break;
+			}
+
+			index += 16;
+		}
+
+		return (index < node_count) ? index : -1;
+#else
+		for (int i = 0; i < node_count; i++)
+		{
+			if (chunks[i] > key)
+				return -1;
+
+			if (chunks[i] == key)
+				return i;
+		}
+
+		return -1;
+#endif
+	}
+	else
+		elog(ERROR, "unsupported fanout size %u for chunk array search",
+			 node_fanout);
+}
+
+/*
+ * This is a bit more complicated than search_chunk_array_16_eq(), because
+ * until recently no unsigned uint8 comparison instruction existed on x86. So
+ * we need to play some trickery using _mm_min_epu8() to effectively get
+ * <=. There never will be any equal elements in the current uses, but that's
+ * what we get here...
+ */
+static pg_attribute_always_inline int
+search_chunk_array_le(uint8 *chunks, uint8 key, uint8 node_fanout, uint8 node_count)
+{
+	if (node_fanout <= RT_SIMPLE_LOOP_THRESHOLD)
+	{
+		int			index;
+
+		for (index = 0; index < node_count; index++)
+		{
+			if (chunks[index] >= key)
+				break;
+		}
+
+		return index;
+	}
+	else if (node_fanout <= RT_VECRTORIZED_LOOP_THRESHOLD)
+	{
+#if defined(__SSE2__) && defined(HAVE__BUILTIN_CTZ)
+		int			index = 0;
+		bool		found = false;
+		__m128i		key_v = _mm_set1_epi8(key);
+
+		while (index < node_count)
+		{
+			__m128i		data_v = _mm_loadu_si128((__m128i_u *) & (chunks[index]));
+			__m128i		min_v = _mm_min_epu8(data_v, key_v);
+			__m128i		cmp_v = _mm_cmpeq_epi8(key_v, min_v);
+			uint32		bitfield = _mm_movemask_epi8(cmp_v);
+
+			bitfield &= ((UINT64CONST(1) << node_count) - 1);
+
+			if (bitfield)
+			{
+				index += pg_rightmost_one_pos32(bitfield);
+				found = true;
+				break;
+			}
+
+			index += 16;
+		}
+
+		return found ? index : node_count;
+#else
+		int			index;
+
+		for (index = 0; index < node_count; index++)
+		{
+			if (chunks[index] >= key)
+				break;
+		}
+
+		return index;
+#endif
+	}
+	else
+		elog(ERROR, "unsupported fanout size %u for chunk array search",
+			 node_fanout);
+}
+
+/* Node support functions for all node types to get its children or values */
+
+/* Return the array of children in the inner node */
+static rt_node **
+rt_node_get_inner_children(rt_node *node)
+{
+	rt_node   **children = NULL;
+
+	Assert(!IS_LEAF_NODE(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			children = (rt_node **) ((rt_node_inner_4 *) node)->children;
+			break;
+		case RT_NODE_KIND_16:
+			children = (rt_node **) ((rt_node_inner_16 *) node)->children;
+			break;
+		case RT_NODE_KIND_32:
+			children = (rt_node **) ((rt_node_inner_32 *) node)->children;
+			break;
+		case RT_NODE_KIND_128:
+			children = (rt_node **) ((rt_node_inner_128 *) node)->children;
+			break;
+		case RT_NODE_KIND_256:
+			children = (rt_node **) ((rt_node_inner_256 *) node)->children;
+			break;
+		default:
+			elog(ERROR, "unexpected node type %u", node->kind);
+	}
+
+	return children;
+}
+
+/* Return the array of values in the leaf node */
+static uint64 *
+rt_node_get_leaf_values(rt_node *node)
+{
+	uint64	   *values = NULL;
+
+	Assert(IS_LEAF_NODE(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			values = ((rt_node_leaf_4 *) node)->values;
+			break;
+		case RT_NODE_KIND_16:
+			values = ((rt_node_leaf_16 *) node)->values;
+			break;
+		case RT_NODE_KIND_32:
+			values = ((rt_node_leaf_32 *) node)->values;
+			break;
+		case RT_NODE_KIND_128:
+			values = ((rt_node_leaf_128 *) node)->values;
+			break;
+		case RT_NODE_KIND_256:
+			values = ((rt_node_leaf_256 *) node)->values;
+			break;
+		default:
+			elog(ERROR, "unexpected node type %u", node->kind);
+	}
+
+	return values;
+}
+
+/*
+ * Node support functions for node-4, node-16, and node-32.
+ *
+ * These three node types have similar structure -- they have the array of chunks with
+ * different length and corresponding pointers or values depending on inner nodes or
+ * leaf nodes.
+ */
+#define ENSURE_CHUNK_ARRAY_NODE(node) \
+	Assert(((((rt_node*) node)->kind) == RT_NODE_KIND_4) || \
+		   ((((rt_node*) node)->kind) == RT_NODE_KIND_16) || \
+		   ((((rt_node*) node)->kind) == RT_NODE_KIND_32))
+
+/* Get the pointer to either the child or the value at 'idx */
+static void *
+chunk_array_node_get_slot(rt_node *node, int idx)
+{
+	void	   *slot;
+
+	ENSURE_CHUNK_ARRAY_NODE(node);
+
+	if (IS_LEAF_NODE(node))
+	{
+		uint64	   *values = rt_node_get_leaf_values(node);
+
+		slot = (void *) &(values[idx]);
+	}
+	else
+	{
+		rt_node   **children = rt_node_get_inner_children(node);
+
+		slot = (void *) children[idx];
+	}
+
+	return slot;
+}
+
+/* Return the chunk array in the node */
+static uint8 *
+chunk_array_node_get_chunks(rt_node *node)
+{
+	uint8	   *chunk = NULL;
+
+	ENSURE_CHUNK_ARRAY_NODE(node);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			chunk = (uint8 *) ((rt_node_base_4 *) node)->chunks;
+			break;
+		case RT_NODE_KIND_16:
+			chunk = (uint8 *) ((rt_node_base_16 *) node)->chunks;
+			break;
+		case RT_NODE_KIND_32:
+			chunk = (uint8 *) ((rt_node_base_32 *) node)->chunks;
+			break;
+		default:
+			/* this function don't support node-128 and node-256 */
+			elog(ERROR, "unsupported node type %d", node->kind);
+	}
+
+	return chunk;
+}
+
+/* Copy the contents of the node from 'src' to 'dst' */
+static void
+chunk_array_node_copy_chunks_and_slots(rt_node *src, rt_node *dst)
+{
+	uint8	   *chunks_src,
+			   *chunks_dst;
+
+	ENSURE_CHUNK_ARRAY_NODE(src);
+	ENSURE_CHUNK_ARRAY_NODE(dst);
+
+	/* Copy base type */
+	rt_copy_node_common(src, dst);
+
+	/* Copy chunk array */
+	chunks_src = chunk_array_node_get_chunks(src);
+	chunks_dst = chunk_array_node_get_chunks(dst);
+	memcpy(chunks_dst, chunks_src, sizeof(uint8) * src->count);
+
+	/* Copy children or values */
+	if (IS_LEAF_NODE(src))
+	{
+		uint64	   *values_src,
+				   *values_dst;
+
+		Assert(IS_LEAF_NODE(dst));
+		values_src = rt_node_get_leaf_values(src);
+		values_dst = rt_node_get_leaf_values(dst);
+		memcpy(values_dst, values_src, sizeof(uint64) * src->count);
+	}
+	else
+	{
+		rt_node   **children_src,
+				  **children_dst;
+
+		Assert(!IS_LEAF_NODE(dst));
+		children_src = rt_node_get_inner_children(src);
+		children_dst = rt_node_get_inner_children(dst);
+		memcpy(children_dst, children_src, sizeof(rt_node *) * src->count);
+	}
+}
+
+/*
+ * Return the index of the (sorted) chunk array where the chunk is inserted.
+ * Set true to replaced_p if the chunk already exists in the array.
+ */
+static int
+chunk_array_node_find_insert_pos(rt_node *node, uint8 chunk, bool *found_p)
+{
+	uint8	   *chunks;
+	int			idx;
+
+	ENSURE_CHUNK_ARRAY_NODE(node);
+
+	*found_p = false;
+	chunks = chunk_array_node_get_chunks(node);
+
+	/* Find the insert pos */
+	idx = search_chunk_array_le(chunks, chunk,
+								rt_node_info[node->kind].fanout,
+								node->count);
+
+	if (idx < node->count && chunks[idx] == chunk)
+		*found_p = true;
+
+	return idx;
+}
+
+/* Delete the chunk at idx */
+static void
+chunk_array_node_delete(rt_node *node, int idx)
+{
+	uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+	/* delete the chunk from the chunk array */
+	memmove(&(chunks[idx]), &(chunks[idx + 1]),
+			sizeof(uint8) * (node->count - idx - 1));
+
+	/* delete either the value or the child as well */
+	if (IS_LEAF_NODE(node))
+	{
+		uint64	   *values = rt_node_get_leaf_values(node);
+
+		memmove(&(values[idx]),
+				&(values[idx + 1]),
+				sizeof(uint64) * (node->count - idx - 1));
+	}
+	else
+	{
+		rt_node   **children = rt_node_get_inner_children(node);
+
+		memmove(&(children[idx]),
+				&(children[idx + 1]),
+				sizeof(rt_node *) * (node->count - idx - 1));
+	}
+}
+
+/* Support function for both node-128 */
+
+/* Does the given chunk in the node has the value? */
+static pg_attribute_always_inline bool
+node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static pg_attribute_always_inline bool
+node_128_is_slot_used(rt_node_base_128 *node, uint8 slot)
+{
+	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+/* Get the pointer to either the child or the value corresponding to chunk */
+static void *
+node_128_get_slot(rt_node_base_128 *node, uint8 chunk)
+{
+	int			slotpos;
+	void	   *slot;
+
+	slotpos = node->slot_idxs[chunk];
+	Assert(slotpos != RT_NODE_128_INVALID_IDX);
+
+	if (IS_LEAF_NODE(node))
+		slot = (void *) &(((rt_node_leaf_128 *) node)->values[slotpos]);
+	else
+		slot = (void *) (((rt_node_inner_128 *) node)->children[slotpos]);
+
+	return slot;
+}
+
+/* Delete the chunk in the node */
+static void
+node_128_delete(rt_node_base_128 *node, uint8 chunk)
+{
+	int			slotpos = node->slot_idxs[chunk];
+
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Return an unused slot in node-128 */
+static int
+node_128_find_unused_slot(rt_node_base_128 *node, uint8 chunk)
+{
+	int			slotpos;
+
+	/*
+	 * Find an unused slot. We iterate over the isset bitmap per byte then
+	 * check each bit.
+	 */
+	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	{
+		if (node->isset[slotpos] < 0xFF)
+			break;
+	}
+	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+	slotpos *= BITS_PER_BYTE;
+	while (node_128_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+
+/* XXX: duplicate with node_128_set_leaf */
+static void
+node_128_set_inner(rt_node_base_128 *node, uint8 chunk, rt_node *child)
+{
+	int			slotpos;
+	rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+	/* Overwrite the existing value if exists */
+	if (node_128_is_chunk_used(node, chunk))
+	{
+		n128->children[n128->base.slot_idxs[chunk]] = child;
+		return;
+	}
+
+	/* find unused slot */
+	slotpos = node_128_find_unused_slot(node, chunk);
+
+	n128->base.slot_idxs[chunk] = slotpos;
+	n128->base.isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	n128->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static void
+node_128_set_leaf(rt_node_base_128 *node, uint8 chunk, uint64 value)
+{
+	int			slotpos;
+	rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+
+	/* Overwrite the existing value if exists */
+	if (node_128_is_chunk_used(node, chunk))
+	{
+		n128->values[n128->base.slot_idxs[chunk]] = value;
+		return;
+	}
+
+	/* find unused slot */
+	slotpos = node_128_find_unused_slot(node, chunk);
+
+	n128->base.slot_idxs[chunk] = slotpos;
+	n128->base.isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	n128->values[slotpos] = value;
+}
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static bool
+node_256_is_chunk_used(rt_node_base_256 *node, uint8 chunk)
+{
+	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+/* Get the pointer to either the child or the value corresponding to chunk */
+static void *
+node_256_get_slot(rt_node_base_256 *node, uint8 chunk)
+{
+	void	   *slot;
+
+	Assert(node_256_is_chunk_used(node, chunk));
+	if (IS_LEAF_NODE(node))
+		slot = (void *) &(((rt_node_leaf_256 *) node)->values[chunk]);
+	else
+		slot = (void *) (((rt_node_inner_256 *) node)->children[chunk]);
+
+	return slot;
+}
+
+/* Set the child in the node-256 */
+static pg_attribute_always_inline void
+node_256_set_inner(rt_node_base_256 *node, uint8 chunk, rt_node *child)
+{
+	rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+	n256->base.isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	n256->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static pg_attribute_always_inline void
+node_256_set_leaf(rt_node_base_256 *node, uint8 chunk, uint64 value)
+{
+	rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+	n256->base.isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	n256->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static pg_attribute_always_inline void
+node_256_delete(rt_node_base_256 *node, uint8 chunk)
+{
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static pg_attribute_always_inline int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_node_kind kind, bool inner)
+{
+	rt_node    *newnode;
+
+	if (inner)
+		newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+													 rt_node_info[kind].inner_size);
+	else
+		newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+													 rt_node_info[kind].leaf_size);
+
+	newnode->kind = kind;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_128)
+	{
+		rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+
+		memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
+	}
+
+	/* update the statistics */
+	tree->cnt[kind]++;
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+		tree->root = NULL;
+
+	/* update the statistics */
+	tree->cnt[node->kind]--;
+
+	Assert(tree->cnt[node->kind] >= 0);
+
+	pfree(node);
+}
+
+/* Copy the common fields without the node kind */
+static void
+rt_copy_node_common(rt_node *src, rt_node *dst)
+{
+	dst->shift = src->shift;
+	dst->chunk = src->chunk;
+	dst->count = src->count;
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_inner_4 *node =
+		(rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4, true);
+
+		node->base.n.count = 1;
+		node->base.n.shift = shift;
+		node->base.chunks[0] = 0;
+		node->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * Wrapper for rt_node_search to search the pointer to the child node in the
+ * node.
+ *
+ * Return true if the corresponding child is found, otherwise return false.  On success,
+ * it sets child_p.
+ */
+static bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+	rt_node    *child;
+
+	if (!rt_node_search(node, key, action, (void **) &child))
+		return false;
+
+	if (child_p)
+		*child_p = child;
+
+	return true;
+}
+
+static bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+	uint64	   *value;
+
+	if (!rt_node_search(node, key, action, (void **) &value))
+		return false;
+
+	if (value_p)
+		*value_p = *value;
+
+	return true;
+}
+
+/*
+ * Return true if the corresponding slot is used, otherwise return false.  On success,
+ * sets the pointer to the slot to slot_p.
+ */
+static bool
+rt_node_search(rt_node *node, uint64 key, rt_action action, void **slot_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				int			idx;
+				uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+				idx = search_chunk_array_eq(chunks, chunk,
+											rt_node_info[node->kind].fanout,
+											node->count);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					*slot_p = chunk_array_node_get_slot(node, idx);
+				else			/* RT_ACTION_DELETE */
+					chunk_array_node_delete(node, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+
+				/* If we find the chunk in the node, do the specified action */
+				if (node_128_is_chunk_used(n128, chunk))
+				{
+					if (action == RT_ACTION_FIND)
+						*slot_p = node_128_get_slot(n128, chunk);
+					else		/* RT_ACTION_DELETE */
+						node_128_delete(n128, chunk);
+
+					found = true;
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+
+				/* If we find the chunk in the node, do the specified action */
+				if (node_256_is_chunk_used(n256, chunk))
+				{
+					found = true;
+
+					if (action == RT_ACTION_FIND)
+						*slot_p = node_256_get_slot(n256, chunk);
+					else		/* RT_ACTION_DELETE */
+						node_256_delete(n256, chunk);
+				}
+
+				break;
+			}
+	}
+
+	/* Update the statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	return found;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+	int			shift = key_get_shift(key);
+	rt_node    *node;
+
+	node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift > 0);
+	node->shift = shift;
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = node;
+}
+
+/* Insert 'node' as a child node of 'parent' */
+static rt_node *
+rt_node_add_new_child(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key)
+{
+	uint8		newshift = node->shift - RT_NODE_SPAN;
+	rt_node    *newchild =
+	(rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, newshift > 0);
+
+	Assert(!IS_LEAF_NODE(node));
+
+	newchild->shift = newshift;
+	newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+	rt_node_insert_inner(tree, parent, node, key, newchild, NULL);
+
+	return (rt_node *) newchild;
+}
+
+/*
+ * For upcoming insertions, we make sure that the node has enough free slots or
+ * grow the node if necessary.  We set true to will_replace_p if the chunk
+ * already exists and will be replaced on insertion.
+ *
+ * Return the index in the chunk array where the key can be inserted. We always
+ * return 0 in node-128 and node-256 cases.
+ */
+static int
+rt_node_prepare_insert(radix_tree *tree, rt_node *parent, rt_node **node_p,
+					   uint64 key, bool *will_replace_p)
+{
+	rt_node    *node = *node_p;
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		will_replace = false;
+	int			idx = 0;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				bool		can_insert = false;
+
+				while ((node->kind == RT_NODE_KIND_4) ||
+					   (node->kind == RT_NODE_KIND_16) ||
+					   (node->kind == RT_NODE_KIND_32))
+				{
+					/* Find the insert pos */
+					idx = chunk_array_node_find_insert_pos(node, chunk, &will_replace);
+
+					if (will_replace || NODE_HAS_FREE_SLOT(node))
+					{
+						can_insert = true;
+						break;
+					}
+
+					node = rt_node_grow(tree, parent, node, key);
+				}
+
+				if (can_insert)
+				{
+					uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+					/*
+					 * The node has unused slot for this chunk. If the key
+					 * needs to be inserted in the middle of the array, we
+					 * make space for the new key.
+					 */
+					if (!will_replace && node->count != 0 && idx != node->count)
+					{
+						memmove(&(chunks[idx + 1]), &(chunks[idx]),
+								sizeof(uint8) * (node->count - idx));
+
+						/* shift either the values array or the children array */
+						if (IS_LEAF_NODE(node))
+						{
+							uint64	   *values = rt_node_get_leaf_values(node);
+
+							memmove(&(values[idx + 1]),
+									&(values[idx]),
+									sizeof(uint64) * (node->count - idx));
+						}
+						else
+						{
+							rt_node   **children = rt_node_get_inner_children(node);
+
+							memmove(&(children[idx + 1]),
+									&(children[idx]),
+									sizeof(rt_node *) * (node->count - idx));
+						}
+					}
+
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+
+				if (node_128_is_chunk_used(n128, chunk) || NODE_HAS_FREE_SLOT(n128))
+				{
+					if (node_128_is_chunk_used(n128, chunk))
+						will_replace = true;
+
+					break;
+				}
+
+				node = rt_node_grow(tree, parent, node, key);
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+
+				if (node_256_is_chunk_used(n256, chunk))
+					will_replace = true;
+
+				break;
+			}
+	}
+
+	*node_p = node;
+	*will_replace_p = will_replace;
+
+	return idx;
+}
+
+/* Insert the child to the inner node */
+static void
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+					 uint64 key, rt_node *child, bool *replaced_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	int			idx;
+	bool		replaced;
+
+	Assert(!IS_LEAF_NODE(node));
+
+	idx = rt_node_prepare_insert(tree, parent, &node, key, &replaced);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				uint8	   *chunks = chunk_array_node_get_chunks(node);
+				rt_node   **children = rt_node_get_inner_children(node);
+
+				Assert(idx >= 0);
+				chunks[idx] = chunk;
+				children[idx] = child;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				node_128_set_inner((rt_node_base_128 *) node, chunk, child);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				node_256_set_inner((rt_node_base_256 *) node, chunk, child);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!replaced)
+		node->count++;
+
+	if (replaced_p)
+		*replaced_p = replaced;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+}
+
+/* Insert the value to the leaf node */
+static void
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+					uint64 key, uint64 value, bool *replaced_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	int			idx;
+	bool		replaced;
+
+	Assert(IS_LEAF_NODE(node));
+
+	idx = rt_node_prepare_insert(tree, parent, &node, key, &replaced);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				uint8	   *chunks = chunk_array_node_get_chunks(node);
+				uint64	   *values = rt_node_get_leaf_values(node);
+
+				Assert(idx >= 0);
+				chunks[idx] = chunk;
+				values[idx] = value;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				node_128_set_leaf((rt_node_base_128 *) node, chunk, value);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				node_256_set_leaf((rt_node_base_256 *) node, chunk, value);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!replaced)
+		node->count++;
+
+	*replaced_p = replaced;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+}
+
+/* Change the node type to the next larger one */
+static rt_node *
+rt_node_grow(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key)
+{
+	rt_node    *newnode = NULL;
+
+	Assert(node->count == rt_node_info[node->kind].fanout);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				newnode = rt_alloc_node(tree, RT_NODE_KIND_16,
+										IS_LEAF_NODE(node));
+
+				/* Copy both chunks and slots to the new node */
+				chunk_array_node_copy_chunks_and_slots(node, newnode);
+				break;
+			}
+		case RT_NODE_KIND_16:
+			{
+				newnode = rt_alloc_node(tree, RT_NODE_KIND_32,
+										IS_LEAF_NODE(node));
+
+				/* Copy both chunks and slots to the new node */
+				chunk_array_node_copy_chunks_and_slots(node, newnode);
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				newnode = rt_alloc_node(tree, RT_NODE_KIND_128,
+										IS_LEAF_NODE(node));
+
+				/* Copy both chunks and slots to the new node */
+				rt_copy_node_common(node, newnode);
+
+				if (IS_LEAF_NODE(node))
+				{
+					rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+					for (int i = 0; i < node->count; i++)
+						node_128_set_leaf((rt_node_base_128 *) newnode,
+										  n32->base.chunks[i], n32->values[i]);
+				}
+				else
+				{
+					rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+					for (int i = 0; i < node->count; i++)
+						node_128_set_inner((rt_node_base_128 *) newnode,
+										   n32->base.chunks[i], n32->children[i]);
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				int			cnt = 0;
+
+				newnode = rt_alloc_node(tree, RT_NODE_KIND_256,
+										IS_LEAF_NODE(node));
+
+				/* Copy both chunks and slots to the new node */
+				rt_copy_node_common(node, newnode);
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->n.count; i++)
+				{
+					void	   *slot;
+
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					slot = node_128_get_slot(n128, i);
+
+					if (IS_LEAF_NODE(node))
+						node_256_set_leaf((rt_node_base_256 *) newnode, i,
+										  *(uint64 *) slot);
+					else
+						node_256_set_inner((rt_node_base_256 *) newnode, i,
+										   (rt_node *) slot);
+
+					cnt++;
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			elog(ERROR, "radix tree node-256 cannot grow");
+			break;
+	}
+
+	if (parent == node)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = newnode;
+	}
+	else
+	{
+		/* Set the new node to the parent node */
+		rt_node_insert_inner(tree, NULL, parent, key, newnode, NULL);
+	}
+
+	/* Verify the node has grown properly */
+	rt_verify_node(newnode);
+
+	/* Free the old node */
+	rt_free_node(tree, node);
+
+	return newnode;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 rt_node_info[i].name,
+												 SLAB_DEFAULT_BLOCK_SIZE,
+												 rt_node_info[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												rt_node_info[i].name,
+												SLAB_DEFAULT_BLOCK_SIZE,
+												rt_node_info[i].leaf_size);
+		tree->cnt[i] = 0;
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry exists, we update its value to 'value' and return
+ * true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		replaced;
+	rt_node    *node;
+	rt_node    *parent = tree->root;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = tree->root;
+
+	while (shift > 0)
+	{
+		rt_node    *child;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			child = rt_node_add_new_child(tree, parent, node, key);
+
+		Assert(child);
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* arrived at a leaf */
+	Assert(IS_LEAF_NODE(node));
+
+	rt_node_insert_leaf(tree, parent, node, key, value, &replaced);
+
+	/* Update the statistics */
+	if (!replaced)
+		tree->num_keys++;
+
+	return replaced;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if the key is successfully
+ * found, otherwise return false.  On success, we set the value to *val_p so
+ * it must not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+	rt_node    *node;
+	int			shift;
+
+	Assert(value_p);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift > 0)
+	{
+		rt_node    *child;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* We reached at a leaf node, search the corresponding slot */
+	Assert(IS_LEAF_NODE(node));
+
+	if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
+		return false;
+
+	return true;
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	int			level;
+	bool		deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descending the tree to search the key while building a stack of nodes
+	 * we visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = 0;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		/* Push the current node to the stack */
+		stack[level] = node;
+
+		if (IS_LEAF_NODE(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+
+	/*
+	 * Delete the key from the leaf node and recursively delete internal nodes
+	 * if necessary.
+	 */
+	Assert(IS_LEAF_NODE(stack[level]));
+	while (level >= 0)
+	{
+		rt_node    *node = stack[level--];
+
+		if (IS_LEAF_NODE(node))
+			deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+		else
+			deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!IS_EMPTY_NODE(node))
+			break;
+
+		Assert(deleted);
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+
+	}
+
+	/*
+	 * If we eventually deleted the root node while recursively deleting empty
+	 * nodes, we make the tree empty.
+	 */
+	if (level == 0)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+	if (deleted)
+		tree->num_keys--;
+
+	return deleted;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+
+	iter->stack_len = top_level;
+	iter->stack[top_level].node = iter->tree->root;
+	iter->stack[top_level].current_idx = -1;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update the stack of the radix tree node while descending to the leaf from
+ * the 'from' level.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, int from)
+{
+	rt_node    *node = iter->stack[from].node;
+	int			level = from;
+
+	for (;;)
+	{
+		rt_iter_node_data *node_iter = &(iter->stack[level--]);
+		bool		found;
+
+		/* Set the node to this level */
+		rt_store_iter_node(iter, node_iter, node);
+
+		/* Finish if we reached to the leaf node */
+		if (IS_LEAF_NODE(node))
+			break;
+
+		/* Advance to the next slot in the node */
+		node = (rt_node *) rt_node_iterate_next(iter, node_iter, &found);
+
+		/*
+		 * Since we always get the first slot in the node, we have to found
+		 * the slot.
+		 */
+		Assert(found);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+	bool		found = false;
+	void	   *slot;
+
+	/* Empty tree */
+	if (!iter->tree)
+		return false;
+
+	for (;;)
+	{
+		rt_node    *node;
+		rt_iter_node_data *node_iter;
+		int			level;
+
+		/*
+		 * Iterate node at each level from the bottom of the tree, i.e., the
+		 * lead node, until we find the next slot.
+		 */
+		for (level = 0; level <= iter->stack_len; level++)
+		{
+			slot = rt_node_iterate_next(iter, &(iter->stack[level]), &found);
+
+			if (found)
+				break;
+		}
+
+		/* We could not find any new key-value pair, the iteration finished */
+		if (!found)
+			break;
+
+		/* found the next slot at the leaf node, return it */
+		if (level == 0)
+		{
+			*key_p = iter->key;
+			*value_p = *((uint64 *) slot);
+			break;
+		}
+
+		/*
+		 * We have advanced slots more than one nodes including both the lead
+		 * node and internal nodes. So we update the stack by descending to
+		 * the left most leaf node from this level.
+		 */
+		node = (rt_node *) (rt_node *) slot;
+		node_iter = &(iter->stack[level - 1]);
+		rt_store_iter_node(iter, node_iter, node);
+		rt_update_iter_stack(iter, level - 1);
+	}
+
+	return found;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+/*
+ * Iterate over the given radix tree node and returns the next slot of the given
+ * node and set true to *found_p, if any.  Otherwise, set false to *found_p.
+ */
+static void *
+rt_node_iterate_next(rt_iter *iter, rt_iter_node_data *node_iter, bool *found_p)
+{
+	rt_node    *node = node_iter->node;
+	void	   *slot = NULL;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				node_iter->current_idx++;
+
+				if (node_iter->current_idx >= node->count)
+					goto not_found;
+
+				slot = chunk_array_node_get_slot(node, node_iter->current_idx);
+
+				/* Update the part of the key by the current chunk */
+				if (IS_LEAF_NODE(node))
+				{
+					uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+					rt_iter_update_key(iter, chunks[node_iter->current_idx], 0);
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < 256; i++)
+				{
+					if (node_128_is_chunk_used(n128, i))
+						break;
+				}
+
+				if (i >= 256)
+					goto not_found;
+
+				node_iter->current_idx = i;
+				slot = node_128_get_slot(n128, i);
+
+				/* Update the part of the key */
+				if (IS_LEAF_NODE(n128))
+					rt_iter_update_key(iter, node_iter->current_idx, 0);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < 256; i++)
+				{
+					if (node_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= 256)
+					goto not_found;
+
+				node_iter->current_idx = i;
+				slot = node_256_get_slot(n256, i);
+
+				/* Update the part of the key */
+				if (IS_LEAF_NODE(n256))
+					rt_iter_update_key(iter, node_iter->current_idx, 0);
+
+				break;
+			}
+	}
+
+	Assert(slot);
+	*found_p = true;
+	return slot;
+
+not_found:
+	*found_p = false;
+	return NULL;
+}
+
+/*
+ * Set the node to the node_iter so we can begin the iteration of the node.
+ * Also, we update the part of the key by the chunk of the given node.
+ */
+static void
+rt_store_iter_node(rt_iter *iter, rt_iter_node_data *node_iter,
+				   rt_node *node)
+{
+	node_iter->node = node;
+	node_iter->current_idx = -1;
+
+	rt_iter_update_key(iter, node->chunk, node->shift + RT_NODE_SPAN);
+}
+
+static pg_attribute_always_inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	Size		total = 0;
+
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+				/* Check if the chunks in the node are sorted */
+				for (int i = 1; i < node->count; i++)
+					Assert(chunks[i - 1] < chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(node_128_is_slot_used(n128, n128->slot_idxs[i]));
+
+					cnt++;
+				}
+
+				Assert(n128->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+					cnt += pg_popcount32(n256->isset[i]);
+
+				/* Check if the number of used chunk matches */
+				Assert(n256->n.count == cnt);
+
+				break;
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u, n16 = %u,n32 = %u, n128 = %u, n256 = %u",
+			tree->num_keys,
+			tree->root->shift / RT_NODE_SPAN,
+			tree->cnt[0],
+			tree->cnt[1],
+			tree->cnt[2],
+			tree->cnt[3],
+			tree->cnt[4]);
+	/* rt_dump(tree); */
+}
+
+static void
+rt_print_slot(StringInfo buf, uint8 chunk, uint64 value, int idx, bool is_leaf, int level)
+{
+	char		space[128] = {0};
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	if (is_leaf)
+		appendStringInfo(buf, "%s[%d] \"0x%X\" val(0x%lX) LEAF\n",
+						 space,
+						 idx,
+						 chunk,
+						 value);
+	else
+		appendStringInfo(buf, "%s[%d] \"0x%X\" -> ",
+						 space,
+						 idx,
+						 chunk);
+}
+
+static void
+rt_dump_node(rt_node *node, int level, StringInfo buf, bool recurse)
+{
+	bool		is_leaf = IS_LEAF_NODE(node);
+
+	appendStringInfo(buf, "[\"%s\" type %d, cnt %u, shift %u, chunk \"0x%X\"] chunks:\n",
+					 IS_LEAF_NODE(node) ? "LEAF" : "INNR",
+					 (node->kind == RT_NODE_KIND_4) ? 4 :
+					 (node->kind == RT_NODE_KIND_32) ? 32 :
+					 (node->kind == RT_NODE_KIND_128) ? 128 : 256,
+					 node->count, node->shift, node->chunk);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+				for (int i = 0; i < node->count; i++)
+				{
+					if (IS_LEAF_NODE(node))
+					{
+						uint64	   *values = rt_node_get_leaf_values(node);
+
+						rt_print_slot(buf, chunks[i],
+									  values[i],
+									  i, is_leaf, level);
+					}
+					else
+						rt_print_slot(buf, chunks[i],
+									  UINT64_MAX,
+									  i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							rt_node   **children = rt_node_get_inner_children(node);
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							rt_dump_node(children[i],
+										 level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				uint8	   *tmp = (uint8 *) n128->isset;
+
+				appendStringInfo(buf, "slot_idxs:");
+				for (int j = 0; j < 256; j++)
+				{
+					if (!node_128_is_chunk_used(n128, j))
+						continue;
+
+					appendStringInfo(buf, " [%d]=%d, ", j, n128->slot_idxs[j]);
+				}
+				appendStringInfo(buf, "\nisset-bitmap:");
+				for (int j = 0; j < 16; j++)
+				{
+					appendStringInfo(buf, "%X ", (uint8) tmp[j]);
+				}
+				appendStringInfo(buf, "\n");
+
+				for (int i = 0; i < 256; i++)
+				{
+					void	   *slot;
+
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					slot = node_128_get_slot(n128, i);
+
+					if (is_leaf)
+						rt_print_slot(buf, i, *(uint64 *) slot,
+									  i, is_leaf, level);
+					else
+						rt_print_slot(buf, i, UINT64_MAX, i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							rt_dump_node((rt_node *) slot,
+										 level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+
+				for (int i = 0; i < 256; i++)
+				{
+					void	   *slot;
+
+					if (!node_256_is_chunk_used(n256, i))
+						continue;
+
+					slot = node_256_get_slot(n256, i);
+
+					if (is_leaf)
+						rt_print_slot(buf, i, *(uint64 *) slot, i, is_leaf, level);
+					else
+						rt_print_slot(buf, i, UINT64_MAX, i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							rt_dump_node((rt_node *) slot, level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	StringInfoData buf;
+	rt_node    *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+			 key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		rt_dump_node(node, level, &buf, false);
+
+		if (IS_LEAF_NODE(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+			break;
+		}
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+
+	elog(NOTICE, "\n%s", buf.data);
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+	StringInfoData buf;
+
+	initStringInfo(&buf);
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu", tree->max_val);
+	rt_dump_node(tree->root, 0, &buf, true);
+	elog(NOTICE, "\n%s", buf.data);
+	elog(NOTICE, "-----------------------------------------------------------");
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..788eb13204
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+extern void rt_free(radix_tree *tree);
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9090226daa..51b2514faf 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -24,6 +24,7 @@ SUBDIRS = \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..671c3e0f47
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,507 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int rt_node_max_entries[] = {
+	4,		/* RT_NODE_KIND_4 */
+	16,		/* RT_NODE_KIND_16 */
+	32,		/* RT_NODE_KIND_32 */
+	128,	/* RT_NODE_KIND_128 */
+	256		/* RT_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 10000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	uint64 dummy;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		uint64 val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+		{
+			rt_dump(radixtree);
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+		}
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		bool found;
+
+		found = rt_set(radixtree, key, key);
+
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+		for (int j = 0; j < lengthof(rt_node_max_entries); j++)
+		{
+			/*
+			 * After filling all slots in each node type, check if the values are
+			 * stored properly.
+			 */
+			if (i == (rt_node_max_entries[j] - 1))
+			{
+				check_search_on_node(radixtree, shift,
+									 (j == 0) ? 0 : rt_node_max_entries[j - 1],
+									 rt_node_max_entries[j]);
+				break;
+			}
+		}
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64	key = ((uint64) i << shift);
+		bool	found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search
+	 * entries again.
+	 */
+	test_node_types_insert(radixtree, shift);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+	radix_tree *radixtree;
+	rt_iter *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the
+	 * stats from the memory context.  They should be in the same ballpark,
+	 * but it's hard to automate testing that, so if you're making changes to
+	 * the implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true

#80

sawada.mshk@gmail.com

over 3 years ago

In reply to: Masahiko Sawada (#79)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 19, 2022 at 1:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I’d like to keep the first version simple. We can improve it and add
more optimizations later. Using radix tree for vacuum TID storage
would still be a big win comparing to using a flat array, even without
all these optimizations. In terms of single-value leaves method, I'm
also concerned about an extra pointer traversal and extra memory
allocation. It's most flexible but multi-value leaves method is also
flexible enough for many use cases. Using the single-value method
seems to be too much as the first step for me.

Overall, using 64-bit keys and 64-bit values would be a reasonable
choice for me as the first step . It can cover wider use cases
including vacuum TID use cases. And possibly it can cover use cases by
combining a hash table or using tree of tree, for example.

These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to preserve optimization work already done, so +1 from me.

Thanks.

I've updated the patch. It now implements 64-bit keys, 64-bit values,
and the multi-value leaves method. I've tried to remove duplicated
codes but we might find a better way to do that.

With the recent changes related to simd, I'm going to split the patch
into at least two parts: introduce other simd optimized functions used
by the radix tree and the radix tree implementation. Particularly we
need two functions for radix tree: a function like pg_lfind32 but for
8 bits integers and return the index, and a function that returns the
index of the first element that is >= key.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#81

/messages/by-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA@mail.gmail.com

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#80)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Aug 15, 2022 at 12:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 19, 2022 at 1:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I’d like to keep the first version simple. We can improve it and add
more optimizations later. Using radix tree for vacuum TID storage
would still be a big win comparing to using a flat array, even without
all these optimizations. In terms of single-value leaves method, I'm
also concerned about an extra pointer traversal and extra memory
allocation. It's most flexible but multi-value leaves method is also
flexible enough for many use cases. Using the single-value method
seems to be too much as the first step for me.

Overall, using 64-bit keys and 64-bit values would be a reasonable
choice for me as the first step . It can cover wider use cases
including vacuum TID use cases. And possibly it can cover use cases by
combining a hash table or using tree of tree, for example.

These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to preserve optimization work already done, so +1 from me.

Thanks.

I've updated the patch. It now implements 64-bit keys, 64-bit values,
and the multi-value leaves method. I've tried to remove duplicated
codes but we might find a better way to do that.

With the recent changes related to simd, I'm going to split the patch
into at least two parts: introduce other simd optimized functions used
by the radix tree and the radix tree implementation. Particularly we
need two functions for radix tree: a function like pg_lfind32 but for
8 bits integers and return the index, and a function that returns the
index of the first element that is >= key.

I recommend looking at

since I did the work just now for searching bytes and returning a
bool, buth = and <=. Should be pretty close. Also, i believe if you
left this for last as a possible refactoring, it might save some work.
In any case, I'll take a look at the latest patch next month.

--
John Naylor
EDB: http://www.enterprisedb.com

#82

sawada.mshk@gmail.com

over 3 years ago

In reply to: John Naylor (#81)

3 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Aug 15, 2022 at 10:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Aug 15, 2022 at 12:39 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jul 22, 2022 at 10:43 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Jul 19, 2022 at 1:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Jul 19, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I’d like to keep the first version simple. We can improve it and add
more optimizations later. Using radix tree for vacuum TID storage
would still be a big win comparing to using a flat array, even without
all these optimizations. In terms of single-value leaves method, I'm
also concerned about an extra pointer traversal and extra memory
allocation. It's most flexible but multi-value leaves method is also
flexible enough for many use cases. Using the single-value method
seems to be too much as the first step for me.

Overall, using 64-bit keys and 64-bit values would be a reasonable
choice for me as the first step . It can cover wider use cases
including vacuum TID use cases. And possibly it can cover use cases by
combining a hash table or using tree of tree, for example.

These two aspects would also bring it closer to Andres' prototype, which 1) makes review easier and 2) easier to preserve optimization work already done, so +1 from me.

Thanks.

I've updated the patch. It now implements 64-bit keys, 64-bit values,
and the multi-value leaves method. I've tried to remove duplicated
codes but we might find a better way to do that.

With the recent changes related to simd, I'm going to split the patch
into at least two parts: introduce other simd optimized functions used
by the radix tree and the radix tree implementation. Particularly we
need two functions for radix tree: a function like pg_lfind32 but for
8 bits integers and return the index, and a function that returns the
index of the first element that is >= key.

I recommend looking at

/messages/by-id/CAFBsxsESLUyJ5spfOSyPrOvKUEYYNqsBosue9SV1j8ecgNXSKA@mail.gmail.com

since I did the work just now for searching bytes and returning a
bool, buth = and <=. Should be pretty close. Also, i believe if you
left this for last as a possible refactoring, it might save some work.
In any case, I'll take a look at the latest patch next month.

I've updated the radix tree patch. It's now separated into two patches.

0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
better names) that are similar to the pg_lfind8() family but they
return the index of the key in the vector instead of true/false. The
patch includes regression tests.

0002 patch is the main radix tree implementation. I've removed some
duplicated codes of node manipulation. For instance, since node-4,
node-16, and node-32 have a similar structure with different fanouts,
I introduced the common function for them.

In addition to two patches, I've attached the third patch. It's not
part of radix tree implementation but introduces a contrib module
bench_radix_tree, a tool for radix tree performance benchmarking. It
measures loading and lookup performance of both the radix tree and a
flat array.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v6-0001-Support-pg_lsearch8_eq-and-pg_lsearch8_ge.patchapplication/x-patch; name=v6-0001-Support-pg_lsearch8_eq-and-pg_lsearch8_ge.patchDownload

From 5d0115b068ecb01d791eab5f8a78a6d25b9cf45c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:01 +0000
Subject: [PATCH v6 1/3] Support pg_lsearch8_eq and pg_lsearch8_ge

---
 src/include/port/pg_lfind.h                   |  71 ++++++++
 src/include/port/simd.h                       | 155 +++++++++++++++++-
 .../test_lfind/expected/test_lfind.out        |  12 ++
 .../modules/test_lfind/sql/test_lfind.sql     |   2 +
 .../modules/test_lfind/test_lfind--1.0.sql    |   8 +
 src/test/modules/test_lfind/test_lfind.c      | 139 ++++++++++++++++
 6 files changed, 378 insertions(+), 9 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index 0625cac6b5..583f204763 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -80,6 +80,77 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
 	return false;
 }
 
+/*
+ * pg_lsearch8
+ *
+ * Return the index of the element in 'base' that equals to 'key', otherwise return
+ * -1.
+ */
+static inline int
+pg_lsearch8(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+
+	/* round down to multiple of vector length */
+	uint32		tail_idx = nelem & ~(sizeof(Vector8) - 1);
+	Vector8		chunk;
+
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		int idx;
+
+		vector8_load(&chunk, &base[i]);
+		if ((idx = vector8_search_eq(chunk, key)) != -1)
+			return i + idx;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (key == base[i])
+			return i;
+	}
+
+	return -1;
+}
+
+
+/*
+ * pg_lsearch8_ge
+ *
+ * Return the index of the first element in 'base' that is greater than or equal to
+ * 'key'. Return nelem if there is no such element.
+ *
+ * Note that this function assumes the elements in 'base' are sorted.
+ */
+static inline int
+pg_lsearch8_ge(uint8 key, uint8 *base, uint32 nelem)
+{
+	uint32		i;
+
+	/* round down to multiple of vector length */
+	uint32		tail_idx = nelem & ~(sizeof(Vector8) - 1);
+	Vector8		chunk;
+
+	for (i = 0; i < tail_idx; i += sizeof(Vector8))
+	{
+		int idx;
+
+		vector8_load(&chunk, &base[i]);
+		if ((idx = vector8_search_ge(chunk, key)) != sizeof(Vector8))
+			return i + idx;
+	}
+
+	/* Process the remaining elements one at a time. */
+	for (; i < nelem; i++)
+	{
+		if (base[i] >= key)
+			break;
+	}
+
+	return i;
+}
+
 /*
  * pg_lfind32
  *
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..e2a99578a5 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -18,6 +18,8 @@
 #ifndef SIMD_H
 #define SIMD_H
 
+#include "port/pg_bitutils.h"
+
 #if (defined(__x86_64__) || defined(_M_AMD64))
 /*
  * SSE2 instructions are part of the spec for the 64-bit x86 ISA. We assume
@@ -88,14 +90,9 @@ static inline Vector32 vector32_or(const Vector32 v1, const Vector32 v2);
 static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
 #endif
 
-/*
- * comparisons between vectors
- *
- * Note: These return a vector rather than boolean, which is why we don't
- * have non-SIMD implementations.
- */
-#ifndef USE_NO_SIMD
+/* comparisons between vectors */
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+#ifndef USE_NO_SIMD
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +274,140 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+        1 << 0, 1 << 1, 1 << 2, 1 << 3,
+        1 << 4, 1 << 5, 1 << 6, 1 << 7,
+        1 << 0, 1 << 1, 1 << 2, 1 << 3,
+        1 << 4, 1 << 5, 1 << 6, 1 << 7,
+      };
+
+    uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+    uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+    return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#else /* USE_NO_SIMD */
+	Vector8 r = 0;
+	uint8 *rp = (uint8 *) &r;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
+
+	return r;
+#endif
+}
+
+/*
+ * Return the index of the element in the vector that equal to the given
+ * scalar. Otherwise, return -1.
+ */
+static inline int
+vector8_search_eq(const Vector8 v, const uint8 c)
+{
+	Vector8 keys = vector8_broadcast(c);
+	Vector8	cmp;
+	uint32	mask;
+	int		result;
+
+#ifdef USE_ASSERT_CHECKING
+	int		assert_result = -1;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+	{
+		if (((const uint8 *) &v)[i] == c)
+		{
+			assert_result = i;
+			break;
+		}
+	}
+#endif							/* USE_ASSERT_CHECKING */
+
+	cmp = vector8_eq(keys, v);
+	mask = vector8_highbit_mask(cmp);
+
+	if (mask)
+		result = pg_rightmost_one_pos32(mask);
+	else
+		result = -1;
+
+	Assert(assert_result == result);
+	return result;
+}
+
+/*
+ * Return the index of the first element in the vector that is greater than
+ * or eual to the given scalar. Return sizeof(Vector8) if there is no such
+ * element.
+ *
+ * Note that this function assumes the elements in the vector are sorted.
+ */
+static inline int
+vector8_search_ge(const Vector8 v, const uint8 c)
+{
+	Vector8 keys = vector8_broadcast(c);
+	Vector8 min;
+	Vector8	cmp;
+	uint32	mask;
+	int		result;
+
+#ifdef USE_ASSERT_CHECKING
+	int		assert_result = -1;
+	Size	i;
+
+	for (i = 0; i < sizeof(Vector8); i++)
+	{
+		if (((const uint8 *) &v)[i] >= c)
+			break;
+	}
+	assert_result = i;
+#endif							/* USE_ASSERT_CHECKING */
+
+	/*
+	 * There is a bit more complicated than vector8_search_eq(), because
+	 * until recently no unsigned uint8 compasion instruction existed.
+	 * Therefore, we need to use vector8_min() to effectively get <= elements.
+	 */
+	min = vector8_min(v, keys);
+	cmp = vector8_eq(keys, min);
+	mask = vector8_highbit_mask(cmp);
+
+	if (mask)
+		result = pg_rightmost_one_pos32(mask);
+	else
+		result = sizeof(Vector8);
+
+	Assert(assert_result == result);
+	return result;
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -348,7 +479,6 @@ vector8_ssub(const Vector8 v1, const Vector8 v2)
  * Return a vector with all bits set in each lane where the the corresponding
  * lanes in the inputs are equal.
  */
-#ifndef USE_NO_SIMD
 static inline Vector8
 vector8_eq(const Vector8 v1, const Vector8 v2)
 {
@@ -356,9 +486,16 @@ vector8_eq(const Vector8 v1, const Vector8 v2)
 	return _mm_cmpeq_epi8(v1, v2);
 #elif defined(USE_NEON)
 	return vceqq_u8(v1, v2);
+#else /* USE_NO_SIMD */
+	Vector8 r = 0;
+	uint8 *rp = (uint8 *) &r;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0;
+
+	return r;
 #endif
 }
-#endif							/* ! USE_NO_SIMD */
 
 #ifndef USE_NO_SIMD
 static inline Vector32
diff --git a/src/test/modules/test_lfind/expected/test_lfind.out b/src/test/modules/test_lfind/expected/test_lfind.out
index 1d4b14e703..9416161955 100644
--- a/src/test/modules/test_lfind/expected/test_lfind.out
+++ b/src/test/modules/test_lfind/expected/test_lfind.out
@@ -22,3 +22,15 @@ SELECT test_lfind32();
  
 (1 row)
 
+SELECT test_lsearch8();
+ test_lsearch8 
+---------------
+ 
+(1 row)
+
+SELECT test_lsearch8_ge();
+ test_lsearch8_ge 
+------------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_lfind/sql/test_lfind.sql b/src/test/modules/test_lfind/sql/test_lfind.sql
index 766c640831..d0dbb142ec 100644
--- a/src/test/modules/test_lfind/sql/test_lfind.sql
+++ b/src/test/modules/test_lfind/sql/test_lfind.sql
@@ -8,3 +8,5 @@ CREATE EXTENSION test_lfind;
 SELECT test_lfind8();
 SELECT test_lfind8_le();
 SELECT test_lfind32();
+SELECT test_lsearch8();
+SELECT test_lsearch8_ge();
diff --git a/src/test/modules/test_lfind/test_lfind--1.0.sql b/src/test/modules/test_lfind/test_lfind--1.0.sql
index 81801926ae..13857cec3b 100644
--- a/src/test/modules/test_lfind/test_lfind--1.0.sql
+++ b/src/test/modules/test_lfind/test_lfind--1.0.sql
@@ -14,3 +14,11 @@ CREATE FUNCTION test_lfind8()
 CREATE FUNCTION test_lfind8_le()
 	RETURNS pg_catalog.void
 	AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lsearch8()
+	RETURNS pg_catalog.void
+	AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lsearch8_ge()
+	RETURNS pg_catalog.void
+	AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_lfind/test_lfind.c b/src/test/modules/test_lfind/test_lfind.c
index 82673d54c6..c494c27436 100644
--- a/src/test/modules/test_lfind/test_lfind.c
+++ b/src/test/modules/test_lfind/test_lfind.c
@@ -14,6 +14,7 @@
 #include "postgres.h"
 
 #include "fmgr.h"
+#include "lib/stringinfo.h"
 #include "port/pg_lfind.h"
 
 /*
@@ -115,6 +116,144 @@ test_lfind8_le(PG_FUNCTION_ARGS)
 	PG_RETURN_VOID();
 }
 
+static void
+test_lsearch8_internal(uint8 key)
+{
+	uint8		charbuf[LEN_WITH_TAIL(Vector8)];
+	const int	len_no_tail = LEN_NO_TAIL(Vector8);
+	const int	len_with_tail = LEN_WITH_TAIL(Vector8);
+	int			keypos;
+
+	memset(charbuf, 0xFF, len_with_tail);
+	/* search tail to test one-byte-at-a-time path */
+	keypos = len_with_tail - 1;
+	charbuf[keypos] = key;
+	if (key > 0x00 && (pg_lsearch8(key - 1, charbuf, len_with_tail) != -1))
+		elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key - 1);
+	if (key < 0xFF && (pg_lsearch8(key, charbuf, len_with_tail) != keypos))
+		elog(ERROR, "pg_lsearch8() did not find existing element '0x%x'", key);
+	if (key < 0xFE && (pg_lsearch8(key + 1, charbuf, len_with_tail) != -1))
+		elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key + 1);
+
+	memset(charbuf, 0xFF, len_with_tail);
+	/* search with vector operations */
+	keypos = len_no_tail - 1;
+	charbuf[keypos] = key;
+	if (key > 0x00 && (pg_lsearch8(key - 1, charbuf, len_no_tail) != -1))
+		elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key - 1);
+	if (key < 0xFF && (pg_lsearch8(key, charbuf, len_no_tail) != keypos))
+		elog(ERROR, "pg_lsearch8() did not find existing element '0x%x'", key);
+	if (key < 0xFE && (pg_lsearch8(key + 1, charbuf, len_no_tail) != -1))
+		elog(ERROR, "pg_lsearch8() found nonexistent element '0x%x'", key + 1);
+}
+
+PG_FUNCTION_INFO_V1(test_lsearch8);
+Datum
+test_lsearch8(PG_FUNCTION_ARGS)
+{
+	test_lsearch8_internal(0);
+	test_lsearch8_internal(1);
+	test_lsearch8_internal(0x7F);
+	test_lsearch8_internal(0x80);
+	test_lsearch8_internal(0x81);
+	test_lsearch8_internal(0xFD);
+	test_lsearch8_internal(0xFE);
+	test_lsearch8_internal(0xFF);
+
+	PG_RETURN_VOID();
+}
+
+static void
+report_lsearch8_error(uint8 *buf, int size, uint8 key, int result, int expected)
+{
+	StringInfoData bufstr;
+	char *sep = "";
+
+	initStringInfo(&bufstr);
+
+	for (int i = 0; i < size; i++)
+	{
+		appendStringInfo(&bufstr, "%s0x%02x", sep, buf[i]);
+		sep = ",";
+	}
+
+	elog(ERROR,
+		 "pg_lsearch8_ge returned %d, expected %d, key 0x%02x buffer %s",
+		 result, expected, key, bufstr.data);
+}
+
+/* workhorse for test_lsearch8_ge */
+static void
+test_lsearch8_ge_internal(uint8 *buf, uint8 key)
+{
+	const int	len_no_tail = LEN_NO_TAIL(Vector8);
+	const int	len_with_tail = LEN_WITH_TAIL(Vector8);
+	int			expected;
+	int			result;
+	int			i;
+
+	/* search tail to test one-byte-at-a-time path */
+	for (i = 0; i < len_with_tail; i++)
+	{
+		if (buf[i] >= key)
+			break;
+	}
+	expected = i;
+	result = pg_lsearch8_ge(key, buf, len_with_tail);
+
+	if (result != expected)
+		report_lsearch8_error(buf, len_with_tail, key, result, expected);
+
+	/* search with vector operations */
+	for (i = 0; i < len_no_tail; i++)
+	{
+		if (buf[i] >= key)
+			break;
+	}
+	expected = i;
+	result = pg_lsearch8_ge(key, buf, len_no_tail);
+
+	if (result != expected)
+		report_lsearch8_error(buf, len_no_tail, key, result, expected);
+}
+
+static int
+cmp(const void *p1, const void *p2)
+{
+	uint8	v1 = *((const uint8 *) p1);
+	uint8	v2 = *((const uint8 *) p2);
+
+	if (v1 < v2)
+		return -1;
+	if (v1 > v2)
+		return 1;
+	return 0;
+}
+
+PG_FUNCTION_INFO_V1(test_lsearch8_ge);
+Datum
+test_lsearch8_ge(PG_FUNCTION_ARGS)
+{
+	uint8		charbuf[LEN_WITH_TAIL(Vector8)];
+	const int	len_with_tail = LEN_WITH_TAIL(Vector8);
+
+	for (int i = 0; i < len_with_tail; i++)
+		charbuf[i] = (uint8) rand();
+
+	qsort(charbuf, len_with_tail, sizeof(uint8), cmp);
+
+	test_lsearch8_ge_internal(charbuf, 0);
+	test_lsearch8_ge_internal(charbuf, 1);
+	test_lsearch8_ge_internal(charbuf, 0x7F);
+	test_lsearch8_ge_internal(charbuf, 0x80);
+	test_lsearch8_ge_internal(charbuf, 0x81);
+	test_lsearch8_ge_internal(charbuf, 0xFD);
+	test_lsearch8_ge_internal(charbuf, 0xFE);
+	test_lsearch8_ge_internal(charbuf, 0xFF);
+
+	PG_RETURN_VOID();
+}
+
 PG_FUNCTION_INFO_V1(test_lfind32);
 Datum
 test_lfind32(PG_FUNCTION_ARGS)
-- 
2.31.1

v6-0002-Add-radix-implementation.patchapplication/x-patch; name=v6-0002-Add-radix-implementation.patchDownload

From f49e91ec2a2dcb19259cbf1bc0fd73f36b29a201 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v6 2/3] Add radix implementation.

---
 src/backend/lib/Makefile                      |    1 +
 src/backend/lib/radixtree.c                   | 2225 +++++++++++++++++
 src/include/lib/radixtree.h                   |   42 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   28 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  504 ++++
 .../test_radixtree/test_radixtree.control     |    4 +
 12 files changed, 2854 insertions(+)
 create mode 100644 src/backend/lib/radixtree.c
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..b163eac480
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2225 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
+ * Also, there is no support for path compression and lazy path expansion. The
+ * radix tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_set			- Set a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iter	- Begin iterating through all key-value pairs
+ * rt_iter_next		- Return next key-value pair, if any
+ * rt_end_iter		- End iteration
+ * rt_memory_usage	- Get the memory usage
+ * rt_num_entries	- Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-128 */
+#define RT_NODE_128_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) \
+	((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+	RT_ACTION_FIND = 0,			/* find the key-value */
+	RT_ACTION_DELETE,			/* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree nodes.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 48 -> 152 -> 296 -> 1304 -> 2088 bytes for inner/leaf nodes, leading to
+ * large amounts of allocator padding with aset.c. Hence the use of slab.
+ *
+ * XXX: need to have node-1 until there is no path compression optimization?
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+typedef enum rt_node_kind
+{
+	RT_NODE_KIND_4 = 0,
+	RT_NODE_KIND_16,
+	RT_NODE_KIND_32,
+	RT_NODE_KIND_128,
+	RT_NODE_KIND_256
+} rt_node_kind;
+#define RT_NODE_KIND_COUNT (RT_NODE_KIND_256 + 1)
+
+/*
+ * Base type for all nodes types.
+ */
+typedef struct rt_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Size class of the node */
+	rt_node_kind kind;
+} rt_node;
+
+/* Macros for radix tree nodes */
+#define IS_LEAF_NODE(n) (((rt_node *) (n))->shift == 0)
+#define IS_EMPTY_NODE(n) (((rt_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+	(((rt_node *) (n))->count < rt_node_info[((rt_node *) (n))->kind].fanout)
+
+/*
+ * Definitions of the base types for inner and leaf nodes of each node type.
+ */
+
+/*
+ * node-4, node-16, and node-32 have similar structure but have different
+ * the number of fanout. They have the same length for chunks and values
+ * (or child pointers in inner nodes). The chunks and values are stored at
+ * corresponding position and chunks are sorted.
+*/
+typedef struct rd_node_base_4
+{
+	rt_node		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} rt_node_base_4;
+
+typedef struct rd_node_base_16
+{
+	rt_node		n;
+
+	/* 16 children, for key chunks */
+	uint8		chunks[16];
+} rt_node_base_16;
+
+typedef struct rd_node_base_32
+{
+	rt_node		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-128 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 128 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rd_node_base_128
+{
+	rt_node		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
+} rt_node_base_128;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rd_node_base_256
+{
+	rt_node		n;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * There are separate from inner node size classes for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+	rt_node_base_4 base;
+
+	/* 4 children, for key chunks */
+	rt_node    *children[4];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+	rt_node_base_4 base;
+
+	/* 4 values, for key chunks */
+	uint64		values[4];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_16
+{
+	rt_node_base_16 base;
+
+	/* 16 children, for key chunks */
+	rt_node    *children[16];
+} rt_node_inner_16;
+
+typedef struct rt_node_leaf_16
+{
+	rt_node_base_16 base;
+
+	/* 16 values, for key chunks */
+	uint64		values[16];
+} rt_node_leaf_16;
+
+typedef struct rt_node_inner_32
+{
+	rt_node_base_32 base;
+
+	/* 32 children, for key chunks */
+	rt_node    *children[32];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+	rt_node_base_32 base;
+
+	/* 32 values, for key chunks */
+	uint64		values[32];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_128
+{
+	rt_node_base_128 base;
+
+	/* Slots for 128 children */
+	rt_node    *children[128];
+} rt_node_inner_128;
+
+typedef struct rt_node_leaf_128
+{
+	rt_node_base_128 base;
+
+	/* Slots for 128 values */
+	uint64		values[128];
+} rt_node_leaf_128;
+
+typedef struct rt_node_inner_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 children */
+	rt_node    *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information of each size class */
+typedef struct rt_node_info_elem
+{
+	const char *name;
+	int			fanout;
+	Size		inner_size;
+	Size		leaf_size;
+} rt_node_info_elem;
+
+static rt_node_info_elem rt_node_info[RT_NODE_KIND_COUNT] = {
+
+	[RT_NODE_KIND_4] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(rt_node_inner_4),
+		.leaf_size = sizeof(rt_node_leaf_4),
+	},
+	[RT_NODE_KIND_16] = {
+		.name = "radix tree node 16",
+		.fanout = 16,
+		.inner_size = sizeof(rt_node_inner_16),
+		.leaf_size = sizeof(rt_node_leaf_16),
+	},
+	[RT_NODE_KIND_32] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(rt_node_inner_32),
+		.leaf_size = sizeof(rt_node_leaf_32),
+	},
+	[RT_NODE_KIND_128] = {
+		.name = "radix tree node 128",
+		.fanout = 128,
+		.inner_size = sizeof(rt_node_inner_128),
+		.leaf_size = sizeof(rt_node_leaf_128),
+	},
+	[RT_NODE_KIND_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(rt_node_inner_256),
+		.leaf_size = sizeof(rt_node_leaf_256),
+	},
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+	rt_node    *node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	rt_node_iter stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node    *root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_NODE_KIND_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_node_kind kind, bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_copy_node_common(rt_node *src, rt_node *dst);
+static void rt_extend(radix_tree *tree, uint64 key);
+static bool rt_node_search(rt_node *node, uint64 key, rt_action action, void **slot_p);
+static bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+								 rt_node **child_p);
+static bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+								uint64 *value_p);
+static rt_node *rt_node_add_new_child(radix_tree *tree, rt_node *parent,
+									  rt_node *node, uint64 key);
+static int	rt_node_prepare_insert(radix_tree *tree, rt_node *parent,
+								   rt_node **node_p, uint64 key,
+								   bool *will_replace_p);
+static void rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+								 uint64 key, rt_node *child, bool *replaced_p);
+static void rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+								uint64 key, uint64 value, bool *replaced_p);
+static rt_node *rt_node_grow(radix_tree *tree, rt_node *parent,
+							 rt_node *node, uint64 key);
+static void rt_update_iter_stack(rt_iter *iter, int from);
+static void *rt_node_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+								  bool *found_p);
+static void rt_update_node_iter(rt_iter *iter, rt_node_iter *node_iter,
+								rt_node *node);
+static pg_attribute_always_inline void rt_iter_update_key(rt_iter *iter, uint8 chunk,
+														  uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/* Return the array of children in the given inner node */
+static rt_node **
+rt_node_get_children(rt_node *node)
+{
+	rt_node   **children = NULL;
+
+	Assert(!IS_LEAF_NODE(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			children = (rt_node **) ((rt_node_inner_4 *) node)->children;
+			break;
+		case RT_NODE_KIND_16:
+			children = (rt_node **) ((rt_node_inner_16 *) node)->children;
+			break;
+		case RT_NODE_KIND_32:
+			children = (rt_node **) ((rt_node_inner_32 *) node)->children;
+			break;
+		case RT_NODE_KIND_128:
+			children = (rt_node **) ((rt_node_inner_128 *) node)->children;
+			break;
+		case RT_NODE_KIND_256:
+			children = (rt_node **) ((rt_node_inner_256 *) node)->children;
+			break;
+		default:
+			elog(ERROR, "unexpected node type %u", node->kind);
+	}
+
+	return children;
+}
+
+/* Return the array of values in the given leaf node */
+static uint64 *
+rt_node_get_values(rt_node *node)
+{
+	uint64	   *values = NULL;
+
+	Assert(IS_LEAF_NODE(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			values = ((rt_node_leaf_4 *) node)->values;
+			break;
+		case RT_NODE_KIND_16:
+			values = ((rt_node_leaf_16 *) node)->values;
+			break;
+		case RT_NODE_KIND_32:
+			values = ((rt_node_leaf_32 *) node)->values;
+			break;
+		case RT_NODE_KIND_128:
+			values = ((rt_node_leaf_128 *) node)->values;
+			break;
+		case RT_NODE_KIND_256:
+			values = ((rt_node_leaf_256 *) node)->values;
+			break;
+		default:
+			elog(ERROR, "unexpected node type %u", node->kind);
+	}
+
+	return values;
+}
+
+/*
+ * Node support functions for node-4, node-16, and node-32.
+ *
+ * These three node types have similar structure -- they have the array of chunks with
+ * different length and corresponding pointers or values depending on inner nodes or
+ * leaf nodes.
+ */
+#define CHECK_CHUNK_ARRAY_NODE(node) \
+	Assert(((((rt_node*) node)->kind) == RT_NODE_KIND_4) || \
+		   ((((rt_node*) node)->kind) == RT_NODE_KIND_16) || \
+		   ((((rt_node*) node)->kind) == RT_NODE_KIND_32))
+
+/* Get the pointer to either the child or the value at 'idx */
+static void *
+chunk_array_node_get_slot(rt_node *node, int idx)
+{
+	void	   *slot;
+
+	CHECK_CHUNK_ARRAY_NODE(node);
+
+	if (IS_LEAF_NODE(node))
+	{
+		uint64	   *values = rt_node_get_values(node);
+
+		slot = (void *) &(values[idx]);
+	}
+	else
+	{
+		rt_node   **children = rt_node_get_children(node);
+
+		slot = (void *) children[idx];
+	}
+
+	return slot;
+}
+
+/* Return the chunk array in the node */
+static uint8 *
+chunk_array_node_get_chunks(rt_node *node)
+{
+	uint8	   *chunk = NULL;
+
+	CHECK_CHUNK_ARRAY_NODE(node);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			chunk = (uint8 *) ((rt_node_base_4 *) node)->chunks;
+			break;
+		case RT_NODE_KIND_16:
+			chunk = (uint8 *) ((rt_node_base_16 *) node)->chunks;
+			break;
+		case RT_NODE_KIND_32:
+			chunk = (uint8 *) ((rt_node_base_32 *) node)->chunks;
+			break;
+		default:
+			/* this function don't support node-128 and node-256 */
+			elog(ERROR, "unsupported node type %d", node->kind);
+	}
+
+	return chunk;
+}
+
+/* Copy the contents of the node from 'src' to 'dst' */
+static void
+chunk_array_node_copy_contents(rt_node *src, rt_node *dst)
+{
+	uint8	   *chunks_src,
+			   *chunks_dst;
+
+	CHECK_CHUNK_ARRAY_NODE(src);
+	CHECK_CHUNK_ARRAY_NODE(dst);
+
+	/* Copy base type */
+	rt_copy_node_common(src, dst);
+
+	/* Copy chunk array */
+	chunks_src = chunk_array_node_get_chunks(src);
+	chunks_dst = chunk_array_node_get_chunks(dst);
+	memcpy(chunks_dst, chunks_src, sizeof(uint8) * src->count);
+
+	/* Copy children or values */
+	if (IS_LEAF_NODE(src))
+	{
+		uint64	   *values_src,
+				   *values_dst;
+
+		Assert(IS_LEAF_NODE(dst));
+		values_src = rt_node_get_values(src);
+		values_dst = rt_node_get_values(dst);
+		memcpy(values_dst, values_src, sizeof(uint64) * src->count);
+	}
+	else
+	{
+		rt_node   **children_src,
+				  **children_dst;
+
+		Assert(!IS_LEAF_NODE(dst));
+		children_src = rt_node_get_children(src);
+		children_dst = rt_node_get_children(dst);
+		memcpy(children_dst, children_src, sizeof(rt_node *) * src->count);
+	}
+}
+
+/*
+ * Return the index of the (sorted) chunk array where the chunk is inserted.
+ * Set true to replaced_p if the chunk already exists in the array.
+ */
+static int
+chunk_array_node_find_insert_pos(rt_node *node, uint8 chunk, bool *found_p)
+{
+	uint8	   *chunks;
+	int			idx;
+
+	CHECK_CHUNK_ARRAY_NODE(node);
+
+	*found_p = false;
+	chunks = chunk_array_node_get_chunks(node);
+
+	/* Find the insert pos */
+	idx = pg_lsearch8_ge(chunk, chunks, node->count);
+
+	if (idx < node->count && chunks[idx] == chunk)
+		*found_p = true;
+
+	return idx;
+}
+
+/* Delete the chunk at idx */
+static void
+chunk_array_node_delete(rt_node *node, int idx)
+{
+	uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+	/* delete the chunk from the chunk array */
+	memmove(&(chunks[idx]), &(chunks[idx + 1]),
+			sizeof(uint8) * (node->count - idx - 1));
+
+	/* delete either the value or the child as well */
+	if (IS_LEAF_NODE(node))
+	{
+		uint64	   *values = rt_node_get_values(node);
+
+		memmove(&(values[idx]),
+				&(values[idx + 1]),
+				sizeof(uint64) * (node->count - idx - 1));
+	}
+	else
+	{
+		rt_node   **children = rt_node_get_children(node);
+
+		memmove(&(children[idx]),
+				&(children[idx + 1]),
+				sizeof(rt_node *) * (node->count - idx - 1));
+	}
+}
+
+/* Support function for both node-128 */
+
+/* Does the given chunk in the node has the value? */
+static pg_attribute_always_inline bool
+node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static pg_attribute_always_inline bool
+node_128_is_slot_used(rt_node_base_128 *node, uint8 slot)
+{
+	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+/* Get the pointer to either the child or the value corresponding to chunk */
+static void *
+node_128_get_slot(rt_node_base_128 *node, uint8 chunk)
+{
+	int			slotpos;
+	void	   *slot;
+
+	slotpos = node->slot_idxs[chunk];
+	Assert(slotpos != RT_NODE_128_INVALID_IDX);
+
+	if (IS_LEAF_NODE(node))
+		slot = (void *) &(((rt_node_leaf_128 *) node)->values[slotpos]);
+	else
+		slot = (void *) (((rt_node_inner_128 *) node)->children[slotpos]);
+
+	return slot;
+}
+
+/* Delete the chunk in the node */
+static void
+node_128_delete(rt_node_base_128 *node, uint8 chunk)
+{
+	int			slotpos = node->slot_idxs[chunk];
+
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Return an unused slot in node-128 */
+static int
+node_128_find_unused_slot(rt_node_base_128 *node, uint8 chunk)
+{
+	int			slotpos;
+
+	/*
+	 * Find an unused slot. We iterate over the isset bitmap per byte then
+	 * check each bit.
+	 */
+	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	{
+		if (node->isset[slotpos] < 0xFF)
+			break;
+	}
+	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+	slotpos *= BITS_PER_BYTE;
+	while (node_128_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+
+/* XXX: duplicate with node_128_set_leaf */
+static void
+node_128_set_inner(rt_node_base_128 *node, uint8 chunk, rt_node *child)
+{
+	int			slotpos;
+	rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+	/* Overwrite the existing value if exists */
+	if (node_128_is_chunk_used(node, chunk))
+	{
+		n128->children[n128->base.slot_idxs[chunk]] = child;
+		return;
+	}
+
+	/* find unused slot */
+	slotpos = node_128_find_unused_slot(node, chunk);
+
+	n128->base.slot_idxs[chunk] = slotpos;
+	n128->base.isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	n128->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static void
+node_128_set_leaf(rt_node_base_128 *node, uint8 chunk, uint64 value)
+{
+	int			slotpos;
+	rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+
+	/* Overwrite the existing value if exists */
+	if (node_128_is_chunk_used(node, chunk))
+	{
+		n128->values[n128->base.slot_idxs[chunk]] = value;
+		return;
+	}
+
+	/* find unused slot */
+	slotpos = node_128_find_unused_slot(node, chunk);
+
+	n128->base.slot_idxs[chunk] = slotpos;
+	n128->base.isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	n128->values[slotpos] = value;
+}
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static bool
+node_256_is_chunk_used(rt_node_base_256 *node, uint8 chunk)
+{
+	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+/* Get the pointer to either the child or the value corresponding to chunk */
+static void *
+node_256_get_slot(rt_node_base_256 *node, uint8 chunk)
+{
+	void	   *slot;
+
+	Assert(node_256_is_chunk_used(node, chunk));
+	if (IS_LEAF_NODE(node))
+		slot = (void *) &(((rt_node_leaf_256 *) node)->values[chunk]);
+	else
+		slot = (void *) (((rt_node_inner_256 *) node)->children[chunk]);
+
+	return slot;
+}
+
+/* Set the child in the node-256 */
+static pg_attribute_always_inline void
+node_256_set_inner(rt_node_base_256 *node, uint8 chunk, rt_node *child)
+{
+	rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+	n256->base.isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	n256->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static pg_attribute_always_inline void
+node_256_set_leaf(rt_node_base_256 *node, uint8 chunk, uint64 value)
+{
+	rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+	n256->base.isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	n256->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static pg_attribute_always_inline void
+node_256_delete(rt_node_base_256 *node, uint8 chunk)
+{
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static pg_attribute_always_inline int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+	int			shift = key_get_shift(key);
+	rt_node    *node;
+
+	node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift > 0);
+	node->shift = shift;
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = node;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_node_kind kind, bool inner)
+{
+	rt_node    *newnode;
+
+	if (inner)
+		newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+													 rt_node_info[kind].inner_size);
+	else
+		newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+													 rt_node_info[kind].leaf_size);
+
+	newnode->kind = kind;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_128)
+	{
+		rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+
+		memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
+	}
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[kind]++;
+#endif
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+		tree->root = NULL;
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[node->kind]--;
+
+	Assert(tree->cnt[node->kind] >= 0);
+#endif
+
+	pfree(node);
+}
+
+/* Copy the common fields without the node kind */
+static void
+rt_copy_node_common(rt_node *src, rt_node *dst)
+{
+	dst->shift = src->shift;
+	dst->chunk = src->chunk;
+	dst->count = src->count;
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_inner_4 *node =
+		(rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4, true);
+
+		node->base.n.count = 1;
+		node->base.n.shift = shift;
+		node->base.chunks[0] = 0;
+		node->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * Search for the given key in the node. Return true if the key is found, otherwise
+ * return false. On success, we do the specified action for the key, and set the
+ * pointer to the slot to slot_p.
+ */
+static bool
+rt_node_search(rt_node *node, uint64 key, rt_action action, void **slot_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				int			idx;
+				uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+				idx = pg_lsearch8(chunk, chunks, node->count);
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					*slot_p = chunk_array_node_get_slot(node, idx);
+				else			/* RT_ACTION_DELETE */
+					chunk_array_node_delete(node, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+
+				if (!node_128_is_chunk_used(n128, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					*slot_p = node_128_get_slot(n128, chunk);
+				else		/* RT_ACTION_DELETE */
+					node_128_delete(n128, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+
+				if (!node_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					*slot_p = node_256_get_slot(n256, chunk);
+				else		/* RT_ACTION_DELETE */
+					node_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* Update the statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	return found;
+}
+
+/*
+ * Search for the child pointer corresponding to the key in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+	rt_node    *child;
+
+	if (!rt_node_search(node, key, action, (void **) &child))
+		return false;
+
+	if (child_p)
+		*child_p = child;
+
+	return true;
+}
+
+/*
+ * Search for the value corresponding to the key in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+	uint64	   *value;
+
+	if (!rt_node_search(node, key, action, (void **) &value))
+		return false;
+
+	if (value_p)
+		*value_p = *value;
+
+	return true;
+}
+
+/* Insert 'node' as a child node of 'parent' */
+static rt_node *
+rt_node_add_new_child(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key)
+{
+	uint8		newshift = node->shift - RT_NODE_SPAN;
+	rt_node    *newchild =
+	(rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, newshift > 0);
+
+	Assert(!IS_LEAF_NODE(node));
+
+	newchild->shift = newshift;
+	newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+	rt_node_insert_inner(tree, parent, node, key, newchild, NULL);
+
+	return (rt_node *) newchild;
+}
+
+/*
+ * For a upcoming insertion, we make sure that the node has enough free slots or
+ * grow the node if necessary. node_p is updated with the grown node. We set true
+ * to will_replace_p to tell the caller that the given chunk already exists in the
+ * node.
+ *
+ * Return the index in the chunk array where the key can be inserted. We always
+ * return 0 in node-128 and node-256 cases.
+ */
+static int
+rt_node_prepare_insert(radix_tree *tree, rt_node *parent, rt_node **node_p,
+					   uint64 key, bool *will_replace_p)
+{
+	rt_node    *node = *node_p;
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		will_replace = false;
+	int			idx = 0;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				bool		can_insert = false;
+
+				while ((node->kind == RT_NODE_KIND_4) ||
+					   (node->kind == RT_NODE_KIND_16) ||
+					   (node->kind == RT_NODE_KIND_32))
+				{
+					/* Find the insert pos */
+					idx = chunk_array_node_find_insert_pos(node, chunk, &will_replace);
+
+					if (will_replace || NODE_HAS_FREE_SLOT(node))
+					{
+						/*
+						 * Found. We can insert a new one or replace the exiting
+						 * value.
+						 */
+						can_insert = true;
+						break;
+					}
+
+					node = rt_node_grow(tree, parent, node, key);
+				}
+
+				if (can_insert)
+				{
+					uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+					Assert(idx >= 0);
+
+					/*
+					 * Make the space for the new key if it will be inserted in
+					 * the middle of the array.
+					 */
+					if (!will_replace && node->count != 0 && idx < node->count)
+					{
+						/* shift chunks array */
+						memmove(&(chunks[idx + 1]), &(chunks[idx]),
+								sizeof(uint8) * (node->count - idx));
+
+						/* shift either the values array or the children array */
+						if (IS_LEAF_NODE(node))
+						{
+							uint64	   *values = rt_node_get_values(node);
+
+							memmove(&(values[idx + 1]), &(values[idx]),
+									sizeof(uint64) * (node->count - idx));
+						}
+						else
+						{
+							rt_node   **children = rt_node_get_children(node);
+
+							memmove(&(children[idx + 1]), &(children[idx]),
+									sizeof(rt_node *) * (node->count - idx));
+						}
+					}
+
+					break;
+				}
+
+				Assert(node->kind == RT_NODE_KIND_128);
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+
+				if (node_128_is_chunk_used(n128, chunk) || NODE_HAS_FREE_SLOT(n128))
+				{
+					if (node_128_is_chunk_used(n128, chunk))
+						will_replace = true;
+
+					break;
+				}
+
+				node = rt_node_grow(tree, parent, node, key);
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+
+				if (node_256_is_chunk_used(n256, chunk))
+					will_replace = true;
+
+				break;
+			}
+	}
+
+	*node_p = node;
+	*will_replace_p = will_replace;
+
+	return idx;
+}
+
+/* Insert the child to the inner node */
+static void
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+					 uint64 key, rt_node *child, bool *replaced_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	int			idx;
+	bool		replaced;
+
+	Assert(!IS_LEAF_NODE(node));
+
+	idx = rt_node_prepare_insert(tree, parent, &node, key, &replaced);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				uint8	   *chunks = chunk_array_node_get_chunks(node);
+				rt_node   **children = rt_node_get_children(node);
+
+				Assert(idx >= 0);
+				chunks[idx] = chunk;
+				children[idx] = child;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				node_128_set_inner((rt_node_base_128 *) node, chunk, child);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				node_256_set_inner((rt_node_base_256 *) node, chunk, child);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!replaced)
+		node->count++;
+
+	if (replaced_p)
+		*replaced_p = replaced;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+}
+
+/* Insert the value to the leaf node */
+static void
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+					uint64 key, uint64 value, bool *replaced_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	int			idx;
+	bool		replaced;
+
+	Assert(IS_LEAF_NODE(node));
+
+	idx = rt_node_prepare_insert(tree, parent, &node, key, &replaced);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				uint8	   *chunks = chunk_array_node_get_chunks(node);
+				uint64	   *values = rt_node_get_values(node);
+
+				Assert(idx >= 0);
+				chunks[idx] = chunk;
+				values[idx] = value;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				node_128_set_leaf((rt_node_base_128 *) node, chunk, value);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				node_256_set_leaf((rt_node_base_256 *) node, chunk, value);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!replaced)
+		node->count++;
+
+	*replaced_p = replaced;
+
+	/*
+	 * Done. Finally, verify if the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+}
+
+/* Change the node type to the next larger one */
+static rt_node *
+rt_node_grow(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key)
+{
+	rt_node    *newnode = NULL;
+
+	Assert(node->count == rt_node_info[node->kind].fanout);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				newnode = rt_alloc_node(tree, RT_NODE_KIND_16,
+										IS_LEAF_NODE(node));
+
+				/* Copy both chunks and slots to the new node */
+				chunk_array_node_copy_contents(node, newnode);
+				break;
+			}
+		case RT_NODE_KIND_16:
+			{
+				newnode = rt_alloc_node(tree, RT_NODE_KIND_32,
+										IS_LEAF_NODE(node));
+
+				/* Copy both chunks and slots to the new node */
+				chunk_array_node_copy_contents(node, newnode);
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				newnode = rt_alloc_node(tree, RT_NODE_KIND_128,
+										IS_LEAF_NODE(node));
+
+				/* Copy both chunks and slots to the new node */
+				rt_copy_node_common(node, newnode);
+
+				if (IS_LEAF_NODE(node))
+				{
+					rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+					for (int i = 0; i < node->count; i++)
+						node_128_set_leaf((rt_node_base_128 *) newnode,
+										  n32->base.chunks[i], n32->values[i]);
+				}
+				else
+				{
+					rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+					for (int i = 0; i < node->count; i++)
+						node_128_set_inner((rt_node_base_128 *) newnode,
+										   n32->base.chunks[i], n32->children[i]);
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				int			cnt = 0;
+
+				newnode = rt_alloc_node(tree, RT_NODE_KIND_256,
+										IS_LEAF_NODE(node));
+
+				/* Copy both chunks and slots to the new node */
+				rt_copy_node_common(node, newnode);
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->n.count; i++)
+				{
+					void	   *slot;
+
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					slot = node_128_get_slot(n128, i);
+
+					if (IS_LEAF_NODE(node))
+						node_256_set_leaf((rt_node_base_256 *) newnode, i,
+										  *(uint64 *) slot);
+					else
+						node_256_set_inner((rt_node_base_256 *) newnode, i,
+										   (rt_node *) slot);
+
+					cnt++;
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			elog(ERROR, "radix tree node-256 cannot grow");
+			break;
+	}
+
+	if (parent == node)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = newnode;
+	}
+	else
+	{
+		/* Set the new node to the parent node */
+		rt_node_insert_inner(tree, NULL, parent, key, newnode, NULL);
+	}
+
+	/* Verify if the node has grown properly */
+	rt_verify_node(newnode);
+
+	/* Free the old node */
+	rt_free_node(tree, node);
+
+	return newnode;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 rt_node_info[i].name,
+												 SLAB_DEFAULT_BLOCK_SIZE,
+												 rt_node_info[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												rt_node_info[i].name,
+												SLAB_DEFAULT_BLOCK_SIZE,
+												rt_node_info[i].leaf_size);
+#ifdef RT_DEBUG
+		tree->cnt[i] = 0;
+#endif
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		replaced;
+	rt_node    *node;
+	rt_node    *parent = tree->root;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = tree->root;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (IS_LEAF_NODE(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			child = rt_node_add_new_child(tree, parent, node, key);
+
+		Assert(child);
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* arrived at a leaf */
+	Assert(IS_LEAF_NODE(node));
+
+	rt_node_insert_leaf(tree, parent, node, key, value, &replaced);
+
+	/* Update the statistics */
+	if (!replaced)
+		tree->num_keys++;
+
+	return replaced;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+	rt_node    *node;
+	int			shift;
+
+	Assert(value_p != NULL);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (IS_LEAF_NODE(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* We reached at a leaf node, so search the corresponding slot */
+	Assert(IS_LEAF_NODE(node));
+	if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
+		return false;
+
+	return true;
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	int			level;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes
+	 * we visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = 0;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		/* Push the current node to the stack */
+		stack[level] = node;
+
+		if (IS_LEAF_NODE(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+
+	Assert(IS_LEAF_NODE(node));
+
+	/* there is no key to delete */
+	if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
+		return false;
+
+	/* Update the statistics */
+	tree->num_keys--;
+
+	/*
+	 * Delete the key from the leaf node and recursively delete the key in
+	 * inner nodes if necessary.
+	 */
+	Assert(IS_LEAF_NODE(stack[level]));
+	while (level >= 0)
+	{
+		rt_node    *node = stack[level--];
+
+		if (IS_LEAF_NODE(node))
+			rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+		else
+			rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!IS_EMPTY_NODE(node))
+			break;
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+	}
+
+	/*
+	 * If we eventually deleted the root node while recursively deleting empty
+	 * nodes, we make the tree empty.
+	 */
+	if (level == 0)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+	return true;;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+
+	iter->stack_len = top_level;
+	iter->stack[top_level].node = iter->tree->root;
+	iter->stack[top_level].current_idx = -1;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update the stack of the radix tree node while descending to the leaf from
+ * the 'from' level.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, int from)
+{
+	rt_node    *node = iter->stack[from].node;
+	int			level = from;
+
+	for (;;)
+	{
+		rt_node_iter *node_iter = &(iter->stack[level--]);
+		bool		found;
+
+		/* Set the node to this level */
+		rt_update_node_iter(iter, node_iter, node);
+
+		/* Finish if we reached to the leaf node */
+		if (IS_LEAF_NODE(node))
+			break;
+
+		/* Advance to the next slot in the node */
+		node = (rt_node *) rt_node_iterate_next(iter, node_iter, &found);
+
+		/*
+		 * Since we always get the first slot in the node, we have to found
+		 * the slot.
+		 */
+		Assert(found);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+	bool		found = false;
+	void	   *slot;
+
+	/* Empty tree */
+	if (!iter->tree)
+		return false;
+
+	for (;;)
+	{
+		rt_node    *node;
+		rt_node_iter *node_iter;
+		int			level;
+
+		/*
+		 * Iterate node at each level from the bottom of the tree, i.e., the
+		 * lead node, until we find the next slot.
+		 */
+		for (level = 0; level <= iter->stack_len; level++)
+		{
+			slot = rt_node_iterate_next(iter, &(iter->stack[level]), &found);
+
+			if (found)
+				break;
+		}
+
+		/* We could not find any new key-value pair, the iteration finished */
+		if (!found)
+			break;
+
+		/* found the next slot at the leaf node, return it */
+		if (level == 0)
+		{
+			*key_p = iter->key;
+			*value_p = *((uint64 *) slot);
+			break;
+		}
+
+		/*
+		 * We have advanced slots more than one nodes including both the lead
+		 * node and inner nodes. So we update the stack by descending to
+		 * the left most leaf node from this level.
+		 */
+		node = (rt_node *) (rt_node *) slot;
+		node_iter = &(iter->stack[level - 1]);
+		rt_update_node_iter(iter, node_iter, node);
+		rt_update_iter_stack(iter, level - 1);
+	}
+
+	return found;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+/*
+ * Iterate over the given radix tree node and returns the next slot of the given
+ * node and set true to *found_p, if any.  Otherwise, set false to *found_p.
+ */
+static void *
+rt_node_iterate_next(rt_iter *iter, rt_node_iter *node_iter, bool *found_p)
+{
+	rt_node    *node = node_iter->node;
+	void	   *slot = NULL;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				node_iter->current_idx++;
+
+				if (node_iter->current_idx >= node->count)
+					goto not_found;
+
+				slot = chunk_array_node_get_slot(node, node_iter->current_idx);
+
+				/* Update the part of the key by the current chunk */
+				if (IS_LEAF_NODE(node))
+				{
+					uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+					rt_iter_update_key(iter, chunks[node_iter->current_idx], 0);
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < 256; i++)
+				{
+					if (node_128_is_chunk_used(n128, i))
+						break;
+				}
+
+				if (i >= 256)
+					goto not_found;
+
+				node_iter->current_idx = i;
+				slot = node_128_get_slot(n128, i);
+
+				/* Update the part of the key */
+				if (IS_LEAF_NODE(n128))
+					rt_iter_update_key(iter, node_iter->current_idx, 0);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < 256; i++)
+				{
+					if (node_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= 256)
+					goto not_found;
+
+				node_iter->current_idx = i;
+				slot = node_256_get_slot(n256, i);
+
+				/* Update the part of the key */
+				if (IS_LEAF_NODE(n256))
+					rt_iter_update_key(iter, node_iter->current_idx, 0);
+
+				break;
+			}
+	}
+
+	Assert(slot);
+	*found_p = true;
+	return slot;
+
+not_found:
+	*found_p = false;
+	return NULL;
+}
+
+/*
+ * Set the node to the node_iter so we can begin the iteration of the node.
+ * Also, we update the part of the key by the chunk of the given node.
+ */
+static void
+rt_update_node_iter(rt_iter *iter, rt_node_iter *node_iter,
+					rt_node *node)
+{
+	node_iter->node = node;
+	node_iter->current_idx = -1;
+
+	rt_iter_update_key(iter, node->chunk, node->shift + RT_NODE_SPAN);
+}
+
+static pg_attribute_always_inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	Size		total = 0;
+
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+				/* Check if the chunks in the node are sorted */
+				for (int i = 1; i < node->count; i++)
+					Assert(chunks[i - 1] < chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(node_128_is_slot_used(n128, n128->slot_idxs[i]));
+
+					cnt++;
+				}
+
+				Assert(n128->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+					cnt += pg_popcount32(n256->isset[i]);
+
+				/* Check if the number of used chunk matches */
+				Assert(n256->n.count == cnt);
+
+				break;
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u, n16 = %u,n32 = %u, n128 = %u, n256 = %u",
+			tree->num_keys,
+			tree->root->shift / RT_NODE_SPAN,
+			tree->cnt[0],
+			tree->cnt[1],
+			tree->cnt[2],
+			tree->cnt[3],
+			tree->cnt[4]);
+	/* rt_dump(tree); */
+}
+
+static void
+rt_print_slot(StringInfo buf, uint8 chunk, uint64 value, int idx, bool is_leaf, int level)
+{
+	char		space[128] = {0};
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	if (is_leaf)
+		appendStringInfo(buf, "%s[%d] \"0x%X\" val(0x%lX) LEAF\n",
+						 space,
+						 idx,
+						 chunk,
+						 value);
+	else
+		appendStringInfo(buf, "%s[%d] \"0x%X\" -> ",
+						 space,
+						 idx,
+						 chunk);
+}
+
+static void
+rt_dump_node(rt_node *node, int level, StringInfo buf, bool recurse)
+{
+	bool		is_leaf = IS_LEAF_NODE(node);
+
+	appendStringInfo(buf, "[\"%s\" type %d, cnt %u, shift %u, chunk \"0x%X\"] chunks:\n",
+					 IS_LEAF_NODE(node) ? "LEAF" : "INNR",
+					 (node->kind == RT_NODE_KIND_4) ? 4 :
+					 (node->kind == RT_NODE_KIND_32) ? 32 :
+					 (node->kind == RT_NODE_KIND_128) ? 128 : 256,
+					 node->count, node->shift, node->chunk);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+		case RT_NODE_KIND_32:
+			{
+				uint8	   *chunks = chunk_array_node_get_chunks(node);
+
+				for (int i = 0; i < node->count; i++)
+				{
+					if (IS_LEAF_NODE(node))
+					{
+						uint64	   *values = rt_node_get_values(node);
+
+						rt_print_slot(buf, chunks[i],
+									  values[i],
+									  i, is_leaf, level);
+					}
+					else
+						rt_print_slot(buf, chunks[i],
+									  UINT64_MAX,
+									  i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							rt_node   **children = rt_node_get_children(node);
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							rt_dump_node(children[i],
+										 level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				uint8	   *tmp = (uint8 *) n128->isset;
+
+				appendStringInfo(buf, "slot_idxs:");
+				for (int j = 0; j < 256; j++)
+				{
+					if (!node_128_is_chunk_used(n128, j))
+						continue;
+
+					appendStringInfo(buf, " [%d]=%d, ", j, n128->slot_idxs[j]);
+				}
+				appendStringInfo(buf, "\nisset-bitmap:");
+				for (int j = 0; j < 16; j++)
+				{
+					appendStringInfo(buf, "%X ", (uint8) tmp[j]);
+				}
+				appendStringInfo(buf, "\n");
+
+				for (int i = 0; i < 256; i++)
+				{
+					void	   *slot;
+
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					slot = node_128_get_slot(n128, i);
+
+					if (is_leaf)
+						rt_print_slot(buf, i, *(uint64 *) slot,
+									  i, is_leaf, level);
+					else
+						rt_print_slot(buf, i, UINT64_MAX, i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							rt_dump_node((rt_node *) slot,
+										 level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_base_256 *n256 = (rt_node_base_256 *) node;
+
+				for (int i = 0; i < 256; i++)
+				{
+					void	   *slot;
+
+					if (!node_256_is_chunk_used(n256, i))
+						continue;
+
+					slot = node_256_get_slot(n256, i);
+
+					if (is_leaf)
+						rt_print_slot(buf, i, *(uint64 *) slot, i, is_leaf, level);
+					else
+						rt_print_slot(buf, i, UINT64_MAX, i, is_leaf, level);
+
+					if (!is_leaf)
+					{
+						if (recurse)
+						{
+							StringInfoData buf2;
+
+							initStringInfo(&buf2);
+							rt_dump_node((rt_node *) slot, level + 1, &buf2, recurse);
+							appendStringInfo(buf, "%s", buf2.data);
+						}
+						else
+							appendStringInfo(buf, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	StringInfoData buf;
+	rt_node    *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+			 key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		rt_dump_node(node, level, &buf, false);
+
+		if (IS_LEAF_NODE(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+			break;
+		}
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+
+	elog(NOTICE, "\n%s", buf.data);
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+	StringInfoData buf;
+
+	initStringInfo(&buf);
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu", tree->max_val);
+	rt_dump_node(tree->root, 0, &buf, true);
+	elog(NOTICE, "\n%s", buf.data);
+	elog(NOTICE, "-----------------------------------------------------------");
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..38cc6abf4c
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+/* #define RT_DEBUG 1 */
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 6c31c8707c..8252ec41c4 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -25,6 +25,7 @@ SUBDIRS = \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..a4aa80a99c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,504 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int rt_node_max_entries[] = {
+	4,		/* RT_NODE_KIND_4 */
+	16,		/* RT_NODE_KIND_16 */
+	32,		/* RT_NODE_KIND_32 */
+	128,	/* RT_NODE_KIND_128 */
+	256		/* RT_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 10000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	uint64 dummy;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		uint64 val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		bool found;
+
+		found = rt_set(radixtree, key, key);
+
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+		for (int j = 0; j < lengthof(rt_node_max_entries); j++)
+		{
+			/*
+			 * After filling all slots in each node type, check if the values are
+			 * stored properly.
+			 */
+			if (i == (rt_node_max_entries[j] - 1))
+			{
+				check_search_on_node(radixtree, shift,
+									 (j == 0) ? 0 : rt_node_max_entries[j - 1],
+									 rt_node_max_entries[j]);
+				break;
+			}
+		}
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64	key = ((uint64) i << shift);
+		bool	found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search
+	 * entries again.
+	 */
+	test_node_types_insert(radixtree, shift);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+	radix_tree *radixtree;
+	rt_iter *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the
+	 * stats from the memory context.  They should be in the same ballpark,
+	 * but it's hard to automate testing that, so if you're making changes to
+	 * the implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
-- 
2.31.1

v6-0003-tool-for-measuring-radix-tree-performance.patchapplication/x-patch; name=v6-0003-tool-for-measuring-radix-tree-performance.patchDownload

From 39f0019d95eb4808d235a07d107aee2ff46856e2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v6 3/3] tool for measuring radix tree performance

---
 contrib/bench_radix_tree/Makefile             |  21 ++
 .../bench_radix_tree--1.0.sql                 |  42 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 301 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 6 files changed, 399 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..b8f70e12d1
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..6663abe6a4
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,42 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..5806ef7519
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,301 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+
+static radix_tree *rt = NULL;
+static ItemPointer itemptrs = NULL;
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint32 upper;
+	uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64 tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper-lower)+0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptrs[j];
+
+		itemptrs[j] = itemptrs[i];
+		itemptrs[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p)
+{
+	ItemPointer	tids;
+	uint64	maxitems;
+	uint64	ntids = 0;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	uint64	ntids;
+	uint64	key;
+	uint64	last_key = PG_UINT64_MAX;;
+	uint64	val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz	start_time,	end_time;
+	long	secs;
+	int		usecs;
+	int64	rt_load_ms, rt_search_ms, ar_load_ms, ar_search_ms;
+	Datum	values[7];
+	bool	nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32	off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	/* measure the load time of the array */
+	itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+									  sizeof(ItemPointerData) * ntids);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointerSetBlockNumber(&(itemptrs[i]),
+								  ItemPointerGetBlockNumber(&(tids[i])));
+		ItemPointerSetOffsetNumber(&(itemptrs[i]),
+								   ItemPointerGetOffsetNumber(&(tids[i])));
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	ar_load_ms = secs * 1000 + usecs / 1000;
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint64	key, val;
+		uint32	off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		rt_search(rt, key, &val);
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	/* next, measure the serach time of the array */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+
+		bsearch((void *) tid,
+				(void *) itemptrs,
+				ntids,
+				sizeof(ItemPointerData),
+				vac_cmp_itemptr);
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	ar_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(ar_load_ms);
+	values[5] = Int64GetDatum(rt_search_ms);
+	values[6] = Int64GetDatum(ar_search_ms);
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64	cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz	start_time,	end_time;
+	long	secs;
+	int		usecs;
+	int64	load_time_ms;
+	Datum	values[2];
+	bool	nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
-- 
2.31.1

#83

https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#82)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Aug 15, 2022 at 10:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

bool, buth = and <=. Should be pretty close. Also, i believe if you
left this for last as a possible refactoring, it might save some work.

v6 demonstrates why this should have been put off towards the end. (more below)

In any case, I'll take a look at the latest patch next month.

Since the CF entry said "Needs Review", I began looking at v5 again
this week. Hopefully not too much has changed, but in the future I
strongly recommend setting to "Waiting on Author" if a new version is
forthcoming. I realize many here share updated patches at any time,
but I'd like to discourage the practice especially for large patches.

I've updated the radix tree patch. It's now separated into two patches.

0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
better names) that are similar to the pg_lfind8() family but they
return the index of the key in the vector instead of true/false. The
patch includes regression tests.

I don't want to do a full review of this just yet, but I'll just point
out some problems from a quick glance.

+/*
+ * Return the index of the first element in the vector that is greater than
+ * or eual to the given scalar. Return sizeof(Vector8) if there is no such
+ * element.

That's a bizarre API to indicate non-existence.

+ *
+ * Note that this function assumes the elements in the vector are sorted.
+ */

That is *completely* unacceptable for a general-purpose function.

+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0;

I don't think we should try to force the non-simd case to adopt the
special semantics of vector comparisons. It's much easier to just use
the same logic as the assert builds.

+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+        1 << 0, 1 << 1, 1 << 2, 1 << 3,
+        1 << 4, 1 << 5, 1 << 6, 1 << 7,
+        1 << 0, 1 << 1, 1 << 2, 1 << 3,
+        1 << 4, 1 << 5, 1 << 6, 1 << 7,
+      };
+
+    uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t)
vshrq_n_s8(v, 7));
+    uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+    return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));

For Arm, we need to be careful here. This article goes into a lot of
detail for this situation:

Here again, I'd rather put this off and focus on getting the "large
details" in good enough shape so we can got towards integrating with
vacuum.

In addition to two patches, I've attached the third patch. It's not
part of radix tree implementation but introduces a contrib module
bench_radix_tree, a tool for radix tree performance benchmarking. It
measures loading and lookup performance of both the radix tree and a
flat array.

Excellent! This was high on my wish list.

--
John Naylor
EDB: http://www.enterprisedb.com

#84

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: John Naylor (#83)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Sep 16, 2022 at 02:54:14PM +0700, John Naylor wrote:

Here again, I'd rather put this off and focus on getting the "large
details" in good enough shape so we can got towards integrating with
vacuum.

I started a new thread for the SIMD patch [0]/messages/by-id/20220917052903.GA3172400@nathanxps13 so that this thread can
remain focused on the radix tree stuff.

[0]: /messages/by-id/20220917052903.GA3172400@nathanxps13

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#85

sawada.mshk@gmail.com

over 3 years ago

In reply to: John Naylor (#83)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Sep 16, 2022 at 4:54 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Aug 15, 2022 at 10:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

bool, buth = and <=. Should be pretty close. Also, i believe if you
left this for last as a possible refactoring, it might save some work.

v6 demonstrates why this should have been put off towards the end. (more below)

In any case, I'll take a look at the latest patch next month.

Since the CF entry said "Needs Review", I began looking at v5 again
this week. Hopefully not too much has changed, but in the future I
strongly recommend setting to "Waiting on Author" if a new version is
forthcoming. I realize many here share updated patches at any time,
but I'd like to discourage the practice especially for large patches.

Understood. Sorry for the inconveniences.

I've updated the radix tree patch. It's now separated into two patches.

0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
better names) that are similar to the pg_lfind8() family but they
return the index of the key in the vector instead of true/false. The
patch includes regression tests.

I don't want to do a full review of this just yet, but I'll just point
out some problems from a quick glance.
+/*
+ * Return the index of the first element in the vector that is greater than
+ * or eual to the given scalar. Return sizeof(Vector8) if there is no such
+ * element.
That's a bizarre API to indicate non-existence.
+ *
+ * Note that this function assumes the elements in the vector are sorted.
+ */
That is *completely* unacceptable for a general-purpose function.
+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = (((const uint8 *) &v1)[i] == ((const uint8 *) &v2)[i]) ? 0xFF : 0;
I don't think we should try to force the non-simd case to adopt the
special semantics of vector comparisons. It's much easier to just use
the same logic as the assert builds.
+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+        1 << 0, 1 << 1, 1 << 2, 1 << 3,
+        1 << 4, 1 << 5, 1 << 6, 1 << 7,
+        1 << 0, 1 << 1, 1 << 2, 1 << 3,
+        1 << 4, 1 << 5, 1 << 6, 1 << 7,
+      };
+
+    uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t)
vshrq_n_s8(v, 7));
+    uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+    return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
For Arm, we need to be careful here. This article goes into a lot of
detail for this situation:

https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon

Here again, I'd rather put this off and focus on getting the "large
details" in good enough shape so we can got towards integrating with
vacuum.

Thank you for the comments! These above comments are addressed by
Nathan in a newly derived thread. I'll work on the patch.

I'll consider how to integrate with vacuum as the next step. One
concern for me is how to limit the memory usage to
maintenance_work_mem. Unlike using a flat array, memory space for
adding one TID varies depending on the situation. If we want strictly
not to allow using memory more than maintenance_work_mem, probably we
need to estimate the memory consumption in a conservative way.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#86

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#85)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Sep 20, 2022 at 3:19 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Sep 16, 2022 at 4:54 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Here again, I'd rather put this off and focus on getting the "large
details" in good enough shape so we can got towards integrating with
vacuum.

Thank you for the comments! These above comments are addressed by
Nathan in a newly derived thread. I'll work on the patch.

I still seem to be out-voted on when to tackle this particular
optimization, so I've extended the v6 benchmark code with a hackish
function that populates a fixed number of keys, but with different fanouts.
(diff attached as a text file)

I didn't take particular care to make this scientific, but the following
seems pretty reproducible. Note what happens to load and search performance
when node16 has 15 entries versus 16:

In trying to wrap the SIMD code behind layers of abstraction, the latest
patch (and Nathan's cleanup) threw it away in almost all cases. To explain,
we need to talk about how vectorized code deals with the "tail" that is too
small for the register:

1. Use a one-by-one algorithm, like we do for the pg_lfind* variants.
2. Read some junk into the register and mask off false positives from the
result.

There are advantages to both depending on the situation.

Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15
elements or less, it will iterate over them one-by-one exactly like a
node4. Only when full with 16 will the vector path be taken. When another
entry is added, the elements are copied to the next bigger node, so there's
a *small* window where it's fast.

In short, this code needs to be lower level so that we still have full
control while being portable. I will work on this, and also the related
code for node dispatch.

Since v6 has some good infrastructure to do low-level benchmarking, I also
want to do some experiments with memory management.

(I have further comments about the code, but I will put that off until
later)

I'll consider how to integrate with vacuum as the next step. One
concern for me is how to limit the memory usage to
maintenance_work_mem. Unlike using a flat array, memory space for
adding one TID varies depending on the situation. If we want strictly
not to allow using memory more than maintenance_work_mem, probably we
need to estimate the memory consumption in a conservative way.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v6addendum-bench-node16.diff.txttext/plain; charset=US-ASCII; name=v6addendum-bench-node16.diff.txtDownload

commit 18407962e96ccec6c9aeeba97412edd762a5a4fe
Author: John Naylor <john.naylor@postgresql.org>
Date:   Wed Sep 21 11:44:43 2022 +0700

    Add special benchmark function to test effect of fanout

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
index b8f70e12d1..952bb0ceae 100644
--- a/contrib/bench_radix_tree/Makefile
+++ b/contrib/bench_radix_tree/Makefile
@@ -7,7 +7,7 @@ OBJS = \
 EXTENSION = bench_radix_tree
 DATA = bench_radix_tree--1.0.sql
 
-REGRESS = bench
+REGRESS = bench_fixed_height
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 6663abe6a4..f2fee15b17 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -40,3 +40,15 @@ OUT load_ms int8)
 returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 5806ef7519..0778da2d7b 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -13,6 +13,7 @@
 #include "fmgr.h"
 #include "funcapi.h"
 #include "lib/radixtree.h"
+#include <math.h>
 #include "miscadmin.h"
 #include "utils/timestamp.h"
 
@@ -24,6 +25,7 @@ PG_MODULE_MAGIC;
 PG_FUNCTION_INFO_V1(bench_seq_search);
 PG_FUNCTION_INFO_V1(bench_shuffle_search);
 PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
 
 static radix_tree *rt = NULL;
 static ItemPointer itemptrs = NULL;
@@ -299,3 +301,108 @@ bench_load_random_int(PG_FUNCTION_ARGS)
 	rt_free(rt);
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int	fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz	start_time,	end_time;
+	long	secs;
+	int		usecs;
+	int64	rt_load_ms, rt_search_ms;
+	Datum	values[5];
+	bool	nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int		n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r, h, i, j, k;
+	int key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	/*  lower nodes have limited fanout, the top is only limited by bits-per-byte */
+	for (r=1;;r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64	key;
+						key = (r<<32) | (h<<24) | (i<<16) | (j<<8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r=1;;r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64	key, val;
+						key = (r<<32) | (h<<24) | (i<<16) | (j<<8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/expected/bench_fixed_height.out b/contrib/bench_radix_tree/expected/bench_fixed_height.out
new file mode 100644
index 0000000000..c4995afc13
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench_fixed_height.out
@@ -0,0 +1,6 @@
+create extension bench_radix_tree;
+\o fixed_height_search.data
+begin;
+select * from bench_fixed_height_search(15);
+select * from bench_fixed_height_search(16);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench_fixed_height.sql b/contrib/bench_radix_tree/sql/bench_fixed_height.sql
new file mode 100644
index 0000000000..0c06570e9a
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench_fixed_height.sql
@@ -0,0 +1,7 @@
+create extension bench_radix_tree;
+
+\o fixed_height_search.data
+begin;
+select * from bench_fixed_height_search(15);
+select * from bench_fixed_height_search(16);
+commit;
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index b163eac480..4ce8e9ad9d 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -1980,7 +1980,7 @@ rt_verify_node(rt_node *node)
 void
 rt_stats(radix_tree *tree)
 {
-	fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u, n16 = %u,n32 = %u, n128 = %u, n256 = %u",
+	fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u, n16 = %u,n32 = %u, n128 = %u, n256 = %u\n",
 			tree->num_keys,
 			tree->root->shift / RT_NODE_SPAN,
 			tree->cnt[0],
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 38cc6abf4c..6016d593ee 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -15,7 +15,7 @@
 
 #include "postgres.h"
 
-/* #define RT_DEBUG 1 */
+#define RT_DEBUG 1 
 
 typedef struct radix_tree radix_tree;
 typedef struct rt_iter rt_iter;

#87

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: John Naylor (#86)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote:

In trying to wrap the SIMD code behind layers of abstraction, the latest
patch (and Nathan's cleanup) threw it away in almost all cases. To explain,
we need to talk about how vectorized code deals with the "tail" that is too
small for the register:

1. Use a one-by-one algorithm, like we do for the pg_lfind* variants.
2. Read some junk into the register and mask off false positives from the
result.

There are advantages to both depending on the situation.

Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15
elements or less, it will iterate over them one-by-one exactly like a
node4. Only when full with 16 will the vector path be taken. When another
entry is added, the elements are copied to the next bigger node, so there's
a *small* window where it's fast.

In short, this code needs to be lower level so that we still have full
control while being portable. I will work on this, and also the related
code for node dispatch.

Is it possible to use approach #2 here, too? AFAICT space is allocated for
all of the chunks, so there wouldn't be any danger in searching all them
and discarding any results >= node->count. Granted, we're depending on the
number of chunks always being a multiple of elements-per-vector in order to
avoid the tail path, but that seems like a reasonably safe assumption that
can be covered with comments.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#88

/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Nathan Bossart (#87)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Sep 22, 2022 at 1:01 AM Nathan Bossart <nathandbossart@gmail.com>
wrote:

On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote:

In short, this code needs to be lower level so that we still have full
control while being portable. I will work on this, and also the related
code for node dispatch.

Is it possible to use approach #2 here, too? AFAICT space is allocated

for

all of the chunks, so there wouldn't be any danger in searching all them
and discarding any results >= node->count.

Sure, the caller could pass the maximum node capacity, and then check if
the returned index is within the range of the node count.

Granted, we're depending on the
number of chunks always being a multiple of elements-per-vector in order

avoid the tail path, but that seems like a reasonably safe assumption that
can be covered with comments.

Actually, we don't need to depend on that at all. When I said "junk" above,
that can be any bytes, as long as we're not reading off the end of
allocated memory. We'll never do that here, since the child pointers/values
follow. In that case, the caller can hard-code the size (it would even
happen to work now to multiply rt_node_kind by 16, to be sneaky). One thing
I want to try soon is storing fewer than 16/32 etc entries, so that the
whole node fits comfortably inside a power-of-two allocation. That would
allow us to use aset without wasting space for the smaller nodes, which
would be faster and possibly would solve the fragmentation problem Andres
referred to in

While on the subject, I wonder how important it is to keep the chunks in
the small nodes in sorted order. That adds branches and memmove calls, and
is the whole reason for the recent "pg_lfind_ge" function.

--
John Naylor
EDB: http://www.enterprisedb.com

#89

sawada.mshk@gmail.com

over 3 years ago

In reply to: John Naylor (#88)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Sep 22, 2022 at 1:46 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Sep 22, 2022 at 1:01 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote:

In short, this code needs to be lower level so that we still have full
control while being portable. I will work on this, and also the related
code for node dispatch.

Is it possible to use approach #2 here, too? AFAICT space is allocated for
all of the chunks, so there wouldn't be any danger in searching all them
and discarding any results >= node->count.

Sure, the caller could pass the maximum node capacity, and then check if the returned index is within the range of the node count.

Granted, we're depending on the
number of chunks always being a multiple of elements-per-vector in order to
avoid the tail path, but that seems like a reasonably safe assumption that
can be covered with comments.

Actually, we don't need to depend on that at all. When I said "junk" above, that can be any bytes, as long as we're not reading off the end of allocated memory. We'll never do that here, since the child pointers/values follow. In that case, the caller can hard-code the size (it would even happen to work now to multiply rt_node_kind by 16, to be sneaky). One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably inside a power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which would be faster and possibly would solve the fragmentation problem Andres referred to in

/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de

While on the subject, I wonder how important it is to keep the chunks in the small nodes in sorted order. That adds branches and memmove calls, and is the whole reason for the recent "pg_lfind_ge" function.

Good point. While keeping the chunks in the small nodes in sorted
order is useful for visiting all keys in sorted order, additional
branches and memmove calls could be slow.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#90

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#89)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Sep 22, 2022 at 1:26 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Thu, Sep 22, 2022 at 1:46 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

While on the subject, I wonder how important it is to keep the chunks

in the small nodes in sorted order. That adds branches and memmove calls,
and is the whole reason for the recent "pg_lfind_ge" function.

Good point. While keeping the chunks in the small nodes in sorted
order is useful for visiting all keys in sorted order, additional
branches and memmove calls could be slow.

Right, the ordering is a property that some users will need, so best to
keep it. Although the node128 doesn't have that property -- too slow to do
so, I think.

--
John Naylor
EDB: http://www.enterprisedb.com

#91

john.naylor@enterprisedb.com

over 3 years ago

In reply to: John Naylor (#90)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Sep 22, 2022 at 7:52 PM John Naylor <john.naylor@enterprisedb.com>
wrote:

On Thu, Sep 22, 2022 at 1:26 PM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

Good point. While keeping the chunks in the small nodes in sorted
order is useful for visiting all keys in sorted order, additional
branches and memmove calls could be slow.

Right, the ordering is a property that some users will need, so best to

keep it. Although the node128 doesn't have that property -- too slow to do
so, I think.

Nevermind, I must have been mixing up keys and values there...

--
John Naylor
EDB: http://www.enterprisedb.com

#92

/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de

john.naylor@enterprisedb.com

over 3 years ago

In reply to: John Naylor (#88)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Sep 22, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com>
wrote:

One thing I want to try soon is storing fewer than 16/32 etc entries, so

that the whole node fits comfortably inside a power-of-two allocation. That
would allow us to use aset without wasting space for the smaller nodes,
which would be faster and possibly would solve the fragmentation problem
Andres referred to in

While calculating node sizes that fit within a power-of-two size, I noticed
the current base node is a bit wasteful, taking up 8 bytes. The node kind
only has a small number of values, so it doesn't really make sense to use
an enum here in the struct (in fact, Andres' prototype used a uint8 for
node_kind). We could use a bitfield for the count and kind:

uint16 -- kind and count bitfield
uint8 shift;
uint8 chunk;

That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag,
the bitfield can just go back to being count only.

Here are the v6 node kinds:

node4: 8 + 4 +(4) + 4*8 = 48 bytes
node16: 8 + 16 + 16*8 = 152
node32: 8 + 32 + 32*8 = 296
node128: 8 + 256 + 128/8 + 128*8 = 1304
node256: 8 + 256/8 + 256*8 = 2088

And here are the possible ways we could optimize nodes for space using aset
allocation. Parentheses are padding bytes. Even if my math has mistakes,
the numbers shouldn't be too far off:

node3: 4 + 3 +(1) + 3*8 = 32 bytes
node6: 4 + 6 +(6) + 6*8 = 64
node13: 4 + 13 +(7) + 13*8 = 128
node28: 4 + 28 + 28*8 = 256
node31: 4 + 256 + 32/8 + 31*8 = 512 (XXX not good)
node94: 4 + 256 + 96/8 + 94*8 = 1024
node220: 4 + 256 + 224/8 + 220*8 = 2048
node256: = 4096

The main disadvantage is that node256 would balloon in size.

--
John Naylor
EDB: http://www.enterprisedb.com

#93

sawada.mshk@gmail.com

over 3 years ago

In reply to: John Naylor (#92)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Sep 23, 2022 at 12:11 AM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Sep 22, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote:

One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably inside a power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which would be faster and possibly would solve the fragmentation problem Andres referred to in

/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de

While calculating node sizes that fit within a power-of-two size, I noticed the current base node is a bit wasteful, taking up 8 bytes. The node kind only has a small number of values, so it doesn't really make sense to use an enum here in the struct (in fact, Andres' prototype used a uint8 for node_kind). We could use a bitfield for the count and kind:

uint16 -- kind and count bitfield
uint8 shift;
uint8 chunk;

That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, the bitfield can just go back to being count only.

Good point, agreed.

Here are the v6 node kinds:

node4: 8 + 4 +(4) + 4*8 = 48 bytes
node16: 8 + 16 + 16*8 = 152
node32: 8 + 32 + 32*8 = 296
node128: 8 + 256 + 128/8 + 128*8 = 1304
node256: 8 + 256/8 + 256*8 = 2088

And here are the possible ways we could optimize nodes for space using aset allocation. Parentheses are padding bytes. Even if my math has mistakes, the numbers shouldn't be too far off:

node3: 4 + 3 +(1) + 3*8 = 32 bytes
node6: 4 + 6 +(6) + 6*8 = 64
node13: 4 + 13 +(7) + 13*8 = 128
node28: 4 + 28 + 28*8 = 256
node31: 4 + 256 + 32/8 + 31*8 = 512 (XXX not good)
node94: 4 + 256 + 96/8 + 94*8 = 1024
node220: 4 + 256 + 224/8 + 220*8 = 2048
node256: = 4096

The main disadvantage is that node256 would balloon in size.

Yeah, node31 and node256 are bloated. We probably could use slab for
node256 independently. It's worth trying a benchmark to see how it
affects the performance and the tree size.

BTW We need to consider not only aset/slab but also DSA since we
allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
the following size classes:

static const uint16 dsa_size_classes[] = {
sizeof(dsa_area_span), 0, /* special size classes */
8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */
80, 96, 112, 128, /* 4 classes separated by 16 bytes */
160, 192, 224, 256, /* 4 classes separated by 32 bytes */
320, 384, 448, 512, /* 4 classes separated by 64 bytes */
640, 768, 896, 1024, /* 4 classes separated by 128 bytes */
1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */
};

node256 will be classed as 2616, which is still not good.

Anyway, I'll implement DSA support for radix tree.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#94

[1]: /messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de
/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#93)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Sep 28, 2022 at 10:49 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

BTW We need to consider not only aset/slab but also DSA since we
allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
the following size classes:

static const uint16 dsa_size_classes[] = {
[...]

Thanks for that info -- I wasn't familiar with the details of DSA. For the
non-parallel case, I plan to at least benchmark using aset because I gather
it's the most heavily optimized. I'm thinking that will allow other problem
areas to be more prominent. I'll also want to compare total context size
compared to slab to see if possibly less fragmentation makes up for other
wastage.

Along those lines, one thing I've been thinking about is the number of size
classes. There is a tradeoff between memory efficiency and number of
branches when searching/inserting. My current thinking is there is too much
coupling between size class and data type. Each size class currently uses a
different data type and a different algorithm to search and set it, which
in turn requires another branch. We've found that a larger number of size
classes leads to poor branch prediction [1]/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de and (I imagine) code density.

I'm thinking we can use "flexible array members" for the values/pointers,
and keep the rest of the control data in the struct the same. That way, we
never have more than 4 actual "kinds" to code and branch on. As a bonus,
when migrating a node to a larger size class of the same kind, we can
simply repalloc() to the next size. To show what I mean, consider this new
table:

node2: 5 + 6 +(5)+ 2*8 = 32 bytes
node6: 5 + 6 +(5)+ 6*8 = 64

node12: 5 + 27 + 12*8 = 128
node27: 5 + 27 + 27*8 = 248(->256)

node91: 5 + 256 + 28 +(7)+ 91*8 = 1024
node219: 5 + 256 + 28 +(7)+219*8 = 2048

node256: 5 + 32 +(3)+256*8 = 2088(->4096)

Seven size classes are grouped into the four kinds.

The common base at the front is here 5 bytes because there is a new uint8
field for "capacity", which we can ignore for node256 since we assume we
can always insert/update that node. The control data is the same in each
pair, and so the offset to the pointer/value array is the same. Thus,
migration would look something like:

case FOO_KIND:
if (unlikely(count == capacity))
{
if (capacity == XYZ) /* for smaller size class of the pair */
{
<repalloc to next size class>;
capacity = next-higher-capacity;
goto do_insert;
}
else
<migrate data to next node kind>;
}
else
{
do_insert:
<...>;
break;
}
/* FALLTHROUGH */
...

One disadvantage is that this wastes some space by reserving the full set
of control data in the smaller size class of the pair, but it's usually
small compared to array size. Somewhat unrelated, we could still implement
Andres' idea [1]/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de to dispense with the isset array in inner nodes of the
indirect array type (now node128), since we can just test if the pointer is
null.

--
John Naylor
EDB: http://www.enterprisedb.com

#95

john.naylor@enterprisedb.com

over 3 years ago

In reply to: John Naylor (#94)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Sep 28, 2022 at 1:18 PM John Naylor <john.naylor@enterprisedb.com>
wrote:

[stuff about size classes]

I kind of buried the lede here on one thing: If we only have 4 kinds
regardless of the number of size classes, we can use 2 bits of the pointer
for dispatch, which would only require 4-byte alignment. That should make
that technique more portable.

--
John Naylor
EDB: http://www.enterprisedb.com

#96

sawada.mshk@gmail.com

over 3 years ago

In reply to: John Naylor (#94)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Sep 28, 2022 at 3:18 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Sep 28, 2022 at 10:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

BTW We need to consider not only aset/slab but also DSA since we
allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
the following size classes:

static const uint16 dsa_size_classes[] = {
[...]

Thanks for that info -- I wasn't familiar with the details of DSA. For the non-parallel case, I plan to at least benchmark using aset because I gather it's the most heavily optimized. I'm thinking that will allow other problem areas to be more prominent. I'll also want to compare total context size compared to slab to see if possibly less fragmentation makes up for other wastage.

Thanks!

Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memory efficiency and number of branches when searching/inserting. My current thinking is there is too much coupling between size class and data type. Each size class currently uses a different data type and a different algorithm to search and set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branch prediction [1] and (I imagine) code density.

I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in the struct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migrating a node to a larger size class of the same kind, we can simply repalloc() to the next size.

Interesting idea. Using flexible array members for values would be
good also for the case in the future where we want to support other
value types than uint64.

With this idea, we can just repalloc() to grow to the larger size in a
pair but I'm slightly concerned that the more size class we use, the
more frequent the node needs to grow. If we want to support node
shrink, the deletion is also affected.

To show what I mean, consider this new table:

node2: 5 + 6 +(5)+ 2*8 = 32 bytes
node6: 5 + 6 +(5)+ 6*8 = 64

node12: 5 + 27 + 12*8 = 128
node27: 5 + 27 + 27*8 = 248(->256)

node91: 5 + 256 + 28 +(7)+ 91*8 = 1024
node219: 5 + 256 + 28 +(7)+219*8 = 2048

node256: 5 + 32 +(3)+256*8 = 2088(->4096)

Seven size classes are grouped into the four kinds.

The common base at the front is here 5 bytes because there is a new uint8 field for "capacity", which we can ignore for node256 since we assume we can always insert/update that node. The control data is the same in each pair, and so the offset to the pointer/value array is the same. Thus, migration would look something like:

I think we can use a bitfield for capacity. That way, we can pack
count (9bits), kind (2bits)and capacity (4bits) in uint16.

Somewhat unrelated, we could still implement Andres' idea [1] to dispense with the isset array in inner nodes of the indirect array type (now node128), since we can just test if the pointer is null.

Right. I didn't do that to use the common logic for inner node128 and
leaf node128.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#97

andres@anarazel.de

over 3 years ago

In reply to: Masahiko Sawada (#82)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2022-09-16 15:00:31 +0900, Masahiko Sawada wrote:

I've updated the radix tree patch. It's now separated into two patches.

cfbot notices a compiler warning:
https://cirrus-ci.com/task/6247907681632256?logs=gcc_warning#L446

[11:03:05.343] radixtree.c: In function ‘rt_iterate_next’:
[11:03:05.343] radixtree.c:1758:15: error: ‘slot’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
[11:03:05.343] 1758 | *value_p = *((uint64 *) slot);
[11:03:05.343] | ^~~~~~~~~~~~~~~~~~

Greetings,

Andres Freund

#98

sawada.mshk@gmail.com

over 3 years ago

In reply to: Andres Freund (#97)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Oct 3, 2022 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-09-16 15:00:31 +0900, Masahiko Sawada wrote:

I've updated the radix tree patch. It's now separated into two patches.

cfbot notices a compiler warning:
https://cirrus-ci.com/task/6247907681632256?logs=gcc_warning#L446

[11:03:05.343] radixtree.c: In function ‘rt_iterate_next’:
[11:03:05.343] radixtree.c:1758:15: error: ‘slot’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
[11:03:05.343] 1758 | *value_p = *((uint64 *) slot);
[11:03:05.343] | ^~~~~~~~~~~~~~~~~~

Thanks, I'll fix it in the next version patch.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#99

sawada.mshk@gmail.com

over 3 years ago

In reply to: Masahiko Sawada (#93)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Sep 23, 2022 at 12:11 AM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Sep 22, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote:

One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably inside a power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which would be faster and possibly would solve the fragmentation problem Andres referred to in

/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de

While calculating node sizes that fit within a power-of-two size, I noticed the current base node is a bit wasteful, taking up 8 bytes. The node kind only has a small number of values, so it doesn't really make sense to use an enum here in the struct (in fact, Andres' prototype used a uint8 for node_kind). We could use a bitfield for the count and kind:

uint16 -- kind and count bitfield
uint8 shift;
uint8 chunk;

That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, the bitfield can just go back to being count only.

Good point, agreed.

Here are the v6 node kinds:

node4: 8 + 4 +(4) + 4*8 = 48 bytes
node16: 8 + 16 + 16*8 = 152
node32: 8 + 32 + 32*8 = 296
node128: 8 + 256 + 128/8 + 128*8 = 1304
node256: 8 + 256/8 + 256*8 = 2088

And here are the possible ways we could optimize nodes for space using aset allocation. Parentheses are padding bytes. Even if my math has mistakes, the numbers shouldn't be too far off:

node3: 4 + 3 +(1) + 3*8 = 32 bytes
node6: 4 + 6 +(6) + 6*8 = 64
node13: 4 + 13 +(7) + 13*8 = 128
node28: 4 + 28 + 28*8 = 256
node31: 4 + 256 + 32/8 + 31*8 = 512 (XXX not good)
node94: 4 + 256 + 96/8 + 94*8 = 1024
node220: 4 + 256 + 224/8 + 220*8 = 2048
node256: = 4096

The main disadvantage is that node256 would balloon in size.

Yeah, node31 and node256 are bloated. We probably could use slab for
node256 independently. It's worth trying a benchmark to see how it
affects the performance and the tree size.

BTW We need to consider not only aset/slab but also DSA since we
allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
the following size classes:

static const uint16 dsa_size_classes[] = {
sizeof(dsa_area_span), 0, /* special size classes */
8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */
80, 96, 112, 128, /* 4 classes separated by 16 bytes */
160, 192, 224, 256, /* 4 classes separated by 32 bytes */
320, 384, 448, 512, /* 4 classes separated by 64 bytes */
640, 768, 896, 1024, /* 4 classes separated by 128 bytes */
1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */
2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */
5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */
};

node256 will be classed as 2616, which is still not good.

Anyway, I'll implement DSA support for radix tree.

Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
to point to its child nodes, instead of C pointers (ig, backend-local
address). I'm thinking of a straightforward approach as the first
step; inner nodes have a union of rt_node* and dsa_pointer and we
choose either one based on whether the radix tree is shared or not. We
allocate and free the shared memory for individual nodes by
dsa_allocate() and dsa_free(), respectively. Therefore we need to get
a C pointer from dsa_pointer by using dsa_get_address() while
descending the tree. I'm a bit concerned that calling
dsa_get_address() for every descent could be performance overhead but
I'm going to measure it anyway.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#100

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#99)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Oct 5, 2022 at 1:46 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

On Fri, Sep 23, 2022 at 12:11 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
Yeah, node31 and node256 are bloated. We probably could use slab for
node256 independently. It's worth trying a benchmark to see how it
affects the performance and the tree size.

This wasn't the focus of your current email, but while experimenting with
v6 I had another thought about local allocation: If we use the default slab
block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right?
If so, since aset and DSA also waste at least a few hundred bytes, we could
store a useless 256-byte slot array within node256. That way, node128 and
node256 share the same start of pointers/values array, so there would be
one less branch for getting that address. In v6, rt_node_get_values and
rt_node_get_children are not inlined (asde: gcc uses a jump table for 5
kinds but not for 4), but possibly should be, and the smaller the better.

Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
to point to its child nodes, instead of C pointers (ig, backend-local
address). I'm thinking of a straightforward approach as the first
step; inner nodes have a union of rt_node* and dsa_pointer and we
choose either one based on whether the radix tree is shared or not. We
allocate and free the shared memory for individual nodes by
dsa_allocate() and dsa_free(), respectively. Therefore we need to get
a C pointer from dsa_pointer by using dsa_get_address() while
descending the tree. I'm a bit concerned that calling
dsa_get_address() for every descent could be performance overhead but
I'm going to measure it anyway.

Are dsa pointers aligned the same as pointers to locally allocated memory?
Meaning, is the offset portion always a multiple of 4 (or 8)? It seems that
way from a glance, but I can't say for sure. If the lower 2 bits of a DSA
pointer are never set, we can tag them the same way as a regular pointer.
That same technique could help hide the latency of converting the pointer,
by the same way it would hide the latency of loading parts of a node into
CPU registers.

One concern is, handling both local and dsa cases in the same code requires
more (predictable) branches and reduces code density. That might be a
reason in favor of templating to handle each case in its own translation
unit. But that might be overkill.
--
John Naylor
EDB: http://www.enterprisedb.com

#101

sawada.mshk@gmail.com

over 3 years ago

In reply to: John Naylor (#100)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Oct 5, 2022 at 6:40 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Wed, Oct 5, 2022 at 1:46 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Sep 23, 2022 at 12:11 AM John Naylor
<john.naylor@enterprisedb.com> wrote:
Yeah, node31 and node256 are bloated. We probably could use slab for
node256 independently. It's worth trying a benchmark to see how it
affects the performance and the tree size.

This wasn't the focus of your current email, but while experimenting with v6 I had another thought about local allocation: If we use the default slab block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right? If so, since aset and DSA also waste at least a few hundred bytes, we could store a useless 256-byte slot array within node256. That way, node128 and node256 share the same start of pointers/values array, so there would be one less branch for getting that address. In v6, rt_node_get_values and rt_node_get_children are not inlined (asde: gcc uses a jump table for 5 kinds but not for 4), but possibly should be, and the smaller the better.

It would be good for performance but I'm a bit concerned that it's
highly optimized to the design of aset and DSA. Since size 2088 will
be currently classed as 2616 in DSA, DSA wastes 528 bytes. However, if
we introduce a new class of 2304 (=2048 + 256) bytes we cannot store a
useless 256-byte and the assumption will be broken.

Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
to point to its child nodes, instead of C pointers (ig, backend-local
address). I'm thinking of a straightforward approach as the first
step; inner nodes have a union of rt_node* and dsa_pointer and we
choose either one based on whether the radix tree is shared or not. We
allocate and free the shared memory for individual nodes by
dsa_allocate() and dsa_free(), respectively. Therefore we need to get
a C pointer from dsa_pointer by using dsa_get_address() while
descending the tree. I'm a bit concerned that calling
dsa_get_address() for every descent could be performance overhead but
I'm going to measure it anyway.

Are dsa pointers aligned the same as pointers to locally allocated memory? Meaning, is the offset portion always a multiple of 4 (or 8)?

I think so.

It seems that way from a glance, but I can't say for sure. If the lower 2 bits of a DSA pointer are never set, we can tag them the same way as a regular pointer. That same technique could help hide the latency of converting the pointer, by the same way it would hide the latency of loading parts of a node into CPU registers.

One concern is, handling both local and dsa cases in the same code requires more (predictable) branches and reduces code density. That might be a reason in favor of templating to handle each case in its own translation unit.

Right. We also need to support locking for shared radix tree, which
would require more branches.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#102

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#101)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Oct 6, 2022 at 2:53 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Wed, Oct 5, 2022 at 6:40 PM John Naylor <john.naylor@enterprisedb.com>

wrote:

This wasn't the focus of your current email, but while experimenting

with v6 I had another thought about local allocation: If we use the default
slab block size of 8192 bytes, then only 3 chunks of size 2088 can fit,
right? If so, since aset and DSA also waste at least a few hundred bytes,
we could store a useless 256-byte slot array within node256. That way,
node128 and node256 share the same start of pointers/values array, so there
would be one less branch for getting that address. In v6,
rt_node_get_values and rt_node_get_children are not inlined (asde: gcc uses
a jump table for 5 kinds but not for 4), but possibly should be, and the
smaller the better.

It would be good for performance but I'm a bit concerned that it's
highly optimized to the design of aset and DSA. Since size 2088 will
be currently classed as 2616 in DSA, DSA wastes 528 bytes. However, if
we introduce a new class of 2304 (=2048 + 256) bytes we cannot store a
useless 256-byte and the assumption will be broken.

A new DSA class is hypothetical. A better argument against my idea is that
SLAB_DEFAULT_BLOCK_SIZE is arbitrary. FWIW, I looked at the prototype just
now and the slab block sizes are:

Max(pg_nextpower2_32((MAXALIGN(inner_class_info[i].size) + 16) * 32), 1024)

...which would be 128kB for nodemax. I'm curious about the difference.

One concern is, handling both local and dsa cases in the same code

requires more (predictable) branches and reduces code density. That might
be a reason in favor of templating to handle each case in its own
translation unit.

Right. We also need to support locking for shared radix tree, which
would require more branches.

Hmm, now it seems we'll likely want to template local vs. shared as a later
step...

--
John Naylor
EDB: http://www.enterprisedb.com

#103

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#82)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

In addition to two patches, I've attached the third patch. It's not
part of radix tree implementation but introduces a contrib module
bench_radix_tree, a tool for radix tree performance benchmarking. It
measures loading and lookup performance of both the radix tree and a
flat array.

Hi Masahiko, I've been using these benchmarks, along with my own
variations, to try various things that I've mentioned. I'm long overdue for
an update, but the picture is not yet complete.

For now, I have two questions that I can't figure out on my own:

1. There seems to be some non-obvious limit on the number of keys that are
loaded (or at least what the numbers report). This is independent of the
number of tids per block. Example below:

john=# select * from bench_shuffle_search(0, 8*1000*1000);
NOTICE: num_keys = 8000000, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 =
250000, n256 = 981
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
8000000 | 268435456 | 48000000 | 661 |
29 | 276 | 389

john=# select * from bench_shuffle_search(0, 9*1000*1000);
NOTICE: num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 =
262144, n256 = 1028
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
8388608 | 276824064 | 54000000 | 718 |
33 | 311 | 446

The array is the right size, but nkeys hasn't kept pace. Can you reproduce
this? Attached is the patch I'm using to show the stats when running the
test. (Side note: The numbers look unfavorable for radix tree because I'm
using 1 tid per block here.)

2. I found that bench_shuffle_search() is much *faster* for traditional
binary search on an array than bench_seq_search(). I've found this to be
true in every case. This seems counterintuitive to me -- any idea why this
is? Example:

john=# select * from bench_seq_search(0, 1000000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 168 |
106 | 827 | 3348

john=# select * from bench_shuffle_search(0, 1000000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 171 |
107 | 827 | 1400

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v65-0001-Turn-on-per-node-counts-in-benchmark.patchtext/x-patch; charset=US-ASCII; name=v65-0001-Turn-on-per-node-counts-in-benchmark.patchDownload

From 43a50a385930ee340d0a3b003910c704a0ff342c Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 6 Oct 2022 09:07:41 +0700
Subject: [PATCH v65 1/5] Turn on per-node counts in benchmark

Also add gitigore, fix whitespace, and change to NOTICE
---
 contrib/bench_radix_tree/.gitignore         | 3 +++
 contrib/bench_radix_tree/bench_radix_tree.c | 5 +++++
 src/backend/lib/radixtree.c                 | 2 +-
 src/include/lib/radixtree.h                 | 2 +-
 4 files changed, 10 insertions(+), 2 deletions(-)
 create mode 100644 contrib/bench_radix_tree/.gitignore

diff --git a/contrib/bench_radix_tree/.gitignore b/contrib/bench_radix_tree/.gitignore
new file mode 100644
index 0000000000..8830f5460d
--- /dev/null
+++ b/contrib/bench_radix_tree/.gitignore
@@ -0,0 +1,3 @@
+*data
+log/*
+results/*
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 5806ef7519..36c5218ae7 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -13,6 +13,7 @@
 #include "fmgr.h"
 #include "funcapi.h"
 #include "lib/radixtree.h"
+#include <math.h>
 #include "miscadmin.h"
 #include "utils/timestamp.h"
 
@@ -183,6 +184,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	TimestampDifference(start_time, end_time, &secs, &usecs);
 	rt_load_ms = secs * 1000 + usecs / 1000;
 
+	rt_stats(rt);
+
 	/* measure the load time of the array */
 	itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
 									  sizeof(ItemPointerData) * ntids);
@@ -292,6 +295,8 @@ bench_load_random_int(PG_FUNCTION_ARGS)
 	TimestampDifference(start_time, end_time, &secs, &usecs);
 	load_time_ms = secs * 1000 + usecs / 1000;
 
+	rt_stats(rt);
+
 	MemSet(nulls, false, sizeof(nulls));
 	values[0] = Int64GetDatum(rt_memory_usage(rt));
 	values[1] = Int64GetDatum(load_time_ms);
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index b163eac480..a84c06f0d4 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -1980,7 +1980,7 @@ rt_verify_node(rt_node *node)
 void
 rt_stats(radix_tree *tree)
 {
-	fprintf(stderr, "num_keys = %lu, height = %u, n4 = %u, n16 = %u,n32 = %u, n128 = %u, n256 = %u",
+	elog(NOTICE, "num_keys = %lu, height = %u, n4 = %u, n16 = %u, n32 = %u, n128 = %u, n256 = %u",
 			tree->num_keys,
 			tree->root->shift / RT_NODE_SPAN,
 			tree->cnt[0],
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 38cc6abf4c..d5d7668617 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -15,7 +15,7 @@
 
 #include "postgres.h"
 
-/* #define RT_DEBUG 1 */
+#define RT_DEBUG 1
 
 typedef struct radix_tree radix_tree;
 typedef struct rt_iter rt_iter;
-- 
2.37.3

#104

sawada.mshk@gmail.com

over 3 years ago

In reply to: John Naylor (#103)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Oct 7, 2022 at 2:29 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

In addition to two patches, I've attached the third patch. It's not
part of radix tree implementation but introduces a contrib module
bench_radix_tree, a tool for radix tree performance benchmarking. It
measures loading and lookup performance of both the radix tree and a
flat array.

Hi Masahiko, I've been using these benchmarks, along with my own variations, to try various things that I've mentioned. I'm long overdue for an update, but the picture is not yet complete.

Thanks!

For now, I have two questions that I can't figure out on my own:

1. There seems to be some non-obvious limit on the number of keys that are loaded (or at least what the numbers report). This is independent of the number of tids per block. Example below:

john=# select * from bench_shuffle_search(0, 8*1000*1000);
NOTICE: num_keys = 8000000, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 250000, n256 = 981
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
8000000 | 268435456 | 48000000 | 661 | 29 | 276 | 389

john=# select * from bench_shuffle_search(0, 9*1000*1000);
NOTICE: num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 262144, n256 = 1028
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
8388608 | 276824064 | 54000000 | 718 | 33 | 311 | 446

The array is the right size, but nkeys hasn't kept pace. Can you reproduce this? Attached is the patch I'm using to show the stats when running the test. (Side note: The numbers look unfavorable for radix tree because I'm using 1 tid per block here.)

Yes, I can reproduce this. In tid_to_key_off() we need to cast to
uint64 when packing offset number and block number:

tid_i = ItemPointerGetOffsetNumber(tid);
tid_i |= ItemPointerGetBlockNumber(tid) << shift;

2. I found that bench_shuffle_search() is much *faster* for traditional binary search on an array than bench_seq_search(). I've found this to be true in every case. This seems counterintuitive to me -- any idea why this is? Example:

john=# select * from bench_seq_search(0, 1000000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 168 | 106 | 827 | 3348

john=# select * from bench_shuffle_search(0, 1000000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 171 | 107 | 827 | 1400

Ugh, in shuffle_itemptrs(), we shuffled itemptrs instead of itemptr:

for (int i = 0; i < nitems - 1; i++)
{
int j = shuffle_randrange(&state, i, nitems - 1);
ItemPointerData t = itemptrs[j];

itemptrs[j] = itemptrs[i];
itemptrs[i] = t;

With the fix, the results on my environment were:

postgres(1:4093192)=# select * from bench_seq_search(0, 10000000);
2022-10-07 16:57:03.124 JST [4093192] LOG: num_keys = 10000000,
height = 3, n4 = 0, n16 = 1, n32 = 312500, n128 = 0, n256 = 1226
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
----------+------------------+---------------------+------------+---------------+--------------+-----------------
10000000 | 101826560 | 1800000000 | 846 |
486 | 6096 | 21128
(1 row)

Time: 28975.566 ms (00:28.976)
postgres(1:4093192)=# select * from bench_shuffle_search(0, 10000000);
2022-10-07 16:57:37.476 JST [4093192] LOG: num_keys = 10000000,
height = 3, n4 = 0, n16 = 1, n32 = 312500, n128 = 0, n256 = 1226
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
----------+------------------+---------------------+------------+---------------+--------------+-----------------
10000000 | 101826560 | 1800000000 | 845 |
484 | 32700 | 152583
(1 row)

I've attached a patch to fix them. Also, I realized that bsearch()
could be optimized out so I added code to prevent it:

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

fix_bench_radix_tree.patchapplication/x-patch; name=fix_bench_radix_tree.patchDownload

diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 0778da2d7b..d4c8040357 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -27,20 +27,17 @@ PG_FUNCTION_INFO_V1(bench_shuffle_search);
 PG_FUNCTION_INFO_V1(bench_load_random_int);
 PG_FUNCTION_INFO_V1(bench_fixed_height_search);
 
-static radix_tree *rt = NULL;
-static ItemPointer itemptrs = NULL;
-
 static uint64
 tid_to_key_off(ItemPointer tid, uint32 *off)
 {
-	uint32 upper;
+	uint64 upper;
 	uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
 	int64 tid_i;
 
 	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
 
 	tid_i = ItemPointerGetOffsetNumber(tid);
-	tid_i |= ItemPointerGetBlockNumber(tid) << shift;
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
 
 	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
 	*off = tid_i & ((1 << 6) - 1);
@@ -70,10 +67,10 @@ shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
 	for (int i = 0; i < nitems - 1; i++)
 	{
 		int j = shuffle_randrange(&state, i, nitems - 1);
-		ItemPointerData t = itemptrs[j];
+		ItemPointerData t = itemptr[j];
 
-		itemptrs[j] = itemptrs[i];
-		itemptrs[i] = t;
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
 	}
 }
 
@@ -138,6 +135,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 {
 	BlockNumber minblk = PG_GETARG_INT32(0);
 	BlockNumber maxblk = PG_GETARG_INT32(1);
+	ItemPointer itemptrs = NULL;
+	radix_tree *rt = NULL;
 	uint64	ntids;
 	uint64	key;
 	uint64	last_key = PG_UINT64_MAX;;
@@ -185,6 +184,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	TimestampDifference(start_time, end_time, &secs, &usecs);
 	rt_load_ms = secs * 1000 + usecs / 1000;
 
+	rt_stats(rt);
+
 	/* measure the load time of the array */
 	itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
 									  sizeof(ItemPointerData) * ntids);
@@ -210,12 +211,14 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 		ItemPointer tid = &(tids[i]);
 		uint64	key, val;
 		uint32	off;
+		volatile bool ret; /* prevent calling rt_search from being optimized out */
 
 		CHECK_FOR_INTERRUPTS();
 
 		key = tid_to_key_off(tid, &off);
 
-		rt_search(rt, key, &val);
+		ret = rt_search(rt, key, &val);
+		(void) ret;
 	}
 	end_time = GetCurrentTimestamp();
 	TimestampDifference(start_time, end_time, &secs, &usecs);
@@ -226,12 +229,16 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	for (int i = 0; i < ntids; i++)
 	{
 		ItemPointer tid = &(tids[i]);
+		volatile bool ret; /* prevent calling bsearch from being optimized out */
 
-		bsearch((void *) tid,
-				(void *) itemptrs,
-				ntids,
-				sizeof(ItemPointerData),
-				vac_cmp_itemptr);
+		CHECK_FOR_INTERRUPTS();
+
+		ret = bsearch((void *) tid,
+					  (void *) itemptrs,
+					  ntids,
+					  sizeof(ItemPointerData),
+					  vac_cmp_itemptr);
+		(void) ret;
 	}
 	end_time = GetCurrentTimestamp();
 	TimestampDifference(start_time, end_time, &secs, &usecs);
@@ -294,6 +301,8 @@ bench_load_random_int(PG_FUNCTION_ARGS)
 	TimestampDifference(start_time, end_time, &secs, &usecs);
 	load_time_ms = secs * 1000 + usecs / 1000;
 
+	rt_stats(rt);
+
 	MemSet(nulls, false, sizeof(nulls));
 	values[0] = Int64GetDatum(rt_memory_usage(rt));
 	values[1] = Int64GetDatum(load_time_ms);

#105

john.naylor@enterprisedb.com

over 3 years ago

In reply to: Masahiko Sawada (#96)

Re: [PoC] Improve dead tuple storage for lazy vacuum

The following is not quite a full review, but has plenty to think about.
There is too much to cover at once, and I have to start somewhere...

My main concerns are that internal APIs:

1. are difficult to follow
2. lead to poor branch prediction and too many function calls

Some of the measurements are picking on the SIMD search code, but I go into
details in order to demonstrate how a regression there can go completely
unnoticed. Hopefully the broader themes are informative.

On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

[fixed benchmarks]

Thanks for that! Now I can show clear results on some aspects in a simple
way. The attached patches (apply on top of v6) are not intended to be
incorporated as-is quite yet, but do point the way to some reorganization
that I think is necessary. I've done some testing on loading, but will
leave it out for now in the interest of length.

0001-0003 are your performance test fix and and some small conveniences for
testing. Binary search is turned off, for example, because we know it
already. And the sleep call is so I can run perf in a different shell
session, on only the search portion.

Note the v6 test loads all block numbers in the range. Since the test item
ids are all below 64 (reasonable), there are always 32 leaf chunks, so all
the leaves are node32 and completely full. This had the effect of never
taking the byte-wise loop in the proposed pg_lsearch function. These two
aspects make this an easy case for the branch predictor:

john=# select * from bench_seq_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 167 |
0 | 822 | 0

1,470,141,841 branches:u

63,693 branch-misses:u # 0.00% of all
branches

john=# select * from bench_shuffle_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 168 |
0 | 2174 | 0

1,470,142,569 branches:u

15,023,983 branch-misses:u # 1.02% of all branches

0004 randomizes block selection in the load part of the search test so that
each block has a 50% chance of being loaded. Note that now we have many
node16s where we had none before. Although node 16 and node32 appear to
share the same path in the switch statement of rt_node_search(), the chunk
comparison and node_get_values() calls each must go through different
branches. The shuffle case is most affected, but even the sequential case
slows down. (The leaves are less full -> there are more of them, so memory
use is larger, but it shouldn't matter much, in the sequential case at
least)

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889,
n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 |
0 | 907 | 0

1,684,114,926 branches:u

1,989,901 branch-misses:u # 0.12% of all branches

john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889,
n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 |
0 | 2890 | 0

1,684,115,844 branches:u

34,215,740 branch-misses:u # 2.03% of all branches

0005 replaces pg_lsearch with a branch-free SIMD search. Note that it
retains full portability and gains predictable performance. For
demonstration, it's used on all three linear-search types. Although I'm
sure it'd be way too slow for node4, this benchmark hardly has any so it's
ok.

1,469,540,357 branches:u

96,678 branch-misses:u # 0.01% of all
branches

1,469,540,533 branches:u

15,019,975 branch-misses:u # 1.02% of all branches

0006 removes node16, and 0007 avoids a function call to introspect node
type. 0006 is really to make 0007 simpler to code. The crucial point here
is that calling out to rt_node_get_values/children() to figure out what
type we are is costly. With these patches, searching an unevenly populated
load is the same or faster than the original sequential load, despite
taking twice as much memory. (And, as I've noted before, decoupling size
class from node kind would win the memory back.)

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256
= 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 171 |
0 | 717 | 0

1,349,614,294 branches:u

1,313 branch-misses:u # 0.00% of all
branches

john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256
= 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 172 |
0 | 2202 | 0

1,349,614,741 branches:u

30,592 branch-misses:u # 0.00% of all
branches

Expanding this point, once a path branches based on node kind, there should
be no reason to ever forget the kind. Ther abstractions in v6 have
disadvantages. I understand the reasoning -- to reduce duplication of code.
However, done this way, less code in the text editor leads to *more* code
(i.e. costly function calls and branches) on the machine level.

I haven't looked at insert/load performance carefully, but it's clear it
suffers from the same amnesia. prepare_node_for_insert() branches based on
the kind. If it must call rt_node_grow(), that function has no idea where
it came from and must branch again. When prepare_node_for_insert() returns
we again have no idea what the kind is, so must branch again. And if we are
one of the three linear-search nodes, we later do another function call,
where we encounter a 5-way jump table because the caller could be anything
at all.

Some of this could be worked around with always-inline functions to which
we pass a const node kind, and let the compiler get rid of the branches
etc. But many cases are probably not even worth doing that. For example, I
don't think prepare_node_for_insert() is a useful abstraction to begin
with. It returns an index, but only for linear nodes. Lookup nodes get a
return value of zero. There is not enough commonality here.

Along the same lines, there are a number of places that have branches as a
consequence of treating inner nodes and leaves with the same api:

rt_node_iterate_next
chunk_array_node_get_slot
node_128/256_get_slot
rt_node_search

I'm leaning towards splitting these out into specialized functions for each
inner and leaf. This is a bit painful for the last one, but perhaps if we
are resigned to templating the shared-mem case, maybe we can template some
of the inner/leaf stuff. Something to think about for later, but for now I
believe we have to accept some code duplication as a prerequisite for
decent performance as well as readability.

For the next steps, we need to proceed cautiously because there is a lot in
the air at the moment. Here are some aspects I would find desirable. If
there are impracticalities I haven't thought of, we can discuss further. I
don't pretend to know the practical consequences of every change I mention.

- If you have started coding the shared memory case, I'd advise to continue
so we can see what that looks like. If that has not gotten beyond the
design stage, I'd like to first see an attempt at tearing down some of the
clumsier abstractions in the current patch.
- As a "smoke test", there should ideally be nothing as general as
rt_node_get_children/values(). We should ideally always know what kind we
are if we found out earlier.
- For distinguishing between linear nodes, perhaps some always-inline
functions can help hide details. But at the same time, trying to treat them
the same is not always worthwhile.
- Start to separate treatment of inner/leaves and see how it goes.
- I firmly believe we only need 4 node *kinds*, and later we can decouple
the size classes as a separate concept. I'm willing to put serious time
into that once the broad details are right. I will also investigate pointer
tagging if we can confirm that can work similarly for dsa pointers.

Regarding size class decoupling, I'll respond to a point made earlier:

On Fri, Sep 30, 2022 at 10:47 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

With this idea, we can just repalloc() to grow to the larger size in a
pair but I'm slightly concerned that the more size class we use, the
more frequent the node needs to grow.

Well, yes, but that's orthogonal. For example, v6 has 5 node kinds. Imagine
that we have 4 node kinds, but the SIMD node kind used 2 size classes. Then
the nodes would grow at *exactly* the same frequency as they do today. I
listed many ways a size class could fit into a power-of-two (and there are
more), but we have a choice in how many to actually use. It's a trade off
between memory usage and complexity.

If we want to support node
shrink, the deletion is also affected.

Not necessarily. We don't have to shrink at the same granularity as
growing. My evidence is simple: we don't shrink at all now. :-)

--
John Naylor
EDB: http://www.enterprisedb.com

#106

john.naylor@enterprisedb.com

over 3 years ago

In reply to: John Naylor (#105)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Oct 10, 2022 at 12:16 PM John Naylor <john.naylor@enterprisedb.com>
wrote:

Thanks for that! Now I can show clear results on some aspects in a simple

way. The attached patches (apply on top of v6)

Forgot the patchset...

--
John Naylor
EDB: http://www.enterprisedb.com

#107

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#105)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On Mon, Oct 10, 2022 at 2:16 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

The following is not quite a full review, but has plenty to think about. There is too much to cover at once, and I have to start somewhere...

My main concerns are that internal APIs:

1. are difficult to follow
2. lead to poor branch prediction and too many function calls

Some of the measurements are picking on the SIMD search code, but I go into details in order to demonstrate how a regression there can go completely unnoticed. Hopefully the broader themes are informative.

On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

[fixed benchmarks]

Thanks for that! Now I can show clear results on some aspects in a simple way. The attached patches (apply on top of v6) are not intended to be incorporated as-is quite yet, but do point the way to some reorganization that I think is necessary. I've done some testing on loading, but will leave it out for now in the interest of length.

0001-0003 are your performance test fix and and some small conveniences for testing. Binary search is turned off, for example, because we know it already. And the sleep call is so I can run perf in a different shell session, on only the search portion.

Note the v6 test loads all block numbers in the range. Since the test item ids are all below 64 (reasonable), there are always 32 leaf chunks, so all the leaves are node32 and completely full. This had the effect of never taking the byte-wise loop in the proposed pg_lsearch function. These two aspects make this an easy case for the branch predictor:

john=# select * from bench_seq_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 167 | 0 | 822 | 0

1,470,141,841 branches:u
63,693 branch-misses:u # 0.00% of all branches

john=# select * from bench_shuffle_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 168 | 0 | 2174 | 0

1,470,142,569 branches:u
15,023,983 branch-misses:u # 1.02% of all branches

0004 randomizes block selection in the load part of the search test so that each block has a 50% chance of being loaded. Note that now we have many node16s where we had none before. Although node 16 and node32 appear to share the same path in the switch statement of rt_node_search(), the chunk comparison and node_get_values() calls each must go through different branches. The shuffle case is most affected, but even the sequential case slows down. (The leaves are less full -> there are more of them, so memory use is larger, but it shouldn't matter much, in the sequential case at least)

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 | 0 | 907 | 0

1,684,114,926 branches:u
1,989,901 branch-misses:u # 0.12% of all branches

john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 | 0 | 2890 | 0

1,684,115,844 branches:u
34,215,740 branch-misses:u # 2.03% of all branches

0005 replaces pg_lsearch with a branch-free SIMD search. Note that it retains full portability and gains predictable performance. For demonstration, it's used on all three linear-search types. Although I'm sure it'd be way too slow for node4, this benchmark hardly has any so it's ok.

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 176 | 0 | 867 | 0

1,469,540,357 branches:u
96,678 branch-misses:u # 0.01% of all branches

john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 171 | 0 | 2530 | 0

1,469,540,533 branches:u
15,019,975 branch-misses:u # 1.02% of all branches

0006 removes node16, and 0007 avoids a function call to introspect node type. 0006 is really to make 0007 simpler to code. The crucial point here is that calling out to rt_node_get_values/children() to figure out what type we are is costly. With these patches, searching an unevenly populated load is the same or faster than the original sequential load, despite taking twice as much memory. (And, as I've noted before, decoupling size class from node kind would win the memory back.)

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 171 | 0 | 717 | 0

1,349,614,294 branches:u
1,313 branch-misses:u # 0.00% of all branches

john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 172 | 0 | 2202 | 0

1,349,614,741 branches:u
30,592 branch-misses:u # 0.00% of all branches

Expanding this point, once a path branches based on node kind, there should be no reason to ever forget the kind. Ther abstractions in v6 have disadvantages. I understand the reasoning -- to reduce duplication of code. However, done this way, less code in the text editor leads to *more* code (i.e. costly function calls and branches) on the machine level.

Right. When updating the patch from v4 to v5, I've eliminated the
duplication of code between each node type as much as possible, which
in turn produced more code on the machine level. The resulst of your
experiment clearly showed the bad side of this work. FWIW I've also
confirmed your changes in my environment (I've added the third
argument to turn on and off the randomizes block selection proposed in
0004 patch):

* w/o patches
postgres(1:361692)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false);
2022-10-14 11:33:15.460 JST [361692] LOG: num_keys = 1000000, height
= 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 87 |
| 462 |
(1 row)

1590104944 branches:u # 3.430 G/sec
65957 branch-misses:u # 0.00% of all branches

postgres(1:361692)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true);
2022-10-14 11:33:28.934 JST [361692] LOG: num_keys = 999654, height =
2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 91 |
| 497 |
(1 row)

1748249456 branches:u # 3.506 G/sec
481074 branch-misses:u # 0.03% of all branches

postgres(1:361692)=# select * from bench_shuffle_search(0, 1 * 1000 *
1000, false);
2022-10-14 11:33:38.378 JST [361692] LOG: num_keys = 1000000, height
= 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 86 |
| 1290 |
(1 row)

1590105370 branches:u # 1.231 G/sec
15039443 branch-misses:u # 0.95% of all branches

Time: 4166.346 ms (00:04.166)
postgres(1:361692)=# select * from bench_shuffle_search(0, 2 * 1000 *
1000, true);
2022-10-14 11:33:51.556 JST [361692] LOG: num_keys = 999654, height =
2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 90 |
| 1536 |
(1 row)

1748250497 branches:u # 1.137 G/sec
28125016 branch-misses:u # 1.61% of all branches

* w/ all patches
postgres(1:360358)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false);
2022-10-14 11:29:27.232 JST [360358] LOG: num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 81 |
| 432 |
(1 row)

1380062209 branches:u # 3.185 G/sec
1066 branch-misses:u # 0.00% of all branches

postgres(1:360358)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true);
2022-10-14 11:29:46.380 JST [360358] LOG: num_keys = 999654, height =
2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 88 |
| 438 |
(1 row)

1379640815 branches:u # 3.133 G/sec
1332 branch-misses:u # 0.00% of all branches

postgres(1:360358)=# select * from bench_shuffle_search(0, 1 * 1000 *
1000, false);
2022-10-14 11:30:00.943 JST [360358] LOG: num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 81 |
| 994 |
(1 row)

1380062386 branches:u # 1.386 G/sec
18368 branch-misses:u # 0.00% of all branches

postgres(1:360358)=# select * from bench_shuffle_search(0, 2 * 1000 *
1000, true);
2022-10-14 11:30:15.944 JST [360358] LOG: num_keys = 999654, height =
2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 88 |
| 1098 |
(1 row)

1379641503 branches:u # 1.254 G/sec
18973 branch-misses:u # 0.00% of all branches

I haven't looked at insert/load performance carefully, but it's clear it suffers from the same amnesia. prepare_node_for_insert() branches based on the kind. If it must call rt_node_grow(), that function has no idea where it came from and must branch again. When prepare_node_for_insert() returns we again have no idea what the kind is, so must branch again. And if we are one of the three linear-search nodes, we later do another function call, where we encounter a 5-way jump table because the caller could be anything at all.

Some of this could be worked around with always-inline functions to which we pass a const node kind, and let the compiler get rid of the branches etc. But many cases are probably not even worth doing that. For example, I don't think prepare_node_for_insert() is a useful abstraction to begin with. It returns an index, but only for linear nodes. Lookup nodes get a return value of zero. There is not enough commonality here.

Agreed.

Along the same lines, there are a number of places that have branches as a consequence of treating inner nodes and leaves with the same api:

rt_node_iterate_next
chunk_array_node_get_slot
node_128/256_get_slot
rt_node_search

I'm leaning towards splitting these out into specialized functions for each inner and leaf. This is a bit painful for the last one, but perhaps if we are resigned to templating the shared-mem case, maybe we can template some of the inner/leaf stuff. Something to think about for later, but for now I believe we have to accept some code duplication as a prerequisite for decent performance as well as readability.

Agreed.

For the next steps, we need to proceed cautiously because there is a lot in the air at the moment. Here are some aspects I would find desirable. If there are impracticalities I haven't thought of, we can discuss further. I don't pretend to know the practical consequences of every change I mention.

- If you have started coding the shared memory case, I'd advise to continue so we can see what that looks like. If that has not gotten beyond the design stage, I'd like to first see an attempt at tearing down some of the clumsier abstractions in the current patch.
- As a "smoke test", there should ideally be nothing as general as rt_node_get_children/values(). We should ideally always know what kind we are if we found out earlier.
- For distinguishing between linear nodes, perhaps some always-inline functions can help hide details. But at the same time, trying to treat them the same is not always worthwhile.
- Start to separate treatment of inner/leaves and see how it goes.

Since I've not started coding the shared memory case seriously, I'm
going to start with eliminating abstractions and splitting the
treatment of inner and leaf nodes.

- I firmly believe we only need 4 node *kinds*, and later we can decouple the size classes as a separate concept. I'm willing to put serious time into that once the broad details are right. I will also investigate pointer tagging if we can confirm that can work similarly for dsa pointers.

I'll keep 4 node kinds. And we can later try to introduce classes into
each node kind.

Regarding size class decoupling, I'll respond to a point made earlier:

On Fri, Sep 30, 2022 at 10:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

With this idea, we can just repalloc() to grow to the larger size in a
pair but I'm slightly concerned that the more size class we use, the
more frequent the node needs to grow.

Well, yes, but that's orthogonal. For example, v6 has 5 node kinds. Imagine that we have 4 node kinds, but the SIMD node kind used 2 size classes. Then the nodes would grow at *exactly* the same frequency as they do today. I listed many ways a size class could fit into a power-of-two (and there are more), but we have a choice in how many to actually use. It's a trade off between memory usage and complexity.

Agreed.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#108

sawada.mshk@gmail.com

about 3 years ago

In reply to: Masahiko Sawada (#107)

3 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Oct 14, 2022 at 4:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

On Mon, Oct 10, 2022 at 2:16 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

The following is not quite a full review, but has plenty to think about. There is too much to cover at once, and I have to start somewhere...

My main concerns are that internal APIs:

1. are difficult to follow
2. lead to poor branch prediction and too many function calls

Some of the measurements are picking on the SIMD search code, but I go into details in order to demonstrate how a regression there can go completely unnoticed. Hopefully the broader themes are informative.

On Fri, Oct 7, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

[fixed benchmarks]

Thanks for that! Now I can show clear results on some aspects in a simple way. The attached patches (apply on top of v6) are not intended to be incorporated as-is quite yet, but do point the way to some reorganization that I think is necessary. I've done some testing on loading, but will leave it out for now in the interest of length.

0001-0003 are your performance test fix and and some small conveniences for testing. Binary search is turned off, for example, because we know it already. And the sleep call is so I can run perf in a different shell session, on only the search portion.

Note the v6 test loads all block numbers in the range. Since the test item ids are all below 64 (reasonable), there are always 32 leaf chunks, so all the leaves are node32 and completely full. This had the effect of never taking the byte-wise loop in the proposed pg_lsearch function. These two aspects make this an easy case for the branch predictor:

john=# select * from bench_seq_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 167 | 0 | 822 | 0

1,470,141,841 branches:u
63,693 branch-misses:u # 0.00% of all branches

john=# select * from bench_shuffle_search(0, 1*1000*1000);
NOTICE: num_keys = 1000000, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 168 | 0 | 2174 | 0

1,470,142,569 branches:u
15,023,983 branch-misses:u # 1.02% of all branches

0004 randomizes block selection in the load part of the search test so that each block has a 50% chance of being loaded. Note that now we have many node16s where we had none before. Although node 16 and node32 appear to share the same path in the switch statement of rt_node_search(), the chunk comparison and node_get_values() calls each must go through different branches. The shuffle case is most affected, but even the sequential case slows down. (The leaves are less full -> there are more of them, so memory use is larger, but it shouldn't matter much, in the sequential case at least)

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 | 0 | 907 | 0

1,684,114,926 branches:u
1,989,901 branch-misses:u # 0.12% of all branches

john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 173 | 0 | 2890 | 0

1,684,115,844 branches:u
34,215,740 branch-misses:u # 2.03% of all branches

0005 replaces pg_lsearch with a branch-free SIMD search. Note that it retains full portability and gains predictable performance. For demonstration, it's used on all three linear-search types. Although I'm sure it'd be way too slow for node4, this benchmark hardly has any so it's ok.

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 176 | 0 | 867 | 0

1,469,540,357 branches:u
96,678 branch-misses:u # 0.01% of all branches

john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 171 | 0 | 2530 | 0

1,469,540,533 branches:u
15,019,975 branch-misses:u # 1.02% of all branches

0006 removes node16, and 0007 avoids a function call to introspect node type. 0006 is really to make 0007 simpler to code. The crucial point here is that calling out to rt_node_get_values/children() to figure out what type we are is costly. With these patches, searching an unevenly populated load is the same or faster than the original sequential load, despite taking twice as much memory. (And, as I've noted before, decoupling size class from node kind would win the memory back.)

john=# select * from bench_seq_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 171 | 0 | 717 | 0

1,349,614,294 branches:u
1,313 branch-misses:u # 0.00% of all branches

john=# select * from bench_shuffle_search(0, 2*1000*1000);
NOTICE: num_keys = 999654, height = 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 172 | 0 | 2202 | 0

1,349,614,741 branches:u
30,592 branch-misses:u # 0.00% of all branches

Expanding this point, once a path branches based on node kind, there should be no reason to ever forget the kind. Ther abstractions in v6 have disadvantages. I understand the reasoning -- to reduce duplication of code. However, done this way, less code in the text editor leads to *more* code (i.e. costly function calls and branches) on the machine level.

Right. When updating the patch from v4 to v5, I've eliminated the
duplication of code between each node type as much as possible, which
in turn produced more code on the machine level. The resulst of your
experiment clearly showed the bad side of this work. FWIW I've also
confirmed your changes in my environment (I've added the third
argument to turn on and off the randomizes block selection proposed in
0004 patch):

* w/o patches
postgres(1:361692)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false);
2022-10-14 11:33:15.460 JST [361692] LOG: num_keys = 1000000, height
= 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 87 |
| 462 |
(1 row)

1590104944 branches:u # 3.430 G/sec
65957 branch-misses:u # 0.00% of all branches

postgres(1:361692)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true);
2022-10-14 11:33:28.934 JST [361692] LOG: num_keys = 999654, height =
2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 91 |
| 497 |
(1 row)

1748249456 branches:u # 3.506 G/sec
481074 branch-misses:u # 0.03% of all branches

postgres(1:361692)=# select * from bench_shuffle_search(0, 1 * 1000 *
1000, false);
2022-10-14 11:33:38.378 JST [361692] LOG: num_keys = 1000000, height
= 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 86 |
| 1290 |
(1 row)

1590105370 branches:u # 1.231 G/sec
15039443 branch-misses:u # 0.95% of all branches

Time: 4166.346 ms (00:04.166)
postgres(1:361692)=# select * from bench_shuffle_search(0, 2 * 1000 *
1000, true);
2022-10-14 11:33:51.556 JST [361692] LOG: num_keys = 999654, height =
2, n4 = 1, n16 = 35610, n32 = 26889, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 14893056 | 179937720 | 90 |
| 1536 |
(1 row)

1748250497 branches:u # 1.137 G/sec
28125016 branch-misses:u # 1.61% of all branches

* w/ all patches
postgres(1:360358)=# select * from bench_seq_search(0, 1 * 1000 * 1000, false);
2022-10-14 11:29:27.232 JST [360358] LOG: num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 81 |
| 432 |
(1 row)

1380062209 branches:u # 3.185 G/sec
1066 branch-misses:u # 0.00% of all branches

postgres(1:360358)=# select * from bench_seq_search(0, 2 * 1000 * 1000, true);
2022-10-14 11:29:46.380 JST [360358] LOG: num_keys = 999654, height =
2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 88 |
| 438 |
(1 row)

1379640815 branches:u # 3.133 G/sec
1332 branch-misses:u # 0.00% of all branches

postgres(1:360358)=# select * from bench_shuffle_search(0, 1 * 1000 *
1000, false);
2022-10-14 11:30:00.943 JST [360358] LOG: num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 10199040 | 180000000 | 81 |
| 994 |
(1 row)

1380062386 branches:u # 1.386 G/sec
18368 branch-misses:u # 0.00% of all branches

postgres(1:360358)=# select * from bench_shuffle_search(0, 2 * 1000 *
1000, true);
2022-10-14 11:30:15.944 JST [360358] LOG: num_keys = 999654, height =
2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
NOTICE: sleeping for 2 seconds...
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 20381696 | 179937720 | 88 |
| 1098 |
(1 row)

1379641503 branches:u # 1.254 G/sec
18973 branch-misses:u # 0.00% of all branches

I haven't looked at insert/load performance carefully, but it's clear it suffers from the same amnesia. prepare_node_for_insert() branches based on the kind. If it must call rt_node_grow(), that function has no idea where it came from and must branch again. When prepare_node_for_insert() returns we again have no idea what the kind is, so must branch again. And if we are one of the three linear-search nodes, we later do another function call, where we encounter a 5-way jump table because the caller could be anything at all.

Some of this could be worked around with always-inline functions to which we pass a const node kind, and let the compiler get rid of the branches etc. But many cases are probably not even worth doing that. For example, I don't think prepare_node_for_insert() is a useful abstraction to begin with. It returns an index, but only for linear nodes. Lookup nodes get a return value of zero. There is not enough commonality here.

Agreed.

Along the same lines, there are a number of places that have branches as a consequence of treating inner nodes and leaves with the same api:

rt_node_iterate_next
chunk_array_node_get_slot
node_128/256_get_slot
rt_node_search

I'm leaning towards splitting these out into specialized functions for each inner and leaf. This is a bit painful for the last one, but perhaps if we are resigned to templating the shared-mem case, maybe we can template some of the inner/leaf stuff. Something to think about for later, but for now I believe we have to accept some code duplication as a prerequisite for decent performance as well as readability.

Agreed.

For the next steps, we need to proceed cautiously because there is a lot in the air at the moment. Here are some aspects I would find desirable. If there are impracticalities I haven't thought of, we can discuss further. I don't pretend to know the practical consequences of every change I mention.

- If you have started coding the shared memory case, I'd advise to continue so we can see what that looks like. If that has not gotten beyond the design stage, I'd like to first see an attempt at tearing down some of the clumsier abstractions in the current patch.
- As a "smoke test", there should ideally be nothing as general as rt_node_get_children/values(). We should ideally always know what kind we are if we found out earlier.
- For distinguishing between linear nodes, perhaps some always-inline functions can help hide details. But at the same time, trying to treat them the same is not always worthwhile.
- Start to separate treatment of inner/leaves and see how it goes.

Since I've not started coding the shared memory case seriously, I'm
going to start with eliminating abstractions and splitting the
treatment of inner and leaf nodes.

I've attached updated PoC patches for discussion and cfbot. From the
previous version, I mainly changed the following things:

* Separate treatment of inner and leaf nodes
* Pack both the node kind and node count to an uint16 value.

I've also made a change in functions in bench_radix_tree test module:
the third argument of bench_seq/shuffle_search() is a flag to turn on
and off the randomizes block selection. The results of performance
tests in my environment are:

postgres(1:1665989)=# select * from bench_seq_search(0, 1* 1000 * 1000, false);
2022-10-24 14:29:40.705 JST [1665989] LOG: num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871104 | 180000000 | 65 |
| 248 |
(1 row)

postgres(1:1665989)=# select * from bench_seq_search(0, 2* 1000 * 1000, true);
2022-10-24 14:29:47.999 JST [1665989] LOG: num_keys = 999654, height
= 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 19680736 | 179937720 | 71 |
| 237 |
(1 row)

postgres(1:1665989)=# select * from bench_shuffle_search(0, 1 * 1000 *
1000, false);
2022-10-24 14:29:55.955 JST [1665989] LOG: num_keys = 1000000, height
= 2, n4 = 0, n32 = 31251, n128 = 1, n256 = 122
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871104 | 180000000 | 65 |
| 641 |
(1 row)

postgres(1:1665989)=# select * from bench_shuffle_search(0, 2 * 1000 *
1000, true);
2022-10-24 14:30:04.140 JST [1665989] LOG: num_keys = 999654, height
= 2, n4 = 1, n32 = 62499, n128 = 1, n256 = 245
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 19680736 | 179937720 | 71 |
| 654 |
(1 row)

I've not done SIMD part seriously yet. But overall the performance
seems good so far. If we agree with the current approach, I think we
can proceed with the verification of decoupling node sizes from node
kind. And I'll investigate DSA support.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/x-patch; name=0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From fcf76629b46732b56e424111f3fb8b53c05fd07a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [POC PATCH 1/3] introduce vector8_min and vector8_highbit_mask

---
 src/include/port/simd.h | 62 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..039d7e5235 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -60,6 +60,15 @@ typedef uint32x4_t Vector32;
 typedef uint64 Vector8;
 #endif
 
+/*
+ * Some of the functions with SIMD implementations use bitwise operations
+ * available in pg_bitutils.h.  There are currently no non-SIMD implementations
+ * that require these bitwise operations.
+ */
+#ifndef USE_NO_SIMD
+#include "port/pg_bitutils.h"
+#endif
+
 /* load/store operations */
 static inline void vector8_load(Vector8 *v, const uint8 *s);
 #ifndef USE_NO_SIMD
@@ -79,6 +88,8 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline int vector8_find(const Vector8 v, const uint8 c);
+static inline int vector8_find_ge(const Vector8 v, const uint8 c);
 #endif
 
 /* arithmetic operations */
@@ -262,6 +273,27 @@ vector8_has_le(const Vector8 v, const uint8 c)
 	return result;
 }
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#else /* USE_NO_SIMD */
+	Vector8 r = 0;
+	uint8 *rp = (uint8 *) &r;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
+
+	return r;
+#endif
+}
+
 /*
  * Return true if the high bit of any element is set
  */
@@ -277,6 +309,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
-- 
2.31.1

0002-Add-radix-implementation.patchapplication/x-patch; name=0002-Add-radix-implementation.patchDownload

From 6cd239b14d521f2f1377730874c27b4eb9281217 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [POC PATCH 2/3] Add radix implementation.

---
 src/backend/lib/Makefile                      |    1 +
 src/backend/lib/radixtree.c                   | 2439 +++++++++++++++++
 src/include/lib/radixtree.h                   |   42 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   28 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  504 ++++
 .../test_radixtree/test_radixtree.control     |    4 +
 12 files changed, 3068 insertions(+)
 create mode 100644 src/backend/lib/radixtree.c
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..93c81b843f
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2439 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * this radix tree module utilizes AVX2 instruction, enabling us to use 256-bit
+ * width SIMD vector, whereas 128-bit width SIMD vector is used in the paper.
+ * Also, there is no support for path compression and lazy path expansion. The
+ * radix tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_set			- Set a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iterate	- Begin iterating through all key-value pairs
+ * rt_iterate_next	- Return next key-value pair, if any
+ * rt_end_iter		- End iteration
+ * rt_memory_usage	- Get the memory usage
+ * rt_num_entries	- Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-128 */
+#define RT_NODE_128_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+	RT_ACTION_FIND = 0,			/* find the key-value */
+	RT_ACTION_DELETE,			/* delete the key-value */
+} rt_action;
+
+/* Base type for all nodes types */
+typedef struct rt_node
+{
+	/* The number of children and the node kind */
+	uint16		info;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+} rt_node;
+
+/*
+ * Flags and masks for 'info'.
+ *
+ * The lowest 9 bits of 'info' represent the number of children in the node, and
+ * the next 2 bits are node kind.
+ */
+#define RT_NODE_INFO_COUNT_BITS	9
+#define RT_NODE_INFO_KIND_BITS	2
+#define RT_NODE_INFO_COUNT_MASK	((1 << RT_NODE_INFO_COUNT_BITS) - 1)
+#define RT_NODE_INFO_KIND_MASK	((1 << RT_NODE_INFO_KIND_BITS) - 1)
+
+/*
+ * Supported radix tree node kinds.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 40/40 -> 296/286 -> 1288/1304 -> 2056/2088 bytes for inner nodes and
+ * leaf nodes, respectively, leading to large amount of allocator padding
+ * with aset.c. Hence the use of slab.
+ *
+ * XXX: need to have node-1 until there is no path compression optimization?
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_128		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/* Macros to access the count and the kind in 'info' */
+#define NODE_GET_COUNT(n)	(((rt_node *) (n))->info & RT_NODE_INFO_COUNT_MASK)
+#define NODE_GET_KIND(n)	\
+	(((((rt_node* ) (n))->info) >> RT_NODE_INFO_COUNT_BITS) & RT_NODE_INFO_KIND_MASK)
+#define NODE_INCREMENT_COUNT(n) \
+	{ \
+		((rt_node *) (n))->info++; \
+		Assert(NODE_GET_COUNT(n) <= rt_node_kind_info[NODE_GET_KIND(n)].fanout); \
+	} while (0)
+#define NODE_DECREMENT_COUNT(n) \
+	{ \
+		((rt_node *) (n))->info--; \
+		Assert(NODE_GET_COUNT(n) >= 0); \
+	} while(0)
+#define NODE_SET_COUNT(n, count) \
+	{ \
+		((rt_node *) (n))->info &= ~RT_NODE_INFO_COUNT_MASK; \
+		((rt_node *) (n))->info |= (count); \
+	} while (0)
+#define NODE_SET_KIND(n, kind) \
+	{ \
+		((rt_node *) (n))->info &= ~RT_NODE_INFO_KIND_MASK; \
+		((rt_node *) (n))->info |= ((kind) << RT_NODE_INFO_COUNT_BITS); \
+	} while (0)
+#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(NODE_GET_COUNT(((rt_node *) (n))) == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+	(NODE_GET_COUNT(n) < rt_node_kind_info[NODE_GET_KIND(n)].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+typedef struct rt_node_base_4
+{
+	rt_node		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+	rt_node		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-128 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 128 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base128
+{
+	rt_node		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_128;
+
+typedef struct rt_node_base256
+{
+	rt_node		n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * There are separate from inner node size classes for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+	rt_node_base_4	base;
+
+	/* 4 children, for key chunks */
+	rt_node    *children[4];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+	rt_node_base_4	base;
+
+	/* 4 values, for key chunks */
+	uint64		values[4];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+	rt_node_base_32	base;
+
+	/* 32 children, for key chunks */
+	rt_node    *children[32];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+	rt_node_base_32	base;
+
+	/* 32 values, for key chunks */
+	uint64		values[32];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_128
+{
+	rt_node_base_128	base;
+
+	/* Slots for 128 children */
+	rt_node    *children[128];
+} rt_node_inner_128;
+
+typedef struct rt_node_leaf_128
+{
+	rt_node_base_128	base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
+
+	/* Slots for 128 values */
+	uint64		values[128];
+} rt_node_leaf_128;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+	rt_node_base_256	base;
+
+	/* Slots for 256 children */
+	rt_node    *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+	rt_node_base_256	base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information of each size kinds */
+typedef struct rt_node_kind_info_elem
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} rt_node_kind_info_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
+static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
+
+	[RT_NODE_KIND_4] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(rt_node_inner_4),
+		.leaf_size = sizeof(rt_node_leaf_4),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
+	},
+	[RT_NODE_KIND_32] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(rt_node_inner_32),
+		.leaf_size = sizeof(rt_node_leaf_32),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
+	},
+	[RT_NODE_KIND_128] = {
+		.name = "radix tree node 128",
+		.fanout = 128,
+		.inner_size = sizeof(rt_node_inner_128),
+		.leaf_size = sizeof(rt_node_leaf_128),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
+	},
+	[RT_NODE_KIND_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(rt_node_inner_256),
+		.leaf_size = sizeof(rt_node_leaf_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+	},
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+	rt_node    *node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	rt_node_iter stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node    *root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_NODE_KIND_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+							  bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+								 rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+								uint64 *value_p);
+static rt_node *rt_node_add_new_child(radix_tree *tree, rt_node *parent,
+									  rt_node *node, uint64 key);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+								 uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+								uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											 uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, int from);
+static void rt_update_node_iter(rt_iter *iter, rt_node_iter *node_iter,
+								rt_node *node);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+	int		idx = -1;
+
+	for (int i = 0; i < NODE_GET_COUNT(node); i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in chunks in the given node that is greater
+ * than or equal to 'key'.  Return -1 if there is no such element.
+ */
+static inline int
+node_4_search_ge(rt_node_base_4 * node, uint8 chunk)
+{
+	int		idx = -1;
+
+	for (int i = 0; i < NODE_GET_COUNT(node); i++)
+	{
+		if (node->chunks[i] >= chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+	int count = NODE_GET_COUNT(node);
+#ifndef USE_NO_SIMD
+	Vector8 spread_chunk;
+	Vector8 haystack1;
+	Vector8 haystack2;
+	Vector8 cmp1;
+	Vector8 cmp2;
+	uint32 bitfield;
+	int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int index = -1;
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	/* XXX: should not to use vector8_highbit_mask */
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the first element in chunks in the given node that is greater
+ * than or equal to 'key'.  Return -1 if there is no such element.
+ */
+static inline int
+node_32_search_ge(rt_node_base_32 *node, uint8 chunk)
+{
+	int count = NODE_GET_COUNT(node);
+#ifndef USE_NO_SIMD
+	Vector8 spread_chunk;
+	Vector8 haystack1;
+	Vector8 haystack2;
+	Vector8 cmp1;
+	Vector8 cmp2;
+	Vector8 min1;
+	Vector8 min2;
+	uint32 bitfield;
+	int index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int index = -1;
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] >= chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]),	sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]),	&(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+						  uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+	memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values, int count)
+{
+	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+	memcpy(dst_values, src_values, sizeof(uint64) * count);
+}
+
+/* Functions to manipulate inner and leaf node-128 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
+{
+	Assert(NODE_IS_LEAF(node));
+	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((rt_node_base_128 *) node)->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Delete the chunk in the node */
+static void
+node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Delete the chunk in the node */
+static void
+node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
+{
+	int			slotpos = node->base.slot_idxs[chunk];
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+static int
+node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
+{
+	int		slotpos = 0;
+
+	Assert(!NODE_IS_LEAF(node));
+	while (node_inner_128_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+/* Return an unused slot in node-128 */
+static int
+node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/*
+	 * Find an unused slot. We iterate over the isset bitmap per byte then
+	 * check each bit.
+	 */
+	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	{
+		if (node->isset[slotpos] < 0xFF)
+			break;
+	}
+	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+	slotpos *= BITS_PER_BYTE;
+	while (node_leaf_128_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static inline void
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+	int			slotpos;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_inner_128_find_unused_slot(node, chunk);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_leaf_128_find_unused_slot(node, chunk);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+/* Update the value corresponding to 'chunk' to 'value' */
+static inline void
+ndoe_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(node_inner_256_is_chunk_used(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(node_leaf_256_is_chunk_used(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+	int			shift = key_get_shift(key);
+	rt_node    *node;
+
+	node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
+									 shift > 0);
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = node;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
+{
+	rt_node    *newnode;
+
+	if (inner)
+		newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+													 rt_node_kind_info[kind].inner_size);
+	else
+		newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+													 rt_node_kind_info[kind].leaf_size);
+
+	NODE_SET_KIND(newnode, kind);
+	newnode->shift = shift;
+	newnode->chunk = chunk;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_128)
+	{
+		rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+
+		memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
+	}
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[kind]++;
+#endif
+
+	return newnode;
+}
+
+static rt_node *
+rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+	rt_node *newnode;
+
+	newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
+							node->shift > 0);
+	NODE_SET_COUNT(newnode, NODE_GET_COUNT(node));
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+		tree->root = NULL;
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[NODE_GET_KIND(node)]--;
+	Assert(tree->cnt[NODE_GET_KIND(node)] >= 0);
+#endif
+
+	pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+				rt_node *new_child, uint64 key)
+{
+	Assert(old_child->chunk == new_child->chunk);
+	Assert(old_child->shift == new_child->shift);
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = new_child;
+	}
+	else
+	{
+		bool replaced PG_USED_FOR_ASSERTS_ONLY;
+
+		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		Assert(replaced);
+	}
+
+	rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_inner_4 *node;
+
+		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
+												 shift, 0, true);
+		NODE_SET_COUNT(node, 1);
+		node->base.chunks[0] = 0;
+		node->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+	uint8	chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool	found = false;
+	rt_node	*child = NULL;
+
+	switch (NODE_GET_KIND(node))
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = n4->children[idx];
+				else	/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n4->base.chunks, n4->children,
+												NODE_GET_COUNT(n4), idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int	idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = n32->children[idx];
+				else	/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n32->base.chunks, n32->children,
+												NODE_GET_COUNT(n32), idx);
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+				if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = node_inner_128_get_child(n128, chunk);
+				else	/* RT_ACTION_DELETE */
+					node_inner_128_delete(n128, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				if (!node_inner_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = node_inner_256_get_child(n256, chunk);
+				else		/* RT_ACTION_DELETE */
+					node_inner_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		NODE_DECREMENT_COUNT(node);
+
+	if (found && child_p)
+		*child_p = child;
+
+	return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+	uint8	chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool	found = false;
+	uint64	value = 0;
+
+	switch (NODE_GET_KIND(node))
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = n4->values[idx];
+				else	/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+											  NODE_GET_COUNT(n4), idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int	idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = n32->values[idx];
+				else	/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+											  NODE_GET_COUNT(n32), idx);
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+
+				if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_128_get_value(n128, chunk);
+				else	/* RT_ACTION_DELETE */
+					node_leaf_128_delete(n128, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				if (!node_leaf_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_256_get_value(n256, chunk);
+				else		/* RT_ACTION_DELETE */
+					node_leaf_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		NODE_DECREMENT_COUNT(node);
+
+	if (found && value_p)
+		*value_p = value;
+
+	return found;
+}
+
+/* Insert a new child to 'node' */
+static rt_node *
+rt_node_add_new_child(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key)
+{
+	uint8		newshift = node->shift - RT_NODE_SPAN;
+	rt_node    *newchild;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
+							 RT_GET_KEY_CHUNK(key, node->shift),
+							 newshift > 0);
+
+	rt_node_insert_inner(tree, parent, node, key, newchild);
+
+	return (rt_node *) newchild;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+					 rt_node *child)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	switch (NODE_GET_KIND(node))
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				rt_node_inner_32 *new32;
+				int idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->children[idx] = child;
+					break;
+				}
+
+				if (likely(NODE_HAS_FREE_SLOT(n4)))
+				{
+					int insertpos = node_4_search_ge((rt_node_base_4 *) n4, chunk);
+					uint16 count = NODE_GET_COUNT(n4);
+
+					if (insertpos < 0)
+						insertpos = count; /* insert to the tail */
+
+					/* shift chunks and children */
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n4->base.chunks, n4->children,
+												   count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->children[insertpos] = child;
+					break;
+				}
+
+				/* grow node from 4 to 32 */
+				new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
+														  RT_NODE_KIND_32);
+				chunk_children_array_copy(n4->base.chunks, n4->children,
+										  new32->base.chunks, new32->children,
+										  NODE_GET_COUNT(n4));
+
+				rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+								key);
+				node = (rt_node *) new32;
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				rt_node_inner_128 *new128;
+
+				int idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->children[idx] = child;
+					break;
+				}
+
+				if (likely(NODE_HAS_FREE_SLOT(n32)))
+				{
+					int	insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
+					int16 count = NODE_GET_COUNT(n32);
+
+					if (insertpos < 0)
+						insertpos = count; /* insert to the tail */
+
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n32->base.chunks, n32->children,
+												   count, insertpos);
+
+					n32->base.chunks[insertpos] = chunk;
+					n32->children[insertpos] = child;
+					break;
+				}
+
+				/* grow node from 32 to 128 */
+				new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
+															RT_NODE_KIND_128);
+				for (int i = 0; i < NODE_GET_COUNT(n32); i++)
+					node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
+
+				rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+								key);
+				node = (rt_node *) new128;
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+				rt_node_inner_256 *new256;
+				int	cnt = 0;
+
+				if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					ndoe_inner_128_update(n128, chunk, child);
+					break;
+				}
+
+				if (likely(NODE_HAS_FREE_SLOT(n128)))
+				{
+					node_inner_128_insert(n128, chunk, child);
+					break;
+				}
+
+				/* grow node from 128 to 256 */
+				new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
+															RT_NODE_KIND_256);
+				for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < NODE_GET_COUNT(n128); i++)
+				{
+					if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+						continue;
+
+					node_inner_256_set(new256, i, node_inner_128_get_child(n128, i));
+					cnt++;
+				}
+
+				rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+								key);
+				node = (rt_node *) new256;
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+				node_inner_256_set(n256, chunk, child);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		NODE_INCREMENT_COUNT(node);
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+					uint64 key, uint64 value)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(NODE_IS_LEAF(node));
+
+	switch (NODE_GET_KIND(node))
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				rt_node_leaf_32 *new32;
+				int idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->values[idx] = value;
+					break;
+				}
+
+				if (likely(NODE_HAS_FREE_SLOT(n4)))
+				{
+					int insertpos = node_4_search_ge((rt_node_base_4 *) n4, chunk);
+					int count = NODE_GET_COUNT(n4);
+
+					if (insertpos < 0)
+						insertpos = count; /* insert to the tail */
+
+					/* shift chunks and values */
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n4->base.chunks, n4->values,
+												 count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->values[insertpos] = value;
+					break;
+				}
+
+				/* grow node from 4 to 32 */
+				new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
+														 RT_NODE_KIND_32);
+				chunk_values_array_copy(n4->base.chunks, n4->values,
+										new32->base.chunks, new32->values,
+										NODE_GET_COUNT(n4));
+
+				rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+								key);
+				node = (rt_node *) new32;
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				rt_node_leaf_128 *new128;
+				int idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = value;
+					break;
+				}
+
+				if (likely(NODE_HAS_FREE_SLOT(n32)))
+				{
+					int	insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
+					int count = NODE_GET_COUNT(n32);
+
+					if (insertpos < 0)
+						insertpos = count; /* insert to the tail */
+
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n32->base.chunks, n32->values,
+												 count, insertpos);
+
+					n32->base.chunks[insertpos] = chunk;
+					n32->values[insertpos] = value;
+					break;
+				}
+
+				/* grow node from 32 to 128 */
+				new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
+														   RT_NODE_KIND_128);
+				for (int i = 0; i < NODE_GET_COUNT(n32); i++)
+					node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
+
+				rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+								key);
+				node = (rt_node *) new128;
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_128:
+			{
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+				rt_node_leaf_256 *new256;
+				int	cnt = 0;
+
+				if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					ndoe_leaf_128_update(n128, chunk, value);
+					break;
+				}
+
+				if (likely(NODE_HAS_FREE_SLOT(n128)))
+				{
+					node_leaf_128_insert(n128, chunk, value);
+					break;
+				}
+
+				/* grow node from 128 to 256 */
+				new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
+														   RT_NODE_KIND_256);
+				for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < NODE_GET_COUNT(n128); i++)
+				{
+					if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+						continue;
+
+					node_leaf_256_set(new256, i, node_leaf_128_get_value(n128, i));
+					cnt++;
+				}
+
+				rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+								key);
+				node = (rt_node *) new256;
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+				node_leaf_256_set(n256, chunk, value);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		NODE_INCREMENT_COUNT(node);
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 rt_node_kind_info[i].name,
+												 rt_node_kind_info[i].inner_blocksize,
+												 rt_node_kind_info[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												rt_node_kind_info[i].name,
+												rt_node_kind_info[i].leaf_blocksize,
+												rt_node_kind_info[i].leaf_size);
+#ifdef RT_DEBUG
+		tree->cnt[i] = 0;
+#endif
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	rt_node    *node;
+	rt_node    *parent = tree->root;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = tree->root;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			child = rt_node_add_new_child(tree, parent, node, key);
+
+		Assert(child);
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* arrived at a leaf */
+	Assert(NODE_IS_LEAF(node));
+
+	updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+	rt_node    *node;
+	int			shift;
+
+	Assert(value_p != NULL);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* We reached at a leaf node, so search the corresponding slot */
+	Assert(NODE_IS_LEAF(node));
+	if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
+		return false;
+
+	return true;
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	int			level;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes
+	 * we visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = 0;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		/* Push the current node to the stack */
+		stack[level] = node;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* there is no key to delete */
+	if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
+		return false;
+
+	/* Update the statistics */
+	tree->num_keys--;
+
+	/*
+	 * Delete the key from the leaf node and recursively delete the key in
+	 * inner nodes if necessary.
+	 */
+	Assert(NODE_IS_LEAF(stack[level]));
+	while (level >= 0)
+	{
+		rt_node    *node = stack[level--];
+
+		if (NODE_IS_LEAF(node))
+			rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+		else
+			rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+	}
+
+	/*
+	 * If we eventually deleted the root node while recursively deleting empty
+	 * nodes, we make the tree empty.
+	 */
+	if (level == 0)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+	return true;;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+
+	iter->stack_len = top_level;
+	iter->stack[top_level].node = iter->tree->root;
+	iter->stack[top_level].current_idx = -1;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update the stack of the radix tree node while descending to the leaf from
+ * the 'from' level.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, int from)
+{
+	rt_node    *node = iter->stack[from].node;
+	int			level = from;
+
+	for (;;)
+	{
+		rt_node_iter *node_iter = &(iter->stack[level--]);
+
+		/* Set the node to this level */
+		rt_update_node_iter(iter, node_iter, node);
+
+		/* Finish if we reached to the leaf node */
+		if (NODE_IS_LEAF(node))
+			break;
+
+		/* Advance to the next slot in the node */
+		node = rt_node_inner_iterate_next(iter, node_iter);
+
+		/*
+		 * Since we always get the first slot in the node, we have to found
+		 * the slot.
+		 */
+		Assert(node);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree)
+		return false;
+
+	for (;;)
+	{
+		rt_node_iter *node_iter;
+		rt_node		*child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * Iterate node at each level from the level=1 inner node until
+		 * we find the next value to return.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* We could not find any new key-value pair, the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * We have advanced slots more than one nodes including both the lead
+		 * node and inner nodes. So we update the stack by descending to
+		 * the left most leaf node from this level.
+		 */
+		node_iter = &(iter->stack[level - 1]);
+		rt_update_node_iter(iter, node_iter, child);
+		rt_update_iter_stack(iter, level - 1);
+	}
+
+	return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+	rt_node		*child = NULL;
+	bool		found = false;
+	uint8		key_chunk;
+
+	switch (NODE_GET_KIND(node_iter->node))
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= NODE_GET_COUNT(n4))
+					break;
+
+				child = n4->children[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= NODE_GET_COUNT(n32))
+					break;
+
+				child = n32->children[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node_iter->node;
+				int	i;
+
+				for (i = node_iter->current_idx + 1; i < 256; i++)
+				{
+					if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+						break;
+				}
+
+				if (i >= 256)
+						break;
+
+				node_iter->current_idx = i;
+				child = node_inner_128_get_child(n128, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				int i;
+				for (i = node_iter->current_idx + 1; i < 256; i++)
+				{
+					if (node_inner_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= 256)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_256_get_child(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+	return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+						  uint64 *value_p)
+{
+	rt_node		*node = node_iter->node;
+	bool		found = false;
+	uint64		value;
+	uint8		key_chunk;
+
+	switch (NODE_GET_KIND(node))
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= NODE_GET_COUNT(n4))
+					break;
+
+				value = n4->values[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= NODE_GET_COUNT(n32))
+					break;
+
+				value = n32->values[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node_iter->node;
+				int	i;
+
+				for (i = node_iter->current_idx + 1; i < 256; i++)
+				{
+					if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+						break;
+				}
+
+				if (i >= 256)
+						break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_128_get_value(n128, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				int i;
+				for (i = node_iter->current_idx + 1; i < 256; i++)
+				{
+					if (node_leaf_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= 256)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_256_get_value(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		*value_p = value;
+	}
+
+	return found;
+}
+
+/*
+ * Set the node to the node_iter so we can begin the iteration of the node.
+ * Also, we update the part of the key by the chunk of the given node.
+ */
+static void
+rt_update_node_iter(rt_iter *iter, rt_node_iter *node_iter,
+					rt_node *node)
+{
+	node_iter->node = node;
+	node_iter->current_idx = -1;
+
+	rt_iter_update_key(iter, node->chunk, node->shift + RT_NODE_SPAN);
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	Size		total = 0;
+
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(NODE_GET_COUNT(node) >= 0);
+
+	switch (NODE_GET_KIND(node))
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+				for (int i = 1; i < NODE_GET_COUNT(n4); i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+				for (int i = 1; i < NODE_GET_COUNT(n32); i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					if (NODE_IS_LEAF(node))
+						Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) node,
+														  n128->slot_idxs[i]));
+					else
+						Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) node,
+														   n128->slot_idxs[i]));
+
+					cnt++;
+				}
+
+				Assert(NODE_GET_COUNT(n128) == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+						cnt += pg_popcount32(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(NODE_GET_COUNT(n256) == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
+						 tree->num_keys,
+						 tree->root->shift / RT_NODE_SPAN,
+						 tree->cnt[0],
+						 tree->cnt[1],
+						 tree->cnt[2],
+						 tree->cnt[3])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+	char        space[128] = {0};
+
+	fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(NODE_GET_KIND(node) == RT_NODE_KIND_4) ? 4 :
+			(NODE_GET_KIND(node) == RT_NODE_KIND_32) ? 32 :
+			(NODE_GET_KIND(node) == RT_NODE_KIND_128) ? 128 : 256,
+			NODE_GET_COUNT(node), node->shift, node->chunk);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (NODE_GET_KIND(node))
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < NODE_GET_COUNT(node); i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							rt_dump_node(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < NODE_GET_COUNT(node); i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							rt_dump_node(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *b128 = (rt_node_base_128 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < 256; i++)
+				{
+					if (!node_128_is_chunk_used(b128, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b128->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < 16; i++)
+					{
+						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < 256; i++)
+				{
+					if (!node_128_is_chunk_used(b128, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) b128;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, i, node_leaf_128_get_value(n128, i));
+					}
+					else
+					{
+						rt_node_inner_128 *n128 = (rt_node_inner_128 *) b128;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_128_get_child(n128, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < 256; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+						if (!node_leaf_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, i, node_leaf_256_get_value(n256, i));
+					}
+					else
+					{
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+						if (!node_inner_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		rt_dump_node(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+			break;
+		}
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
+				rt_node_kind_info[i].name,
+				rt_node_kind_info[i].inner_size,
+				rt_node_kind_info[i].inner_blocksize,
+				rt_node_kind_info[i].leaf_size,
+				rt_node_kind_info[i].leaf_blocksize);
+	fprintf(stderr, "max_val = %lu\n", tree->max_val);
+
+	if (!tree->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 7b3f292965..e587cabe13 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -26,6 +26,7 @@ SUBDIRS = \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..a4aa80a99c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,504 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int rt_node_max_entries[] = {
+	4,		/* RT_NODE_KIND_4 */
+	16,		/* RT_NODE_KIND_16 */
+	32,		/* RT_NODE_KIND_32 */
+	128,	/* RT_NODE_KIND_128 */
+	256		/* RT_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+} test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 10000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	uint64 dummy;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		uint64 val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64 key = ((uint64) i << shift);
+		bool found;
+
+		found = rt_set(radixtree, key, key);
+
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+		for (int j = 0; j < lengthof(rt_node_max_entries); j++)
+		{
+			/*
+			 * After filling all slots in each node type, check if the values are
+			 * stored properly.
+			 */
+			if (i == (rt_node_max_entries[j] - 1))
+			{
+				check_search_on_node(radixtree, shift,
+									 (j == 0) ? 0 : rt_node_max_entries[j - 1],
+									 rt_node_max_entries[j]);
+				break;
+			}
+		}
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64	key = ((uint64) i << shift);
+		bool	found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned" UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search
+	 * entries again.
+	 */
+	test_node_types_insert(radixtree, shift);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+	radix_tree *radixtree;
+	rt_iter *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the
+	 * stats from the memory context.  They should be in the same ballpark,
+	 * but it's hard to automate testing that, so if you're making changes to
+	 * the implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
-- 
2.31.1

0003-tool-for-measuring-radix-tree-performance.patchapplication/x-patch; name=0003-tool-for-measuring-radix-tree-performance.patchDownload

From 726959296d734784292a46e5a01c95a276820db0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [POC PATCH 3/3] tool for measuring radix tree performance

---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  56 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 447 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 6 files changed, 559 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..0874201d7e
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,56 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..673f96c860
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,447 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64 upper;
+	uint32 shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64 tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper-lower)+0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer	tids;
+	uint64	maxitems;
+	uint64	ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool	random_block = PG_GETARG_BOOL(2);
+	radix_tree *rt = NULL;
+	uint64	ntids;
+	uint64	key;
+	uint64	last_key = PG_UINT64_MAX;;
+	uint64	val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz	start_time,	end_time;
+	long	secs;
+	int		usecs;
+	int64	rt_load_ms, rt_search_ms;
+	Datum	values[7];
+	bool	nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32	off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint64	key, val;
+		uint32	off;
+		volatile bool ret; /* prevent calling rt_search from being optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true; /* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true; /* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64	ar_load_ms, ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret; /* prevent calling bsearch from being optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64	cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz	start_time,	end_time;
+	long	secs;
+	int		usecs;
+	int64	load_time_ms;
+	Datum	values[2];
+	bool	nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int	fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz	start_time,	end_time;
+	long	secs;
+	int		usecs;
+	int64	rt_load_ms, rt_search_ms;
+	Datum	values[5];
+	bool	nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int		n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r, h, i, j, k;
+	int key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	/*  lower nodes have limited fanout, the top is only limited by bits-per-byte */
+	for (r=1;;r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64	key;
+						key = (r<<32) | (h<<24) | (i<<16) | (j<<8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r=1;;r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64	key, val;
+						key = (r<<32) | (h<<24) | (i<<16) | (j<<8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
-- 
2.31.1

#109

john.naylor@enterprisedb.com

about 3 years ago

In reply to: John Naylor (#106)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I've attached updated PoC patches for discussion and cfbot. From the
previous version, I mainly changed the following things:

* Separate treatment of inner and leaf nodes

Overall, this looks much better!

* Pack both the node kind and node count to an uint16 value.

For this, I did mention a bitfield earlier as something we "could" do, but
it wasn't clear we should. After looking again at the node types, I must
not have thought through this at all. Storing one byte instead of four for
the full enum is a good step, but saving one more byte usually doesn't buy
anything because of padding, with a few exceptions like this example:

node4: 4 + 4 + 4*8 = 40
node4: 5 + 4+(7) + 4*8 = 48 bytes

Even there, I'd rather not spend the extra cycles to access the members.
And with my idea of decoupling size classes from kind, the variable-sized
kinds will require another byte to store "capacity". Then, even if the kind
gets encoded in a pointer tag, we'll still have 5 bytes in the base type.
So I think we should assume 5 bytes from the start. (Might be 6 temporarily
if I work on size decoupling first).

(Side note, if you have occasion to use bitfields again in the future, C99
has syntactic support for them, so no need to write your own
shifting/masking code).

I've not done SIMD part seriously yet. But overall the performance
seems good so far. If we agree with the current approach, I think we
can proceed with the verification of decoupling node sizes from node
kind. And I'll investigate DSA support.

Sounds good. I have some additional comments about v7, and after these are
addressed, we can proceed independently with the above two items. Seeing
the DSA work will also inform me how invasive pointer tagging will be.
There will still be some performance tuning and cosmetic work, but it's
getting closer.

-------------------------
0001:

+#ifndef USE_NO_SIMD
+#include "port/pg_bitutils.h"
+#endif

Leftover from an earlier version?

+static inline int vector8_find(const Vector8 v, const uint8 c);
+static inline int vector8_find_ge(const Vector8 v, const uint8 c);

Leftovers, causing compiler warnings. (Also see new variable shadow warning)

+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
+
+ return r;
+#endif

As I mentioned a couple versions ago, this style is really awkward, and
potential non-SIMD callers will be better off writing their own byte-wise
loop rather than using this API. Especially since the "min" function exists
only as a workaround for lack of unsigned comparison in (at least) SSE2.
There is one existing function in this file with that idiom for non-assert
code (for completeness), but even there, inputs of current interest to us
use the uint64 algorithm.

0002:

+ /* XXX: should not to use vector8_highbit_mask */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) <<
sizeof(Vector8));

Hmm?

+/*
+ * Return index of the first element in chunks in the given node that is
greater
+ * than or equal to 'key'.  Return -1 if there is no such element.
+ */
+static inline int
+node_32_search_ge(rt_node_base_32 *node, uint8 chunk)

The caller must now have logic for inserting at the end:

+ int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
+ int16 count = NODE_GET_COUNT(n32);
+
+ if (insertpos < 0)
+ insertpos = count; /* insert to the tail */

It would be a bit more clear if node_*_search_ge() always returns the
position we need (see the prototype for example). In fact, these functions
are probably better named node*_get_insertpos().

+ if (likely(NODE_HAS_FREE_SLOT(n128)))
+ {
+ node_inner_128_insert(n128, chunk, child);
+ break;
+ }
+
+ /* grow node from 128 to 256 */

We want all the node-growing code to be pushed down to the bottom so that
all branches of the hot path are close together. This provides better
locality for the CPU frontend. Looking at the assembly, the above doesn't
have the desired effect, so we need to write like this (also see prototype):

if (unlikely( ! has-free-slot))
grow-node;
else
{
...;
break;
}
/* FALLTHROUGH */

+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+   rt_node    *child;
+
+   if (NODE_IS_LEAF(node))
+     break;
+
+   if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+     child = rt_node_add_new_child(tree, parent, node, key);
+
+   Assert(child);
+
+   parent = node;
+   node = child;
+   shift -= RT_NODE_SPAN;
+ }

Note that if we have to call rt_node_add_new_child(), each successive loop
iteration must search it and find nothing there (the prototype had a
separate function to handle this). Maybe it's not that critical yet, but
something to keep in mind as we proceed. Maybe a comment about it to remind
us.

+ /* there is no key to delete */
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
+   return false;
+
+ /* Update the statistics */
+ tree->num_keys--;
+
+ /*
+  * Delete the key from the leaf node and recursively delete the key in
+  * inner nodes if necessary.
+  */
+ Assert(NODE_IS_LEAF(stack[level]));
+ while (level >= 0)
+ {
+   rt_node    *node = stack[level--];
+
+   if (NODE_IS_LEAF(node))
+     rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+   else
+     rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+
+   /* If the node didn't become empty, we stop deleting the key */
+   if (!NODE_IS_EMPTY(node))
+     break;
+
+   /* The node became empty */
+   rt_free_node(tree, node);
+ }

Here we call rt_node_search_leaf() twice -- once to check for existence,
and once to delete. All three search calls are inlined, so this wastes
space. Let's try to delete the leaf, return if not found, otherwise handle
the leaf bookkeepping and loop over the inner nodes. This might require
some duplication of code.

+ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)

Spelling

+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+             uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}

gcc generates better code with something like this (but not hard-coded) at
the top:

if (count > 4)
pg_unreachable();

This would have to change when we implement shrinking of nodes, but might
still be useful.

+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
+   return false;
+
+ return true;

Maybe just "return rt_node_search_leaf(...)" ?

--
John Naylor
EDB: http://www.enterprisedb.com

#110

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#109)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Oct 26, 2022 at 8:06 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Oct 24, 2022 at 12:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached updated PoC patches for discussion and cfbot. From the
previous version, I mainly changed the following things:

Thank you for the comments!

* Separate treatment of inner and leaf nodes

Overall, this looks much better!

* Pack both the node kind and node count to an uint16 value.

For this, I did mention a bitfield earlier as something we "could" do, but it wasn't clear we should. After looking again at the node types, I must not have thought through this at all. Storing one byte instead of four for the full enum is a good step, but saving one more byte usually doesn't buy anything because of padding, with a few exceptions like this example:

node4: 4 + 4 + 4*8 = 40
node4: 5 + 4+(7) + 4*8 = 48 bytes

Even there, I'd rather not spend the extra cycles to access the members. And with my idea of decoupling size classes from kind, the variable-sized kinds will require another byte to store "capacity". Then, even if the kind gets encoded in a pointer tag, we'll still have 5 bytes in the base type. So I think we should assume 5 bytes from the start. (Might be 6 temporarily if I work on size decoupling first).

True. I'm going to start with 6 bytes and will consider reducing it to
5 bytes. Encoding the kind in a pointer tag could be tricky given DSA
support so currently I'm thinking to pack the node kind and node
capacity classes to uint8.

(Side note, if you have occasion to use bitfields again in the future, C99 has syntactic support for them, so no need to write your own shifting/masking code).

Thanks!

I've not done SIMD part seriously yet. But overall the performance
seems good so far. If we agree with the current approach, I think we
can proceed with the verification of decoupling node sizes from node
kind. And I'll investigate DSA support.

Sounds good. I have some additional comments about v7, and after these are addressed, we can proceed independently with the above two items. Seeing the DSA work will also inform me how invasive pointer tagging will be. There will still be some performance tuning and cosmetic work, but it's getting closer.

I've made some progress on investigating DSA support. I've written
draft patch for that and regression tests passed. I'll share it as a
separate patch for discussion with v8 radix tree patch.

While implementing DSA support, I realized that we may not need to use
pointer tagging to distinguish between backend-local address or
dsa_pointer. In order to get a backend-local address from dsa_pointer,
we need to pass dsa_area like:

node = dsa_get_address(tree->dsa, node_dp);

As shown above, the dsa area used by the shared radix tree is stored
in radix_tree struct, so we can know whether the radix tree is shared
or not by checking (tree->dsa == NULL). That is, if it's shared we use
a pointer to radix tree node as dsa_pointer, and if not we use a
pointer as a backend-local pointer. We don't need to encode something
in a pointer.

-------------------------
0001:
+#ifndef USE_NO_SIMD
+#include "port/pg_bitutils.h"
+#endif
Leftover from an earlier version?
+static inline int vector8_find(const Vector8 v, const uint8 c);
+static inline int vector8_find_ge(const Vector8 v, const uint8 c);
Leftovers, causing compiler warnings. (Also see new variable shadow warning)

Will fix.

+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) &r;
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = Min(((const uint8 *) &v1)[i], ((const uint8 *) &v2)[i]);
+
+ return r;
+#endif
As I mentioned a couple versions ago, this style is really awkward, and potential non-SIMD callers will be better off writing their own byte-wise loop rather than using this API. Especially since the "min" function exists only as a workaround for lack of unsigned comparison in (at least) SSE2. There is one existing function in this file with that idiom for non-assert code (for completeness), but even there, inputs of current interest to us use the uint64 algorithm.

Agreed. Will remove non-SIMD code.

0002:

+ /* XXX: should not to use vector8_highbit_mask */
+ bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));

Hmm?

It's my outdated memo, will remove.

+/*
+ * Return index of the first element in chunks in the given node that is greater
+ * than or equal to 'key'.  Return -1 if there is no such element.
+ */
+static inline int
+node_32_search_ge(rt_node_base_32 *node, uint8 chunk)
The caller must now have logic for inserting at the end:
+ int insertpos = node_32_search_ge((rt_node_base_32 *) n32, chunk);
+ int16 count = NODE_GET_COUNT(n32);
+
+ if (insertpos < 0)
+ insertpos = count; /* insert to the tail */
It would be a bit more clear if node_*_search_ge() always returns the position we need (see the prototype for example). In fact, these functions are probably better named node*_get_insertpos().

Agreed.

+ if (likely(NODE_HAS_FREE_SLOT(n128)))
+ {
+ node_inner_128_insert(n128, chunk, child);
+ break;
+ }
+
+ /* grow node from 128 to 256 */
We want all the node-growing code to be pushed down to the bottom so that all branches of the hot path are close together. This provides better locality for the CPU frontend. Looking at the assembly, the above doesn't have the desired effect, so we need to write like this (also see prototype):

if (unlikely( ! has-free-slot))
grow-node;
else
{
...;
break;
}
/* FALLTHROUGH */

Good point. Will change.

+ /* Descend the tree until a leaf node */
+ while (shift >= 0)
+ {
+   rt_node    *child;
+
+   if (NODE_IS_LEAF(node))
+     break;
+
+   if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+     child = rt_node_add_new_child(tree, parent, node, key);
+
+   Assert(child);
+
+   parent = node;
+   node = child;
+   shift -= RT_NODE_SPAN;
+ }
Note that if we have to call rt_node_add_new_child(), each successive loop iteration must search it and find nothing there (the prototype had a separate function to handle this). Maybe it's not that critical yet, but something to keep in mind as we proceed. Maybe a comment about it to remind us.

Agreed. Currently rt_extend() is used to add upper nodes but probably
we need another function to add lower nodes for this case.

+ /* there is no key to delete */
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
+   return false;
+
+ /* Update the statistics */
+ tree->num_keys--;
+
+ /*
+  * Delete the key from the leaf node and recursively delete the key in
+  * inner nodes if necessary.
+  */
+ Assert(NODE_IS_LEAF(stack[level]));
+ while (level >= 0)
+ {
+   rt_node    *node = stack[level--];
+
+   if (NODE_IS_LEAF(node))
+     rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+   else
+     rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+
+   /* If the node didn't become empty, we stop deleting the key */
+   if (!NODE_IS_EMPTY(node))
+     break;
+
+   /* The node became empty */
+   rt_free_node(tree, node);
+ }
Here we call rt_node_search_leaf() twice -- once to check for existence, and once to delete. All three search calls are inlined, so this wastes space. Let's try to delete the leaf, return if not found, otherwise handle the leaf bookkeepping and loop over the inner nodes. This might require some duplication of code.

Agreed.

+ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)

Spelling

WIll fix.

+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+             uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}

gcc generates better code with something like this (but not hard-coded) at the top:

if (count > 4)
pg_unreachable();

Agreed.

This would have to change when we implement shrinking of nodes, but might still be useful.
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
+   return false;
+
+ return true;
Maybe just "return rt_node_search_leaf(...)" ?

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#111

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#110)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Oct 27, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

True. I'm going to start with 6 bytes and will consider reducing it to
5 bytes.

Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As
discussed, I will attempt the size class decoupling after v8 and see how it
goes.

Encoding the kind in a pointer tag could be tricky given DSA

If it turns out to be unworkable, that's life. If it's just tricky, that
can certainly be put off for future work. I hope to at least test it out
with local memory.

support so currently I'm thinking to pack the node kind and node
capacity classes to uint8.

That won't work, if we need 128 for capacity, leaving no bits left. I want
the capacity to be a number we can directly compare with the count (we
won't ever need to store 256 because that node will never grow). Also,
further to my last message, we need to access the kind quickly, without
more cycles.

I've made some progress on investigating DSA support. I've written
draft patch for that and regression tests passed. I'll share it as a
separate patch for discussion with v8 radix tree patch.

Great!

While implementing DSA support, I realized that we may not need to use
pointer tagging to distinguish between backend-local address or
dsa_pointer. In order to get a backend-local address from dsa_pointer,
we need to pass dsa_area like:

I was not clear -- when I see how much code changes to accommodate DSA
pointers, I imagine I will pretty much know the places that would be
affected by tagging the pointer with the node kind.

Speaking of tests, there is currently no Meson support, but tests pass
because this library is not used anywhere in the backend yet, and
apparently the CI Meson builds don't know to run the regression test? That
will need to be done too. However, it's okay to keep the benchmarking
module in autoconf, since it won't be committed.

+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+             uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}

gcc generates better code with something like this (but not hard-coded)

at the top:

if (count > 4)
pg_unreachable();

Actually it just now occurred to me there's a bigger issue here: *We* know
this code can only get here iff count==4, so why doesn't the compiler know
that? I believe it boils down to

static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {

In the assembly, I see it checks if there is room in the node by doing a
runtime lookup in this array, which is not constant. This might not be
important just yet, because I want to base the check on the proposed node
capacity instead, but I mention it as a reminder to us to make sure we take
all opportunities for the compiler to propagate constants.

--
John Naylor
EDB: http://www.enterprisedb.com

#112

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#111)

4 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Oct 27, 2022 at 12:21 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Oct 27, 2022 at 9:11 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

True. I'm going to start with 6 bytes and will consider reducing it to
5 bytes.

Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As discussed, I will attempt the size class decoupling after v8 and see how it goes.

Encoding the kind in a pointer tag could be tricky given DSA

If it turns out to be unworkable, that's life. If it's just tricky, that can certainly be put off for future work. I hope to at least test it out with local memory.

support so currently I'm thinking to pack the node kind and node
capacity classes to uint8.

That won't work, if we need 128 for capacity, leaving no bits left. I want the capacity to be a number we can directly compare with the count (we won't ever need to store 256 because that node will never grow). Also, further to my last message, we need to access the kind quickly, without more cycles.

Understood.

I've made some progress on investigating DSA support. I've written
draft patch for that and regression tests passed. I'll share it as a
separate patch for discussion with v8 radix tree patch.

Great!

While implementing DSA support, I realized that we may not need to use
pointer tagging to distinguish between backend-local address or
dsa_pointer. In order to get a backend-local address from dsa_pointer,
we need to pass dsa_area like:

I was not clear -- when I see how much code changes to accommodate DSA pointers, I imagine I will pretty much know the places that would be affected by tagging the pointer with the node kind.

Speaking of tests, there is currently no Meson support, but tests pass because this library is not used anywhere in the backend yet, and apparently the CI Meson builds don't know to run the regression test? That will need to be done too. However, it's okay to keep the benchmarking module in autoconf, since it won't be committed.

Updated to support Meson.

+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+             uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
gcc generates better code with something like this (but not hard-coded) at the top:

if (count > 4)
pg_unreachable();
Actually it just now occurred to me there's a bigger issue here: *We* know this code can only get here iff count==4, so why doesn't the compiler know that? I believe it boils down to

static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {

In the assembly, I see it checks if there is room in the node by doing a runtime lookup in this array, which is not constant. This might not be important just yet, because I want to base the check on the proposed node capacity instead, but I mention it as a reminder to us to make sure we take all opportunities for the compiler to propagate constants.

I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
the comments I got so far. 0004 patch is a DSA support patch for PoC.

In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
to point its children, and we use rt_node_ptr as either rt_node* or
dsa_pointer depending on whether the radix tree is shared or not (ie,
by checking radix_tree->dsa == NULL). Regarding the performance, I've
added another boolean argument to bench_seq/shuffle_search(),
specifying whether to use the shared radix tree or not. Here are
benchmark results in my environment,

select * from bench_seq_search(0, 1* 1000 * 1000, false, false);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871240 | 180000000 | 67 |
| 241 |
(1 row)

select * from bench_seq_search(0, 1* 1000 * 1000, false, true);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 14680064 | 180000000 | 81 |
| 483 |
(1 row)

select * from bench_seq_search(0, 2* 1000 * 1000, true, false);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 19680872 | 179937720 | 74 |
| 235 |
(1 row)

select * from bench_seq_search(0, 2* 1000 * 1000, true, true);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 23068672 | 179937720 | 86 |
| 445 |
(1 row)

select * from bench_shuffle_search(0, 1* 1000 * 1000, false, false);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 9871240 | 180000000 | 67 |
| 640 |
(1 row)

select * from bench_shuffle_search(0, 1* 1000 * 1000, false, true);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
---------+------------------+---------------------+------------+---------------+--------------+-----------------
1000000 | 14680064 | 180000000 | 81 |
| 1002 |
(1 row)

select * from bench_shuffle_search(0, 2* 1000 * 1000, true, false);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 19680872 | 179937720 | 74 |
| 697 |
(1 row)

select * from bench_shuffle_search(0, 2* 1000 * 1000, true, true);
nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
--------+------------------+---------------------+------------+---------------+--------------+-----------------
999654 | 23068672 | 179937720 | 86 |
| 1030 |
(1 row)

In non-shared radix tree cases (the forth argument is false), I don't
see a visible performance degradation. On the other hand, in shared
radix tree cases (the forth argument is true), I see visible overheads
because of dsa_get_address().

Please note that the current shared radix tree implementation doesn't
support any locking, so it cannot be read while written by someone.
Also, only one process can iterate over the shared radix tree. When it
comes to parallel vacuum, these don't become restriction as the leader
process writes the radix tree while scanning heap and the radix tree
is read by multiple processes while vacuuming indexes. And only the
leader process can do heap vacuum by iterating the key-value pairs in
the radix tree. If we want to use it for other cases too, we would
need to support locking, RCU or something.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v8-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v8-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From 8a240268c8135a871f80b8d465e0335745f2cedd Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v8 1/4] introduce vector8_min and vector8_highbit_mask

---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.31.1

v8-0004-PoC-DSA-support-for-radix-tree.patchapplication/octet-stream; name=v8-0004-PoC-DSA-support-for-radix-tree.patchDownload

From eac9256167afc948166144820e0d884c9e89f8cc Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Oct 2022 14:02:00 +0900
Subject: [PATCH v8 4/4] PoC: DSA support for radix tree.

---
 .../bench_radix_tree--1.0.sql                 |   2 +
 contrib/bench_radix_tree/bench_radix_tree.c   |  12 +-
 src/backend/lib/radixtree.c                   | 683 ++++++++++++------
 src/backend/utils/mmgr/dsa.c                  |  12 +
 src/include/lib/radixtree.h                   |   6 +-
 src/include/utils/dsa.h                       |   1 +
 .../expected/test_radixtree.out               |  17 +
 .../modules/test_radixtree/test_radixtree.c   |  98 ++-
 8 files changed, 558 insertions(+), 273 deletions(-)

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 0874201d7e..cf294c01d6 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -7,6 +7,7 @@ create function bench_shuffle_search(
 minblk int4,
 maxblk int4,
 random_block bool DEFAULT false,
+shared bool DEFAULT false,
 OUT nkeys int8,
 OUT rt_mem_allocated int8,
 OUT array_mem_allocated int8,
@@ -23,6 +24,7 @@ create function bench_seq_search(
 minblk int4,
 maxblk int4,
 random_block bool DEFAULT false,
+shared bool DEFAULT false,
 OUT nkeys int8,
 OUT rt_mem_allocated int8,
 OUT array_mem_allocated int8,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 7abb237e96..be3f7ed811 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -15,6 +15,7 @@
 #include "lib/radixtree.h"
 #include <math.h>
 #include "miscadmin.h"
+#include "storage/lwlock.h"
 #include "utils/timestamp.h"
 
 PG_MODULE_MAGIC;
@@ -149,7 +150,9 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	BlockNumber minblk = PG_GETARG_INT32(0);
 	BlockNumber maxblk = PG_GETARG_INT32(1);
 	bool		random_block = PG_GETARG_BOOL(2);
+	bool		shared = PG_GETARG_BOOL(3);
 	radix_tree *rt = NULL;
+	dsa_area   *dsa = NULL;
 	uint64		ntids;
 	uint64		key;
 	uint64		last_key = PG_UINT64_MAX;
@@ -171,8 +174,11 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 
 	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
 
+	if (shared)
+		dsa = dsa_create(LWLockNewTrancheId());
+
 	/* measure the load time of the radix tree */
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, dsa);
 	start_time = GetCurrentTimestamp();
 	for (int i = 0; i < ntids; i++)
 	{
@@ -323,7 +329,7 @@ bench_load_random_int(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	pg_prng_seed(&state, 0);
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	start_time = GetCurrentTimestamp();
 	for (uint64 i = 0; i < cnt; i++)
@@ -375,7 +381,7 @@ bench_fixed_height_search(PG_FUNCTION_ARGS)
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
 
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	start_time = GetCurrentTimestamp();
 
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index b239b3c615..3b06f22af5 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -22,6 +22,15 @@
  * choose it to avoid an additional pointer traversal.  It is the reason this code
  * currently does not support variable-length keys.
  *
+ * If DSA space is specified when rt_create(), the radix tree is created in the
+ * DSA space so that multiple processes can access to it simultaneously. The process
+ * who created the shared radix tree need to tell both DSA area specified when
+ * calling to rt_create() and dsa_pointer of the radix tree, fetched by
+ * rt_get_dsa_pointer(), other processes so that they can attach by rt_attach().
+ *
+ * XXX: shared radix tree is still PoC state as it doesn't have any locking support.
+ * Also, it supports only single-process iteration.
+ *
  * XXX: Most functions in this file have two variants for inner nodes and leaf
  * nodes, therefore there are duplication codes. While this sometimes makes the
  * code maintenance tricky, this reduces branch prediction misses when judging
@@ -59,12 +68,13 @@
 
 #include "postgres.h"
 
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "port/pg_lfind.h"
+#include "utils/dsa.h"
 #include "utils/memutils.h"
-#include "lib/radixtree.h"
-#include "lib/stringinfo.h"
 
 /* The number of bits encoded in one tree level */
 #define RT_NODE_SPAN	BITS_PER_BYTE
@@ -152,6 +162,17 @@ typedef struct rt_node
 #define NODE_HAS_FREE_SLOT(n) \
 	(((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
 
+/*
+ * rt_node_ptr is used as a pointer for rt_node. It can be either a local address
+ * in non-shared radix tree case (RadixTreeIsShared() is true) or a dsa_pointer in
+ * shared radix tree case. The inner nodes of the radix tree need to use rt_node_ptr
+ * to store the child rt_node pointer instead of C-pointers. A rt_node_ptr can be
+ * converted to a local address of rt_node by using node_ptr_get_local().
+ */
+typedef uintptr_t rt_node_ptr;
+#define InvalidRTNodePointer	((rt_node_ptr) 0)
+#define RTNodePtrIsValid(x)	((x) != InvalidRTNodePointer)
+
 /* Base type of each node kinds for leaf and inner nodes */
 typedef struct rt_node_base_4
 {
@@ -205,7 +226,7 @@ typedef struct rt_node_inner_4
 	rt_node_base_4 base;
 
 	/* 4 children, for key chunks */
-	rt_node    *children[4];
+	rt_node_ptr children[4];
 } rt_node_inner_4;
 
 typedef struct rt_node_leaf_4
@@ -221,7 +242,7 @@ typedef struct rt_node_inner_32
 	rt_node_base_32 base;
 
 	/* 32 children, for key chunks */
-	rt_node    *children[32];
+	rt_node_ptr children[32];
 } rt_node_inner_32;
 
 typedef struct rt_node_leaf_32
@@ -237,7 +258,7 @@ typedef struct rt_node_inner_128
 	rt_node_base_128 base;
 
 	/* Slots for 128 children */
-	rt_node    *children[128];
+	rt_node_ptr children[128];
 } rt_node_inner_128;
 
 typedef struct rt_node_leaf_128
@@ -260,7 +281,7 @@ typedef struct rt_node_inner_256
 	rt_node_base_256 base;
 
 	/* Slots for 256 children */
-	rt_node    *children[RT_NODE_MAX_SLOTS];
+	rt_node_ptr children[RT_NODE_MAX_SLOTS];
 } rt_node_inner_256;
 
 typedef struct rt_node_leaf_256
@@ -344,6 +365,11 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
  * construct the key whenever updating the node iteration information, e.g., when
  * advancing the current index within the node or when moving to the next node
  * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than rt_node_ptr.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
  */
 typedef struct rt_node_iter
 {
@@ -363,37 +389,56 @@ struct rt_iter
 	uint64		key;
 };
 
-/* A radix tree with nodes */
-struct radix_tree
+/* Control information for an radix tree */
+typedef struct radix_tree_control
 {
-	MemoryContext context;
+	rt_node_ptr root;
 
-	rt_node    *root;
+	/* XXX: use pg_atomic_uint64 instead */
 	uint64		max_val;
 	uint64		num_keys;
 
-	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
-	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
-
 	/* statistics */
 #ifdef RT_DEBUG
 	int32		cnt[RT_NODE_KIND_COUNT];
 #endif
+} radix_tree_control;
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	radix_tree_control *ctl;
+
+	/* used only when the radix tree is shared */
+	dsa_area   *dsa;
+	dsa_pointer ctl_dp;
+
+	/* used only when the radix tree is private */
+	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
 };
+#define RadixTreeIsShared(rt) ((rt)->dsa != NULL)
 
 static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
-							  bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
+static rt_node_ptr rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+								 bool inner);
+static rt_node_ptr rt_copy_node(radix_tree *tree, rt_node *node, int new_kind);
+static void rt_free_node(radix_tree *tree, rt_node_ptr nodep);
+static void rt_replace_node(radix_tree *tree, rt_node *parent, rt_node_ptr oldp,
+							rt_node_ptr newp, uint64 key);
 static void rt_extend(radix_tree *tree, uint64 key);
 static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
-										rt_node **child_p);
+										rt_node_ptr *childp_p);
 static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
 									   uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
-								 uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
-								uint64 key, uint64 value);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node_ptr nodep,
+								 rt_node *node, uint64 key, rt_node_ptr childp);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node_ptr nodep,
+								rt_node *node, uint64 key, uint64 value);
+static inline void rt_node_update_inner(rt_node *node, uint64 key, rt_node_ptr newchildp);
 static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
 static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 											 uint64 *value_p);
@@ -403,6 +448,15 @@ static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
 /* verification (available only with assertion) */
 static void rt_verify_node(rt_node *node);
 
+/* Get the local address of nodep */
+static inline rt_node *
+node_ptr_get_local(radix_tree *tree, rt_node_ptr nodep)
+{
+	return RadixTreeIsShared(tree)
+		? (rt_node *) dsa_get_address(tree->dsa, (dsa_pointer) nodep)
+		: (rt_node *) nodep;
+}
+
 /*
  * Return index of the first element in 'base' that equals 'key'. Return -1
  * if there is no such element.
@@ -550,10 +604,10 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
 
 /* Shift the elements right at 'idx' by one */
 static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_shift(uint8 *chunks, rt_node_ptr *children, int count, int idx)
 {
 	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
-	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node_ptr) * (count - idx));
 }
 
 static inline void
@@ -565,7 +619,7 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Delete the element at 'idx' */
 static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_delete(uint8 *chunks, rt_node_ptr *children, int count, int idx)
 {
 	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
 	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
@@ -580,15 +634,15 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Copy both chunks and children/values arrays */
 static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
-						  uint8 *dst_chunks, rt_node **dst_children, int count)
+chunk_children_array_copy(uint8 *src_chunks, rt_node_ptr *src_children,
+						  uint8 *dst_chunks, rt_node_ptr *dst_children, int count)
 {
 	/* For better code generation */
 	if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
 		pg_unreachable();
 
 	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
-	memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+	memcpy(dst_children, src_children, sizeof(rt_node_ptr) * count);
 }
 
 static inline void
@@ -617,7 +671,7 @@ static inline bool
 node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
 {
 	Assert(!NODE_IS_LEAF(node));
-	return (node->children[slot] != NULL);
+	return RTNodePtrIsValid(node->children[slot]);
 }
 
 static inline bool
@@ -627,7 +681,7 @@ node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
 	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
 }
 
-static inline rt_node *
+static inline rt_node_ptr
 node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
 {
 	Assert(!NODE_IS_LEAF(node));
@@ -695,7 +749,7 @@ node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
 }
 
 static inline void
-node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node_ptr child)
 {
 	int			slotpos;
 
@@ -726,10 +780,10 @@ node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
 
 /* Update the child corresponding to 'chunk' to 'child' */
 static inline void
-node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node_ptr childp)
 {
 	Assert(!NODE_IS_LEAF(node));
-	node->children[node->base.slot_idxs[chunk]] = child;
+	node->children[node->base.slot_idxs[chunk]] = childp;
 }
 
 static inline void
@@ -746,7 +800,7 @@ static inline bool
 node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
 {
 	Assert(!NODE_IS_LEAF(node));
-	return (node->children[chunk] != NULL);
+	return RTNodePtrIsValid(node->children[chunk]);
 }
 
 static inline bool
@@ -756,7 +810,7 @@ node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
 	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
 }
 
-static inline rt_node *
+static inline rt_node_ptr
 node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
 {
 	Assert(!NODE_IS_LEAF(node));
@@ -774,7 +828,7 @@ node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
 
 /* Set the child in the node-256 */
 static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node_ptr child)
 {
 	Assert(!NODE_IS_LEAF(node));
 	node->children[chunk] = child;
@@ -794,7 +848,7 @@ static inline void
 node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
 {
 	Assert(!NODE_IS_LEAF(node));
-	node->children[chunk] = NULL;
+	node->children[chunk] = InvalidRTNodePointer;
 }
 
 static inline void
@@ -835,28 +889,45 @@ static void
 rt_new_root(radix_tree *tree, uint64 key)
 {
 	int			shift = key_get_shift(key);
-	rt_node    *node;
+	rt_node_ptr nodep;
 
-	node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
-									 shift > 0);
-	tree->max_val = shift_get_max_val(shift);
-	tree->root = node;
+	nodep = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, shift > 0);
+	tree->ctl->max_val = shift_get_max_val(shift);
+	tree->ctl->root = nodep;
 }
 
 /*
  * Allocate a new node with the given node kind.
  */
-static rt_node *
+static rt_node_ptr
 rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
 {
 	rt_node    *newnode;
+	rt_node_ptr newnodep;
+
+	if (tree->dsa != NULL)
+	{
+		dsa_pointer dp;
+
+		if (inner)
+			dp = dsa_allocate0(tree->dsa, rt_node_kind_info[kind].inner_size);
+		else
+			dp = dsa_allocate0(tree->dsa, rt_node_kind_info[kind].leaf_size);
 
-	if (inner)
-		newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
-													 rt_node_kind_info[kind].inner_size);
+		newnodep = (rt_node_ptr) dp;
+		newnode = (rt_node *) dsa_get_address(tree->dsa, newnodep);
+	}
 	else
-		newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
-													 rt_node_kind_info[kind].leaf_size);
+	{
+		if (inner)
+			newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+														 rt_node_kind_info[kind].inner_size);
+		else
+			newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+														 rt_node_kind_info[kind].leaf_size);
+
+		newnodep = (rt_node_ptr) newnode;
+	}
 
 	newnode->kind = kind;
 	newnode->shift = shift;
@@ -872,69 +943,81 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[kind]++;
+	tree->ctl->cnt[kind]++;
 #endif
 
-	return newnode;
+	return newnodep;
 }
 
 /*
  * Create a new node with 'new_kind' and the same shift, chunk, and
  * count of 'node'.
  */
-static rt_node *
+static rt_node_ptr
 rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
 {
 	rt_node    *newnode;
+	rt_node_ptr newnodep;
 
-	newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
-							node->shift > 0);
+	newnodep = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
+							 node->shift > 0);
+	newnode = node_ptr_get_local(tree, newnodep);
 	newnode->count = node->count;
 
-	return newnode;
+	return newnodep;
 }
 
 /* Free the given node */
 static void
-rt_free_node(radix_tree *tree, rt_node *node)
+rt_free_node(radix_tree *tree, rt_node_ptr nodep)
 {
 	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node)
-		tree->root = NULL;
+	if (tree->ctl->root == nodep)
+		tree->ctl->root = InvalidRTNodePointer;
 
 #ifdef RT_DEBUG
-	/* update the statistics */
-	tree->cnt[node->kind]--;
-	Assert(tree->cnt[node->kind] >= 0);
+	{
+		rt_node    *node = node_ptr_get_local(tree, nodep);
+
+		/* update the statistics */
+		tree->ctl->cnt[node->kind]--;
+		Assert(tree->ctl->cnt[node->kind] >= 0);
+	}
 #endif
 
-	pfree(node);
+	if (RadixTreeIsShared(tree))
+		dsa_free(tree->dsa, (dsa_pointer) nodep);
+	else
+		pfree((rt_node *) nodep);
 }
 
 /*
  * Replace old_child with new_child, and free the old one.
  */
 static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
-				rt_node *new_child, uint64 key)
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node_ptr oldp,
+				rt_node_ptr newp, uint64 key)
 {
-	Assert(old_child->chunk == new_child->chunk);
-	Assert(old_child->shift == new_child->shift);
+	rt_node    *old = node_ptr_get_local(tree, oldp);
 
-	if (parent == old_child)
+#ifdef USE_ASSERT_CHECKING
 	{
-		/* Replace the root node with the new large node */
-		tree->root = new_child;
+		rt_node    *new = node_ptr_get_local(tree, newp);
+
+		Assert(old->chunk == new->chunk);
+		Assert(old->shift == new->shift);
 	}
-	else
-	{
-		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
+#endif
 
-		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
-		Assert(replaced);
+	if (parent == old)
+	{
+		/* Replace the root node with the new large node */
+		tree->ctl->root = newp;
 	}
+	else
+		rt_node_update_inner(parent, key, newp);
 
-	rt_free_node(tree, old_child);
+	rt_free_node(tree, oldp);
 }
 
 /*
@@ -945,7 +1028,8 @@ static void
 rt_extend(radix_tree *tree, uint64 key)
 {
 	int			target_shift;
-	int			shift = tree->root->shift + RT_NODE_SPAN;
+	rt_node    *root = node_ptr_get_local(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
 
 	target_shift = key_get_shift(key);
 
@@ -953,20 +1037,77 @@ rt_extend(radix_tree *tree, uint64 key)
 	while (shift <= target_shift)
 	{
 		rt_node_inner_4 *node;
+		rt_node_ptr nodep;
 
-		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
-												 shift, 0, true);
+		/* create the new root */
+		nodep = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, true);
+		node = (rt_node_inner_4 *) node_ptr_get_local(tree, nodep);
 		node->base.n.count = 1;
 		node->base.chunks[0] = 0;
-		node->children[0] = tree->root;
+		node->children[0] = tree->ctl->root;
 
-		tree->root->chunk = 0;
-		tree->root = (rt_node *) node;
+		/* Update the root */
+		root->chunk = 0;
+		tree->ctl->root = nodep;
+		root = (rt_node *) node;
 
 		shift += RT_NODE_SPAN;
 	}
 
-	tree->max_val = shift_get_max_val(target_shift);
+	tree->ctl->max_val = shift_get_max_val(target_shift);
+}
+
+/* XXX: can be merged to rt_node_search_inner with RT_ACTION_UPDATE? */
+static inline void
+rt_node_update_inner(rt_node *node, uint64 key, rt_node_ptr newchildp)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < -1)
+					break;
+
+				n4->children[idx] = newchildp;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < -1)
+					break;
+
+				n32->children[idx] = newchildp;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+				if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+					break;
+
+				node_inner_128_update(n128, chunk, newchildp);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				if (!node_inner_256_is_chunk_used(n256, chunk))
+					break;
+
+				node_inner_256_set(n256, chunk, newchildp);
+				break;
+			}
+	}
 }
 
 /*
@@ -975,27 +1116,31 @@ rt_extend(radix_tree *tree, uint64 key)
  */
 static inline void
 rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
-			  rt_node *node)
+			  rt_node_ptr nodep, rt_node *node)
 {
 	int			shift = node->shift;
 
+	Assert(node_ptr_get_local(tree, nodep) == node);
+
 	while (shift >= RT_NODE_SPAN)
 	{
-		rt_node    *newchild;
+		rt_node_ptr newchildp;
 		int			newshift = shift - RT_NODE_SPAN;
 
-		newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
-								 RT_GET_KEY_CHUNK(key, node->shift),
-								 newshift > 0);
-		rt_node_insert_inner(tree, parent, node, key, newchild);
+		newchildp = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
+								  RT_GET_KEY_CHUNK(key, node->shift),
+								  newshift > 0);
+
+		rt_node_insert_inner(tree, parent, nodep, node, key, newchildp);
 
 		parent = node;
-		node = newchild;
+		node = node_ptr_get_local(tree, newchildp);
+		nodep = newchildp;
 		shift -= RT_NODE_SPAN;
 	}
 
-	rt_node_insert_leaf(tree, parent, node, key, value);
-	tree->num_keys++;
+	rt_node_insert_leaf(tree, parent, nodep, node, key, value);
+	tree->ctl->num_keys++;
 }
 
 /*
@@ -1006,11 +1151,11 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node_ptr *childp_p)
 {
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 	bool		found = false;
-	rt_node    *child = NULL;
+	rt_node_ptr childp = InvalidRTNodePointer;
 
 	switch (node->kind)
 	{
@@ -1025,7 +1170,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 				found = true;
 
 				if (action == RT_ACTION_FIND)
-					child = n4->children[idx];
+					childp = n4->children[idx];
 				else			/* RT_ACTION_DELETE */
 					chunk_children_array_delete(n4->base.chunks, n4->children,
 												n4->base.n.count, idx);
@@ -1041,8 +1186,9 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 					break;
 
 				found = true;
+
 				if (action == RT_ACTION_FIND)
-					child = n32->children[idx];
+					childp = n32->children[idx];
 				else			/* RT_ACTION_DELETE */
 					chunk_children_array_delete(n32->base.chunks, n32->children,
 												n32->base.n.count, idx);
@@ -1058,7 +1204,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 				found = true;
 
 				if (action == RT_ACTION_FIND)
-					child = node_inner_128_get_child(n128, chunk);
+					childp = node_inner_128_get_child(n128, chunk);
 				else			/* RT_ACTION_DELETE */
 					node_inner_128_delete(n128, chunk);
 
@@ -1073,7 +1219,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 
 				found = true;
 				if (action == RT_ACTION_FIND)
-					child = node_inner_256_get_child(n256, chunk);
+					childp = node_inner_256_get_child(n256, chunk);
 				else			/* RT_ACTION_DELETE */
 					node_inner_256_delete(n256, chunk);
 
@@ -1085,8 +1231,8 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 	if (action == RT_ACTION_DELETE && found)
 		node->count--;
 
-	if (found && child_p)
-		*child_p = child;
+	if (found && childp_p)
+		*childp_p = childp;
 
 	return found;
 }
@@ -1186,8 +1332,8 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 
 /* Insert the child to the inner node */
 static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
-					 rt_node *child)
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node_ptr nodep, rt_node *node,
+					 uint64 key, rt_node_ptr childp)
 {
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 	bool		chunk_exists = false;
@@ -1206,23 +1352,24 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					n4->children[idx] = child;
+					n4->children[idx] = childp;
 					break;
 				}
 
 				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
 				{
 					rt_node_inner_32 *new32;
+					rt_node_ptr new32p;
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
-															  RT_NODE_KIND_32);
+					new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
+					new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);
+
 					chunk_children_array_copy(n4->base.chunks, n4->children,
 											  new32->base.chunks, new32->children,
 											  n4->base.n.count);
 
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
-									key);
+					rt_replace_node(tree, parent, nodep, new32p, key);
 					node = (rt_node *) new32;
 				}
 				else
@@ -1236,7 +1383,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 												   count, insertpos);
 
 					n4->base.chunks[insertpos] = chunk;
-					n4->children[insertpos] = child;
+					n4->children[insertpos] = childp;
 					break;
 				}
 			}
@@ -1251,22 +1398,23 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					n32->children[idx] = child;
+					n32->children[idx] = childp;
 					break;
 				}
 
 				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
 				{
 					rt_node_inner_128 *new128;
+					rt_node_ptr new128p;
 
 					/* grow node from 32 to 128 */
-					new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
-																RT_NODE_KIND_128);
+					new128p = rt_copy_node(tree, (rt_node *) n32, RT_NODE_KIND_128);
+					new128 = (rt_node_inner_128 *) node_ptr_get_local(tree, new128p);
+
 					for (int i = 0; i < n32->base.n.count; i++)
 						node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
 
-					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
-									key);
+					rt_replace_node(tree, parent, nodep, new128p, key);
 					node = (rt_node *) new128;
 				}
 				else
@@ -1279,7 +1427,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 												   count, insertpos);
 
 					n32->base.chunks[insertpos] = chunk;
-					n32->children[insertpos] = child;
+					n32->children[insertpos] = childp;
 					break;
 				}
 			}
@@ -1293,17 +1441,19 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					node_inner_128_update(n128, chunk, child);
+					node_inner_128_update(n128, chunk, childp);
 					break;
 				}
 
 				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
 				{
 					rt_node_inner_256 *new256;
+					rt_node_ptr new256p;
 
 					/* grow node from 128 to 256 */
-					new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
-																RT_NODE_KIND_256);
+					new256p = rt_copy_node(tree, (rt_node *) n128, RT_NODE_KIND_256);
+					new256 = (rt_node_inner_256 *) node_ptr_get_local(tree, new256p);
+
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
 					{
 						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1313,13 +1463,12 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 						cnt++;
 					}
 
-					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
-									key);
+					rt_replace_node(tree, parent, nodep, new256p, key);
 					node = (rt_node *) new256;
 				}
 				else
 				{
-					node_inner_128_insert(n128, chunk, child);
+					node_inner_128_insert(n128, chunk, childp);
 					break;
 				}
 			}
@@ -1331,7 +1480,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
 				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
 
-				node_inner_256_set(n256, chunk, child);
+				node_inner_256_set(n256, chunk, childp);
 				break;
 			}
 	}
@@ -1351,7 +1500,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 
 /* Insert the value to the leaf node */
 static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node_ptr nodep, rt_node *node,
 					uint64 key, uint64 value)
 {
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
@@ -1378,16 +1527,16 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
 				{
 					rt_node_leaf_32 *new32;
+					rt_node_ptr new32p;
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
-															 RT_NODE_KIND_32);
+					new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
+					new32 = (rt_node_leaf_32 *) node_ptr_get_local(tree, new32p);
 					chunk_values_array_copy(n4->base.chunks, n4->values,
 											new32->base.chunks, new32->values,
 											n4->base.n.count);
 
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
-									key);
+					rt_replace_node(tree, parent, nodep, new32p, key);
 					node = (rt_node *) new32;
 				}
 				else
@@ -1423,15 +1572,16 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
 				{
 					rt_node_leaf_128 *new128;
+					rt_node_ptr new128p;
 
 					/* grow node from 32 to 128 */
-					new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
-															   RT_NODE_KIND_128);
+					new128p = rt_copy_node(tree, (rt_node *) n32, RT_NODE_KIND_128);
+					new128 = (rt_node_leaf_128 *) node_ptr_get_local(tree, new128p);
+
 					for (int i = 0; i < n32->base.n.count; i++)
 						node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
 
-					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
-									key);
+					rt_replace_node(tree, parent, nodep, new128p, key);
 					node = (rt_node *) new128;
 				}
 				else
@@ -1465,10 +1615,12 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
 				{
 					rt_node_leaf_256 *new256;
+					rt_node_ptr new256p;
 
 					/* grow node from 128 to 256 */
-					new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
-															   RT_NODE_KIND_256);
+					new256p = rt_copy_node(tree, (rt_node *) n128, RT_NODE_KIND_256);
+					new256 = (rt_node_leaf_256 *) node_ptr_get_local(tree, new256p);
+
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
 					{
 						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1478,8 +1630,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 						cnt++;
 					}
 
-					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
-									key);
+					rt_replace_node(tree, parent, nodep, new256p, key);
 					node = (rt_node *) new256;
 				}
 				else
@@ -1518,33 +1669,46 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
  * Create the radix tree in the given memory context and return it.
  */
 radix_tree *
-rt_create(MemoryContext ctx)
+rt_create(MemoryContext ctx, dsa_area *dsa)
 {
 	radix_tree *tree;
 	MemoryContext old_ctx;
 
 	old_ctx = MemoryContextSwitchTo(ctx);
 
-	tree = palloc(sizeof(radix_tree));
+	tree = (radix_tree *) palloc0(sizeof(radix_tree));
 	tree->context = ctx;
-	tree->root = NULL;
-	tree->max_val = 0;
-	tree->num_keys = 0;
+
+	if (dsa != NULL)
+	{
+		tree->dsa = dsa;
+		tree->ctl_dp = dsa_allocate0(dsa, sizeof(radix_tree_control));
+		tree->ctl = (radix_tree_control *) dsa_get_address(dsa, tree->ctl_dp);
+	}
+	else
+	{
+		tree->ctl_dp = InvalidDsaPointer;
+		tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
+	}
+
+	tree->ctl->root = InvalidRTNodePointer;
+	tree->ctl->max_val = 0;
+	tree->ctl->num_keys = 0;
 
 	/* Create the slab allocator for each size class */
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	if (dsa == NULL)
 	{
-		tree->inner_slabs[i] = SlabContextCreate(ctx,
-												 rt_node_kind_info[i].name,
-												 rt_node_kind_info[i].inner_blocksize,
-												 rt_node_kind_info[i].inner_size);
-		tree->leaf_slabs[i] = SlabContextCreate(ctx,
-												rt_node_kind_info[i].name,
-												rt_node_kind_info[i].leaf_blocksize,
-												rt_node_kind_info[i].leaf_size);
-#ifdef RT_DEBUG
-		tree->cnt[i] = 0;
-#endif
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			tree->inner_slabs[i] = SlabContextCreate(ctx,
+													 rt_node_kind_info[i].name,
+													 rt_node_kind_info[i].inner_blocksize,
+													 rt_node_kind_info[i].inner_size);
+			tree->leaf_slabs[i] = SlabContextCreate(ctx,
+													rt_node_kind_info[i].name,
+													rt_node_kind_info[i].leaf_blocksize,
+													rt_node_kind_info[i].leaf_size);
+		}
 	}
 
 	MemoryContextSwitchTo(old_ctx);
@@ -1552,16 +1716,48 @@ rt_create(MemoryContext ctx)
 	return tree;
 }
 
+dsa_pointer
+rt_get_dsa_pointer(radix_tree *tree)
+{
+	return tree->ctl_dp;
+}
+
+radix_tree *
+rt_attach(dsa_area *dsa, dsa_pointer dp)
+{
+	radix_tree *tree;
+
+	/* XXX: memory context support */
+	tree = (radix_tree *) palloc0(sizeof(radix_tree));
+
+	tree->ctl_dp = dp;
+	tree->ctl = (radix_tree_control *) dsa_get_address(dsa, dp);
+
+	/* XXX: do we need to set a callback on exit to detach dsa? */
+
+	return tree;
+}
+
 /*
  * Free the given radix tree.
  */
 void
 rt_free(radix_tree *tree)
 {
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	if (RadixTreeIsShared(tree))
+	{
+		dsa_free(tree->dsa, tree->ctl_dp);
+		dsa_detach(tree->dsa);
+	}
+	else
 	{
-		MemoryContextDelete(tree->inner_slabs[i]);
-		MemoryContextDelete(tree->leaf_slabs[i]);
+		pfree(tree->ctl);
+
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			MemoryContextDelete(tree->inner_slabs[i]);
+			MemoryContextDelete(tree->leaf_slabs[i]);
+		}
 	}
 
 	pfree(tree);
@@ -1576,48 +1772,48 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 {
 	int			shift;
 	bool		updated;
+	rt_node    *parent;
 	rt_node    *node;
-	rt_node    *parent = tree->root;
+	rt_node_ptr nodep;
 
 	/* Empty tree, create the root */
-	if (!tree->root)
+	if (!RTNodePtrIsValid(tree->ctl->root))
 		rt_new_root(tree, key);
 
 	/* Extend the tree if necessary */
-	if (key > tree->max_val)
+	if (key > tree->ctl->max_val)
 		rt_extend(tree, key);
 
-	Assert(tree->root);
-
-	shift = tree->root->shift;
-	node = tree->root;
+	parent = node_ptr_get_local(tree, tree->ctl->root);
+	nodep = tree->ctl->root;
+	shift = parent->shift;
 
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_node_ptr childp;
+
+		node = node_ptr_get_local(tree, nodep);
 
 		if (NODE_IS_LEAF(node))
 			break;
 
-		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &childp))
 		{
-			rt_set_extend(tree, key, value, parent, node);
+			rt_set_extend(tree, key, value, parent, nodep, node);
 			return false;
 		}
 
-		Assert(child);
-
 		parent = node;
-		node = child;
+		nodep = childp;
 		shift -= RT_NODE_SPAN;
 	}
 
-	updated = rt_node_insert_leaf(tree, parent, node, key, value);
+	updated = rt_node_insert_leaf(tree, parent, nodep, node, key, value);
 
 	/* Update the statistics */
 	if (!updated)
-		tree->num_keys++;
+		tree->ctl->num_keys++;
 
 	return updated;
 }
@@ -1635,24 +1831,24 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 
 	Assert(value_p != NULL);
 
-	if (!tree->root || key > tree->max_val)
+	if (!RTNodePtrIsValid(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = node_ptr_get_local(tree, tree->ctl->root);
+	shift = node->shift;
 
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_node_ptr childp;
 
 		if (NODE_IS_LEAF(node))
 			break;
 
-		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &childp))
 			return false;
 
-		node = child;
+		node = node_ptr_get_local(tree, childp);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1667,37 +1863,40 @@ bool
 rt_delete(radix_tree *tree, uint64 key)
 {
 	rt_node    *node;
-	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	rt_node_ptr nodep;
+	rt_node_ptr stack[RT_MAX_LEVEL] = {0};
 	int			shift;
 	int			level;
 	bool		deleted;
 
-	if (!tree->root || key > tree->max_val)
+	if (!RTNodePtrIsValid(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
 	/*
 	 * Descend the tree to search the key while building a stack of nodes we
 	 * visited.
 	 */
-	node = tree->root;
-	shift = tree->root->shift;
+	nodep = tree->ctl->root;
+	node = node_ptr_get_local(tree, nodep);
+	shift = node->shift;
 	level = -1;
 	while (shift > 0)
 	{
-		rt_node    *child;
+		rt_node_ptr childp;
 
 		/* Push the current node to the stack */
-		stack[++level] = node;
+		stack[++level] = nodep;
+		node = node_ptr_get_local(tree, nodep);
 
-		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &childp))
 			return false;
 
-		node = child;
+		nodep = childp;
 		shift -= RT_NODE_SPAN;
 	}
 
 	/* Delete the key from the leaf node if exists */
-	Assert(NODE_IS_LEAF(node));
+	node = node_ptr_get_local(tree, nodep);
 	deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
 
 	if (!deleted)
@@ -1707,7 +1906,7 @@ rt_delete(radix_tree *tree, uint64 key)
 	}
 
 	/* Found the key to delete. Update the statistics */
-	tree->num_keys--;
+	tree->ctl->num_keys--;
 
 	/*
 	 * Return if the leaf node still has keys and we don't need to delete the
@@ -1717,12 +1916,13 @@ rt_delete(radix_tree *tree, uint64 key)
 		return true;
 
 	/* Free the empty leaf node */
-	rt_free_node(tree, node);
+	rt_free_node(tree, nodep);
 
 	/* Delete the key in inner nodes recursively */
 	while (level >= 0)
 	{
-		node = stack[level--];
+		nodep = stack[level--];
+		node = node_ptr_get_local(tree, nodep);
 
 		deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
 		Assert(deleted);
@@ -1732,7 +1932,7 @@ rt_delete(radix_tree *tree, uint64 key)
 			break;
 
 		/* The node became empty */
-		rt_free_node(tree, node);
+		rt_free_node(tree, nodep);
 	}
 
 	/*
@@ -1741,8 +1941,8 @@ rt_delete(radix_tree *tree, uint64 key)
 	 */
 	if (level == 0)
 	{
-		tree->root = NULL;
-		tree->max_val = 0;
+		tree->ctl->root = InvalidRTNodePointer;
+		tree->ctl->max_val = 0;
 	}
 
 	return true;
@@ -1753,6 +1953,7 @@ rt_iter *
 rt_begin_iterate(radix_tree *tree)
 {
 	MemoryContext old_ctx;
+	rt_node    *root;
 	rt_iter    *iter;
 	int			top_level;
 
@@ -1765,14 +1966,15 @@ rt_begin_iterate(radix_tree *tree)
 	if (!iter->tree)
 		return iter;
 
-	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	root = node_ptr_get_local(tree, tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
 	iter->stack_len = top_level;
 
 	/*
 	 * Descend to the left most leaf node from the root. The key is being
 	 * constructed while descending to the leaf.
 	 */
-	rt_update_iter_stack(iter, iter->tree->root, top_level);
+	rt_update_iter_stack(iter, root, top_level);
 
 	MemoryContextSwitchTo(old_ctx);
 
@@ -1792,7 +1994,6 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
 	{
 		rt_node_iter *node_iter = &(iter->stack[level--]);
 
-		/* Set the node to this level */
 		node_iter->node = node;
 		node_iter->current_idx = -1;
 
@@ -1828,7 +2029,6 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 
 		/* Advance the leaf node iterator to get next key-value pair */
 		found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
-
 		if (found)
 		{
 			*key_p = iter->key;
@@ -1898,7 +2098,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 				if (node_iter->current_idx >= n4->base.n.count)
 					break;
 
-				child = n4->children[node_iter->current_idx];
+				child = node_ptr_get_local(iter->tree, n4->children[node_iter->current_idx]);
 				key_chunk = n4->base.chunks[node_iter->current_idx];
 				found = true;
 				break;
@@ -1911,7 +2111,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 				if (node_iter->current_idx >= n32->base.n.count)
 					break;
 
-				child = n32->children[node_iter->current_idx];
+				child = node_ptr_get_local(iter->tree, n32->children[node_iter->current_idx]);
 				key_chunk = n32->base.chunks[node_iter->current_idx];
 				found = true;
 				break;
@@ -1931,7 +2131,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 					break;
 
 				node_iter->current_idx = i;
-				child = node_inner_128_get_child(n128, i);
+				child = node_ptr_get_local(iter->tree, node_inner_128_get_child(n128, i));
 				key_chunk = i;
 				found = true;
 				break;
@@ -1951,7 +2151,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 					break;
 
 				node_iter->current_idx = i;
-				child = node_inner_256_get_child(n256, i);
+				child = node_ptr_get_local(iter->tree, node_inner_256_get_child(n256, i));
 				key_chunk = i;
 				found = true;
 				break;
@@ -2062,7 +2262,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 uint64
 rt_num_entries(radix_tree *tree)
 {
-	return tree->num_keys;
+	return tree->ctl->num_keys;
 }
 
 /*
@@ -2071,12 +2271,17 @@ rt_num_entries(radix_tree *tree)
 uint64
 rt_memory_usage(radix_tree *tree)
 {
-	Size		total = sizeof(radix_tree);
+	Size		total = sizeof(radix_tree) + sizeof(radix_tree_control);
 
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	if (RadixTreeIsShared(tree))
+		total = dsa_get_total_size(tree->dsa);
+	else
 	{
-		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
-		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+			total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+		}
 	}
 
 	return total;
@@ -2161,17 +2366,18 @@ void
 rt_stats(radix_tree *tree)
 {
 	ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
-						 tree->num_keys,
-						 tree->root->shift / RT_NODE_SPAN,
-						 tree->cnt[0],
-						 tree->cnt[1],
-						 tree->cnt[2],
-						 tree->cnt[3])));
+						 tree->ctl->num_keys,
+						 node_ptr_get_local(tree, tree->ctl->root)->shift / RT_NODE_SPAN,
+						 tree->ctl->cnt[0],
+						 tree->ctl->cnt[1],
+						 tree->ctl->cnt[2],
+						 tree->ctl->cnt[3])));
 }
 
 static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(radix_tree *tree, rt_node_ptr nodep, int level, bool recurse)
 {
+	rt_node    *node = node_ptr_get_local(tree, nodep);
 	char		space[128] = {0};
 
 	fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
@@ -2205,7 +2411,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 								space, n4->base.chunks[i]);
 
 						if (recurse)
-							rt_dump_node(n4->children[i], level + 1, recurse);
+							rt_dump_node(tree, n4->children[i], level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2232,7 +2438,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 
 						if (recurse)
 						{
-							rt_dump_node(n32->children[i], level + 1, recurse);
+							rt_dump_node(tree, n32->children[i], level + 1, recurse);
 						}
 						else
 							fprintf(stderr, "\n");
@@ -2284,7 +2490,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_128_get_child(n128, i),
+							rt_dump_node(tree, node_inner_128_get_child(n128, i),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2317,8 +2523,8 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
-										 recurse);
+							rt_dump_node(tree, node_inner_256_get_child(n256, i),
+										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2328,6 +2534,28 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 	}
 }
 
+void
+rt_dump(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
+				rt_node_kind_info[i].name,
+				rt_node_kind_info[i].inner_size,
+				rt_node_kind_info[i].inner_blocksize,
+				rt_node_kind_info[i].leaf_size,
+				rt_node_kind_info[i].leaf_blocksize);
+	fprintf(stderr, "max_val = %lu\n", tree->ctl->max_val);
+
+	if (!tree->ctl->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree, tree->ctl->root, 0, true);
+}
+
+#ifdef unused
 void
 rt_dump_search(radix_tree *tree, uint64 key)
 {
@@ -2336,23 +2564,23 @@ rt_dump_search(radix_tree *tree, uint64 key)
 	int			level = 0;
 
 	elog(NOTICE, "-----------------------------------------------------------");
-	elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+	elog(NOTICE, "max_val = %lu (0x%lX)", tree->ctl->max_val, tree->ctl->max_val);
 
-	if (!tree->root)
+	if (!tree->ctl->root)
 	{
 		elog(NOTICE, "tree is empty");
 		return;
 	}
 
-	if (key > tree->max_val)
+	if (key > tree->ctl->max_val)
 	{
 		elog(NOTICE, "key %lu (0x%lX) is larger than max val",
 			 key, key);
 		return;
 	}
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = tree->ctl->root;
+	shift = tree->ctl->root->shift;
 	while (shift >= 0)
 	{
 		rt_node    *child;
@@ -2377,25 +2605,6 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		level++;
 	}
 }
+#endif
 
-void
-rt_dump(radix_tree *tree)
-{
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
-		fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
-				rt_node_kind_info[i].name,
-				rt_node_kind_info[i].inner_size,
-				rt_node_kind_info[i].inner_blocksize,
-				rt_node_kind_info[i].leaf_size,
-				rt_node_kind_info[i].leaf_blocksize);
-	fprintf(stderr, "max_val = %lu\n", tree->max_val);
-
-	if (!tree->root)
-	{
-		fprintf(stderr, "empty tree\n");
-		return;
-	}
-
-	rt_dump_node(tree->root, 0, true);
-}
 #endif
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 82376fde2d..ad169882af 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..d9d8355c21 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -14,18 +14,22 @@
 #define RADIXTREE_H
 
 #include "postgres.h"
+#include "utils/dsa.h"
 
 #define RT_DEBUG 1
 
 typedef struct radix_tree radix_tree;
 typedef struct rt_iter rt_iter;
 
-extern radix_tree *rt_create(MemoryContext ctx);
+extern radix_tree *rt_create(MemoryContext ctx, dsa_area *dsa);
 extern void rt_free(radix_tree *tree);
 extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
 extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
 extern rt_iter *rt_begin_iterate(radix_tree *tree);
 
+extern dsa_pointer rt_get_dsa_pointer(radix_tree *tree);
+extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+
 extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
 extern void rt_end_iterate(rt_iter *iter);
 extern bool rt_delete(radix_tree *tree, uint64 key);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 405606fe2f..dad06adecc 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index cc6970c87c..a0ff1e1c77 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -5,21 +5,38 @@ CREATE EXTENSION test_radixtree;
 --
 SELECT test_radixtree();
 NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
 NOTICE:  testing radix tree node types with shift "8"
 NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "16"
 NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
 NOTICE:  testing radix tree node types with shift "32"
 NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
 NOTICE:  testing radix tree node types with shift "48"
 NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
 NOTICE:  testing radix tree with pattern "all ones"
 NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
 NOTICE:  testing radix tree with pattern "clusters of ten"
 NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
 NOTICE:  testing radix tree with pattern "one-every-64k"
 NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
 NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
 NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
 NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
  test_radixtree 
 ----------------
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index cb3596755d..a08495834e 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -19,6 +19,7 @@
 #include "nodes/bitmapset.h"
 #include "storage/block.h"
 #include "storage/itemptr.h"
+#include "storage/lwlock.h"
 #include "utils/memutils.h"
 #include "utils/timestamp.h"
 
@@ -111,7 +112,7 @@ test_empty(void)
 	radix_tree *radixtree;
 	uint64		dummy;
 
-	radixtree = rt_create(CurrentMemoryContext);
+	radixtree = rt_create(CurrentMemoryContext, NULL);
 
 	if (rt_search(radixtree, 0, &dummy))
 		elog(ERROR, "rt_search on empty tree returned true");
@@ -217,14 +218,10 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
  * level.
  */
 static void
-test_node_types(uint8 shift)
+do_test_node_types(radix_tree *radixtree, uint8 shift)
 {
-	radix_tree *radixtree;
-
 	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
 
-	radixtree = rt_create(CurrentMemoryContext);
-
 	/*
 	 * Insert and search entries for every node type at the 'shift' level,
 	 * then delete all entries to make it empty, and insert and search entries
@@ -233,19 +230,38 @@ test_node_types(uint8 shift)
 	test_node_types_insert(radixtree, shift);
 	test_node_types_delete(radixtree, shift);
 	test_node_types_insert(radixtree, shift);
+}
 
-	rt_free(radixtree);
+static void
+test_node_types(void)
+{
+	int			tranche_id = LWLockNewTrancheId();
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+	{
+		radix_tree *tree;
+		dsa_area   *dsa;
+
+		/* Test the local radix tree */
+		tree = rt_create(CurrentMemoryContext, NULL);
+		do_test_node_types(tree, shift);
+		rt_free(tree);
+
+		/* Test the shared radix tree */
+		dsa = dsa_create(tranche_id);
+		tree = rt_create(CurrentMemoryContext, dsa);
+		do_test_node_types(tree, shift);
+		rt_free(tree);
+	}
 }
 
 /*
  * Test with a repeating pattern, defined by the 'spec'.
  */
 static void
-test_pattern(const test_spec * spec)
+do_test_pattern(radix_tree *radixtree, const test_spec * spec)
 {
-	radix_tree *radixtree;
 	rt_iter    *iter;
-	MemoryContext radixtree_ctx;
 	TimestampTz starttime;
 	TimestampTz endtime;
 	uint64		n;
@@ -271,18 +287,6 @@ test_pattern(const test_spec * spec)
 			pattern_values[pattern_num_values++] = i;
 	}
 
-	/*
-	 * Allocate the radix tree.
-	 *
-	 * Allocate it in a separate memory context, so that we can print its
-	 * memory usage easily.
-	 */
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
-										  "radixtree test",
-										  ALLOCSET_SMALL_SIZES);
-	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
-	radixtree = rt_create(radixtree_ctx);
-
 	/*
 	 * Add values to the set.
 	 */
@@ -336,8 +340,6 @@ test_pattern(const test_spec * spec)
 		mem_usage = rt_memory_usage(radixtree);
 		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
 				mem_usage, (double) mem_usage / spec->num_values);
-
-		MemoryContextStats(radixtree_ctx);
 	}
 
 	/* Check that rt_num_entries works */
@@ -484,21 +486,53 @@ test_pattern(const test_spec * spec)
 	if ((nbefore - ndeleted) != nafter)
 		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
 			 nafter, (nbefore - ndeleted), ndeleted);
+}
+
+static void
+test_patterns(void)
+{
+	int			tranche_id = LWLockNewTrancheId();
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+	{
+		radix_tree *tree;
+		MemoryContext radixtree_ctx;
+		dsa_area   *dsa;
+		const		test_spec *spec = &test_specs[i];
 
-	MemoryContextDelete(radixtree_ctx);
+		/*
+		 * Allocate the radix tree.
+		 *
+		 * Allocate it in a separate memory context, so that we can print its
+		 * memory usage easily.
+		 */
+		radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+											  "radixtree test",
+											  ALLOCSET_SMALL_SIZES);
+		MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+		/* Test the local radix tree */
+		tree = rt_create(radixtree_ctx, NULL);
+		do_test_pattern(tree, spec);
+		rt_free(tree);
+		MemoryContextReset(radixtree_ctx);
+
+		/* Test the shared radix tree */
+		dsa = dsa_create(tranche_id);
+		tree = rt_create(radixtree_ctx, dsa);
+		do_test_pattern(tree, spec);
+		rt_free(tree);
+		MemoryContextDelete(radixtree_ctx);
+	}
 }
 
 Datum
 test_radixtree(PG_FUNCTION_ARGS)
 {
 	test_empty();
-
-	for (int shift = 0; shift <= (64 - 8); shift += 8)
-		test_node_types(shift);
-
-	/* Test different test patterns, with lots of entries */
-	for (int i = 0; i < lengthof(test_specs); i++)
-		test_pattern(&test_specs[i]);
+	test_node_types();
+	test_patterns();
 
 	PG_RETURN_VOID();
 }
-- 
2.31.1

v8-0002-Add-radix-implementation.patchapplication/octet-stream; name=v8-0002-Add-radix-implementation.patchDownload

From 45a5a064b71dc6f58d333984a7a571cc3cd80e63 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v8 2/4] Add radix implementation.

---
 src/backend/lib/Makefile                      |    1 +
 src/backend/lib/meson.build                   |    1 +
 src/backend/lib/radixtree.c                   | 2401 +++++++++++++++++
 src/include/lib/radixtree.h                   |   42 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   28 +
 src/test/modules/test_radixtree/meson.build   |   34 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  504 ++++
 .../test_radixtree/test_radixtree.control     |    4 +
 15 files changed, 3066 insertions(+)
 create mode 100644 src/backend/lib/radixtree.c
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
   'knapsack.c',
   'pairingheap.c',
   'rbtree.c',
+  'radixtree.c',
 )
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..b239b3c615
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2401 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_set			- Set a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iterate	- Begin iterating through all key-value pairs
+ * rt_iterate_next	- Return next key-value pair, if any
+ * rt_end_iter		- End iteration
+ * rt_memory_usage	- Get the memory usage
+ * rt_num_entries	- Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-128 */
+#define RT_NODE_128_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+	RT_ACTION_FIND = 0,			/* find the key-value */
+	RT_ACTION_DELETE,			/* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 40/40 -> 296/286 -> 1288/1304 -> 2056/2088 bytes for inner nodes and
+ * leaf nodes, respectively, leading to large amount of allocator padding
+ * with aset.c. Hence the use of slab.
+ *
+ * XXX: need to have node-1 until there is no path compression optimization?
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_128		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Size kind of the node */
+	uint8		kind;
+} rt_node;
+#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+	(((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+typedef struct rt_node_base_4
+{
+	rt_node		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+	rt_node		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-128 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 128 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base128
+{
+	rt_node		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_128;
+
+typedef struct rt_node_base256
+{
+	rt_node		n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * There are separate from inner node size classes for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+	rt_node_base_4 base;
+
+	/* 4 children, for key chunks */
+	rt_node    *children[4];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+	rt_node_base_4 base;
+
+	/* 4 values, for key chunks */
+	uint64		values[4];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+	rt_node_base_32 base;
+
+	/* 32 children, for key chunks */
+	rt_node    *children[32];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+	rt_node_base_32 base;
+
+	/* 32 values, for key chunks */
+	uint64		values[32];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_128
+{
+	rt_node_base_128 base;
+
+	/* Slots for 128 children */
+	rt_node    *children[128];
+} rt_node_inner_128;
+
+typedef struct rt_node_leaf_128
+{
+	rt_node_base_128 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
+
+	/* Slots for 128 values */
+	uint64		values[128];
+} rt_node_leaf_128;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 children */
+	rt_node    *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+	rt_node_base_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information of each size kinds */
+typedef struct rt_node_kind_info_elem
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} rt_node_kind_info_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
+static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
+
+	[RT_NODE_KIND_4] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(rt_node_inner_4),
+		.leaf_size = sizeof(rt_node_leaf_4),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
+	},
+	[RT_NODE_KIND_32] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(rt_node_inner_32),
+		.leaf_size = sizeof(rt_node_leaf_32),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
+	},
+	[RT_NODE_KIND_128] = {
+		.name = "radix tree node 128",
+		.fanout = 128,
+		.inner_size = sizeof(rt_node_inner_128),
+		.leaf_size = sizeof(rt_node_leaf_128),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
+	},
+	[RT_NODE_KIND_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(rt_node_inner_256),
+		.leaf_size = sizeof(rt_node_leaf_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+	},
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+	rt_node    *node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	rt_node_iter stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node    *root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_NODE_KIND_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+							  bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+										rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+									   uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+								 uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+								uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											 uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+						  uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+	/* For better code generation */
+	if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+		pg_unreachable();
+
+	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+	memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values, int count)
+{
+	/* For better code generation */
+	if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+		pg_unreachable();
+
+	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+	memcpy(dst_values, src_values, sizeof(uint64) * count);
+}
+
+/* Functions to manipulate inner and leaf node-128 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
+{
+	Assert(NODE_IS_LEAF(node));
+	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((rt_node_base_128 *) node)->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+static void
+node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
+{
+	int			slotpos = node->base.slot_idxs[chunk];
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Return an unused slot in node-128 */
+static int
+node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
+{
+	int			slotpos = 0;
+
+	Assert(!NODE_IS_LEAF(node));
+	while (node_inner_128_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static int
+node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* We iterate over the isset bitmap per byte then check each bit */
+	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	{
+		if (node->isset[slotpos] < 0xFF)
+			break;
+	}
+	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+	slotpos *= BITS_PER_BYTE;
+	while (node_leaf_128_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static inline void
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+	int			slotpos;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_inner_128_find_unused_slot(node, chunk);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_leaf_128_find_unused_slot(node, chunk);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(node_inner_256_is_chunk_used(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(node_leaf_256_is_chunk_used(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+	int			shift = key_get_shift(key);
+	rt_node    *node;
+
+	node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
+									 shift > 0);
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = node;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
+{
+	rt_node    *newnode;
+
+	if (inner)
+		newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+													 rt_node_kind_info[kind].inner_size);
+	else
+		newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+													 rt_node_kind_info[kind].leaf_size);
+
+	newnode->kind = kind;
+	newnode->shift = shift;
+	newnode->chunk = chunk;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_128)
+	{
+		rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+
+		memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
+	}
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[kind]++;
+#endif
+
+	return newnode;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+	rt_node    *newnode;
+
+	newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
+							node->shift > 0);
+	newnode->count = node->count;
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+		tree->root = NULL;
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[node->kind]--;
+	Assert(tree->cnt[node->kind] >= 0);
+#endif
+
+	pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+				rt_node *new_child, uint64 key)
+{
+	Assert(old_child->chunk == new_child->chunk);
+	Assert(old_child->shift == new_child->shift);
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = new_child;
+	}
+	else
+	{
+		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
+
+		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		Assert(replaced);
+	}
+
+	rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_inner_4 *node;
+
+		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
+												 shift, 0, true);
+		node->base.n.count = 1;
+		node->base.chunks[0] = 0;
+		node->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+			  rt_node *node)
+{
+	int			shift = node->shift;
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		rt_node    *newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+
+		newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
+								 RT_GET_KEY_CHUNK(key, node->shift),
+								 newshift > 0);
+		rt_node_insert_inner(tree, parent, node, key, newchild);
+
+		parent = node;
+		node = newchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	rt_node_insert_leaf(tree, parent, node, key, value);
+	tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	rt_node    *child = NULL;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = n4->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n4->base.chunks, n4->children,
+												n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = n32->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n32->base.chunks, n32->children,
+												n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+				if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = node_inner_128_get_child(n128, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_128_delete(n128, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				if (!node_inner_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = node_inner_256_get_child(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && child_p)
+		*child_p = child;
+
+	return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	uint64		value = 0;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = n4->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+											  n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = n32->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+											  n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+
+				if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_128_get_value(n128, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_128_delete(n128, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				if (!node_leaf_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_256_get_value(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && value_p)
+		*value_p = value;
+
+	return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+					 rt_node *child)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_inner_32 *new32;
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
+															  RT_NODE_KIND_32);
+					chunk_children_array_copy(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children,
+											  n4->base.n.count);
+
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+									key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					uint16		count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n4->base.chunks, n4->children,
+												   count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->children[insertpos] = child;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+				{
+					rt_node_inner_128 *new128;
+
+					/* grow node from 32 to 128 */
+					new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
+																RT_NODE_KIND_128);
+					for (int i = 0; i < n32->base.n.count; i++)
+						node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
+
+					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+									key);
+					node = (rt_node *) new128;
+				}
+				else
+				{
+					int			insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+					int16		count = n32->base.n.count;
+
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n32->base.chunks, n32->children,
+												   count, insertpos);
+
+					n32->base.chunks[insertpos] = chunk;
+					n32->children[insertpos] = child;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+				int			cnt = 0;
+
+				if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_inner_128_update(n128, chunk, child);
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+				{
+					rt_node_inner_256 *new256;
+
+					/* grow node from 128 to 256 */
+					new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
+																RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+					{
+						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+							continue;
+
+						node_inner_256_set(new256, i, node_inner_128_get_child(n128, i));
+						cnt++;
+					}
+
+					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_inner_128_insert(n128, chunk, child);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+				node_inner_256_set(n256, chunk, child);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+					uint64 key, uint64 value)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_leaf_32 *new32;
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
+															 RT_NODE_KIND_32);
+					chunk_values_array_copy(n4->base.chunks, n4->values,
+											new32->base.chunks, new32->values,
+											n4->base.n.count);
+
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+									key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and values */
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n4->base.chunks, n4->values,
+												 count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->values[insertpos] = value;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+				{
+					rt_node_leaf_128 *new128;
+
+					/* grow node from 32 to 128 */
+					new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
+															   RT_NODE_KIND_128);
+					for (int i = 0; i < n32->base.n.count; i++)
+						node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
+
+					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+									key);
+					node = (rt_node *) new128;
+				}
+				else
+				{
+					int			insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+					int			count = n32->base.n.count;
+
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n32->base.chunks, n32->values,
+												 count, insertpos);
+
+					n32->base.chunks[insertpos] = chunk;
+					n32->values[insertpos] = value;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_128:
+			{
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+				int			cnt = 0;
+
+				if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_leaf_128_update(n128, chunk, value);
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+				{
+					rt_node_leaf_256 *new256;
+
+					/* grow node from 128 to 256 */
+					new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
+															   RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+					{
+						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+							continue;
+
+						node_leaf_256_set(new256, i, node_leaf_128_get_value(n128, i));
+						cnt++;
+					}
+
+					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_leaf_128_insert(n128, chunk, value);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+				node_leaf_256_set(n256, chunk, value);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 rt_node_kind_info[i].name,
+												 rt_node_kind_info[i].inner_blocksize,
+												 rt_node_kind_info[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												rt_node_kind_info[i].name,
+												rt_node_kind_info[i].leaf_blocksize,
+												rt_node_kind_info[i].leaf_size);
+#ifdef RT_DEBUG
+		tree->cnt[i] = 0;
+#endif
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	rt_node    *node;
+	rt_node    *parent = tree->root;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = tree->root;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		{
+			rt_set_extend(tree, key, value, parent, node);
+			return false;
+		}
+
+		Assert(child);
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+	rt_node    *node;
+	int			shift;
+
+	Assert(value_p != NULL);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		rt_node    *child;
+
+		/* Push the current node to the stack */
+		stack[++level] = node;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	Assert(NODE_IS_LEAF(node));
+	deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	rt_free_node(tree, node);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		node = stack[level--];
+
+		deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+	}
+
+	/*
+	 * If we eventually deleted the root node while recursively deleting empty
+	 * nodes, we make the tree empty.
+	 */
+	if (level == 0)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+	return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+	int			level = from;
+	rt_node    *node = from_node;
+
+	for (;;)
+	{
+		rt_node_iter *node_iter = &(iter->stack[level--]);
+
+		/* Set the node to this level */
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = rt_node_inner_iterate_next(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree)
+		return false;
+
+	for (;;)
+	{
+		rt_node    *child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		rt_update_iter_stack(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+	rt_node    *child = NULL;
+	bool		found = false;
+	uint8		key_chunk;
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				child = n4->children[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				child = n32->children[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_128_get_child(n128, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_inner_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_256_get_child(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+	return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+						  uint64 *value_p)
+{
+	rt_node    *node = node_iter->node;
+	bool		found = false;
+	uint64		value;
+	uint8		key_chunk;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				value = n4->values[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				value = n32->values[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_128_get_value(n128, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_leaf_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_256_get_value(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		*value_p = value;
+	}
+
+	return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	Size		total = sizeof(radix_tree);
+
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					if (NODE_IS_LEAF(node))
+						Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) node,
+														  n128->slot_idxs[i]));
+					else
+						Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) node,
+														   n128->slot_idxs[i]));
+
+					cnt++;
+				}
+
+				Assert(n128->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+						cnt += pg_popcount32(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
+						 tree->num_keys,
+						 tree->root->shift / RT_NODE_SPAN,
+						 tree->cnt[0],
+						 tree->cnt[1],
+						 tree->cnt[2],
+						 tree->cnt[3])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+	char		space[128] = {0};
+
+	fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_128) ? 128 : 256,
+			node->count, node->shift, node->chunk);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							rt_dump_node(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							rt_dump_node(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *b128 = (rt_node_base_128 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(b128, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b128->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < 16; i++)
+					{
+						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(b128, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) b128;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, i, node_leaf_128_get_value(n128, i));
+					}
+					else
+					{
+						rt_node_inner_128 *n128 = (rt_node_inner_128 *) b128;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_128_get_child(n128, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+						if (!node_leaf_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, i, node_leaf_256_get_value(n256, i));
+					}
+					else
+					{
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+						if (!node_inner_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		rt_dump_node(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+			break;
+		}
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
+				rt_node_kind_info[i].name,
+				rt_node_kind_info[i].inner_size,
+				rt_node_kind_info[i].inner_blocksize,
+				rt_node_kind_info[i].leaf_size,
+				rt_node_kind_info[i].leaf_blocksize);
+	fprintf(stderr, "max_val = %lu\n", tree->max_val);
+
+	if (!tree->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 7b3f292965..e587cabe13 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -26,6 +26,7 @@ SUBDIRS = \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index c2e5f5ffd5..c86f6bdcb0 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -20,6 +20,7 @@ subdir('test_oat_hooks')
 subdir('test_parser')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..cb3596755d
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,504 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int	rt_node_max_entries[] = {
+	4,							/* RT_NODE_KIND_4 */
+	16,							/* RT_NODE_KIND_16 */
+	32,							/* RT_NODE_KIND_32 */
+	128,						/* RT_NODE_KIND_128 */
+	256							/* RT_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 10000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	uint64		dummy;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		uint64		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, key);
+
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+		for (int j = 0; j < lengthof(rt_node_max_entries); j++)
+		{
+			/*
+			 * After filling all slots in each node type, check if the values
+			 * are stored properly.
+			 */
+			if (i == (rt_node_max_entries[j] - 1))
+			{
+				check_search_on_node(radixtree, shift,
+									 (j == 0) ? 0 : rt_node_max_entries[j - 1],
+									 rt_node_max_entries[j]);
+				break;
+			}
+		}
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
-- 
2.31.1

v8-0003-tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v8-0003-tool-for-measuring-radix-tree-performance.patchDownload

From 799d4d6500bec90171c0d9ee81f55af480583323 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v8 3/4] tool for measuring radix tree performance

---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  56 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 466 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 6 files changed, 578 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..0874201d7e
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,56 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..7abb237e96
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,466 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
-- 
2.31.1

#113

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#112)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
the comments I got so far. 0004 patch is a DSA support patch for PoC.

Thanks for the new patchset. This is not a full review, but I have some
comments:

0001 and 0002 look okay on a quick scan -- I will use this as a base for
further work that we discussed. However, before I do so I'd like to request
another revision regarding the following:

In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
to point its children, and we use rt_node_ptr as either rt_node* or
dsa_pointer depending on whether the radix tree is shared or not (ie,
by checking radix_tree->dsa == NULL).

0004: Looks like a good start, but this patch has a large number of changes
like these, making it hard to read:

- if (found && child_p)
- *child_p = child;
+ if (found && childp_p)
+ *childp_p = childp;
...
  rt_node_inner_32 *new32;
+ rt_node_ptr new32p;

  /* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
-  RT_NODE_KIND_32);
+ new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);

It's difficult to keep in my head what all the variables refer to. I
thought a bit about how to split this patch up to make this easier to read.
Here's what I came up with:

typedef struct rt_node_ptr
{
uintptr_t encoded;
rt_node * decoded;
}

Note that there is nothing about "dsa or local". That's deliberate. That
way, we can use the "encoded" field for a tagged pointer as well, as I hope
we can do (at least for local pointers) in the future. So an intermediate
patch would have "static inline void" functions node_ptr_encode() and
node_ptr_decode(), which would only copy from one member to another. I
suspect that: 1. The actual DSA changes will be *much* smaller and easier
to reason about. 2. Experimenting with tagged pointers will be easier.

Also, quick question: 0004 has a new function rt_node_update_inner() -- is
that necessary because of DSA?, or does this ideally belong in 0002? What's
the reason for it?

Regarding the performance, I've

added another boolean argument to bench_seq/shuffle_search(),
specifying whether to use the shared radix tree or not. Here are
benchmark results in my environment,

[...]

In non-shared radix tree cases (the forth argument is false), I don't
see a visible performance degradation. On the other hand, in shared
radix tree cases (the forth argument is true), I see visible overheads
because of dsa_get_address().

Thanks, this is useful.

Please note that the current shared radix tree implementation doesn't
support any locking, so it cannot be read while written by someone.

I think at the very least we need a global lock to enforce this.

Also, only one process can iterate over the shared radix tree. When it
comes to parallel vacuum, these don't become restriction as the leader
process writes the radix tree while scanning heap and the radix tree
is read by multiple processes while vacuuming indexes. And only the
leader process can do heap vacuum by iterating the key-value pairs in
the radix tree. If we want to use it for other cases too, we would
need to support locking, RCU or something.

A useful exercise here is to think about what we'd need to do parallel heap
pruning. We don't need to go that far for v16 of course, but what's the
simplest thing we can do to make that possible? Other use cases can change
to more sophisticated schemes if need be.

--
John Naylor
EDB: http://www.enterprisedb.com

#114

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#113)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Nov 3, 2022 at 1:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
the comments I got so far. 0004 patch is a DSA support patch for PoC.

Thanks for the new patchset. This is not a full review, but I have some comments:

0001 and 0002 look okay on a quick scan -- I will use this as a base for further work that we discussed. However, before I do so I'd like to request another revision regarding the following:

In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
to point its children, and we use rt_node_ptr as either rt_node* or
dsa_pointer depending on whether the radix tree is shared or not (ie,
by checking radix_tree->dsa == NULL).

Thank you for the comments!

0004: Looks like a good start, but this patch has a large number of changes like these, making it hard to read:
- if (found && child_p)
- *child_p = child;
+ if (found && childp_p)
+ *childp_p = childp;
...
rt_node_inner_32 *new32;
+ rt_node_ptr new32p;
/* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
-  RT_NODE_KIND_32);
+ new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);
It's difficult to keep in my head what all the variables refer to. I thought a bit about how to split this patch up to make this easier to read. Here's what I came up with:

typedef struct rt_node_ptr
{
uintptr_t encoded;
rt_node * decoded;
}

Note that there is nothing about "dsa or local". That's deliberate. That way, we can use the "encoded" field for a tagged pointer as well, as I hope we can do (at least for local pointers) in the future. So an intermediate patch would have "static inline void" functions node_ptr_encode() and node_ptr_decode(), which would only copy from one member to another. I suspect that: 1. The actual DSA changes will be *much* smaller and easier to reason about. 2. Experimenting with tagged pointers will be easier.

Good idea. Will try in the next version patch.

Also, quick question: 0004 has a new function rt_node_update_inner() -- is that necessary because of DSA?, or does this ideally belong in 0002? What's the reason for it?

Oh, this was needed once when initially I'm writing DSA support but
thinking about it again now I think we can remove it and use
rt_node_insert_inner() with parent = NULL instead.

Regarding the performance, I've

added another boolean argument to bench_seq/shuffle_search(),
specifying whether to use the shared radix tree or not. Here are
benchmark results in my environment,

[...]

In non-shared radix tree cases (the forth argument is false), I don't
see a visible performance degradation. On the other hand, in shared
radix tree cases (the forth argument is true), I see visible overheads
because of dsa_get_address().

Thanks, this is useful.

Please note that the current shared radix tree implementation doesn't
support any locking, so it cannot be read while written by someone.

I think at the very least we need a global lock to enforce this.

Also, only one process can iterate over the shared radix tree. When it
comes to parallel vacuum, these don't become restriction as the leader
process writes the radix tree while scanning heap and the radix tree
is read by multiple processes while vacuuming indexes. And only the
leader process can do heap vacuum by iterating the key-value pairs in
the radix tree. If we want to use it for other cases too, we would
need to support locking, RCU or something.

A useful exercise here is to think about what we'd need to do parallel heap pruning. We don't need to go that far for v16 of course, but what's the simplest thing we can do to make that possible? Other use cases can change to more sophisticated schemes if need be.

For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of lazy vacuum, since the write phase and
read phase are separated the readers don't need to worry about
concurrent updates.

I've attached a draft patch for lazy vacuum integration that can be
applied on top of v8 patches. The patch adds a new module called
TIDStore, an efficient storage for TID backed by radix tree. Lazy
vacuum and parallel vacuum use it instead of a TID array. The patch
also introduces rt_detach() that was missed in 0002 patch. It's a very
rough patch but I hope it helps in considering lazy vacuum
integration, radix tree APIs, and shared radix tree functionality.
There are some TODOs:

* We need to reset the TIDStore and therefore reset the radix tree. It
can easily be done by using MemoryContextReset() in non-shared radix
tree cases, but in shared case, we need either to free all radix tree
nodes recursively or introduce a way to release all allocated DSA
memory.

* We need to limit the size of TIDStore (mainly radix_tree) in
maintenance_work_mem.

* We need to change the counter-based information in
pg_stat_progress_vacuum such as max_dead_tuples and num_dead_tuplesn.
I think it would be better to show maximum bytes we can collect TIDs
and its usage instead.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v8-0005-PoC-lazy-vacuum-integration.patchapplication/octet-stream; name=v8-0005-PoC-lazy-vacuum-integration.patchDownload

From 315483e86611f485136efc6a6f141dd0caf3691c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 4 Nov 2022 14:14:42 +0900
Subject: [PATCH v8 5/5] PoC: lazy vacuum integration.

The patch includes:

* Introducing a new module called TIDStore
* Lazy vacuum and parallel vacuum integration.

TODOs:
* radix tree needs to have the reset funtionality.
* should not allow TIDStore to grow beyond the memory limit.
* change the progress statistics of pg_stat_progress_vacuum.
---
 src/backend/access/common/Makefile    |   1 +
 src/backend/access/common/meson.build |   1 +
 src/backend/access/common/tidstore.c  | 273 ++++++++++++++++++++++++++
 src/backend/access/heap/vacuumlazy.c  | 160 +++++----------
 src/backend/commands/vacuum.c         |  45 +----
 src/backend/commands/vacuumparallel.c |  59 +++---
 src/backend/lib/radixtree.c           |   9 +
 src/backend/storage/lmgr/lwlock.c     |   2 +
 src/include/access/tidstore.h         |  55 ++++++
 src/include/commands/vacuum.h         |  24 +--
 src/include/lib/radixtree.h           |   1 +
 src/include/storage/lwlock.h          |   1 +
 12 files changed, 436 insertions(+), 195 deletions(-)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h

diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 857beaa32d..76265974b1 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..8793c87fab
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,273 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		TID (ItemPointer) storage implementation.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* XXX: should be configurable for non-heap AMs */
+#define TIDSTORE_OFFSET_NBITS 11	/* pg_ceil_log2_32(MaxHeapTuplesPerPage) */
+
+#define TIDSTORE_VALUE_NBITS 6	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+	((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+struct TIDStore
+{
+	/* main storage for TID */
+	radix_tree	*tree;
+
+	/* # of tids in TIDStore */
+	int	num_tids;
+
+	/* DSA area and handle for shared TIDStore */
+	dsa_pointer	handle;
+	dsa_area	*dsa;
+};
+
+static void tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+TIDStore *
+tidstore_create(dsa_area *dsa)
+{
+	TIDStore	*ts;
+
+	ts = palloc0(sizeof(TIDStore));
+
+	ts->tree = rt_create(CurrentMemoryContext, dsa);
+	ts->dsa = dsa;
+
+	if (dsa != NULL)
+		ts->handle = rt_get_dsa_pointer(ts->tree);
+
+	return ts;
+}
+
+/* Attach the shared TIDStore */
+TIDStore *
+tidstore_attach(dsa_area *dsa, dsa_pointer handle)
+{
+	TIDStore *ts;
+
+	Assert(dsa != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	ts = palloc0(sizeof(TIDStore));
+
+	ts->tree = rt_attach(dsa, handle);
+
+	return ts;
+}
+
+void
+tidstore_detach(TIDStore *ts)
+{
+	rt_detach(ts->tree);
+}
+
+void
+tidstore_free(TIDStore *ts)
+{
+	rt_free(ts->tree);
+	pfree(ts);
+}
+
+void
+tidstore_reset(TIDStore *ts)
+{
+	if (ts->dsa != NULL)
+	{
+		/* XXX: reset shared radix tree */
+		Assert(false);
+	}
+	else
+	{
+		ts->num_tids = 0;
+
+		rt_free(ts->tree);
+		ts->tree = rt_create(CurrentMemoryContext, NULL);
+	}
+}
+
+/* Add TIDs to TIDStore */
+void
+tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	uint64 last_key = PG_UINT64_MAX;
+	uint64 key;
+	uint64 val = 0;
+	ItemPointerData tid;
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint32	off;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		key = tid_to_key_off(&tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(ts->tree, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= UINT64CONST(1) << off;
+		ts->num_tids++;
+	}
+
+	if (last_key != PG_UINT64_MAX)
+	{
+		rt_set(ts->tree, last_key, val);
+		val = 0;
+	}
+}
+
+/* Return true if the given TID is present in TIDStore */
+bool
+tidstore_lookup_tid(TIDStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val;
+	uint32 off;
+	bool found;
+
+	key = tid_to_key_off(tid, &off);
+
+	found = rt_search(ts->tree, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+TIDStoreIter *
+tidstore_begin_iterate(TIDStore *ts)
+{
+	TIDStoreIter *iter;
+
+	iter = palloc0(sizeof(TIDStoreIter));
+	iter->ts = ts;
+	iter->tree_iter = rt_begin_iterate(ts->tree);
+	iter->blkno = InvalidBlockNumber;
+
+	return iter;
+}
+
+bool
+tidstore_iterate_next(TIDStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+
+	if (iter->finished)
+		return false;
+
+	if (BlockNumberIsValid(iter->blkno))
+	{
+		iter->num_offsets = 0;
+		tidstore_iter_collect_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (rt_iterate_next(iter->tree_iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = KEY_GET_BLKNO(key);
+
+		if (BlockNumberIsValid(iter->blkno) && iter->blkno != blkno)
+		{
+			/*
+			 * Remember the key-value pair for the next block for the
+			 * next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return true;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_collect_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return true;
+}
+
+uint64
+tidstore_num_tids(TIDStore *ts)
+{
+	return ts->num_tids;
+}
+
+uint64
+tidstore_memory_usage(TIDStore *ts)
+{
+	return (uint64) sizeof(TIDStore) + rt_memory_usage(ts->tree);
+}
+
+tidstore_handle
+tidstore_get_handle(TIDStore *ts)
+{
+	return rt_get_dsa_pointer(ts->tree);
+}
+
+/* Extract TIDs from key-value pair */
+static void
+tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val)
+{
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+		iter->offsets[iter->num_offsets++] = off;
+	}
+
+	iter->blkno = KEY_GET_BLKNO(key);
+}
+
+/* Encode a TID to key and val */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64 upper;
+	uint64 tid_i;
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+	*off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+	upper = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return upper;
+}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index dfbe37472f..5b013bc3a8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -144,6 +145,8 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	int			max_bytes;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -194,7 +197,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TIDStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -265,8 +268,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer *vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer *vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -397,6 +401,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->indname = NULL;
 	vacrel->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
 	vacrel->verbose = verbose;
+	vacrel->max_bytes = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
@@ -858,7 +865,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_unskippable_block,
 				next_failsafe_block = 0,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TIDStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
@@ -872,7 +879,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = vacrel->max_bytes; /* XXX: should use # of tids */
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -942,8 +949,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		/* XXX: should not allow tidstore to grow beyond max_bytes */
+		if (tidstore_memory_usage(vacrel->dead_items) > vacrel->max_bytes)
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1075,11 +1082,17 @@ lazy_scan_heap(LVRelState *vacrel)
 			if (prunestate.has_lpdead_items)
 			{
 				Size		freespace;
+				TIDStoreIter *iter;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+				iter = tidstore_begin_iterate(vacrel->dead_items);
+				tidstore_iterate_next(iter);
+				lazy_vacuum_heap_page(vacrel, blkno, iter->offsets, iter->num_offsets,
+									  buf, &vmbuffer);
+				Assert(!tidstore_iterate_next(iter));
+				pfree(iter);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				tidstore_reset(dead_items);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1116,7 +1129,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
 		}
 
 		/*
@@ -1269,7 +1282,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1903,25 +1916,16 @@ retry:
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TIDStore *dead_items = vacrel->dead_items;
 
 		Assert(!prunestate->all_visible);
 		Assert(prunestate->has_lpdead_items);
 
 		vacrel->lpdead_item_pages++;
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+									 tidstore_num_tids(dead_items));
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
@@ -2128,8 +2132,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TIDStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2138,17 +2141,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+									 tidstore_num_tids(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2197,7 +2193,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2226,7 +2222,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2253,8 +2249,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2299,7 +2295,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	/* tidstore_reset(vacrel->dead_items); */
 }
 
 /*
@@ -2371,7 +2367,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2408,10 +2404,10 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index;
 	BlockNumber vacuumed_pages;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TIDStoreIter *iter;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2428,8 +2424,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 	vacuumed_pages = 0;
 
-	index = 0;
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while (tidstore_iterate_next(iter))
 	{
 		BlockNumber tblk;
 		Buffer		buf;
@@ -2438,12 +2434,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		tblk = iter->blkno;
 		vacrel->blkno = tblk;
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+		lazy_vacuum_heap_page(vacrel, tblk, iter->offsets, iter->num_offsets,
+							  buf, &vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2467,9 +2464,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
@@ -2491,11 +2487,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * LP_DEAD item on the page.  The return value is the first index immediately
  * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+					  int num_offsets, Buffer buffer, Buffer *vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			uncnt = 0;
@@ -2514,16 +2509,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = offsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2603,7 +2593,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -3105,46 +3094,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3155,12 +3104,6 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
-
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
 	 * be used for an index, so we invoke parallelism only if there are at
@@ -3186,7 +3129,6 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3199,11 +3141,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(NULL);
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7ccde07de9..03ce9c3b6e 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2295,16 +2295,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TIDStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2335,18 +2335,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
@@ -2357,32 +2345,9 @@ vac_max_items_to_alloc_size(int max_items)
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
+	TIDStore *dead_items = (TIDStore *) state;
 
-	return (res != NULL);
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
 
 /*
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26d796e52..641c98d80b 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+#define PARALLEL_VACUUM_KEY_DSA				2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TIDStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,7 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TIDStore *dead_items;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +225,22 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TIDStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +288,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +355,15 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(dead_items_dsa);
+	pvs->dead_items = dead_items;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +373,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +382,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +439,8 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_free(pvs->dead_items);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +449,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TIDStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +947,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TIDStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +993,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1042,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 3b06f22af5..a428046d71 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -1731,6 +1731,7 @@ rt_attach(dsa_area *dsa, dsa_pointer dp)
 	tree = (radix_tree *) palloc0(sizeof(radix_tree));
 
 	tree->ctl_dp = dp;
+	tree->dsa = dsa;
 	tree->ctl = (radix_tree_control *) dsa_get_address(dsa, dp);
 
 	/* XXX: do we need to set a callback on exit to detach dsa? */
@@ -1738,6 +1739,14 @@ rt_attach(dsa_area *dsa, dsa_pointer dp)
 	return tree;
 }
 
+void
+rt_detach(radix_tree *tree)
+{
+	Assert(RadixTreeIsShared(tree));
+	dsa_detach(tree->dsa);
+	pfree(tree);
+}
+
 /*
  * Free the given radix tree.
  */
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 0fc0cf6ebb..f94608f45a 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -183,6 +183,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"PgStatsHash",
 	/* LWTRANCHE_PGSTATS_DATA: */
 	"PgStatsData",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..40b8021f9b
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  TID storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TIDStore TIDStore;
+
+typedef struct TIDStoreIter
+{
+	TIDStore	*ts;
+
+	rt_iter		*tree_iter;
+
+	bool		finished;
+
+	uint64		next_key;
+	uint64		next_val;
+
+	BlockNumber		blkno;
+	OffsetNumber	offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+	int				num_offsets;
+} TIDStoreIter;
+
+extern TIDStore *tidstore_create(dsa_area *dsa);
+extern TIDStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TIDStore *ts);
+extern void tidstore_free(TIDStore *ts);
+extern void tidstore_reset(TIDStore *ts);
+extern void tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TIDStore *ts, ItemPointer tid);
+extern TIDStoreIter * tidstore_begin_iterate(TIDStore *ts);
+extern bool tidstore_iterate_next(TIDStoreIter *iter);
+extern uint64 tidstore_num_tids(TIDStore *ts);
+extern uint64 tidstore_memory_usage(TIDStore *ts);
+extern tidstore_handle tidstore_get_handle(TIDStore *ts);
+
+#endif		/* TIDSTORE_H */
+
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d816ba7f4..d221528f16 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -235,21 +236,6 @@ typedef struct VacuumParams
 	int			nworkers;
 } VacuumParams;
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -306,18 +292,16 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TIDStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TIDStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d9d8355c21..e3f90adebd 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -29,6 +29,7 @@ extern rt_iter *rt_begin_iterate(radix_tree *tree);
 
 extern dsa_pointer rt_get_dsa_pointer(radix_tree *tree);
 extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+extern void rt_detach(radix_tree *tree);
 
 extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
 extern void rt_end_iterate(rt_iter *iter);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ca4eca76f4..0999e4fc10 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -193,6 +193,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DSA,
 	LWTRANCHE_PGSTATS_HASH,
 	LWTRANCHE_PGSTATS_DATA,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
-- 
2.31.1

#115

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#114)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of lazy vacuum, since the write phase and
read phase are separated the readers don't need to worry about
concurrent updates.

It's a good idea to use ranges for a different reason -- readahead. See
commit 56788d2156fc3, which aimed to improve readahead for sequential
scans. It might work to use that as a model: Each worker prunes a range of
64 pages, keeping the dead tids in a local array. At the end of the range:
lock the tid store, enter the tids into the store, unlock, free the local
array, and get the next range from the leader. It's possible contention
won't be too bad, and I suspect using small local arrays as-we-go would be
faster and use less memory than merging multiple sub-trees at the end.

I've attached a draft patch for lazy vacuum integration that can be
applied on top of v8 patches. The patch adds a new module called
TIDStore, an efficient storage for TID backed by radix tree. Lazy
vacuum and parallel vacuum use it instead of a TID array. The patch
also introduces rt_detach() that was missed in 0002 patch. It's a very
rough patch but I hope it helps in considering lazy vacuum
integration, radix tree APIs, and shared radix tree functionality.

It does help, good to see this.

--
John Naylor
EDB: http://www.enterprisedb.com

#116

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#115)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sat, Nov 5, 2022 at 6:23 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of lazy vacuum, since the write phase and
read phase are separated the readers don't need to worry about
concurrent updates.

It's a good idea to use ranges for a different reason -- readahead. See commit 56788d2156fc3, which aimed to improve readahead for sequential scans. It might work to use that as a model: Each worker prunes a range of 64 pages, keeping the dead tids in a local array. At the end of the range: lock the tid store, enter the tids into the store, unlock, free the local array, and get the next range from the leader. It's possible contention won't be too bad, and I suspect using small local arrays as-we-go would be faster and use less memory than merging multiple sub-trees at the end.

Seems a promising idea. I think it might work well even in the current
parallel vacuum (ie., single writer). I mean, I think we can have a
single lwlock for shared cases in the first version. If the overhead
of acquiring the lwlock per insertion of key-value is not negligible,
we might want to try this idea.

Apart from that, I'm going to incorporate the comments on 0004 patch
and try a pointer tagging.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#117

pg@bowt.ie

about 3 years ago

In reply to: Masahiko Sawada (#114)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Nov 4, 2022 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of lazy vacuum, since the write phase and
read phase are separated the readers don't need to worry about
concurrent updates.

I think that the VM snapshot concept can eventually be used to
implement parallel heap pruning. Since every page that will become a
scanned_pages is known right from the start with VM snapshots, it will
be relatively straightforward to partition these pages into distinct
ranges with an equal number of pages, one per worker planned. The VM
snapshot structure can also be used for I/O prefetching, which will be
more important with parallel heap pruning (and with aio).

Working off of an immutable structure that describes which pages to
process right from the start is naturally easy to work with, in
general. We can "reorder work" flexibly (i.e. process individual
scanned_pages in any order that is convenient). Another example is
"changing our mind" about advancing relfrozenxid when it turns out
that we maybe should have decided to do that at the start of VACUUM
[1]: /messages/by-id/CAH2-WzkQ86yf==mgAF=cQ0qeLRWKX3htLw9Qo+qx3zbwJJkPiQ@mail.gmail.com -- Peter Geoghegan
be a very useful idea, but it is at least an interesting and thought
provoking concept.

[1]: /messages/by-id/CAH2-WzkQ86yf==mgAF=cQ0qeLRWKX3htLw9Qo+qx3zbwJJkPiQ@mail.gmail.com -- Peter Geoghegan
--
Peter Geoghegan

#118

sawada.mshk@gmail.com

about 3 years ago

In reply to: Masahiko Sawada (#116)

6 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Nov 8, 2022 at 11:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Nov 5, 2022 at 6:23 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

For parallel heap pruning, multiple workers will insert key-value
pairs to the radix tree concurrently. The simplest solution would be a
single lock to protect writes but the performance will not be good.
Another solution would be that we can divide the tables into multiple
ranges so that keys derived from TIDs are not conflicted with each
other and have parallel workers process one or more ranges. That way,
parallel vacuum workers can build *sub-trees* and the leader process
can merge them. In use cases of lazy vacuum, since the write phase and
read phase are separated the readers don't need to worry about
concurrent updates.

It's a good idea to use ranges for a different reason -- readahead. See commit 56788d2156fc3, which aimed to improve readahead for sequential scans. It might work to use that as a model: Each worker prunes a range of 64 pages, keeping the dead tids in a local array. At the end of the range: lock the tid store, enter the tids into the store, unlock, free the local array, and get the next range from the leader. It's possible contention won't be too bad, and I suspect using small local arrays as-we-go would be faster and use less memory than merging multiple sub-trees at the end.

Seems a promising idea. I think it might work well even in the current
parallel vacuum (ie., single writer). I mean, I think we can have a
single lwlock for shared cases in the first version. If the overhead
of acquiring the lwlock per insertion of key-value is not negligible,
we might want to try this idea.

Apart from that, I'm going to incorporate the comments on 0004 patch
and try a pointer tagging.

I'd like to share some progress on this work.

0004 patch is a new patch supporting a pointer tagging of the node
kind. Also, it introduces rt_node_ptr we discussed so that internal
functions use it rather than having two arguments for encoded and
decoded pointers. With this intermediate patch, the DSA support patch
became more readable and understandable. Probably we can make it
smaller further if we move the change of separating the control object
from radix_tree to the main patch (0002). The patch still needs to be
polished but I'd like to check if this idea is worthwhile. If we agree
on this direction, this patch will be merged into the main radix tree
implementation patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v9-0003-tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v9-0003-tool-for-measuring-radix-tree-performance.patchDownload

From b5950f71e476f3621e46ec7da0f8f9f7a452a685 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v9 3/6] tool for measuring radix tree performance

---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  56 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 466 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 6 files changed, 578 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..0874201d7e
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,56 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..7abb237e96
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,466 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
-- 
2.31.1

v9-0004-PoC-tag-the-node-kind-to-rt_pointer.patchapplication/octet-stream; name=v9-0004-PoC-tag-the-node-kind-to-rt_pointer.patchDownload

From 624fd0577546a746f0538b98ab7456adc4ca1bd5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 14 Nov 2022 11:44:17 +0900
Subject: [PATCH v9 4/6] PoC: tag the node kind to rt_pointer.

---
 src/backend/lib/radixtree.c | 660 ++++++++++++++++++++----------------
 1 file changed, 375 insertions(+), 285 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bd58b2bfad..c25d455d2a 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -126,6 +126,23 @@ typedef enum
 #define RT_NODE_KIND_128		0x02
 #define RT_NODE_KIND_256		0x03
 #define RT_NODE_KIND_COUNT		4
+#define RT_POINTER_KIND_MASK	0x03
+
+/*
+ * rt_pointer is a tagged pointer for rt_node. It is encoded from a
+ * C-pointer (ie, local memory address) and the node kind. The node
+ * kind uses the lower 2 bits, which are always 0 in local memory address.
+ * We can encode and decode the pointer using by rt_pointer_decode()
+ * and rt_pointer_encode() functions, respectively.
+ *
+ * The inner nodes of the radix tree need to store rt_pointer rather than
+ * C-pointer for the above reason.
+ */
+typedef uintptr_t rt_pointer;
+#define InvalidRTPointer		((rt_pointer) 0)
+#define RTPointerIsValid(x) 	(((rt_pointer) (x)) != InvalidRTPointer)
+#define RTPointerTagKind(x, k)	((rt_pointer) (x) | ((k) & RT_POINTER_KIND_MASK))
+#define RTPointerUnTagKind(x) 	((rt_pointer) (x) & ~RT_POINTER_KIND_MASK)
 
 /* Common type for all nodes types */
 typedef struct rt_node
@@ -144,13 +161,12 @@ typedef struct rt_node
 	uint8		shift;
 	uint8		chunk;
 
-	/* Size kind of the node */
-	uint8		kind;
+	/*
+	 * The node kind is tagged into the rt_pointer, see the comments of
+	 * rt_pointer for details.
+	 */
 } rt_node;
-#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
-#define NODE_HAS_FREE_SLOT(n) \
-	(((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+#define RT_NODE_IS_LEAF(n)	(((rt_node *) (n))->shift == 0)
 
 /* Base type of each node kinds for leaf and inner nodes */
 typedef struct rt_node_base_4
@@ -205,7 +221,7 @@ typedef struct rt_node_inner_4
 	rt_node_base_4 base;
 
 	/* 4 children, for key chunks */
-	rt_node    *children[4];
+	rt_pointer    children[4];
 } rt_node_inner_4;
 
 typedef struct rt_node_leaf_4
@@ -221,7 +237,7 @@ typedef struct rt_node_inner_32
 	rt_node_base_32 base;
 
 	/* 32 children, for key chunks */
-	rt_node    *children[32];
+	rt_pointer    children[32];
 } rt_node_inner_32;
 
 typedef struct rt_node_leaf_32
@@ -237,7 +253,7 @@ typedef struct rt_node_inner_128
 	rt_node_base_128 base;
 
 	/* Slots for 128 children */
-	rt_node    *children[128];
+	rt_pointer    children[128];
 } rt_node_inner_128;
 
 typedef struct rt_node_leaf_128
@@ -260,7 +276,7 @@ typedef struct rt_node_inner_256
 	rt_node_base_256 base;
 
 	/* Slots for 256 children */
-	rt_node    *children[RT_NODE_MAX_SLOTS];
+	rt_pointer    children[RT_NODE_MAX_SLOTS];
 } rt_node_inner_256;
 
 typedef struct rt_node_leaf_256
@@ -274,6 +290,30 @@ typedef struct rt_node_leaf_256
 	uint64		values[RT_NODE_MAX_SLOTS];
 } rt_node_leaf_256;
 
+/*
+ * rt_node_ptr is an useful data structure representing a pointer for a rt_node.
+ */
+typedef struct rt_node_ptr
+{
+	rt_pointer		encoded;
+	rt_node			*decoded;
+} rt_node_ptr;
+#define InvalidRTNodePtr \
+	(rt_node_ptr) {.encoded = InvalidRTPointer, .decoded = NULL }
+#define RTNodePtrIsValid(n) \
+	(!rt_node_ptr_eq((rt_node_ptr *) &(n), &(InvalidRTNodePtr)))
+
+/* Macros for rt_node_ptr to access the fields of rt_node */
+#define NODE_RAW(n)			(((rt_node_ptr) (n)).decoded)
+#define NODE_IS_LEAF(n)		(NODE_RAW(n)->shift == 0)
+#define NODE_IS_EMPTY(n)	(NODE_COUNT(n) == 0)
+#define NODE_KIND(n)	((uint8) (((rt_node_ptr) (n)).encoded & RT_POINTER_KIND_MASK))
+#define NODE_COUNT(n)	(NODE_RAW(n)->count)
+#define NODE_SHIFT(n)	(NODE_RAW(n)->shift)
+#define NODE_CHUNK(n)	(NODE_RAW(n)->chunk)
+#define NODE_HAS_FREE_SLOT(n) \
+	(NODE_COUNT(n) < rt_node_kind_info[NODE_KIND(n)].fanout)
+
 /* Information of each size kinds */
 typedef struct rt_node_kind_info_elem
 {
@@ -347,7 +387,7 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
  */
 typedef struct rt_node_iter
 {
-	rt_node    *node;			/* current node being iterated */
+	rt_node_ptr	node;			/* current node being iterated */
 	int			current_idx;	/* current position. -1 for initial value */
 } rt_node_iter;
 
@@ -368,7 +408,7 @@ struct radix_tree
 {
 	MemoryContext context;
 
-	rt_node    *root;
+	rt_pointer	root;
 	uint64		max_val;
 	uint64		num_keys;
 
@@ -382,26 +422,56 @@ struct radix_tree
 };
 
 static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
-							  bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
+static rt_node_ptr rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+								 bool inner);
+static void rt_free_node(radix_tree *tree, rt_node_ptr node);
 static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
-										rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+static inline bool rt_node_search_inner(rt_node_ptr node_ptr, uint64 key, rt_action action,
+										rt_pointer *child_p);
+static inline bool rt_node_search_leaf(rt_node_ptr node_ptr, uint64 key, rt_action action,
 									   uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
-								 uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+static bool rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+								 uint64 key, rt_node_ptr child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
 								uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											  rt_node_ptr *child_p);
 static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 											 uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static void rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from);
 static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
 
 /* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
+static void rt_verify_node(rt_node_ptr node);
+
+/* Decode and encode function of rt_pointer */
+static inline rt_node *
+rt_pointer_decode(rt_pointer encoded)
+{
+	return (rt_node *) RTPointerUnTagKind(encoded);
+}
+
+static inline rt_pointer
+rt_pointer_encode(rt_node *decoded, uint8 kind)
+{
+	return (rt_pointer) RTPointerTagKind(decoded, kind);
+}
+
+/* Return a rt_pointer created from the given encoded pointer */
+static inline rt_node_ptr
+rt_node_ptr_encoded(rt_pointer encoded)
+{
+	return (rt_node_ptr) {
+		.encoded = encoded,
+			.decoded = rt_pointer_decode(encoded)
+			};
+}
+
+static inline bool
+rt_node_ptr_eq(rt_node_ptr *a, rt_node_ptr *b)
+{
+	return (a->decoded == b->decoded) && (a->encoded == b->encoded);
+}
 
 /*
  * Return index of the first element in 'base' that equals 'key'. Return -1
@@ -550,10 +620,10 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
 
 /* Shift the elements right at 'idx' by one */
 static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_shift(uint8 *chunks, rt_pointer *children, int count, int idx)
 {
 	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
-	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_pointer) * (count - idx));
 }
 
 static inline void
@@ -565,10 +635,10 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Delete the element at 'idx' */
 static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_delete(uint8 *chunks, rt_pointer *children, int count, int idx)
 {
 	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
-	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_pointer) * (count - idx - 1));
 }
 
 static inline void
@@ -580,15 +650,15 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Copy both chunks and children/values arrays */
 static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
-						  uint8 *dst_chunks, rt_node **dst_children, int count)
+chunk_children_array_copy(uint8 *src_chunks, rt_pointer *src_children,
+						  uint8 *dst_chunks, rt_pointer *dst_children, int count)
 {
 	/* For better code generation */
 	if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
 		pg_unreachable();
 
 	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
-	memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+	memcpy(dst_children, src_children, sizeof(rt_pointer) * count);
 }
 
 static inline void
@@ -616,28 +686,28 @@ node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
 static inline bool
 node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
 {
-	Assert(!NODE_IS_LEAF(node));
-	return (node->children[slot] != NULL);
+	Assert(!RT_NODE_IS_LEAF(node));
+	return RTPointerIsValid(node->children[slot]);
 }
 
 static inline bool
 node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
 }
 
-static inline rt_node *
+static inline rt_pointer
 node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	return node->children[node->base.slot_idxs[chunk]];
 }
 
 static inline uint64
 node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(((rt_node_base_128 *) node)->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX);
 	return node->values[node->base.slot_idxs[chunk]];
 }
@@ -645,7 +715,7 @@ node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
 static void
 node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
 }
 
@@ -654,7 +724,7 @@ node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
 {
 	int			slotpos = node->base.slot_idxs[chunk];
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
 	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
 }
@@ -665,7 +735,7 @@ node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
 {
 	int			slotpos = 0;
 
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	while (node_inner_128_is_slot_used(node, slotpos))
 		slotpos++;
 
@@ -677,7 +747,7 @@ node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
 {
 	int			slotpos;
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 
 	/* We iterate over the isset bitmap per byte then check each bit */
 	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
@@ -695,11 +765,11 @@ node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
 }
 
 static inline void
-node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_pointer child)
 {
 	int			slotpos;
 
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 
 	/* find unused slot */
 	slotpos = node_inner_128_find_unused_slot(node, chunk);
@@ -714,7 +784,7 @@ node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
 {
 	int			slotpos;
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 
 	/* find unused slot */
 	slotpos = node_leaf_128_find_unused_slot(node, chunk);
@@ -726,16 +796,16 @@ node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
 
 /* Update the child corresponding to 'chunk' to 'child' */
 static inline void
-node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_pointer child)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[node->base.slot_idxs[chunk]] = child;
 }
 
 static inline void
 node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->values[node->base.slot_idxs[chunk]] = value;
 }
 
@@ -745,21 +815,21 @@ node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
 static inline bool
 node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
-	return (node->children[chunk] != NULL);
+	Assert(!RT_NODE_IS_LEAF(node));
+	return RTPointerIsValid(node->children[chunk]);
 }
 
 static inline bool
 node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
 }
 
-static inline rt_node *
+static inline rt_pointer
 node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	Assert(node_inner_256_is_chunk_used(node, chunk));
 	return node->children[chunk];
 }
@@ -767,16 +837,16 @@ node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
 static inline uint64
 node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(node_leaf_256_is_chunk_used(node, chunk));
 	return node->values[chunk];
 }
 
 /* Set the child in the node-256 */
 static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_pointer child)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[chunk] = child;
 }
 
@@ -784,7 +854,7 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
 static inline void
 node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
 	node->values[chunk] = value;
 }
@@ -793,14 +863,14 @@ node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
 static inline void
 node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
-	node->children[chunk] = NULL;
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = InvalidRTPointer;
 }
 
 static inline void
 node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
 }
 
@@ -835,37 +905,36 @@ static void
 rt_new_root(radix_tree *tree, uint64 key)
 {
 	int			shift = key_get_shift(key);
-	rt_node    *node;
+	rt_node_ptr	node;
 
-	node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
-									 shift > 0);
+	node = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, shift > 0);
 	tree->max_val = shift_get_max_val(shift);
-	tree->root = node;
+	tree->root = node.encoded;
 }
 
 /*
  * Allocate a new node with the given node kind.
  */
-static rt_node *
+static rt_node_ptr
 rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
 {
-	rt_node    *newnode;
+	rt_node_ptr	newnode;
 
 	if (inner)
-		newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
-													 rt_node_kind_info[kind].inner_size);
+		newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+															 rt_node_kind_info[kind].inner_size);
 	else
-		newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
-													 rt_node_kind_info[kind].leaf_size);
+		newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+															 rt_node_kind_info[kind].leaf_size);
 
-	newnode->kind = kind;
-	newnode->shift = shift;
-	newnode->chunk = chunk;
+	newnode.encoded = rt_pointer_encode(newnode.decoded, kind);
+	NODE_SHIFT(newnode) = shift;
+	NODE_CHUNK(newnode) = chunk;
 
 	/* Initialize slot_idxs to invalid values */
 	if (kind == RT_NODE_KIND_128)
 	{
-		rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+		rt_node_base_128 *n128 = (rt_node_base_128 *) newnode.decoded;
 
 		memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
 	}
@@ -882,55 +951,56 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
  * Create a new node with 'new_kind' and the same shift, chunk, and
  * count of 'node'.
  */
-static rt_node *
-rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
+static rt_node_ptr
+rt_copy_node(radix_tree *tree, rt_node_ptr node, int new_kind)
 {
-	rt_node    *newnode;
+	rt_node_ptr    newnode;
+	rt_node		*n = node.decoded;
 
-	newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
-							node->shift > 0);
-	newnode->count = node->count;
+	newnode = rt_alloc_node(tree, new_kind, n->shift, n->chunk, n->shift > 0);
+	NODE_COUNT(newnode) = NODE_COUNT(node);
 
 	return newnode;
 }
 
 /* Free the given node */
 static void
-rt_free_node(radix_tree *tree, rt_node *node)
+rt_free_node(radix_tree *tree, rt_node_ptr node)
 {
 	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node)
-		tree->root = NULL;
+	if (tree->root == node.encoded)
+		tree->root = InvalidRTPointer;
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[node->kind]--;
-	Assert(tree->cnt[node->kind] >= 0);
+	tree->cnt[NODE_KIND(node)]--;
+	Assert(tree->cnt[NODE_KIND(node)] >= 0);
 #endif
 
-	pfree(node);
+	pfree(node.decoded);
 }
 
 /*
  * Replace old_child with new_child, and free the old one.
  */
 static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
-				rt_node *new_child, uint64 key)
+rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
+				rt_node_ptr new_child, uint64 key)
 {
-	Assert(old_child->chunk == new_child->chunk);
-	Assert(old_child->shift == new_child->shift);
+	Assert(NODE_CHUNK(old_child) == NODE_CHUNK(new_child));
+	Assert(NODE_SHIFT(old_child) == NODE_SHIFT(new_child));
 
-	if (parent == old_child)
+	if (rt_node_ptr_eq(&parent, &old_child))
 	{
 		/* Replace the root node with the new large node */
-		tree->root = new_child;
+		tree->root = new_child.encoded;
 	}
 	else
 	{
 		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
 
-		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		replaced = rt_node_insert_inner(tree, InvalidRTNodePtr, parent, key,
+										new_child);
 		Assert(replaced);
 	}
 
@@ -945,23 +1015,26 @@ static void
 rt_extend(radix_tree *tree, uint64 key)
 {
 	int			target_shift;
-	int			shift = tree->root->shift + RT_NODE_SPAN;
+	rt_node		*root = rt_pointer_decode(tree->root);
+	int			shift = root->shift + RT_NODE_SPAN;
 
 	target_shift = key_get_shift(key);
 
 	/* Grow tree from 'shift' to 'target_shift' */
 	while (shift <= target_shift)
 	{
-		rt_node_inner_4 *node;
+		rt_node_ptr	node;
+		rt_node_inner_4 *n4;
+
+		node = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, true);
+		n4 = (rt_node_inner_4 *) node.decoded;
 
-		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
-												 shift, 0, true);
-		node->base.n.count = 1;
-		node->base.chunks[0] = 0;
-		node->children[0] = tree->root;
+		n4->base.n.count = 1;
+		n4->base.chunks[0] = 0;
+		n4->children[0] = tree->root;
 
-		tree->root->chunk = 0;
-		tree->root = (rt_node *) node;
+		root->chunk = 0;
+		tree->root = node.encoded;
 
 		shift += RT_NODE_SPAN;
 	}
@@ -974,18 +1047,18 @@ rt_extend(radix_tree *tree, uint64 key)
  * Insert inner and leaf nodes from 'node' to bottom.
  */
 static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
-			  rt_node *node)
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
+			  rt_node_ptr node)
 {
-	int			shift = node->shift;
+	int			shift = NODE_SHIFT(node);
 
 	while (shift >= RT_NODE_SPAN)
 	{
-		rt_node    *newchild;
+		rt_node_ptr    newchild;
 		int			newshift = shift - RT_NODE_SPAN;
 
 		newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
-								 RT_GET_KEY_CHUNK(key, node->shift),
+								 RT_GET_KEY_CHUNK(key, NODE_SHIFT(node)),
 								 newshift > 0);
 		rt_node_insert_inner(tree, parent, node, key, newchild);
 
@@ -1006,17 +1079,18 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
+					 rt_pointer *child_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
-	rt_node    *child = NULL;
+	rt_pointer	child;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
 
 				if (idx < 0)
@@ -1034,7 +1108,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
 
 				if (idx < 0)
@@ -1050,7 +1124,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_128:
 			{
-				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
 
 				if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
 					break;
@@ -1066,7 +1140,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 				if (!node_inner_256_is_chunk_used(n256, chunk))
 					break;
@@ -1083,7 +1157,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 
 	/* update statistics */
 	if (action == RT_ACTION_DELETE && found)
-		node->count--;
+		NODE_COUNT(node)--;
 
 	if (found && child_p)
 		*child_p = child;
@@ -1099,17 +1173,17 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
  * to the value is set to value_p.
  */
 static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node_ptr node, uint64 key, rt_action action, uint64 *value_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
 	uint64		value = 0;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
 
 				if (idx < 0)
@@ -1127,7 +1201,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
 
 				if (idx < 0)
@@ -1143,7 +1217,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_128:
 			{
-				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node.decoded;
 
 				if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
 					break;
@@ -1159,7 +1233,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 				if (!node_leaf_256_is_chunk_used(n256, chunk))
 					break;
@@ -1176,7 +1250,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 
 	/* update statistics */
 	if (action == RT_ACTION_DELETE && found)
-		node->count--;
+		NODE_COUNT(node)--;
 
 	if (found && value_p)
 		*value_p = value;
@@ -1186,19 +1260,19 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 
 /* Insert the child to the inner node */
 static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
-					 rt_node *child)
+rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+					 uint64 key, rt_node_ptr child)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		chunk_exists = false;
 
 	Assert(!NODE_IS_LEAF(node));
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 				int			idx;
 
 				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1206,25 +1280,26 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					n4->children[idx] = child;
+					n4->children[idx] = child.encoded;
 					break;
 				}
 
-				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+				if (unlikely(!NODE_HAS_FREE_SLOT(node)))
 				{
+					rt_node_ptr	new;
 					rt_node_inner_32 *new32;
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
-															  RT_NODE_KIND_32);
+					new = rt_copy_node(tree, node, RT_NODE_KIND_32);
+					new32 = (rt_node_inner_32 *) new.decoded;
+
 					chunk_children_array_copy(n4->base.chunks, n4->children,
 											  new32->base.chunks, new32->children,
 											  n4->base.n.count);
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
-									key);
-					node = (rt_node *) new32;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1237,14 +1312,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 												   count, insertpos);
 
 					n4->base.chunks[insertpos] = chunk;
-					n4->children[insertpos] = child;
+					n4->children[insertpos] = child.encoded;
 					break;
 				}
 			}
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 				int			idx;
 
 				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1252,24 +1327,25 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					n32->children[idx] = child;
+					n32->children[idx] = child.encoded;
 					break;
 				}
 
-				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+				if (unlikely(!NODE_HAS_FREE_SLOT(node)))
 				{
+					rt_node_ptr	new;
 					rt_node_inner_128 *new128;
 
 					/* grow node from 32 to 128 */
-					new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
-																RT_NODE_KIND_128);
+					new = rt_copy_node(tree, node, RT_NODE_KIND_128);
+					new128 = (rt_node_inner_128 *) new.decoded;
+
 					for (int i = 0; i < n32->base.n.count; i++)
 						node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
-									key);
-					node = (rt_node *) new128;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1281,31 +1357,33 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 												   count, insertpos);
 
 					n32->base.chunks[insertpos] = chunk;
-					n32->children[insertpos] = child;
+					n32->children[insertpos] = child.encoded;
 					break;
 				}
 			}
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_128:
 			{
-				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
 				int			cnt = 0;
 
 				if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					node_inner_128_update(n128, chunk, child);
+					node_inner_128_update(n128, chunk, child.encoded);
 					break;
 				}
 
-				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+				if (unlikely(!NODE_HAS_FREE_SLOT(node)))
 				{
+					rt_node_ptr	new;
 					rt_node_inner_256 *new256;
 
 					/* grow node from 128 to 256 */
-					new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
-																RT_NODE_KIND_256);
+					new = rt_copy_node(tree, node, RT_NODE_KIND_256);
+					new256 = (rt_node_inner_256 *) new.decoded;
+
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
 					{
 						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1315,33 +1393,32 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 						cnt++;
 					}
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
-					node_inner_128_insert(n128, chunk, child);
+					node_inner_128_insert(n128, chunk, child.encoded);
 					break;
 				}
 			}
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
-				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+				Assert(chunk_exists || NODE_HAS_FREE_SLOT(node));
 
-				node_inner_256_set(n256, chunk, child);
+				node_inner_256_set(n256, chunk, child.encoded);
 				break;
 			}
 	}
 
 	/* Update statistics */
 	if (!chunk_exists)
-		node->count++;
+		NODE_COUNT(node)++;
 
 	/*
 	 * Done. Finally, verify the chunk and value is inserted or replaced
@@ -1354,19 +1431,19 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 
 /* Insert the value to the leaf node */
 static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
 					uint64 key, uint64 value)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		chunk_exists = false;
 
 	Assert(NODE_IS_LEAF(node));
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 				int			idx;
 
 				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1378,21 +1455,22 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 					break;
 				}
 
-				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+				if (unlikely(!NODE_HAS_FREE_SLOT(node)))
 				{
+					rt_node_ptr	new;
 					rt_node_leaf_32 *new32;
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
-															 RT_NODE_KIND_32);
+					new = rt_copy_node(tree, node, RT_NODE_KIND_32);
+					new32 = (rt_node_leaf_32 *) new.decoded;
+
 					chunk_values_array_copy(n4->base.chunks, n4->values,
 											new32->base.chunks, new32->values,
 											n4->base.n.count);
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
-									key);
-					node = (rt_node *) new32;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1412,7 +1490,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 				int			idx;
 
 				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1424,20 +1502,21 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 					break;
 				}
 
-				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+				if (unlikely(!NODE_HAS_FREE_SLOT(node)))
 				{
+					rt_node_ptr	new;
 					rt_node_leaf_128 *new128;
 
 					/* grow node from 32 to 128 */
-					new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
-															   RT_NODE_KIND_128);
+					new = rt_copy_node(tree, node, RT_NODE_KIND_128);
+					new128 = (rt_node_leaf_128 *) new.decoded;
+
 					for (int i = 0; i < n32->base.n.count; i++)
 						node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
-									key);
-					node = (rt_node *) new128;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1456,7 +1535,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_128:
 			{
-				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node.decoded;
 				int			cnt = 0;
 
 				if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
@@ -1467,13 +1546,15 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 					break;
 				}
 
-				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+				if (unlikely(!NODE_HAS_FREE_SLOT(node)))
 				{
+					rt_node_ptr	new;
 					rt_node_leaf_256 *new256;
 
 					/* grow node from 128 to 256 */
-					new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
-															   RT_NODE_KIND_256);
+					new = rt_copy_node(tree, node, RT_NODE_KIND_256);
+					new256 = (rt_node_leaf_256 *) new.decoded;
+
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
 					{
 						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1483,10 +1564,9 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 						cnt++;
 					}
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1497,10 +1577,10 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
-				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+				Assert(chunk_exists || NODE_HAS_FREE_SLOT(node));
 
 				node_leaf_256_set(n256, chunk, value);
 				break;
@@ -1509,7 +1589,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 	/* Update statistics */
 	if (!chunk_exists)
-		node->count++;
+		NODE_COUNT(node)++;
 
 	/*
 	 * Done. Finally, verify the chunk and value is inserted or replaced
@@ -1533,7 +1613,7 @@ rt_create(MemoryContext ctx)
 
 	tree = palloc(sizeof(radix_tree));
 	tree->context = ctx;
-	tree->root = NULL;
+	tree->root = InvalidRTPointer;
 	tree->max_val = 0;
 	tree->num_keys = 0;
 
@@ -1582,26 +1662,24 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 {
 	int			shift;
 	bool		updated;
-	rt_node    *node;
-	rt_node    *parent = tree->root;
+	rt_node_ptr	node;
+	rt_node_ptr parent;
 
 	/* Empty tree, create the root */
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 		rt_new_root(tree, key);
 
 	/* Extend the tree if necessary */
 	if (key > tree->max_val)
 		rt_extend(tree, key);
 
-	Assert(tree->root);
-
-	shift = tree->root->shift;
-	node = tree->root;
-
 	/* Descend the tree until a leaf node */
+	parent = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer    child;
 
 		if (NODE_IS_LEAF(node))
 			break;
@@ -1613,7 +1691,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 		}
 
 		parent = node;
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1634,21 +1712,21 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 bool
 rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 {
-	rt_node    *node;
+	rt_node_ptr    node;
 	int			shift;
 
 	Assert(value_p != NULL);
 
-	if (!tree->root || key > tree->max_val)
+	if (!RTPointerIsValid(tree->root) || key > tree->max_val)
 		return false;
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer	child;
 
 		if (NODE_IS_LEAF(node))
 			break;
@@ -1656,7 +1734,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1670,8 +1748,8 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 bool
 rt_delete(radix_tree *tree, uint64 key)
 {
-	rt_node    *node;
-	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	rt_node_ptr	node;
+	rt_node_ptr	stack[RT_MAX_LEVEL] = {0};
 	int			shift;
 	int			level;
 	bool		deleted;
@@ -1683,12 +1761,12 @@ rt_delete(radix_tree *tree, uint64 key)
 	 * Descend the tree to search the key while building a stack of nodes we
 	 * visited.
 	 */
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	level = -1;
 	while (shift > 0)
 	{
-		rt_node    *child;
+		rt_pointer	child;
 
 		/* Push the current node to the stack */
 		stack[++level] = node;
@@ -1696,7 +1774,7 @@ rt_delete(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1745,7 +1823,7 @@ rt_delete(radix_tree *tree, uint64 key)
 	 */
 	if (level == 0)
 	{
-		tree->root = NULL;
+		tree->root = InvalidRTPointer;
 		tree->max_val = 0;
 	}
 
@@ -1757,6 +1835,7 @@ rt_iter *
 rt_begin_iterate(radix_tree *tree)
 {
 	MemoryContext old_ctx;
+	rt_node_ptr	root;
 	rt_iter    *iter;
 	int			top_level;
 
@@ -1766,17 +1845,18 @@ rt_begin_iterate(radix_tree *tree)
 	iter->tree = tree;
 
 	/* empty tree */
-	if (!iter->tree)
+	if (!RTPointerIsValid(iter->tree))
 		return iter;
 
-	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	root = rt_node_ptr_encoded(iter->tree->root);
+	top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
 	iter->stack_len = top_level;
 
 	/*
 	 * Descend to the left most leaf node from the root. The key is being
 	 * constructed while descending to the leaf.
 	 */
-	rt_update_iter_stack(iter, iter->tree->root, top_level);
+	rt_update_iter_stack(iter, root, top_level);
 
 	MemoryContextSwitchTo(old_ctx);
 
@@ -1787,14 +1867,15 @@ rt_begin_iterate(radix_tree *tree)
  * Update each node_iter for inner nodes in the iterator node stack.
  */
 static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
 {
 	int			level = from;
-	rt_node    *node = from_node;
+	rt_node_ptr node = from_node;
 
 	for (;;)
 	{
 		rt_node_iter *node_iter = &(iter->stack[level--]);
+		bool found PG_USED_FOR_ASSERTS_ONLY;
 
 		node_iter->node = node;
 		node_iter->current_idx = -1;
@@ -1804,10 +1885,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
 			return;
 
 		/* Advance to the next slot in the inner node */
-		node = rt_node_inner_iterate_next(iter, node_iter);
+		found = rt_node_inner_iterate_next(iter, node_iter, &node);
 
 		/* We must find the first children in the node */
-		Assert(node);
+		Assert(found);
 	}
 }
 
@@ -1824,7 +1905,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 
 	for (;;)
 	{
-		rt_node    *child = NULL;
+		rt_node_ptr	child = InvalidRTNodePtr;
 		uint64		value;
 		int			level;
 		bool		found;
@@ -1845,14 +1926,12 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 		 */
 		for (level = 1; level <= iter->stack_len; level++)
 		{
-			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
-
-			if (child)
+			if (rt_node_inner_iterate_next(iter, &(iter->stack[level]), &child))
 				break;
 		}
 
 		/* the iteration finished */
-		if (!child)
+		if (!RTNodePtrIsValid(child))
 			return false;
 
 		/*
@@ -1884,18 +1963,19 @@ rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
  * Advance the slot in the inner node. Return the child if exists, otherwise
  * null.
  */
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+static inline bool
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *child_p)
 {
-	rt_node    *child = NULL;
+	rt_node_ptr	node = node_iter->node;
+	rt_pointer	child;
 	bool		found = false;
 	uint8		key_chunk;
 
-	switch (node_iter->node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n4->base.n.count)
@@ -1908,7 +1988,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n32->base.n.count)
@@ -1921,7 +2001,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_128:
 			{
-				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node_iter->node;
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -1941,7 +2021,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -1962,9 +2042,12 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 	}
 
 	if (found)
-		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+	{
+		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
+		*child_p = rt_node_ptr_encoded(child);
+	}
 
-	return child;
+	return found;
 }
 
 /*
@@ -1972,19 +2055,18 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
  * is set to value_p, otherwise return false.
  */
 static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
-						  uint64 *value_p)
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_p)
 {
-	rt_node    *node = node_iter->node;
+	rt_node_ptr node = node_iter->node;
 	bool		found = false;
 	uint64		value;
 	uint8		key_chunk;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n4->base.n.count)
@@ -1997,7 +2079,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n32->base.n.count)
@@ -2010,7 +2092,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_128:
 			{
-				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node_iter->node;
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2030,7 +2112,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2052,7 +2134,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 
 	if (found)
 	{
-		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
 		*value_p = value;
 	}
 
@@ -2089,16 +2171,16 @@ rt_memory_usage(radix_tree *tree)
  * Verify the radix tree node.
  */
 static void
-rt_verify_node(rt_node *node)
+rt_verify_node(rt_node_ptr node)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(node->count >= 0);
+	Assert(NODE_COUNT(node) >= 0);
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node.decoded;
 
 				for (int i = 1; i < n4->n.count; i++)
 					Assert(n4->chunks[i - 1] < n4->chunks[i]);
@@ -2107,7 +2189,7 @@ rt_verify_node(rt_node *node)
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node.decoded;
 
 				for (int i = 1; i < n32->n.count; i++)
 					Assert(n32->chunks[i - 1] < n32->chunks[i]);
@@ -2116,7 +2198,7 @@ rt_verify_node(rt_node *node)
 			}
 		case RT_NODE_KIND_128:
 			{
-				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node.decoded;
 				int			cnt = 0;
 
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2126,10 +2208,10 @@ rt_verify_node(rt_node *node)
 
 					/* Check if the corresponding slot is used */
 					if (NODE_IS_LEAF(node))
-						Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) node,
+						Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) n128,
 														  n128->slot_idxs[i]));
 					else
-						Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) node,
+						Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) n128,
 														   n128->slot_idxs[i]));
 
 					cnt++;
@@ -2142,7 +2224,7 @@ rt_verify_node(rt_node *node)
 			{
 				if (NODE_IS_LEAF(node))
 				{
-					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 					int			cnt = 0;
 
 					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
@@ -2163,9 +2245,11 @@ rt_verify_node(rt_node *node)
 void
 rt_stats(radix_tree *tree)
 {
+	rt_node_ptr	root = rt_node_ptr_encoded(tree->root);
+
 	ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
 						 tree->num_keys,
-						 tree->root->shift / RT_NODE_SPAN,
+						 NODE_SHIFT(root) / RT_NODE_SPAN,
 						 tree->cnt[0],
 						 tree->cnt[1],
 						 tree->cnt[2],
@@ -2173,42 +2257,44 @@ rt_stats(radix_tree *tree)
 }
 
 static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(rt_node_ptr node, int level, bool recurse)
 {
+	rt_node		*n = node.decoded;
 	char		space[128] = {0};
 
 	fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
 			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
-			(node->kind == RT_NODE_KIND_4) ? 4 :
-			(node->kind == RT_NODE_KIND_32) ? 32 :
-			(node->kind == RT_NODE_KIND_128) ? 128 : 256,
-			node->count, node->shift, node->chunk);
+			(NODE_KIND(node) == RT_NODE_KIND_4) ? 4 :
+			(NODE_KIND(node) == RT_NODE_KIND_32) ? 32 :
+			(NODE_KIND(node) == RT_NODE_KIND_128) ? 128 : 256,
+			n->count, n->shift, n->chunk);
 
 	if (level > 0)
 		sprintf(space, "%*c", level * 4, ' ');
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				for (int i = 0; i < node->count; i++)
+				for (int i = 0; i < NODE_COUNT(node); i++)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
 								space, n4->base.chunks[i], n4->values[i]);
 					}
 					else
 					{
-						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, n4->base.chunks[i]);
 
 						if (recurse)
-							rt_dump_node(n4->children[i], level + 1, recurse);
+							rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2217,25 +2303,26 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 			}
 		case RT_NODE_KIND_32:
 			{
-				for (int i = 0; i < node->count; i++)
+				for (int i = 0; i < NODE_KIND(node); i++)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
 								space, n32->base.chunks[i], n32->values[i]);
 					}
 					else
 					{
-						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, n32->base.chunks[i]);
 
 						if (recurse)
 						{
-							rt_dump_node(n32->children[i], level + 1, recurse);
+							rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+										 level + 1, recurse);
 						}
 						else
 							fprintf(stderr, "\n");
@@ -2245,7 +2332,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 			}
 		case RT_NODE_KIND_128:
 			{
-				rt_node_base_128 *b128 = (rt_node_base_128 *) node;
+				rt_node_base_128 *b128 = (rt_node_base_128 *) node.decoded;
 
 				fprintf(stderr, "slot_idxs ");
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2257,7 +2344,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 				}
 				if (NODE_IS_LEAF(node))
 				{
-					rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
+					rt_node_leaf_128 *n = (rt_node_leaf_128 *) node.decoded;
 
 					fprintf(stderr, ", isset-bitmap:");
 					for (int i = 0; i < 16; i++)
@@ -2287,7 +2374,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_128_get_child(n128, i),
+							rt_dump_node(rt_node_ptr_encoded(node_inner_128_get_child(n128, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2301,7 +2388,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 						if (!node_leaf_256_is_chunk_used(n256, i))
 							continue;
@@ -2311,7 +2398,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 					}
 					else
 					{
-						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 						if (!node_inner_256_is_chunk_used(n256, i))
 							continue;
@@ -2320,8 +2407,8 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
-										 recurse);
+							rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2334,14 +2421,14 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 void
 rt_dump_search(radix_tree *tree, uint64 key)
 {
-	rt_node    *node;
+	rt_node_ptr node;
 	int			shift;
 	int			level = 0;
 
 	elog(NOTICE, "-----------------------------------------------------------");
 	elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
 
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 	{
 		elog(NOTICE, "tree is empty");
 		return;
@@ -2354,11 +2441,11 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		return;
 	}
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer   child;
 
 		rt_dump_node(node, level, false);
 
@@ -2375,7 +2462,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			break;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 		level++;
 	}
@@ -2384,6 +2471,8 @@ rt_dump_search(radix_tree *tree, uint64 key)
 void
 rt_dump(radix_tree *tree)
 {
+	rt_node_ptr root;
+
 	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
 		fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
 				rt_node_kind_info[i].name,
@@ -2393,12 +2482,13 @@ rt_dump(radix_tree *tree)
 				rt_node_kind_info[i].leaf_blocksize);
 	fprintf(stderr, "max_val = %lu\n", tree->max_val);
 
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 	{
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
-	rt_dump_node(tree->root, 0, true);
+	root = rt_node_ptr_encoded(tree->root);
+	rt_dump_node(root, 0, true);
 }
 #endif
-- 
2.31.1

v9-0005-PoC-DSA-support-for-radix-tree.patchapplication/octet-stream; name=v9-0005-PoC-DSA-support-for-radix-tree.patchDownload

From a304e99926444dda3861722c53d9cbd86e61fec0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Oct 2022 14:02:00 +0900
Subject: [PATCH v9 5/6] PoC: DSA support for radix tree.

---
 .../bench_radix_tree--1.0.sql                 |   2 +
 contrib/bench_radix_tree/bench_radix_tree.c   |  12 +-
 src/backend/lib/radixtree.c                   | 484 +++++++++++++-----
 src/backend/utils/mmgr/dsa.c                  |  12 +
 src/include/lib/radixtree.h                   |   8 +-
 src/include/utils/dsa.h                       |   1 +
 .../expected/test_radixtree.out               |  17 +
 .../modules/test_radixtree/test_radixtree.c   | 100 ++--
 8 files changed, 482 insertions(+), 154 deletions(-)

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 0874201d7e..cf294c01d6 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -7,6 +7,7 @@ create function bench_shuffle_search(
 minblk int4,
 maxblk int4,
 random_block bool DEFAULT false,
+shared bool DEFAULT false,
 OUT nkeys int8,
 OUT rt_mem_allocated int8,
 OUT array_mem_allocated int8,
@@ -23,6 +24,7 @@ create function bench_seq_search(
 minblk int4,
 maxblk int4,
 random_block bool DEFAULT false,
+shared bool DEFAULT false,
 OUT nkeys int8,
 OUT rt_mem_allocated int8,
 OUT array_mem_allocated int8,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 7abb237e96..be3f7ed811 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -15,6 +15,7 @@
 #include "lib/radixtree.h"
 #include <math.h>
 #include "miscadmin.h"
+#include "storage/lwlock.h"
 #include "utils/timestamp.h"
 
 PG_MODULE_MAGIC;
@@ -149,7 +150,9 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	BlockNumber minblk = PG_GETARG_INT32(0);
 	BlockNumber maxblk = PG_GETARG_INT32(1);
 	bool		random_block = PG_GETARG_BOOL(2);
+	bool		shared = PG_GETARG_BOOL(3);
 	radix_tree *rt = NULL;
+	dsa_area   *dsa = NULL;
 	uint64		ntids;
 	uint64		key;
 	uint64		last_key = PG_UINT64_MAX;
@@ -171,8 +174,11 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 
 	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
 
+	if (shared)
+		dsa = dsa_create(LWLockNewTrancheId());
+
 	/* measure the load time of the radix tree */
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, dsa);
 	start_time = GetCurrentTimestamp();
 	for (int i = 0; i < ntids; i++)
 	{
@@ -323,7 +329,7 @@ bench_load_random_int(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	pg_prng_seed(&state, 0);
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	start_time = GetCurrentTimestamp();
 	for (uint64 i = 0; i < cnt; i++)
@@ -375,7 +381,7 @@ bench_fixed_height_search(PG_FUNCTION_ARGS)
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
 
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	start_time = GetCurrentTimestamp();
 
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index c25d455d2a..fb35463b66 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -22,6 +22,15 @@
  * choose it to avoid an additional pointer traversal.  It is the reason this code
  * currently does not support variable-length keys.
  *
+ * If DSA space is specified when rt_create(), the radix tree is created in the
+ * DSA space so that multiple processes can access to it simultaneously. The process
+ * who created the shared radix tree need to tell both DSA area specified when
+ * calling to rt_create() and dsa_pointer of the radix tree, fetched by
+ * rt_get_dsa_pointer(), other processes so that they can attach by rt_attach().
+ *
+ * XXX: shared radix tree is still PoC state as it doesn't have any locking support.
+ * Also, it supports only single-process iteration.
+ *
  * XXX: Most functions in this file have two variants for inner nodes and leaf
  * nodes, therefore there are duplication codes. While this sometimes makes the
  * code maintenance tricky, this reduces branch prediction misses when judging
@@ -34,6 +43,9 @@
  *
  * rt_create		- Create a new, empty radix tree
  * rt_free			- Free the radix tree
+ * rt_attach		- Attach to the radix tree
+ * rt_detach		- Detach from the radix tree
+ * rt_get_handle	- Return the handle of the radix tree
  * rt_search		- Search a key-value pair
  * rt_set			- Set a key-value pair
  * rt_delete		- Delete a key-value pair
@@ -64,6 +76,7 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "port/pg_lfind.h"
+#include "utils/dsa.h"
 #include "utils/memutils.h"
 
 /* The number of bits encoded in one tree level */
@@ -384,6 +397,11 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
  * construct the key whenever updating the node iteration information, e.g., when
  * advancing the current index within the node or when moving to the next node
  * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than rt_node_ptr.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
  */
 typedef struct rt_node_iter
 {
@@ -403,23 +421,43 @@ struct rt_iter
 	uint64		key;
 };
 
-/* A radix tree with nodes */
-struct radix_tree
+/* A magic value used to identify our radix tree */
+#define RADIXTREE_MAGIC 0x54A48167
+
+/* Control information for an radix tree */
+typedef struct radix_tree_control
 {
-	MemoryContext context;
+	rt_handle	handle;
+	uint32		magic;
 
+	/* Root node */
 	rt_pointer	root;
-	uint64		max_val;
-	uint64		num_keys;
 
-	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
-	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+	pg_atomic_uint64 max_val;
+	pg_atomic_uint64 num_keys;
 
 	/* statistics */
 #ifdef RT_DEBUG
 	int32		cnt[RT_NODE_KIND_COUNT];
 #endif
+} radix_tree_control;
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	/* control object in either backend-local memory or DSA */
+	radix_tree_control *ctl;
+
+	/* used only when the radix tree is shared */
+	dsa_area   *area;
+
+	/* used only when the radix tree is private */
+	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
 };
+#define RadixTreeIsShared(rt) ((rt)->area != NULL)
 
 static void rt_new_root(radix_tree *tree, uint64 key);
 static rt_node_ptr rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
@@ -446,24 +484,31 @@ static void rt_verify_node(rt_node_ptr node);
 
 /* Decode and encode function of rt_pointer */
 static inline rt_node *
-rt_pointer_decode(rt_pointer encoded)
+rt_pointer_decode(radix_tree *tree, rt_pointer encoded)
 {
-	return (rt_node *) RTPointerUnTagKind(encoded);
+	encoded = RTPointerUnTagKind(encoded);
+
+	if (RadixTreeIsShared(tree))
+		return (rt_node *) dsa_get_address(tree->area, encoded);
+	else
+		return (rt_node *) encoded;
 }
 
 static inline rt_pointer
-rt_pointer_encode(rt_node *decoded, uint8 kind)
+rt_pointer_encode(rt_pointer decoded, uint8 kind)
 {
+	Assert((decoded & RT_POINTER_KIND_MASK) == 0);
+
 	return (rt_pointer) RTPointerTagKind(decoded, kind);
 }
 
 /* Return a rt_pointer created from the given encoded pointer */
 static inline rt_node_ptr
-rt_node_ptr_encoded(rt_pointer encoded)
+rt_node_ptr_encoded(radix_tree *tree, rt_pointer encoded)
 {
 	return (rt_node_ptr) {
 		.encoded = encoded,
-			.decoded = rt_pointer_decode(encoded)
+			.decoded = rt_pointer_decode(tree, encoded)
 			};
 }
 
@@ -908,8 +953,8 @@ rt_new_root(radix_tree *tree, uint64 key)
 	rt_node_ptr	node;
 
 	node = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, shift > 0);
-	tree->max_val = shift_get_max_val(shift);
-	tree->root = node.encoded;
+	pg_atomic_write_u64(&tree->ctl->max_val, shift_get_max_val(shift));
+	tree->ctl->root = node.encoded;
 }
 
 /*
@@ -918,16 +963,35 @@ rt_new_root(radix_tree *tree, uint64 key)
 static rt_node_ptr
 rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
 {
-	rt_node_ptr	newnode;
+	rt_node_ptr newnode;
+
+	if (tree->area != NULL)
+	{
+		dsa_pointer dp;
+
+		if (inner)
+			dp = dsa_allocate0(tree->area, rt_node_kind_info[kind].inner_size);
+		else
+			dp = dsa_allocate0(tree->area, rt_node_kind_info[kind].leaf_size);
 
-	if (inner)
-		newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
-															 rt_node_kind_info[kind].inner_size);
+		newnode.encoded = rt_pointer_encode((rt_pointer) dp, kind);
+		newnode.decoded = (rt_node *) dsa_get_address(tree->area, dp);
+	}
 	else
-		newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
-															 rt_node_kind_info[kind].leaf_size);
+	{
+		rt_node *new;
+
+		if (inner)
+			new = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+													 rt_node_kind_info[kind].inner_size);
+		else
+			new = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+													 rt_node_kind_info[kind].leaf_size);
+
+		newnode.encoded = rt_pointer_encode((rt_pointer) new, kind);
+		newnode.decoded = new;
+	}
 
-	newnode.encoded = rt_pointer_encode(newnode.decoded, kind);
 	NODE_SHIFT(newnode) = shift;
 	NODE_CHUNK(newnode) = chunk;
 
@@ -941,7 +1005,7 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[kind]++;
+	tree->ctl->cnt[kind]++;
 #endif
 
 	return newnode;
@@ -968,16 +1032,19 @@ static void
 rt_free_node(radix_tree *tree, rt_node_ptr node)
 {
 	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node.encoded)
-		tree->root = InvalidRTPointer;
+	if (tree->ctl->root == node.encoded)
+		tree->ctl->root = InvalidRTPointer;
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[NODE_KIND(node)]--;
-	Assert(tree->cnt[NODE_KIND(node)] >= 0);
+	tree->ctl->cnt[NODE_KIND(node)]--;
+	Assert(tree->ctl->cnt[NODE_KIND(node)] >= 0);
 #endif
 
-	pfree(node.decoded);
+	if (RadixTreeIsShared(tree))
+		dsa_free(tree->area, (dsa_pointer) RTPointerUnTagKind(node.encoded));
+	else
+		pfree(node.decoded);
 }
 
 /*
@@ -993,7 +1060,7 @@ rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
 	if (rt_node_ptr_eq(&parent, &old_child))
 	{
 		/* Replace the root node with the new large node */
-		tree->root = new_child.encoded;
+		tree->ctl->root = new_child.encoded;
 	}
 	else
 	{
@@ -1015,7 +1082,7 @@ static void
 rt_extend(radix_tree *tree, uint64 key)
 {
 	int			target_shift;
-	rt_node		*root = rt_pointer_decode(tree->root);
+	rt_node		*root = rt_pointer_decode(tree, tree->ctl->root);
 	int			shift = root->shift + RT_NODE_SPAN;
 
 	target_shift = key_get_shift(key);
@@ -1031,15 +1098,15 @@ rt_extend(radix_tree *tree, uint64 key)
 
 		n4->base.n.count = 1;
 		n4->base.chunks[0] = 0;
-		n4->children[0] = tree->root;
+		n4->children[0] = tree->ctl->root;
 
 		root->chunk = 0;
-		tree->root = node.encoded;
+		tree->ctl->root = node.encoded;
 
 		shift += RT_NODE_SPAN;
 	}
 
-	tree->max_val = shift_get_max_val(target_shift);
+	pg_atomic_write_u64(&tree->ctl->max_val, shift_get_max_val(target_shift));
 }
 
 /*
@@ -1068,7 +1135,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
 	}
 
 	rt_node_insert_leaf(tree, parent, node, key, value);
-	tree->num_keys++;
+	pg_atomic_add_fetch_u64(&tree->ctl->num_keys, 1);
 }
 
 /*
@@ -1079,8 +1146,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
-					 rt_pointer *child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action, rt_pointer *child_p)
 {
 	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
@@ -1115,6 +1181,7 @@ rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
 					break;
 
 				found = true;
+
 				if (action == RT_ACTION_FIND)
 					child = n32->children[idx];
 				else			/* RT_ACTION_DELETE */
@@ -1604,33 +1671,50 @@ rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
  * Create the radix tree in the given memory context and return it.
  */
 radix_tree *
-rt_create(MemoryContext ctx)
+rt_create(MemoryContext ctx, dsa_area *area)
 {
 	radix_tree *tree;
 	MemoryContext old_ctx;
 
 	old_ctx = MemoryContextSwitchTo(ctx);
 
-	tree = palloc(sizeof(radix_tree));
+	tree = (radix_tree *) palloc0(sizeof(radix_tree));
 	tree->context = ctx;
-	tree->root = InvalidRTPointer;
-	tree->max_val = 0;
-	tree->num_keys = 0;
+
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+
+		tree->area = area;
+		dp = dsa_allocate0(area, sizeof(radix_tree_control));
+		tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
+		tree->ctl->handle = (rt_handle) dp;
+	}
+	else
+	{
+		tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
+		tree->ctl->handle = InvalidDsaPointer;
+	}
+
+	tree->ctl->magic = RADIXTREE_MAGIC;
+	tree->ctl->root = InvalidRTPointer;
+	pg_atomic_init_u64(&tree->ctl->max_val, 0);
+	pg_atomic_init_u64(&tree->ctl->num_keys, 0);
 
 	/* Create the slab allocator for each size class */
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	if (area == NULL)
 	{
-		tree->inner_slabs[i] = SlabContextCreate(ctx,
-												 rt_node_kind_info[i].name,
-												 rt_node_kind_info[i].inner_blocksize,
-												 rt_node_kind_info[i].inner_size);
-		tree->leaf_slabs[i] = SlabContextCreate(ctx,
-												rt_node_kind_info[i].name,
-												rt_node_kind_info[i].leaf_blocksize,
-												rt_node_kind_info[i].leaf_size);
-#ifdef RT_DEBUG
-		tree->cnt[i] = 0;
-#endif
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			tree->inner_slabs[i] = SlabContextCreate(ctx,
+													 rt_node_kind_info[i].name,
+													 rt_node_kind_info[i].inner_blocksize,
+													 rt_node_kind_info[i].inner_size);
+			tree->leaf_slabs[i] = SlabContextCreate(ctx,
+													rt_node_kind_info[i].name,
+													rt_node_kind_info[i].leaf_blocksize,
+													rt_node_kind_info[i].leaf_size);
+		}
 	}
 
 	MemoryContextSwitchTo(old_ctx);
@@ -1638,16 +1722,159 @@ rt_create(MemoryContext ctx)
 	return tree;
 }
 
+/*
+ * Get a handle that can be used by other processes to attach to this radix
+ * tree.
+ */
+dsa_pointer
+rt_get_handle(radix_tree *tree)
+{
+	Assert(RadixTreeIsShared(tree));
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	return tree->ctl->handle;
+}
+
+/*
+ * Attach to an existing radix tree using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+radix_tree *
+rt_attach(dsa_area *area, rt_handle handle)
+{
+	radix_tree *tree;
+	dsa_pointer	control;
+
+	/* Allocate the backend-local object representing the radix tree */
+	tree = (radix_tree *) palloc0(sizeof(radix_tree));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	/* Set up the local radix tree */
+	tree->area = area;
+	tree->ctl = (radix_tree_control *) dsa_get_address(area, control);
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	return tree;
+}
+
+/*
+ * Detach from a radix tree. This frees backend-local resources associated
+ * with the radix tree, but the radix tree will continue to exist until
+ * it is explicitly freed.
+ */
+void
+rt_detach(radix_tree *tree)
+{
+	Assert(RadixTreeIsShared(tree));
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	pfree(tree);
+}
+
+/*
+ * Recursively free all nodes allocated to the dsa area.
+ */
+static void
+rt_free_recurse(radix_tree *tree, rt_pointer ptr)
+{
+	rt_node_ptr	node = rt_node_ptr_encoded(tree, ptr);
+
+	Assert(RadixTreeIsShared(tree));
+
+	/* The leaf node doesn't have child pointers, so free it */
+	if (NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->area, RTPointerUnTagKind(node.encoded));
+		return;
+	}
+
+	switch (NODE_KIND(node))
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < NODE_COUNT(node); i++)
+					rt_free_recurse(tree, n4->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < NODE_COUNT(node); i++)
+					rt_free_recurse(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+						continue;
+
+					rt_free_recurse(tree, node_inner_128_get_child(n128, i));
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_inner_256_is_chunk_used(n256, i))
+						continue;
+
+					rt_free_recurse(tree, node_inner_256_get_child(n256, i));
+				}
+				break;
+			}
+	}
+
+	/* Free the inner node itself */
+	dsa_free(tree->area, RTPointerUnTagKind(node.encoded));
+}
+
 /*
  * Free the given radix tree.
  */
 void
 rt_free(radix_tree *tree)
 {
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (RadixTreeIsShared(tree))
 	{
-		MemoryContextDelete(tree->inner_slabs[i]);
-		MemoryContextDelete(tree->leaf_slabs[i]);
+		/* Free all memory used for radix tree nodes */
+		rt_free_recurse(tree, tree->ctl->root);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix tree.
+		 */
+		tree->ctl->magic = 0;
+		dsa_free(tree->area, tree->ctl->handle);
+	}
+	else
+	{
+		/* Free all memory used for radix tree nodes */
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			MemoryContextDelete(tree->inner_slabs[i]);
+			MemoryContextDelete(tree->leaf_slabs[i]);
+		}
+		pfree(tree->ctl);
 	}
 
 	pfree(tree);
@@ -1665,17 +1892,19 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 	rt_node_ptr	node;
 	rt_node_ptr parent;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
 	/* Empty tree, create the root */
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 		rt_new_root(tree, key);
 
 	/* Extend the tree if necessary */
-	if (key > tree->max_val)
+	if (key > pg_atomic_read_u64(&tree->ctl->max_val))
 		rt_extend(tree, key);
 
 	/* Descend the tree until a leaf node */
-	parent = rt_node_ptr_encoded(tree->root);
-	node = rt_node_ptr_encoded(tree->root);
+	parent = rt_node_ptr_encoded(tree, tree->ctl->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
@@ -1691,7 +1920,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 		}
 
 		parent = node;
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1699,7 +1928,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 
 	/* Update the statistics */
 	if (!updated)
-		tree->num_keys++;
+		pg_atomic_add_fetch_u64(&tree->ctl->num_keys, 1);
 
 	return updated;
 }
@@ -1715,12 +1944,14 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 	rt_node_ptr    node;
 	int			shift;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
 	Assert(value_p != NULL);
 
-	if (!RTPointerIsValid(tree->root) || key > tree->max_val)
+	if (!RTPointerIsValid(tree->ctl->root) ||
+		key > pg_atomic_read_u64(&tree->ctl->max_val))
 		return false;
 
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 
 	/* Descend the tree until a leaf node */
@@ -1734,7 +1965,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1754,14 +1985,17 @@ rt_delete(radix_tree *tree, uint64 key)
 	int			level;
 	bool		deleted;
 
-	if (!tree->root || key > tree->max_val)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (!RTPointerIsValid(tree->ctl->root) ||
+		key > pg_atomic_read_u64(&tree->ctl->max_val))
 		return false;
 
 	/*
 	 * Descend the tree to search the key while building a stack of nodes we
 	 * visited.
 	 */
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	level = -1;
 	while (shift > 0)
@@ -1774,7 +2008,7 @@ rt_delete(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1789,7 +2023,7 @@ rt_delete(radix_tree *tree, uint64 key)
 	}
 
 	/* Found the key to delete. Update the statistics */
-	tree->num_keys--;
+	pg_atomic_sub_fetch_u64(&tree->ctl->num_keys, 1);
 
 	/*
 	 * Return if the leaf node still has keys and we don't need to delete the
@@ -1823,8 +2057,8 @@ rt_delete(radix_tree *tree, uint64 key)
 	 */
 	if (level == 0)
 	{
-		tree->root = InvalidRTPointer;
-		tree->max_val = 0;
+		tree->ctl->root = InvalidRTPointer;
+		pg_atomic_write_u64(&tree->ctl->max_val, 0);
 	}
 
 	return true;
@@ -1839,6 +2073,8 @@ rt_begin_iterate(radix_tree *tree)
 	rt_iter    *iter;
 	int			top_level;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
 	old_ctx = MemoryContextSwitchTo(tree->context);
 
 	iter = (rt_iter *) palloc0(sizeof(rt_iter));
@@ -1848,7 +2084,7 @@ rt_begin_iterate(radix_tree *tree)
 	if (!RTPointerIsValid(iter->tree))
 		return iter;
 
-	root = rt_node_ptr_encoded(iter->tree->root);
+	root = rt_node_ptr_encoded(tree, iter->tree->ctl->root);
 	top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
 	iter->stack_len = top_level;
 
@@ -1899,6 +2135,8 @@ rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
 bool
 rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 {
+	Assert(!RadixTreeIsShared(iter->tree) || iter->tree->ctl->magic == RADIXTREE_MAGIC);
+
 	/* Empty tree */
 	if (!iter->tree)
 		return false;
@@ -2044,7 +2282,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *
 	if (found)
 	{
 		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
-		*child_p = rt_node_ptr_encoded(child);
+		*child_p = rt_node_ptr_encoded(iter->tree, child);
 	}
 
 	return found;
@@ -2147,7 +2385,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_
 uint64
 rt_num_entries(radix_tree *tree)
 {
-	return tree->num_keys;
+	return pg_atomic_read_u64(&tree->ctl->num_keys);
 }
 
 /*
@@ -2156,12 +2394,19 @@ rt_num_entries(radix_tree *tree)
 uint64
 rt_memory_usage(radix_tree *tree)
 {
-	Size		total = sizeof(radix_tree);
+	Size		total = sizeof(radix_tree) + sizeof(radix_tree_control);
 
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (RadixTreeIsShared(tree))
+		total = dsa_get_total_size(tree->area);
+	else
 	{
-		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
-		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+			total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+		}
 	}
 
 	return total;
@@ -2245,19 +2490,19 @@ rt_verify_node(rt_node_ptr node)
 void
 rt_stats(radix_tree *tree)
 {
-	rt_node_ptr	root = rt_node_ptr_encoded(tree->root);
+	rt_node_ptr	root = rt_node_ptr_encoded(tree, tree->ctl->root);
 
 	ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
-						 tree->num_keys,
+						 pg_atomic_read_u64(&tree->ctl->num_keys),
 						 NODE_SHIFT(root) / RT_NODE_SPAN,
-						 tree->cnt[0],
-						 tree->cnt[1],
-						 tree->cnt[2],
-						 tree->cnt[3])));
+						 tree->ctl->cnt[0],
+						 tree->ctl->cnt[1],
+						 tree->ctl->cnt[2],
+						 tree->ctl->cnt[3])));
 }
 
 static void
-rt_dump_node(rt_node_ptr node, int level, bool recurse)
+rt_dump_node(radix_tree *tree, rt_node_ptr node, int level, bool recurse)
 {
 	rt_node		*n = node.decoded;
 	char		space[128] = {0};
@@ -2293,7 +2538,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, n4->base.chunks[i]);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+							rt_dump_node(tree, rt_node_ptr_encoded(tree, n4->children[i]),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2321,7 +2566,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 
 						if (recurse)
 						{
-							rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+							rt_dump_node(tree, rt_node_ptr_encoded(tree, n32->children[i]),
 										 level + 1, recurse);
 						}
 						else
@@ -2374,7 +2619,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(node_inner_128_get_child(n128, i)),
+							rt_dump_node(tree,
+										 rt_node_ptr_encoded(tree,
+															 node_inner_128_get_child(n128, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2407,7 +2654,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+							rt_dump_node(tree,
+										 rt_node_ptr_encoded(tree,
+															 node_inner_256_get_child(n256, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2418,6 +2667,27 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 	}
 }
 
+void
+rt_dump(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
+				rt_node_kind_info[i].name,
+				rt_node_kind_info[i].inner_size,
+				rt_node_kind_info[i].inner_blocksize,
+				rt_node_kind_info[i].leaf_size,
+				rt_node_kind_info[i].leaf_blocksize);
+	fprintf(stderr, "max_val = %lu\n", pg_atomic_read_u64(&tree->ctl->max_val));
+
+	if (!tree->ctl->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree, rt_node_ptr_encoded(tree, tree->ctl->root), 0, true);
+}
+
 void
 rt_dump_search(radix_tree *tree, uint64 key)
 {
@@ -2426,28 +2696,30 @@ rt_dump_search(radix_tree *tree, uint64 key)
 	int			level = 0;
 
 	elog(NOTICE, "-----------------------------------------------------------");
-	elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+	elog(NOTICE, "max_val = %lu (0x%lX)",
+		 pg_atomic_read_u64(&tree->ctl->max_val),
+		 pg_atomic_read_u64(&tree->ctl->max_val));
 
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 	{
 		elog(NOTICE, "tree is empty");
 		return;
 	}
 
-	if (key > tree->max_val)
+	if (key > pg_atomic_read_u64(&tree->ctl->max_val))
 	{
 		elog(NOTICE, "key %lu (0x%lX) is larger than max val",
 			 key, key);
 		return;
 	}
 
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
 		rt_pointer   child;
 
-		rt_dump_node(node, level, false);
+		rt_dump_node(tree, node, level, false);
 
 		if (NODE_IS_LEAF(node))
 		{
@@ -2462,33 +2734,9 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			break;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 		level++;
 	}
 }
-
-void
-rt_dump(radix_tree *tree)
-{
-	rt_node_ptr root;
-
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
-		fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
-				rt_node_kind_info[i].name,
-				rt_node_kind_info[i].inner_size,
-				rt_node_kind_info[i].inner_blocksize,
-				rt_node_kind_info[i].leaf_size,
-				rt_node_kind_info[i].leaf_blocksize);
-	fprintf(stderr, "max_val = %lu\n", tree->max_val);
-
-	if (!RTPointerIsValid(tree->root))
-	{
-		fprintf(stderr, "empty tree\n");
-		return;
-	}
-
-	root = rt_node_ptr_encoded(tree->root);
-	rt_dump_node(root, 0, true);
-}
 #endif
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 82376fde2d..ad169882af 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..68a11df970 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -14,18 +14,24 @@
 #define RADIXTREE_H
 
 #include "postgres.h"
+#include "utils/dsa.h"
 
 #define RT_DEBUG 1
 
 typedef struct radix_tree radix_tree;
 typedef struct rt_iter rt_iter;
+typedef dsa_pointer rt_handle;
 
-extern radix_tree *rt_create(MemoryContext ctx);
+extern radix_tree *rt_create(MemoryContext ctx, dsa_area *dsa);
 extern void rt_free(radix_tree *tree);
 extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
 extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
 extern rt_iter *rt_begin_iterate(radix_tree *tree);
 
+extern rt_handle rt_get_handle(radix_tree *tree);
+extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+extern void rt_detach(radix_tree *tree);
+
 extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
 extern void rt_end_iterate(rt_iter *iter);
 extern bool rt_delete(radix_tree *tree, uint64 key);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 405606fe2f..dad06adecc 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index cc6970c87c..a0ff1e1c77 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -5,21 +5,38 @@ CREATE EXTENSION test_radixtree;
 --
 SELECT test_radixtree();
 NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
 NOTICE:  testing radix tree node types with shift "8"
 NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "16"
 NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
 NOTICE:  testing radix tree node types with shift "32"
 NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
 NOTICE:  testing radix tree node types with shift "48"
 NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
 NOTICE:  testing radix tree with pattern "all ones"
 NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
 NOTICE:  testing radix tree with pattern "clusters of ten"
 NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
 NOTICE:  testing radix tree with pattern "one-every-64k"
 NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
 NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
 NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
 NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
  test_radixtree 
 ----------------
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index cb3596755d..a948cba4ec 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -19,6 +19,7 @@
 #include "nodes/bitmapset.h"
 #include "storage/block.h"
 #include "storage/itemptr.h"
+#include "storage/lwlock.h"
 #include "utils/memutils.h"
 #include "utils/timestamp.h"
 
@@ -111,7 +112,7 @@ test_empty(void)
 	radix_tree *radixtree;
 	uint64		dummy;
 
-	radixtree = rt_create(CurrentMemoryContext);
+	radixtree = rt_create(CurrentMemoryContext, NULL);
 
 	if (rt_search(radixtree, 0, &dummy))
 		elog(ERROR, "rt_search on empty tree returned true");
@@ -217,14 +218,10 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
  * level.
  */
 static void
-test_node_types(uint8 shift)
+do_test_node_types(radix_tree *radixtree, uint8 shift)
 {
-	radix_tree *radixtree;
-
 	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
 
-	radixtree = rt_create(CurrentMemoryContext);
-
 	/*
 	 * Insert and search entries for every node type at the 'shift' level,
 	 * then delete all entries to make it empty, and insert and search entries
@@ -233,19 +230,39 @@ test_node_types(uint8 shift)
 	test_node_types_insert(radixtree, shift);
 	test_node_types_delete(radixtree, shift);
 	test_node_types_insert(radixtree, shift);
+}
 
-	rt_free(radixtree);
+static void
+test_node_types(void)
+{
+	int			tranche_id = LWLockNewTrancheId();
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+	{
+		radix_tree *tree;
+		dsa_area   *dsa;
+
+		/* Test the local radix tree */
+		tree = rt_create(CurrentMemoryContext, NULL);
+		do_test_node_types(tree, shift);
+		rt_free(tree);
+
+		/* Test the shared radix tree */
+		dsa = dsa_create(tranche_id);
+		tree = rt_create(CurrentMemoryContext, dsa);
+		do_test_node_types(tree, shift);
+		rt_free(tree);
+		dsa_detach(dsa);
+	}
 }
 
 /*
  * Test with a repeating pattern, defined by the 'spec'.
  */
 static void
-test_pattern(const test_spec * spec)
+do_test_pattern(radix_tree *radixtree, const test_spec * spec)
 {
-	radix_tree *radixtree;
 	rt_iter    *iter;
-	MemoryContext radixtree_ctx;
 	TimestampTz starttime;
 	TimestampTz endtime;
 	uint64		n;
@@ -271,18 +288,6 @@ test_pattern(const test_spec * spec)
 			pattern_values[pattern_num_values++] = i;
 	}
 
-	/*
-	 * Allocate the radix tree.
-	 *
-	 * Allocate it in a separate memory context, so that we can print its
-	 * memory usage easily.
-	 */
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
-										  "radixtree test",
-										  ALLOCSET_SMALL_SIZES);
-	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
-	radixtree = rt_create(radixtree_ctx);
-
 	/*
 	 * Add values to the set.
 	 */
@@ -336,8 +341,6 @@ test_pattern(const test_spec * spec)
 		mem_usage = rt_memory_usage(radixtree);
 		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
 				mem_usage, (double) mem_usage / spec->num_values);
-
-		MemoryContextStats(radixtree_ctx);
 	}
 
 	/* Check that rt_num_entries works */
@@ -484,21 +487,54 @@ test_pattern(const test_spec * spec)
 	if ((nbefore - ndeleted) != nafter)
 		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
 			 nafter, (nbefore - ndeleted), ndeleted);
+}
+
+static void
+test_patterns(void)
+{
+	int			tranche_id = LWLockNewTrancheId();
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+	{
+		radix_tree *tree;
+		MemoryContext radixtree_ctx;
+		dsa_area   *dsa;
+		const		test_spec *spec = &test_specs[i];
 
-	MemoryContextDelete(radixtree_ctx);
+		/*
+		 * Allocate the radix tree.
+		 *
+		 * Allocate it in a separate memory context, so that we can print its
+		 * memory usage easily.
+		 */
+		radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+											  "radixtree test",
+											  ALLOCSET_SMALL_SIZES);
+		MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+		/* Test the local radix tree */
+		tree = rt_create(radixtree_ctx, NULL);
+		do_test_pattern(tree, spec);
+		rt_free(tree);
+		MemoryContextReset(radixtree_ctx);
+
+		/* Test the shared radix tree */
+		dsa = dsa_create(tranche_id);
+		tree = rt_create(radixtree_ctx, dsa);
+		do_test_pattern(tree, spec);
+		rt_free(tree);
+		dsa_detach(dsa);
+		MemoryContextDelete(radixtree_ctx);
+	}
 }
 
 Datum
 test_radixtree(PG_FUNCTION_ARGS)
 {
 	test_empty();
-
-	for (int shift = 0; shift <= (64 - 8); shift += 8)
-		test_node_types(shift);
-
-	/* Test different test patterns, with lots of entries */
-	for (int i = 0; i < lengthof(test_specs); i++)
-		test_pattern(&test_specs[i]);
+	test_node_types();
+	test_patterns();
 
 	PG_RETURN_VOID();
 }
-- 
2.31.1

v9-0006-PoC-lazy-vacuum-integration.patchapplication/octet-stream; name=v9-0006-PoC-lazy-vacuum-integration.patchDownload

From 2cbeff1f0c195eefc1daa2400361007e112e7aac Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 4 Nov 2022 14:14:42 +0900
Subject: [PATCH v9 6/6] PoC: lazy vacuum integration.

The patch includes:

* Introducing a new module called TIDStore
* Lazy vacuum and parallel vacuum integration.

TODOs:
* radix tree needs to have the reset funtionality.
* should not allow TIDStore to grow beyond the memory limit.
* change the progress statistics of pg_stat_progress_vacuum.
---
 src/backend/access/common/Makefile    |   1 +
 src/backend/access/common/meson.build |   1 +
 src/backend/access/common/tidstore.c  | 280 ++++++++++++++++++++++++++
 src/backend/access/heap/vacuumlazy.c  | 160 +++++----------
 src/backend/commands/vacuum.c         |  76 +------
 src/backend/commands/vacuumparallel.c |  60 +++---
 src/backend/storage/lmgr/lwlock.c     |   2 +
 src/include/access/tidstore.h         |  55 +++++
 src/include/commands/vacuum.h         |  24 +--
 src/include/storage/lwlock.h          |   1 +
 10 files changed, 434 insertions(+), 226 deletions(-)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h

diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 857beaa32d..76265974b1 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..50ec800fd6
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		TID (ItemPointer) storage implementation.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* XXX: should be configurable for non-heap AMs */
+#define TIDSTORE_OFFSET_NBITS 11	/* pg_ceil_log2_32(MaxHeapTuplesPerPage) */
+
+#define TIDSTORE_VALUE_NBITS 6	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+	((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+struct TIDStore
+{
+	/* main storage for TID */
+	radix_tree	*tree;
+
+	/* # of tids in TIDStore */
+	int	num_tids;
+
+	/* DSA area and handle for shared TIDStore */
+	rt_handle	handle;
+	dsa_area	*area;
+};
+
+static void tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TIDStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TIDStore *
+tidstore_create(dsa_area *area)
+{
+	TIDStore	*ts;
+
+	ts = palloc0(sizeof(TIDStore));
+
+	ts->tree = rt_create(CurrentMemoryContext, area);
+	ts->area = area;
+
+	if (area != NULL)
+		ts->handle = rt_get_handle(ts->tree);
+
+	return ts;
+}
+
+/* Attach to the shared TIDStore using a handle */
+TIDStore *
+tidstore_attach(dsa_area *area, rt_handle handle)
+{
+	TIDStore *ts;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	ts = palloc0(sizeof(TIDStore));
+	ts->tree = rt_attach(area, handle);
+
+	return ts;
+}
+
+/*
+ * Detach from a TIDStore. This detaches from radix tree and frees the
+ * backend-local resources.
+ */
+void
+tidstore_detach(TIDStore *ts)
+{
+	rt_detach(ts->tree);
+	pfree(ts);
+}
+
+void
+tidstore_free(TIDStore *ts)
+{
+	rt_free(ts->tree);
+	pfree(ts);
+}
+
+void
+tidstore_reset(TIDStore *ts)
+{
+	dsa_area *area = ts->area;
+
+	/* Reset the statistics */
+	ts->num_tids = 0;
+
+	/* Recreate radix tree storage */
+	rt_free(ts->tree);
+	ts->tree = rt_create(CurrentMemoryContext, area);
+}
+
+/* Add TIDs to TIDStore */
+void
+tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	uint64 last_key = PG_UINT64_MAX;
+	uint64 key;
+	uint64 val = 0;
+	ItemPointerData tid;
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint32	off;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		key = tid_to_key_off(&tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(ts->tree, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= UINT64CONST(1) << off;
+		ts->num_tids++;
+	}
+
+	if (last_key != PG_UINT64_MAX)
+	{
+		rt_set(ts->tree, last_key, val);
+		val = 0;
+	}
+}
+
+/* Return true if the given TID is present in TIDStore */
+bool
+tidstore_lookup_tid(TIDStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val;
+	uint32 off;
+	bool found;
+
+	key = tid_to_key_off(tid, &off);
+
+	found = rt_search(ts->tree, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+TIDStoreIter *
+tidstore_begin_iterate(TIDStore *ts)
+{
+	TIDStoreIter *iter;
+
+	iter = palloc0(sizeof(TIDStoreIter));
+	iter->ts = ts;
+	iter->tree_iter = rt_begin_iterate(ts->tree);
+	iter->blkno = InvalidBlockNumber;
+
+	return iter;
+}
+
+bool
+tidstore_iterate_next(TIDStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+
+	if (iter->finished)
+		return false;
+
+	if (BlockNumberIsValid(iter->blkno))
+	{
+		iter->num_offsets = 0;
+		tidstore_iter_collect_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (rt_iterate_next(iter->tree_iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = KEY_GET_BLKNO(key);
+
+		if (BlockNumberIsValid(iter->blkno) && iter->blkno != blkno)
+		{
+			/*
+			 * Remember the key-value pair for the next block for the
+			 * next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return true;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_collect_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return true;
+}
+
+uint64
+tidstore_num_tids(TIDStore *ts)
+{
+	return ts->num_tids;
+}
+
+uint64
+tidstore_memory_usage(TIDStore *ts)
+{
+	return (uint64) sizeof(TIDStore) + rt_memory_usage(ts->tree);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TIDStore
+ */
+tidstore_handle
+tidstore_get_handle(TIDStore *ts)
+{
+	return rt_get_handle(ts->tree);
+}
+
+/* Extract TIDs from key-value pair */
+static void
+tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val)
+{
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+		iter->offsets[iter->num_offsets++] = off;
+	}
+
+	iter->blkno = KEY_GET_BLKNO(key);
+}
+
+/* Encode a TID to key and val */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64 upper;
+	uint64 tid_i;
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+	*off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+	upper = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return upper;
+}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index dfbe37472f..5b013bc3a8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -144,6 +145,8 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	int			max_bytes;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -194,7 +197,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TIDStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -265,8 +268,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer *vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer *vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -397,6 +401,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->indname = NULL;
 	vacrel->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
 	vacrel->verbose = verbose;
+	vacrel->max_bytes = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
@@ -858,7 +865,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_unskippable_block,
 				next_failsafe_block = 0,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TIDStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
@@ -872,7 +879,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = vacrel->max_bytes; /* XXX: should use # of tids */
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -942,8 +949,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		/* XXX: should not allow tidstore to grow beyond max_bytes */
+		if (tidstore_memory_usage(vacrel->dead_items) > vacrel->max_bytes)
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1075,11 +1082,17 @@ lazy_scan_heap(LVRelState *vacrel)
 			if (prunestate.has_lpdead_items)
 			{
 				Size		freespace;
+				TIDStoreIter *iter;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+				iter = tidstore_begin_iterate(vacrel->dead_items);
+				tidstore_iterate_next(iter);
+				lazy_vacuum_heap_page(vacrel, blkno, iter->offsets, iter->num_offsets,
+									  buf, &vmbuffer);
+				Assert(!tidstore_iterate_next(iter));
+				pfree(iter);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				tidstore_reset(dead_items);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1116,7 +1129,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
 		}
 
 		/*
@@ -1269,7 +1282,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1903,25 +1916,16 @@ retry:
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TIDStore *dead_items = vacrel->dead_items;
 
 		Assert(!prunestate->all_visible);
 		Assert(prunestate->has_lpdead_items);
 
 		vacrel->lpdead_item_pages++;
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+									 tidstore_num_tids(dead_items));
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
@@ -2128,8 +2132,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TIDStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2138,17 +2141,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+									 tidstore_num_tids(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2197,7 +2193,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2226,7 +2222,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2253,8 +2249,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2299,7 +2295,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	/* tidstore_reset(vacrel->dead_items); */
 }
 
 /*
@@ -2371,7 +2367,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2408,10 +2404,10 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index;
 	BlockNumber vacuumed_pages;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TIDStoreIter *iter;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2428,8 +2424,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 	vacuumed_pages = 0;
 
-	index = 0;
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while (tidstore_iterate_next(iter))
 	{
 		BlockNumber tblk;
 		Buffer		buf;
@@ -2438,12 +2434,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		tblk = iter->blkno;
 		vacrel->blkno = tblk;
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+		lazy_vacuum_heap_page(vacrel, tblk, iter->offsets, iter->num_offsets,
+							  buf, &vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2467,9 +2464,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
@@ -2491,11 +2487,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * LP_DEAD item on the page.  The return value is the first index immediately
  * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+					  int num_offsets, Buffer buffer, Buffer *vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			uncnt = 0;
@@ -2514,16 +2509,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = offsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2603,7 +2593,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -3105,46 +3094,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3155,12 +3104,6 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
-
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
 	 * be used for an index, so we invoke parallelism only if there are at
@@ -3186,7 +3129,6 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3199,11 +3141,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(NULL);
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3c8ea21475..effb72cdd6 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2295,16 +2294,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TIDStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2335,18 +2334,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
@@ -2357,60 +2344,7 @@ vac_max_items_to_alloc_size(int max_items)
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TIDStore *dead_items = (TIDStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26d796e52..08892c2196 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+#define PARALLEL_VACUUM_KEY_DSA				2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TIDStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,7 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TIDStore *dead_items;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +225,22 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TIDStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +288,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +355,15 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(dead_items_dsa);
+	pvs->dead_items = dead_items;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +373,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +382,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +439,8 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_free(pvs->dead_items);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +449,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TIDStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +947,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TIDStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +993,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1042,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 532cd67f4e..d49a052b14 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -183,6 +183,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"PgStatsHash",
 	/* LWTRANCHE_PGSTATS_DATA: */
 	"PgStatsData",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..40b8021f9b
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  TID storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TIDStore TIDStore;
+
+typedef struct TIDStoreIter
+{
+	TIDStore	*ts;
+
+	rt_iter		*tree_iter;
+
+	bool		finished;
+
+	uint64		next_key;
+	uint64		next_val;
+
+	BlockNumber		blkno;
+	OffsetNumber	offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+	int				num_offsets;
+} TIDStoreIter;
+
+extern TIDStore *tidstore_create(dsa_area *dsa);
+extern TIDStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TIDStore *ts);
+extern void tidstore_free(TIDStore *ts);
+extern void tidstore_reset(TIDStore *ts);
+extern void tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TIDStore *ts, ItemPointer tid);
+extern TIDStoreIter * tidstore_begin_iterate(TIDStore *ts);
+extern bool tidstore_iterate_next(TIDStoreIter *iter);
+extern uint64 tidstore_num_tids(TIDStore *ts);
+extern uint64 tidstore_memory_usage(TIDStore *ts);
+extern tidstore_handle tidstore_get_handle(TIDStore *ts);
+
+#endif		/* TIDSTORE_H */
+
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d816ba7f4..d221528f16 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -235,21 +236,6 @@ typedef struct VacuumParams
 	int			nworkers;
 } VacuumParams;
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -306,18 +292,16 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TIDStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TIDStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index ca4eca76f4..0999e4fc10 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -193,6 +193,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DSA,
 	LWTRANCHE_PGSTATS_HASH,
 	LWTRANCHE_PGSTATS_DATA,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
-- 
2.31.1

v9-0002-Add-radix-implementation.patchapplication/octet-stream; name=v9-0002-Add-radix-implementation.patchDownload

From ac437b4d40cd0e61258fb411e659ddd87de08a1e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v9 2/6] Add radix implementation.

---
 src/backend/lib/Makefile                      |    1 +
 src/backend/lib/meson.build                   |    1 +
 src/backend/lib/radixtree.c                   | 2404 +++++++++++++++++
 src/include/lib/radixtree.h                   |   42 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   28 +
 src/test/modules/test_radixtree/meson.build   |   34 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  504 ++++
 .../test_radixtree/test_radixtree.control     |    4 +
 15 files changed, 3069 insertions(+)
 create mode 100644 src/backend/lib/radixtree.c
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
   'knapsack.c',
   'pairingheap.c',
   'rbtree.c',
+  'radixtree.c',
 )
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..bd58b2bfad
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2404 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_set			- Set a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iterate	- Begin iterating through all key-value pairs
+ * rt_iterate_next	- Return next key-value pair, if any
+ * rt_end_iter		- End iteration
+ * rt_memory_usage	- Get the memory usage
+ * rt_num_entries	- Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-128 */
+#define RT_NODE_128_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+	RT_ACTION_FIND = 0,			/* find the key-value */
+	RT_ACTION_DELETE,			/* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 40/40 -> 296/286 -> 1288/1304 -> 2056/2088 bytes for inner nodes and
+ * leaf nodes, respectively, leading to large amount of allocator padding
+ * with aset.c. Hence the use of slab.
+ *
+ * XXX: need to have node-1 until there is no path compression optimization?
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_128		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Size kind of the node */
+	uint8		kind;
+} rt_node;
+#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+	(((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+typedef struct rt_node_base_4
+{
+	rt_node		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+	rt_node		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-128 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 128 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base128
+{
+	rt_node		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_128;
+
+typedef struct rt_node_base256
+{
+	rt_node		n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * There are separate from inner node size classes for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+	rt_node_base_4 base;
+
+	/* 4 children, for key chunks */
+	rt_node    *children[4];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+	rt_node_base_4 base;
+
+	/* 4 values, for key chunks */
+	uint64		values[4];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+	rt_node_base_32 base;
+
+	/* 32 children, for key chunks */
+	rt_node    *children[32];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+	rt_node_base_32 base;
+
+	/* 32 values, for key chunks */
+	uint64		values[32];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_128
+{
+	rt_node_base_128 base;
+
+	/* Slots for 128 children */
+	rt_node    *children[128];
+} rt_node_inner_128;
+
+typedef struct rt_node_leaf_128
+{
+	rt_node_base_128 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
+
+	/* Slots for 128 values */
+	uint64		values[128];
+} rt_node_leaf_128;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 children */
+	rt_node    *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+	rt_node_base_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information of each size kinds */
+typedef struct rt_node_kind_info_elem
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} rt_node_kind_info_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
+static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
+
+	[RT_NODE_KIND_4] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(rt_node_inner_4),
+		.leaf_size = sizeof(rt_node_leaf_4),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
+	},
+	[RT_NODE_KIND_32] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(rt_node_inner_32),
+		.leaf_size = sizeof(rt_node_leaf_32),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
+	},
+	[RT_NODE_KIND_128] = {
+		.name = "radix tree node 128",
+		.fanout = 128,
+		.inner_size = sizeof(rt_node_inner_128),
+		.leaf_size = sizeof(rt_node_leaf_128),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
+	},
+	[RT_NODE_KIND_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(rt_node_inner_256),
+		.leaf_size = sizeof(rt_node_leaf_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+	},
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+	rt_node    *node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	rt_node_iter stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node    *root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_NODE_KIND_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+							  bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+										rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+									   uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+								 uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+								uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											 uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+						  uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+	/* For better code generation */
+	if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+		pg_unreachable();
+
+	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+	memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values, int count)
+{
+	/* For better code generation */
+	if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+		pg_unreachable();
+
+	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+	memcpy(dst_values, src_values, sizeof(uint64) * count);
+}
+
+/* Functions to manipulate inner and leaf node-128 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
+{
+	Assert(NODE_IS_LEAF(node));
+	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((rt_node_base_128 *) node)->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+static void
+node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
+{
+	int			slotpos = node->base.slot_idxs[chunk];
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Return an unused slot in node-128 */
+static int
+node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
+{
+	int			slotpos = 0;
+
+	Assert(!NODE_IS_LEAF(node));
+	while (node_inner_128_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static int
+node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* We iterate over the isset bitmap per byte then check each bit */
+	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	{
+		if (node->isset[slotpos] < 0xFF)
+			break;
+	}
+	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+	slotpos *= BITS_PER_BYTE;
+	while (node_leaf_128_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static inline void
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+	int			slotpos;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_inner_128_find_unused_slot(node, chunk);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_leaf_128_find_unused_slot(node, chunk);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(node_inner_256_is_chunk_used(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(node_leaf_256_is_chunk_used(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+	int			shift = key_get_shift(key);
+	rt_node    *node;
+
+	node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
+									 shift > 0);
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = node;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
+{
+	rt_node    *newnode;
+
+	if (inner)
+		newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+													 rt_node_kind_info[kind].inner_size);
+	else
+		newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+													 rt_node_kind_info[kind].leaf_size);
+
+	newnode->kind = kind;
+	newnode->shift = shift;
+	newnode->chunk = chunk;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_128)
+	{
+		rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+
+		memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
+	}
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[kind]++;
+#endif
+
+	return newnode;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+	rt_node    *newnode;
+
+	newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
+							node->shift > 0);
+	newnode->count = node->count;
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+		tree->root = NULL;
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[node->kind]--;
+	Assert(tree->cnt[node->kind] >= 0);
+#endif
+
+	pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+				rt_node *new_child, uint64 key)
+{
+	Assert(old_child->chunk == new_child->chunk);
+	Assert(old_child->shift == new_child->shift);
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = new_child;
+	}
+	else
+	{
+		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
+
+		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		Assert(replaced);
+	}
+
+	rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_inner_4 *node;
+
+		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
+												 shift, 0, true);
+		node->base.n.count = 1;
+		node->base.chunks[0] = 0;
+		node->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+			  rt_node *node)
+{
+	int			shift = node->shift;
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		rt_node    *newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+
+		newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
+								 RT_GET_KEY_CHUNK(key, node->shift),
+								 newshift > 0);
+		rt_node_insert_inner(tree, parent, node, key, newchild);
+
+		parent = node;
+		node = newchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	rt_node_insert_leaf(tree, parent, node, key, value);
+	tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	rt_node    *child = NULL;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = n4->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n4->base.chunks, n4->children,
+												n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = n32->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n32->base.chunks, n32->children,
+												n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+				if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = node_inner_128_get_child(n128, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_128_delete(n128, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				if (!node_inner_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = node_inner_256_get_child(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && child_p)
+		*child_p = child;
+
+	return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	uint64		value = 0;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = n4->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+											  n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = n32->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+											  n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+
+				if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_128_get_value(n128, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_128_delete(n128, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				if (!node_leaf_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_256_get_value(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && value_p)
+		*value_p = value;
+
+	return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+					 rt_node *child)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_inner_32 *new32;
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
+															  RT_NODE_KIND_32);
+					chunk_children_array_copy(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children,
+											  n4->base.n.count);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+									key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					uint16		count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n4->base.chunks, n4->children,
+												   count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->children[insertpos] = child;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+				{
+					rt_node_inner_128 *new128;
+
+					/* grow node from 32 to 128 */
+					new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
+																RT_NODE_KIND_128);
+					for (int i = 0; i < n32->base.n.count; i++)
+						node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+									key);
+					node = (rt_node *) new128;
+				}
+				else
+				{
+					int			insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+					int16		count = n32->base.n.count;
+
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n32->base.chunks, n32->children,
+												   count, insertpos);
+
+					n32->base.chunks[insertpos] = chunk;
+					n32->children[insertpos] = child;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+				int			cnt = 0;
+
+				if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_inner_128_update(n128, chunk, child);
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+				{
+					rt_node_inner_256 *new256;
+
+					/* grow node from 128 to 256 */
+					new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
+																RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+					{
+						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+							continue;
+
+						node_inner_256_set(new256, i, node_inner_128_get_child(n128, i));
+						cnt++;
+					}
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_inner_128_insert(n128, chunk, child);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+				node_inner_256_set(n256, chunk, child);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+					uint64 key, uint64 value)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_leaf_32 *new32;
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
+															 RT_NODE_KIND_32);
+					chunk_values_array_copy(n4->base.chunks, n4->values,
+											new32->base.chunks, new32->values,
+											n4->base.n.count);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+									key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and values */
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n4->base.chunks, n4->values,
+												 count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->values[insertpos] = value;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+				{
+					rt_node_leaf_128 *new128;
+
+					/* grow node from 32 to 128 */
+					new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
+															   RT_NODE_KIND_128);
+					for (int i = 0; i < n32->base.n.count; i++)
+						node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+									key);
+					node = (rt_node *) new128;
+				}
+				else
+				{
+					int			insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+					int			count = n32->base.n.count;
+
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n32->base.chunks, n32->values,
+												 count, insertpos);
+
+					n32->base.chunks[insertpos] = chunk;
+					n32->values[insertpos] = value;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_128:
+			{
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+				int			cnt = 0;
+
+				if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_leaf_128_update(n128, chunk, value);
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+				{
+					rt_node_leaf_256 *new256;
+
+					/* grow node from 128 to 256 */
+					new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
+															   RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+					{
+						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+							continue;
+
+						node_leaf_256_set(new256, i, node_leaf_128_get_value(n128, i));
+						cnt++;
+					}
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_leaf_128_insert(n128, chunk, value);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+				node_leaf_256_set(n256, chunk, value);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 rt_node_kind_info[i].name,
+												 rt_node_kind_info[i].inner_blocksize,
+												 rt_node_kind_info[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												rt_node_kind_info[i].name,
+												rt_node_kind_info[i].leaf_blocksize,
+												rt_node_kind_info[i].leaf_size);
+#ifdef RT_DEBUG
+		tree->cnt[i] = 0;
+#endif
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	rt_node    *node;
+	rt_node    *parent = tree->root;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = tree->root;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		{
+			rt_set_extend(tree, key, value, parent, node);
+			return false;
+		}
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+	rt_node    *node;
+	int			shift;
+
+	Assert(value_p != NULL);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		rt_node    *child;
+
+		/* Push the current node to the stack */
+		stack[++level] = node;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	Assert(NODE_IS_LEAF(node));
+	deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	rt_free_node(tree, node);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		node = stack[level--];
+
+		deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+	}
+
+	/*
+	 * If we eventually deleted the root node while recursively deleting empty
+	 * nodes, we make the tree empty.
+	 */
+	if (level == 0)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+	return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+	int			level = from;
+	rt_node    *node = from_node;
+
+	for (;;)
+	{
+		rt_node_iter *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = rt_node_inner_iterate_next(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree)
+		return false;
+
+	for (;;)
+	{
+		rt_node    *child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		rt_update_iter_stack(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+	rt_node    *child = NULL;
+	bool		found = false;
+	uint8		key_chunk;
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				child = n4->children[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				child = n32->children[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_128_get_child(n128, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_inner_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_256_get_child(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+	return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+						  uint64 *value_p)
+{
+	rt_node    *node = node_iter->node;
+	bool		found = false;
+	uint64		value;
+	uint8		key_chunk;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				value = n4->values[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				value = n32->values[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_128_get_value(n128, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_leaf_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_256_get_value(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		*value_p = value;
+	}
+
+	return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	Size		total = sizeof(radix_tree);
+
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					if (NODE_IS_LEAF(node))
+						Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) node,
+														  n128->slot_idxs[i]));
+					else
+						Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) node,
+														   n128->slot_idxs[i]));
+
+					cnt++;
+				}
+
+				Assert(n128->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+						cnt += pg_popcount32(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
+						 tree->num_keys,
+						 tree->root->shift / RT_NODE_SPAN,
+						 tree->cnt[0],
+						 tree->cnt[1],
+						 tree->cnt[2],
+						 tree->cnt[3])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+	char		space[128] = {0};
+
+	fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_128) ? 128 : 256,
+			node->count, node->shift, node->chunk);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							rt_dump_node(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							rt_dump_node(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *b128 = (rt_node_base_128 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(b128, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b128->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < 16; i++)
+					{
+						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(b128, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) b128;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, i, node_leaf_128_get_value(n128, i));
+					}
+					else
+					{
+						rt_node_inner_128 *n128 = (rt_node_inner_128 *) b128;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_128_get_child(n128, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+						if (!node_leaf_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, i, node_leaf_256_get_value(n256, i));
+					}
+					else
+					{
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+						if (!node_inner_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		rt_dump_node(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+			break;
+		}
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
+				rt_node_kind_info[i].name,
+				rt_node_kind_info[i].inner_size,
+				rt_node_kind_info[i].inner_blocksize,
+				rt_node_kind_info[i].leaf_size,
+				rt_node_kind_info[i].leaf_blocksize);
+	fprintf(stderr, "max_val = %lu\n", tree->max_val);
+
+	if (!tree->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 7b3f292965..e587cabe13 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -26,6 +26,7 @@ SUBDIRS = \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index c2e5f5ffd5..c86f6bdcb0 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -20,6 +20,7 @@ subdir('test_oat_hooks')
 subdir('test_parser')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..cb3596755d
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,504 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int	rt_node_max_entries[] = {
+	4,							/* RT_NODE_KIND_4 */
+	16,							/* RT_NODE_KIND_16 */
+	32,							/* RT_NODE_KIND_32 */
+	128,						/* RT_NODE_KIND_128 */
+	256							/* RT_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 10000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	uint64		dummy;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		uint64		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, key);
+
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+		for (int j = 0; j < lengthof(rt_node_max_entries); j++)
+		{
+			/*
+			 * After filling all slots in each node type, check if the values
+			 * are stored properly.
+			 */
+			if (i == (rt_node_max_entries[j] - 1))
+			{
+				check_search_on_node(radixtree, shift,
+									 (j == 0) ? 0 : rt_node_max_entries[j - 1],
+									 rt_node_max_entries[j]);
+				break;
+			}
+		}
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
-- 
2.31.1

v9-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v9-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From c8918d78d679fabe40a2855ba4d9ea0d1dbb5445 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v9 1/6] introduce vector8_min and vector8_highbit_mask

---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.31.1

#119

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#118)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

0004 patch is a new patch supporting a pointer tagging of the node
kind. Also, it introduces rt_node_ptr we discussed so that internal
functions use it rather than having two arguments for encoded and
decoded pointers. With this intermediate patch, the DSA support patch
became more readable and understandable. Probably we can make it
smaller further if we move the change of separating the control object
from radix_tree to the main patch (0002). The patch still needs to be
polished but I'd like to check if this idea is worthwhile. If we agree
on this direction, this patch will be merged into the main radix tree
implementation patch.

Thanks for the new patch set. I've taken a very brief look at 0004 and I
think the broad outlines are okay. As you say it needs polish, but before
going further, I'd like to do some experiments of my own as I mentioned
earlier:

- See how much performance we actually gain from tagging the node kind.
- Try additional size classes while keeping the node kinds to only four.
- Optimize node128 insert.
- Try templating out the differences between local and shared memory. With
local memory, the node-pointer struct would be a union, for example.
Templating would also reduce branches and re-simplify some internal APIs,
but it's likely that would also make the TID store and/or vacuum more
complex, because at least some external functions would be duplicated.

I'll set the patch to "waiting on author", but in this case the author is
me.

--
John Naylor
EDB: http://www.enterprisedb.com

#120

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#119)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Nov 14, 2022 at 10:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

0004 patch is a new patch supporting a pointer tagging of the node
kind. Also, it introduces rt_node_ptr we discussed so that internal
functions use it rather than having two arguments for encoded and
decoded pointers. With this intermediate patch, the DSA support patch
became more readable and understandable. Probably we can make it
smaller further if we move the change of separating the control object
from radix_tree to the main patch (0002). The patch still needs to be
polished but I'd like to check if this idea is worthwhile. If we agree
on this direction, this patch will be merged into the main radix tree
implementation patch.

Thanks for the new patch set. I've taken a very brief look at 0004 and I think the broad outlines are okay. As you say it needs polish, but before going further, I'd like to do some experiments of my own as I mentioned earlier:

- See how much performance we actually gain from tagging the node kind.
- Try additional size classes while keeping the node kinds to only four.
- Optimize node128 insert.
- Try templating out the differences between local and shared memory. With local memory, the node-pointer struct would be a union, for example. Templating would also reduce branches and re-simplify some internal APIs, but it's likely that would also make the TID store and/or vacuum more complex, because at least some external functions would be duplicated.

Thanks! Please let me know if there is something I can help with.

In the meanwhile, I'd like to make some progress on the vacuum
integration and improving the test coverages.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#121

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#120)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

Thanks! Please let me know if there is something I can help with.

I didn't get very far because the tests fail on 0004 in rt_verify_node:

TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

--
John Naylor
EDB: http://www.enterprisedb.com

#122

john.naylor@enterprisedb.com

about 3 years ago

In reply to: John Naylor (#121)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Nov 16, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com>
wrote:

On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

Thanks! Please let me know if there is something I can help with.

I didn't get very far because the tests fail on 0004 in rt_verify_node:

TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:

"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

Actually I do want to offer some general advice. Upthread I recommended a
purely refactoring patch that added the node-pointer struct but did nothing
else, so that the DSA changes would be smaller. 0004 attempted pointer
tagging in the same commit, which makes it no longer a purely refactoring
patch, so that 1) makes it harder to tell what part caused the bug and 2)
obscures what is necessary for DSA pointers and what was additionally
necessary for pointer tagging. Shared memory support is a prerequisite for
a shippable feature, but pointer tagging is (hopefully) a performance
optimization. Let's keep them separate.

--
John Naylor
EDB: http://www.enterprisedb.com

#123

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#121)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Nov 16, 2022 at 1:46 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thanks! Please let me know if there is something I can help with.

I didn't get very far because the tests fail on 0004 in rt_verify_node:

TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

Which tests do you use to get this assertion failure? I've confirmed
there is a bug in 0005 patch but without it, "make check-world"
passed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#124

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#122)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Nov 16, 2022 at 2:17 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Nov 16, 2022 at 11:46 AM John Naylor <john.naylor@enterprisedb.com> wrote:

On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thanks! Please let me know if there is something I can help with.

I didn't get very far because the tests fail on 0004 in rt_verify_node:

TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

Actually I do want to offer some general advice. Upthread I recommended a purely refactoring patch that added the node-pointer struct but did nothing else, so that the DSA changes would be smaller. 0004 attempted pointer tagging in the same commit, which makes it no longer a purely refactoring patch, so that 1) makes it harder to tell what part caused the bug and 2) obscures what is necessary for DSA pointers and what was additionally necessary for pointer tagging. Shared memory support is a prerequisite for a shippable feature, but pointer tagging is (hopefully) a performance optimization. Let's keep them separate.

Totally agreed. I'll separate them in the next version patch. Thank
you for your advice.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#125

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#123)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Wed, Nov 16, 2022 at 1:46 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

Thanks! Please let me know if there is something I can help with.

I didn't get very far because the tests fail on 0004 in rt_verify_node:

TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:

"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

Which tests do you use to get this assertion failure? I've confirmed
there is a bug in 0005 patch but without it, "make check-world"
passed.

Hmm, I started over and rebuilt and it didn't reproduce. Not sure what
happened, sorry for the noise.

I'm attaching a test I wrote to stress test branch prediction in search,
and while trying it out I found two possible issues.

It's based on the random int load test, but tests search speed. Run like
this:

select * from bench_search_random_nodes(10 * 1000 * 1000)

It also takes some care to include all the different node kinds,
restricting the possible keys by AND-ing with a filter. Here's a simple
demo:

filter = ((uint64)1<<40)-1;
LOG: num_keys = 9999967, height = 4, n4 = 17513814, n32 = 6320, n128 =
62663, n256 = 3130

Just using random integers leads to >99% using the smallest node. I wanted
to get close to having the same number of each, but that's difficult while
still using random inputs. I ended up using

filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)

which gives

LOG: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 =
182670, n256 = 1024

Which seems okay for the task. One puzzling thing I found while trying
various filters is that sometimes the reported tree height would change.
For example:

filter = (((uint64) 1<<32) | (0xFF<<24));
LOG: num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 =
62632, n256 = 3161

1) Any idea why the tree height would be reported as 7 here? I didn't
expect that.

2) It seems that 0004 actually causes a significant slowdown in this test
(as in the attached, using the second filter above and with turboboost
disabled):

v9 0003: 2062 2051 2050
v9 0004: 2346 2316 2321

That means my idea for the pointer struct might have some problems, at
least as currently implemented. Maybe in the course of separating out and
polishing that piece, an inefficiency will fall out. Or, it might be
another reason to template local and shared separately. Not sure yet. I
also haven't tried to adjust this test for the shared memory case.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

add-random-node-search-test.patch.txttext/plain; charset=US-ASCII; name=add-random-node-search-test.patch.txtDownload

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 0874201d7e..e0205b364e 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -43,6 +43,14 @@ returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
 
+create function bench_search_random_nodes(
+cnt int8,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
 create function bench_fixed_height_search(
 fanout int4,
 OUT fanout int4,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 7abb237e96..a43fc61c2d 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -29,6 +29,7 @@ PG_FUNCTION_INFO_V1(bench_seq_search);
 PG_FUNCTION_INFO_V1(bench_shuffle_search);
 PG_FUNCTION_INFO_V1(bench_load_random_int);
 PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
 
 static uint64
 tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -347,6 +348,77 @@ bench_load_random_int(PG_FUNCTION_ARGS)
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
 
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	const uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, key);
+	}
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
 Datum
 bench_fixed_height_search(PG_FUNCTION_ARGS)
 {

#126

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#125)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Nov 16, 2022 at 4:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Nov 16, 2022 at 1:46 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thanks! Please let me know if there is something I can help with.

I didn't get very far because the tests fail on 0004 in rt_verify_node:

TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

Which tests do you use to get this assertion failure? I've confirmed
there is a bug in 0005 patch but without it, "make check-world"
passed.

Hmm, I started over and rebuilt and it didn't reproduce. Not sure what happened, sorry for the noise.

Good to know. No problem.

I'm attaching a test I wrote to stress test branch prediction in search, and while trying it out I found two possible issues.

Thank you for testing!

It's based on the random int load test, but tests search speed. Run like this:

select * from bench_search_random_nodes(10 * 1000 * 1000)

It also takes some care to include all the different node kinds, restricting the possible keys by AND-ing with a filter. Here's a simple demo:

filter = ((uint64)1<<40)-1;
LOG: num_keys = 9999967, height = 4, n4 = 17513814, n32 = 6320, n128 = 62663, n256 = 3130

Just using random integers leads to >99% using the smallest node. I wanted to get close to having the same number of each, but that's difficult while still using random inputs. I ended up using

filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)

which gives

LOG: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 182670, n256 = 1024

Which seems okay for the task. One puzzling thing I found while trying various filters is that sometimes the reported tree height would change. For example:

filter = (((uint64) 1<<32) | (0xFF<<24));
LOG: num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 3161

1) Any idea why the tree height would be reported as 7 here? I didn't expect that.

In my environment, (0xFF<<24) is 0xFFFFFFFFFF000000, not 0xFF000000.
It seems the filter should be (((uint64) 1<<32) | ((uint64)
0xFF<<24)).

2) It seems that 0004 actually causes a significant slowdown in this test (as in the attached, using the second filter above and with turboboost disabled):

v9 0003: 2062 2051 2050
v9 0004: 2346 2316 2321

That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case.

I'll also run the test on my environment and do the investigation tomorrow.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#127

john.naylor@enterprisedb.com

about 3 years ago

In reply to: John Naylor (#94)

2 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Sep 28, 2022 at 1:18 PM I wrote:

Along those lines, one thing I've been thinking about is the number of

size classes. There is a tradeoff between memory efficiency and number of
branches when searching/inserting. My current thinking is there is too much
coupling between size class and data type. Each size class currently uses a
different data type and a different algorithm to search and set it, which
in turn requires another branch. We've found that a larger number of size
classes leads to poor branch prediction [1] and (I imagine) code density.

I'm thinking we can use "flexible array members" for the values/pointers,

and keep the rest of the control data in the struct the same. That way, we
never have more than 4 actual "kinds" to code and branch on. As a bonus,
when migrating a node to a larger size class of the same kind, we can
simply repalloc() to the next size.

While the most important challenge right now is how to best represent and
organize the shared memory case, I wanted to get the above idea working and
out of the way, to be saved for a future time. I've attached a rough
implementation (applies on top of v9 0003) that splits node32 into 2 size
classes. They both share the exact same base data type and hence the same
search/set code, so the number of "kind"s is still four, but here there are
five "size classes", so a new case in the "unlikely" node-growing path. The
smaller instance of node32 is a "node15", because that's currently 160
bytes, corresponding to one of the DSA size classes. This idea can be
applied to any other node except the max size, as we see fit. (Adding a
singleton size class would bring it back in line with the prototype, at
least as far as memory consumption.)

One issue with this patch: The "fanout" member is a uint8, so it can't hold
256 for the largest node kind. That's not an issue in practice, since we
never need to grow it, and we only compare that value with the count in an
Assert(), so I just set it to zero. That does break an invariant, so it's
not great. We could use 2 bytes to be strictly correct in all cases, but
that limits what we can do with the smallest node kind.

In the course of working on this, I encountered a pain point. Since it's
impossible to repalloc in slab, we have to do alloc/copy/free ourselves.
That's fine, but the current coding makes too many assumptions about the
use cases: rt_alloc_node and rt_copy_node are too entangled with each other
and do too much work unrelated to what the names imply. I seem to remember
an earlier version had something like rt_node_copy_common that did
only...copying. That was much easier to reason about. In 0002 I resorted to
doing my own allocation to show what I really want to do, because the new
use case doesn't need zeroing and setting values. It only needs
to...allocate (and increase the stats counter if built that way).

Future optimization work while I'm thinking of it: rt_alloc_node should be
always-inlined and the memset done separately (i.e. not *AllocZero). That
way the compiler should be able generate more efficient zeroing code for
smaller nodes. I'll test the numbers on this sometime in the future.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v901-0002-Make-node32-variable-sized.patch.txttext/plain; charset=US-ASCII; name=v901-0002-Make-node32-variable-sized.patch.txtDownload

From 6fcc970ae7e31f44fa6b6aface983cadb023cc50 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 17 Nov 2022 16:10:44 +0700
Subject: [PATCH v901 2/2] Make node32 variable sized

Add a size class for 15 elements, which corresponds to 160 bytes,
an allocation size used by DSA. When a 16th element is to be
inserted, allocte a larger area and memcpy the entire old node
to it.

NB: Zeroing the new area is only necessary if it's for an
inner node128, since insert logic must check for null child
pointers.

This technique allows us to limit the node kinds to 4, which
1. limits the number of cases in switch statements
2. allows a possible future optimization to encode the node kind
in a pointer tag
---
 src/backend/lib/radixtree.c | 141 +++++++++++++++++++++++++++---------
 1 file changed, 108 insertions(+), 33 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bef1a438ab..f368e750d5 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -130,6 +130,7 @@ typedef enum
 typedef enum rt_size_class
 {
 	RT_CLASS_4_FULL = 0,
+	RT_CLASS_32_PARTIAL,
 	RT_CLASS_32_FULL,
 	RT_CLASS_128_FULL,
 	RT_CLASS_256
@@ -147,6 +148,8 @@ typedef struct rt_node
 	uint16		count;
 
 	/* Max number of children. We can use uint8 because we never need to store 256 */
+	/* WIP: if we don't have a variable sized node4, this should instead be in the base
+	types as needed, since saving every byte is crucial for the smallest node kind */
 	uint8		fanout;
 
 	/*
@@ -166,6 +169,8 @@ typedef struct rt_node
 	((node)->base.n.count < (node)->base.n.fanout)
 
 /* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
 typedef struct rt_node_base_4
 {
 	rt_node		n;
@@ -217,40 +222,40 @@ typedef struct rt_node_inner_4
 {
 	rt_node_base_4 base;
 
-	/* 4 children, for key chunks */
-	rt_node    *children[4];
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_4;
 
 typedef struct rt_node_leaf_4
 {
 	rt_node_base_4 base;
 
-	/* 4 values, for key chunks */
-	uint64		values[4];
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_leaf_4;
 
 typedef struct rt_node_inner_32
 {
 	rt_node_base_32 base;
 
-	/* 32 children, for key chunks */
-	rt_node    *children[32];
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_32;
 
 typedef struct rt_node_leaf_32
 {
 	rt_node_base_32 base;
 
-	/* 32 values, for key chunks */
-	uint64		values[32];
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_leaf_32;
 
 typedef struct rt_node_inner_128
 {
 	rt_node_base_128 base;
 
-	/* Slots for 128 children */
-	rt_node    *children[128];
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_128;
 
 typedef struct rt_node_leaf_128
@@ -260,8 +265,8 @@ typedef struct rt_node_leaf_128
 	/* isset is a bitmap to track which slot is in use */
 	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
 
-	/* Slots for 128 values */
-	uint64		values[128];
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_leaf_128;
 
 /*
@@ -307,32 +312,40 @@ typedef struct rt_size_class_elem
  * from the block.
  */
 #define NODE_SLAB_BLOCK_SIZE(size)	\
-	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
 static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
 
 	[RT_CLASS_4_FULL] = {
 		.name = "radix tree node 4",
 		.fanout = 4,
-		.inner_size = sizeof(rt_node_inner_4),
-		.leaf_size = sizeof(rt_node_leaf_4),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
+		.inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_PARTIAL] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
 	},
 	[RT_CLASS_32_FULL] = {
 		.name = "radix tree node 32",
 		.fanout = 32,
-		.inner_size = sizeof(rt_node_inner_32),
-		.leaf_size = sizeof(rt_node_leaf_32),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
+		.inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
 	},
 	[RT_CLASS_128_FULL] = {
 		.name = "radix tree node 128",
 		.fanout = 128,
-		.inner_size = sizeof(rt_node_inner_128),
-		.leaf_size = sizeof(rt_node_leaf_128),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
+		.inner_size = sizeof(rt_node_inner_128) + 128 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_128) + 128 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128) + 128 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128) + 128 * sizeof(uint64)),
 	},
 	[RT_CLASS_256] = {
 		.name = "radix tree node 256",
@@ -922,7 +935,6 @@ rt_free_node(radix_tree *tree, rt_node *node)
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	// FIXME
 	for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
 		if (node->fanout == rt_size_class_info[i].fanout)
@@ -1240,7 +1252,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 
 					/* grow node from 4 to 32 */
 					new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
-															  RT_NODE_KIND_32, RT_CLASS_32_FULL);
+															  RT_NODE_KIND_32, RT_CLASS_32_PARTIAL);
 					chunk_children_array_copy(n4->base.chunks, n4->children,
 											  new32->base.chunks, new32->children,
 											  n4->base.n.count);
@@ -1282,6 +1294,37 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 
 				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
 				{
+					Assert(parent != NULL);
+
+					if (n32->base.n.fanout == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+					{
+						/* use the same node kind, but expand to the next size class */
+
+						/* no need to zero the new memory */
+						rt_node_inner_32 *new32 =
+							(rt_node_inner_32 *) MemoryContextAlloc(tree->inner_slabs[RT_CLASS_32_FULL],
+													 rt_size_class_info[RT_CLASS_32_FULL].inner_size);
+
+// FIXME the API for rt_alloc_node and rt_node_copy are too entangled
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[RT_CLASS_32_FULL]++;
+#endif
+						/* copy the entire old node -- the new node is only different in having
+						additional slots so we only have to change the fanout */
+						memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
+						new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32,
+										key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new32;
+						n32 = new32;
+						goto retry_insert_inner_32;
+					}
+					else
+					{
 					rt_node_inner_128 *new128;
 
 					/* grow node from 32 to 128 */
@@ -1290,13 +1333,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 					for (int i = 0; i < n32->base.n.count; i++)
 						node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
 
-					Assert(parent != NULL);
 					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
 									key);
 					node = (rt_node *) new128;
+					}
 				}
 				else
 				{
+retry_insert_inner_32:
 					int			insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
 					int16		count = n32->base.n.count;
 
@@ -1409,12 +1453,10 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 					/* grow node from 4 to 32 */
 					new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
-															 RT_NODE_KIND_32, RT_CLASS_32_FULL);
+															 RT_NODE_KIND_32, RT_CLASS_32_PARTIAL);
 					chunk_values_array_copy(n4->base.chunks, n4->values,
 											new32->base.chunks, new32->values,
 											n4->base.n.count);
-
-					Assert(parent != NULL);
 					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
 									key);
 					node = (rt_node *) new32;
@@ -1451,6 +1493,37 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
 				{
+					Assert(parent != NULL);
+
+					if (n32->base.n.fanout == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+					{
+						/* use the same node kind, but expand to the next size class */
+
+						/* no need to zero the new memory */
+						rt_node_leaf_32 *new32 =
+							(rt_node_leaf_32 *) MemoryContextAlloc(tree->leaf_slabs[RT_CLASS_32_FULL],
+													 rt_size_class_info[RT_CLASS_32_FULL].leaf_size);
+
+// FIXME the API for rt_alloc_node and rt_node_copy are too entangled
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[RT_CLASS_32_FULL]++;
+#endif
+						/* copy the entire old node -- the new node is only different in having
+						additional slots so we only have to change the fanout */
+						memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size);
+						new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32,
+										key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new32;
+						n32 = new32;
+						goto retry_insert_leaf_32;
+					}
+					else
+					{
 					rt_node_leaf_128 *new128;
 
 					/* grow node from 32 to 128 */
@@ -1459,13 +1532,14 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 					for (int i = 0; i < n32->base.n.count; i++)
 						node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
 
-					Assert(parent != NULL);
 					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
 									key);
 					node = (rt_node *) new128;
+					}
 				}
 				else
 				{
+retry_insert_leaf_32:
 					int			insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
 					int			count = n32->base.n.count;
 
@@ -2189,10 +2263,11 @@ rt_verify_node(rt_node *node)
 void
 rt_stats(radix_tree *tree)
 {
-	ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
+	ereport(NOTICE, (errmsg("num_keys = %lu, height = %u, n4 = %u, n15 = %u, n32 = %u, n128 = %u, n256 = %u",
 						 tree->num_keys,
 						 tree->root->shift / RT_NODE_SPAN,
 						 tree->cnt[RT_CLASS_4_FULL],
+						 tree->cnt[RT_CLASS_32_PARTIAL],
 						 tree->cnt[RT_CLASS_32_FULL],
 						 tree->cnt[RT_CLASS_128_FULL],
 						 tree->cnt[RT_CLASS_256])));
-- 
2.38.1

v901-0001-Preparatory-refactoring-for-decoupling-kind-fro.patch.txttext/plain; charset=US-ASCII; name=v901-0001-Preparatory-refactoring-for-decoupling-kind-fro.patch.txtDownload

From 15e16df13912d265c3b1eda858456de6fe595c33 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 17 Nov 2022 12:10:31 +0700
Subject: [PATCH v901 1/2] Preparatory refactoring for decoupling kind from
 size class

Rename the current kind info array to refer to size classes, but
keep all the contents the same.

Add a fanout member to all nodes which stores the max capacity of
the node. This is currently set with the same hardcoded value as
in the kind info array.

In passing, remove outdated reference to node16 in the regression
test.
---
 src/backend/lib/radixtree.c                   | 147 +++++++++++-------
 .../modules/test_radixtree/test_radixtree.c   |   1 -
 2 files changed, 87 insertions(+), 61 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bd58b2bfad..bef1a438ab 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -127,6 +127,16 @@ typedef enum
 #define RT_NODE_KIND_256		0x03
 #define RT_NODE_KIND_COUNT		4
 
+typedef enum rt_size_class
+{
+	RT_CLASS_4_FULL = 0,
+	RT_CLASS_32_FULL,
+	RT_CLASS_128_FULL,
+	RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
 /* Common type for all nodes types */
 typedef struct rt_node
 {
@@ -136,6 +146,9 @@ typedef struct rt_node
 	 */
 	uint16		count;
 
+	/* Max number of children. We can use uint8 because we never need to store 256 */
+	uint8		fanout;
+
 	/*
 	 * Shift indicates which part of the key space is represented by this
 	 * node. That is, the key is shifted by 'shift' and the lowest
@@ -144,13 +157,13 @@ typedef struct rt_node
 	uint8		shift;
 	uint8		chunk;
 
-	/* Size kind of the node */
+	/* Node kind, one per search/set algorithm */
 	uint8		kind;
 } rt_node;
 #define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
 #define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
-#define NODE_HAS_FREE_SLOT(n) \
-	(((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+#define NODE_HAS_FREE_SLOT(node) \
+	((node)->base.n.count < (node)->base.n.fanout)
 
 /* Base type of each node kinds for leaf and inner nodes */
 typedef struct rt_node_base_4
@@ -190,7 +203,7 @@ typedef struct rt_node_base256
 /*
  * Inner and leaf nodes.
  *
- * There are separate from inner node size classes for two main reasons:
+ * Theres are separate for two main reasons:
  *
  * 1) the value type might be different than something fitting into a pointer
  *    width type
@@ -274,8 +287,8 @@ typedef struct rt_node_leaf_256
 	uint64		values[RT_NODE_MAX_SLOTS];
 } rt_node_leaf_256;
 
-/* Information of each size kinds */
-typedef struct rt_node_kind_info_elem
+/* Information for each size class */
+typedef struct rt_size_class_elem
 {
 	const char *name;
 	int			fanout;
@@ -287,7 +300,7 @@ typedef struct rt_node_kind_info_elem
 	/* slab block size */
 	Size		inner_blocksize;
 	Size		leaf_blocksize;
-} rt_node_kind_info_elem;
+} rt_size_class_elem;
 
 /*
  * Calculate the slab blocksize so that we can allocate at least 32 chunks
@@ -295,9 +308,9 @@ typedef struct rt_node_kind_info_elem
  */
 #define NODE_SLAB_BLOCK_SIZE(size)	\
 	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
-static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
 
-	[RT_NODE_KIND_4] = {
+	[RT_CLASS_4_FULL] = {
 		.name = "radix tree node 4",
 		.fanout = 4,
 		.inner_size = sizeof(rt_node_inner_4),
@@ -305,7 +318,7 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
 		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
 	},
-	[RT_NODE_KIND_32] = {
+	[RT_CLASS_32_FULL] = {
 		.name = "radix tree node 32",
 		.fanout = 32,
 		.inner_size = sizeof(rt_node_inner_32),
@@ -313,7 +326,7 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
 		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
 	},
-	[RT_NODE_KIND_128] = {
+	[RT_CLASS_128_FULL] = {
 		.name = "radix tree node 128",
 		.fanout = 128,
 		.inner_size = sizeof(rt_node_inner_128),
@@ -321,9 +334,11 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
 		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
 	},
-	[RT_NODE_KIND_256] = {
+	[RT_CLASS_256] = {
 		.name = "radix tree node 256",
-		.fanout = 256,
+		/* technically it's 256, but we can't store that in a uint8,
+		  and this is the max size class so it will never grow */
+		.fanout = 0,
 		.inner_size = sizeof(rt_node_inner_256),
 		.leaf_size = sizeof(rt_node_leaf_256),
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
@@ -372,17 +387,17 @@ struct radix_tree
 	uint64		max_val;
 	uint64		num_keys;
 
-	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
-	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
 
 	/* statistics */
 #ifdef RT_DEBUG
-	int32		cnt[RT_NODE_KIND_COUNT];
+	int32		cnt[RT_SIZE_CLASS_COUNT];
 #endif
 };
 
 static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+static rt_node *rt_alloc_node(radix_tree *tree, int kind, rt_size_class size_class, uint8 shift, uint8 chunk,
 							  bool inner);
 static void rt_free_node(radix_tree *tree, rt_node *node);
 static void rt_extend(radix_tree *tree, uint64 key);
@@ -584,7 +599,7 @@ chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
 						  uint8 *dst_chunks, rt_node **dst_children, int count)
 {
 	/* For better code generation */
-	if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+	if (count > rt_size_class_info[RT_CLASS_4_FULL].fanout)
 		pg_unreachable();
 
 	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
@@ -596,7 +611,7 @@ chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
 						uint8 *dst_chunks, uint64 *dst_values, int count)
 {
 	/* For better code generation */
-	if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+	if (count > rt_size_class_info[RT_CLASS_4_FULL].fanout)
 		pg_unreachable();
 
 	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
@@ -837,7 +852,7 @@ rt_new_root(radix_tree *tree, uint64 key)
 	int			shift = key_get_shift(key);
 	rt_node    *node;
 
-	node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
+	node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL, shift, 0,
 									 shift > 0);
 	tree->max_val = shift_get_max_val(shift);
 	tree->root = node;
@@ -847,18 +862,19 @@ rt_new_root(radix_tree *tree, uint64 key)
  * Allocate a new node with the given node kind.
  */
 static rt_node *
-rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
+rt_alloc_node(radix_tree *tree, int kind, rt_size_class size_class, uint8 shift, uint8 chunk, bool inner)
 {
 	rt_node    *newnode;
 
 	if (inner)
-		newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
-													 rt_node_kind_info[kind].inner_size);
+		newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[size_class],
+													 rt_size_class_info[size_class].inner_size);
 	else
-		newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
-													 rt_node_kind_info[kind].leaf_size);
+		newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[size_class],
+													 rt_size_class_info[size_class].leaf_size);
 
 	newnode->kind = kind;
+	newnode->fanout = rt_size_class_info[size_class].fanout;
 	newnode->shift = shift;
 	newnode->chunk = chunk;
 
@@ -872,7 +888,7 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[kind]++;
+	tree->cnt[size_class]++;
 #endif
 
 	return newnode;
@@ -883,11 +899,11 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
  * count of 'node'.
  */
 static rt_node *
-rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
+rt_copy_node(radix_tree *tree, rt_node *node, int new_kind, rt_size_class new_size_class)
 {
 	rt_node    *newnode;
 
-	newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
+	newnode = rt_alloc_node(tree, new_kind, new_size_class, node->shift, node->chunk,
 							node->shift > 0);
 	newnode->count = node->count;
 
@@ -898,14 +914,22 @@ rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
 static void
 rt_free_node(radix_tree *tree, rt_node *node)
 {
+	int i;
+
 	/* If we're deleting the root node, make the tree empty */
 	if (tree->root == node)
 		tree->root = NULL;
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[node->kind]--;
-	Assert(tree->cnt[node->kind] >= 0);
+	// FIXME
+	for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		if (node->fanout == rt_size_class_info[i].fanout)
+			break;
+	}
+	tree->cnt[i]--;
+	Assert(tree->cnt[i] >= 0);
 #endif
 
 	pfree(node);
@@ -954,7 +978,7 @@ rt_extend(radix_tree *tree, uint64 key)
 	{
 		rt_node_inner_4 *node;
 
-		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
+		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL,
 												 shift, 0, true);
 		node->base.n.count = 1;
 		node->base.chunks[0] = 0;
@@ -984,7 +1008,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
 		rt_node    *newchild;
 		int			newshift = shift - RT_NODE_SPAN;
 
-		newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
+		newchild = rt_alloc_node(tree, RT_NODE_KIND_4,  RT_CLASS_4_FULL, newshift,
 								 RT_GET_KEY_CHUNK(key, node->shift),
 								 newshift > 0);
 		rt_node_insert_inner(tree, parent, node, key, newchild);
@@ -1216,7 +1240,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 
 					/* grow node from 4 to 32 */
 					new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
-															  RT_NODE_KIND_32);
+															  RT_NODE_KIND_32, RT_CLASS_32_FULL);
 					chunk_children_array_copy(n4->base.chunks, n4->children,
 											  new32->base.chunks, new32->children,
 											  n4->base.n.count);
@@ -1262,7 +1286,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 
 					/* grow node from 32 to 128 */
 					new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
-																RT_NODE_KIND_128);
+																RT_NODE_KIND_128, RT_CLASS_128_FULL);
 					for (int i = 0; i < n32->base.n.count; i++)
 						node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
 
@@ -1305,7 +1329,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 
 					/* grow node from 128 to 256 */
 					new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
-																RT_NODE_KIND_256);
+																RT_NODE_KIND_256, RT_CLASS_256);
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
 					{
 						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1332,7 +1356,8 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
 
 				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
-				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+				Assert(n256->base.n.fanout == 0);
+				Assert(chunk_exists || ((rt_node *) n256)->count < RT_NODE_MAX_SLOTS);
 
 				node_inner_256_set(n256, chunk, child);
 				break;
@@ -1384,7 +1409,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 					/* grow node from 4 to 32 */
 					new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
-															 RT_NODE_KIND_32);
+															 RT_NODE_KIND_32, RT_CLASS_32_FULL);
 					chunk_values_array_copy(n4->base.chunks, n4->values,
 											new32->base.chunks, new32->values,
 											n4->base.n.count);
@@ -1430,7 +1455,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 					/* grow node from 32 to 128 */
 					new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
-															   RT_NODE_KIND_128);
+															   RT_NODE_KIND_128, RT_CLASS_128_FULL);
 					for (int i = 0; i < n32->base.n.count; i++)
 						node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
 
@@ -1473,7 +1498,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 					/* grow node from 128 to 256 */
 					new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
-															   RT_NODE_KIND_256);
+															   RT_NODE_KIND_256, RT_CLASS_256);
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
 					{
 						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1500,7 +1525,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
 
 				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
-				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+				Assert(((rt_node *) n256)->fanout == 0);
+				Assert(chunk_exists || ((rt_node *) n256)->count < 256);
 
 				node_leaf_256_set(n256, chunk, value);
 				break;
@@ -1538,16 +1564,16 @@ rt_create(MemoryContext ctx)
 	tree->num_keys = 0;
 
 	/* Create the slab allocator for each size class */
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
 		tree->inner_slabs[i] = SlabContextCreate(ctx,
-												 rt_node_kind_info[i].name,
-												 rt_node_kind_info[i].inner_blocksize,
-												 rt_node_kind_info[i].inner_size);
+												 rt_size_class_info[i].name,
+												 rt_size_class_info[i].inner_blocksize,
+												 rt_size_class_info[i].inner_size);
 		tree->leaf_slabs[i] = SlabContextCreate(ctx,
-												rt_node_kind_info[i].name,
-												rt_node_kind_info[i].leaf_blocksize,
-												rt_node_kind_info[i].leaf_size);
+												rt_size_class_info[i].name,
+												rt_size_class_info[i].leaf_blocksize,
+												rt_size_class_info[i].leaf_size);
 #ifdef RT_DEBUG
 		tree->cnt[i] = 0;
 #endif
@@ -1564,7 +1590,7 @@ rt_create(MemoryContext ctx)
 void
 rt_free(radix_tree *tree)
 {
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
 		MemoryContextDelete(tree->inner_slabs[i]);
 		MemoryContextDelete(tree->leaf_slabs[i]);
@@ -2076,7 +2102,7 @@ rt_memory_usage(radix_tree *tree)
 {
 	Size		total = sizeof(radix_tree);
 
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
 		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
 		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
@@ -2166,10 +2192,10 @@ rt_stats(radix_tree *tree)
 	ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
 						 tree->num_keys,
 						 tree->root->shift / RT_NODE_SPAN,
-						 tree->cnt[0],
-						 tree->cnt[1],
-						 tree->cnt[2],
-						 tree->cnt[3])));
+						 tree->cnt[RT_CLASS_4_FULL],
+						 tree->cnt[RT_CLASS_32_FULL],
+						 tree->cnt[RT_CLASS_128_FULL],
+						 tree->cnt[RT_CLASS_256])));
 }
 
 static void
@@ -2177,11 +2203,12 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 {
 	char		space[128] = {0};
 
-	fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
 			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
 			(node->kind == RT_NODE_KIND_4) ? 4 :
 			(node->kind == RT_NODE_KIND_32) ? 32 :
 			(node->kind == RT_NODE_KIND_128) ? 128 : 256,
+			node->fanout == 0 ? 256 : node->fanout,
 			node->count, node->shift, node->chunk);
 
 	if (level > 0)
@@ -2384,13 +2411,13 @@ rt_dump_search(radix_tree *tree, uint64 key)
 void
 rt_dump(radix_tree *tree)
 {
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 		fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
-				rt_node_kind_info[i].name,
-				rt_node_kind_info[i].inner_size,
-				rt_node_kind_info[i].inner_blocksize,
-				rt_node_kind_info[i].leaf_size,
-				rt_node_kind_info[i].leaf_blocksize);
+				rt_size_class_info[i].name,
+				rt_size_class_info[i].inner_size,
+				rt_size_class_info[i].inner_blocksize,
+				rt_size_class_info[i].leaf_size,
+				rt_size_class_info[i].leaf_blocksize);
 	fprintf(stderr, "max_val = %lu\n", tree->max_val);
 
 	if (!tree->root)
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index cb3596755d..de1cd6cd70 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -40,7 +40,6 @@ static const bool rt_test_stats = false;
 /* The maximum number of entries each node type can have */
 static int	rt_node_max_entries[] = {
 	4,							/* RT_NODE_KIND_4 */
-	16,							/* RT_NODE_KIND_16 */
 	32,							/* RT_NODE_KIND_32 */
 	128,						/* RT_NODE_KIND_128 */
 	256							/* RT_NODE_KIND_256 */
-- 
2.38.1

#128

sawada.mshk@gmail.com

about 3 years ago

In reply to: Masahiko Sawada (#126)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Nov 16, 2022 at 4:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Nov 16, 2022 at 1:46 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thanks! Please let me know if there is something I can help with.

I didn't get very far because the tests fail on 0004 in rt_verify_node:

TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

Which tests do you use to get this assertion failure? I've confirmed
there is a bug in 0005 patch but without it, "make check-world"
passed.

Hmm, I started over and rebuilt and it didn't reproduce. Not sure what happened, sorry for the noise.

Good to know. No problem.

I'm attaching a test I wrote to stress test branch prediction in search, and while trying it out I found two possible issues.

Thank you for testing!

It's based on the random int load test, but tests search speed. Run like this:

select * from bench_search_random_nodes(10 * 1000 * 1000)

It also takes some care to include all the different node kinds, restricting the possible keys by AND-ing with a filter. Here's a simple demo:

filter = ((uint64)1<<40)-1;
LOG: num_keys = 9999967, height = 4, n4 = 17513814, n32 = 6320, n128 = 62663, n256 = 3130

Just using random integers leads to >99% using the smallest node. I wanted to get close to having the same number of each, but that's difficult while still using random inputs. I ended up using

filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)

which gives

LOG: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 182670, n256 = 1024

Which seems okay for the task. One puzzling thing I found while trying various filters is that sometimes the reported tree height would change. For example:

filter = (((uint64) 1<<32) | (0xFF<<24));
LOG: num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 3161

1) Any idea why the tree height would be reported as 7 here? I didn't expect that.

In my environment, (0xFF<<24) is 0xFFFFFFFFFF000000, not 0xFF000000.
It seems the filter should be (((uint64) 1<<32) | ((uint64)
0xFF<<24)).

2) It seems that 0004 actually causes a significant slowdown in this test (as in the attached, using the second filter above and with turboboost disabled):

v9 0003: 2062 2051 2050
v9 0004: 2346 2316 2321

That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case.

I'll also run the test on my environment and do the investigation tomorrow.

FYI I've not tested the patch you shared today but here are the
benchmark results I did with the v9 patch in my environment (I used
the second filter). I splitted 0004 patch into two patches: a patch
for pure refactoring patch to introduce rt_node_ptr and a patch to do
pointer tagging.

v9 0003 patch : 1113 1114 1114
introduce rt_node_ptr: 1127 1128 1128
pointer tagging : 1085 1087 1086 (equivalent to 0004 patch)

In my environment, rt_node_ptr seemed to lead some overhead but
pointer tagging had performance benefits. I'm not sure the reason why
the results are different from yours. The radix tree stats shows the
same as your tests.

=# select * from bench_search_random_nodes(10 * 1000 * 1000);
2022-11-18 22:18:21.608 JST [3913544] LOG: num_keys = 9291812, height
= 4, n4 = 262144, n32 =79603, n128 = 182670, n256 = 1024

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#129

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#126)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

FYI I've not tested the patch you shared today but here are the
benchmark results I did with the v9 patch in my environment (I used
the second filter). I splitted 0004 patch into two patches: a patch
for pure refactoring patch to introduce rt_node_ptr and a patch to do
pointer tagging.

v9 0003 patch : 1113 1114 1114
introduce rt_node_ptr: 1127 1128 1128
pointer tagging : 1085 1087 1086 (equivalent to 0004 patch)

In my environment, rt_node_ptr seemed to lead some overhead but
pointer tagging had performance benefits. I'm not sure the reason why
the results are different from yours. The radix tree stats shows the
same as your tests.

There is less than 2% difference from the medial set of results, so it's
hard to distinguish from noise. I did a fresh rebuild and retested with the
same results: about 15% slowdown in v9 0004. That's strange.

On Wed, Nov 16, 2022 at 10:24 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

filter = (((uint64) 1<<32) | (0xFF<<24));
LOG: num_keys = 9999944, height = 7, n4 = 47515559, n32 = 6209, n128 =

62632, n256 = 3161

1) Any idea why the tree height would be reported as 7 here? I didn't

expect that.

In my environment, (0xFF<<24) is 0xFFFFFFFFFF000000, not 0xFF000000.
It seems the filter should be (((uint64) 1<<32) | ((uint64)
0xFF<<24)).

Ugh, sign extension, brain fade on my part. Thanks, I'm glad there was a
straightforward explanation.

--
John Naylor
EDB: http://www.enterprisedb.com

#130

[1]: /messages/by-id/CAFBsxsFEVckVzsBsfgGzGR4Yz=Jp=UxOtjYvTjOz6fOoLXtOig@mail.gmail.com
/messages/by-id/CAFBsxsFEVckVzsBsfgGzGR4Yz=Jp=UxOtjYvTjOz6fOoLXtOig@mail.gmail.com

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#128)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

On Wed, Nov 16, 2022 at 4:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

That means my idea for the pointer struct might have some problems,

at least as currently implemented. Maybe in the course of separating out
and polishing that piece, an inefficiency will fall out. Or, it might be
another reason to template local and shared separately. Not sure yet. I
also haven't tried to adjust this test for the shared memory case.

Digging a bit deeper, I see a flaw in my benchmark: Even though the total
distribution of node kinds is decently even, the pattern that the benchmark
sees is not terribly random:

3,343,352 branch-misses:u # 0.85% of all
branches
393,204,959 branches:u

Recall a previous benchmark [1]/messages/by-id/CAFBsxsFEVckVzsBsfgGzGR4Yz=Jp=UxOtjYvTjOz6fOoLXtOig@mail.gmail.com where the leaf node was about half node16
and half node32. Randomizing the leaf node between the two caused branch
misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this
new benchmark, each level has a skewed distribution of nodes, giving a
smart branch predictor something to work with. We will need a way to
efficiently generate keys that lead to a relatively unpredictable
distribution of node kinds, as seen by a searcher. Especially in the leaves
(or just above the leaves), since those are less likely to be cached.

I'll also run the test on my environment and do the investigation

tomorrow.

FYI I've not tested the patch you shared today but here are the
benchmark results I did with the v9 patch in my environment (I used
the second filter). I splitted 0004 patch into two patches: a patch
for pure refactoring patch to introduce rt_node_ptr and a patch to do
pointer tagging.

Would you be able to share the refactoring patch? And a fix for the failing
tests? I'm thinking I want to try the templating approach fairly soon.

--
John Naylor
EDB: http://www.enterprisedb.com

#131

john.naylor@enterprisedb.com

about 3 years ago

In reply to: John Naylor (#127)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Nov 18, 2022 at 2:48 PM I wrote:

One issue with this patch: The "fanout" member is a uint8, so it can't

hold 256 for the largest node kind. That's not an issue in practice, since
we never need to grow it, and we only compare that value with the count in
an Assert(), so I just set it to zero. That does break an invariant, so
it's not great. We could use 2 bytes to be strictly correct in all cases,
but that limits what we can do with the smallest node kind.

Thinking about this part, there's an easy resolution -- use a different
macro for fixed- and variable-sized node kinds to determine if there is a
free slot.

Also, I wanted to share some results of adjusting the boundary between the
two smallest node kinds. In the hackish attached patch, I modified the
fixed height search benchmark to search a small (within L1 cache) tree
thousands of times. For the first set I modified node4's maximum fanout and
filled it up. For the second, I set node4's fanout to 1, which causes 2+ to
spill to node32 (actually the partially-filled node15 size class
as demoed earlier).

node4:

NOTICE: num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
2 | 16 | 16520 | 0 | 3

NOTICE: num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
3 | 81 | 16456 | 0 | 17

NOTICE: num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
4 | 256 | 16456 | 0 | 89

NOTICE: num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
5 | 625 | 16488 | 0 | 327

node32:

NOTICE: num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
2 | 16 | 16488 | 0 | 5
(1 row)

NOTICE: num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
3 | 81 | 16520 | 0 | 28

NOTICE: num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
4 | 256 | 16408 | 0 | 79

NOTICE: num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0,
n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
5 | 625 | 24616 | 0 | 199

In this test, node32 seems slightly faster than node4 with 4 elements, at
the cost of more memory.

Assuming the smallest node is fixed size (i.e. fanout/capacity member not
part of the common set, so only part of variable-sized nodes), 3 has a nice
property: no wasted padding space:

node4: 5 + 4+(7) + 4*8 = 48 bytes
node3: 5 + 3 + 3*8 = 32

--
John Naylor
EDB: http://www.enterprisedb.com

#132

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#130)

7 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Nov 21, 2022 at 3:43 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Nov 16, 2022 at 4:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case.

Digging a bit deeper, I see a flaw in my benchmark: Even though the total distribution of node kinds is decently even, the pattern that the benchmark sees is not terribly random:

3,343,352 branch-misses:u # 0.85% of all branches
393,204,959 branches:u

Recall a previous benchmark [1] where the leaf node was about half node16 and half node32. Randomizing the leaf node between the two caused branch misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this new benchmark, each level has a skewed distribution of nodes, giving a smart branch predictor something to work with. We will need a way to efficiently generate keys that lead to a relatively unpredictable distribution of node kinds, as seen by a searcher. Especially in the leaves (or just above the leaves), since those are less likely to be cached.

I'll also run the test on my environment and do the investigation tomorrow.

FYI I've not tested the patch you shared today but here are the
benchmark results I did with the v9 patch in my environment (I used
the second filter). I splitted 0004 patch into two patches: a patch
for pure refactoring patch to introduce rt_node_ptr and a patch to do
pointer tagging.

Would you be able to share the refactoring patch? And a fix for the failing tests? I'm thinking I want to try the templating approach fairly soon.

Sure. I've attached the v10 patches. 0004 is the pure refactoring
patch and 0005 patch introduces the pointer tagging.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v10-0003-tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v10-0003-tool-for-measuring-radix-tree-performance.patchDownload

From 5cd4f1f8435d5367e09b8044c08e153ae05f2f19 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v10 3/7] tool for measuring radix tree performance

---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  64 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 541 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 6 files changed, 661 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..e0205b364e
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,64 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..70ca989118
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,541 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	const uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, key);
+	}
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
-- 
2.31.1

v10-0004-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchapplication/octet-stream; name=v10-0004-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchDownload

From 082277fda9061c8651b3cc4d2e70b763d508bb1a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 14 Nov 2022 11:44:17 +0900
Subject: [PATCH v10 4/7] Use rt_node_ptr to reference radix tree nodes.

---
 src/backend/lib/radixtree.c | 652 ++++++++++++++++++++----------------
 1 file changed, 369 insertions(+), 283 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 6159b73b75..67f4dc646e 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -126,6 +126,21 @@ typedef enum
 #define RT_NODE_KIND_128		0x02
 #define RT_NODE_KIND_256		0x03
 #define RT_NODE_KIND_COUNT		4
+#define RT_POINTER_KIND_MASK	0x03
+
+/*
+ * rt_pointer is a tagged pointer for rt_node. It is encoded from a
+ * C-pointer (ie, local memory address) and the node kind. The node
+ * kind uses the lower 2 bits, which are always 0 in local memory address.
+ * We can encode and decode the pointer using by rt_pointer_decode()
+ * and rt_pointer_encode() functions, respectively.
+ *
+ * The inner nodes of the radix tree need to store rt_pointer rather than
+ * C-pointer for the above reason.
+ */
+typedef uintptr_t rt_pointer;
+#define InvalidRTPointer		((rt_pointer) 0)
+#define RTPointerIsValid(x) 	(((rt_pointer) (x)) != InvalidRTPointer)
 
 /* Common type for all nodes types */
 typedef struct rt_node
@@ -147,10 +162,7 @@ typedef struct rt_node
 	/* Size kind of the node */
 	uint8		kind;
 } rt_node;
-#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
-#define NODE_HAS_FREE_SLOT(n) \
-	(((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+#define RT_NODE_IS_LEAF(n)	(((rt_node *) (n))->shift == 0)
 
 /* Base type of each node kinds for leaf and inner nodes */
 typedef struct rt_node_base_4
@@ -205,7 +217,7 @@ typedef struct rt_node_inner_4
 	rt_node_base_4 base;
 
 	/* 4 children, for key chunks */
-	rt_node    *children[4];
+	rt_pointer    children[4];
 } rt_node_inner_4;
 
 typedef struct rt_node_leaf_4
@@ -221,7 +233,7 @@ typedef struct rt_node_inner_32
 	rt_node_base_32 base;
 
 	/* 32 children, for key chunks */
-	rt_node    *children[32];
+	rt_pointer    children[32];
 } rt_node_inner_32;
 
 typedef struct rt_node_leaf_32
@@ -237,7 +249,7 @@ typedef struct rt_node_inner_128
 	rt_node_base_128 base;
 
 	/* Slots for 128 children */
-	rt_node    *children[128];
+	rt_pointer    children[128];
 } rt_node_inner_128;
 
 typedef struct rt_node_leaf_128
@@ -260,7 +272,7 @@ typedef struct rt_node_inner_256
 	rt_node_base_256 base;
 
 	/* Slots for 256 children */
-	rt_node    *children[RT_NODE_MAX_SLOTS];
+	rt_pointer    children[RT_NODE_MAX_SLOTS];
 } rt_node_inner_256;
 
 typedef struct rt_node_leaf_256
@@ -274,6 +286,30 @@ typedef struct rt_node_leaf_256
 	uint64		values[RT_NODE_MAX_SLOTS];
 } rt_node_leaf_256;
 
+/*
+ * rt_node_ptr is an useful data structure representing a pointer for a rt_node.
+ */
+typedef struct rt_node_ptr
+{
+	rt_pointer		encoded;
+	rt_node			*decoded;
+} rt_node_ptr;
+#define InvalidRTNodePtr \
+	(rt_node_ptr) {.encoded = InvalidRTPointer, .decoded = NULL }
+#define RTNodePtrIsValid(n) \
+	(!rt_node_ptr_eq((rt_node_ptr *) &(n), &(InvalidRTNodePtr)))
+
+/* Macros for rt_node_ptr to access the fields of rt_node */
+#define NODE_RAW(n)			(((rt_node_ptr) (n)).decoded)
+#define NODE_IS_LEAF(n)		(NODE_RAW(n)->shift == 0)
+#define NODE_IS_EMPTY(n)	(NODE_COUNT(n) == 0)
+#define NODE_KIND(n)	(NODE_RAW(n)->kind)
+#define NODE_COUNT(n)	(NODE_RAW(n)->count)
+#define NODE_SHIFT(n)	(NODE_RAW(n)->shift)
+#define NODE_CHUNK(n)	(NODE_RAW(n)->chunk)
+#define NODE_HAS_FREE_SLOT(n) \
+	(NODE_COUNT(n) < rt_node_kind_info[NODE_KIND(n)].fanout)
+
 /* Information of each size kinds */
 typedef struct rt_node_kind_info_elem
 {
@@ -347,7 +383,7 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
  */
 typedef struct rt_node_iter
 {
-	rt_node    *node;			/* current node being iterated */
+	rt_node_ptr	node;			/* current node being iterated */
 	int			current_idx;	/* current position. -1 for initial value */
 } rt_node_iter;
 
@@ -368,7 +404,7 @@ struct radix_tree
 {
 	MemoryContext context;
 
-	rt_node    *root;
+	rt_pointer	root;
 	uint64		max_val;
 	uint64		num_keys;
 
@@ -382,26 +418,56 @@ struct radix_tree
 };
 
 static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
-							  bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
+static rt_node_ptr rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+								 bool inner);
+static void rt_free_node(radix_tree *tree, rt_node_ptr node);
 static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
-										rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+static inline bool rt_node_search_inner(rt_node_ptr node_ptr, uint64 key, rt_action action,
+										rt_pointer *child_p);
+static inline bool rt_node_search_leaf(rt_node_ptr node_ptr, uint64 key, rt_action action,
 									   uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
-								 uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+static bool rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+								 uint64 key, rt_node_ptr child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
 								uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											  rt_node_ptr *child_p);
 static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 											 uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static void rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from);
 static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
 
 /* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
+static void rt_verify_node(rt_node_ptr node);
+
+/* Decode and encode function of rt_pointer */
+static inline rt_node *
+rt_pointer_decode(rt_pointer encoded)
+{
+	return (rt_node *) encoded;
+}
+
+static inline rt_pointer
+rt_pointer_encode(rt_node *decoded)
+{
+	return (rt_pointer) decoded;
+}
+
+/* Return a rt_pointer created from the given encoded pointer */
+static inline rt_node_ptr
+rt_node_ptr_encoded(rt_pointer encoded)
+{
+	return (rt_node_ptr) {
+		.encoded = encoded,
+			.decoded = rt_pointer_decode(encoded)
+			};
+}
+
+static inline bool
+rt_node_ptr_eq(rt_node_ptr *a, rt_node_ptr *b)
+{
+	return (a->decoded == b->decoded) && (a->encoded == b->encoded);
+}
 
 /*
  * Return index of the first element in 'base' that equals 'key'. Return -1
@@ -550,10 +616,10 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
 
 /* Shift the elements right at 'idx' by one */
 static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_shift(uint8 *chunks, rt_pointer *children, int count, int idx)
 {
 	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
-	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_pointer) * (count - idx));
 }
 
 static inline void
@@ -565,10 +631,10 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Delete the element at 'idx' */
 static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_delete(uint8 *chunks, rt_pointer *children, int count, int idx)
 {
 	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
-	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_pointer) * (count - idx - 1));
 }
 
 static inline void
@@ -580,15 +646,15 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Copy both chunks and children/values arrays */
 static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
-						  uint8 *dst_chunks, rt_node **dst_children, int count)
+chunk_children_array_copy(uint8 *src_chunks, rt_pointer *src_children,
+						  uint8 *dst_chunks, rt_pointer *dst_children, int count)
 {
 	/* For better code generation */
 	if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
 		pg_unreachable();
 
 	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
-	memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+	memcpy(dst_children, src_children, sizeof(rt_pointer) * count);
 }
 
 static inline void
@@ -616,28 +682,28 @@ node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
 static inline bool
 node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
 {
-	Assert(!NODE_IS_LEAF(node));
-	return (node->children[slot] != NULL);
+	Assert(!RT_NODE_IS_LEAF(node));
+	return RTPointerIsValid(node->children[slot]);
 }
 
 static inline bool
 node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
 }
 
-static inline rt_node *
+static inline rt_pointer
 node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	return node->children[node->base.slot_idxs[chunk]];
 }
 
 static inline uint64
 node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(((rt_node_base_128 *) node)->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX);
 	return node->values[node->base.slot_idxs[chunk]];
 }
@@ -645,7 +711,7 @@ node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
 static void
 node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
 }
 
@@ -654,7 +720,7 @@ node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
 {
 	int			slotpos = node->base.slot_idxs[chunk];
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
 	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
 }
@@ -665,7 +731,7 @@ node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
 {
 	int			slotpos = 0;
 
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	while (node_inner_128_is_slot_used(node, slotpos))
 		slotpos++;
 
@@ -677,7 +743,7 @@ node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
 {
 	int			slotpos;
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 
 	/* We iterate over the isset bitmap per byte then check each bit */
 	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
@@ -695,11 +761,11 @@ node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
 }
 
 static inline void
-node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_pointer child)
 {
 	int			slotpos;
 
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 
 	/* find unused slot */
 	slotpos = node_inner_128_find_unused_slot(node, chunk);
@@ -714,7 +780,7 @@ node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
 {
 	int			slotpos;
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 
 	/* find unused slot */
 	slotpos = node_leaf_128_find_unused_slot(node, chunk);
@@ -726,16 +792,16 @@ node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
 
 /* Update the child corresponding to 'chunk' to 'child' */
 static inline void
-node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_pointer child)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[node->base.slot_idxs[chunk]] = child;
 }
 
 static inline void
 node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->values[node->base.slot_idxs[chunk]] = value;
 }
 
@@ -745,21 +811,21 @@ node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
 static inline bool
 node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
-	return (node->children[chunk] != NULL);
+	Assert(!RT_NODE_IS_LEAF(node));
+	return RTPointerIsValid(node->children[chunk]);
 }
 
 static inline bool
 node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
 }
 
-static inline rt_node *
+static inline rt_pointer
 node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	Assert(node_inner_256_is_chunk_used(node, chunk));
 	return node->children[chunk];
 }
@@ -767,16 +833,16 @@ node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
 static inline uint64
 node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(node_leaf_256_is_chunk_used(node, chunk));
 	return node->values[chunk];
 }
 
 /* Set the child in the node-256 */
 static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_pointer child)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[chunk] = child;
 }
 
@@ -784,7 +850,7 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
 static inline void
 node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
 	node->values[chunk] = value;
 }
@@ -793,14 +859,14 @@ node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
 static inline void
 node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
-	node->children[chunk] = NULL;
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = InvalidRTPointer;
 }
 
 static inline void
 node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
 }
 
@@ -835,37 +901,37 @@ static void
 rt_new_root(radix_tree *tree, uint64 key)
 {
 	int			shift = key_get_shift(key);
-	rt_node    *node;
+	rt_node_ptr	node;
 
-	node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
-									 shift > 0);
+	node = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, shift > 0);
 	tree->max_val = shift_get_max_val(shift);
-	tree->root = node;
+	tree->root = node.encoded;
 }
 
 /*
  * Allocate a new node with the given node kind.
  */
-static rt_node *
+static rt_node_ptr
 rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
 {
-	rt_node    *newnode;
+	rt_node_ptr	newnode;
 
 	if (inner)
-		newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
-													 rt_node_kind_info[kind].inner_size);
+		newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+															 rt_node_kind_info[kind].inner_size);
 	else
-		newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
-													 rt_node_kind_info[kind].leaf_size);
+		newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+															 rt_node_kind_info[kind].leaf_size);
 
-	newnode->kind = kind;
-	newnode->shift = shift;
-	newnode->chunk = chunk;
+	newnode.encoded = rt_pointer_encode(newnode.decoded);
+	NODE_KIND(newnode) = kind;
+	NODE_SHIFT(newnode) = shift;
+	NODE_CHUNK(newnode) = chunk;
 
 	/* Initialize slot_idxs to invalid values */
 	if (kind == RT_NODE_KIND_128)
 	{
-		rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+		rt_node_base_128 *n128 = (rt_node_base_128 *) newnode.decoded;
 
 		memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
 	}
@@ -882,55 +948,56 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
  * Create a new node with 'new_kind' and the same shift, chunk, and
  * count of 'node'.
  */
-static rt_node *
-rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
+static rt_node_ptr
+rt_copy_node(radix_tree *tree, rt_node_ptr node, int new_kind)
 {
-	rt_node    *newnode;
+	rt_node_ptr    newnode;
+	rt_node		*n = node.decoded;
 
-	newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
-							node->shift > 0);
-	newnode->count = node->count;
+	newnode = rt_alloc_node(tree, new_kind, n->shift, n->chunk, n->shift > 0);
+	NODE_COUNT(newnode) = NODE_COUNT(node);
 
 	return newnode;
 }
 
 /* Free the given node */
 static void
-rt_free_node(radix_tree *tree, rt_node *node)
+rt_free_node(radix_tree *tree, rt_node_ptr node)
 {
 	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node)
-		tree->root = NULL;
+	if (tree->root == node.encoded)
+		tree->root = InvalidRTPointer;
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[node->kind]--;
-	Assert(tree->cnt[node->kind] >= 0);
+	tree->cnt[NODE_KIND(node)]--;
+	Assert(tree->cnt[NODE_KIND(node)] >= 0);
 #endif
 
-	pfree(node);
+	pfree(node.decoded);
 }
 
 /*
  * Replace old_child with new_child, and free the old one.
  */
 static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
-				rt_node *new_child, uint64 key)
+rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
+				rt_node_ptr new_child, uint64 key)
 {
-	Assert(old_child->chunk == new_child->chunk);
-	Assert(old_child->shift == new_child->shift);
+	Assert(NODE_CHUNK(old_child) == NODE_CHUNK(new_child));
+	Assert(NODE_SHIFT(old_child) == NODE_SHIFT(new_child));
 
-	if (parent == old_child)
+	if (rt_node_ptr_eq(&parent, &old_child))
 	{
 		/* Replace the root node with the new large node */
-		tree->root = new_child;
+		tree->root = new_child.encoded;
 	}
 	else
 	{
 		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
 
-		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		replaced = rt_node_insert_inner(tree, InvalidRTNodePtr, parent, key,
+										new_child);
 		Assert(replaced);
 	}
 
@@ -945,23 +1012,26 @@ static void
 rt_extend(radix_tree *tree, uint64 key)
 {
 	int			target_shift;
-	int			shift = tree->root->shift + RT_NODE_SPAN;
+	rt_node		*root = rt_pointer_decode(tree->root);
+	int			shift = root->shift + RT_NODE_SPAN;
 
 	target_shift = key_get_shift(key);
 
 	/* Grow tree from 'shift' to 'target_shift' */
 	while (shift <= target_shift)
 	{
-		rt_node_inner_4 *node;
+		rt_node_ptr	node;
+		rt_node_inner_4 *n4;
 
-		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
-												 shift, 0, true);
-		node->base.n.count = 1;
-		node->base.chunks[0] = 0;
-		node->children[0] = tree->root;
+		node = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, true);
+		n4 = (rt_node_inner_4 *) node.decoded;
 
-		tree->root->chunk = 0;
-		tree->root = (rt_node *) node;
+		n4->base.n.count = 1;
+		n4->base.chunks[0] = 0;
+		n4->children[0] = tree->root;
+
+		root->chunk = 0;
+		tree->root = node.encoded;
 
 		shift += RT_NODE_SPAN;
 	}
@@ -974,18 +1044,18 @@ rt_extend(radix_tree *tree, uint64 key)
  * Insert inner and leaf nodes from 'node' to bottom.
  */
 static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
-			  rt_node *node)
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
+			  rt_node_ptr node)
 {
-	int			shift = node->shift;
+	int			shift = NODE_SHIFT(node);
 
 	while (shift >= RT_NODE_SPAN)
 	{
-		rt_node    *newchild;
+		rt_node_ptr    newchild;
 		int			newshift = shift - RT_NODE_SPAN;
 
 		newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
-								 RT_GET_KEY_CHUNK(key, node->shift),
+								 RT_GET_KEY_CHUNK(key, NODE_SHIFT(node)),
 								 newshift > 0);
 		rt_node_insert_inner(tree, parent, node, key, newchild);
 
@@ -1006,17 +1076,18 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
+					 rt_pointer *child_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
-	rt_node    *child = NULL;
+	rt_pointer	child;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
 
 				if (idx < 0)
@@ -1034,7 +1105,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
 
 				if (idx < 0)
@@ -1050,7 +1121,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_128:
 			{
-				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
 
 				if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
 					break;
@@ -1066,7 +1137,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 				if (!node_inner_256_is_chunk_used(n256, chunk))
 					break;
@@ -1083,7 +1154,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 
 	/* update statistics */
 	if (action == RT_ACTION_DELETE && found)
-		node->count--;
+		NODE_COUNT(node)--;
 
 	if (found && child_p)
 		*child_p = child;
@@ -1099,17 +1170,17 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
  * to the value is set to value_p.
  */
 static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node_ptr node, uint64 key, rt_action action, uint64 *value_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
 	uint64		value = 0;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
 
 				if (idx < 0)
@@ -1127,7 +1198,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
 
 				if (idx < 0)
@@ -1143,7 +1214,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_128:
 			{
-				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node.decoded;
 
 				if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
 					break;
@@ -1159,7 +1230,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 				if (!node_leaf_256_is_chunk_used(n256, chunk))
 					break;
@@ -1176,7 +1247,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 
 	/* update statistics */
 	if (action == RT_ACTION_DELETE && found)
-		node->count--;
+		NODE_COUNT(node)--;
 
 	if (found && value_p)
 		*value_p = value;
@@ -1186,19 +1257,19 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 
 /* Insert the child to the inner node */
 static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
-					 rt_node *child)
+rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+					 uint64 key, rt_node_ptr child)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		chunk_exists = false;
 
 	Assert(!NODE_IS_LEAF(node));
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 				int			idx;
 
 				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1206,25 +1277,26 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					n4->children[idx] = child;
+					n4->children[idx] = child.encoded;
 					break;
 				}
 
-				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+				if (unlikely(!NODE_HAS_FREE_SLOT(node)))
 				{
+					rt_node_ptr	new;
 					rt_node_inner_32 *new32;
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
-															  RT_NODE_KIND_32);
+					new = rt_copy_node(tree, node, RT_NODE_KIND_32);
+					new32 = (rt_node_inner_32 *) new.decoded;
+
 					chunk_children_array_copy(n4->base.chunks, n4->children,
 											  new32->base.chunks, new32->children,
 											  n4->base.n.count);
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
-									key);
-					node = (rt_node *) new32;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1237,14 +1309,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 												   count, insertpos);
 
 					n4->base.chunks[insertpos] = chunk;
-					n4->children[insertpos] = child;
+					n4->children[insertpos] = child.encoded;
 					break;
 				}
 			}
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 				int			idx;
 
 				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1252,24 +1324,25 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					n32->children[idx] = child;
+					n32->children[idx] = child.encoded;
 					break;
 				}
 
-				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+				if (unlikely(!NODE_HAS_FREE_SLOT(node)))
 				{
+					rt_node_ptr	new;
 					rt_node_inner_128 *new128;
 
 					/* grow node from 32 to 128 */
-					new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
-																RT_NODE_KIND_128);
+					new = rt_copy_node(tree, node, RT_NODE_KIND_128);
+					new128 = (rt_node_inner_128 *) new.decoded;
+
 					for (int i = 0; i < n32->base.n.count; i++)
 						node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
-									key);
-					node = (rt_node *) new128;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1281,31 +1354,33 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 												   count, insertpos);
 
 					n32->base.chunks[insertpos] = chunk;
-					n32->children[insertpos] = child;
+					n32->children[insertpos] = child.encoded;
 					break;
 				}
 			}
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_128:
 			{
-				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
 				int			cnt = 0;
 
 				if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					node_inner_128_update(n128, chunk, child);
+					node_inner_128_update(n128, chunk, child.encoded);
 					break;
 				}
 
-				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+				if (unlikely(!NODE_HAS_FREE_SLOT(node)))
 				{
+					rt_node_ptr	new;
 					rt_node_inner_256 *new256;
 
 					/* grow node from 128 to 256 */
-					new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
-																RT_NODE_KIND_256);
+					new = rt_copy_node(tree, node, RT_NODE_KIND_256);
+					new256 = (rt_node_inner_256 *) new.decoded;
+
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
 					{
 						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1315,33 +1390,32 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 						cnt++;
 					}
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
-					node_inner_128_insert(n128, chunk, child);
+					node_inner_128_insert(n128, chunk, child.encoded);
 					break;
 				}
 			}
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
-				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+				Assert(chunk_exists || NODE_HAS_FREE_SLOT(node));
 
-				node_inner_256_set(n256, chunk, child);
+				node_inner_256_set(n256, chunk, child.encoded);
 				break;
 			}
 	}
 
 	/* Update statistics */
 	if (!chunk_exists)
-		node->count++;
+		NODE_COUNT(node)++;
 
 	/*
 	 * Done. Finally, verify the chunk and value is inserted or replaced
@@ -1354,19 +1428,19 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 
 /* Insert the value to the leaf node */
 static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
 					uint64 key, uint64 value)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		chunk_exists = false;
 
 	Assert(NODE_IS_LEAF(node));
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 				int			idx;
 
 				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1378,21 +1452,22 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 					break;
 				}
 
-				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+				if (unlikely(!NODE_HAS_FREE_SLOT(node)))
 				{
+					rt_node_ptr	new;
 					rt_node_leaf_32 *new32;
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
-															 RT_NODE_KIND_32);
+					new = rt_copy_node(tree, node, RT_NODE_KIND_32);
+					new32 = (rt_node_leaf_32 *) new.decoded;
+
 					chunk_values_array_copy(n4->base.chunks, n4->values,
 											new32->base.chunks, new32->values,
 											n4->base.n.count);
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
-									key);
-					node = (rt_node *) new32;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1412,7 +1487,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 				int			idx;
 
 				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1424,20 +1499,21 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 					break;
 				}
 
-				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+				if (unlikely(!NODE_HAS_FREE_SLOT(node)))
 				{
+					rt_node_ptr	new;
 					rt_node_leaf_128 *new128;
 
 					/* grow node from 32 to 128 */
-					new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
-															   RT_NODE_KIND_128);
+					new = rt_copy_node(tree, node, RT_NODE_KIND_128);
+					new128 = (rt_node_leaf_128 *) new.decoded;
+
 					for (int i = 0; i < n32->base.n.count; i++)
 						node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
-									key);
-					node = (rt_node *) new128;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1456,7 +1532,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_128:
 			{
-				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node.decoded;
 				int			cnt = 0;
 
 				if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
@@ -1467,13 +1543,15 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 					break;
 				}
 
-				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+				if (unlikely(!NODE_HAS_FREE_SLOT(node)))
 				{
+					rt_node_ptr	new;
 					rt_node_leaf_256 *new256;
 
 					/* grow node from 128 to 256 */
-					new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
-															   RT_NODE_KIND_256);
+					new = rt_copy_node(tree, node, RT_NODE_KIND_256);
+					new256 = (rt_node_leaf_256 *) new.decoded;
+
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
 					{
 						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1483,10 +1561,9 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 						cnt++;
 					}
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1497,10 +1574,10 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
-				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+				Assert(chunk_exists || NODE_HAS_FREE_SLOT(node));
 
 				node_leaf_256_set(n256, chunk, value);
 				break;
@@ -1509,7 +1586,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 	/* Update statistics */
 	if (!chunk_exists)
-		node->count++;
+		NODE_COUNT(node)++;
 
 	/*
 	 * Done. Finally, verify the chunk and value is inserted or replaced
@@ -1533,7 +1610,7 @@ rt_create(MemoryContext ctx)
 
 	tree = palloc(sizeof(radix_tree));
 	tree->context = ctx;
-	tree->root = NULL;
+	tree->root = InvalidRTPointer;
 	tree->max_val = 0;
 	tree->num_keys = 0;
 
@@ -1582,26 +1659,23 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 {
 	int			shift;
 	bool		updated;
-	rt_node    *node;
-	rt_node    *parent;
+	rt_node_ptr	node;
+	rt_node_ptr parent;
 
 	/* Empty tree, create the root */
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 		rt_new_root(tree, key);
 
 	/* Extend the tree if necessary */
 	if (key > tree->max_val)
 		rt_extend(tree, key);
 
-	Assert(tree->root);
-
-	shift = tree->root->shift;
-	node = parent = tree->root;
-
 	/* Descend the tree until a leaf node */
+	node = parent = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer    child;
 
 		if (NODE_IS_LEAF(node))
 			break;
@@ -1613,7 +1687,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 		}
 
 		parent = node;
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1634,21 +1708,21 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 bool
 rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 {
-	rt_node    *node;
+	rt_node_ptr    node;
 	int			shift;
 
 	Assert(value_p != NULL);
 
-	if (!tree->root || key > tree->max_val)
+	if (!RTPointerIsValid(tree->root) || key > tree->max_val)
 		return false;
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer	child;
 
 		if (NODE_IS_LEAF(node))
 			break;
@@ -1656,7 +1730,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1670,8 +1744,8 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 bool
 rt_delete(radix_tree *tree, uint64 key)
 {
-	rt_node    *node;
-	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	rt_node_ptr	node;
+	rt_node_ptr	stack[RT_MAX_LEVEL] = {0};
 	int			shift;
 	int			level;
 	bool		deleted;
@@ -1683,12 +1757,12 @@ rt_delete(radix_tree *tree, uint64 key)
 	 * Descend the tree to search the key while building a stack of nodes we
 	 * visited.
 	 */
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	level = -1;
 	while (shift > 0)
 	{
-		rt_node    *child;
+		rt_pointer	child;
 
 		/* Push the current node to the stack */
 		stack[++level] = node;
@@ -1696,7 +1770,7 @@ rt_delete(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1745,7 +1819,7 @@ rt_delete(radix_tree *tree, uint64 key)
 	 */
 	if (level == 0)
 	{
-		tree->root = NULL;
+		tree->root = InvalidRTPointer;
 		tree->max_val = 0;
 	}
 
@@ -1757,6 +1831,7 @@ rt_iter *
 rt_begin_iterate(radix_tree *tree)
 {
 	MemoryContext old_ctx;
+	rt_node_ptr	root;
 	rt_iter    *iter;
 	int			top_level;
 
@@ -1766,17 +1841,18 @@ rt_begin_iterate(radix_tree *tree)
 	iter->tree = tree;
 
 	/* empty tree */
-	if (!iter->tree)
+	if (!RTPointerIsValid(iter->tree))
 		return iter;
 
-	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	root = rt_node_ptr_encoded(iter->tree->root);
+	top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
 	iter->stack_len = top_level;
 
 	/*
 	 * Descend to the left most leaf node from the root. The key is being
 	 * constructed while descending to the leaf.
 	 */
-	rt_update_iter_stack(iter, iter->tree->root, top_level);
+	rt_update_iter_stack(iter, root, top_level);
 
 	MemoryContextSwitchTo(old_ctx);
 
@@ -1787,14 +1863,15 @@ rt_begin_iterate(radix_tree *tree)
  * Update each node_iter for inner nodes in the iterator node stack.
  */
 static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
 {
 	int			level = from;
-	rt_node    *node = from_node;
+	rt_node_ptr node = from_node;
 
 	for (;;)
 	{
 		rt_node_iter *node_iter = &(iter->stack[level--]);
+		bool found PG_USED_FOR_ASSERTS_ONLY;
 
 		node_iter->node = node;
 		node_iter->current_idx = -1;
@@ -1804,10 +1881,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
 			return;
 
 		/* Advance to the next slot in the inner node */
-		node = rt_node_inner_iterate_next(iter, node_iter);
+		found = rt_node_inner_iterate_next(iter, node_iter, &node);
 
 		/* We must find the first children in the node */
-		Assert(node);
+		Assert(found);
 	}
 }
 
@@ -1824,7 +1901,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 
 	for (;;)
 	{
-		rt_node    *child = NULL;
+		rt_node_ptr	child = InvalidRTNodePtr;
 		uint64		value;
 		int			level;
 		bool		found;
@@ -1845,14 +1922,12 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 		 */
 		for (level = 1; level <= iter->stack_len; level++)
 		{
-			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
-
-			if (child)
+			if (rt_node_inner_iterate_next(iter, &(iter->stack[level]), &child))
 				break;
 		}
 
 		/* the iteration finished */
-		if (!child)
+		if (!RTNodePtrIsValid(child))
 			return false;
 
 		/*
@@ -1884,18 +1959,19 @@ rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
  * Advance the slot in the inner node. Return the child if exists, otherwise
  * null.
  */
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+static inline bool
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *child_p)
 {
-	rt_node    *child = NULL;
+	rt_node_ptr	node = node_iter->node;
+	rt_pointer	child;
 	bool		found = false;
 	uint8		key_chunk;
 
-	switch (node_iter->node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n4->base.n.count)
@@ -1908,7 +1984,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n32->base.n.count)
@@ -1921,7 +1997,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_128:
 			{
-				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node_iter->node;
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -1941,7 +2017,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -1962,9 +2038,12 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 	}
 
 	if (found)
-		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+	{
+		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
+		*child_p = rt_node_ptr_encoded(child);
+	}
 
-	return child;
+	return found;
 }
 
 /*
@@ -1972,19 +2051,18 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
  * is set to value_p, otherwise return false.
  */
 static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
-						  uint64 *value_p)
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_p)
 {
-	rt_node    *node = node_iter->node;
+	rt_node_ptr node = node_iter->node;
 	bool		found = false;
 	uint64		value;
 	uint8		key_chunk;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n4->base.n.count)
@@ -1997,7 +2075,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n32->base.n.count)
@@ -2010,7 +2088,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_128:
 			{
-				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node_iter->node;
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2030,7 +2108,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2052,7 +2130,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 
 	if (found)
 	{
-		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
 		*value_p = value;
 	}
 
@@ -2089,16 +2167,16 @@ rt_memory_usage(radix_tree *tree)
  * Verify the radix tree node.
  */
 static void
-rt_verify_node(rt_node *node)
+rt_verify_node(rt_node_ptr node)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(node->count >= 0);
+	Assert(NODE_COUNT(node) >= 0);
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node.decoded;
 
 				for (int i = 1; i < n4->n.count; i++)
 					Assert(n4->chunks[i - 1] < n4->chunks[i]);
@@ -2107,7 +2185,7 @@ rt_verify_node(rt_node *node)
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node.decoded;
 
 				for (int i = 1; i < n32->n.count; i++)
 					Assert(n32->chunks[i - 1] < n32->chunks[i]);
@@ -2116,7 +2194,7 @@ rt_verify_node(rt_node *node)
 			}
 		case RT_NODE_KIND_128:
 			{
-				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node.decoded;
 				int			cnt = 0;
 
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2126,10 +2204,10 @@ rt_verify_node(rt_node *node)
 
 					/* Check if the corresponding slot is used */
 					if (NODE_IS_LEAF(node))
-						Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) node,
+						Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) n128,
 														  n128->slot_idxs[i]));
 					else
-						Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) node,
+						Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) n128,
 														   n128->slot_idxs[i]));
 
 					cnt++;
@@ -2142,7 +2220,7 @@ rt_verify_node(rt_node *node)
 			{
 				if (NODE_IS_LEAF(node))
 				{
-					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 					int			cnt = 0;
 
 					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
@@ -2163,9 +2241,11 @@ rt_verify_node(rt_node *node)
 void
 rt_stats(radix_tree *tree)
 {
+	rt_node_ptr	root = rt_node_ptr_encoded(tree->root);
+
 	ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
 						 tree->num_keys,
-						 tree->root->shift / RT_NODE_SPAN,
+						 NODE_SHIFT(root) / RT_NODE_SPAN,
 						 tree->cnt[0],
 						 tree->cnt[1],
 						 tree->cnt[2],
@@ -2173,42 +2253,44 @@ rt_stats(radix_tree *tree)
 }
 
 static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(rt_node_ptr node, int level, bool recurse)
 {
+	rt_node		*n = node.decoded;
 	char		space[128] = {0};
 
 	fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
 			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
-			(node->kind == RT_NODE_KIND_4) ? 4 :
-			(node->kind == RT_NODE_KIND_32) ? 32 :
-			(node->kind == RT_NODE_KIND_128) ? 128 : 256,
-			node->count, node->shift, node->chunk);
+			(NODE_KIND(node) == RT_NODE_KIND_4) ? 4 :
+			(NODE_KIND(node) == RT_NODE_KIND_32) ? 32 :
+			(NODE_KIND(node) == RT_NODE_KIND_128) ? 128 : 256,
+			n->count, n->shift, n->chunk);
 
 	if (level > 0)
 		sprintf(space, "%*c", level * 4, ' ');
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				for (int i = 0; i < node->count; i++)
+				for (int i = 0; i < NODE_COUNT(node); i++)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
 								space, n4->base.chunks[i], n4->values[i]);
 					}
 					else
 					{
-						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, n4->base.chunks[i]);
 
 						if (recurse)
-							rt_dump_node(n4->children[i], level + 1, recurse);
+							rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2217,25 +2299,26 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 			}
 		case RT_NODE_KIND_32:
 			{
-				for (int i = 0; i < node->count; i++)
+				for (int i = 0; i < NODE_KIND(node); i++)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
 								space, n32->base.chunks[i], n32->values[i]);
 					}
 					else
 					{
-						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, n32->base.chunks[i]);
 
 						if (recurse)
 						{
-							rt_dump_node(n32->children[i], level + 1, recurse);
+							rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+										 level + 1, recurse);
 						}
 						else
 							fprintf(stderr, "\n");
@@ -2245,7 +2328,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 			}
 		case RT_NODE_KIND_128:
 			{
-				rt_node_base_128 *b128 = (rt_node_base_128 *) node;
+				rt_node_base_128 *b128 = (rt_node_base_128 *) node.decoded;
 
 				fprintf(stderr, "slot_idxs ");
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2257,7 +2340,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 				}
 				if (NODE_IS_LEAF(node))
 				{
-					rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
+					rt_node_leaf_128 *n = (rt_node_leaf_128 *) node.decoded;
 
 					fprintf(stderr, ", isset-bitmap:");
 					for (int i = 0; i < 16; i++)
@@ -2287,7 +2370,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_128_get_child(n128, i),
+							rt_dump_node(rt_node_ptr_encoded(node_inner_128_get_child(n128, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2301,7 +2384,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 						if (!node_leaf_256_is_chunk_used(n256, i))
 							continue;
@@ -2311,7 +2394,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 					}
 					else
 					{
-						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 						if (!node_inner_256_is_chunk_used(n256, i))
 							continue;
@@ -2320,8 +2403,8 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
-										 recurse);
+							rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2334,14 +2417,14 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 void
 rt_dump_search(radix_tree *tree, uint64 key)
 {
-	rt_node    *node;
+	rt_node_ptr node;
 	int			shift;
 	int			level = 0;
 
 	elog(NOTICE, "-----------------------------------------------------------");
 	elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
 
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 	{
 		elog(NOTICE, "tree is empty");
 		return;
@@ -2354,11 +2437,11 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		return;
 	}
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer   child;
 
 		rt_dump_node(node, level, false);
 
@@ -2375,7 +2458,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			break;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 		level++;
 	}
@@ -2384,6 +2467,8 @@ rt_dump_search(radix_tree *tree, uint64 key)
 void
 rt_dump(radix_tree *tree)
 {
+	rt_node_ptr root;
+
 	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
 		fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
 				rt_node_kind_info[i].name,
@@ -2393,12 +2478,13 @@ rt_dump(radix_tree *tree)
 				rt_node_kind_info[i].leaf_blocksize);
 	fprintf(stderr, "max_val = %lu\n", tree->max_val);
 
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 	{
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
-	rt_dump_node(tree->root, 0, true);
+	root = rt_node_ptr_encoded(tree->root);
+	rt_dump_node(root, 0, true);
 }
 #endif
-- 
2.31.1

v10-0005-PoC-tag-the-node-kind-to-rt_pointer.patchapplication/octet-stream; name=v10-0005-PoC-tag-the-node-kind-to-rt_pointer.patchDownload

From 180ee3a0691bd1c7986f41dfec51673891e5cc06 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 17 Nov 2022 11:16:06 +0900
Subject: [PATCH v10 5/7] PoC: tag the node kind to rt_pointer.

---
 src/backend/lib/radixtree.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 67f4dc646e..08d580a899 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -141,6 +141,8 @@ typedef enum
 typedef uintptr_t rt_pointer;
 #define InvalidRTPointer		((rt_pointer) 0)
 #define RTPointerIsValid(x) 	(((rt_pointer) (x)) != InvalidRTPointer)
+#define RTPointerTagKind(x, k)	((rt_pointer) (x) | ((k) & RT_POINTER_KIND_MASK))
+#define RTPointerUnTagKind(x) 	((rt_pointer) (x) & ~RT_POINTER_KIND_MASK)
 
 /* Common type for all nodes types */
 typedef struct rt_node
@@ -159,8 +161,10 @@ typedef struct rt_node
 	uint8		shift;
 	uint8		chunk;
 
-	/* Size kind of the node */
-	uint8		kind;
+	/*
+	 * The node kind is tagged into the rt_pointer, see the comments of
+	 * rt_pointer for details.
+	 */
 } rt_node;
 #define RT_NODE_IS_LEAF(n)	(((rt_node *) (n))->shift == 0)
 
@@ -303,7 +307,7 @@ typedef struct rt_node_ptr
 #define NODE_RAW(n)			(((rt_node_ptr) (n)).decoded)
 #define NODE_IS_LEAF(n)		(NODE_RAW(n)->shift == 0)
 #define NODE_IS_EMPTY(n)	(NODE_COUNT(n) == 0)
-#define NODE_KIND(n)	(NODE_RAW(n)->kind)
+#define NODE_KIND(n)	((uint8) (((rt_node_ptr) (n)).encoded & RT_POINTER_KIND_MASK))
 #define NODE_COUNT(n)	(NODE_RAW(n)->count)
 #define NODE_SHIFT(n)	(NODE_RAW(n)->shift)
 #define NODE_CHUNK(n)	(NODE_RAW(n)->chunk)
@@ -444,13 +448,13 @@ static void rt_verify_node(rt_node_ptr node);
 static inline rt_node *
 rt_pointer_decode(rt_pointer encoded)
 {
-	return (rt_node *) encoded;
+	return (rt_node *) RTPointerUnTagKind(encoded);
 }
 
 static inline rt_pointer
-rt_pointer_encode(rt_node *decoded)
+rt_pointer_encode(rt_node *decoded, uint8 kind)
 {
-	return (rt_pointer) decoded;
+	return (rt_pointer) RTPointerTagKind(decoded, kind);
 }
 
 /* Return a rt_pointer created from the given encoded pointer */
@@ -923,8 +927,7 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
 		newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
 															 rt_node_kind_info[kind].leaf_size);
 
-	newnode.encoded = rt_pointer_encode(newnode.decoded);
-	NODE_KIND(newnode) = kind;
+	newnode.encoded = rt_pointer_encode(newnode.decoded, kind);
 	NODE_SHIFT(newnode) = shift;
 	NODE_CHUNK(newnode) = chunk;
 
-- 
2.31.1

v10-0007-PoC-lazy-vacuum-integration.patchapplication/octet-stream; name=v10-0007-PoC-lazy-vacuum-integration.patchDownload

From 2e6cc9188b06ec7ed548fe556bc1402bf1b88976 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 4 Nov 2022 14:14:42 +0900
Subject: [PATCH v10 7/7] PoC: lazy vacuum integration.

The patch includes:

* Introducing a new module called TIDStore
* Lazy vacuum and parallel vacuum integration.

TODOs:
* radix tree needs to have the reset funtionality.
* should not allow TIDStore to grow beyond the memory limit.
* change the progress statistics of pg_stat_progress_vacuum.
---
 src/backend/access/common/Makefile    |   1 +
 src/backend/access/common/meson.build |   1 +
 src/backend/access/common/tidstore.c  | 280 ++++++++++++++++++++++++++
 src/backend/access/heap/vacuumlazy.c  | 160 +++++----------
 src/backend/commands/vacuum.c         |  76 +------
 src/backend/commands/vacuumparallel.c |  63 +++---
 src/backend/storage/lmgr/lwlock.c     |   2 +
 src/include/access/tidstore.h         |  55 +++++
 src/include/commands/vacuum.h         |  24 +--
 src/include/storage/lwlock.h          |   1 +
 10 files changed, 437 insertions(+), 226 deletions(-)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h

diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 857beaa32d..76265974b1 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..50ec800fd6
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,280 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		TID (ItemPointer) storage implementation.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* XXX: should be configurable for non-heap AMs */
+#define TIDSTORE_OFFSET_NBITS 11	/* pg_ceil_log2_32(MaxHeapTuplesPerPage) */
+
+#define TIDSTORE_VALUE_NBITS 6	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+	((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+struct TIDStore
+{
+	/* main storage for TID */
+	radix_tree	*tree;
+
+	/* # of tids in TIDStore */
+	int	num_tids;
+
+	/* DSA area and handle for shared TIDStore */
+	rt_handle	handle;
+	dsa_area	*area;
+};
+
+static void tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TIDStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TIDStore *
+tidstore_create(dsa_area *area)
+{
+	TIDStore	*ts;
+
+	ts = palloc0(sizeof(TIDStore));
+
+	ts->tree = rt_create(CurrentMemoryContext, area);
+	ts->area = area;
+
+	if (area != NULL)
+		ts->handle = rt_get_handle(ts->tree);
+
+	return ts;
+}
+
+/* Attach to the shared TIDStore using a handle */
+TIDStore *
+tidstore_attach(dsa_area *area, rt_handle handle)
+{
+	TIDStore *ts;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	ts = palloc0(sizeof(TIDStore));
+	ts->tree = rt_attach(area, handle);
+
+	return ts;
+}
+
+/*
+ * Detach from a TIDStore. This detaches from radix tree and frees the
+ * backend-local resources.
+ */
+void
+tidstore_detach(TIDStore *ts)
+{
+	rt_detach(ts->tree);
+	pfree(ts);
+}
+
+void
+tidstore_free(TIDStore *ts)
+{
+	rt_free(ts->tree);
+	pfree(ts);
+}
+
+void
+tidstore_reset(TIDStore *ts)
+{
+	dsa_area *area = ts->area;
+
+	/* Reset the statistics */
+	ts->num_tids = 0;
+
+	/* Recreate radix tree storage */
+	rt_free(ts->tree);
+	ts->tree = rt_create(CurrentMemoryContext, area);
+}
+
+/* Add TIDs to TIDStore */
+void
+tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	uint64 last_key = PG_UINT64_MAX;
+	uint64 key;
+	uint64 val = 0;
+	ItemPointerData tid;
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint32	off;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		key = tid_to_key_off(&tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(ts->tree, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= UINT64CONST(1) << off;
+		ts->num_tids++;
+	}
+
+	if (last_key != PG_UINT64_MAX)
+	{
+		rt_set(ts->tree, last_key, val);
+		val = 0;
+	}
+}
+
+/* Return true if the given TID is present in TIDStore */
+bool
+tidstore_lookup_tid(TIDStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val;
+	uint32 off;
+	bool found;
+
+	key = tid_to_key_off(tid, &off);
+
+	found = rt_search(ts->tree, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+TIDStoreIter *
+tidstore_begin_iterate(TIDStore *ts)
+{
+	TIDStoreIter *iter;
+
+	iter = palloc0(sizeof(TIDStoreIter));
+	iter->ts = ts;
+	iter->tree_iter = rt_begin_iterate(ts->tree);
+	iter->blkno = InvalidBlockNumber;
+
+	return iter;
+}
+
+bool
+tidstore_iterate_next(TIDStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+
+	if (iter->finished)
+		return false;
+
+	if (BlockNumberIsValid(iter->blkno))
+	{
+		iter->num_offsets = 0;
+		tidstore_iter_collect_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (rt_iterate_next(iter->tree_iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = KEY_GET_BLKNO(key);
+
+		if (BlockNumberIsValid(iter->blkno) && iter->blkno != blkno)
+		{
+			/*
+			 * Remember the key-value pair for the next block for the
+			 * next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return true;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_collect_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return true;
+}
+
+uint64
+tidstore_num_tids(TIDStore *ts)
+{
+	return ts->num_tids;
+}
+
+uint64
+tidstore_memory_usage(TIDStore *ts)
+{
+	return (uint64) sizeof(TIDStore) + rt_memory_usage(ts->tree);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TIDStore
+ */
+tidstore_handle
+tidstore_get_handle(TIDStore *ts)
+{
+	return rt_get_handle(ts->tree);
+}
+
+/* Extract TIDs from key-value pair */
+static void
+tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val)
+{
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+		iter->offsets[iter->num_offsets++] = off;
+	}
+
+	iter->blkno = KEY_GET_BLKNO(key);
+}
+
+/* Encode a TID to key and val */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64 upper;
+	uint64 tid_i;
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+	*off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+	upper = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return upper;
+}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 834ab83a0e..cda405dd99 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -144,6 +145,8 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	int			max_bytes;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -194,7 +197,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TIDStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -265,8 +268,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer *vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer *vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -397,6 +401,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->indname = NULL;
 	vacrel->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
 	vacrel->verbose = verbose;
+	vacrel->max_bytes = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
@@ -858,7 +865,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_unskippable_block,
 				next_failsafe_block = 0,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TIDStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
@@ -872,7 +879,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = vacrel->max_bytes; /* XXX: should use # of tids */
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -942,8 +949,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		/* XXX: should not allow tidstore to grow beyond max_bytes */
+		if (tidstore_memory_usage(vacrel->dead_items) > vacrel->max_bytes)
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1075,11 +1082,17 @@ lazy_scan_heap(LVRelState *vacrel)
 			if (prunestate.has_lpdead_items)
 			{
 				Size		freespace;
+				TIDStoreIter *iter;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+				iter = tidstore_begin_iterate(vacrel->dead_items);
+				tidstore_iterate_next(iter);
+				lazy_vacuum_heap_page(vacrel, blkno, iter->offsets, iter->num_offsets,
+									  buf, &vmbuffer);
+				Assert(!tidstore_iterate_next(iter));
+				pfree(iter);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				tidstore_reset(dead_items);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1116,7 +1129,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
 		}
 
 		/*
@@ -1269,7 +1282,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1868,25 +1881,16 @@ retry:
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TIDStore *dead_items = vacrel->dead_items;
 
 		Assert(!prunestate->all_visible);
 		Assert(prunestate->has_lpdead_items);
 
 		vacrel->lpdead_item_pages++;
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+									 tidstore_num_tids(dead_items));
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
@@ -2093,8 +2097,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TIDStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2103,17 +2106,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+									 tidstore_num_tids(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2162,7 +2158,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2191,7 +2187,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2218,8 +2214,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2264,7 +2260,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	/* tidstore_reset(vacrel->dead_items); */
 }
 
 /*
@@ -2336,7 +2332,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2373,10 +2369,10 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index;
 	BlockNumber vacuumed_pages;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TIDStoreIter *iter;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2393,8 +2389,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 	vacuumed_pages = 0;
 
-	index = 0;
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while (tidstore_iterate_next(iter))
 	{
 		BlockNumber tblk;
 		Buffer		buf;
@@ -2403,12 +2399,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		tblk = iter->blkno;
 		vacrel->blkno = tblk;
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+		lazy_vacuum_heap_page(vacrel, tblk, iter->offsets, iter->num_offsets,
+							  buf, &vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2432,9 +2429,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
@@ -2456,11 +2452,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * LP_DEAD item on the page.  The return value is the first index immediately
  * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+					  int num_offsets, Buffer buffer, Buffer *vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			uncnt = 0;
@@ -2479,16 +2474,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = offsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2568,7 +2558,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -3070,46 +3059,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3120,12 +3069,6 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
-
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
 	 * be used for an index, so we invoke parallelism only if there are at
@@ -3151,7 +3094,6 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3164,11 +3106,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(NULL);
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 3c8ea21475..effb72cdd6 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2295,16 +2294,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TIDStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2335,18 +2334,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
@@ -2357,60 +2344,7 @@ vac_max_items_to_alloc_size(int max_items)
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TIDStore *dead_items = (TIDStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26d796e52..070503f662 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+#define PARALLEL_VACUUM_KEY_DSA				2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TIDStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TIDStore *dead_items;
+	dsa_area *dead_items_area;;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +226,22 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TIDStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +289,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +356,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +375,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +384,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +441,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_free(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +452,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TIDStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +950,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TIDStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +996,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1045,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a5ad36ca78..2fb30fe2e7 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -183,6 +183,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"PgStatsHash",
 	/* LWTRANCHE_PGSTATS_DATA: */
 	"PgStatsData",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..40b8021f9b
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,55 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  TID storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TIDStore TIDStore;
+
+typedef struct TIDStoreIter
+{
+	TIDStore	*ts;
+
+	rt_iter		*tree_iter;
+
+	bool		finished;
+
+	uint64		next_key;
+	uint64		next_val;
+
+	BlockNumber		blkno;
+	OffsetNumber	offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+	int				num_offsets;
+} TIDStoreIter;
+
+extern TIDStore *tidstore_create(dsa_area *dsa);
+extern TIDStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TIDStore *ts);
+extern void tidstore_free(TIDStore *ts);
+extern void tidstore_reset(TIDStore *ts);
+extern void tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TIDStore *ts, ItemPointer tid);
+extern TIDStoreIter * tidstore_begin_iterate(TIDStore *ts);
+extern bool tidstore_iterate_next(TIDStoreIter *iter);
+extern uint64 tidstore_num_tids(TIDStore *ts);
+extern uint64 tidstore_memory_usage(TIDStore *ts);
+extern tidstore_handle tidstore_get_handle(TIDStore *ts);
+
+#endif		/* TIDSTORE_H */
+
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 5d816ba7f4..d221528f16 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -235,21 +236,6 @@ typedef struct VacuumParams
 	int			nworkers;
 } VacuumParams;
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -306,18 +292,16 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TIDStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TIDStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index a494cb598f..88e35254d1 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -201,6 +201,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DSA,
 	LWTRANCHE_PGSTATS_HASH,
 	LWTRANCHE_PGSTATS_DATA,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
-- 
2.31.1

v10-0006-PoC-DSA-support-for-radix-tree.patchapplication/octet-stream; name=v10-0006-PoC-DSA-support-for-radix-tree.patchDownload

From b85513ab0f8654df36aa913f4b29b626e652943f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Oct 2022 14:02:00 +0900
Subject: [PATCH v10 6/7] PoC: DSA support for radix tree.

---
 .../bench_radix_tree--1.0.sql                 |   2 +
 contrib/bench_radix_tree/bench_radix_tree.c   |  12 +-
 src/backend/lib/radixtree.c                   | 483 +++++++++++++-----
 src/backend/utils/mmgr/dsa.c                  |  12 +
 src/include/lib/radixtree.h                   |   8 +-
 src/include/utils/dsa.h                       |   1 +
 .../expected/test_radixtree.out               |  17 +
 .../modules/test_radixtree/test_radixtree.c   | 100 ++--
 8 files changed, 482 insertions(+), 153 deletions(-)

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index e0205b364e..b5f731f329 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -7,6 +7,7 @@ create function bench_shuffle_search(
 minblk int4,
 maxblk int4,
 random_block bool DEFAULT false,
+shared bool DEFAULT false,
 OUT nkeys int8,
 OUT rt_mem_allocated int8,
 OUT array_mem_allocated int8,
@@ -23,6 +24,7 @@ create function bench_seq_search(
 minblk int4,
 maxblk int4,
 random_block bool DEFAULT false,
+shared bool DEFAULT false,
 OUT nkeys int8,
 OUT rt_mem_allocated int8,
 OUT array_mem_allocated int8,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 70ca989118..225a1b3bb1 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -15,6 +15,7 @@
 #include "lib/radixtree.h"
 #include <math.h>
 #include "miscadmin.h"
+#include "storage/lwlock.h"
 #include "utils/timestamp.h"
 
 PG_MODULE_MAGIC;
@@ -150,7 +151,9 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	BlockNumber minblk = PG_GETARG_INT32(0);
 	BlockNumber maxblk = PG_GETARG_INT32(1);
 	bool		random_block = PG_GETARG_BOOL(2);
+	bool		shared = PG_GETARG_BOOL(3);
 	radix_tree *rt = NULL;
+	dsa_area   *dsa = NULL;
 	uint64		ntids;
 	uint64		key;
 	uint64		last_key = PG_UINT64_MAX;
@@ -172,8 +175,11 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 
 	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
 
+	if (shared)
+		dsa = dsa_create(LWLockNewTrancheId());
+
 	/* measure the load time of the radix tree */
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, dsa);
 	start_time = GetCurrentTimestamp();
 	for (int i = 0; i < ntids; i++)
 	{
@@ -324,7 +330,7 @@ bench_load_random_int(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	pg_prng_seed(&state, 0);
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	start_time = GetCurrentTimestamp();
 	for (uint64 i = 0; i < cnt; i++)
@@ -450,7 +456,7 @@ bench_fixed_height_search(PG_FUNCTION_ARGS)
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
 
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	start_time = GetCurrentTimestamp();
 
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 08d580a899..1f2bb95e24 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -22,6 +22,15 @@
  * choose it to avoid an additional pointer traversal.  It is the reason this code
  * currently does not support variable-length keys.
  *
+ * If DSA space is specified when rt_create(), the radix tree is created in the
+ * DSA space so that multiple processes can access to it simultaneously. The process
+ * who created the shared radix tree need to tell both DSA area specified when
+ * calling to rt_create() and dsa_pointer of the radix tree, fetched by
+ * rt_get_dsa_pointer(), other processes so that they can attach by rt_attach().
+ *
+ * XXX: shared radix tree is still PoC state as it doesn't have any locking support.
+ * Also, it supports only single-process iteration.
+ *
  * XXX: Most functions in this file have two variants for inner nodes and leaf
  * nodes, therefore there are duplication codes. While this sometimes makes the
  * code maintenance tricky, this reduces branch prediction misses when judging
@@ -34,6 +43,9 @@
  *
  * rt_create		- Create a new, empty radix tree
  * rt_free			- Free the radix tree
+ * rt_attach		- Attach to the radix tree
+ * rt_detach		- Detach from the radix tree
+ * rt_get_handle	- Return the handle of the radix tree
  * rt_search		- Search a key-value pair
  * rt_set			- Set a key-value pair
  * rt_delete		- Delete a key-value pair
@@ -64,6 +76,7 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "port/pg_lfind.h"
+#include "utils/dsa.h"
 #include "utils/memutils.h"
 
 /* The number of bits encoded in one tree level */
@@ -384,6 +397,11 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
  * construct the key whenever updating the node iteration information, e.g., when
  * advancing the current index within the node or when moving to the next node
  * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than rt_node_ptr.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
  */
 typedef struct rt_node_iter
 {
@@ -403,23 +421,43 @@ struct rt_iter
 	uint64		key;
 };
 
-/* A radix tree with nodes */
-struct radix_tree
+/* A magic value used to identify our radix tree */
+#define RADIXTREE_MAGIC 0x54A48167
+
+/* Control information for an radix tree */
+typedef struct radix_tree_control
 {
-	MemoryContext context;
+	rt_handle	handle;
+	uint32		magic;
 
+	/* Root node */
 	rt_pointer	root;
-	uint64		max_val;
-	uint64		num_keys;
 
-	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
-	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+	pg_atomic_uint64 max_val;
+	pg_atomic_uint64 num_keys;
 
 	/* statistics */
 #ifdef RT_DEBUG
 	int32		cnt[RT_NODE_KIND_COUNT];
 #endif
+} radix_tree_control;
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	/* control object in either backend-local memory or DSA */
+	radix_tree_control *ctl;
+
+	/* used only when the radix tree is shared */
+	dsa_area   *area;
+
+	/* used only when the radix tree is private */
+	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
 };
+#define RadixTreeIsShared(rt) ((rt)->area != NULL)
 
 static void rt_new_root(radix_tree *tree, uint64 key);
 static rt_node_ptr rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
@@ -446,24 +484,31 @@ static void rt_verify_node(rt_node_ptr node);
 
 /* Decode and encode function of rt_pointer */
 static inline rt_node *
-rt_pointer_decode(rt_pointer encoded)
+rt_pointer_decode(radix_tree *tree, rt_pointer encoded)
 {
-	return (rt_node *) RTPointerUnTagKind(encoded);
+	encoded = RTPointerUnTagKind(encoded);
+
+	if (RadixTreeIsShared(tree))
+		return (rt_node *) dsa_get_address(tree->area, encoded);
+	else
+		return (rt_node *) encoded;
 }
 
 static inline rt_pointer
-rt_pointer_encode(rt_node *decoded, uint8 kind)
+rt_pointer_encode(rt_pointer decoded, uint8 kind)
 {
+	Assert((decoded & RT_POINTER_KIND_MASK) == 0);
+
 	return (rt_pointer) RTPointerTagKind(decoded, kind);
 }
 
 /* Return a rt_pointer created from the given encoded pointer */
 static inline rt_node_ptr
-rt_node_ptr_encoded(rt_pointer encoded)
+rt_node_ptr_encoded(radix_tree *tree, rt_pointer encoded)
 {
 	return (rt_node_ptr) {
 		.encoded = encoded,
-			.decoded = rt_pointer_decode(encoded)
+			.decoded = rt_pointer_decode(tree, encoded)
 			};
 }
 
@@ -908,8 +953,8 @@ rt_new_root(radix_tree *tree, uint64 key)
 	rt_node_ptr	node;
 
 	node = rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0, shift > 0);
-	tree->max_val = shift_get_max_val(shift);
-	tree->root = node.encoded;
+	pg_atomic_write_u64(&tree->ctl->max_val, shift_get_max_val(shift));
+	tree->ctl->root = node.encoded;
 }
 
 /*
@@ -918,16 +963,35 @@ rt_new_root(radix_tree *tree, uint64 key)
 static rt_node_ptr
 rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
 {
-	rt_node_ptr	newnode;
+	rt_node_ptr newnode;
+
+	if (tree->area != NULL)
+	{
+		dsa_pointer dp;
+
+		if (inner)
+			dp = dsa_allocate0(tree->area, rt_node_kind_info[kind].inner_size);
+		else
+			dp = dsa_allocate0(tree->area, rt_node_kind_info[kind].leaf_size);
 
-	if (inner)
-		newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
-															 rt_node_kind_info[kind].inner_size);
+		newnode.encoded = rt_pointer_encode((rt_pointer) dp, kind);
+		newnode.decoded = (rt_node *) dsa_get_address(tree->area, dp);
+	}
 	else
-		newnode.decoded = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
-															 rt_node_kind_info[kind].leaf_size);
+	{
+		rt_node *new;
+
+		if (inner)
+			new = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+													 rt_node_kind_info[kind].inner_size);
+		else
+			new = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+													 rt_node_kind_info[kind].leaf_size);
+
+		newnode.encoded = rt_pointer_encode((rt_pointer) new, kind);
+		newnode.decoded = new;
+	}
 
-	newnode.encoded = rt_pointer_encode(newnode.decoded, kind);
 	NODE_SHIFT(newnode) = shift;
 	NODE_CHUNK(newnode) = chunk;
 
@@ -941,7 +1005,7 @@ rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[kind]++;
+	tree->ctl->cnt[kind]++;
 #endif
 
 	return newnode;
@@ -968,16 +1032,19 @@ static void
 rt_free_node(radix_tree *tree, rt_node_ptr node)
 {
 	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node.encoded)
-		tree->root = InvalidRTPointer;
+	if (tree->ctl->root == node.encoded)
+		tree->ctl->root = InvalidRTPointer;
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[NODE_KIND(node)]--;
-	Assert(tree->cnt[NODE_KIND(node)] >= 0);
+	tree->ctl->cnt[NODE_KIND(node)]--;
+	Assert(tree->ctl->cnt[NODE_KIND(node)] >= 0);
 #endif
 
-	pfree(node.decoded);
+	if (RadixTreeIsShared(tree))
+		dsa_free(tree->area, (dsa_pointer) RTPointerUnTagKind(node.encoded));
+	else
+		pfree(node.decoded);
 }
 
 /*
@@ -993,7 +1060,7 @@ rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
 	if (rt_node_ptr_eq(&parent, &old_child))
 	{
 		/* Replace the root node with the new large node */
-		tree->root = new_child.encoded;
+		tree->ctl->root = new_child.encoded;
 	}
 	else
 	{
@@ -1015,7 +1082,7 @@ static void
 rt_extend(radix_tree *tree, uint64 key)
 {
 	int			target_shift;
-	rt_node		*root = rt_pointer_decode(tree->root);
+	rt_node		*root = rt_pointer_decode(tree, tree->ctl->root);
 	int			shift = root->shift + RT_NODE_SPAN;
 
 	target_shift = key_get_shift(key);
@@ -1031,15 +1098,15 @@ rt_extend(radix_tree *tree, uint64 key)
 
 		n4->base.n.count = 1;
 		n4->base.chunks[0] = 0;
-		n4->children[0] = tree->root;
+		n4->children[0] = tree->ctl->root;
 
 		root->chunk = 0;
-		tree->root = node.encoded;
+		tree->ctl->root = node.encoded;
 
 		shift += RT_NODE_SPAN;
 	}
 
-	tree->max_val = shift_get_max_val(target_shift);
+	pg_atomic_write_u64(&tree->ctl->max_val, shift_get_max_val(target_shift));
 }
 
 /*
@@ -1068,7 +1135,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
 	}
 
 	rt_node_insert_leaf(tree, parent, node, key, value);
-	tree->num_keys++;
+	pg_atomic_add_fetch_u64(&tree->ctl->num_keys, 1);
 }
 
 /*
@@ -1079,8 +1146,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
-					 rt_pointer *child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action, rt_pointer *child_p)
 {
 	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
@@ -1115,6 +1181,7 @@ rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
 					break;
 
 				found = true;
+
 				if (action == RT_ACTION_FIND)
 					child = n32->children[idx];
 				else			/* RT_ACTION_DELETE */
@@ -1604,33 +1671,50 @@ rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
  * Create the radix tree in the given memory context and return it.
  */
 radix_tree *
-rt_create(MemoryContext ctx)
+rt_create(MemoryContext ctx, dsa_area *area)
 {
 	radix_tree *tree;
 	MemoryContext old_ctx;
 
 	old_ctx = MemoryContextSwitchTo(ctx);
 
-	tree = palloc(sizeof(radix_tree));
+	tree = (radix_tree *) palloc0(sizeof(radix_tree));
 	tree->context = ctx;
-	tree->root = InvalidRTPointer;
-	tree->max_val = 0;
-	tree->num_keys = 0;
+
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+
+		tree->area = area;
+		dp = dsa_allocate0(area, sizeof(radix_tree_control));
+		tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
+		tree->ctl->handle = (rt_handle) dp;
+	}
+	else
+	{
+		tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
+		tree->ctl->handle = InvalidDsaPointer;
+	}
+
+	tree->ctl->magic = RADIXTREE_MAGIC;
+	tree->ctl->root = InvalidRTPointer;
+	pg_atomic_init_u64(&tree->ctl->max_val, 0);
+	pg_atomic_init_u64(&tree->ctl->num_keys, 0);
 
 	/* Create the slab allocator for each size class */
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	if (area == NULL)
 	{
-		tree->inner_slabs[i] = SlabContextCreate(ctx,
-												 rt_node_kind_info[i].name,
-												 rt_node_kind_info[i].inner_blocksize,
-												 rt_node_kind_info[i].inner_size);
-		tree->leaf_slabs[i] = SlabContextCreate(ctx,
-												rt_node_kind_info[i].name,
-												rt_node_kind_info[i].leaf_blocksize,
-												rt_node_kind_info[i].leaf_size);
-#ifdef RT_DEBUG
-		tree->cnt[i] = 0;
-#endif
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			tree->inner_slabs[i] = SlabContextCreate(ctx,
+													 rt_node_kind_info[i].name,
+													 rt_node_kind_info[i].inner_blocksize,
+													 rt_node_kind_info[i].inner_size);
+			tree->leaf_slabs[i] = SlabContextCreate(ctx,
+													rt_node_kind_info[i].name,
+													rt_node_kind_info[i].leaf_blocksize,
+													rt_node_kind_info[i].leaf_size);
+		}
 	}
 
 	MemoryContextSwitchTo(old_ctx);
@@ -1638,16 +1722,160 @@ rt_create(MemoryContext ctx)
 	return tree;
 }
 
+/*
+ * Get a handle that can be used by other processes to attach to this radix
+ * tree.
+ */
+dsa_pointer
+rt_get_handle(radix_tree *tree)
+{
+	Assert(RadixTreeIsShared(tree));
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	return tree->ctl->handle;
+}
+
+/*
+ * Attach to an existing radix tree using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+radix_tree *
+rt_attach(dsa_area *area, rt_handle handle)
+{
+	radix_tree *tree;
+	dsa_pointer	control;
+
+	/* Allocate the backend-local object representing the radix tree */
+	tree = (radix_tree *) palloc0(sizeof(radix_tree));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	/* Set up the local radix tree */
+	tree->area = area;
+	tree->ctl = (radix_tree_control *) dsa_get_address(area, control);
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	return tree;
+}
+
+/*
+ * Detach from a radix tree. This frees backend-local resources associated
+ * with the radix tree, but the radix tree will continue to exist until
+ * it is explicitly freed.
+ */
+void
+rt_detach(radix_tree *tree)
+{
+	Assert(RadixTreeIsShared(tree));
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	pfree(tree);
+}
+
+/*
+ * Recursively free all nodes allocated to the dsa area.
+ */
+static void
+rt_free_recurse(radix_tree *tree, rt_pointer ptr)
+{
+	rt_node_ptr	node = rt_node_ptr_encoded(tree, ptr);
+
+	Assert(RadixTreeIsShared(tree));
+
+	/* The leaf node doesn't have child pointers, so free it */
+	if (NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->area, RTPointerUnTagKind(node.encoded));
+		return;
+	}
+
+	switch (NODE_KIND(node))
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < NODE_COUNT(node); i++)
+					rt_free_recurse(tree, n4->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < NODE_COUNT(node); i++)
+					rt_free_recurse(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+						continue;
+
+					rt_free_recurse(tree, node_inner_128_get_child(n128, i));
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_inner_256_is_chunk_used(n256, i))
+						continue;
+
+					rt_free_recurse(tree, node_inner_256_get_child(n256, i));
+				}
+				break;
+			}
+	}
+
+	/* Free the inner node itself */
+	dsa_free(tree->area, RTPointerUnTagKind(node.encoded));
+}
+
 /*
  * Free the given radix tree.
  */
 void
 rt_free(radix_tree *tree)
 {
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (RadixTreeIsShared(tree))
 	{
-		MemoryContextDelete(tree->inner_slabs[i]);
-		MemoryContextDelete(tree->leaf_slabs[i]);
+		/* Free all memory used for radix tree nodes */
+		if (RTPointerIsValid(tree->ctl->root))
+			rt_free_recurse(tree, tree->ctl->root);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix tree.
+		 */
+		tree->ctl->magic = 0;
+		dsa_free(tree->area, tree->ctl->handle);
+	}
+	else
+	{
+		/* Free all memory used for radix tree nodes */
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			MemoryContextDelete(tree->inner_slabs[i]);
+			MemoryContextDelete(tree->leaf_slabs[i]);
+		}
+		pfree(tree->ctl);
 	}
 
 	pfree(tree);
@@ -1665,16 +1893,18 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 	rt_node_ptr	node;
 	rt_node_ptr parent;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
 	/* Empty tree, create the root */
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 		rt_new_root(tree, key);
 
 	/* Extend the tree if necessary */
-	if (key > tree->max_val)
+	if (key > pg_atomic_read_u64(&tree->ctl->max_val))
 		rt_extend(tree, key);
 
 	/* Descend the tree until a leaf node */
-	node = parent = rt_node_ptr_encoded(tree->root);
+	node = parent = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
@@ -1690,7 +1920,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 		}
 
 		parent = node;
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1698,7 +1928,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 
 	/* Update the statistics */
 	if (!updated)
-		tree->num_keys++;
+		pg_atomic_add_fetch_u64(&tree->ctl->num_keys, 1);
 
 	return updated;
 }
@@ -1714,12 +1944,14 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 	rt_node_ptr    node;
 	int			shift;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
 	Assert(value_p != NULL);
 
-	if (!RTPointerIsValid(tree->root) || key > tree->max_val)
+	if (!RTPointerIsValid(tree->ctl->root) ||
+		key > pg_atomic_read_u64(&tree->ctl->max_val))
 		return false;
 
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 
 	/* Descend the tree until a leaf node */
@@ -1733,7 +1965,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1753,14 +1985,17 @@ rt_delete(radix_tree *tree, uint64 key)
 	int			level;
 	bool		deleted;
 
-	if (!tree->root || key > tree->max_val)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (!RTPointerIsValid(tree->ctl->root) ||
+		key > pg_atomic_read_u64(&tree->ctl->max_val))
 		return false;
 
 	/*
 	 * Descend the tree to search the key while building a stack of nodes we
 	 * visited.
 	 */
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	level = -1;
 	while (shift > 0)
@@ -1773,7 +2008,7 @@ rt_delete(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1788,7 +2023,7 @@ rt_delete(radix_tree *tree, uint64 key)
 	}
 
 	/* Found the key to delete. Update the statistics */
-	tree->num_keys--;
+	pg_atomic_sub_fetch_u64(&tree->ctl->num_keys, 1);
 
 	/*
 	 * Return if the leaf node still has keys and we don't need to delete the
@@ -1822,8 +2057,8 @@ rt_delete(radix_tree *tree, uint64 key)
 	 */
 	if (level == 0)
 	{
-		tree->root = InvalidRTPointer;
-		tree->max_val = 0;
+		tree->ctl->root = InvalidRTPointer;
+		pg_atomic_write_u64(&tree->ctl->max_val, 0);
 	}
 
 	return true;
@@ -1838,6 +2073,8 @@ rt_begin_iterate(radix_tree *tree)
 	rt_iter    *iter;
 	int			top_level;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
 	old_ctx = MemoryContextSwitchTo(tree->context);
 
 	iter = (rt_iter *) palloc0(sizeof(rt_iter));
@@ -1847,7 +2084,7 @@ rt_begin_iterate(radix_tree *tree)
 	if (!RTPointerIsValid(iter->tree))
 		return iter;
 
-	root = rt_node_ptr_encoded(iter->tree->root);
+	root = rt_node_ptr_encoded(tree, iter->tree->ctl->root);
 	top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
 	iter->stack_len = top_level;
 
@@ -1898,6 +2135,8 @@ rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
 bool
 rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 {
+	Assert(!RadixTreeIsShared(iter->tree) || iter->tree->ctl->magic == RADIXTREE_MAGIC);
+
 	/* Empty tree */
 	if (!iter->tree)
 		return false;
@@ -2043,7 +2282,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *
 	if (found)
 	{
 		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
-		*child_p = rt_node_ptr_encoded(child);
+		*child_p = rt_node_ptr_encoded(iter->tree, child);
 	}
 
 	return found;
@@ -2146,7 +2385,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_
 uint64
 rt_num_entries(radix_tree *tree)
 {
-	return tree->num_keys;
+	return pg_atomic_read_u64(&tree->ctl->num_keys);
 }
 
 /*
@@ -2155,12 +2394,19 @@ rt_num_entries(radix_tree *tree)
 uint64
 rt_memory_usage(radix_tree *tree)
 {
-	Size		total = sizeof(radix_tree);
+	Size		total = sizeof(radix_tree) + sizeof(radix_tree_control);
 
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (RadixTreeIsShared(tree))
+		total = dsa_get_total_size(tree->area);
+	else
 	{
-		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
-		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+			total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+		}
 	}
 
 	return total;
@@ -2244,19 +2490,19 @@ rt_verify_node(rt_node_ptr node)
 void
 rt_stats(radix_tree *tree)
 {
-	rt_node_ptr	root = rt_node_ptr_encoded(tree->root);
+	rt_node_ptr	root = rt_node_ptr_encoded(tree, tree->ctl->root);
 
 	ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
-						 tree->num_keys,
+						 pg_atomic_read_u64(&tree->ctl->num_keys),
 						 NODE_SHIFT(root) / RT_NODE_SPAN,
-						 tree->cnt[0],
-						 tree->cnt[1],
-						 tree->cnt[2],
-						 tree->cnt[3])));
+						 tree->ctl->cnt[0],
+						 tree->ctl->cnt[1],
+						 tree->ctl->cnt[2],
+						 tree->ctl->cnt[3])));
 }
 
 static void
-rt_dump_node(rt_node_ptr node, int level, bool recurse)
+rt_dump_node(radix_tree *tree, rt_node_ptr node, int level, bool recurse)
 {
 	rt_node		*n = node.decoded;
 	char		space[128] = {0};
@@ -2292,7 +2538,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, n4->base.chunks[i]);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+							rt_dump_node(tree, rt_node_ptr_encoded(tree, n4->children[i]),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2320,7 +2566,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 
 						if (recurse)
 						{
-							rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+							rt_dump_node(tree, rt_node_ptr_encoded(tree, n32->children[i]),
 										 level + 1, recurse);
 						}
 						else
@@ -2373,7 +2619,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(node_inner_128_get_child(n128, i)),
+							rt_dump_node(tree,
+										 rt_node_ptr_encoded(tree,
+															 node_inner_128_get_child(n128, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2406,7 +2654,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+							rt_dump_node(tree,
+										 rt_node_ptr_encoded(tree,
+															 node_inner_256_get_child(n256, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2417,6 +2667,27 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 	}
 }
 
+void
+rt_dump(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
+				rt_node_kind_info[i].name,
+				rt_node_kind_info[i].inner_size,
+				rt_node_kind_info[i].inner_blocksize,
+				rt_node_kind_info[i].leaf_size,
+				rt_node_kind_info[i].leaf_blocksize);
+	fprintf(stderr, "max_val = %lu\n", pg_atomic_read_u64(&tree->ctl->max_val));
+
+	if (!tree->ctl->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree, rt_node_ptr_encoded(tree, tree->ctl->root), 0, true);
+}
+
 void
 rt_dump_search(radix_tree *tree, uint64 key)
 {
@@ -2425,28 +2696,30 @@ rt_dump_search(radix_tree *tree, uint64 key)
 	int			level = 0;
 
 	elog(NOTICE, "-----------------------------------------------------------");
-	elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+	elog(NOTICE, "max_val = %lu (0x%lX)",
+		 pg_atomic_read_u64(&tree->ctl->max_val),
+		 pg_atomic_read_u64(&tree->ctl->max_val));
 
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 	{
 		elog(NOTICE, "tree is empty");
 		return;
 	}
 
-	if (key > tree->max_val)
+	if (key > pg_atomic_read_u64(&tree->ctl->max_val))
 	{
 		elog(NOTICE, "key %lu (0x%lX) is larger than max val",
 			 key, key);
 		return;
 	}
 
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
 		rt_pointer   child;
 
-		rt_dump_node(node, level, false);
+		rt_dump_node(tree, node, level, false);
 
 		if (NODE_IS_LEAF(node))
 		{
@@ -2461,33 +2734,9 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			break;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 		level++;
 	}
 }
-
-void
-rt_dump(radix_tree *tree)
-{
-	rt_node_ptr root;
-
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
-		fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
-				rt_node_kind_info[i].name,
-				rt_node_kind_info[i].inner_size,
-				rt_node_kind_info[i].inner_blocksize,
-				rt_node_kind_info[i].leaf_size,
-				rt_node_kind_info[i].leaf_blocksize);
-	fprintf(stderr, "max_val = %lu\n", tree->max_val);
-
-	if (!RTPointerIsValid(tree->root))
-	{
-		fprintf(stderr, "empty tree\n");
-		return;
-	}
-
-	root = rt_node_ptr_encoded(tree->root);
-	rt_dump_node(root, 0, true);
-}
 #endif
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 82376fde2d..ad169882af 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..68a11df970 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -14,18 +14,24 @@
 #define RADIXTREE_H
 
 #include "postgres.h"
+#include "utils/dsa.h"
 
 #define RT_DEBUG 1
 
 typedef struct radix_tree radix_tree;
 typedef struct rt_iter rt_iter;
+typedef dsa_pointer rt_handle;
 
-extern radix_tree *rt_create(MemoryContext ctx);
+extern radix_tree *rt_create(MemoryContext ctx, dsa_area *dsa);
 extern void rt_free(radix_tree *tree);
 extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
 extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
 extern rt_iter *rt_begin_iterate(radix_tree *tree);
 
+extern rt_handle rt_get_handle(radix_tree *tree);
+extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+extern void rt_detach(radix_tree *tree);
+
 extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
 extern void rt_end_iterate(rt_iter *iter);
 extern bool rt_delete(radix_tree *tree, uint64 key);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 405606fe2f..dad06adecc 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index cc6970c87c..a0ff1e1c77 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -5,21 +5,38 @@ CREATE EXTENSION test_radixtree;
 --
 SELECT test_radixtree();
 NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
 NOTICE:  testing radix tree node types with shift "8"
 NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "16"
 NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
 NOTICE:  testing radix tree node types with shift "32"
 NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
 NOTICE:  testing radix tree node types with shift "48"
 NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
 NOTICE:  testing radix tree with pattern "all ones"
 NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
 NOTICE:  testing radix tree with pattern "clusters of ten"
 NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
 NOTICE:  testing radix tree with pattern "one-every-64k"
 NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
 NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
 NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
 NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
  test_radixtree 
 ----------------
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index cb3596755d..a948cba4ec 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -19,6 +19,7 @@
 #include "nodes/bitmapset.h"
 #include "storage/block.h"
 #include "storage/itemptr.h"
+#include "storage/lwlock.h"
 #include "utils/memutils.h"
 #include "utils/timestamp.h"
 
@@ -111,7 +112,7 @@ test_empty(void)
 	radix_tree *radixtree;
 	uint64		dummy;
 
-	radixtree = rt_create(CurrentMemoryContext);
+	radixtree = rt_create(CurrentMemoryContext, NULL);
 
 	if (rt_search(radixtree, 0, &dummy))
 		elog(ERROR, "rt_search on empty tree returned true");
@@ -217,14 +218,10 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
  * level.
  */
 static void
-test_node_types(uint8 shift)
+do_test_node_types(radix_tree *radixtree, uint8 shift)
 {
-	radix_tree *radixtree;
-
 	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
 
-	radixtree = rt_create(CurrentMemoryContext);
-
 	/*
 	 * Insert and search entries for every node type at the 'shift' level,
 	 * then delete all entries to make it empty, and insert and search entries
@@ -233,19 +230,39 @@ test_node_types(uint8 shift)
 	test_node_types_insert(radixtree, shift);
 	test_node_types_delete(radixtree, shift);
 	test_node_types_insert(radixtree, shift);
+}
 
-	rt_free(radixtree);
+static void
+test_node_types(void)
+{
+	int			tranche_id = LWLockNewTrancheId();
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+	{
+		radix_tree *tree;
+		dsa_area   *dsa;
+
+		/* Test the local radix tree */
+		tree = rt_create(CurrentMemoryContext, NULL);
+		do_test_node_types(tree, shift);
+		rt_free(tree);
+
+		/* Test the shared radix tree */
+		dsa = dsa_create(tranche_id);
+		tree = rt_create(CurrentMemoryContext, dsa);
+		do_test_node_types(tree, shift);
+		rt_free(tree);
+		dsa_detach(dsa);
+	}
 }
 
 /*
  * Test with a repeating pattern, defined by the 'spec'.
  */
 static void
-test_pattern(const test_spec * spec)
+do_test_pattern(radix_tree *radixtree, const test_spec * spec)
 {
-	radix_tree *radixtree;
 	rt_iter    *iter;
-	MemoryContext radixtree_ctx;
 	TimestampTz starttime;
 	TimestampTz endtime;
 	uint64		n;
@@ -271,18 +288,6 @@ test_pattern(const test_spec * spec)
 			pattern_values[pattern_num_values++] = i;
 	}
 
-	/*
-	 * Allocate the radix tree.
-	 *
-	 * Allocate it in a separate memory context, so that we can print its
-	 * memory usage easily.
-	 */
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
-										  "radixtree test",
-										  ALLOCSET_SMALL_SIZES);
-	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
-	radixtree = rt_create(radixtree_ctx);
-
 	/*
 	 * Add values to the set.
 	 */
@@ -336,8 +341,6 @@ test_pattern(const test_spec * spec)
 		mem_usage = rt_memory_usage(radixtree);
 		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
 				mem_usage, (double) mem_usage / spec->num_values);
-
-		MemoryContextStats(radixtree_ctx);
 	}
 
 	/* Check that rt_num_entries works */
@@ -484,21 +487,54 @@ test_pattern(const test_spec * spec)
 	if ((nbefore - ndeleted) != nafter)
 		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
 			 nafter, (nbefore - ndeleted), ndeleted);
+}
+
+static void
+test_patterns(void)
+{
+	int			tranche_id = LWLockNewTrancheId();
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+	{
+		radix_tree *tree;
+		MemoryContext radixtree_ctx;
+		dsa_area   *dsa;
+		const		test_spec *spec = &test_specs[i];
 
-	MemoryContextDelete(radixtree_ctx);
+		/*
+		 * Allocate the radix tree.
+		 *
+		 * Allocate it in a separate memory context, so that we can print its
+		 * memory usage easily.
+		 */
+		radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+											  "radixtree test",
+											  ALLOCSET_SMALL_SIZES);
+		MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+		/* Test the local radix tree */
+		tree = rt_create(radixtree_ctx, NULL);
+		do_test_pattern(tree, spec);
+		rt_free(tree);
+		MemoryContextReset(radixtree_ctx);
+
+		/* Test the shared radix tree */
+		dsa = dsa_create(tranche_id);
+		tree = rt_create(radixtree_ctx, dsa);
+		do_test_pattern(tree, spec);
+		rt_free(tree);
+		dsa_detach(dsa);
+		MemoryContextDelete(radixtree_ctx);
+	}
 }
 
 Datum
 test_radixtree(PG_FUNCTION_ARGS)
 {
 	test_empty();
-
-	for (int shift = 0; shift <= (64 - 8); shift += 8)
-		test_node_types(shift);
-
-	/* Test different test patterns, with lots of entries */
-	for (int i = 0; i < lengthof(test_specs); i++)
-		test_pattern(&test_specs[i]);
+	test_node_types();
+	test_patterns();
 
 	PG_RETURN_VOID();
 }
-- 
2.31.1

v10-0002-Add-radix-implementation.patchapplication/octet-stream; name=v10-0002-Add-radix-implementation.patchDownload

From f6cd9570460e9ae2a53e670c94bdee0c69b883b2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v10 2/7] Add radix implementation.

---
 src/backend/lib/Makefile                      |    1 +
 src/backend/lib/meson.build                   |    1 +
 src/backend/lib/radixtree.c                   | 2404 +++++++++++++++++
 src/include/lib/radixtree.h                   |   42 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   28 +
 src/test/modules/test_radixtree/meson.build   |   34 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  504 ++++
 .../test_radixtree/test_radixtree.control     |    4 +
 15 files changed, 3069 insertions(+)
 create mode 100644 src/backend/lib/radixtree.c
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
   'knapsack.c',
   'pairingheap.c',
   'rbtree.c',
+  'radixtree.c',
 )
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..6159b73b75
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2404 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_set			- Set a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iterate	- Begin iterating through all key-value pairs
+ * rt_iterate_next	- Return next key-value pair, if any
+ * rt_end_iter		- End iteration
+ * rt_memory_usage	- Get the memory usage
+ * rt_num_entries	- Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-128 */
+#define RT_NODE_128_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+	RT_ACTION_FIND = 0,			/* find the key-value */
+	RT_ACTION_DELETE,			/* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 40/40 -> 296/286 -> 1288/1304 -> 2056/2088 bytes for inner nodes and
+ * leaf nodes, respectively, leading to large amount of allocator padding
+ * with aset.c. Hence the use of slab.
+ *
+ * XXX: need to have node-1 until there is no path compression optimization?
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_128		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Size kind of the node */
+	uint8		kind;
+} rt_node;
+#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+	(((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+typedef struct rt_node_base_4
+{
+	rt_node		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+	rt_node		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-128 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 128 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base128
+{
+	rt_node		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_128;
+
+typedef struct rt_node_base256
+{
+	rt_node		n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * There are separate from inner node size classes for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+	rt_node_base_4 base;
+
+	/* 4 children, for key chunks */
+	rt_node    *children[4];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+	rt_node_base_4 base;
+
+	/* 4 values, for key chunks */
+	uint64		values[4];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+	rt_node_base_32 base;
+
+	/* 32 children, for key chunks */
+	rt_node    *children[32];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+	rt_node_base_32 base;
+
+	/* 32 values, for key chunks */
+	uint64		values[32];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_128
+{
+	rt_node_base_128 base;
+
+	/* Slots for 128 children */
+	rt_node    *children[128];
+} rt_node_inner_128;
+
+typedef struct rt_node_leaf_128
+{
+	rt_node_base_128 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
+
+	/* Slots for 128 values */
+	uint64		values[128];
+} rt_node_leaf_128;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 children */
+	rt_node    *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+	rt_node_base_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information of each size kinds */
+typedef struct rt_node_kind_info_elem
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} rt_node_kind_info_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
+static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
+
+	[RT_NODE_KIND_4] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(rt_node_inner_4),
+		.leaf_size = sizeof(rt_node_leaf_4),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
+	},
+	[RT_NODE_KIND_32] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(rt_node_inner_32),
+		.leaf_size = sizeof(rt_node_leaf_32),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
+	},
+	[RT_NODE_KIND_128] = {
+		.name = "radix tree node 128",
+		.fanout = 128,
+		.inner_size = sizeof(rt_node_inner_128),
+		.leaf_size = sizeof(rt_node_leaf_128),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
+	},
+	[RT_NODE_KIND_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(rt_node_inner_256),
+		.leaf_size = sizeof(rt_node_leaf_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+	},
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+	rt_node    *node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	rt_node_iter stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node    *root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_NODE_KIND_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk,
+							  bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+										rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+									   uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+								 uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+								uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											 uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+						  uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+	/* For better code generation */
+	if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+		pg_unreachable();
+
+	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+	memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values, int count)
+{
+	/* For better code generation */
+	if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+		pg_unreachable();
+
+	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+	memcpy(dst_values, src_values, sizeof(uint64) * count);
+}
+
+/* Functions to manipulate inner and leaf node-128 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
+{
+	Assert(NODE_IS_LEAF(node));
+	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((rt_node_base_128 *) node)->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+static void
+node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
+{
+	int			slotpos = node->base.slot_idxs[chunk];
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Return an unused slot in node-128 */
+static int
+node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
+{
+	int			slotpos = 0;
+
+	Assert(!NODE_IS_LEAF(node));
+	while (node_inner_128_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static int
+node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* We iterate over the isset bitmap per byte then check each bit */
+	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	{
+		if (node->isset[slotpos] < 0xFF)
+			break;
+	}
+	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+	slotpos *= BITS_PER_BYTE;
+	while (node_leaf_128_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static inline void
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+	int			slotpos;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_inner_128_find_unused_slot(node, chunk);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_leaf_128_find_unused_slot(node, chunk);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(node_inner_256_is_chunk_used(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(node_leaf_256_is_chunk_used(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+	int			shift = key_get_shift(key);
+	rt_node    *node;
+
+	node = (rt_node *) rt_alloc_node(tree, RT_NODE_KIND_4, shift, 0,
+									 shift > 0);
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = node;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, int kind, uint8 shift, uint8 chunk, bool inner)
+{
+	rt_node    *newnode;
+
+	if (inner)
+		newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+													 rt_node_kind_info[kind].inner_size);
+	else
+		newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+													 rt_node_kind_info[kind].leaf_size);
+
+	newnode->kind = kind;
+	newnode->shift = shift;
+	newnode->chunk = chunk;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_128)
+	{
+		rt_node_base_128 *n128 = (rt_node_base_128 *) newnode;
+
+		memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
+	}
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[kind]++;
+#endif
+
+	return newnode;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_copy_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+	rt_node    *newnode;
+
+	newnode = rt_alloc_node(tree, new_kind, node->shift, node->chunk,
+							node->shift > 0);
+	newnode->count = node->count;
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+		tree->root = NULL;
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[node->kind]--;
+	Assert(tree->cnt[node->kind] >= 0);
+#endif
+
+	pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+				rt_node *new_child, uint64 key)
+{
+	Assert(old_child->chunk == new_child->chunk);
+	Assert(old_child->shift == new_child->shift);
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = new_child;
+	}
+	else
+	{
+		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
+
+		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		Assert(replaced);
+	}
+
+	rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_inner_4 *node;
+
+		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_NODE_KIND_4,
+												 shift, 0, true);
+		node->base.n.count = 1;
+		node->base.chunks[0] = 0;
+		node->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+			  rt_node *node)
+{
+	int			shift = node->shift;
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		rt_node    *newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+
+		newchild = rt_alloc_node(tree, RT_NODE_KIND_4, newshift,
+								 RT_GET_KEY_CHUNK(key, node->shift),
+								 newshift > 0);
+		rt_node_insert_inner(tree, parent, node, key, newchild);
+
+		parent = node;
+		node = newchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	rt_node_insert_leaf(tree, parent, node, key, value);
+	tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	rt_node    *child = NULL;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = n4->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n4->base.chunks, n4->children,
+												n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = n32->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n32->base.chunks, n32->children,
+												n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+				if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = node_inner_128_get_child(n128, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_128_delete(n128, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				if (!node_inner_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = node_inner_256_get_child(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && child_p)
+		*child_p = child;
+
+	return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	uint64		value = 0;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = n4->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+											  n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = n32->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+											  n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+
+				if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_128_get_value(n128, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_128_delete(n128, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				if (!node_leaf_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_256_get_value(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && value_p)
+		*value_p = value;
+
+	return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+					 rt_node *child)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_inner_32 *new32;
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
+															  RT_NODE_KIND_32);
+					chunk_children_array_copy(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children,
+											  n4->base.n.count);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+									key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					uint16		count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n4->base.chunks, n4->children,
+												   count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->children[insertpos] = child;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+				{
+					rt_node_inner_128 *new128;
+
+					/* grow node from 32 to 128 */
+					new128 = (rt_node_inner_128 *) rt_copy_node(tree, (rt_node *) n32,
+																RT_NODE_KIND_128);
+					for (int i = 0; i < n32->base.n.count; i++)
+						node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+									key);
+					node = (rt_node *) new128;
+				}
+				else
+				{
+					int			insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+					int16		count = n32->base.n.count;
+
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n32->base.chunks, n32->children,
+												   count, insertpos);
+
+					n32->base.chunks[insertpos] = chunk;
+					n32->children[insertpos] = child;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+				int			cnt = 0;
+
+				if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_inner_128_update(n128, chunk, child);
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+				{
+					rt_node_inner_256 *new256;
+
+					/* grow node from 128 to 256 */
+					new256 = (rt_node_inner_256 *) rt_copy_node(tree, (rt_node *) n128,
+																RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+					{
+						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+							continue;
+
+						node_inner_256_set(new256, i, node_inner_128_get_child(n128, i));
+						cnt++;
+					}
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_inner_128_insert(n128, chunk, child);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+				node_inner_256_set(n256, chunk, child);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+					uint64 key, uint64 value)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_leaf_32 *new32;
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_leaf_32 *) rt_copy_node(tree, (rt_node *) n4,
+															 RT_NODE_KIND_32);
+					chunk_values_array_copy(n4->base.chunks, n4->values,
+											new32->base.chunks, new32->values,
+											n4->base.n.count);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+									key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and values */
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n4->base.chunks, n4->values,
+												 count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->values[insertpos] = value;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+				{
+					rt_node_leaf_128 *new128;
+
+					/* grow node from 32 to 128 */
+					new128 = (rt_node_leaf_128 *) rt_copy_node(tree, (rt_node *) n32,
+															   RT_NODE_KIND_128);
+					for (int i = 0; i < n32->base.n.count; i++)
+						node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+									key);
+					node = (rt_node *) new128;
+				}
+				else
+				{
+					int			insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+					int			count = n32->base.n.count;
+
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n32->base.chunks, n32->values,
+												 count, insertpos);
+
+					n32->base.chunks[insertpos] = chunk;
+					n32->values[insertpos] = value;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_128:
+			{
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+				int			cnt = 0;
+
+				if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_leaf_128_update(n128, chunk, value);
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+				{
+					rt_node_leaf_256 *new256;
+
+					/* grow node from 128 to 256 */
+					new256 = (rt_node_leaf_256 *) rt_copy_node(tree, (rt_node *) n128,
+															   RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+					{
+						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+							continue;
+
+						node_leaf_256_set(new256, i, node_leaf_128_get_value(n128, i));
+						cnt++;
+					}
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_leaf_128_insert(n128, chunk, value);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+				node_leaf_256_set(n256, chunk, value);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 rt_node_kind_info[i].name,
+												 rt_node_kind_info[i].inner_blocksize,
+												 rt_node_kind_info[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												rt_node_kind_info[i].name,
+												rt_node_kind_info[i].leaf_blocksize,
+												rt_node_kind_info[i].leaf_size);
+#ifdef RT_DEBUG
+		tree->cnt[i] = 0;
+#endif
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	rt_node    *node;
+	rt_node    *parent;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = parent = tree->root;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		{
+			rt_set_extend(tree, key, value, parent, node);
+			return false;
+		}
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+	rt_node    *node;
+	int			shift;
+
+	Assert(value_p != NULL);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		rt_node    *child;
+
+		/* Push the current node to the stack */
+		stack[++level] = node;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	Assert(NODE_IS_LEAF(node));
+	deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	rt_free_node(tree, node);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		node = stack[level--];
+
+		deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+	}
+
+	/*
+	 * If we eventually deleted the root node while recursively deleting empty
+	 * nodes, we make the tree empty.
+	 */
+	if (level == 0)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+	return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+	int			level = from;
+	rt_node    *node = from_node;
+
+	for (;;)
+	{
+		rt_node_iter *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = rt_node_inner_iterate_next(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree)
+		return false;
+
+	for (;;)
+	{
+		rt_node    *child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		rt_update_iter_stack(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+	rt_node    *child = NULL;
+	bool		found = false;
+	uint8		key_chunk;
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				child = n4->children[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				child = n32->children[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_128_get_child(n128, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_inner_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_256_get_child(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+	return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+						  uint64 *value_p)
+{
+	rt_node    *node = node_iter->node;
+	bool		found = false;
+	uint64		value;
+	uint8		key_chunk;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				value = n4->values[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				value = n32->values[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_128_get_value(n128, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_leaf_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_256_get_value(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		*value_p = value;
+	}
+
+	return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	Size		total = sizeof(radix_tree);
+
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					if (NODE_IS_LEAF(node))
+						Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) node,
+														  n128->slot_idxs[i]));
+					else
+						Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) node,
+														   n128->slot_idxs[i]));
+
+					cnt++;
+				}
+
+				Assert(n128->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+						cnt += pg_popcount32(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
+						 tree->num_keys,
+						 tree->root->shift / RT_NODE_SPAN,
+						 tree->cnt[0],
+						 tree->cnt[1],
+						 tree->cnt[2],
+						 tree->cnt[3])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+	char		space[128] = {0};
+
+	fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_128) ? 128 : 256,
+			node->count, node->shift, node->chunk);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							rt_dump_node(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							rt_dump_node(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *b128 = (rt_node_base_128 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(b128, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b128->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < 16; i++)
+					{
+						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(b128, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) b128;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, i, node_leaf_128_get_value(n128, i));
+					}
+					else
+					{
+						rt_node_inner_128 *n128 = (rt_node_inner_128 *) b128;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_128_get_child(n128, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+						if (!node_leaf_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x%lX\n",
+								space, i, node_leaf_256_get_value(n256, i));
+					}
+					else
+					{
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+						if (!node_inner_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = %lu (0x%lX)", tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key %lu (0x%lX) is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		rt_dump_node(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+			break;
+		}
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size%lu\tinner_blocksize %lu\tleaf_size %lu\tleaf_blocksize %lu\n",
+				rt_node_kind_info[i].name,
+				rt_node_kind_info[i].inner_size,
+				rt_node_kind_info[i].inner_blocksize,
+				rt_node_kind_info[i].leaf_size,
+				rt_node_kind_info[i].leaf_blocksize);
+	fprintf(stderr, "max_val = %lu\n", tree->max_val);
+
+	if (!tree->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 96addded81..11d0ec5b07 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -27,6 +27,7 @@ SUBDIRS = \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1d26544854..568823b221 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -21,6 +21,7 @@ subdir('test_oat_hooks')
 subdir('test_parser')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..cc6970c87c
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,28 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..cb3596755d
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,504 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/* The maximum number of entries each node type can have */
+static int	rt_node_max_entries[] = {
+	4,							/* RT_NODE_KIND_4 */
+	16,							/* RT_NODE_KIND_16 */
+	32,							/* RT_NODE_KIND_32 */
+	128,						/* RT_NODE_KIND_128 */
+	256							/* RT_NODE_KIND_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 10000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	uint64		dummy;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		uint64		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, key);
+
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", key);
+
+		for (int j = 0; j < lengthof(rt_node_max_entries); j++)
+		{
+			/*
+			 * After filling all slots in each node type, check if the values
+			 * are stored properly.
+			 */
+			if (i == (rt_node_max_entries[j] - 1))
+			{
+				check_search_on_node(radixtree, shift,
+									 (j == 0) ? 0 : rt_node_max_entries[j - 1],
+									 rt_node_max_entries[j]);
+				break;
+			}
+		}
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "inserted key 0x" UINT64_HEX_FORMAT " is not found", key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
-- 
2.31.1

v10-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v10-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From 9fd128f027302de19075942180b749ebd184007b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v10 1/7] introduce vector8_min and vector8_highbit_mask

---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.31.1

#133

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#131)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Nov 21, 2022 at 4:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Fri, Nov 18, 2022 at 2:48 PM I wrote:

One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's not an issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(), so I just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct in all cases, but that limits what we can do with the smallest node kind.

Thinking about this part, there's an easy resolution -- use a different macro for fixed- and variable-sized node kinds to determine if there is a free slot.

Also, I wanted to share some results of adjusting the boundary between the two smallest node kinds. In the hackish attached patch, I modified the fixed height search benchmark to search a small (within L1 cache) tree thousands of times. For the first set I modified node4's maximum fanout and filled it up. For the second, I set node4's fanout to 1, which causes 2+ to spill to node32 (actually the partially-filled node15 size class as demoed earlier).

node4:

NOTICE: num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
2 | 16 | 16520 | 0 | 3

NOTICE: num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
3 | 81 | 16456 | 0 | 17

NOTICE: num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
4 | 256 | 16456 | 0 | 89

NOTICE: num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
5 | 625 | 16488 | 0 | 327

node32:

NOTICE: num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
2 | 16 | 16488 | 0 | 5
(1 row)

NOTICE: num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
3 | 81 | 16520 | 0 | 28

NOTICE: num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
4 | 256 | 16408 | 0 | 79

NOTICE: num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
--------+-------+------------------+------------+--------------
5 | 625 | 24616 | 0 | 199

In this test, node32 seems slightly faster than node4 with 4 elements, at the cost of more memory.

Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space:

node4: 5 + 4+(7) + 4*8 = 48 bytes
node3: 5 + 3 + 3*8 = 32

IIUC if we store the fanout member only in variable-sized nodes,
rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
size doesn't change but there is 1 byte padding space.

Also, even if we have the node3 a variable-sized node, size class 1
for node3 could be a good choice since it also doesn't need padding
space and could be a good alternative to path compression.

node3 : 5 + 3 + 3*8 = 32 bytes
size class 1 : 5 + 3 + 1*8 = 16 bytes

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#134

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#133)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Mon, Nov 21, 2022 at 4:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Assuming the smallest node is fixed size (i.e. fanout/capacity member

not part of the common set, so only part of variable-sized nodes), 3 has a
nice property: no wasted padding space:

node4: 5 + 4+(7) + 4*8 = 48 bytes
node3: 5 + 3 + 3*8 = 32

IIUC if we store the fanout member only in variable-sized nodes,
rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
size doesn't change but there is 1 byte padding space.

I forgot to mention I'm assuming no pointer-tagging for this exercise.
You've demonstrated it can be done in a small amount of code, and I hope we
can demonstrate a speedup in search. Just in case there is some issue with
portability, valgrind, or some other obstacle, I'm being pessimistic in my
calculations.

Also, even if we have the node3 a variable-sized node, size class 1
for node3 could be a good choice since it also doesn't need padding
space and could be a good alternative to path compression.

node3 : 5 + 3 + 3*8 = 32 bytes
size class 1 : 5 + 3 + 1*8 = 16 bytes

Precisely! I have that scenario in my notes as well -- it's quite
compelling.

--
John Naylor
EDB: http://www.enterprisedb.com

#135

https://cirrus-ci.com/task/4635135954386944

andres@anarazel.de

about 3 years ago

In reply to: Masahiko Sawada (#132)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On 2022-11-21 17:06:56 +0900, Masahiko Sawada wrote:

Sure. I've attached the v10 patches. 0004 is the pure refactoring
patch and 0005 patch introduces the pointer tagging.

This failed on cfbot, with som many crashes that the VM ran out of disk for
core dumps. During testing with 32bit, so there's probably something broken
around that.

A failure is e.g. at: https://api.cirrus-ci.com/v1/artifact/task/4635135954386944/testrun/build-32/testrun/adminpack/regress/log/initdb.log

performing post-bootstrap initialization ... ../src/backend/lib/radixtree.c:1696:21: runtime error: member access within misaligned address 0x590faf74 for type 'struct radix_tree_control', which requires 8 byte alignment
0x590faf74: note: pointer points here
90 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^
==55813==Using libbacktrace symbolizer.
#0 0x56dcc274 in rt_create ../src/backend/lib/radixtree.c:1696
#1 0x56953d1b in tidstore_create ../src/backend/access/common/tidstore.c:57
#2 0x56a1ca4f in dead_items_alloc ../src/backend/access/heap/vacuumlazy.c:3109
#3 0x56a2219f in heap_vacuum_rel ../src/backend/access/heap/vacuumlazy.c:539
#4 0x56cb77ed in table_relation_vacuum ../src/include/access/tableam.h:1681
#5 0x56cb77ed in vacuum_rel ../src/backend/commands/vacuum.c:2062
#6 0x56cb9a16 in vacuum ../src/backend/commands/vacuum.c:472
#7 0x56cba904 in ExecVacuum ../src/backend/commands/vacuum.c:272
#8 0x5711b6d0 in standard_ProcessUtility ../src/backend/tcop/utility.c:866
#9 0x5711bdeb in ProcessUtility ../src/backend/tcop/utility.c:530
#10 0x5711759f in PortalRunUtility ../src/backend/tcop/pquery.c:1158
#11 0x57117cb8 in PortalRunMulti ../src/backend/tcop/pquery.c:1315
#12 0x571183d2 in PortalRun ../src/backend/tcop/pquery.c:791
#13 0x57111049 in exec_simple_query ../src/backend/tcop/postgres.c:1238
#14 0x57113f9c in PostgresMain ../src/backend/tcop/postgres.c:4551
#15 0x5711463d in PostgresSingleUserMain ../src/backend/tcop/postgres.c:4028
#16 0x56df4672 in main ../src/backend/main/main.c:197
#17 0xf6ad8e45 in __libc_start_main (/lib/i386-linux-gnu/libc.so.6+0x1ae45)
#18 0x5691d0f0 in _start (/tmp/cirrus-ci-build/build-32/tmp_install/usr/local/pgsql/bin/postgres+0x3040f0)

Aborted (core dumped)
child process exited with exit code 134
initdb: data directory "/tmp/cirrus-ci-build/build-32/testrun/adminpack/regress/tmp_check/data" not removed at user's request

#136

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#134)

5 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Nov 21, 2022 at 6:30 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 21, 2022 at 4:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space:

node4: 5 + 4+(7) + 4*8 = 48 bytes
node3: 5 + 3 + 3*8 = 32

IIUC if we store the fanout member only in variable-sized nodes,
rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
size doesn't change but there is 1 byte padding space.

I forgot to mention I'm assuming no pointer-tagging for this exercise. You've demonstrated it can be done in a small amount of code, and I hope we can demonstrate a speedup in search. Just in case there is some issue with portability, valgrind, or some other obstacle, I'm being pessimistic in my calculations.

Also, even if we have the node3 a variable-sized node, size class 1
for node3 could be a good choice since it also doesn't need padding
space and could be a good alternative to path compression.

node3 : 5 + 3 + 3*8 = 32 bytes
size class 1 : 5 + 3 + 1*8 = 16 bytes

Precisely! I have that scenario in my notes as well -- it's quite compelling.

So it seems that there are two candidates of rt_node structure: (1)
all nodes except for node256 are variable-size nodes and use pointer
tagging, and (2) node32 and node128 are variable-sized nodes and do
not use pointer tagging (fanout is in part of only these two nodes).
rt_node can be 5 bytes in both cases. But before going to this step, I
started to verify the idea of variable-size nodes by using 6-bytes
rt_node. We can adjust the node kinds and node classes later.

In this verification, I have all nodes except for node256
variable-sized nodes, and the sizes are:

radix tree node 1 : 6 + 4 + (6) + 1*8 = 24 bytes
radix tree node 4 : 6 + 4 + (6) + 4*8 = 48
radix tree node 15 : 6 + 32 + (2) + 15*8 = 160
radix tree node 32 : 6 + 32 + (2) + 32*8 = 296
radix tree node 61 : inner 6 + 256 + (2) + 61*8 = 752, leaf 6 +
256 + (2) + 16 + 61*8 = 768
radix tree node 128 : inner 6 + 256 + (2) + 128*8 = 1288, leaf 6 +
256 + (2) + 16 + 128*8 = 1304
radix tree node 256 : inner 6 + (2) + 256*8 = 2056, leaf 6 + (2) + 32
+ 256*8 = 2088

I did some performance tests against two radix trees: a radix tree
supporting only fixed-size nodes (i.e. applying up to 0003 patch), and
a radix tree supporting variable-size nodes (i.e. applying all
attached patches). Also, I changed bench_search_random_nodes()
function so that we can specify the filter via a function argument.
Here are results:

Here are results:

* Query
select * from bench_seq_search(0, 1*1000*1000, false)

---
* Query
select * from bench_search_random_nodes(10 * 1000 * 1000, '0x7F07FF00FF')

* Fixed-size
NOTICE: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603,
n128 = 182670, n256 = 1024
mem_allocated | search_ms
---------------+-----------
343001456 | 1151
(1 row)

* Variable-size
NOTICE: num_keys = 9291812, height = 4, n1 = 262144, n4 = 0, n15 =
138, n32 = 79465, n61 = 182665, n128 = 5, n256 = 1024
mem_allocated | search_ms
---------------+-----------
230504328 | 1077
(1 row)

---
* Query
select * from bench_search_random_nodes(10 * 1000 * 1000, '0xFFFF0000003F')
* Fixed-size
NOTICE: num_keys = 3807650, height = 5, n4 = 196608, n32 = 0, n128 =
65536, n256 = 257
mem_allocated | search_ms
---------------+-----------
99911920 | 632
(1 row)
* Variable-size
NOTICE: num_keys = 3807650, height = 5, n1 = 196608, n4 = 0, n15 = 0,
n32 = 0, n61 = 61747, n128 = 3789, n256 = 257
mem_allocated | search_ms
---------------+-----------
64045688 | 554
(1 row)

Overall, the idea of variable-sized nodes is good, smaller size
without losing search performance. I'm going to check the load
performance as well.

I've attached the patches I used for the verification. I don't include
patches for pointer tagging, DSA support, and vacuum integration since
I'm investigating the issue on cfbot that Andres reported. Also, I've
modified tests to improve the test coverage.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v11-0004-Preparatory-refactoring-for-decoupling-kind-from.patchapplication/x-patch; name=v11-0004-Preparatory-refactoring-for-decoupling-kind-from.patchDownload

From 9b8d423d8a1969b698dcd07bbfd1e309e86bddd2 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 17 Nov 2022 12:10:31 +0700
Subject: [PATCH v11 4/6] Preparatory refactoring for decoupling kind from size
 class

Rename the current kind info array to refer to size classes, but
keep all the contents the same.

Add a fanout member to all nodes which stores the max capacity of
the node. This is currently set with the same hardcoded value as
in the kind info array.

In passing, remove outdated reference to node16 in the regression
test.
---
 src/backend/lib/radixtree.c | 196 +++++++++++++++++++++---------------
 1 file changed, 117 insertions(+), 79 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index cc1a629fed..b71545e031 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -131,6 +131,16 @@ typedef enum
 #define RT_NODE_KIND_256		0x03
 #define RT_NODE_KIND_COUNT		4
 
+typedef enum rt_size_class
+{
+	RT_CLASS_4_FULL = 0,
+	RT_CLASS_32_FULL,
+	RT_CLASS_128_FULL,
+	RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
 /* Common type for all nodes types */
 typedef struct rt_node
 {
@@ -140,6 +150,9 @@ typedef struct rt_node
 	 */
 	uint16		count;
 
+	/* Max number of children. We can use uint8 because we never need to store 256 */
+	uint8		fanout;
+
 	/*
 	 * Shift indicates which part of the key space is represented by this
 	 * node. That is, the key is shifted by 'shift' and the lowest
@@ -148,13 +161,13 @@ typedef struct rt_node
 	uint8		shift;
 	uint8		chunk;
 
-	/* Size kind of the node */
+	/* Node kind, one per search/set algorithm */
 	uint8		kind;
 } rt_node;
 #define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
 #define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
-#define NODE_HAS_FREE_SLOT(n) \
-	(((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+#define NODE_HAS_FREE_SLOT(node) \
+	((node)->base.n.count < (node)->base.n.fanout)
 
 /* Base type of each node kinds for leaf and inner nodes */
 typedef struct rt_node_base_4
@@ -194,7 +207,7 @@ typedef struct rt_node_base256
 /*
  * Inner and leaf nodes.
  *
- * There are separate from inner node size classes for two main reasons:
+ * Theres are separate for two main reasons:
  *
  * 1) the value type might be different than something fitting into a pointer
  *    width type
@@ -278,8 +291,8 @@ typedef struct rt_node_leaf_256
 	uint64		values[RT_NODE_MAX_SLOTS];
 } rt_node_leaf_256;
 
-/* Information of each size kinds */
-typedef struct rt_node_kind_info_elem
+/* Information for each size class */
+typedef struct rt_size_class_elem
 {
 	const char *name;
 	int			fanout;
@@ -291,7 +304,7 @@ typedef struct rt_node_kind_info_elem
 	/* slab block size */
 	Size		inner_blocksize;
 	Size		leaf_blocksize;
-} rt_node_kind_info_elem;
+} rt_size_class_elem;
 
 /*
  * Calculate the slab blocksize so that we can allocate at least 32 chunks
@@ -299,9 +312,9 @@ typedef struct rt_node_kind_info_elem
  */
 #define NODE_SLAB_BLOCK_SIZE(size)	\
 	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
-static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
 
-	[RT_NODE_KIND_4] = {
+	[RT_CLASS_4_FULL] = {
 		.name = "radix tree node 4",
 		.fanout = 4,
 		.inner_size = sizeof(rt_node_inner_4),
@@ -309,7 +322,7 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
 		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
 	},
-	[RT_NODE_KIND_32] = {
+	[RT_CLASS_32_FULL] = {
 		.name = "radix tree node 32",
 		.fanout = 32,
 		.inner_size = sizeof(rt_node_inner_32),
@@ -317,7 +330,7 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
 		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
 	},
-	[RT_NODE_KIND_128] = {
+	[RT_CLASS_128_FULL] = {
 		.name = "radix tree node 128",
 		.fanout = 128,
 		.inner_size = sizeof(rt_node_inner_128),
@@ -325,9 +338,11 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
 		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
 	},
-	[RT_NODE_KIND_256] = {
+	[RT_CLASS_256] = {
 		.name = "radix tree node 256",
-		.fanout = 256,
+		/* technically it's 256, but we can't store that in a uint8,
+		  and this is the max size class so it will never grow */
+		.fanout = 0,
 		.inner_size = sizeof(rt_node_inner_256),
 		.leaf_size = sizeof(rt_node_leaf_256),
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
@@ -335,6 +350,14 @@ static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
 	},
 };
 
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+	[RT_NODE_KIND_32] = RT_CLASS_32_FULL,
+	[RT_NODE_KIND_128] = RT_CLASS_128_FULL,
+	[RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
 /*
  * Iteration support.
  *
@@ -376,21 +399,21 @@ struct radix_tree
 	uint64		max_val;
 	uint64		num_keys;
 
-	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
-	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
 
 	/* statistics */
 #ifdef RT_DEBUG
-	int32		cnt[RT_NODE_KIND_COUNT];
+	int32		cnt[RT_SIZE_CLASS_COUNT];
 #endif
 };
 
 static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node * rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift,
-									uint8 chunk, bool inner);
-static inline void rt_init_node(rt_node *node, uint8 kind, uint8 shift, uint8 chunk,
-								bool inner);
-static rt_node *rt_alloc_node(radix_tree *tree, int kind, bool inner);
+static rt_node * rt_alloc_init_node(radix_tree *tree, uint8 kind, rt_size_class size_class,
+									uint8 shift, uint8 chunk, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, uint8 shift,
+								uint8 chunk, bool inner);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
 static void rt_free_node(radix_tree *tree, rt_node *node);
 static void rt_extend(radix_tree *tree, uint64 key);
 static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
@@ -591,7 +614,7 @@ chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
 						  uint8 *dst_chunks, rt_node **dst_children, int count)
 {
 	/* For better code generation */
-	if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+	if (count > rt_size_class_info[RT_CLASS_4_FULL].fanout)
 		pg_unreachable();
 
 	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
@@ -603,7 +626,7 @@ chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
 						uint8 *dst_chunks, uint64 *dst_values, int count)
 {
 	/* For better code generation */
-	if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+	if (count > rt_size_class_info[RT_CLASS_4_FULL].fanout)
 		pg_unreachable();
 
 	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
@@ -844,20 +867,21 @@ rt_new_root(radix_tree *tree, uint64 key)
 	int			shift = key_get_shift(key);
 	rt_node    *node;
 
-	node = (rt_node *) rt_alloc_init_node(tree, RT_NODE_KIND_4, shift, 0,
-										  shift > 0);
+	node = (rt_node *) rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL,
+										  shift, 0, shift > 0);
 	tree->max_val = shift_get_max_val(shift);
 	tree->root = node;
 }
 
 /* Return a new and initialized node */
 static rt_node *
-rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, bool inner)
+rt_alloc_init_node(radix_tree *tree, uint8 kind, rt_size_class size_class, uint8 shift,
+				   uint8 chunk, bool inner)
 {
 	rt_node *newnode;
 
-	newnode = rt_alloc_node(tree, kind, inner);
-	rt_init_node(newnode, kind, shift, chunk, inner);
+	newnode = rt_alloc_node(tree, size_class, inner);
+	rt_init_node(newnode, kind, size_class, shift, chunk, inner);
 
 	return newnode;
 }
@@ -866,20 +890,20 @@ rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, bool
  * Allocate a new node with the given node kind.
  */
 static rt_node *
-rt_alloc_node(radix_tree *tree, int kind, bool inner)
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 {
 	rt_node    *newnode;
 
 	if (inner)
-		newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
-													 rt_node_kind_info[kind].inner_size);
+		newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[size_class],
+													 rt_size_class_info[size_class].inner_size);
 	else
-		newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
-													 rt_node_kind_info[kind].leaf_size);
+		newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[size_class],
+													 rt_size_class_info[size_class].leaf_size);
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[kind]++;
+	tree->cnt[size_class]++;
 #endif
 
 	return newnode;
@@ -887,14 +911,16 @@ rt_alloc_node(radix_tree *tree, int kind, bool inner)
 
 /* Initialize the node contents */
 static inline void
-rt_init_node(rt_node *node, uint8 kind, uint8 shift, uint8 chunk, bool inner)
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, uint8 shift, uint8 chunk,
+			 bool inner)
 {
 	if (inner)
-		MemSet(node, 0, rt_node_kind_info[kind].inner_size);
+		MemSet(node, 0, rt_size_class_info[size_class].inner_size);
 	else
-		MemSet(node, 0, rt_node_kind_info[kind].leaf_size);
+		MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
 
 	node->kind = kind;
+	node->fanout = rt_size_class_info[size_class].fanout;
 	node->shift = shift;
 	node->chunk = chunk;
 	node->count = 0;
@@ -912,13 +938,13 @@ rt_init_node(rt_node *node, uint8 kind, uint8 shift, uint8 chunk, bool inner)
  * Create a new node with 'new_kind' and the same shift, chunk, and
  * count of 'node'.
  */
-static rt_node *
-rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
 {
-	rt_node    *newnode;
+	rt_node	*newnode;
 
-	newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
-								 node->shift > 0);
+	newnode = rt_alloc_init_node(tree, new_kind, kind_min_size_class[new_kind],
+								 node->shift, node->chunk, !NODE_IS_LEAF(node));
 	newnode->count = node->count;
 
 	return newnode;
@@ -928,6 +954,8 @@ rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
 static void
 rt_free_node(radix_tree *tree, rt_node *node)
 {
+	int i;
+
 	/* If we're deleting the root node, make the tree empty */
 	if (tree->root == node)
 	{
@@ -937,8 +965,14 @@ rt_free_node(radix_tree *tree, rt_node *node)
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[node->kind]--;
-	Assert(tree->cnt[node->kind] >= 0);
+	// FIXME
+	for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		if (node->fanout == rt_size_class_info[i].fanout)
+			break;
+	}
+	tree->cnt[i]--;
+	Assert(tree->cnt[i] >= 0);
 #endif
 
 	pfree(node);
@@ -987,7 +1021,7 @@ rt_extend(radix_tree *tree, uint64 key)
 	{
 		rt_node_inner_4 *node;
 
-		node = (rt_node_inner_4 *) rt_alloc_init_node(tree, RT_NODE_KIND_4,
+		node = (rt_node_inner_4 *) rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL,
 													  shift, 0, true);
 		node->base.n.count = 1;
 		node->base.chunks[0] = 0;
@@ -1017,7 +1051,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
 		rt_node    *newchild;
 		int			newshift = shift - RT_NODE_SPAN;
 
-		newchild = rt_alloc_init_node(tree, RT_NODE_KIND_4, newshift,
+		newchild = rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL, newshift,
 									  RT_GET_KEY_CHUNK(key, node->shift),
 									  newshift > 0);
 		rt_node_insert_inner(tree, parent, node, key, newchild);
@@ -1248,8 +1282,8 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 					rt_node_inner_32 *new32;
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_inner_32 *) rt_grow_node(tree, (rt_node *) n4,
-															  RT_NODE_KIND_32);
+					new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+																   RT_NODE_KIND_32);
 					chunk_children_array_copy(n4->base.chunks, n4->children,
 											  new32->base.chunks, new32->children,
 											  n4->base.n.count);
@@ -1294,8 +1328,8 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 					rt_node_inner_128 *new128;
 
 					/* grow node from 32 to 128 */
-					new128 = (rt_node_inner_128 *) rt_grow_node(tree, (rt_node *) n32,
-																RT_NODE_KIND_128);
+					new128 = (rt_node_inner_128 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																	 RT_NODE_KIND_128);
 					for (int i = 0; i < n32->base.n.count; i++)
 						node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
 
@@ -1337,8 +1371,8 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 					rt_node_inner_256 *new256;
 
 					/* grow node from 128 to 256 */
-					new256 = (rt_node_inner_256 *) rt_grow_node(tree, (rt_node *) n128,
-																RT_NODE_KIND_256);
+					new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n128,
+																	 RT_NODE_KIND_256);
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
 					{
 						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
@@ -1365,7 +1399,8 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
 
 				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
-				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+				Assert(n256->base.n.fanout == 0);
+				Assert(chunk_exists || ((rt_node *) n256)->count < RT_NODE_MAX_SLOTS);
 
 				node_inner_256_set(n256, chunk, child);
 				break;
@@ -1416,8 +1451,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 					rt_node_leaf_32 *new32;
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_leaf_32 *) rt_grow_node(tree, (rt_node *) n4,
-															 RT_NODE_KIND_32);
+					new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+																  RT_NODE_KIND_32);
 					chunk_values_array_copy(n4->base.chunks, n4->values,
 											new32->base.chunks, new32->values,
 											n4->base.n.count);
@@ -1462,8 +1497,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 					rt_node_leaf_128 *new128;
 
 					/* grow node from 32 to 128 */
-					new128 = (rt_node_leaf_128 *) rt_grow_node(tree, (rt_node *) n32,
-															   RT_NODE_KIND_128);
+					new128 = (rt_node_leaf_128 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																	RT_NODE_KIND_128);
 					for (int i = 0; i < n32->base.n.count; i++)
 						node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
 
@@ -1505,7 +1540,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 					rt_node_leaf_256 *new256;
 
 					/* grow node from 128 to 256 */
-					new256 = (rt_node_leaf_256 *) rt_grow_node(tree, (rt_node *) n128,
+					new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n128,
 															   RT_NODE_KIND_256);
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
 					{
@@ -1533,7 +1568,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
 
 				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
-				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+				Assert(((rt_node *) n256)->fanout == 0);
+				Assert(chunk_exists || ((rt_node *) n256)->count < 256);
 
 				node_leaf_256_set(n256, chunk, value);
 				break;
@@ -1571,16 +1607,16 @@ rt_create(MemoryContext ctx)
 	tree->num_keys = 0;
 
 	/* Create the slab allocator for each size class */
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
 		tree->inner_slabs[i] = SlabContextCreate(ctx,
-												 rt_node_kind_info[i].name,
-												 rt_node_kind_info[i].inner_blocksize,
-												 rt_node_kind_info[i].inner_size);
+												 rt_size_class_info[i].name,
+												 rt_size_class_info[i].inner_blocksize,
+												 rt_size_class_info[i].inner_size);
 		tree->leaf_slabs[i] = SlabContextCreate(ctx,
-												rt_node_kind_info[i].name,
-												rt_node_kind_info[i].leaf_blocksize,
-												rt_node_kind_info[i].leaf_size);
+												rt_size_class_info[i].name,
+												rt_size_class_info[i].leaf_blocksize,
+												rt_size_class_info[i].leaf_size);
 #ifdef RT_DEBUG
 		tree->cnt[i] = 0;
 #endif
@@ -1597,7 +1633,7 @@ rt_create(MemoryContext ctx)
 void
 rt_free(radix_tree *tree)
 {
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
 		MemoryContextDelete(tree->inner_slabs[i]);
 		MemoryContextDelete(tree->leaf_slabs[i]);
@@ -2099,7 +2135,7 @@ rt_memory_usage(radix_tree *tree)
 {
 	Size		total = sizeof(radix_tree);
 
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
 		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
 		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
@@ -2189,10 +2225,10 @@ rt_stats(radix_tree *tree)
 	ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
 						 tree->num_keys,
 						 tree->root->shift / RT_NODE_SPAN,
-						 tree->cnt[0],
-						 tree->cnt[1],
-						 tree->cnt[2],
-						 tree->cnt[3])));
+						 tree->cnt[RT_CLASS_4_FULL],
+						 tree->cnt[RT_CLASS_32_FULL],
+						 tree->cnt[RT_CLASS_128_FULL],
+						 tree->cnt[RT_CLASS_256])));
 }
 
 static void
@@ -2200,11 +2236,12 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 {
 	char		space[128] = {0};
 
-	fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
 			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
 			(node->kind == RT_NODE_KIND_4) ? 4 :
 			(node->kind == RT_NODE_KIND_32) ? 32 :
 			(node->kind == RT_NODE_KIND_128) ? 128 : 256,
+			node->fanout == 0 ? 256 : node->fanout,
 			node->count, node->shift, node->chunk);
 
 	if (level > 0)
@@ -2408,13 +2445,14 @@ rt_dump_search(radix_tree *tree, uint64 key)
 void
 rt_dump(radix_tree *tree)
 {
-	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
-				rt_node_kind_info[i].name,
-				rt_node_kind_info[i].inner_size,
-				rt_node_kind_info[i].inner_blocksize,
-				rt_node_kind_info[i].leaf_size,
-				rt_node_kind_info[i].leaf_blocksize);
+				rt_size_class_info[i].name,
+				rt_size_class_info[i].inner_size,
+				rt_size_class_info[i].inner_blocksize,
+				rt_size_class_info[i].leaf_size,
+				rt_size_class_info[i].leaf_blocksize);
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
 
 	if (!tree->root)
-- 
2.31.1

v11-0003-tool-for-measuring-radix-tree-performance.patchapplication/x-patch; name=v11-0003-tool-for-measuring-radix-tree-performance.patchDownload

From 496f70836c2828ebca4cc025e933ae7355807292 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v11 3/6] tool for measuring radix tree performance

---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  65 ++
 contrib/bench_radix_tree/bench_radix_tree.c   | 554 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 6 files changed, 675 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..67ba568531
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,65 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..e69be48448
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,554 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	nulls[2] = true;
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+		nulls[2] = false;
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, key);
+	}
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
-- 
2.31.1

v11-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/x-patch; name=v11-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From 8d2df83bfaf7ec598292fe1e29446b5d02c278a3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v11 1/6] introduce vector8_min and vector8_highbit_mask

---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.31.1

v11-0002-Add-radix-implementation.patchapplication/x-patch; name=v11-0002-Add-radix-implementation.patchDownload

From f1c3bad56571261cc85c6bce596e652a5c028448 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v11 2/6] Add radix implementation.

---
 src/backend/lib/Makefile                      |    1 +
 src/backend/lib/meson.build                   |    1 +
 src/backend/lib/radixtree.c                   | 2428 +++++++++++++++++
 src/include/lib/radixtree.h                   |   42 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   32 +
 src/test/modules/test_radixtree/meson.build   |   34 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  582 ++++
 .../test_radixtree/test_radixtree.control     |    4 +
 15 files changed, 3175 insertions(+)
 create mode 100644 src/backend/lib/radixtree.c
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
   'knapsack.c',
   'pairingheap.c',
   'rbtree.c',
+  'radixtree.c',
 )
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..cc1a629fed
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2428 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_set			- Set a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iterate	- Begin iterating through all key-value pairs
+ * rt_iterate_next	- Return next key-value pair, if any
+ * rt_end_iter		- End iteration
+ * rt_memory_usage	- Get the memory usage
+ * rt_num_entries	- Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-128 */
+#define RT_NODE_128_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+	RT_ACTION_FIND = 0,			/* find the key-value */
+	RT_ACTION_DELETE,			/* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds.
+ *
+ * XXX: These are currently not well chosen. To reduce memory fragmentation
+ * smaller class should optimally fit neatly into the next larger class
+ * (except perhaps at the lowest end). Right now its
+ * 40/40 -> 296/286 -> 1288/1304 -> 2056/2088 bytes for inner nodes and
+ * leaf nodes, respectively, leading to large amount of allocator padding
+ * with aset.c. Hence the use of slab.
+ *
+ * XXX: need to have node-1 until there is no path compression optimization?
+ *
+ * XXX: need to explain why we choose these node types based on benchmark
+ * results etc.
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_128		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Size kind of the node */
+	uint8		kind;
+} rt_node;
+#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+#define NODE_HAS_FREE_SLOT(n) \
+	(((rt_node *) (n))->count < rt_node_kind_info[((rt_node *) (n))->kind].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+typedef struct rt_node_base_4
+{
+	rt_node		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+	rt_node		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-128 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 128 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base128
+{
+	rt_node		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_128;
+
+typedef struct rt_node_base256
+{
+	rt_node		n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * There are separate from inner node size classes for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+	rt_node_base_4 base;
+
+	/* 4 children, for key chunks */
+	rt_node    *children[4];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+	rt_node_base_4 base;
+
+	/* 4 values, for key chunks */
+	uint64		values[4];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+	rt_node_base_32 base;
+
+	/* 32 children, for key chunks */
+	rt_node    *children[32];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+	rt_node_base_32 base;
+
+	/* 32 values, for key chunks */
+	uint64		values[32];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_128
+{
+	rt_node_base_128 base;
+
+	/* Slots for 128 children */
+	rt_node    *children[128];
+} rt_node_inner_128;
+
+typedef struct rt_node_leaf_128
+{
+	rt_node_base_128 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
+
+	/* Slots for 128 values */
+	uint64		values[128];
+} rt_node_leaf_128;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 children */
+	rt_node    *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+	rt_node_base_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information of each size kinds */
+typedef struct rt_node_kind_info_elem
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} rt_node_kind_info_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
+static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {
+
+	[RT_NODE_KIND_4] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(rt_node_inner_4),
+		.leaf_size = sizeof(rt_node_leaf_4),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
+	},
+	[RT_NODE_KIND_32] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(rt_node_inner_32),
+		.leaf_size = sizeof(rt_node_leaf_32),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
+	},
+	[RT_NODE_KIND_128] = {
+		.name = "radix tree node 128",
+		.fanout = 128,
+		.inner_size = sizeof(rt_node_inner_128),
+		.leaf_size = sizeof(rt_node_leaf_128),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
+	},
+	[RT_NODE_KIND_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(rt_node_inner_256),
+		.leaf_size = sizeof(rt_node_leaf_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+	},
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+	rt_node    *node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	rt_node_iter stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node    *root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_NODE_KIND_COUNT];
+	MemoryContextData *leaf_slabs[RT_NODE_KIND_COUNT];
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_NODE_KIND_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node * rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift,
+									uint8 chunk, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, uint8 shift, uint8 chunk,
+								bool inner);
+static rt_node *rt_alloc_node(radix_tree *tree, int kind, bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+										rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+									   uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+								 uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+								uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											 uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+						  uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+	/* For better code generation */
+	if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+		pg_unreachable();
+
+	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+	memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values, int count)
+{
+	/* For better code generation */
+	if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+		pg_unreachable();
+
+	memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+	memcpy(dst_values, src_values, sizeof(uint64) * count);
+}
+
+/* Functions to manipulate inner and leaf node-128 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_128_is_chunk_used(rt_node_base_128 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
+{
+	Assert(NODE_IS_LEAF(node));
+	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_128_get_child(rt_node_inner_128 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((rt_node_base_128 *) node)->slot_idxs[chunk] != RT_NODE_128_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+static void
+node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
+{
+	int			slotpos = node->base.slot_idxs[chunk];
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
+}
+
+/* Return an unused slot in node-128 */
+static int
+node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
+{
+	int			slotpos = 0;
+
+	Assert(!NODE_IS_LEAF(node));
+	while (node_inner_128_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static int
+node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* We iterate over the isset bitmap per byte then check each bit */
+	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	{
+		if (node->isset[slotpos] < 0xFF)
+			break;
+	}
+	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+	slotpos *= BITS_PER_BYTE;
+	while (node_leaf_128_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static inline void
+node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+	int			slotpos;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_inner_128_find_unused_slot(node, chunk);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_leaf_128_find_unused_slot(node, chunk);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_128_update(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(node_inner_256_is_chunk_used(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(node_leaf_256_is_chunk_used(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+	int			shift = key_get_shift(key);
+	rt_node    *node;
+
+	node = (rt_node *) rt_alloc_init_node(tree, RT_NODE_KIND_4, shift, 0,
+										  shift > 0);
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = node;
+}
+
+/* Return a new and initialized node */
+static rt_node *
+rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, bool inner)
+{
+	rt_node *newnode;
+
+	newnode = rt_alloc_node(tree, kind, inner);
+	rt_init_node(newnode, kind, shift, chunk, inner);
+
+	return newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, int kind, bool inner)
+{
+	rt_node    *newnode;
+
+	if (inner)
+		newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+													 rt_node_kind_info[kind].inner_size);
+	else
+		newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+													 rt_node_kind_info[kind].leaf_size);
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[kind]++;
+#endif
+
+	return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, uint8 shift, uint8 chunk, bool inner)
+{
+	if (inner)
+		MemSet(node, 0, rt_node_kind_info[kind].inner_size);
+	else
+		MemSet(node, 0, rt_node_kind_info[kind].leaf_size);
+
+	node->kind = kind;
+	node->shift = shift;
+	node->chunk = chunk;
+	node->count = 0;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_128)
+	{
+		rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+
+		memset(n128->slot_idxs, RT_NODE_128_INVALID_IDX, sizeof(n128->slot_idxs));
+	}
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+	rt_node    *newnode;
+
+	newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
+								 node->shift > 0);
+	newnode->count = node->count;
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[node->kind]--;
+	Assert(tree->cnt[node->kind] >= 0);
+#endif
+
+	pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+				rt_node *new_child, uint64 key)
+{
+	Assert(old_child->chunk == new_child->chunk);
+	Assert(old_child->shift == new_child->shift);
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = new_child;
+	}
+	else
+	{
+		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
+
+		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		Assert(replaced);
+	}
+
+	rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_inner_4 *node;
+
+		node = (rt_node_inner_4 *) rt_alloc_init_node(tree, RT_NODE_KIND_4,
+													  shift, 0, true);
+		node->base.n.count = 1;
+		node->base.chunks[0] = 0;
+		node->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+			  rt_node *node)
+{
+	int			shift = node->shift;
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		rt_node    *newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+
+		newchild = rt_alloc_init_node(tree, RT_NODE_KIND_4, newshift,
+									  RT_GET_KEY_CHUNK(key, node->shift),
+									  newshift > 0);
+		rt_node_insert_inner(tree, parent, node, key, newchild);
+
+		parent = node;
+		node = newchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	rt_node_insert_leaf(tree, parent, node, key, value);
+	tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	rt_node    *child = NULL;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = n4->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n4->base.chunks, n4->children,
+												n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = n32->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n32->base.chunks, n32->children,
+												n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+
+				if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = node_inner_128_get_child(n128, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_128_delete(n128, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				if (!node_inner_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = node_inner_256_get_child(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && child_p)
+		*child_p = child;
+
+	return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	uint64		value = 0;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = n4->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+											  n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = n32->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+											  n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+
+				if (!node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_128_get_value(n128, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_128_delete(n128, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				if (!node_leaf_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_256_get_value(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && value_p)
+		*value_p = value;
+
+	return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+					 rt_node *child)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_inner_32 *new32;
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_inner_32 *) rt_grow_node(tree, (rt_node *) n4,
+															  RT_NODE_KIND_32);
+					chunk_children_array_copy(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children,
+											  n4->base.n.count);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+									key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					uint16		count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n4->base.chunks, n4->children,
+												   count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->children[insertpos] = child;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+				{
+					rt_node_inner_128 *new128;
+
+					/* grow node from 32 to 128 */
+					new128 = (rt_node_inner_128 *) rt_grow_node(tree, (rt_node *) n32,
+																RT_NODE_KIND_128);
+					for (int i = 0; i < n32->base.n.count; i++)
+						node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+									key);
+					node = (rt_node *) new128;
+				}
+				else
+				{
+					int			insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+					int16		count = n32->base.n.count;
+
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n32->base.chunks, n32->children,
+												   count, insertpos);
+
+					n32->base.chunks[insertpos] = chunk;
+					n32->children[insertpos] = child;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node;
+				int			cnt = 0;
+
+				if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_inner_128_update(n128, chunk, child);
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+				{
+					rt_node_inner_256 *new256;
+
+					/* grow node from 128 to 256 */
+					new256 = (rt_node_inner_256 *) rt_grow_node(tree, (rt_node *) n128,
+																RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+					{
+						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+							continue;
+
+						node_inner_256_set(new256, i, node_inner_128_get_child(n128, i));
+						cnt++;
+					}
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_inner_128_insert(n128, chunk, child);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+				node_inner_256_set(n256, chunk, child);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+					uint64 key, uint64 value)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_leaf_32 *new32;
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_leaf_32 *) rt_grow_node(tree, (rt_node *) n4,
+															 RT_NODE_KIND_32);
+					chunk_values_array_copy(n4->base.chunks, n4->values,
+											new32->base.chunks, new32->values,
+											n4->base.n.count);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+									key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and values */
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n4->base.chunks, n4->values,
+												 count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->values[insertpos] = value;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
+				{
+					rt_node_leaf_128 *new128;
+
+					/* grow node from 32 to 128 */
+					new128 = (rt_node_leaf_128 *) rt_grow_node(tree, (rt_node *) n32,
+															   RT_NODE_KIND_128);
+					for (int i = 0; i < n32->base.n.count; i++)
+						node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+									key);
+					node = (rt_node *) new128;
+				}
+				else
+				{
+					int			insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+					int			count = n32->base.n.count;
+
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n32->base.chunks, n32->values,
+												 count, insertpos);
+
+					n32->base.chunks[insertpos] = chunk;
+					n32->values[insertpos] = value;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_128:
+			{
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node;
+				int			cnt = 0;
+
+				if (node_128_is_chunk_used((rt_node_base_128 *) n128, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_leaf_128_update(n128, chunk, value);
+					break;
+				}
+
+				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
+				{
+					rt_node_leaf_256 *new256;
+
+					/* grow node from 128 to 256 */
+					new256 = (rt_node_leaf_256 *) rt_grow_node(tree, (rt_node *) n128,
+															   RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+					{
+						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+							continue;
+
+						node_leaf_256_set(new256, i, node_leaf_128_get_value(n128, i));
+						cnt++;
+					}
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_leaf_128_insert(n128, chunk, value);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+
+				node_leaf_256_set(n256, chunk, value);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 rt_node_kind_info[i].name,
+												 rt_node_kind_info[i].inner_blocksize,
+												 rt_node_kind_info[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												rt_node_kind_info[i].name,
+												rt_node_kind_info[i].leaf_blocksize,
+												rt_node_kind_info[i].leaf_size);
+#ifdef RT_DEBUG
+		tree->cnt[i] = 0;
+#endif
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	rt_node    *node;
+	rt_node    *parent;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = parent = tree->root;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		{
+			rt_set_extend(tree, key, value, parent, node);
+			return false;
+		}
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+	rt_node    *node;
+	int			shift;
+
+	Assert(value_p != NULL);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		rt_node    *child;
+
+		/* Push the current node to the stack */
+		stack[++level] = node;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	Assert(NODE_IS_LEAF(node));
+	deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	rt_free_node(tree, node);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		node = stack[level--];
+
+		deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+	}
+
+	return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->root)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+	int			level = from;
+	rt_node    *node = from_node;
+
+	for (;;)
+	{
+		rt_node_iter *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = rt_node_inner_iterate_next(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->root)
+		return false;
+
+	for (;;)
+	{
+		rt_node    *child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		rt_update_iter_stack(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+	rt_node    *child = NULL;
+	bool		found = false;
+	uint8		key_chunk;
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				child = n4->children[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				child = n32->children[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_inner_128 *n128 = (rt_node_inner_128 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_128_get_child(n128, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_inner_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_256_get_child(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+	return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+						  uint64 *value_p)
+{
+	rt_node    *node = node_iter->node;
+	bool		found = false;
+	uint64		value;
+	uint8		key_chunk;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				value = n4->values[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				value = n32->values[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_128_get_value(n128, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_leaf_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_256_get_value(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		*value_p = value;
+	}
+
+	return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	Size		total = sizeof(radix_tree);
+
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *n128 = (rt_node_base_128 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(n128, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					if (NODE_IS_LEAF(node))
+						Assert(node_leaf_128_is_slot_used((rt_node_leaf_128 *) node,
+														  n128->slot_idxs[i]));
+					else
+						Assert(node_inner_128_is_slot_used((rt_node_inner_128 *) node,
+														   n128->slot_idxs[i]));
+
+					cnt++;
+				}
+
+				Assert(n128->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+						cnt += pg_popcount32(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
+						 tree->num_keys,
+						 tree->root->shift / RT_NODE_SPAN,
+						 tree->cnt[0],
+						 tree->cnt[1],
+						 tree->cnt[2],
+						 tree->cnt[3])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+	char		space[128] = {0};
+
+	fprintf(stderr, "[%s] kind %d, count %u, shift %u, chunk 0x%X:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_128) ? 128 : 256,
+			node->count, node->shift, node->chunk);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							rt_dump_node(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							rt_dump_node(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_128:
+			{
+				rt_node_base_128 *b128 = (rt_node_base_128 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(b128, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b128->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < 16; i++)
+					{
+						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_128_is_chunk_used(b128, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_128 *n128 = (rt_node_leaf_128 *) b128;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_128_get_value(n128, i));
+					}
+					else
+					{
+						rt_node_inner_128 *n128 = (rt_node_inner_128 *) b128;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_128_get_child(n128, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+						if (!node_leaf_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_256_get_value(n256, i));
+					}
+					else
+					{
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+						if (!node_inner_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+		 tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		rt_dump_node(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+			break;
+		}
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+	for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+				rt_node_kind_info[i].name,
+				rt_node_kind_info[i].inner_size,
+				rt_node_kind_info[i].inner_blocksize,
+				rt_node_kind_info[i].leaf_size,
+				rt_node_kind_info[i].leaf_blocksize);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+	if (!tree->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 96addded81..11d0ec5b07 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -27,6 +27,7 @@ SUBDIRS = \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1d26544854..568823b221 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -21,6 +21,7 @@ subdir('test_oat_hooks')
 subdir('test_parser')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..5242538cec
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,32 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with node 4
+NOTICE:  testing basic operations with node 32
+NOTICE:  testing basic operations with node 128
+NOTICE:  testing basic operations with node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..4198d7e976
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,582 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	rt_iter		*iter;
+	uint64		dummy;
+	uint64		key;
+	uint64		val;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+}
+
+static void
+test_basic(int children)
+{
+	radix_tree	*radixtree;
+
+	elog(NOTICE, "testing basic operations with node %d", children);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/* insert key-value pairs like 1, 32, 2, 31, 3, 30 ... */
+	for (int i = 0; i < children / 2; i++)
+	{
+		uint64 x;
+
+		x = i + 1;
+		if (rt_set(radixtree, x, x))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT "is found", x);
+		x = children - i;
+		if (rt_set(radixtree, x, x))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT "is found", x);
+	}
+
+	/* update these keys */
+	for (int i = 0; i < children / 2; i++)
+	{
+		uint64 x;
+
+		x = i + 1;
+		if (!rt_set(radixtree, x, x + 1))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, x);
+		x = children - i;
+		if (!rt_set(radixtree, x, x + 1))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, x);
+	}
+
+	/* delete these keys */
+	for (int i = 0; i < children / 2; i++)
+	{
+		uint64 x;
+
+		x = i + 1;
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		x = children - i;
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, x);
+	}
+
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		uint64		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+	static int	rt_node_max_entries[] = {
+		0,
+		4,							/* RT_NODE_KIND_4 */
+		32,							/* RT_NODE_KIND_32 */
+		128,						/* RT_NODE_KIND_128 */
+		256							/* RT_NODE_KIND_256 */
+	};
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_max_entries[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_max_entries[node_kind_idx - 1]
+				: rt_node_max_entries[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_max_entries[node_kind_idx]
+				: rt_node_max_entries[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	test_basic(4);
+	test_basic(32);
+	test_basic(128);
+	test_basic(256);
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
-- 
2.31.1

v11-0005-Make-all-node-kinds-variable-sized.patchapplication/x-patch; name=v11-0005-Make-all-node-kinds-variable-sized.patchDownload

From 5bab5b1c57233ceecaa46cb155e7b0f1e9e7d2b5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 24 Nov 2022 12:02:22 +0900
Subject: [PATCH v11 5/6] Make all node kinds variable sized

Add one size class 1, 15, and 61, for each node kind 4, 32, and 128,
respectively. The inner and leaf node size with the new size classes
are 24/24, 160/160, and 752/768, respectively.

For example in size class 15, when a 16th element is to be inserted,
allocte a larger area and memcpy the entire old node to it.

This technique allows us to limit the node kinds to 4, which
1. limits the number of cases in switch statements
2. allows a possible future optimization to encode the node kind
in a pointer tag
---
 src/backend/lib/radixtree.c | 470 +++++++++++++++++++++++++-----------
 1 file changed, 329 insertions(+), 141 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index b71545e031..f10abd8add 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -133,8 +133,11 @@ typedef enum
 
 typedef enum rt_size_class
 {
-	RT_CLASS_4_FULL = 0,
+	RT_CLASS_4_PARTIAL = 0,
+	RT_CLASS_4_FULL,
+	RT_CLASS_32_PARTIAL,
 	RT_CLASS_32_FULL,
+	RT_CLASS_128_PARTIAL,
 	RT_CLASS_128_FULL,
 	RT_CLASS_256
 
@@ -151,6 +154,8 @@ typedef struct rt_node
 	uint16		count;
 
 	/* Max number of children. We can use uint8 because we never need to store 256 */
+	/* WIP: if we don't have a variable sized node4, this should instead be in the base
+	types as needed, since saving every byte is crucial for the smallest node kind */
 	uint8		fanout;
 
 	/*
@@ -168,8 +173,12 @@ typedef struct rt_node
 #define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
 #define NODE_HAS_FREE_SLOT(node) \
 	((node)->base.n.count < (node)->base.n.fanout)
+#define NODE_NEEDS_TO_GROW_CLASS(node, class) \
+	(((node)->base.n.count) == (rt_size_class_info[(class)].fanout))
 
 /* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
 typedef struct rt_node_base_4
 {
 	rt_node		n;
@@ -221,40 +230,40 @@ typedef struct rt_node_inner_4
 {
 	rt_node_base_4 base;
 
-	/* 4 children, for key chunks */
-	rt_node    *children[4];
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_4;
 
 typedef struct rt_node_leaf_4
 {
 	rt_node_base_4 base;
 
-	/* 4 values, for key chunks */
-	uint64		values[4];
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_leaf_4;
 
 typedef struct rt_node_inner_32
 {
 	rt_node_base_32 base;
 
-	/* 32 children, for key chunks */
-	rt_node    *children[32];
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_32;
 
 typedef struct rt_node_leaf_32
 {
 	rt_node_base_32 base;
 
-	/* 32 values, for key chunks */
-	uint64		values[32];
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_leaf_32;
 
 typedef struct rt_node_inner_128
 {
 	rt_node_base_128 base;
 
-	/* Slots for 128 children */
-	rt_node    *children[128];
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_128;
 
 typedef struct rt_node_leaf_128
@@ -264,8 +273,8 @@ typedef struct rt_node_leaf_128
 	/* isset is a bitmap to track which slot is in use */
 	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
 
-	/* Slots for 128 values */
-	uint64		values[128];
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_leaf_128;
 
 /*
@@ -311,32 +320,55 @@ typedef struct rt_size_class_elem
  * from the block.
  */
 #define NODE_SLAB_BLOCK_SIZE(size)	\
-	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * size, (size) * 32)
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
 static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
-
+	[RT_CLASS_4_PARTIAL] = {
+		.name = "radix tree node 1",
+		.fanout = 1,
+		.inner_size = sizeof(rt_node_inner_4) + 1 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_4) + 1 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 1 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 1 * sizeof(uint64)),
+	},
 	[RT_CLASS_4_FULL] = {
 		.name = "radix tree node 4",
 		.fanout = 4,
-		.inner_size = sizeof(rt_node_inner_4),
-		.leaf_size = sizeof(rt_node_leaf_4),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4)),
+		.inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_PARTIAL] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
 	},
 	[RT_CLASS_32_FULL] = {
 		.name = "radix tree node 32",
 		.fanout = 32,
-		.inner_size = sizeof(rt_node_inner_32),
-		.leaf_size = sizeof(rt_node_leaf_32),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32)),
+		.inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+	},
+	[RT_CLASS_128_PARTIAL] = {
+		.name = "radix tree node 61",
+		.fanout = 61,
+		.inner_size = sizeof(rt_node_inner_128) + 61 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_128) + 61 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128) + 61 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128) + 61 * sizeof(uint64)),
 	},
 	[RT_CLASS_128_FULL] = {
 		.name = "radix tree node 128",
 		.fanout = 128,
-		.inner_size = sizeof(rt_node_inner_128),
-		.leaf_size = sizeof(rt_node_leaf_128),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128)),
+		.inner_size = sizeof(rt_node_inner_128) + 128 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_128) + 128 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_128) + 128 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_128) + 128 * sizeof(uint64)),
 	},
 	[RT_CLASS_256] = {
 		.name = "radix tree node 256",
@@ -352,9 +384,9 @@ static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
 
 /* Map from the node kind to its minimum size class */
 static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
-	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
-	[RT_NODE_KIND_32] = RT_CLASS_32_FULL,
-	[RT_NODE_KIND_128] = RT_CLASS_128_FULL,
+	[RT_NODE_KIND_4] = RT_CLASS_4_PARTIAL,
+	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+	[RT_NODE_KIND_128] = RT_CLASS_128_PARTIAL,
 	[RT_NODE_KIND_256] = RT_CLASS_256,
 };
 
@@ -867,7 +899,7 @@ rt_new_root(radix_tree *tree, uint64 key)
 	int			shift = key_get_shift(key);
 	rt_node    *node;
 
-	node = (rt_node *) rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL,
+	node = (rt_node *) rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_PARTIAL,
 										  shift, 0, shift > 0);
 	tree->max_val = shift_get_max_val(shift);
 	tree->root = node;
@@ -965,7 +997,6 @@ rt_free_node(radix_tree *tree, rt_node *node)
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	// FIXME
 	for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
 		if (node->fanout == rt_size_class_info[i].fanout)
@@ -1021,7 +1052,7 @@ rt_extend(radix_tree *tree, uint64 key)
 	{
 		rt_node_inner_4 *node;
 
-		node = (rt_node_inner_4 *) rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL,
+		node = (rt_node_inner_4 *) rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_PARTIAL,
 													  shift, 0, true);
 		node->base.n.count = 1;
 		node->base.chunks[0] = 0;
@@ -1051,7 +1082,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
 		rt_node    *newchild;
 		int			newshift = shift - RT_NODE_SPAN;
 
-		newchild = rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_FULL, newshift,
+		newchild = rt_alloc_init_node(tree, RT_NODE_KIND_4, RT_CLASS_4_PARTIAL, newshift,
 									  RT_GET_KEY_CHUNK(key, node->shift),
 									  newshift > 0);
 		rt_node_insert_inner(tree, parent, node, key, newchild);
@@ -1279,33 +1310,63 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 
 				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
 				{
-					rt_node_inner_32 *new32;
+					Assert(parent != NULL);
 
-					/* grow node from 4 to 32 */
-					new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
-																   RT_NODE_KIND_32);
-					chunk_children_array_copy(n4->base.chunks, n4->children,
-											  new32->base.chunks, new32->children,
-											  n4->base.n.count);
+					if (NODE_NEEDS_TO_GROW_CLASS(n4, RT_CLASS_4_PARTIAL))
+					{
+						rt_node_inner_4 *new4;
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
-									key);
-					node = (rt_node *) new32;
+						/*
+						 * Use the same node kind, but expand to the next size class. We
+						 * copy the entire old node -- the new node is only different in
+						 * having additional slots so we only have to change the fanout.
+						 */
+						new4 = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+						memcpy(new4, n4, rt_size_class_info[RT_CLASS_4_PARTIAL].inner_size);
+						new4->base.n.fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new4,
+										key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new4;
+						n4 = new4;
+
+						goto retry_insert_inner_4;
+					}
+					else
+					{
+						rt_node_inner_32 *new32;
+
+						/* grow node from 4 to 32 */
+						new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+																	   RT_NODE_KIND_32);
+						chunk_children_array_copy(n4->base.chunks, n4->children,
+												  new32->base.chunks, new32->children,
+												  n4->base.n.count);
+
+						Assert(parent != NULL);
+						rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+										key);
+						node = (rt_node *) new32;
+					}
 				}
 				else
 				{
-					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
-					uint16		count = n4->base.n.count;
+				retry_insert_inner_4:
+					{
+						int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+						uint16		count = n4->base.n.count;
 
-					/* shift chunks and children */
-					if (count != 0 && insertpos < count)
-						chunk_children_array_shift(n4->base.chunks, n4->children,
-												   count, insertpos);
+						/* shift chunks and children */
+						if (count != 0 && insertpos < count)
+							chunk_children_array_shift(n4->base.chunks, n4->children,
+													   count, insertpos);
 
-					n4->base.chunks[insertpos] = chunk;
-					n4->children[insertpos] = child;
-					break;
+						n4->base.chunks[insertpos] = chunk;
+						n4->children[insertpos] = child;
+						break;
+					}
 				}
 			}
 			/* FALLTHROUGH */
@@ -1325,31 +1386,56 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 
 				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
 				{
-					rt_node_inner_128 *new128;
+					Assert(parent != NULL);
+
+					if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))
+					{
+						/* use the same node kind, but expand to the next size class */
+						rt_node_inner_32 *new32;
 
-					/* grow node from 32 to 128 */
-					new128 = (rt_node_inner_128 *) rt_grow_node_kind(tree, (rt_node *) n32,
-																	 RT_NODE_KIND_128);
-					for (int i = 0; i < n32->base.n.count; i++)
-						node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
+						new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+						memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
+						new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
-									key);
-					node = (rt_node *) new128;
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32,
+										key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new32;
+						n32 = new32;
+
+						goto retry_insert_inner_32;
+					}
+					else
+					{
+						rt_node_inner_128 *new128;
+
+						/* grow node from 32 to 128 */
+						new128 = (rt_node_inner_128 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																		 RT_NODE_KIND_128);
+						for (int i = 0; i < n32->base.n.count; i++)
+							node_inner_128_insert(new128, n32->base.chunks[i], n32->children[i]);
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+										key);
+						node = (rt_node *) new128;
+					}
 				}
 				else
 				{
-					int			insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
-					int16		count = n32->base.n.count;
+retry_insert_inner_32:
+					{
+						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+						int16 count = n32->base.n.count;
 
-					if (count != 0 && insertpos < count)
-						chunk_children_array_shift(n32->base.chunks, n32->children,
-												   count, insertpos);
+						if (count != 0 && insertpos < count)
+							chunk_children_array_shift(n32->base.chunks, n32->children,
+													   count, insertpos);
 
-					n32->base.chunks[insertpos] = chunk;
-					n32->children[insertpos] = child;
-					break;
+						n32->base.chunks[insertpos] = chunk;
+						n32->children[insertpos] = child;
+						break;
+					}
 				}
 			}
 			/* FALLTHROUGH */
@@ -1368,29 +1454,54 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 
 				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
 				{
-					rt_node_inner_256 *new256;
+					Assert(parent != NULL);
 
-					/* grow node from 128 to 256 */
-					new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n128,
-																	 RT_NODE_KIND_256);
-					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+					if (NODE_NEEDS_TO_GROW_CLASS(n128, RT_CLASS_128_PARTIAL))
 					{
-						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
-							continue;
+						/* use the same node kind, but expand to the next size class */
+						rt_node_inner_128 *new128;
 
-						node_inner_256_set(new256, i, node_inner_128_get_child(n128, i));
-						cnt++;
+						new128 = (rt_node_inner_128 *) rt_alloc_node(tree, RT_CLASS_128_FULL, true);
+						memcpy(new128, n128, rt_size_class_info[RT_CLASS_128_PARTIAL].inner_size);
+						new128->base.n.fanout = rt_size_class_info[RT_CLASS_128_FULL].fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new128,
+										key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new128;
+						n128 = new128;
+
+						goto retry_insert_inner_128;
 					}
+					else
+					{
+						rt_node_inner_256 *new256;
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
+						/* grow node from 128 to 256 */
+						new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n128,
+																		 RT_NODE_KIND_256);
+						for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+						{
+							if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+								continue;
+
+							node_inner_256_set(new256, i, node_inner_128_get_child(n128, i));
+							cnt++;
+						}
+
+						rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+										key);
+						node = (rt_node *) new256;
+					}
 				}
 				else
 				{
-					node_inner_128_insert(n128, chunk, child);
-					break;
+				retry_insert_inner_128:
+					{
+						node_inner_128_insert(n128, chunk, child);
+						break;
+					}
 				}
 			}
 			/* FALLTHROUGH */
@@ -1448,33 +1559,57 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!NODE_HAS_FREE_SLOT(n4)))
 				{
-					rt_node_leaf_32 *new32;
+					Assert(parent != NULL);
 
-					/* grow node from 4 to 32 */
-					new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
-																  RT_NODE_KIND_32);
-					chunk_values_array_copy(n4->base.chunks, n4->values,
-											new32->base.chunks, new32->values,
-											n4->base.n.count);
+					if (NODE_NEEDS_TO_GROW_CLASS(n4, RT_CLASS_4_PARTIAL))
+					{
+						/* use the same node kind, but expand to the next size class */
+						rt_node_leaf_4 *new4;
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
-									key);
-					node = (rt_node *) new32;
+						new4 = (rt_node_leaf_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, false);
+						memcpy(new4, n4, rt_size_class_info[RT_CLASS_4_PARTIAL].leaf_size);
+						new4->base.n.fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new4,
+										key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new4;
+						n4 = new4;
+
+						goto retry_insert_leaf_4;
+					}
+					else
+					{
+						rt_node_leaf_32 *new32;
+
+						/* grow node from 4 to 32 */
+						new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+																	  RT_NODE_KIND_32);
+						chunk_values_array_copy(n4->base.chunks, n4->values,
+												new32->base.chunks, new32->values,
+												n4->base.n.count);
+						rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+										key);
+						node = (rt_node *) new32;
+					}
 				}
 				else
 				{
-					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
-					int			count = n4->base.n.count;
+				retry_insert_leaf_4:
+					{
+						int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+						int			count = n4->base.n.count;
 
-					/* shift chunks and values */
-					if (count != 0 && insertpos < count)
-						chunk_values_array_shift(n4->base.chunks, n4->values,
-												 count, insertpos);
+						/* shift chunks and values */
+						if (count != 0 && insertpos < count)
+							chunk_values_array_shift(n4->base.chunks, n4->values,
+													 count, insertpos);
 
-					n4->base.chunks[insertpos] = chunk;
-					n4->values[insertpos] = value;
-					break;
+						n4->base.chunks[insertpos] = chunk;
+						n4->values[insertpos] = value;
+						break;
+					}
 				}
 			}
 			/* FALLTHROUGH */
@@ -1494,31 +1629,56 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!NODE_HAS_FREE_SLOT(n32)))
 				{
-					rt_node_leaf_128 *new128;
+					Assert(parent != NULL);
+
+					if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))
+					{
+						/* use the same node kind, but expand to the next size class */
+						rt_node_leaf_32 *new32;
 
-					/* grow node from 32 to 128 */
-					new128 = (rt_node_leaf_128 *) rt_grow_node_kind(tree, (rt_node *) n32,
-																	RT_NODE_KIND_128);
-					for (int i = 0; i < n32->base.n.count; i++)
-						node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
+						new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+						memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size);
+						new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
-									key);
-					node = (rt_node *) new128;
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32,
+										key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new32;
+						n32 = new32;
+
+						goto retry_insert_leaf_32;
+					}
+					else
+					{
+						rt_node_leaf_128 *new128;
+
+						/* grow node from 32 to 128 */
+						new128 = (rt_node_leaf_128 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																		RT_NODE_KIND_128);
+						for (int i = 0; i < n32->base.n.count; i++)
+							node_leaf_128_insert(new128, n32->base.chunks[i], n32->values[i]);
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new128,
+										key);
+						node = (rt_node *) new128;
+					}
 				}
 				else
 				{
-					int			insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
-					int			count = n32->base.n.count;
+				retry_insert_leaf_32:
+					{
+						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+						int	count = n32->base.n.count;
 
-					if (count != 0 && insertpos < count)
-						chunk_values_array_shift(n32->base.chunks, n32->values,
-												 count, insertpos);
+						if (count != 0 && insertpos < count)
+							chunk_values_array_shift(n32->base.chunks, n32->values,
+													 count, insertpos);
 
-					n32->base.chunks[insertpos] = chunk;
-					n32->values[insertpos] = value;
-					break;
+						n32->base.chunks[insertpos] = chunk;
+						n32->values[insertpos] = value;
+						break;
+					}
 				}
 			}
 			/* FALLTHROUGH */
@@ -1537,29 +1697,54 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!NODE_HAS_FREE_SLOT(n128)))
 				{
-					rt_node_leaf_256 *new256;
+					Assert(parent != NULL);
 
-					/* grow node from 128 to 256 */
-					new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n128,
-															   RT_NODE_KIND_256);
-					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+					if (NODE_NEEDS_TO_GROW_CLASS(n128, RT_CLASS_128_PARTIAL))
 					{
-						if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
-							continue;
+						/* use the same node kind, but expand to the next size class */
+						rt_node_leaf_128 *new128;
+
+						new128 = (rt_node_leaf_128 *) rt_alloc_node(tree, RT_CLASS_128_FULL, false);
+						memcpy(new128, n128, rt_size_class_info[RT_CLASS_128_PARTIAL].leaf_size);
+						new128->base.n.fanout = rt_size_class_info[RT_CLASS_128_FULL].fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new128,
+										key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new128;
+						n128 = new128;
 
-						node_leaf_256_set(new256, i, node_leaf_128_get_value(n128, i));
-						cnt++;
+						goto retry_insert_leaf_128;
 					}
+					else
+					{
+						rt_node_leaf_256 *new256;
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
+						/* grow node from 128 to 256 */
+						new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n128,
+																		RT_NODE_KIND_256);
+						for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n128->base.n.count; i++)
+						{
+							if (!node_128_is_chunk_used((rt_node_base_128 *) n128, i))
+								continue;
+
+							node_leaf_256_set(new256, i, node_leaf_128_get_value(n128, i));
+							cnt++;
+						}
+
+						rt_replace_node(tree, parent, (rt_node *) n128, (rt_node *) new256,
+										key);
+						node = (rt_node *) new256;
+					}
 				}
 				else
 				{
-					node_leaf_128_insert(n128, chunk, value);
-					break;
+				retry_insert_leaf_128:
+					{
+						node_leaf_128_insert(n128, chunk, value);
+						break;
+					}
 				}
 			}
 			/* FALLTHROUGH */
@@ -2222,11 +2407,14 @@ rt_verify_node(rt_node *node)
 void
 rt_stats(radix_tree *tree)
 {
-	ereport(LOG, (errmsg("num_keys = %lu, height = %u, n4 = %u, n32 = %u, n128 = %u, n256 = %u",
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n1 = %u, n4 = %u, n15 = %u, n32 = %u, n61 = %u, n128 = %u, n256 = %u",
 						 tree->num_keys,
 						 tree->root->shift / RT_NODE_SPAN,
+						 tree->cnt[RT_CLASS_4_PARTIAL],
 						 tree->cnt[RT_CLASS_4_FULL],
+						 tree->cnt[RT_CLASS_32_PARTIAL],
 						 tree->cnt[RT_CLASS_32_FULL],
+						 tree->cnt[RT_CLASS_128_PARTIAL],
 						 tree->cnt[RT_CLASS_128_FULL],
 						 tree->cnt[RT_CLASS_256])));
 }
-- 
2.31.1

#137

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#136)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

So it seems that there are two candidates of rt_node structure: (1)
all nodes except for node256 are variable-size nodes and use pointer
tagging, and (2) node32 and node128 are variable-sized nodes and do
not use pointer tagging (fanout is in part of only these two nodes).
rt_node can be 5 bytes in both cases. But before going to this step, I
started to verify the idea of variable-size nodes by using 6-bytes
rt_node. We can adjust the node kinds and node classes later.

First, I'm glad you picked up the size class concept and expanded it. (I
have some comments about some internal APIs below.)

Let's leave the pointer tagging piece out until the main functionality is
committed. We have all the prerequisites in place, except for a benchmark
random enough to demonstrate benefit. I'm still not quite satisfied with
how the shared memory coding looked, and that is the only sticky problem we
still have, IMO. The rest is "just work".

That said, (1) and (2) above are still relevant -- variable sizing any
given node is optional, and we can refine as needed.

Overall, the idea of variable-sized nodes is good, smaller size
without losing search performance.

Good.

I'm going to check the load
performance as well.

Part of that is this, which gets called a lot more now, when node1 expands:

+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);

Since memset for expanding size class is now handled separately, these can
use the non-zeroing versions. When compiling MemoryContextAllocZero, the
compiler has no idea how big the size is, so it assumes the worst and
optimizes for large sizes. On x86-64, that means using "rep stos",
which calls microcode found in the CPU's ROM. This is slow for small sizes.
The "init" function should be always inline with const parameters where
possible. That way, memset can compile to a single instruction for the
smallest node kind. (More on alloc/init below)

Note, there is a wrinkle: As currently written inner_node128 searches the
child pointers for NULL when inserting, so when expanding from partial to
full size class, the new node must be zeroed (Worth fixing in the short
term. I thought of this while writing the proof-of-concept for size
classes, but didn't mention it.) Medium term, rather than special-casing
this, I actually want to rewrite the inner-node128 to be more similar to
the leaf, with an "isset" array, but accessed and tested differently. I
guarantee it's *really* slow now to load (maybe somewhat true even for
leaves), but I'll leave the details for later. Regarding node128 leaf, note
that it's slightly larger than a DSA size class, and we can trim it to fit:

node61: 6 + 256+(2) +16 + 61*8 = 768
node125: 6 + 256+(2) +16 + 125*8 = 1280

I've attached the patches I used for the verification. I don't include
patches for pointer tagging, DSA support, and vacuum integration since
I'm investigating the issue on cfbot that Andres reported. Also, I've
modified tests to improve the test coverage.

Sounds good. For v12, I think size classes have proven themselves, so v11's
0002/4/5 can be squashed. Plus, some additional comments:

+/* Return a new and initialized node */
+static rt_node *
+rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk,
bool inner)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, kind, inner);
+ rt_init_node(newnode, kind, shift, chunk, inner);
+
+ return newnode;
+}

I don't see the point of a function that just calls two functions.

+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+ rt_node    *newnode;
+
+ newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
+ node->shift > 0);
+ newnode->count = node->count;
+
+ return newnode;
+}

This, in turn, just calls a function that does _almost_ everything, and
additionally must set one member. This function should really be alloc-node
+ init-node + copy-common, where copy-common is like in the prototype:
+ newnode->node_shift = oldnode->node_shift;
+ newnode->node_chunk = oldnode->node_chunk;
+ newnode->count = oldnode->count;

And init-node should really be just memset + set kind + set initial fanout.
It has no business touching "shift" and "chunk". The callers rt_new_root,
rt_set_extend, and rt_extend set some values of their own anyway, so let
them set those, too -- it might even improve readability.

-       if (n32->base.n.fanout ==
rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+       if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))

This macro doesn't really improve readability -- it obscures what is being
tested, and the name implies the "else" branch means "node doesn't need to
grow class", which is false. If we want to simplify expressions in this
block, I think it'd be more effective to improve the lines that follow:

+ memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
+ new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;

Maybe we can have const variables old_size and new_fanout to break out the
array lookup? While I'm thinking of it, these arrays should be const so the
compiler can avoid runtime lookups. Speaking of...

+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+  uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}

When I looked at this earlier, I somehow didn't go far enough -- why are we
passing the runtime count in the first place? This function can only be
called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last
parameter to memcpy should evaluate to a compile-time constant, right? Even
when we add node shrinking in the future, the constant should be correct,
IIUC?

- .fanout = 256,
+ /* technically it's 256, but we can't store that in a uint8,
+  and this is the max size class so it will never grow */
+ .fanout = 0,

- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(((rt_node *) n256)->fanout == 0);
+ Assert(chunk_exists || ((rt_node *) n256)->count < 256);

These hacks were my work, but I think we can improve that by having two
versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized
nodes. For that to work, in "init-node" we'd need a branch to set fanout to
zero for node256. That should be fine -- it already has to branch for
memset'ing node128's indexes to 0xFF.

--
John Naylor
EDB: http://www.enterprisedb.com

#138

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#136)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

[v11]

There is one more thing that just now occurred to me: In expanding the use
of size classes, that makes rebasing and reworking the shared memory piece
more work than it should be. That's important because there are still some
open questions about the design around shared memory. To keep unnecessary
churn to a minimum, perhaps we should limit size class expansion to just
one (or 5 total size classes) for the near future?

--
John Naylor
EDB: http://www.enterprisedb.com

#139

john.naylor@enterprisedb.com

about 3 years ago

In reply to: John Naylor (#138)

Re: [PoC] Improve dead tuple storage for lazy vacuum

While creating a benchmark for inserting into node128-inner, I found a bug.
If a caller deletes from a node128, the slot index is set to invalid, but
the child pointer is still valid. Do that a few times, and every child
pointer is valid, even if no slot index points to it. When the next
inserter comes along, something surprising happens. This function:

/* Return an unused slot in node-128 */
static int
node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
{
int slotpos = 0;

Assert(!NODE_IS_LEAF(node));
while (node_inner_128_is_slot_used(node, slotpos))
slotpos++;

return slotpos;
}

...passes an integer to this function, whose parameter is a uint8:

/* Is the slot in the node used? */
static inline bool
node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
{
Assert(!NODE_IS_LEAF(node));
return (node->children[slot] != NULL);
}

...so instead of growing the node unnecessarily or segfaulting, it enters
an infinite loop doing this:

add eax, 1
movzx ecx, al
cmp QWORD PTR [rbx+264+rcx*8], 0
jne .L147

The fix is easy enough -- set the child pointer to null upon deletion, but
I'm somewhat astonished that the regression tests didn't hit this. I do
still intend to replace this code with something faster, but before I do so
the tests should probably exercise the deletion paths more. Since VACUUM

--
John Naylor
EDB: http://www.enterprisedb.com

#140

john.naylor@enterprisedb.com

about 3 years ago

In reply to: John Naylor (#139)

Re: [PoC] Improve dead tuple storage for lazy vacuum

The fix is easy enough -- set the child pointer to null upon deletion,

but I'm somewhat astonished that the regression tests didn't hit this. I do
still intend to replace this code with something faster, but before I do so
the tests should probably exercise the deletion paths more. Since VACUUM

Oops. I meant to finish with "Since VACUUM doesn't perform deletion we
didn't have an opportunity to detect this during that operation."

--
John Naylor
EDB: http://www.enterprisedb.com

#141

john.naylor@enterprisedb.com

about 3 years ago

In reply to: John Naylor (#119)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

There are a few things up in the air, so I'm coming back to this list to
summarize and add a recent update:

On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com>
wrote:

- See how much performance we actually gain from tagging the node kind.

Needs a benchmark that has enough branch mispredicts and L2/3 misses to
show a benefit. Otherwise either neutral or worse in its current form,
depending on compiler(?). Put off for later.

- Try additional size classes while keeping the node kinds to only four.

This is relatively simple and effective. If only one additional size class
(total 5) is coded as a placeholder, I imagine it will be easier to rebase
shared memory logic than using this technique everywhere possible.

- Optimize node128 insert.

I've attached a rough start at this. The basic idea is borrowed from our
bitmapset nodes, so we can iterate over and operate on word-sized (32- or
64-bit) types at a time, rather than bytes. To make this easier, I've moved
some of the lower-level macros and types from bitmapset.h/.c to
pg_bitutils.h. That's probably going to need a separate email thread to
resolve the coding style clash this causes, so that can be put off for
later. This is not meant to be included in the next patchset. For
demonstration purposes, I get these results with a function that repeatedly
deletes the last value from a mostly-full node128 leaf and re-inserts it:

select * from bench_node128_load(120);

v11

NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0,
n61 = 0, n128 = 121, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
120 | 14400 | 208304 | 56

v11 + 0006 addendum

I didn't test inner nodes, but I imagine the difference is bigger. This
bitmap style should also be used for the node256-leaf isset array simply to
be consistent and avoid needing single-use macros, but that has not been
done yet. It won't make a difference for performance because there is no
iteration there.

- Try templating out the differences between local and shared memory.

I hope to start this sometime after the crashes on 32-bit are resolved.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v11-0006-addendum-bitmapword-node128.patch.txttext/plain; charset=US-ASCII; name=v11-0006-addendum-bitmapword-node128.patch.txtDownload

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 67ba568531..2fd689aa91 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -63,3 +63,14 @@ OUT rt_search_ms int8
 returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index e69be48448..b035b3a747 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -31,6 +31,7 @@ PG_FUNCTION_INFO_V1(bench_shuffle_search);
 PG_FUNCTION_INFO_V1(bench_load_random_int);
 PG_FUNCTION_INFO_V1(bench_fixed_height_search);
 PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
 
 static uint64
 tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -552,3 +553,85 @@ finish_search:
 	rt_free(rt);
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+
+	rt_stats(rt);
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index f10abd8add..9cfed1624f 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -262,6 +262,9 @@ typedef struct rt_node_inner_128
 {
 	rt_node_base_128 base;
 
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword		isset[WORDNUM(128)];
+
 	/* number of children depends on size class */
 	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_128;
@@ -271,7 +274,7 @@ typedef struct rt_node_leaf_128
 	rt_node_base_128 base;
 
 	/* isset is a bitmap to track which slot is in use */
-	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
+	bitmapword		isset[WORDNUM(128)];
 
 	/* number of values depends on size class */
 	uint64		values[FLEXIBLE_ARRAY_MEMBER];
@@ -679,14 +682,14 @@ static inline bool
 node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
 {
 	Assert(!NODE_IS_LEAF(node));
-	return (node->children[slot] != NULL);
+	return (node->isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
 
 static inline bool
 node_leaf_128_is_slot_used(rt_node_leaf_128 *node, uint8 slot)
 {
 	Assert(NODE_IS_LEAF(node));
-	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+	return (node->isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
 
 static inline rt_node *
@@ -707,7 +710,10 @@ node_leaf_128_get_value(rt_node_leaf_128 *node, uint8 chunk)
 static void
 node_inner_128_delete(rt_node_inner_128 *node, uint8 chunk)
 {
+	int			slotpos = node->base.slot_idxs[chunk];
+
 	Assert(!NODE_IS_LEAF(node));
+	node->isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
 }
 
@@ -717,41 +723,32 @@ node_leaf_128_delete(rt_node_leaf_128 *node, uint8 chunk)
 	int			slotpos = node->base.slot_idxs[chunk];
 
 	Assert(NODE_IS_LEAF(node));
-	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->base.slot_idxs[chunk] = RT_NODE_128_INVALID_IDX;
 }
 
 /* Return an unused slot in node-128 */
 static int
-node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
-{
-	int			slotpos = 0;
-
-	Assert(!NODE_IS_LEAF(node));
-	while (node_inner_128_is_slot_used(node, slotpos))
-		slotpos++;
-
-	return slotpos;
-}
-
-static int
-node_leaf_128_find_unused_slot(rt_node_leaf_128 *node, uint8 chunk)
+node128_find_unused_slot(bitmapword *isset)
 {
 	int			slotpos;
+	int			idx;
+	bitmapword	inverse;
 
-	Assert(NODE_IS_LEAF(node));
-
-	/* We iterate over the isset bitmap per byte then check each bit */
-	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	/* get the first word with at least one bit not set */
+	for (idx = 0; idx < WORDNUM(128); idx++)
 	{
-		if (node->isset[slotpos] < 0xFF)
+		if (isset[idx] < ~((bitmapword) 0))
 			break;
 	}
-	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
 
-	slotpos *= BITS_PER_BYTE;
-	while (node_leaf_128_is_slot_used(node, slotpos))
-		slotpos++;
+	/* To get the first unset bit in X, get the first set bit in ~X */
+	inverse = ~(isset[idx]);
+	slotpos = idx * BITS_PER_BITMAPWORD;
+	slotpos += bmw_rightmost_one_pos(inverse);
+
+	/* mark the slot used */
+	isset[idx] |= RIGHTMOST_ONE(inverse);
 
 	return slotpos;
 }
@@ -763,8 +760,7 @@ node_inner_128_insert(rt_node_inner_128 *node, uint8 chunk, rt_node *child)
 
 	Assert(!NODE_IS_LEAF(node));
 
-	/* find unused slot */
-	slotpos = node_inner_128_find_unused_slot(node, chunk);
+	slotpos = node128_find_unused_slot(node->isset);
 
 	node->base.slot_idxs[chunk] = slotpos;
 	node->children[slotpos] = child;
@@ -778,11 +774,9 @@ node_leaf_128_insert(rt_node_leaf_128 *node, uint8 chunk, uint64 value)
 
 	Assert(NODE_IS_LEAF(node));
 
-	/* find unused slot */
-	slotpos = node_leaf_128_find_unused_slot(node, chunk);
+	slotpos = node128_find_unused_slot(node->isset);
 
 	node->base.slot_idxs[chunk] = slotpos;
-	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
 	node->values[slotpos] = value;
 }
 
@@ -2508,9 +2502,9 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 					rt_node_leaf_128 *n = (rt_node_leaf_128 *) node;
 
 					fprintf(stderr, ", isset-bitmap:");
-					for (int i = 0; i < 16; i++)
+					for (int i = 0; i < WORDNUM(128); i++)
 					{
-						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+						fprintf(stderr, "%lX ", n->isset[i]);
 					}
 					fprintf(stderr, "\n");
 				}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..3fe0fd88ce 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -23,49 +23,11 @@
 #include "common/hashfn.h"
 #include "nodes/bitmapset.h"
 #include "nodes/pg_list.h"
-#include "port/pg_bitutils.h"
 
 
-#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
-
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
-
 
 /*
  * bms_copy - make a palloc'd copy of a bitmapset
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..06fa21ccaa 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -21,33 +21,13 @@
 #define BITMAPSET_H
 
 #include "nodes/nodes.h"
+#include "port/pg_bitutils.h"
 
 /*
  * Forward decl to save including pg_list.h
  */
 struct List;
 
-/*
- * Data representation
- *
- * Larger bitmap word sizes generally give better performance, so long as
- * they're not wider than the processor can handle efficiently.  We use
- * 64-bit words if pointers are that large, else 32-bit words.
- */
-#if SIZEOF_VOID_P >= 8
-
-#define BITS_PER_BITMAPWORD 64
-typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
-
-#else
-
-#define BITS_PER_BITMAPWORD 32
-typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
-
-#endif
-
 typedef struct Bitmapset
 {
 	pg_node_attr(custom_copy_equal, special_read_write)
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 814e0b2dba..ad5aa2c5cf 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,51 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*
+ * Platform-specific types
+ *
+ * Larger bitmap word sizes generally give better performance, so long as
+ * they're not wider than the processor can handle efficiently.  We use
+ * 64-bit words if pointers are that large, else 32-bit words.
+ */
+#if SIZEOF_VOID_P >= 8
+
+#define BITS_PER_BITMAPWORD 64
+typedef uint64 bitmapword;		/* must be an unsigned type */
+typedef int64 signedbitmapword; /* must be the matching signed type */
+
+#else
+
+#define BITS_PER_BITMAPWORD 32
+typedef uint32 bitmapword;		/* must be an unsigned type */
+typedef int32 signedbitmapword; /* must be the matching signed type */
+
+#endif
+
+#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
+
+#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
@@ -291,4 +336,17 @@ pg_rotate_left32(uint32 word, int n)
 #define pg_prevpower2_size_t pg_prevpower2_64
 #endif
 
+/* variants of some functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_leftmost_one_pos pg_leftmost_one_pos32
+#define bmw_rightmost_one_pos pg_rightmost_one_pos32
+#define bmw_popcount pg_popcount32
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_leftmost_one_pos pg_leftmost_one_pos64
+#define bmw_rightmost_one_pos pg_rightmost_one_pos64
+#define bmw_popcount pg_popcount64
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
+
 #endif							/* PG_BITUTILS_H */

#142

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#137)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Nov 25, 2022 at 5:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

So it seems that there are two candidates of rt_node structure: (1)
all nodes except for node256 are variable-size nodes and use pointer
tagging, and (2) node32 and node128 are variable-sized nodes and do
not use pointer tagging (fanout is in part of only these two nodes).
rt_node can be 5 bytes in both cases. But before going to this step, I
started to verify the idea of variable-size nodes by using 6-bytes
rt_node. We can adjust the node kinds and node classes later.

First, I'm glad you picked up the size class concept and expanded it. (I have some comments about some internal APIs below.)

Let's leave the pointer tagging piece out until the main functionality is committed. We have all the prerequisites in place, except for a benchmark random enough to demonstrate benefit. I'm still not quite satisfied with how the shared memory coding looked, and that is the only sticky problem we still have, IMO. The rest is "just work".

That said, (1) and (2) above are still relevant -- variable sizing any given node is optional, and we can refine as needed.

Overall, the idea of variable-sized nodes is good, smaller size
without losing search performance.

Good.

I'm going to check the load
performance as well.

Part of that is this, which gets called a lot more now, when node1 expands:
+ if (inner)
+ newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
+ rt_node_kind_info[kind].inner_size);
+ else
+ newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
+ rt_node_kind_info[kind].leaf_size);
Since memset for expanding size class is now handled separately, these can use the non-zeroing versions. When compiling MemoryContextAllocZero, the compiler has no idea how big the size is, so it assumes the worst and optimizes for large sizes. On x86-64, that means using "rep stos", which calls microcode found in the CPU's ROM. This is slow for small sizes. The "init" function should be always inline with const parameters where possible. That way, memset can compile to a single instruction for the smallest node kind. (More on alloc/init below)

Right. I forgot to update it.

Note, there is a wrinkle: As currently written inner_node128 searches the child pointers for NULL when inserting, so when expanding from partial to full size class, the new node must be zeroed (Worth fixing in the short term. I thought of this while writing the proof-of-concept for size classes, but didn't mention it.) Medium term, rather than special-casing this, I actually want to rewrite the inner-node128 to be more similar to the leaf, with an "isset" array, but accessed and tested differently. I guarantee it's *really* slow now to load (maybe somewhat true even for leaves), but I'll leave the details for later.

Agreed, I start with zeroing out the node when expanding from partial
to full size.

Regarding node128 leaf, note that it's slightly larger than a DSA size class, and we can trim it to fit:

node61: 6 + 256+(2) +16 + 61*8 = 768
node125: 6 + 256+(2) +16 + 125*8 = 1280

Agreed, changed.

I've attached the patches I used for the verification. I don't include
patches for pointer tagging, DSA support, and vacuum integration since
I'm investigating the issue on cfbot that Andres reported. Also, I've
modified tests to improve the test coverage.

Sounds good. For v12, I think size classes have proven themselves, so v11's 0002/4/5 can be squashed. Plus, some additional comments:
+/* Return a new and initialized node */
+static rt_node *
+rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, bool inner)
+{
+ rt_node *newnode;
+
+ newnode = rt_alloc_node(tree, kind, inner);
+ rt_init_node(newnode, kind, shift, chunk, inner);
+
+ return newnode;
+}
I don't see the point of a function that just calls two functions.

Removed.

+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node *
+rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
+{
+ rt_node    *newnode;
+
+ newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
+ node->shift > 0);
+ newnode->count = node->count;
+
+ return newnode;
+}

This, in turn, just calls a function that does _almost_ everything, and additionally must set one member. This function should really be alloc-node + init-node + copy-common, where copy-common is like in the prototype:
+ newnode->node_shift = oldnode->node_shift;
+ newnode->node_chunk = oldnode->node_chunk;
+ newnode->count = oldnode->count;

And init-node should really be just memset + set kind + set initial fanout. It has no business touching "shift" and "chunk". The callers rt_new_root, rt_set_extend, and rt_extend set some values of their own anyway, so let them set those, too -- it might even improve readability.

-       if (n32->base.n.fanout == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+       if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))

Agreed.

This macro doesn't really improve readability -- it obscures what is being tested, and the name implies the "else" branch means "node doesn't need to grow class", which is false. If we want to simplify expressions in this block, I think it'd be more effective to improve the lines that follow:
+ memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
+ new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
Maybe we can have const variables old_size and new_fanout to break out the array lookup? While I'm thinking of it, these arrays should be const so the compiler can avoid runtime lookups. Speaking of...
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+  uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}
When I looked at this earlier, I somehow didn't go far enough -- why are we passing the runtime count in the first place? This function can only be called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last parameter to memcpy should evaluate to a compile-time constant, right? Even when we add node shrinking in the future, the constant should be correct, IIUC?

Right. We don't need to pass count to these functions.

- .fanout = 256,
+ /* technically it's 256, but we can't store that in a uint8,
+  and this is the max size class so it will never grow */
+ .fanout = 0,
- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(((rt_node *) n256)->fanout == 0);
+ Assert(chunk_exists || ((rt_node *) n256)->count < 256);
These hacks were my work, but I think we can improve that by having two versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized nodes. For that to work, in "init-node" we'd need a branch to set fanout to zero for node256. That should be fine -- it already has to branch for memset'ing node128's indexes to 0xFF.

Since the node has fanout regardless of fixed-sized and
variable-sized, only node256 is the special case where the fanout in
the node doesn't match the actual fanout of the node. I think if we
want to have two versions of NODE_HAS_FREE_SLOT, we can have one for
node256 and one for other classes. Thoughts? In your idea, for
NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the
following?

#define FIXED_NODDE_HAS_FREE_SLOT(node, class)
(node->base.n.count < rt_size_class_info[class].fanout)

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#143

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#138)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Nov 25, 2022 at 6:47 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

[v11]

There is one more thing that just now occurred to me: In expanding the use of size classes, that makes rebasing and reworking the shared memory piece more work than it should be. That's important because there are still some open questions about the design around shared memory. To keep unnecessary churn to a minimum, perhaps we should limit size class expansion to just one (or 5 total size classes) for the near future?

Make sense. We can add size classes once we have a good design and
implementation around shared memory.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#144

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#139)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Nov 29, 2022 at 1:36 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

While creating a benchmark for inserting into node128-inner, I found a bug. If a caller deletes from a node128, the slot index is set to invalid, but the child pointer is still valid. Do that a few times, and every child pointer is valid, even if no slot index points to it. When the next inserter comes along, something surprising happens. This function:

/* Return an unused slot in node-128 */
static int
node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
{
int slotpos = 0;

Assert(!NODE_IS_LEAF(node));
while (node_inner_128_is_slot_used(node, slotpos))
slotpos++;

return slotpos;
}

...passes an integer to this function, whose parameter is a uint8:

/* Is the slot in the node used? */
static inline bool
node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
{
Assert(!NODE_IS_LEAF(node));
return (node->children[slot] != NULL);
}

...so instead of growing the node unnecessarily or segfaulting, it enters an infinite loop doing this:

add eax, 1
movzx ecx, al
cmp QWORD PTR [rbx+264+rcx*8], 0
jne .L147

The fix is easy enough -- set the child pointer to null upon deletion,

Good catch!

but I'm somewhat astonished that the regression tests didn't hit this. I do still intend to replace this code with something faster, but before I do so the tests should probably exercise the deletion paths more. Since VACUUM

Indeed, there are some tests for deletion but all of them delete all
keys in the node so we end up deleting the node. I've added tests of
repeating deletion and insertion as well as additional assertions.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#145

sawada.mshk@gmail.com

about 3 years ago

In reply to: Andres Freund (#135)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Nov 23, 2022 at 2:10 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-11-21 17:06:56 +0900, Masahiko Sawada wrote:

Sure. I've attached the v10 patches. 0004 is the pure refactoring
patch and 0005 patch introduces the pointer tagging.

This failed on cfbot, with som many crashes that the VM ran out of disk for
core dumps. During testing with 32bit, so there's probably something broken
around that.

https://cirrus-ci.com/task/4635135954386944

A failure is e.g. at: https://api.cirrus-ci.com/v1/artifact/task/4635135954386944/testrun/build-32/testrun/adminpack/regress/log/initdb.log

performing post-bootstrap initialization ... ../src/backend/lib/radixtree.c:1696:21: runtime error: member access within misaligned address 0x590faf74 for type 'struct radix_tree_control', which requires 8 byte alignment
0x590faf74: note: pointer points here
90 11 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^

radix_tree_control struct has two pg_atomic_uint64 variables, and the
assertion check in pg_atomic_init_u64() failed:

static inline void
pg_atomic_init_u64(volatile pg_atomic_uint64 *ptr, uint64 val)
{
/*
* Can't necessarily enforce alignment - and don't need it - when using
* the spinlock based fallback implementation. Therefore only assert when
* not using it.
*/
#ifndef PG_HAVE_ATOMIC_U64_SIMULATION
AssertPointerAlignment(ptr, 8);
#endif
pg_atomic_init_u64_impl(ptr, val);
}

I've investigated this issue and have a question about using atomic
variables on palloc'ed memory. In non-parallel vacuum cases,
radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
the memory allocated by aset.c is 4-bytes aligned so these atomic
variables are not always 8-bytes aligned. Is there any way to enforce
8-bytes aligned memory allocations in 32-bit machines?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#146

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#145)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Nov 30, 2022 at 11:09 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I've investigated this issue and have a question about using atomic
variables on palloc'ed memory. In non-parallel vacuum cases,
radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
the memory allocated by aset.c is 4-bytes aligned so these atomic
variables are not always 8-bytes aligned. Is there any way to enforce
8-bytes aligned memory allocations in 32-bit machines?

The bigger question in my mind is: Why is there an atomic variable in
backend-local memory?

--
John Naylor
EDB: http://www.enterprisedb.com

#147

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#142)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Nov 30, 2022 at 2:28 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Nov 25, 2022 at 5:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

These hacks were my work, but I think we can improve that by having two

versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized
nodes. For that to work, in "init-node" we'd need a branch to set fanout to
zero for node256. That should be fine -- it already has to branch for
memset'ing node128's indexes to 0xFF.

Since the node has fanout regardless of fixed-sized and
variable-sized

As currently coded, yes. But that's not strictly necessary, I think.

, only node256 is the special case where the fanout in
the node doesn't match the actual fanout of the node. I think if we
want to have two versions of NODE_HAS_FREE_SLOT, we can have one for
node256 and one for other classes. Thoughts? In your idea, for
NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the
following?

#define FIXED_NODDE_HAS_FREE_SLOT(node, class)
(node->base.n.count < rt_size_class_info[class].fanout)

Right, and the other one could be VAR_NODE_...

--
John Naylor
EDB: http://www.enterprisedb.com

#148

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#146)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Dec 1, 2022 at 4:00 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Wed, Nov 30, 2022 at 11:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've investigated this issue and have a question about using atomic
variables on palloc'ed memory. In non-parallel vacuum cases,
radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
the memory allocated by aset.c is 4-bytes aligned so these atomic
variables are not always 8-bytes aligned. Is there any way to enforce
8-bytes aligned memory allocations in 32-bit machines?

The bigger question in my mind is: Why is there an atomic variable in backend-local memory?

Because I use the same radix_tree and radix_tree_control structs for
non-parallel and parallel vacuum. Therefore, radix_tree_control is
allocated in DSM for parallel-vacuum cases or in backend-local memory
for non-parallel vacuum cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#149

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#148)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Dec 1, 2022 at 3:03 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Thu, Dec 1, 2022 at 4:00 PM John Naylor <john.naylor@enterprisedb.com>

wrote:

The bigger question in my mind is: Why is there an atomic variable in

backend-local memory?

Because I use the same radix_tree and radix_tree_control structs for
non-parallel and parallel vacuum. Therefore, radix_tree_control is
allocated in DSM for parallel-vacuum cases or in backend-local memory
for non-parallel vacuum cases.

Ok, that could be yet another reason to compile local- and shared-memory
functionality separately, but now I'm wondering why there are atomic
variables at all, since there isn't yet any locking support.

--
John Naylor
EDB: http://www.enterprisedb.com

#150

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#141)

7 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Nov 30, 2022 at 2:51 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

There are a few things up in the air, so I'm coming back to this list to summarize and add a recent update:

On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:

- See how much performance we actually gain from tagging the node kind.

Needs a benchmark that has enough branch mispredicts and L2/3 misses to show a benefit. Otherwise either neutral or worse in its current form, depending on compiler(?). Put off for later.

- Try additional size classes while keeping the node kinds to only four.

This is relatively simple and effective. If only one additional size class (total 5) is coded as a placeholder, I imagine it will be easier to rebase shared memory logic than using this technique everywhere possible.

- Optimize node128 insert.

I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over and operate on word-sized (32- or 64-bit) types at a time, rather than bytes.

Thanks! I think this is a good idea.

To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That's probably going to need a separate email thread to resolve the coding style clash this causes, so that can be put off for later.

Agreed. Since tidbitmap.c also has WORDNUM(x) and BITNUM(x), we can
use it if we move from bitmapset.h.

This is not meant to be included in the next patchset. For demonstration purposes, I get these results with a function that repeatedly deletes the last value from a mostly-full node128 leaf and re-inserts it:

select * from bench_node128_load(120);

v11

NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
120 | 14400 | 208304 | 56

v11 + 0006 addendum

NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
120 | 14400 | 208816 | 34

I didn't test inner nodes, but I imagine the difference is bigger. This bitmap style should also be used for the node256-leaf isset array simply to be consistent and avoid needing single-use macros, but that has not been done yet. It won't make a difference for performance because there is no iteration there.

After updating the patch set according to recent comments, I've also
done the same test in my environment and got similar good results.

w/o 0006 addendum patch

NOTICE: num_keys = 14400, height = 1, n4 = 0, n15 = 0, n32 = 0, n125
= 121, n256 = 0
fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
--------+-------+------------------+------------------
120 | 14400 | 204424 | 29
(1 row)

w/ 0006 addendum patch

- Try templating out the differences between local and shared memory.

I hope to start this sometime after the crashes on 32-bit are resolved.

I've attached updated patches that incorporated all comments I got so
far as well as fixes for compiler warnings. I included your bitmapword
patch as 0004 for benchmarking. Also I reverted the change around
pg_atomic_u64 since we don't support any locking as you mentioned and
if we have a single lwlock to protect the radix tree, we don't need to
use pg_atomic_u64 only for max_val and num_keys.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v12-0007-PoC-lazy-vacuum-integration.patchapplication/octet-stream; name=v12-0007-PoC-lazy-vacuum-integration.patchDownload

From e6bce249a60d60ce6ed5eeaf021b5993e7568415 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 4 Nov 2022 14:14:42 +0900
Subject: [PATCH v12 7/7] PoC: lazy vacuum integration.

The patch includes:

* Introducing a new module called TIDStore
* Lazy vacuum and parallel vacuum integration.

TODOs:
* radix tree needs to have the reset funtionality.
* should not allow TIDStore to grow beyond the memory limit.
* change the progress statistics of pg_stat_progress_vacuum.
---
 src/backend/access/common/Makefile    |   1 +
 src/backend/access/common/meson.build |   1 +
 src/backend/access/common/tidstore.c  | 448 ++++++++++++++++++++++++++
 src/backend/access/heap/vacuumlazy.c  | 164 +++-------
 src/backend/commands/vacuum.c         |  76 +----
 src/backend/commands/vacuumparallel.c |  63 ++--
 src/backend/storage/lmgr/lwlock.c     |   2 +
 src/include/access/tidstore.h         |  60 ++++
 src/include/commands/vacuum.h         |  24 +-
 src/include/storage/lwlock.h          |   1 +
 10 files changed, 612 insertions(+), 228 deletions(-)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h

diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 857beaa32d..76265974b1 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..c3cf771f7d
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,448 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		TID (ItemPointer) storage implementation.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+#include "miscadmin.h"
+
+#define XXX_DEBUG_TID_STORE 1
+
+/* XXX: should be configurable for non-heap AMs */
+#define TIDSTORE_OFFSET_NBITS 11	/* pg_ceil_log2_32(MaxHeapTuplesPerPage) */
+
+#define TIDSTORE_VALUE_NBITS 6	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+	((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+struct TIDStore
+{
+	/* main storage for TID */
+	radix_tree	*tree;
+
+	/* # of tids in TIDStore */
+	int	num_tids;
+
+	/* DSA area and handle for shared TIDStore */
+	rt_handle	handle;
+	dsa_area	*area;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	ItemPointer	itemptrs;
+	uint64	nitems;
+#endif
+};
+
+static void tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+/*
+ * Comparator routines for use with qsort() and bsearch().
+ */
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+		rblk;
+	OffsetNumber loff,
+		roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+
+static void
+verify_iter_tids(TIDStoreIter *iter)
+{
+	uint64 index = iter->prev_index;
+
+	if (iter->ts->itemptrs == NULL)
+		return;
+
+	Assert(index <= iter->ts->nitems);
+
+	for (int i = 0; i < iter->num_offsets; i++)
+	{
+		ItemPointerData tid;
+
+		ItemPointerSetBlockNumber(&tid, iter->blkno);
+		ItemPointerSetOffsetNumber(&tid, iter->offsets[i]);
+
+		Assert(ItemPointerEquals(&iter->ts->itemptrs[index++], &tid));
+	}
+
+	iter->prev_index = iter->itemptrs_index;
+}
+
+static void
+dump_itemptrs(TIDStore *ts)
+{
+	StringInfoData buf;
+
+	if (ts->itemptrs == NULL)
+		return;
+
+	initStringInfo(&buf);
+	for (int i = 0; i < ts->nitems; i++)
+	{
+		appendStringInfo(&buf, "(%d,%d) ",
+						 ItemPointerGetBlockNumber(&(ts->itemptrs[i])),
+						 ItemPointerGetOffsetNumber(&(ts->itemptrs[i])));
+	}
+	elog(WARNING, "--- dump (" UINT64_FORMAT " items) ---", ts->nitems);
+	elog(WARNING, "%s\n", buf.data);
+}
+
+#endif
+
+/*
+ * Create a TIDStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TIDStore *
+tidstore_create(dsa_area *area)
+{
+	TIDStore	*ts;
+
+	ts = palloc0(sizeof(TIDStore));
+
+	ts->tree = rt_create(CurrentMemoryContext, area);
+	ts->area = area;
+
+	if (area != NULL)
+		ts->handle = rt_get_handle(ts->tree);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+#define MAXDEADITEMS(avail_mem) \
+	(avail_mem / sizeof(ItemPointerData))
+
+	if (area == NULL)
+	{
+		ts->itemptrs = (ItemPointer) palloc0(sizeof(ItemPointerData) *
+											 MAXDEADITEMS(maintenance_work_mem * 1024));
+		ts->nitems = 0;
+	}
+#endif
+
+	return ts;
+}
+
+/* Attach to the shared TIDStore using a handle */
+TIDStore *
+tidstore_attach(dsa_area *area, rt_handle handle)
+{
+	TIDStore *ts;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	ts = palloc0(sizeof(TIDStore));
+	ts->tree = rt_attach(area, handle);
+
+	return ts;
+}
+
+/*
+ * Detach from a TIDStore. This detaches from radix tree and frees the
+ * backend-local resources.
+ */
+void
+tidstore_detach(TIDStore *ts)
+{
+	rt_detach(ts->tree);
+	pfree(ts);
+}
+
+void
+tidstore_free(TIDStore *ts)
+{
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	if (ts->itemptrs)
+		pfree(ts->itemptrs);
+#endif
+
+	rt_free(ts->tree);
+	pfree(ts);
+}
+
+void
+tidstore_reset(TIDStore *ts)
+{
+	dsa_area *area = ts->area;
+
+	/* Reset the statistics */
+	ts->num_tids = 0;
+
+	/* Recreate radix tree storage */
+	rt_free(ts->tree);
+	ts->tree = rt_create(CurrentMemoryContext, area);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	ts->nitems = 0;
+#endif
+}
+
+/* Add TIDs to TIDStore */
+void
+tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	uint64 last_key = PG_UINT64_MAX;
+	uint64 key;
+	uint64 val = 0;
+	ItemPointerData tid;
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint32	off;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		key = tid_to_key_off(&tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(ts->tree, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= UINT64CONST(1) << off;
+		ts->num_tids++;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+		if (ts->itemptrs)
+		{
+			ItemPointerSetBlockNumber(&(ts->itemptrs[ts->nitems]), blkno);
+			ItemPointerSetOffsetNumber(&(ts->itemptrs[ts->nitems]), offsets[i]);
+			ts->nitems++;
+		}
+#endif
+	}
+
+	if (last_key != PG_UINT64_MAX)
+	{
+		rt_set(ts->tree, last_key, val);
+		val = 0;
+	}
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	if (ts->itemptrs)
+		Assert(ts->nitems == ts->num_tids);
+#endif
+}
+
+/* Return true if the given TID is present in TIDStore */
+bool
+tidstore_lookup_tid(TIDStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val;
+	uint32 off;
+	bool found;
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	bool found_assert;
+#endif
+
+	key = tid_to_key_off(tid, &off);
+
+	found = rt_search(ts->tree, key, &val);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	if (ts->itemptrs)
+		found_assert = bsearch((void *) tid,
+							   (void *) ts->itemptrs,
+							   ts->nitems,
+							   sizeof(ItemPointerData),
+							   vac_cmp_itemptr) != NULL;
+#endif
+
+	if (!found)
+	{
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+		if (ts->itemptrs)
+			Assert(!found_assert);
+#endif
+		return false;
+	}
+
+	found = (val & (UINT64CONST(1) << off)) != 0;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+
+	if (ts->itemptrs && found != found_assert)
+	{
+		elog(WARNING, "tid (%d,%d)\n",
+				ItemPointerGetBlockNumber(tid),
+				ItemPointerGetOffsetNumber(tid));
+		dump_itemptrs(ts);
+	}
+
+	if (ts->itemptrs)
+		Assert(found == found_assert);
+
+#endif
+	return found;
+}
+
+TIDStoreIter *
+tidstore_begin_iterate(TIDStore *ts)
+{
+	TIDStoreIter *iter;
+
+	iter = palloc0(sizeof(TIDStoreIter));
+	iter->ts = ts;
+	iter->tree_iter = rt_begin_iterate(ts->tree);
+	iter->blkno = InvalidBlockNumber;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	iter->itemptrs_index = 0;
+#endif
+
+	return iter;
+}
+
+bool
+tidstore_iterate_next(TIDStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+
+	if (iter->finished)
+		return false;
+
+	if (BlockNumberIsValid(iter->blkno))
+	{
+		iter->num_offsets = 0;
+		tidstore_iter_collect_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (rt_iterate_next(iter->tree_iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = KEY_GET_BLKNO(key);
+
+		if (BlockNumberIsValid(iter->blkno) && iter->blkno != blkno)
+		{
+			/*
+			 * Remember the key-value pair for the next block for the
+			 * next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+			verify_iter_tids(iter);
+#endif
+			return true;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_collect_tids(iter, key, val);
+	}
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	verify_iter_tids(iter);
+#endif
+
+	iter->finished = true;
+	return true;
+}
+
+uint64
+tidstore_num_tids(TIDStore *ts)
+{
+	return ts->num_tids;
+}
+
+uint64
+tidstore_memory_usage(TIDStore *ts)
+{
+	return (uint64) sizeof(TIDStore) + rt_memory_usage(ts->tree);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TIDStore
+ */
+tidstore_handle
+tidstore_get_handle(TIDStore *ts)
+{
+	return rt_get_handle(ts->tree);
+}
+
+/* Extract TIDs from key-value pair */
+static void
+tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val)
+{
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+		iter->offsets[iter->num_offsets++] = off;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+		iter->itemptrs_index++;
+#endif
+	}
+
+	iter->blkno = KEY_GET_BLKNO(key);
+}
+
+/* Encode a TID to key and val */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64 upper;
+	uint64 tid_i;
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+	*off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+	upper = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return upper;
+}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d59711b7ec..75dead6c14 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -144,6 +145,8 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	int			max_bytes;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -194,7 +197,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TIDStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -265,8 +268,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer *vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer *vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -392,6 +396,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->indname = NULL;
 	vacrel->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
 	vacrel->verbose = verbose;
+	vacrel->max_bytes = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
@@ -853,7 +860,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_unskippable_block,
 				next_failsafe_block = 0,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TIDStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
@@ -867,7 +874,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = vacrel->max_bytes; /* XXX: should use # of tids */
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -937,8 +944,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		/* XXX: should not allow tidstore to grow beyond max_bytes */
+		if (tidstore_memory_usage(vacrel->dead_items) > vacrel->max_bytes)
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1070,11 +1077,17 @@ lazy_scan_heap(LVRelState *vacrel)
 			if (prunestate.has_lpdead_items)
 			{
 				Size		freespace;
+				TIDStoreIter *iter;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+				iter = tidstore_begin_iterate(vacrel->dead_items);
+				tidstore_iterate_next(iter);
+				lazy_vacuum_heap_page(vacrel, blkno, iter->offsets, iter->num_offsets,
+									  buf, &vmbuffer);
+				Assert(!tidstore_iterate_next(iter));
+				pfree(iter);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				tidstore_reset(dead_items);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1111,7 +1124,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
 		}
 
 		/*
@@ -1264,7 +1277,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1863,25 +1876,16 @@ retry:
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TIDStore *dead_items = vacrel->dead_items;
 
 		Assert(!prunestate->all_visible);
 		Assert(prunestate->has_lpdead_items);
 
 		vacrel->lpdead_item_pages++;
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+									 tidstore_num_tids(dead_items));
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
@@ -2088,8 +2092,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TIDStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2098,17 +2101,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+									 tidstore_num_tids(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2157,7 +2153,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2186,7 +2182,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2213,8 +2209,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2259,7 +2255,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	/* tidstore_reset(vacrel->dead_items); */
 }
 
 /*
@@ -2331,7 +2327,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2368,10 +2364,10 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index;
 	BlockNumber vacuumed_pages;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TIDStoreIter *iter;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2388,8 +2384,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 	vacuumed_pages = 0;
 
-	index = 0;
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while (tidstore_iterate_next(iter))
 	{
 		BlockNumber tblk;
 		Buffer		buf;
@@ -2398,12 +2394,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		tblk = iter->blkno;
 		vacrel->blkno = tblk;
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+		lazy_vacuum_heap_page(vacrel, tblk, iter->offsets, iter->num_offsets,
+							  buf, &vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2427,14 +2424,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2451,11 +2447,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * LP_DEAD item on the page.  The return value is the first index immediately
  * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+					  int num_offsets, Buffer buffer, Buffer *vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			uncnt = 0;
@@ -2474,16 +2469,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = offsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2563,7 +2553,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -3065,46 +3054,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3115,12 +3064,6 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
-
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
 	 * be used for an index, so we invoke parallelism only if there are at
@@ -3146,7 +3089,6 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3159,11 +3101,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(NULL);
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index a6d5ed1f6b..62db8b0101 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2283,16 +2282,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TIDStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2323,18 +2322,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
@@ -2345,60 +2332,7 @@ vac_max_items_to_alloc_size(int max_items)
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TIDStore *dead_items = (TIDStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26d796e52..742039b3a6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+#define PARALLEL_VACUUM_KEY_DSA				2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TIDStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TIDStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +226,22 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TIDStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +289,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +356,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +375,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +384,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +441,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_free(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +452,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TIDStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +950,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TIDStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +996,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1045,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a5ad36ca78..2fb30fe2e7 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -183,6 +183,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"PgStatsHash",
 	/* LWTRANCHE_PGSTATS_DATA: */
 	"PgStatsData",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..f4ccf1dbc5
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,60 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  TID storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TIDStore TIDStore;
+
+typedef struct TIDStoreIter
+{
+	TIDStore	*ts;
+
+	rt_iter		*tree_iter;
+
+	bool		finished;
+
+	uint64		next_key;
+	uint64		next_val;
+
+	BlockNumber		blkno;
+	OffsetNumber	offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+	int				num_offsets;
+
+#ifdef USE_ASSERT_CHECKING
+	uint64		itemptrs_index;
+	int	prev_index;
+#endif
+} TIDStoreIter;
+
+extern TIDStore *tidstore_create(dsa_area *dsa);
+extern TIDStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TIDStore *ts);
+extern void tidstore_free(TIDStore *ts);
+extern void tidstore_reset(TIDStore *ts);
+extern void tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TIDStore *ts, ItemPointer tid);
+extern TIDStoreIter * tidstore_begin_iterate(TIDStore *ts);
+extern bool tidstore_iterate_next(TIDStoreIter *iter);
+extern uint64 tidstore_num_tids(TIDStore *ts);
+extern uint64 tidstore_memory_usage(TIDStore *ts);
+extern tidstore_handle tidstore_get_handle(TIDStore *ts);
+
+#endif		/* TIDSTORE_H */
+
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4e4bc26a8b..c15e6d7a66 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -235,21 +236,6 @@ typedef struct VacuumParams
 	int			nworkers;
 } VacuumParams;
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -302,18 +288,16 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TIDStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TIDStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index a494cb598f..88e35254d1 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -201,6 +201,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DSA,
 	LWTRANCHE_PGSTATS_HASH,
 	LWTRANCHE_PGSTATS_DATA,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
-- 
2.31.1

v12-0005-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchapplication/octet-stream; name=v12-0005-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchDownload

From f9bc757064a1dcbcfb98f9df2a497b510252c0d2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 14 Nov 2022 11:44:17 +0900
Subject: [PATCH v12 5/7] Use rt_node_ptr to reference radix tree nodes.

---
 src/backend/lib/radixtree.c | 688 +++++++++++++++++++++---------------
 1 file changed, 398 insertions(+), 290 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 673cc5e46b..a97d86ae2b 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -145,6 +145,19 @@ typedef enum rt_size_class
 #define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
 } rt_size_class;
 
+/*
+ * rt_pointer is a pointer compatible with a pointer to local memory and a
+ * pointer for DSA area (i.e. dsa_pointer). Since the radix tree node can be
+ * allocated in backend local memory as well as DSA area, we cannot use a
+ * C-pointer to rt_node (i.e. backend local memory address) for child pointers
+ * in inner nodes. Inner nodes need to use rt_pointer instead. We can get
+ * the backend local memory address of a node from a rt_pointer by using
+ * rt_pointer_decode().
+*/
+typedef uintptr_t rt_pointer;
+#define InvalidRTPointer		((rt_pointer) 0)
+#define RTPointerIsValid(x) 	(((rt_pointer) (x)) != InvalidRTPointer)
+
 /* Common type for all nodes types */
 typedef struct rt_node
 {
@@ -170,8 +183,7 @@ typedef struct rt_node
 	/* Node kind, one per search/set algorithm */
 	uint8		kind;
 } rt_node;
-#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+#define RT_NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
 #define VAR_NODE_HAS_FREE_SLOT(node) \
 	((node)->base.n.count < (node)->base.n.fanout)
 #define FIXED_NODE_HAS_FREE_SLOT(node, class) \
@@ -235,7 +247,7 @@ typedef struct rt_node_inner_4
 	rt_node_base_4 base;
 
 	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+	rt_pointer    children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_4;
 
 typedef struct rt_node_leaf_4
@@ -251,7 +263,7 @@ typedef struct rt_node_inner_32
 	rt_node_base_32 base;
 
 	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+	rt_pointer    children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_32;
 
 typedef struct rt_node_leaf_32
@@ -267,7 +279,7 @@ typedef struct rt_node_inner_125
 	rt_node_base_125 base;
 
 	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+	rt_pointer    children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_125;
 
 typedef struct rt_node_leaf_125
@@ -287,7 +299,7 @@ typedef struct rt_node_inner_256
 	rt_node_base_256 base;
 
 	/* Slots for 256 children */
-	rt_node    *children[RT_NODE_MAX_SLOTS];
+	rt_pointer    children[RT_NODE_MAX_SLOTS];
 } rt_node_inner_256;
 
 typedef struct rt_node_leaf_256
@@ -301,6 +313,29 @@ typedef struct rt_node_leaf_256
 	uint64		values[RT_NODE_MAX_SLOTS];
 } rt_node_leaf_256;
 
+/* rt_node_ptr is a data structure representing a pointer for a rt_node */
+typedef struct rt_node_ptr
+{
+	rt_pointer		encoded;
+	rt_node			*decoded;
+} rt_node_ptr;
+#define InvalidRTNodePtr \
+	(rt_node_ptr) {.encoded = InvalidRTPointer, .decoded = NULL}
+#define RTNodePtrIsValid(n) \
+	(!rt_node_ptr_eq((rt_node_ptr *) &(n), &(InvalidRTNodePtr)))
+
+/* Macros for rt_node_ptr to access the fields of rt_node */
+#define NODE_RAW(n)			(n.decoded)
+#define NODE_IS_LEAF(n)		(NODE_RAW(n)->shift == 0)
+#define NODE_IS_EMPTY(n)	(NODE_COUNT(n) == 0)
+#define NODE_KIND(n)	(NODE_RAW(n)->kind)
+#define NODE_COUNT(n)	(NODE_RAW(n)->count)
+#define NODE_SHIFT(n)	(NODE_RAW(n)->shift)
+#define NODE_CHUNK(n)	(NODE_RAW(n)->chunk)
+#define NODE_FANOUT(n)	(NODE_RAW(n)->fanout)
+#define NODE_HAS_FREE_SLOT(n) \
+	(NODE_COUNT(n) < rt_node_kind_info[NODE_KIND(n)].fanout)
+
 /* Information for each size class */
 typedef struct rt_size_class_elem
 {
@@ -389,7 +424,7 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
  */
 typedef struct rt_node_iter
 {
-	rt_node    *node;			/* current node being iterated */
+	rt_node_ptr	node;			/* current node being iterated */
 	int			current_idx;	/* current position. -1 for initial value */
 } rt_node_iter;
 
@@ -410,7 +445,7 @@ struct radix_tree
 {
 	MemoryContext context;
 
-	rt_node    *root;
+	rt_pointer	root;
 	uint64		max_val;
 	uint64		num_keys;
 
@@ -424,27 +459,58 @@ struct radix_tree
 };
 
 static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
-static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+
+static rt_node_ptr rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class,
 								bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_free_node(radix_tree *tree, rt_node_ptr node);
 static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
-										rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+static inline bool rt_node_search_inner(rt_node_ptr node_ptr, uint64 key, rt_action action,
+										rt_pointer *child_p);
+static inline bool rt_node_search_leaf(rt_node_ptr node_ptr, uint64 key, rt_action action,
 									   uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
-								 uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+static bool rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+								 uint64 key, rt_node_ptr child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
 								uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											  rt_node_ptr *child_p);
 static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 											 uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static void rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from);
 static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
 
 /* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
+static void rt_verify_node(rt_node_ptr node);
+
+/* Decode and encode functions of rt_pointer */
+static inline rt_node *
+rt_pointer_decode(rt_pointer encoded)
+{
+	return (rt_node *) encoded;
+}
+
+static inline rt_pointer
+rt_pointer_encode(rt_node *decoded)
+{
+	return (rt_pointer) decoded;
+}
+
+/* Return a rt_node_ptr created from the given encoded pointer */
+static inline rt_node_ptr
+rt_node_ptr_encoded(rt_pointer encoded)
+{
+	return (rt_node_ptr) {
+		.encoded = encoded,
+			.decoded = rt_pointer_decode(encoded),
+			};
+}
+
+static inline bool
+rt_node_ptr_eq(rt_node_ptr *a, rt_node_ptr *b)
+{
+	return (a->decoded == b->decoded) && (a->encoded == b->encoded);
+}
 
 /*
  * Return index of the first element in 'base' that equals 'key'. Return -1
@@ -593,10 +659,10 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
 
 /* Shift the elements right at 'idx' by one */
 static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_shift(uint8 *chunks, rt_pointer *children, int count, int idx)
 {
 	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
-	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_pointer) * (count - idx));
 }
 
 static inline void
@@ -608,10 +674,10 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Delete the element at 'idx' */
 static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_delete(uint8 *chunks, rt_pointer *children, int count, int idx)
 {
 	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
-	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_pointer) * (count - idx - 1));
 }
 
 static inline void
@@ -623,12 +689,12 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Copy both chunks and children/values arrays */
 static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
-						  uint8 *dst_chunks, rt_node **dst_children)
+chunk_children_array_copy(uint8 *src_chunks, rt_pointer *src_children,
+						  uint8 *dst_chunks, rt_pointer *dst_children)
 {
 	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
-	const Size children_size = sizeof(rt_node *) * fanout;
+	const Size children_size = sizeof(rt_pointer) * fanout;
 
 	memcpy(dst_chunks, src_chunks, chunk_size);
 	memcpy(dst_children, src_children, children_size);
@@ -660,7 +726,7 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
 static inline bool
 node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
 	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
@@ -668,23 +734,23 @@ node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
 static inline bool
 node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
 	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
 #endif
 
-static inline rt_node *
+static inline rt_pointer
 node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	return node->children[node->base.slot_idxs[chunk]];
 }
 
 static inline uint64
 node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
 	return node->values[node->base.slot_idxs[chunk]];
 }
@@ -694,9 +760,9 @@ node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
 {
 	int			slotpos = node->base.slot_idxs[chunk];
 
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
-	node->children[node->base.slot_idxs[chunk]] = NULL;
+	node->children[node->base.slot_idxs[chunk]] = InvalidRTPointer;
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
 
@@ -705,7 +771,7 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
 {
 	int			slotpos = node->base.slot_idxs[chunk];
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
@@ -737,11 +803,11 @@ node_125_find_unused_slot(bitmapword *isset)
  }
 
 static inline void
-node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
 {
 	int			slotpos;
 
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 
 	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
@@ -756,7 +822,7 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 {
 	int			slotpos;
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 
 	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
@@ -767,16 +833,16 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 
 /* Update the child corresponding to 'chunk' to 'child' */
 static inline void
-node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[node->base.slot_idxs[chunk]] = child;
 }
 
 static inline void
 node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->values[node->base.slot_idxs[chunk]] = value;
 }
 
@@ -786,21 +852,21 @@ node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 static inline bool
 node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
-	return (node->children[chunk] != NULL);
+	Assert(!RT_NODE_IS_LEAF(node));
+	return RTPointerIsValid(node->children[chunk]);
 }
 
 static inline bool
 node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
 }
 
-static inline rt_node *
+static inline rt_pointer
 node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	Assert(node_inner_256_is_chunk_used(node, chunk));
 	return node->children[chunk];
 }
@@ -808,16 +874,16 @@ node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
 static inline uint64
 node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(node_leaf_256_is_chunk_used(node, chunk));
 	return node->values[chunk];
 }
 
 /* Set the child in the node-256 */
 static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_pointer child)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[chunk] = child;
 }
 
@@ -825,7 +891,7 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
 static inline void
 node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
 	node->values[chunk] = value;
 }
@@ -834,14 +900,14 @@ node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
 static inline void
 node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
-	node->children[chunk] = NULL;
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = InvalidRTPointer;
 }
 
 static inline void
 node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
 }
 
@@ -877,29 +943,32 @@ rt_new_root(radix_tree *tree, uint64 key)
 {
 	int			shift = key_get_shift(key);
 	bool		inner = shift > 0;
-	rt_node    *newnode;
+	rt_node_ptr	newnode;
 
 	newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
 	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
-	newnode->shift = shift;
+	NODE_SHIFT(newnode) = shift;
+
 	tree->max_val = shift_get_max_val(shift);
-	tree->root = newnode;
+	tree->root = newnode.encoded;
 }
 
 /*
  * Allocate a new node with the given node kind.
  */
-static rt_node *
+static rt_node_ptr
 rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 {
-	rt_node    *newnode;
+	rt_node_ptr	newnode;
 
 	if (inner)
-		newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
-												 rt_size_class_info[size_class].inner_size);
+		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+														 rt_size_class_info[size_class].inner_size);
 	else
-		newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
-												 rt_size_class_info[size_class].leaf_size);
+		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+														 rt_size_class_info[size_class].leaf_size);
+
+	newnode.encoded = rt_pointer_encode(newnode.decoded);
 
 #ifdef RT_DEBUG
 	/* update the statistics */
@@ -911,20 +980,20 @@ rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 
 /* Initialize the node contents */
 static inline void
-rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class, bool inner)
 {
 	if (inner)
-		MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+		MemSet(node.decoded, 0, rt_size_class_info[size_class].inner_size);
 	else
-		MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+		MemSet(node.decoded, 0, rt_size_class_info[size_class].leaf_size);
 
-	node->kind = kind;
-	node->fanout = rt_size_class_info[size_class].fanout;
+	NODE_KIND(node) = kind;
+	NODE_FANOUT(node) = rt_size_class_info[size_class].fanout;
 
 	/* Initialize slot_idxs to invalid values */
 	if (kind == RT_NODE_KIND_125)
 	{
-		rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+		rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
 
 		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
 	}
@@ -934,25 +1003,25 @@ rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
 	 * and this is the max size class to it will never grow.
 	 */
 	if (kind == RT_NODE_KIND_256)
-		node->fanout = 0;
+		NODE_FANOUT(node) = 0;
 }
 
 static inline void
-rt_copy_node(rt_node *newnode, rt_node *oldnode)
+rt_copy_node(rt_node_ptr newnode, rt_node_ptr oldnode)
 {
-	newnode->shift = oldnode->shift;
-	newnode->chunk = oldnode->chunk;
-	newnode->count = oldnode->count;
+	NODE_SHIFT(newnode) = NODE_SHIFT(oldnode);
+	NODE_CHUNK(newnode) = NODE_CHUNK(oldnode);
+	NODE_COUNT(newnode) = NODE_COUNT(oldnode);
 }
 
 /*
  * Create a new node with 'new_kind' and the same shift, chunk, and
  * count of 'node'.
  */
-static rt_node*
-rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+static rt_node_ptr
+rt_grow_node_kind(radix_tree *tree, rt_node_ptr node, uint8 new_kind)
 {
-	rt_node	*newnode;
+	rt_node_ptr	newnode;
 	bool inner = !NODE_IS_LEAF(node);
 
 	newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
@@ -964,12 +1033,12 @@ rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
 
 /* Free the given node */
 static void
-rt_free_node(radix_tree *tree, rt_node *node)
+rt_free_node(radix_tree *tree, rt_node_ptr node)
 {
 	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node)
+	if (tree->root == node.encoded)
 	{
-		tree->root = NULL;
+		tree->root = InvalidRTPointer;
 		tree->max_val = 0;
 	}
 
@@ -980,7 +1049,7 @@ rt_free_node(radix_tree *tree, rt_node *node)
 		/* update the statistics */
 		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 		{
-			if (node->fanout == rt_size_class_info[i].fanout)
+			if (NODE_FANOUT(node) == rt_size_class_info[i].fanout)
 				break;
 		}
 
@@ -993,29 +1062,30 @@ rt_free_node(radix_tree *tree, rt_node *node)
 	}
 #endif
 
-	pfree(node);
+	pfree(node.decoded);
 }
 
 /*
  * Replace old_child with new_child, and free the old one.
  */
 static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
-				rt_node *new_child, uint64 key)
+rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
+				rt_node_ptr new_child, uint64 key)
 {
-	Assert(old_child->chunk == new_child->chunk);
-	Assert(old_child->shift == new_child->shift);
+	Assert(NODE_CHUNK(old_child) == NODE_CHUNK(new_child));
+	Assert(NODE_SHIFT(old_child) == NODE_SHIFT(new_child));
 
-	if (parent == old_child)
+	if (rt_node_ptr_eq(&parent, &old_child))
 	{
 		/* Replace the root node with the new large node */
-		tree->root = new_child;
+		tree->root = new_child.encoded;
 	}
 	else
 	{
 		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
 
-		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		replaced = rt_node_insert_inner(tree, InvalidRTNodePtr, parent, key,
+										new_child);
 		Assert(replaced);
 	}
 
@@ -1030,24 +1100,28 @@ static void
 rt_extend(radix_tree *tree, uint64 key)
 {
 	int			target_shift;
-	int			shift = tree->root->shift + RT_NODE_SPAN;
+	rt_node		*root = rt_pointer_decode(tree->root);
+	int			shift = root->shift + RT_NODE_SPAN;
 
 	target_shift = key_get_shift(key);
 
 	/* Grow tree from 'shift' to 'target_shift' */
 	while (shift <= target_shift)
 	{
-		rt_node_inner_4 *node;
+		rt_node_ptr	node;
+		rt_node_inner_4 *n4;
+
+		node = rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+		rt_init_node(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
 
-		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
-		rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
-		node->base.n.shift = shift;
-		node->base.n.count = 1;
-		node->base.chunks[0] = 0;
-		node->children[0] = tree->root;
+		n4 = (rt_node_inner_4 *) node.decoded;
+		n4->base.n.shift = shift;
+		n4->base.n.count = 1;
+		n4->base.chunks[0] = 0;
+		n4->children[0] = tree->root;
 
-		tree->root->chunk = 0;
-		tree->root = (rt_node *) node;
+		root->chunk = 0;
+		tree->root = node.encoded;
 
 		shift += RT_NODE_SPAN;
 	}
@@ -1060,21 +1134,22 @@ rt_extend(radix_tree *tree, uint64 key)
  * Insert inner and leaf nodes from 'node' to bottom.
  */
 static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
-			  rt_node *node)
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
+			  rt_node_ptr node)
 {
-	int			shift = node->shift;
+	int			shift = NODE_SHIFT(node);
 
 	while (shift >= RT_NODE_SPAN)
 	{
-		rt_node    *newchild;
+		rt_node_ptr    newchild;
 		int			newshift = shift - RT_NODE_SPAN;
 		bool		inner = newshift > 0;
 
 		newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
 		rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
-		newchild->shift = newshift;
-		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+		NODE_SHIFT(newchild) = newshift;
+		NODE_CHUNK(newchild) = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
+
 		rt_node_insert_inner(tree, parent, node, key, newchild);
 
 		parent = node;
@@ -1094,17 +1169,18 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
+					 rt_pointer *child_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
-	rt_node    *child = NULL;
+	rt_pointer	child;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
 
 				if (idx < 0)
@@ -1122,7 +1198,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
 
 				if (idx < 0)
@@ -1138,7 +1214,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
 
 				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
 					break;
@@ -1154,7 +1230,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 				if (!node_inner_256_is_chunk_used(n256, chunk))
 					break;
@@ -1171,7 +1247,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 
 	/* update statistics */
 	if (action == RT_ACTION_DELETE && found)
-		node->count--;
+		NODE_COUNT(node)--;
 
 	if (found && child_p)
 		*child_p = child;
@@ -1187,17 +1263,17 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
  * to the value is set to value_p.
  */
 static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node_ptr node, uint64 key, rt_action action, uint64 *value_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
 	uint64		value = 0;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
 
 				if (idx < 0)
@@ -1215,7 +1291,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
 
 				if (idx < 0)
@@ -1231,7 +1307,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
 
 				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
 					break;
@@ -1247,7 +1323,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 				if (!node_leaf_256_is_chunk_used(n256, chunk))
 					break;
@@ -1264,7 +1340,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 
 	/* update statistics */
 	if (action == RT_ACTION_DELETE && found)
-		node->count--;
+		NODE_COUNT(node)--;
 
 	if (found && value_p)
 		*value_p = value;
@@ -1274,19 +1350,19 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 
 /* Insert the child to the inner node */
 static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
-					 rt_node *child)
+rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+					 uint64 key, rt_node_ptr child)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		chunk_exists = false;
 
 	Assert(!NODE_IS_LEAF(node));
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 				int			idx;
 
 				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1294,25 +1370,27 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					n4->children[idx] = child;
+					n4->children[idx] = child.encoded;
 					break;
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
 				{
+					rt_node_ptr	new;
 					rt_node_inner_32 *new32;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
-																   RT_NODE_KIND_32);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+					new32 = (rt_node_inner_32 *) new.decoded;
+
 					chunk_children_array_copy(n4->base.chunks, n4->children,
 											  new32->base.chunks, new32->children);
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
-									key);
-					node = (rt_node *) new32;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1325,14 +1403,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 												   count, insertpos);
 
 					n4->base.chunks[insertpos] = chunk;
-					n4->children[insertpos] = child;
+					n4->children[insertpos] = child.encoded;
 					break;
 				}
 			}
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 				int			idx;
 
 				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1340,45 +1418,52 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					n32->children[idx] = child;
+					n32->children[idx] = child.encoded;
 					break;
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
 				{
-					Assert(parent != NULL);
+					Assert(RTNodePtrIsValid(parent));
 
 					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
 					{
 						/* use the same node kind, but expand to the next size class */
 						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
 						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_ptr	new;
 						rt_node_inner_32 *new32;
 
-						new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+						new = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+						new32 = (rt_node_inner_32 *) new.decoded;
 						memcpy(new32, n32, size);
 						new32->base.n.fanout = fanout;
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+						rt_replace_node(tree, parent, node, new, key);
 
-						/* must update both pointers here */
-						node = (rt_node *) new32;
+						/*
+						 * Must update both pointers here since we update n32 and
+						 * verify node.
+						 */
+						node = new;
 						n32 = new32;
 
 						goto retry_insert_inner_32;
 					}
 					else
 					{
+						rt_node_ptr	new;
 						rt_node_inner_125 *new125;
 
 						/* grow node from 32 to 125 */
-						new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
-																		 RT_NODE_KIND_125);
+						new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+						new125 = (rt_node_inner_125 *) new.decoded;
+
 						for (int i = 0; i < n32->base.n.count; i++)
 							node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
-						node = (rt_node *) new125;
+						rt_replace_node(tree, parent, node, new, key);
+						node = new;
 					}
 				}
 				else
@@ -1393,7 +1478,7 @@ retry_insert_inner_32:
 													   count, insertpos);
 
 						n32->base.chunks[insertpos] = chunk;
-						n32->children[insertpos] = child;
+						n32->children[insertpos] = child.encoded;
 						break;
 					}
 				}
@@ -1401,25 +1486,28 @@ retry_insert_inner_32:
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_125:
 			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
 				int			cnt = 0;
 
 				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					node_inner_125_update(n125, chunk, child);
+					node_inner_125_update(n125, chunk, child.encoded);
 					break;
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
 				{
+					rt_node_ptr	new;
 					rt_node_inner_256 *new256;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 125 to 256 */
-					new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
-																	 RT_NODE_KIND_256);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+					new256 = (rt_node_inner_256 *) new.decoded;
+
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
 					{
 						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1429,32 +1517,31 @@ retry_insert_inner_32:
 						cnt++;
 					}
 
-					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
-					node_inner_125_insert(n125, chunk, child);
+					node_inner_125_insert(n125, chunk, child.encoded);
 					break;
 				}
 			}
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
 				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
 
-				node_inner_256_set(n256, chunk, child);
+				node_inner_256_set(n256, chunk, child.encoded);
 				break;
 			}
 	}
 
 	/* Update statistics */
 	if (!chunk_exists)
-		node->count++;
+		NODE_COUNT(node)++;
 
 	/*
 	 * Done. Finally, verify the chunk and value is inserted or replaced
@@ -1467,19 +1554,19 @@ retry_insert_inner_32:
 
 /* Insert the value to the leaf node */
 static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
 					uint64 key, uint64 value)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		chunk_exists = false;
 
 	Assert(NODE_IS_LEAF(node));
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 				int			idx;
 
 				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1493,16 +1580,18 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
 				{
+					rt_node_ptr	new;
 					rt_node_leaf_32 *new32;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
-																  RT_NODE_KIND_32);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+					new32 = (rt_node_leaf_32 *) new.decoded;
 					chunk_values_array_copy(n4->base.chunks, n4->values,
 											new32->base.chunks, new32->values);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
-					node = (rt_node *) new32;
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1522,7 +1611,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 				int			idx;
 
 				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1536,45 +1625,51 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
 				{
-					Assert(parent != NULL);
+					Assert(RTNodePtrIsValid(parent));
 
 					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
 					{
 						/* use the same node kind, but expand to the next size class */
 						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
 						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_ptr new;
 						rt_node_leaf_32 *new32;
 
-						new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+						new = rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+						new32 = (rt_node_leaf_32 *) new.decoded;
 						memcpy(new32, n32, size);
 						new32->base.n.fanout = fanout;
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+						rt_replace_node(tree, parent, node, new, key);
 
-						/* must update both pointers here */
-						node = (rt_node *) new32;
+						/*
+						 * Must update both pointers here since we update n32 and
+						 * verify node.
+						 */
+						node = new;
 						n32 = new32;
 
 						goto retry_insert_leaf_32;
 					}
 					else
 					{
+						rt_node_ptr	new;
 						rt_node_leaf_125 *new125;
 
 						/* grow node from 32 to 125 */
-						new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
-																		RT_NODE_KIND_125);
+						new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+						new125 = (rt_node_leaf_125 *) new.decoded;
+
 						for (int i = 0; i < n32->base.n.count; i++)
 							node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
-										key);
-						node = (rt_node *) new125;
+						rt_replace_node(tree, parent, node, new, key);
+						node = new;
 					}
 				}
 				else
 				{
-				retry_insert_leaf_32:
+retry_insert_leaf_32:
 					{
 						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
 						int	count = n32->base.n.count;
@@ -1592,7 +1687,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_125:
 			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
 				int			cnt = 0;
 
 				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
@@ -1605,12 +1700,14 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
 				{
+					rt_node_ptr	new;
 					rt_node_leaf_256 *new256;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 125 to 256 */
-					new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
-																	RT_NODE_KIND_256);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+					new256 = (rt_node_leaf_256 *) new.decoded;
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
 					{
 						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1620,9 +1717,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 						cnt++;
 					}
 
-					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1633,7 +1729,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
 				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
@@ -1645,7 +1741,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 	/* Update statistics */
 	if (!chunk_exists)
-		node->count++;
+		NODE_COUNT(node)++;
 
 	/*
 	 * Done. Finally, verify the chunk and value is inserted or replaced
@@ -1669,7 +1765,7 @@ rt_create(MemoryContext ctx)
 
 	tree = palloc(sizeof(radix_tree));
 	tree->context = ctx;
-	tree->root = NULL;
+	tree->root = InvalidRTPointer;
 	tree->max_val = 0;
 	tree->num_keys = 0;
 
@@ -1718,26 +1814,23 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 {
 	int			shift;
 	bool		updated;
-	rt_node    *node;
-	rt_node    *parent;
+	rt_node_ptr	node;
+	rt_node_ptr parent;
 
 	/* Empty tree, create the root */
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 		rt_new_root(tree, key);
 
 	/* Extend the tree if necessary */
 	if (key > tree->max_val)
 		rt_extend(tree, key);
 
-	Assert(tree->root);
-
-	shift = tree->root->shift;
-	node = parent = tree->root;
-
 	/* Descend the tree until a leaf node */
+	node = parent = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer    child;
 
 		if (NODE_IS_LEAF(node))
 			break;
@@ -1749,7 +1842,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 		}
 
 		parent = node;
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1770,21 +1863,21 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 bool
 rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 {
-	rt_node    *node;
+	rt_node_ptr    node;
 	int			shift;
 
 	Assert(value_p != NULL);
 
-	if (!tree->root || key > tree->max_val)
+	if (!RTPointerIsValid(tree->root) || key > tree->max_val)
 		return false;
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer	child;
 
 		if (NODE_IS_LEAF(node))
 			break;
@@ -1792,7 +1885,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1806,8 +1899,8 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 bool
 rt_delete(radix_tree *tree, uint64 key)
 {
-	rt_node    *node;
-	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	rt_node_ptr	node;
+	rt_node_ptr	stack[RT_MAX_LEVEL] = {0};
 	int			shift;
 	int			level;
 	bool		deleted;
@@ -1819,12 +1912,12 @@ rt_delete(radix_tree *tree, uint64 key)
 	 * Descend the tree to search the key while building a stack of nodes we
 	 * visited.
 	 */
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	level = -1;
 	while (shift > 0)
 	{
-		rt_node    *child;
+		rt_pointer	child;
 
 		/* Push the current node to the stack */
 		stack[++level] = node;
@@ -1832,7 +1925,7 @@ rt_delete(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1883,6 +1976,7 @@ rt_iter *
 rt_begin_iterate(radix_tree *tree)
 {
 	MemoryContext old_ctx;
+	rt_node_ptr	root;
 	rt_iter    *iter;
 	int			top_level;
 
@@ -1892,17 +1986,18 @@ rt_begin_iterate(radix_tree *tree)
 	iter->tree = tree;
 
 	/* empty tree */
-	if (!iter->tree->root)
+	if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
 		return iter;
 
-	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	root = rt_node_ptr_encoded(iter->tree->root);
+	top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
 	iter->stack_len = top_level;
 
 	/*
 	 * Descend to the left most leaf node from the root. The key is being
 	 * constructed while descending to the leaf.
 	 */
-	rt_update_iter_stack(iter, iter->tree->root, top_level);
+	rt_update_iter_stack(iter, root, top_level);
 
 	MemoryContextSwitchTo(old_ctx);
 
@@ -1913,14 +2008,15 @@ rt_begin_iterate(radix_tree *tree)
  * Update each node_iter for inner nodes in the iterator node stack.
  */
 static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
 {
 	int			level = from;
-	rt_node    *node = from_node;
+	rt_node_ptr node = from_node;
 
 	for (;;)
 	{
 		rt_node_iter *node_iter = &(iter->stack[level--]);
+		bool found PG_USED_FOR_ASSERTS_ONLY;
 
 		node_iter->node = node;
 		node_iter->current_idx = -1;
@@ -1930,10 +2026,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
 			return;
 
 		/* Advance to the next slot in the inner node */
-		node = rt_node_inner_iterate_next(iter, node_iter);
+		found = rt_node_inner_iterate_next(iter, node_iter, &node);
 
 		/* We must find the first children in the node */
-		Assert(node);
+		Assert(found);
 	}
 }
 
@@ -1950,7 +2046,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 
 	for (;;)
 	{
-		rt_node    *child = NULL;
+		rt_node_ptr	child = InvalidRTNodePtr;
 		uint64		value;
 		int			level;
 		bool		found;
@@ -1971,14 +2067,12 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 		 */
 		for (level = 1; level <= iter->stack_len; level++)
 		{
-			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
-
-			if (child)
+			if (rt_node_inner_iterate_next(iter, &(iter->stack[level]), &child))
 				break;
 		}
 
 		/* the iteration finished */
-		if (!child)
+		if (!RTNodePtrIsValid(child))
 			return false;
 
 		/*
@@ -2010,18 +2104,19 @@ rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
  * Advance the slot in the inner node. Return the child if exists, otherwise
  * null.
  */
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+static inline bool
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *child_p)
 {
-	rt_node    *child = NULL;
+	rt_node_ptr	node = node_iter->node;
+	rt_pointer	child;
 	bool		found = false;
 	uint8		key_chunk;
 
-	switch (node_iter->node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n4->base.n.count)
@@ -2034,7 +2129,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n32->base.n.count)
@@ -2047,7 +2142,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2067,7 +2162,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2088,9 +2183,12 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 	}
 
 	if (found)
-		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+	{
+		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
+		*child_p = rt_node_ptr_encoded(child);
+	}
 
-	return child;
+	return found;
 }
 
 /*
@@ -2098,19 +2196,18 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
  * is set to value_p, otherwise return false.
  */
 static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
-						  uint64 *value_p)
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_p)
 {
-	rt_node    *node = node_iter->node;
+	rt_node_ptr node = node_iter->node;
 	bool		found = false;
 	uint64		value;
 	uint8		key_chunk;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n4->base.n.count)
@@ -2123,7 +2220,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n32->base.n.count)
@@ -2136,7 +2233,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2156,7 +2253,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2178,7 +2275,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 
 	if (found)
 	{
-		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
 		*value_p = value;
 	}
 
@@ -2215,16 +2312,16 @@ rt_memory_usage(radix_tree *tree)
  * Verify the radix tree node.
  */
 static void
-rt_verify_node(rt_node *node)
+rt_verify_node(rt_node_ptr node)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(node->count >= 0);
+	Assert(NODE_COUNT(node) >= 0);
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node.decoded;
 
 				for (int i = 1; i < n4->n.count; i++)
 					Assert(n4->chunks[i - 1] < n4->chunks[i]);
@@ -2233,7 +2330,7 @@ rt_verify_node(rt_node *node)
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node.decoded;
 
 				for (int i = 1; i < n32->n.count; i++)
 					Assert(n32->chunks[i - 1] < n32->chunks[i]);
@@ -2242,7 +2339,7 @@ rt_verify_node(rt_node *node)
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+				rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
 				int			cnt = 0;
 
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2252,10 +2349,10 @@ rt_verify_node(rt_node *node)
 
 					/* Check if the corresponding slot is used */
 					if (NODE_IS_LEAF(node))
-						Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+						Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) n125,
 														  n125->slot_idxs[i]));
 					else
-						Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+						Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) n125,
 														   n125->slot_idxs[i]));
 
 					cnt++;
@@ -2268,7 +2365,7 @@ rt_verify_node(rt_node *node)
 			{
 				if (NODE_IS_LEAF(node))
 				{
-					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 					int			cnt = 0;
 
 					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
@@ -2289,54 +2386,62 @@ rt_verify_node(rt_node *node)
 void
 rt_stats(radix_tree *tree)
 {
+	rt_node *root = rt_pointer_decode(tree->root);
+
+	if (root == NULL)
+		return;
+
 	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
-						 tree->num_keys,
-						 tree->root->shift / RT_NODE_SPAN,
-						 tree->cnt[RT_CLASS_4_FULL],
-						 tree->cnt[RT_CLASS_32_PARTIAL],
-						 tree->cnt[RT_CLASS_32_FULL],
-						 tree->cnt[RT_CLASS_125_FULL],
-						 tree->cnt[RT_CLASS_256])));
+							tree->num_keys,
+							root->shift / RT_NODE_SPAN,
+							tree->cnt[RT_CLASS_4_FULL],
+							tree->cnt[RT_CLASS_32_PARTIAL],
+							tree->cnt[RT_CLASS_32_FULL],
+							tree->cnt[RT_CLASS_125_FULL],
+							tree->cnt[RT_CLASS_256])));
 }
 
 static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(rt_node_ptr node, int level, bool recurse)
 {
-	char		space[125] = {0};
+	rt_node		*n = node.decoded;
+	char		space[128] = {0};
 
 	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
 			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
-			(node->kind == RT_NODE_KIND_4) ? 4 :
-			(node->kind == RT_NODE_KIND_32) ? 32 :
-			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
-			node->fanout == 0 ? 256 : node->fanout,
-			node->count, node->shift, node->chunk);
+
+			(n->kind == RT_NODE_KIND_4) ? 4 :
+			(n->kind == RT_NODE_KIND_32) ? 32 :
+			(n->kind == RT_NODE_KIND_125) ? 125 : 256,
+			n->fanout == 0 ? 256 : n->fanout,
+			n->count, n->shift, n->chunk);
 
 	if (level > 0)
 		sprintf(space, "%*c", level * 4, ' ');
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				for (int i = 0; i < node->count; i++)
+				for (int i = 0; i < NODE_COUNT(node); i++)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
 								space, n4->base.chunks[i], n4->values[i]);
 					}
 					else
 					{
-						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, n4->base.chunks[i]);
 
 						if (recurse)
-							rt_dump_node(n4->children[i], level + 1, recurse);
+							rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2345,25 +2450,26 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 			}
 		case RT_NODE_KIND_32:
 			{
-				for (int i = 0; i < node->count; i++)
+				for (int i = 0; i < NODE_KIND(node); i++)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
 								space, n32->base.chunks[i], n32->values[i]);
 					}
 					else
 					{
-						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, n32->base.chunks[i]);
 
 						if (recurse)
 						{
-							rt_dump_node(n32->children[i], level + 1, recurse);
+							rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+										 level + 1, recurse);
 						}
 						else
 							fprintf(stderr, "\n");
@@ -2373,7 +2479,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+				rt_node_base_125 *b125 = (rt_node_base_125 *) node.decoded;
 
 				fprintf(stderr, "slot_idxs ");
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2385,7 +2491,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 				}
 				if (NODE_IS_LEAF(node))
 				{
-					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node.decoded;
 
 					fprintf(stderr, ", isset-bitmap:");
 					for (int i = 0; i < WORDNUM(128); i++)
@@ -2415,7 +2521,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_125_get_child(n125, i),
+							rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2429,7 +2535,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 						if (!node_leaf_256_is_chunk_used(n256, i))
 							continue;
@@ -2439,7 +2545,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 					}
 					else
 					{
-						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 						if (!node_inner_256_is_chunk_used(n256, i))
 							continue;
@@ -2448,8 +2554,8 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
-										 recurse);
+							rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2462,7 +2568,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 void
 rt_dump_search(radix_tree *tree, uint64 key)
 {
-	rt_node    *node;
+	rt_node_ptr node;
 	int			shift;
 	int			level = 0;
 
@@ -2470,7 +2576,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
 		 tree->max_val, tree->max_val);
 
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 	{
 		elog(NOTICE, "tree is empty");
 		return;
@@ -2483,11 +2589,11 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		return;
 	}
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer   child;
 
 		rt_dump_node(node, level, false);
 
@@ -2504,7 +2610,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			break;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 		level++;
 	}
@@ -2513,6 +2619,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 void
 rt_dump(radix_tree *tree)
 {
+	rt_node_ptr root;
 
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
@@ -2523,12 +2630,13 @@ rt_dump(radix_tree *tree)
 				rt_size_class_info[i].leaf_blocksize);
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
 
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 	{
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
-	rt_dump_node(tree->root, 0, true);
+	root = rt_node_ptr_encoded(tree->root);
+	rt_dump_node(root, 0, true);
 }
 #endif
-- 
2.31.1

v12-0003-tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v12-0003-tool-for-measuring-radix-tree-performance.patchDownload

From 1e244ff8963101b8a74fb3db01fae19f15d620a3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v12 3/7] tool for measuring radix tree performance

---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  76 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 635 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 6 files changed, 767 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..83529805fc
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..a0693695e6
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,635 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, key);
+	}
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+
+	rt_stats(rt);
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
-- 
2.31.1

v12-0006-PoC-DSA-support-for-radix-tree.patchapplication/octet-stream; name=v12-0006-PoC-DSA-support-for-radix-tree.patchDownload

From 07daf71cbc20e445c6897e4e7790c85c5d59637d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 27 Oct 2022 14:02:00 +0900
Subject: [PATCH v12 6/7] PoC: DSA support for radix tree.

---
 .../bench_radix_tree--1.0.sql                 |   2 +
 contrib/bench_radix_tree/bench_radix_tree.c   |  16 +-
 src/backend/lib/radixtree.c                   | 437 ++++++++++++++----
 src/backend/utils/mmgr/dsa.c                  |  12 +
 src/include/lib/radixtree.h                   |   8 +-
 src/include/utils/dsa.h                       |   1 +
 .../expected/test_radixtree.out               |  25 +
 .../modules/test_radixtree/test_radixtree.c   | 147 ++++--
 8 files changed, 502 insertions(+), 146 deletions(-)

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 83529805fc..d9216d715c 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -7,6 +7,7 @@ create function bench_shuffle_search(
 minblk int4,
 maxblk int4,
 random_block bool DEFAULT false,
+shared bool DEFAULT false,
 OUT nkeys int8,
 OUT rt_mem_allocated int8,
 OUT array_mem_allocated int8,
@@ -23,6 +24,7 @@ create function bench_seq_search(
 minblk int4,
 maxblk int4,
 random_block bool DEFAULT false,
+shared bool DEFAULT false,
 OUT nkeys int8,
 OUT rt_mem_allocated int8,
 OUT array_mem_allocated int8,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index a0693695e6..1a26722495 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -154,6 +154,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	BlockNumber maxblk = PG_GETARG_INT32(1);
 	bool		random_block = PG_GETARG_BOOL(2);
 	radix_tree *rt = NULL;
+	bool		shared = PG_GETARG_BOOL(3);
+	dsa_area   *dsa = NULL;
 	uint64		ntids;
 	uint64		key;
 	uint64		last_key = PG_UINT64_MAX;
@@ -176,7 +178,11 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
 
 	/* measure the load time of the radix tree */
-	rt = rt_create(CurrentMemoryContext);
+	if (shared)
+		dsa = dsa_create(LWLockNewTrancheId());
+	rt = rt_create(CurrentMemoryContext, dsa);
+
+	/* measure the load time of the radix tree */
 	start_time = GetCurrentTimestamp();
 	for (int i = 0; i < ntids; i++)
 	{
@@ -327,7 +333,7 @@ bench_load_random_int(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	pg_prng_seed(&state, 0);
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	start_time = GetCurrentTimestamp();
 	for (uint64 i = 0; i < cnt; i++)
@@ -393,7 +399,7 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
 	}
 	elog(NOTICE, "bench with filter 0x%lX", filter);
 
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	for (uint64 i = 0; i < cnt; i++)
 	{
@@ -462,7 +468,7 @@ bench_fixed_height_search(PG_FUNCTION_ARGS)
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
 
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	start_time = GetCurrentTimestamp();
 
@@ -574,7 +580,7 @@ bench_node128_load(PG_FUNCTION_ARGS)
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
 
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	key_id = 0;
 
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index a97d86ae2b..58e947f9df 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -22,6 +22,15 @@
  * choose it to avoid an additional pointer traversal.  It is the reason this code
  * currently does not support variable-length keys.
  *
+ * If DSA area is specified for rt_create(), the radix tree is created in the
+ * DSA area so that multiple processes can access to it simultaneously. The process
+ * who created the shared radix tree needs to tell both DSA area specified when
+ * calling to rt_create() and dsa_pointer of the radix tree, fetched by
+ * rt_get_dsa_pointer(), to other processes so that they can attach by rt_attach().
+ *
+ * XXX: shared radix tree is still PoC state as it doesn't have any locking support.
+ * Also, it supports the iteration only by one process.
+ *
  * XXX: Most functions in this file have two variants for inner nodes and leaf
  * nodes, therefore there are duplication codes. While this sometimes makes the
  * code maintenance tricky, this reduces branch prediction misses when judging
@@ -34,6 +43,9 @@
  *
  * rt_create		- Create a new, empty radix tree
  * rt_free			- Free the radix tree
+ * rt_attach		- Attach to the radix tree
+ * rt_detach		- Detach from the radix tree
+ * rt_get_handle	- Return the handle of the radix tree
  * rt_search		- Search a key-value pair
  * rt_set			- Set a key-value pair
  * rt_delete		- Delete a key-value pair
@@ -64,6 +76,7 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "port/pg_lfind.h"
+#include "utils/dsa.h"
 #include "utils/memutils.h"
 
 #ifdef RT_DEBUG
@@ -421,6 +434,10 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
  * construct the key whenever updating the node iteration information, e.g., when
  * advancing the current index within the node or when moving to the next node
  * at the same level.
+ *
+ * XXX: We need either a safeguard to disallow other processes to begin the
+ * iteration while one process is doing or to allow multiple processes to do
+ * the iteration.
  */
 typedef struct rt_node_iter
 {
@@ -440,23 +457,43 @@ struct rt_iter
 	uint64		key;
 };
 
-/* A radix tree with nodes */
-struct radix_tree
+/* A magic value used to identify our radix tree */
+#define RADIXTREE_MAGIC 0x54A48167
+
+/* Control information for an radix tree */
+typedef struct radix_tree_control
 {
-	MemoryContext context;
+	rt_handle	handle;
+	uint32		magic;
 
+	/* Root node */
 	rt_pointer	root;
+
 	uint64		max_val;
 	uint64		num_keys;
 
-	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
-	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
-
 	/* statistics */
 #ifdef RT_DEBUG
 	int32		cnt[RT_SIZE_CLASS_COUNT];
 #endif
+} radix_tree_control;
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	/* control object in either backend-local memory or DSA */
+	radix_tree_control *ctl;
+
+	/* used only when the radix tree is shared */
+	dsa_area   *area;
+
+	/* used only when the radix tree is private */
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
 };
+#define RadixTreeIsShared(rt) ((rt)->area != NULL)
 
 static void rt_new_root(radix_tree *tree, uint64 key);
 
@@ -485,9 +522,12 @@ static void rt_verify_node(rt_node_ptr node);
 
 /* Decode and encode functions of rt_pointer */
 static inline rt_node *
-rt_pointer_decode(rt_pointer encoded)
+rt_pointer_decode(radix_tree *tree, rt_pointer encoded)
 {
-	return (rt_node *) encoded;
+	if (RadixTreeIsShared(tree))
+		return (rt_node *) dsa_get_address(tree->area, encoded);
+	else
+		return (rt_node *) encoded;
 }
 
 static inline rt_pointer
@@ -498,11 +538,11 @@ rt_pointer_encode(rt_node *decoded)
 
 /* Return a rt_node_ptr created from the given encoded pointer */
 static inline rt_node_ptr
-rt_node_ptr_encoded(rt_pointer encoded)
+rt_node_ptr_encoded(radix_tree *tree, rt_pointer encoded)
 {
 	return (rt_node_ptr) {
 		.encoded = encoded,
-			.decoded = rt_pointer_decode(encoded),
+			.decoded = rt_pointer_decode(tree, encoded)
 			};
 }
 
@@ -949,8 +989,8 @@ rt_new_root(radix_tree *tree, uint64 key)
 	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
 	NODE_SHIFT(newnode) = shift;
 
-	tree->max_val = shift_get_max_val(shift);
-	tree->root = newnode.encoded;
+	tree->ctl->max_val = shift_get_max_val(shift);
+	tree->ctl->root = newnode.encoded;
 }
 
 /*
@@ -959,20 +999,35 @@ rt_new_root(radix_tree *tree, uint64 key)
 static rt_node_ptr
 rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 {
-	rt_node_ptr	newnode;
+	rt_node_ptr newnode;
 
-	if (inner)
-		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
-														 rt_size_class_info[size_class].inner_size);
-	else
-		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
-														 rt_size_class_info[size_class].leaf_size);
+	if (tree->area != NULL)
+	{
+		dsa_pointer dp;
 
-	newnode.encoded = rt_pointer_encode(newnode.decoded);
+		if (inner)
+			dp = dsa_allocate(tree->area, rt_size_class_info[size_class].inner_size);
+		else
+			dp = dsa_allocate(tree->area, rt_size_class_info[size_class].leaf_size);
+
+		newnode.encoded = (rt_pointer) dp;
+		newnode.decoded = rt_pointer_decode(tree, newnode.encoded);
+	}
+	else
+	{
+		if (inner)
+			newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+															 rt_size_class_info[size_class].inner_size);
+		else
+			newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+															 rt_size_class_info[size_class].leaf_size);
+
+		newnode.encoded = rt_pointer_encode(newnode.decoded);
+	}
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[size_class]++;
+	tree->ctl->cnt[size_class]++;
 #endif
 
 	return newnode;
@@ -1036,10 +1091,10 @@ static void
 rt_free_node(radix_tree *tree, rt_node_ptr node)
 {
 	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node.encoded)
+	if (tree->ctl->root == node.encoded)
 	{
-		tree->root = InvalidRTPointer;
-		tree->max_val = 0;
+		tree->ctl->root = InvalidRTPointer;
+		tree->ctl->max_val = 0;
 	}
 
 #ifdef RT_DEBUG
@@ -1057,12 +1112,15 @@ rt_free_node(radix_tree *tree, rt_node_ptr node)
 		if (i == RT_SIZE_CLASS_COUNT)
 			i = RT_CLASS_256;
 
-		tree->cnt[i]--;
-		Assert(tree->cnt[i] >= 0);
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
 	}
 #endif
 
-	pfree(node.decoded);
+	if (RadixTreeIsShared(tree))
+		dsa_free(tree->area, (dsa_pointer) node.encoded);
+	else
+		pfree(node.decoded);
 }
 
 /*
@@ -1078,7 +1136,7 @@ rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
 	if (rt_node_ptr_eq(&parent, &old_child))
 	{
 		/* Replace the root node with the new large node */
-		tree->root = new_child.encoded;
+		tree->ctl->root = new_child.encoded;
 	}
 	else
 	{
@@ -1100,7 +1158,7 @@ static void
 rt_extend(radix_tree *tree, uint64 key)
 {
 	int			target_shift;
-	rt_node		*root = rt_pointer_decode(tree->root);
+	rt_node		*root = rt_pointer_decode(tree, tree->ctl->root);
 	int			shift = root->shift + RT_NODE_SPAN;
 
 	target_shift = key_get_shift(key);
@@ -1118,15 +1176,15 @@ rt_extend(radix_tree *tree, uint64 key)
 		n4->base.n.shift = shift;
 		n4->base.n.count = 1;
 		n4->base.chunks[0] = 0;
-		n4->children[0] = tree->root;
+		n4->children[0] = tree->ctl->root;
 
 		root->chunk = 0;
-		tree->root = node.encoded;
+		tree->ctl->root = node.encoded;
 
 		shift += RT_NODE_SPAN;
 	}
 
-	tree->max_val = shift_get_max_val(target_shift);
+	tree->ctl->max_val = shift_get_max_val(target_shift);
 }
 
 /*
@@ -1158,7 +1216,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
 	}
 
 	rt_node_insert_leaf(tree, parent, node, key, value);
-	tree->num_keys++;
+	tree->ctl->num_keys++;
 }
 
 /*
@@ -1169,12 +1227,11 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
-					 rt_pointer *child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action, rt_pointer *child_p)
 {
 	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
-	rt_pointer	child;
+	rt_pointer	child = InvalidRTPointer;
 
 	switch (NODE_KIND(node))
 	{
@@ -1205,6 +1262,7 @@ rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
 					break;
 
 				found = true;
+
 				if (action == RT_ACTION_FIND)
 					child = n32->children[idx];
 				else			/* RT_ACTION_DELETE */
@@ -1756,33 +1814,51 @@ retry_insert_leaf_32:
  * Create the radix tree in the given memory context and return it.
  */
 radix_tree *
-rt_create(MemoryContext ctx)
+rt_create(MemoryContext ctx, dsa_area *area)
 {
 	radix_tree *tree;
 	MemoryContext old_ctx;
 
 	old_ctx = MemoryContextSwitchTo(ctx);
 
-	tree = palloc(sizeof(radix_tree));
+	tree = (radix_tree *) palloc0(sizeof(radix_tree));
 	tree->context = ctx;
-	tree->root = InvalidRTPointer;
-	tree->max_val = 0;
-	tree->num_keys = 0;
+
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+
+		tree->area = area;
+		dp = dsa_allocate0(area, sizeof(radix_tree_control));
+		tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
+		tree->ctl->handle = (rt_handle) dp;
+	}
+	else
+	{
+		tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
+		tree->ctl->handle = InvalidDsaPointer;
+	}
+
+	tree->ctl->magic = RADIXTREE_MAGIC;
+	tree->ctl->root = InvalidRTPointer;
 
 	/* Create the slab allocator for each size class */
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	if (area == NULL)
 	{
-		tree->inner_slabs[i] = SlabContextCreate(ctx,
-												 rt_size_class_info[i].name,
-												 rt_size_class_info[i].inner_blocksize,
-												 rt_size_class_info[i].inner_size);
-		tree->leaf_slabs[i] = SlabContextCreate(ctx,
-												rt_size_class_info[i].name,
-												rt_size_class_info[i].leaf_blocksize,
-												rt_size_class_info[i].leaf_size);
+		for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			tree->inner_slabs[i] = SlabContextCreate(ctx,
+													 rt_size_class_info[i].name,
+													 rt_size_class_info[i].inner_blocksize,
+													 rt_size_class_info[i].inner_size);
+			tree->leaf_slabs[i] = SlabContextCreate(ctx,
+													rt_size_class_info[i].name,
+													rt_size_class_info[i].leaf_blocksize,
+													rt_size_class_info[i].leaf_size);
 #ifdef RT_DEBUG
-		tree->cnt[i] = 0;
+			tree->ctl->cnt[i] = 0;
 #endif
+		}
 	}
 
 	MemoryContextSwitchTo(old_ctx);
@@ -1790,16 +1866,163 @@ rt_create(MemoryContext ctx)
 	return tree;
 }
 
+/*
+ * Get a handle that can be used by other processes to attach to this radix
+ * tree.
+ */
+dsa_pointer
+rt_get_handle(radix_tree *tree)
+{
+	Assert(RadixTreeIsShared(tree));
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	return tree->ctl->handle;
+}
+
+/*
+ * Attach to an existing radix tree using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+radix_tree *
+rt_attach(dsa_area *area, rt_handle handle)
+{
+	radix_tree *tree;
+	dsa_pointer	control;
+
+	/* Allocate the backend-local object representing the radix tree */
+	tree = (radix_tree *) palloc0(sizeof(radix_tree));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	/* Set up the local radix tree */
+	tree->area = area;
+	tree->ctl = (radix_tree_control *) dsa_get_address(area, control);
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	return tree;
+}
+
+/*
+ * Detach from a radix tree. This frees backend-local resources associated
+ * with the radix tree, but the radix tree will continue to exist until
+ * it is explicitly freed.
+ */
+void
+rt_detach(radix_tree *tree)
+{
+	Assert(RadixTreeIsShared(tree));
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	pfree(tree);
+}
+
+/*
+ * Recursively free all nodes allocated to the dsa area.
+ */
+static void
+rt_free_recurse(radix_tree *tree, rt_pointer ptr)
+{
+	rt_node_ptr	node = rt_node_ptr_encoded(tree, ptr);
+
+	Assert(RadixTreeIsShared(tree));
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers, so free it */
+	if (NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->area, (dsa_pointer) node.encoded);
+		return;
+	}
+
+	switch (NODE_KIND(node))
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < NODE_COUNT(node); i++)
+					rt_free_recurse(tree, n4->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < NODE_COUNT(node); i++)
+					rt_free_recurse(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						continue;
+
+					rt_free_recurse(tree, node_inner_125_get_child(n125, i));
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_inner_256_is_chunk_used(n256, i))
+						continue;
+
+					rt_free_recurse(tree, node_inner_256_get_child(n256, i));
+				}
+				break;
+			}
+	}
+
+	/* Free the inner node itself */
+	dsa_free(tree->area, node.encoded);
+}
+
 /*
  * Free the given radix tree.
  */
 void
 rt_free(radix_tree *tree)
 {
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (RadixTreeIsShared(tree))
 	{
-		MemoryContextDelete(tree->inner_slabs[i]);
-		MemoryContextDelete(tree->leaf_slabs[i]);
+		/* Free all memory used for radix tree nodes */
+		if (RTPointerIsValid(tree->ctl->root))
+			rt_free_recurse(tree, tree->ctl->root);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix tree.
+		 */
+		tree->ctl->magic = 0;
+		dsa_free(tree->area, tree->ctl->handle);
+	}
+	else
+	{
+		/* Free all memory used for radix tree nodes */
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			MemoryContextDelete(tree->inner_slabs[i]);
+			MemoryContextDelete(tree->leaf_slabs[i]);
+		}
+		pfree(tree->ctl);
 	}
 
 	pfree(tree);
@@ -1817,16 +2040,18 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 	rt_node_ptr	node;
 	rt_node_ptr parent;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
 	/* Empty tree, create the root */
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 		rt_new_root(tree, key);
 
 	/* Extend the tree if necessary */
-	if (key > tree->max_val)
+	if (key > tree->ctl->max_val)
 		rt_extend(tree, key);
 
 	/* Descend the tree until a leaf node */
-	node = parent = rt_node_ptr_encoded(tree->root);
+	node = parent = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
@@ -1842,7 +2067,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 		}
 
 		parent = node;
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1850,7 +2075,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 
 	/* Update the statistics */
 	if (!updated)
-		tree->num_keys++;
+		tree->ctl->num_keys++;
 
 	return updated;
 }
@@ -1866,12 +2091,13 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 	rt_node_ptr    node;
 	int			shift;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
 	Assert(value_p != NULL);
 
-	if (!RTPointerIsValid(tree->root) || key > tree->max_val)
+	if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 
 	/* Descend the tree until a leaf node */
@@ -1885,7 +2111,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1905,14 +2131,16 @@ rt_delete(radix_tree *tree, uint64 key)
 	int			level;
 	bool		deleted;
 
-	if (!tree->root || key > tree->max_val)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
 	/*
 	 * Descend the tree to search the key while building a stack of nodes we
 	 * visited.
 	 */
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	level = -1;
 	while (shift > 0)
@@ -1925,7 +2153,7 @@ rt_delete(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1940,7 +2168,7 @@ rt_delete(radix_tree *tree, uint64 key)
 	}
 
 	/* Found the key to delete. Update the statistics */
-	tree->num_keys--;
+	tree->ctl->num_keys--;
 
 	/*
 	 * Return if the leaf node still has keys and we don't need to delete the
@@ -1980,16 +2208,18 @@ rt_begin_iterate(radix_tree *tree)
 	rt_iter    *iter;
 	int			top_level;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
 	old_ctx = MemoryContextSwitchTo(tree->context);
 
 	iter = (rt_iter *) palloc0(sizeof(rt_iter));
 	iter->tree = tree;
 
 	/* empty tree */
-	if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
+	if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->ctl->root))
 		return iter;
 
-	root = rt_node_ptr_encoded(iter->tree->root);
+	root = rt_node_ptr_encoded(tree, iter->tree->ctl->root);
 	top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
 	iter->stack_len = top_level;
 
@@ -2040,8 +2270,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
 bool
 rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 {
+	Assert(!RadixTreeIsShared(iter->tree) || iter->tree->ctl->magic == RADIXTREE_MAGIC);
+
 	/* Empty tree */
-	if (!iter->tree->root)
+	if (!iter->tree->ctl->root)
 		return false;
 
 	for (;;)
@@ -2185,7 +2417,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *
 	if (found)
 	{
 		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
-		*child_p = rt_node_ptr_encoded(child);
+		*child_p = rt_node_ptr_encoded(iter->tree, child);
 	}
 
 	return found;
@@ -2288,7 +2520,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_
 uint64
 rt_num_entries(radix_tree *tree)
 {
-	return tree->num_keys;
+	return tree->ctl->num_keys;
 }
 
 /*
@@ -2297,12 +2529,19 @@ rt_num_entries(radix_tree *tree)
 uint64
 rt_memory_usage(radix_tree *tree)
 {
-	Size		total = sizeof(radix_tree);
+	Size		total = sizeof(radix_tree) + sizeof(radix_tree_control);
 
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (RadixTreeIsShared(tree))
+		total = dsa_get_total_size(tree->area);
+	else
 	{
-		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
-		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+			total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+		}
 	}
 
 	return total;
@@ -2386,23 +2625,23 @@ rt_verify_node(rt_node_ptr node)
 void
 rt_stats(radix_tree *tree)
 {
-	rt_node *root = rt_pointer_decode(tree->root);
+	rt_node *root = rt_pointer_decode(tree, tree->ctl->root);
 
 	if (root == NULL)
 		return;
 
 	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
-							tree->num_keys,
+							tree->ctl->num_keys,
 							root->shift / RT_NODE_SPAN,
-							tree->cnt[RT_CLASS_4_FULL],
-							tree->cnt[RT_CLASS_32_PARTIAL],
-							tree->cnt[RT_CLASS_32_FULL],
-							tree->cnt[RT_CLASS_125_FULL],
-							tree->cnt[RT_CLASS_256])));
+							tree->ctl->cnt[RT_CLASS_4_FULL],
+							tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+							tree->ctl->cnt[RT_CLASS_32_FULL],
+							tree->ctl->cnt[RT_CLASS_125_FULL],
+							tree->ctl->cnt[RT_CLASS_256])));
 }
 
 static void
-rt_dump_node(rt_node_ptr node, int level, bool recurse)
+rt_dump_node(radix_tree *tree, rt_node_ptr node, int level, bool recurse)
 {
 	rt_node		*n = node.decoded;
 	char		space[128] = {0};
@@ -2440,7 +2679,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, n4->base.chunks[i]);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+							rt_dump_node(tree, rt_node_ptr_encoded(tree, n4->children[i]),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2468,7 +2707,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 
 						if (recurse)
 						{
-							rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+							rt_dump_node(tree, rt_node_ptr_encoded(tree, n32->children[i]),
 										 level + 1, recurse);
 						}
 						else
@@ -2521,7 +2760,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
+							rt_dump_node(tree,
+										 rt_node_ptr_encoded(tree,
+															 node_inner_125_get_child(n125, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2554,7 +2795,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+							rt_dump_node(tree,
+										 rt_node_ptr_encoded(tree,
+															 node_inner_256_get_child(n256, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2574,28 +2817,28 @@ rt_dump_search(radix_tree *tree, uint64 key)
 
 	elog(NOTICE, "-----------------------------------------------------------");
 	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
-		 tree->max_val, tree->max_val);
+		 tree->ctl->max_val, tree->ctl->max_val);
 
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 	{
 		elog(NOTICE, "tree is empty");
 		return;
 	}
 
-	if (key > tree->max_val)
+	if (key > tree->ctl->max_val)
 	{
 		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
 			 key, key);
 		return;
 	}
 
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
 		rt_pointer   child;
 
-		rt_dump_node(node, level, false);
+		rt_dump_node(tree, node, level, false);
 
 		if (NODE_IS_LEAF(node))
 		{
@@ -2610,7 +2853,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			break;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 		level++;
 	}
@@ -2628,15 +2871,15 @@ rt_dump(radix_tree *tree)
 				rt_size_class_info[i].inner_blocksize,
 				rt_size_class_info[i].leaf_size,
 				rt_size_class_info[i].leaf_blocksize);
-	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
 
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 	{
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
-	root = rt_node_ptr_encoded(tree->root);
-	rt_dump_node(root, 0, true);
+	root = rt_node_ptr_encoded(tree, tree->ctl->root);
+	rt_dump_node(tree, root, 0, true);
 }
 #endif
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 82376fde2d..ad169882af 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..68a11df970 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -14,18 +14,24 @@
 #define RADIXTREE_H
 
 #include "postgres.h"
+#include "utils/dsa.h"
 
 #define RT_DEBUG 1
 
 typedef struct radix_tree radix_tree;
 typedef struct rt_iter rt_iter;
+typedef dsa_pointer rt_handle;
 
-extern radix_tree *rt_create(MemoryContext ctx);
+extern radix_tree *rt_create(MemoryContext ctx, dsa_area *dsa);
 extern void rt_free(radix_tree *tree);
 extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
 extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
 extern rt_iter *rt_begin_iterate(radix_tree *tree);
 
+extern rt_handle rt_get_handle(radix_tree *tree);
+extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+extern void rt_detach(radix_tree *tree);
+
 extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
 extern void rt_end_iterate(rt_iter *iter);
 extern bool rt_delete(radix_tree *tree, uint64 key);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 405606fe2f..dad06adecc 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index ce645cb8b5..a217e0d312 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -6,28 +6,53 @@ CREATE EXTENSION test_radixtree;
 SELECT test_radixtree();
 NOTICE:  testing basic operations with leaf node 4
 NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
 NOTICE:  testing basic operations with leaf node 32
 NOTICE:  testing basic operations with inner node 32
 NOTICE:  testing basic operations with leaf node 125
 NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
 NOTICE:  testing basic operations with leaf node 256
 NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
 NOTICE:  testing radix tree node types with shift "0"
 NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "8"
 NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
 NOTICE:  testing radix tree node types with shift "24"
 NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "32"
 NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
 NOTICE:  testing radix tree node types with shift "48"
 NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree node types with shift "56"
 NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
 NOTICE:  testing radix tree with pattern "alternating bits"
 NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of ten"
 NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
 NOTICE:  testing radix tree with pattern "one-every-64k"
 NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "sparse"
 NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
 NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
 NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
  test_radixtree 
 ----------------
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index ea993e63df..fe1e168ec4 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -19,6 +19,7 @@
 #include "nodes/bitmapset.h"
 #include "storage/block.h"
 #include "storage/itemptr.h"
+#include "storage/lwlock.h"
 #include "utils/memutils.h"
 #include "utils/timestamp.h"
 
@@ -99,6 +100,8 @@ static const test_spec test_specs[] = {
 	}
 };
 
+static int lwlock_tranche_id;
+
 PG_MODULE_MAGIC;
 
 PG_FUNCTION_INFO_V1(test_radixtree);
@@ -112,7 +115,7 @@ test_empty(void)
 	uint64		key;
 	uint64		val;
 
-	radixtree = rt_create(CurrentMemoryContext);
+	radixtree = rt_create(CurrentMemoryContext, NULL);
 
 	if (rt_search(radixtree, 0, &dummy))
 		elog(ERROR, "rt_search on empty tree returned true");
@@ -140,17 +143,14 @@ test_empty(void)
 }
 
 static void
-test_basic(int children, bool test_inner)
+do_test_basic(radix_tree *radixtree, int children, bool test_inner)
 {
-	radix_tree	*radixtree;
 	uint64 *keys;
 	int	shift = test_inner ? 8 : 0;
 
 	elog(NOTICE, "testing basic operations with %s node %d",
 		 test_inner ? "inner" : "leaf", children);
 
-	radixtree = rt_create(CurrentMemoryContext);
-
 	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
 	keys = palloc(sizeof(uint64) * children);
 	for (int i = 0; i < children; i++)
@@ -165,7 +165,7 @@ test_basic(int children, bool test_inner)
 	for (int i = 0; i < children; i++)
 	{
 		if (rt_set(radixtree, keys[i], keys[i]))
-			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found %d", keys[i], i);
 	}
 
 	/* update keys */
@@ -185,7 +185,38 @@ test_basic(int children, bool test_inner)
 	}
 
 	pfree(keys);
-	rt_free(radixtree);
+}
+
+static void
+test_basic()
+{
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		radix_tree *tree;
+		dsa_area	*area;
+
+		/* Test the local radix tree */
+		tree = rt_create(CurrentMemoryContext, NULL);
+		do_test_basic(tree, rt_node_kind_fanouts[i], false);
+		rt_free(tree);
+
+		tree = rt_create(CurrentMemoryContext, NULL);
+		do_test_basic(tree, rt_node_kind_fanouts[i], true);
+		rt_free(tree);
+
+		/* Test the shared radix tree */
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(CurrentMemoryContext, area);
+		do_test_basic(tree, rt_node_kind_fanouts[i], false);
+		rt_free(tree);
+		dsa_detach(area);
+
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(CurrentMemoryContext, area);
+		do_test_basic(tree, rt_node_kind_fanouts[i], true);
+		rt_free(tree);
+		dsa_detach(area);
+	}
 }
 
 /*
@@ -286,14 +317,10 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
  * level.
  */
 static void
-test_node_types(uint8 shift)
+do_test_node_types(radix_tree *radixtree, uint8 shift)
 {
-	radix_tree *radixtree;
-
 	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
 
-	radixtree = rt_create(CurrentMemoryContext);
-
 	/*
 	 * Insert and search entries for every node type at the 'shift' level,
 	 * then delete all entries to make it empty, and insert and search entries
@@ -302,19 +329,37 @@ test_node_types(uint8 shift)
 	test_node_types_insert(radixtree, shift, true);
 	test_node_types_delete(radixtree, shift);
 	test_node_types_insert(radixtree, shift, false);
+}
 
-	rt_free(radixtree);
+static void
+test_node_types(void)
+{
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+	{
+		radix_tree *tree;
+		dsa_area   *area;
+
+		/* Test the local radix tree */
+		tree = rt_create(CurrentMemoryContext, NULL);
+		do_test_node_types(tree, shift);
+		rt_free(tree);
+
+		/* Test the shared radix tree */
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(CurrentMemoryContext, area);
+		do_test_node_types(tree, shift);
+		rt_free(tree);
+		dsa_detach(area);
+	}
 }
 
 /*
  * Test with a repeating pattern, defined by the 'spec'.
  */
 static void
-test_pattern(const test_spec * spec)
+do_test_pattern(radix_tree *radixtree, const test_spec * spec)
 {
-	radix_tree *radixtree;
 	rt_iter    *iter;
-	MemoryContext radixtree_ctx;
 	TimestampTz starttime;
 	TimestampTz endtime;
 	uint64		n;
@@ -340,18 +385,6 @@ test_pattern(const test_spec * spec)
 			pattern_values[pattern_num_values++] = i;
 	}
 
-	/*
-	 * Allocate the radix tree.
-	 *
-	 * Allocate it in a separate memory context, so that we can print its
-	 * memory usage easily.
-	 */
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
-										  "radixtree test",
-										  ALLOCSET_SMALL_SIZES);
-	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
-	radixtree = rt_create(radixtree_ctx);
-
 	/*
 	 * Add values to the set.
 	 */
@@ -405,8 +438,6 @@ test_pattern(const test_spec * spec)
 		mem_usage = rt_memory_usage(radixtree);
 		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
 				mem_usage, (double) mem_usage / spec->num_values);
-
-		MemoryContextStats(radixtree_ctx);
 	}
 
 	/* Check that rt_num_entries works */
@@ -555,27 +586,57 @@ test_pattern(const test_spec * spec)
 	if ((nbefore - ndeleted) != nafter)
 		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
 			 nafter, (nbefore - ndeleted), ndeleted);
+}
+
+static void
+test_patterns(void)
+{
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+	{
+		radix_tree *tree;
+		MemoryContext radixtree_ctx;
+		dsa_area   *area;
+		const		test_spec *spec = &test_specs[i];
 
-	MemoryContextDelete(radixtree_ctx);
+		/*
+		 * Allocate the radix tree.
+		 *
+		 * Allocate it in a separate memory context, so that we can print its
+		 * memory usage easily.
+		 */
+		radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+											  "radixtree test",
+											  ALLOCSET_SMALL_SIZES);
+		MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+		/* Test the local radix tree */
+		tree = rt_create(radixtree_ctx, NULL);
+		do_test_pattern(tree, spec);
+		rt_free(tree);
+		MemoryContextReset(radixtree_ctx);
+
+		/* Test the shared radix tree */
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(radixtree_ctx, area);
+		do_test_pattern(tree, spec);
+		rt_free(tree);
+		dsa_detach(area);
+		MemoryContextDelete(radixtree_ctx);
+	}
 }
 
 Datum
 test_radixtree(PG_FUNCTION_ARGS)
 {
-	test_empty();
+	/* get a new lwlock tranche id for all tests for shared radix tree */
+	lwlock_tranche_id = LWLockNewTrancheId();
 
-	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
-	{
-		test_basic(rt_node_kind_fanouts[i], false);
-		test_basic(rt_node_kind_fanouts[i], true);
-	}
-
-	for (int shift = 0; shift <= (64 - 8); shift += 8)
-		test_node_types(shift);
+	test_empty();
+	test_basic();
 
-	/* Test different test patterns, with lots of entries */
-	for (int i = 0; i < lengthof(test_specs); i++)
-		test_pattern(&test_specs[i]);
+	test_node_types();
+	test_patterns();
 
 	PG_RETURN_VOID();
 }
-- 
2.31.1

v12-0004-Use-bitmapword-for-node-125.patchapplication/octet-stream; name=v12-0004-Use-bitmapword-for-node-125.patchDownload

From 25cc8623d65d333e68cb43a792ba3055bf89b7c9 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 2 Dec 2022 15:27:06 +0900
Subject: [PATCH v12 4/7] Use bitmapword for node-125

---
 src/backend/lib/radixtree.c    | 71 +++++++++++++++-------------------
 src/backend/nodes/bitmapset.c  | 38 ------------------
 src/include/nodes/bitmapset.h  | 22 +----------
 src/include/port/pg_bitutils.h | 58 +++++++++++++++++++++++++++
 4 files changed, 91 insertions(+), 98 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index e7f61fd943..673cc5e46b 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -207,6 +207,9 @@ typedef struct rt_node_base125
 
 	/* The index of slots for each fanout */
 	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword		isset[WORDNUM(128)];
 } rt_node_base_125;
 
 typedef struct rt_node_base256
@@ -271,9 +274,6 @@ typedef struct rt_node_leaf_125
 {
 	rt_node_base_125 base;
 
-	/* isset is a bitmap to track which slot is in use */
-	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
-
 	/* number of values depends on size class */
 	uint64		values[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_leaf_125;
@@ -655,13 +655,14 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
 	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
 }
 
+#ifdef USE_ASSERT_CHECKING
 /* Is the slot in the node used? */
 static inline bool
 node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
 {
 	Assert(!NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
-	return (node->children[slot] != NULL);
+	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
 
 static inline bool
@@ -669,8 +670,9 @@ node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
 {
 	Assert(NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
-	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
+#endif
 
 static inline rt_node *
 node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
@@ -690,7 +692,10 @@ node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
 static void
 node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
 {
+	int			slotpos = node->base.slot_idxs[chunk];
+
 	Assert(!NODE_IS_LEAF(node));
+	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->children[node->base.slot_idxs[chunk]] = NULL;
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
@@ -701,44 +706,35 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
 	int			slotpos = node->base.slot_idxs[chunk];
 
 	Assert(NODE_IS_LEAF(node));
-	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
 
 /* Return an unused slot in node-125 */
 static int
-node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
-{
-	int			slotpos = 0;
-
-	Assert(!NODE_IS_LEAF(node));
-	while (node_inner_125_is_slot_used(node, slotpos))
-		slotpos++;
-
-	return slotpos;
-}
-
-static int
-node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
-{
-	int			slotpos;
-
-	Assert(NODE_IS_LEAF(node));
+node_125_find_unused_slot(bitmapword *isset)
+ {
+ 	int			slotpos;
+	int			idx;
+	bitmapword	inverse;
 
-	/* We iterate over the isset bitmap per byte then check each bit */
-	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	/* get the first word with at least one bit not set */
+	for (idx = 0; idx < WORDNUM(128); idx++)
 	{
-		if (node->isset[slotpos] < 0xFF)
-			break;
+		if (isset[idx] < ~((bitmapword) 0))
+ 			break;
 	}
-	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
 
-	slotpos *= BITS_PER_BYTE;
-	while (node_leaf_125_is_slot_used(node, slotpos))
-		slotpos++;
+	/* To get the first unset bit in X, get the first set bit in ~X */
+	inverse = ~(isset[idx]);
+	slotpos = idx * BITS_PER_BITMAPWORD;
+	slotpos += bmw_rightmost_one_pos(inverse);
+
+	/* mark the slot used */
+	isset[idx] |= RIGHTMOST_ONE(inverse);
 
 	return slotpos;
-}
+ }
 
 static inline void
 node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
@@ -747,8 +743,7 @@ node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
 
 	Assert(!NODE_IS_LEAF(node));
 
-	/* find unused slot */
-	slotpos = node_inner_125_find_unused_slot(node, chunk);
+	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
 
 	node->base.slot_idxs[chunk] = slotpos;
@@ -763,12 +758,10 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 
 	Assert(NODE_IS_LEAF(node));
 
-	/* find unused slot */
-	slotpos = node_leaf_125_find_unused_slot(node, chunk);
+	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
 
 	node->base.slot_idxs[chunk] = slotpos;
-	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
 	node->values[slotpos] = value;
 }
 
@@ -2395,9 +2388,9 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
 
 					fprintf(stderr, ", isset-bitmap:");
-					for (int i = 0; i < 16; i++)
+					for (int i = 0; i < WORDNUM(128); i++)
 					{
-						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+						fprintf(stderr, UINT64_FORMAT_HEX " ", n->base.isset[i]);
 					}
 					fprintf(stderr, "\n");
 				}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..3fe0fd88ce 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -23,49 +23,11 @@
 #include "common/hashfn.h"
 #include "nodes/bitmapset.h"
 #include "nodes/pg_list.h"
-#include "port/pg_bitutils.h"
 
 
-#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
-
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
-
 
 /*
  * bms_copy - make a palloc'd copy of a bitmapset
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..06fa21ccaa 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -21,33 +21,13 @@
 #define BITMAPSET_H
 
 #include "nodes/nodes.h"
+#include "port/pg_bitutils.h"
 
 /*
  * Forward decl to save including pg_list.h
  */
 struct List;
 
-/*
- * Data representation
- *
- * Larger bitmap word sizes generally give better performance, so long as
- * they're not wider than the processor can handle efficiently.  We use
- * 64-bit words if pointers are that large, else 32-bit words.
- */
-#if SIZEOF_VOID_P >= 8
-
-#define BITS_PER_BITMAPWORD 64
-typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
-
-#else
-
-#define BITS_PER_BITMAPWORD 32
-typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
-
-#endif
-
 typedef struct Bitmapset
 {
 	pg_node_attr(custom_copy_equal, special_read_write)
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 814e0b2dba..ad5aa2c5cf 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,51 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*
+ * Platform-specific types
+ *
+ * Larger bitmap word sizes generally give better performance, so long as
+ * they're not wider than the processor can handle efficiently.  We use
+ * 64-bit words if pointers are that large, else 32-bit words.
+ */
+#if SIZEOF_VOID_P >= 8
+
+#define BITS_PER_BITMAPWORD 64
+typedef uint64 bitmapword;		/* must be an unsigned type */
+typedef int64 signedbitmapword; /* must be the matching signed type */
+
+#else
+
+#define BITS_PER_BITMAPWORD 32
+typedef uint32 bitmapword;		/* must be an unsigned type */
+typedef int32 signedbitmapword; /* must be the matching signed type */
+
+#endif
+
+#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
+
+#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
@@ -291,4 +336,17 @@ pg_rotate_left32(uint32 word, int n)
 #define pg_prevpower2_size_t pg_prevpower2_64
 #endif
 
+/* variants of some functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_leftmost_one_pos pg_leftmost_one_pos32
+#define bmw_rightmost_one_pos pg_rightmost_one_pos32
+#define bmw_popcount pg_popcount32
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_leftmost_one_pos pg_leftmost_one_pos64
+#define bmw_rightmost_one_pos pg_rightmost_one_pos64
+#define bmw_popcount pg_popcount64
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
+
 #endif							/* PG_BITUTILS_H */
-- 
2.31.1

v12-0002-Add-radix-implementation.patchapplication/octet-stream; name=v12-0002-Add-radix-implementation.patchDownload

From 68401b497992d33ef5758b5ddb75244d550240d5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v12 2/7] Add radix implementation.

---
 src/backend/lib/Makefile                      |    1 +
 src/backend/lib/meson.build                   |    1 +
 src/backend/lib/radixtree.c                   | 2541 +++++++++++++++++
 src/include/lib/radixtree.h                   |   42 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   34 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  581 ++++
 .../test_radixtree/test_radixtree.control     |    4 +
 15 files changed, 3291 insertions(+)
 create mode 100644 src/backend/lib/radixtree.c
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
   'knapsack.c',
   'pairingheap.c',
   'rbtree.c',
+  'radixtree.c',
 )
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..e7f61fd943
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2541 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_set			- Set a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iterate	- Begin iterating through all key-value pairs
+ * rt_iterate_next	- Return next key-value pair, if any
+ * rt_end_iter		- End iteration
+ * rt_memory_usage	- Get the memory usage
+ * rt_num_entries	- Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+	RT_ACTION_FIND = 0,			/* find the key-value */
+	RT_ACTION_DELETE,			/* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+typedef enum rt_size_class
+{
+	RT_CLASS_4_FULL = 0,
+	RT_CLASS_32_PARTIAL,
+	RT_CLASS_32_FULL,
+	RT_CLASS_125_FULL,
+	RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/* Max number of children. We can use uint8 because we never need to store 256 */
+	/* WIP: if we don't have a variable sized node4, this should instead be in the base
+	types as needed, since saving every byte is crucial for the smallest node kind */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} rt_node;
+#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+	((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+	((node)->base.n.count < rt_size_class_info[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct rt_node_base_4
+{
+	rt_node		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+	rt_node		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base125
+{
+	rt_node		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_125;
+
+typedef struct rt_node_base256
+{
+	rt_node		n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+	rt_node_base_4 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+	rt_node_base_4 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+	rt_node_base_32 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+	rt_node_base_32 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_125
+{
+	rt_node_base_125 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_125;
+
+typedef struct rt_node_leaf_125
+{
+	rt_node_base_125 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 children */
+	rt_node    *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+	rt_node_base_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information for each size class */
+typedef struct rt_size_class_elem
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} rt_size_class_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+	[RT_CLASS_4_FULL] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_PARTIAL] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_FULL] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+	},
+	[RT_CLASS_125_FULL] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(rt_node_inner_256),
+		.leaf_size = sizeof(rt_node_leaf_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+	},
+};
+
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+	[RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+	rt_node    *node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	rt_node_iter stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node    *root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+								bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+										rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+									   uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+								 uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+								uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											 uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+						  uint8 *dst_chunks, rt_node **dst_children)
+{
+	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(rt_node *) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values)
+{
+	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(uint64) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(slot < node->base.n.fanout);
+	return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(slot < node->base.n.fanout);
+	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = NULL;
+	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+static void
+node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
+{
+	int			slotpos = node->base.slot_idxs[chunk];
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+/* Return an unused slot in node-125 */
+static int
+node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
+{
+	int			slotpos = 0;
+
+	Assert(!NODE_IS_LEAF(node));
+	while (node_inner_125_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static int
+node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* We iterate over the isset bitmap per byte then check each bit */
+	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	{
+		if (node->isset[slotpos] < 0xFF)
+			break;
+	}
+	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+	slotpos *= BITS_PER_BYTE;
+	while (node_leaf_125_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static inline void
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+	int			slotpos;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_inner_125_find_unused_slot(node, chunk);
+	Assert(slotpos < node->base.n.fanout);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_leaf_125_find_unused_slot(node, chunk);
+	Assert(slotpos < node->base.n.fanout);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(node_inner_256_is_chunk_used(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(node_leaf_256_is_chunk_used(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+	int			shift = key_get_shift(key);
+	bool		inner = shift > 0;
+	rt_node    *newnode;
+
+	newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+	newnode->shift = shift;
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+{
+	rt_node    *newnode;
+
+	if (inner)
+		newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+												 rt_size_class_info[size_class].inner_size);
+	else
+		newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+												 rt_size_class_info[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[size_class]++;
+#endif
+
+	return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+{
+	if (inner)
+		MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+	else
+		MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+
+	node->kind = kind;
+	node->fanout = rt_size_class_info[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+
+		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+	}
+
+	/*
+	 * Technically it's 256, but we cannot store that in a uint8,
+	 * and this is the max size class to it will never grow.
+	 */
+	if (kind == RT_NODE_KIND_256)
+		node->fanout = 0;
+}
+
+static inline void
+rt_copy_node(rt_node *newnode, rt_node *oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->chunk = oldnode->chunk;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+{
+	rt_node	*newnode;
+	bool inner = !NODE_IS_LEAF(node);
+
+	newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
+	rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
+	rt_copy_node(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == rt_size_class_info[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->cnt[i]--;
+		Assert(tree->cnt[i] >= 0);
+	}
+#endif
+
+	pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+				rt_node *new_child, uint64 key)
+{
+	Assert(old_child->chunk == new_child->chunk);
+	Assert(old_child->shift == new_child->shift);
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = new_child;
+	}
+	else
+	{
+		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
+
+		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		Assert(replaced);
+	}
+
+	rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_inner_4 *node;
+
+		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+		rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		node->base.n.shift = shift;
+		node->base.n.count = 1;
+		node->base.chunks[0] = 0;
+		node->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+			  rt_node *node)
+{
+	int			shift = node->shift;
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		rt_node    *newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		inner = newshift > 0;
+
+		newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+		rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+		newchild->shift = newshift;
+		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+		rt_node_insert_inner(tree, parent, node, key, newchild);
+
+		parent = node;
+		node = newchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	rt_node_insert_leaf(tree, parent, node, key, value);
+	tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	rt_node    *child = NULL;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = n4->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n4->base.chunks, n4->children,
+												n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = n32->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n32->base.chunks, n32->children,
+												n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+
+				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = node_inner_125_get_child(n125, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_125_delete(n125, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				if (!node_inner_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = node_inner_256_get_child(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && child_p)
+		*child_p = child;
+
+	return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	uint64		value = 0;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = n4->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+											  n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = n32->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+											  n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+
+				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_125_get_value(n125, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_125_delete(n125, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				if (!node_leaf_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_256_get_value(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && value_p)
+		*value_p = value;
+
+	return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+					 rt_node *child)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_inner_32 *new32;
+					Assert(parent != NULL);
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+																   RT_NODE_KIND_32);
+					chunk_children_array_copy(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+									key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					uint16		count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n4->base.chunks, n4->children,
+												   count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->children[insertpos] = child;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					Assert(parent != NULL);
+
+					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+					{
+						/* use the same node kind, but expand to the next size class */
+						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
+						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_inner_32 *new32;
+
+						new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+						memcpy(new32, n32, size);
+						new32->base.n.fanout = fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new32;
+						n32 = new32;
+
+						goto retry_insert_inner_32;
+					}
+					else
+					{
+						rt_node_inner_125 *new125;
+
+						/* grow node from 32 to 125 */
+						new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																		 RT_NODE_KIND_125);
+						for (int i = 0; i < n32->base.n.count; i++)
+							node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
+						node = (rt_node *) new125;
+					}
+				}
+				else
+				{
+retry_insert_inner_32:
+					{
+						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+						int16 count = n32->base.n.count;
+
+						if (count != 0 && insertpos < count)
+							chunk_children_array_shift(n32->base.chunks, n32->children,
+													   count, insertpos);
+
+						n32->base.chunks[insertpos] = chunk;
+						n32->children[insertpos] = child;
+						break;
+					}
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+				int			cnt = 0;
+
+				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_inner_125_update(n125, chunk, child);
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					rt_node_inner_256 *new256;
+					Assert(parent != NULL);
+
+					/* grow node from 125 to 256 */
+					new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+																	 RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+							continue;
+
+						node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+						cnt++;
+					}
+
+					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_inner_125_insert(n125, chunk, child);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+				node_inner_256_set(n256, chunk, child);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+					uint64 key, uint64 value)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_leaf_32 *new32;
+					Assert(parent != NULL);
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+																  RT_NODE_KIND_32);
+					chunk_values_array_copy(n4->base.chunks, n4->values,
+											new32->base.chunks, new32->values);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and values */
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n4->base.chunks, n4->values,
+												 count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->values[insertpos] = value;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					Assert(parent != NULL);
+
+					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+					{
+						/* use the same node kind, but expand to the next size class */
+						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
+						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_leaf_32 *new32;
+
+						new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+						memcpy(new32, n32, size);
+						new32->base.n.fanout = fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new32;
+						n32 = new32;
+
+						goto retry_insert_leaf_32;
+					}
+					else
+					{
+						rt_node_leaf_125 *new125;
+
+						/* grow node from 32 to 125 */
+						new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																		RT_NODE_KIND_125);
+						for (int i = 0; i < n32->base.n.count; i++)
+							node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
+										key);
+						node = (rt_node *) new125;
+					}
+				}
+				else
+				{
+				retry_insert_leaf_32:
+					{
+						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+						int	count = n32->base.n.count;
+
+						if (count != 0 && insertpos < count)
+							chunk_values_array_shift(n32->base.chunks, n32->values,
+													 count, insertpos);
+
+						n32->base.chunks[insertpos] = chunk;
+						n32->values[insertpos] = value;
+						break;
+					}
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+				int			cnt = 0;
+
+				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_leaf_125_update(n125, chunk, value);
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					rt_node_leaf_256 *new256;
+					Assert(parent != NULL);
+
+					/* grow node from 125 to 256 */
+					new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+																	RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+							continue;
+
+						node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+						cnt++;
+					}
+
+					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_leaf_125_insert(n125, chunk, value);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+				node_leaf_256_set(n256, chunk, value);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 rt_size_class_info[i].name,
+												 rt_size_class_info[i].inner_blocksize,
+												 rt_size_class_info[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												rt_size_class_info[i].name,
+												rt_size_class_info[i].leaf_blocksize,
+												rt_size_class_info[i].leaf_size);
+#ifdef RT_DEBUG
+		tree->cnt[i] = 0;
+#endif
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	rt_node    *node;
+	rt_node    *parent;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = parent = tree->root;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		{
+			rt_set_extend(tree, key, value, parent, node);
+			return false;
+		}
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+	rt_node    *node;
+	int			shift;
+
+	Assert(value_p != NULL);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		rt_node    *child;
+
+		/* Push the current node to the stack */
+		stack[++level] = node;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	Assert(NODE_IS_LEAF(node));
+	deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	rt_free_node(tree, node);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		node = stack[level--];
+
+		deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+	}
+
+	return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->root)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+	int			level = from;
+	rt_node    *node = from_node;
+
+	for (;;)
+	{
+		rt_node_iter *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = rt_node_inner_iterate_next(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->root)
+		return false;
+
+	for (;;)
+	{
+		rt_node    *child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		rt_update_iter_stack(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+	rt_node    *child = NULL;
+	bool		found = false;
+	uint8		key_chunk;
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				child = n4->children[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				child = n32->children[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_125_get_child(n125, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_inner_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_256_get_child(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+	return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+						  uint64 *value_p)
+{
+	rt_node    *node = node_iter->node;
+	bool		found = false;
+	uint64		value;
+	uint8		key_chunk;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				value = n4->values[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				value = n32->values[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_125_get_value(n125, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_leaf_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_256_get_value(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		*value_p = value;
+	}
+
+	return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	Size		total = sizeof(radix_tree);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					if (NODE_IS_LEAF(node))
+						Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+														  n125->slot_idxs[i]));
+					else
+						Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+														   n125->slot_idxs[i]));
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+						cnt += pg_popcount32(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+						 tree->num_keys,
+						 tree->root->shift / RT_NODE_SPAN,
+						 tree->cnt[RT_CLASS_4_FULL],
+						 tree->cnt[RT_CLASS_32_PARTIAL],
+						 tree->cnt[RT_CLASS_32_FULL],
+						 tree->cnt[RT_CLASS_125_FULL],
+						 tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+	char		space[125] = {0};
+
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
+			node->fanout == 0 ? 256 : node->fanout,
+			node->count, node->shift, node->chunk);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							rt_dump_node(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							rt_dump_node(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(b125, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < 16; i++)
+					{
+						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(b125, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_125_get_value(n125, i));
+					}
+					else
+					{
+						rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_125_get_child(n125, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+						if (!node_leaf_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_256_get_value(n256, i));
+					}
+					else
+					{
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+						if (!node_inner_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+		 tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		rt_dump_node(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+			break;
+		}
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+				rt_size_class_info[i].name,
+				rt_size_class_info[i].inner_size,
+				rt_size_class_info[i].inner_blocksize,
+				rt_size_class_info[i].leaf_size,
+				rt_size_class_info[i].leaf_blocksize);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+	if (!tree->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 96addded81..11d0ec5b07 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -27,6 +27,7 @@ SUBDIRS = \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1d26544854..568823b221 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -21,6 +21,7 @@ subdir('test_oat_hooks')
 subdir('test_parser')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..ea993e63df
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,581 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	rt_iter		*iter;
+	uint64		dummy;
+	uint64		key;
+	uint64		val;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_set(radixtree, keys[i], keys[i] + 1))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		uint64		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
-- 
2.31.1

v12-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v12-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From 4468f93f23b2900392b1510b8e572ca6e14a9dbd Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v12 1/7] introduce vector8_min and vector8_highbit_mask

---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.31.1

#151

[1]: /messages/by-id/CAFBsxsFW2JjTo58jtDB+3sZhxMx3t-3evew8=Acr+GGhC+kFaA@mail.gmail.com
/messages/by-id/CAFBsxsFW2JjTo58jtDB+3sZhxMx3t-3evew8=Acr+GGhC+kFaA@mail.gmail.com

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#150)

8 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Dec 2, 2022 at 11:42 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Mon, Nov 14, 2022 at 7:59 PM John Naylor <

john.naylor@enterprisedb.com> wrote:

- Optimize node128 insert.

I've attached a rough start at this. The basic idea is borrowed from

our bitmapset nodes, so we can iterate over and operate on word-sized (32-
or 64-bit) types at a time, rather than bytes.

Thanks! I think this is a good idea.

To make this easier, I've moved some of the lower-level macros and

types from bitmapset.h/.c to pg_bitutils.h. That's probably going to need a
separate email thread to resolve the coding style clash this causes, so
that can be put off for later.

I started a separate thread [1]/messages/by-id/CAFBsxsFW2JjTo58jtDB+3sZhxMx3t-3evew8=Acr+GGhC+kFaA@mail.gmail.com, and 0002 comes from feedback on that.
There is a FIXME about using WORDNUM and BITNUM, at least with that
spelling. I'm putting that off to ease rebasing the rest as v13 -- getting
some CI testing with 0002 seems like a good idea. There are no other
changes yet. Next, I will take a look at templating local vs. shared
memory. I might try basing that on the styles of both v12 and v8, and see
which one works best with templating.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v13-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v13-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From 1dc766a6a33ba379c27c15677b7ec2c02384ba8e Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v13 2/8] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..4384ff591d 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..fdc504596b 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 814e0b2dba..f95b6afd86 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 58daeca831..68df6ddc0b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3651,7 +3651,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.38.1

v13-0004-Use-bitmapword-for-node-125.patchtext/x-patch; charset=US-ASCII; name=v13-0004-Use-bitmapword-for-node-125.patchDownload

From bacc9b9ced17faeb868a5e5684c5016ffcc68ff6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 15:22:26 +0700
Subject: [PATCH v13 4/8] Use bitmapword for node-125

TODO: Rename macros copied from bitmapset.c
---
 src/backend/lib/radixtree.c | 70 ++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 36 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index e7f61fd943..abd0450727 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -62,6 +62,7 @@
 #include "lib/radixtree.h"
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
+#include "nodes/bitmapset.h"
 #include "port/pg_bitutils.h"
 #include "port/pg_lfind.h"
 #include "utils/memutils.h"
@@ -103,6 +104,10 @@
 #define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
 #define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
 
+/* FIXME rename */
+#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
+
 /* Enum used rt_node_search() */
 typedef enum
 {
@@ -207,6 +212,9 @@ typedef struct rt_node_base125
 
 	/* The index of slots for each fanout */
 	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword		isset[WORDNUM(128)];
 } rt_node_base_125;
 
 typedef struct rt_node_base256
@@ -271,9 +279,6 @@ typedef struct rt_node_leaf_125
 {
 	rt_node_base_125 base;
 
-	/* isset is a bitmap to track which slot is in use */
-	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
-
 	/* number of values depends on size class */
 	uint64		values[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_leaf_125;
@@ -655,13 +660,14 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
 	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
 }
 
+#ifdef USE_ASSERT_CHECKING
 /* Is the slot in the node used? */
 static inline bool
 node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
 {
 	Assert(!NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
-	return (node->children[slot] != NULL);
+	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
 
 static inline bool
@@ -669,8 +675,9 @@ node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
 {
 	Assert(NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
-	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
+#endif
 
 static inline rt_node *
 node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
@@ -690,7 +697,10 @@ node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
 static void
 node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
 {
+	int			slotpos = node->base.slot_idxs[chunk];
+
 	Assert(!NODE_IS_LEAF(node));
+	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->children[node->base.slot_idxs[chunk]] = NULL;
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
@@ -701,44 +711,35 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
 	int			slotpos = node->base.slot_idxs[chunk];
 
 	Assert(NODE_IS_LEAF(node));
-	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
 
 /* Return an unused slot in node-125 */
 static int
-node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
-{
-	int			slotpos = 0;
-
-	Assert(!NODE_IS_LEAF(node));
-	while (node_inner_125_is_slot_used(node, slotpos))
-		slotpos++;
-
-	return slotpos;
-}
-
-static int
-node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+node_125_find_unused_slot(bitmapword *isset)
 {
 	int			slotpos;
+	int			idx;
+	bitmapword	inverse;
 
-	Assert(NODE_IS_LEAF(node));
-
-	/* We iterate over the isset bitmap per byte then check each bit */
-	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	/* get the first word with at least one bit not set */
+	for (idx = 0; idx < WORDNUM(128); idx++)
 	{
-		if (node->isset[slotpos] < 0xFF)
+		if (isset[idx] < ~((bitmapword) 0))
 			break;
 	}
-	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
 
-	slotpos *= BITS_PER_BYTE;
-	while (node_leaf_125_is_slot_used(node, slotpos))
-		slotpos++;
+	/* To get the first unset bit in X, get the first set bit in ~X */
+	inverse = ~(isset[idx]);
+	slotpos = idx * BITS_PER_BITMAPWORD;
+	slotpos += bmw_rightmost_one_pos(inverse);
+
+	/* mark the slot used */
+	isset[idx] |= bmw_rightmost_one(inverse);
 
 	return slotpos;
-}
+ }
 
 static inline void
 node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
@@ -747,8 +748,7 @@ node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
 
 	Assert(!NODE_IS_LEAF(node));
 
-	/* find unused slot */
-	slotpos = node_inner_125_find_unused_slot(node, chunk);
+	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
 
 	node->base.slot_idxs[chunk] = slotpos;
@@ -763,12 +763,10 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 
 	Assert(NODE_IS_LEAF(node));
 
-	/* find unused slot */
-	slotpos = node_leaf_125_find_unused_slot(node, chunk);
+	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
 
 	node->base.slot_idxs[chunk] = slotpos;
-	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
 	node->values[slotpos] = value;
 }
 
@@ -2395,9 +2393,9 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
 
 					fprintf(stderr, ", isset-bitmap:");
-					for (int i = 0; i < 16; i++)
+					for (int i = 0; i < WORDNUM(128); i++)
 					{
-						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+						fprintf(stderr, UINT64_FORMAT_HEX " ", n->base.isset[i]);
 					}
 					fprintf(stderr, "\n");
 				}
-- 
2.38.1

v13-0003-Add-radix-implementation.patchtext/x-patch; charset=US-ASCII; name=v13-0003-Add-radix-implementation.patchDownload

From 377cc13755e9129e672e72deaccc2f8d36fe8fa5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v13 3/8] Add radix implementation.

---
 src/backend/lib/Makefile                      |    1 +
 src/backend/lib/meson.build                   |    1 +
 src/backend/lib/radixtree.c                   | 2541 +++++++++++++++++
 src/include/lib/radixtree.h                   |   42 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   34 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  581 ++++
 .../test_radixtree/test_radixtree.control     |    4 +
 15 files changed, 3291 insertions(+)
 create mode 100644 src/backend/lib/radixtree.c
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
   'knapsack.c',
   'pairingheap.c',
   'rbtree.c',
+  'radixtree.c',
 )
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..e7f61fd943
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2541 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_set			- Set a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iterate	- Begin iterating through all key-value pairs
+ * rt_iterate_next	- Return next key-value pair, if any
+ * rt_end_iter		- End iteration
+ * rt_memory_usage	- Get the memory usage
+ * rt_num_entries	- Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+	RT_ACTION_FIND = 0,			/* find the key-value */
+	RT_ACTION_DELETE,			/* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+typedef enum rt_size_class
+{
+	RT_CLASS_4_FULL = 0,
+	RT_CLASS_32_PARTIAL,
+	RT_CLASS_32_FULL,
+	RT_CLASS_125_FULL,
+	RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/* Max number of children. We can use uint8 because we never need to store 256 */
+	/* WIP: if we don't have a variable sized node4, this should instead be in the base
+	types as needed, since saving every byte is crucial for the smallest node kind */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} rt_node;
+#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+	((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+	((node)->base.n.count < rt_size_class_info[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct rt_node_base_4
+{
+	rt_node		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+	rt_node		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base125
+{
+	rt_node		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_125;
+
+typedef struct rt_node_base256
+{
+	rt_node		n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+	rt_node_base_4 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+	rt_node_base_4 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+	rt_node_base_32 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+	rt_node_base_32 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_125
+{
+	rt_node_base_125 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_125;
+
+typedef struct rt_node_leaf_125
+{
+	rt_node_base_125 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 children */
+	rt_node    *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+	rt_node_base_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information for each size class */
+typedef struct rt_size_class_elem
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} rt_size_class_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+	[RT_CLASS_4_FULL] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_PARTIAL] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_FULL] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+	},
+	[RT_CLASS_125_FULL] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(rt_node_inner_256),
+		.leaf_size = sizeof(rt_node_leaf_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+	},
+};
+
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+	[RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+	rt_node    *node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	rt_node_iter stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node    *root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+								bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+										rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+									   uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+								 uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+								uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											 uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+						  uint8 *dst_chunks, rt_node **dst_children)
+{
+	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(rt_node *) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values)
+{
+	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(uint64) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(slot < node->base.n.fanout);
+	return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(slot < node->base.n.fanout);
+	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = NULL;
+	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+static void
+node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
+{
+	int			slotpos = node->base.slot_idxs[chunk];
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+/* Return an unused slot in node-125 */
+static int
+node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
+{
+	int			slotpos = 0;
+
+	Assert(!NODE_IS_LEAF(node));
+	while (node_inner_125_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static int
+node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* We iterate over the isset bitmap per byte then check each bit */
+	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	{
+		if (node->isset[slotpos] < 0xFF)
+			break;
+	}
+	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+	slotpos *= BITS_PER_BYTE;
+	while (node_leaf_125_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static inline void
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+	int			slotpos;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_inner_125_find_unused_slot(node, chunk);
+	Assert(slotpos < node->base.n.fanout);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_leaf_125_find_unused_slot(node, chunk);
+	Assert(slotpos < node->base.n.fanout);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(node_inner_256_is_chunk_used(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(node_leaf_256_is_chunk_used(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+	int			shift = key_get_shift(key);
+	bool		inner = shift > 0;
+	rt_node    *newnode;
+
+	newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+	newnode->shift = shift;
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+{
+	rt_node    *newnode;
+
+	if (inner)
+		newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+												 rt_size_class_info[size_class].inner_size);
+	else
+		newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+												 rt_size_class_info[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[size_class]++;
+#endif
+
+	return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+{
+	if (inner)
+		MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+	else
+		MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+
+	node->kind = kind;
+	node->fanout = rt_size_class_info[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+
+		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+	}
+
+	/*
+	 * Technically it's 256, but we cannot store that in a uint8,
+	 * and this is the max size class to it will never grow.
+	 */
+	if (kind == RT_NODE_KIND_256)
+		node->fanout = 0;
+}
+
+static inline void
+rt_copy_node(rt_node *newnode, rt_node *oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->chunk = oldnode->chunk;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+{
+	rt_node	*newnode;
+	bool inner = !NODE_IS_LEAF(node);
+
+	newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
+	rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
+	rt_copy_node(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == rt_size_class_info[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->cnt[i]--;
+		Assert(tree->cnt[i] >= 0);
+	}
+#endif
+
+	pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+				rt_node *new_child, uint64 key)
+{
+	Assert(old_child->chunk == new_child->chunk);
+	Assert(old_child->shift == new_child->shift);
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = new_child;
+	}
+	else
+	{
+		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
+
+		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		Assert(replaced);
+	}
+
+	rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_inner_4 *node;
+
+		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+		rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		node->base.n.shift = shift;
+		node->base.n.count = 1;
+		node->base.chunks[0] = 0;
+		node->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+			  rt_node *node)
+{
+	int			shift = node->shift;
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		rt_node    *newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		inner = newshift > 0;
+
+		newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+		rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+		newchild->shift = newshift;
+		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+		rt_node_insert_inner(tree, parent, node, key, newchild);
+
+		parent = node;
+		node = newchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	rt_node_insert_leaf(tree, parent, node, key, value);
+	tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	rt_node    *child = NULL;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = n4->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n4->base.chunks, n4->children,
+												n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = n32->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n32->base.chunks, n32->children,
+												n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+
+				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = node_inner_125_get_child(n125, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_125_delete(n125, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				if (!node_inner_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = node_inner_256_get_child(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && child_p)
+		*child_p = child;
+
+	return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	uint64		value = 0;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = n4->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+											  n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = n32->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+											  n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+
+				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_125_get_value(n125, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_125_delete(n125, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				if (!node_leaf_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_256_get_value(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && value_p)
+		*value_p = value;
+
+	return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+					 rt_node *child)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_inner_32 *new32;
+					Assert(parent != NULL);
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+																   RT_NODE_KIND_32);
+					chunk_children_array_copy(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+									key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					uint16		count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n4->base.chunks, n4->children,
+												   count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->children[insertpos] = child;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					Assert(parent != NULL);
+
+					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+					{
+						/* use the same node kind, but expand to the next size class */
+						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
+						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_inner_32 *new32;
+
+						new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+						memcpy(new32, n32, size);
+						new32->base.n.fanout = fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new32;
+						n32 = new32;
+
+						goto retry_insert_inner_32;
+					}
+					else
+					{
+						rt_node_inner_125 *new125;
+
+						/* grow node from 32 to 125 */
+						new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																		 RT_NODE_KIND_125);
+						for (int i = 0; i < n32->base.n.count; i++)
+							node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
+						node = (rt_node *) new125;
+					}
+				}
+				else
+				{
+retry_insert_inner_32:
+					{
+						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+						int16 count = n32->base.n.count;
+
+						if (count != 0 && insertpos < count)
+							chunk_children_array_shift(n32->base.chunks, n32->children,
+													   count, insertpos);
+
+						n32->base.chunks[insertpos] = chunk;
+						n32->children[insertpos] = child;
+						break;
+					}
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+				int			cnt = 0;
+
+				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_inner_125_update(n125, chunk, child);
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					rt_node_inner_256 *new256;
+					Assert(parent != NULL);
+
+					/* grow node from 125 to 256 */
+					new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+																	 RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+							continue;
+
+						node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+						cnt++;
+					}
+
+					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_inner_125_insert(n125, chunk, child);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+				node_inner_256_set(n256, chunk, child);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+					uint64 key, uint64 value)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_leaf_32 *new32;
+					Assert(parent != NULL);
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+																  RT_NODE_KIND_32);
+					chunk_values_array_copy(n4->base.chunks, n4->values,
+											new32->base.chunks, new32->values);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and values */
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n4->base.chunks, n4->values,
+												 count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->values[insertpos] = value;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					Assert(parent != NULL);
+
+					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+					{
+						/* use the same node kind, but expand to the next size class */
+						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
+						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_leaf_32 *new32;
+
+						new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+						memcpy(new32, n32, size);
+						new32->base.n.fanout = fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new32;
+						n32 = new32;
+
+						goto retry_insert_leaf_32;
+					}
+					else
+					{
+						rt_node_leaf_125 *new125;
+
+						/* grow node from 32 to 125 */
+						new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																		RT_NODE_KIND_125);
+						for (int i = 0; i < n32->base.n.count; i++)
+							node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
+										key);
+						node = (rt_node *) new125;
+					}
+				}
+				else
+				{
+				retry_insert_leaf_32:
+					{
+						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+						int	count = n32->base.n.count;
+
+						if (count != 0 && insertpos < count)
+							chunk_values_array_shift(n32->base.chunks, n32->values,
+													 count, insertpos);
+
+						n32->base.chunks[insertpos] = chunk;
+						n32->values[insertpos] = value;
+						break;
+					}
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+				int			cnt = 0;
+
+				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_leaf_125_update(n125, chunk, value);
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					rt_node_leaf_256 *new256;
+					Assert(parent != NULL);
+
+					/* grow node from 125 to 256 */
+					new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+																	RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+							continue;
+
+						node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+						cnt++;
+					}
+
+					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_leaf_125_insert(n125, chunk, value);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+				node_leaf_256_set(n256, chunk, value);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 rt_size_class_info[i].name,
+												 rt_size_class_info[i].inner_blocksize,
+												 rt_size_class_info[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												rt_size_class_info[i].name,
+												rt_size_class_info[i].leaf_blocksize,
+												rt_size_class_info[i].leaf_size);
+#ifdef RT_DEBUG
+		tree->cnt[i] = 0;
+#endif
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	rt_node    *node;
+	rt_node    *parent;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = parent = tree->root;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		{
+			rt_set_extend(tree, key, value, parent, node);
+			return false;
+		}
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+	rt_node    *node;
+	int			shift;
+
+	Assert(value_p != NULL);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		rt_node    *child;
+
+		/* Push the current node to the stack */
+		stack[++level] = node;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	Assert(NODE_IS_LEAF(node));
+	deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	rt_free_node(tree, node);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		node = stack[level--];
+
+		deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+	}
+
+	return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->root)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+	int			level = from;
+	rt_node    *node = from_node;
+
+	for (;;)
+	{
+		rt_node_iter *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = rt_node_inner_iterate_next(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->root)
+		return false;
+
+	for (;;)
+	{
+		rt_node    *child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		rt_update_iter_stack(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+	rt_node    *child = NULL;
+	bool		found = false;
+	uint8		key_chunk;
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				child = n4->children[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				child = n32->children[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_125_get_child(n125, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_inner_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_256_get_child(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+	return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+						  uint64 *value_p)
+{
+	rt_node    *node = node_iter->node;
+	bool		found = false;
+	uint64		value;
+	uint8		key_chunk;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				value = n4->values[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				value = n32->values[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_125_get_value(n125, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_leaf_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_256_get_value(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		*value_p = value;
+	}
+
+	return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	Size		total = sizeof(radix_tree);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					if (NODE_IS_LEAF(node))
+						Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+														  n125->slot_idxs[i]));
+					else
+						Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+														   n125->slot_idxs[i]));
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+						cnt += pg_popcount32(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+						 tree->num_keys,
+						 tree->root->shift / RT_NODE_SPAN,
+						 tree->cnt[RT_CLASS_4_FULL],
+						 tree->cnt[RT_CLASS_32_PARTIAL],
+						 tree->cnt[RT_CLASS_32_FULL],
+						 tree->cnt[RT_CLASS_125_FULL],
+						 tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+	char		space[125] = {0};
+
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
+			node->fanout == 0 ? 256 : node->fanout,
+			node->count, node->shift, node->chunk);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							rt_dump_node(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							rt_dump_node(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(b125, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < 16; i++)
+					{
+						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(b125, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_125_get_value(n125, i));
+					}
+					else
+					{
+						rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_125_get_child(n125, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+						if (!node_leaf_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_256_get_value(n256, i));
+					}
+					else
+					{
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+						if (!node_inner_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+		 tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		rt_dump_node(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+			break;
+		}
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+				rt_size_class_info[i].name,
+				rt_size_class_info[i].inner_size,
+				rt_size_class_info[i].inner_blocksize,
+				rt_size_class_info[i].leaf_size,
+				rt_size_class_info[i].leaf_blocksize);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+	if (!tree->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 96addded81..11d0ec5b07 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -27,6 +27,7 @@ SUBDIRS = \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1d26544854..568823b221 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -21,6 +21,7 @@ subdir('test_oat_hooks')
 subdir('test_parser')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..ea993e63df
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,581 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	rt_iter		*iter;
+	uint64		dummy;
+	uint64		key;
+	uint64		val;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_set(radixtree, keys[i], keys[i] + 1))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		uint64		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
-- 
2.38.1

v13-0001-introduce-vector8_min-and-vector8_highbit_mask.patchtext/x-patch; charset=US-ASCII; name=v13-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From 3b3d8b87123413bfc04ece39bdfbfdd784b3a02c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v13 1/8] introduce vector8_min and vector8_highbit_mask

---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.38.1

v13-0005-tool-for-measuring-radix-tree-performance.patchtext/x-patch; charset=US-ASCII; name=v13-0005-tool-for-measuring-radix-tree-performance.patchDownload

From 3c4009682a186e3826803db8fb859cde527c6e76 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v13 5/8] tool for measuring radix tree performance

---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  76 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 635 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 6 files changed, 767 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..83529805fc
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..a0693695e6
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,635 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, key);
+	}
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+
+	rt_stats(rt);
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
-- 
2.38.1

v13-0008-PoC-lazy-vacuum-integration.patchtext/x-patch; charset=US-ASCII; name=v13-0008-PoC-lazy-vacuum-integration.patchDownload

From 5f20ef14890f10cfd4290fa212440ea8a10dd318 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 4 Nov 2022 14:14:42 +0900
Subject: [PATCH v13 8/8] PoC: lazy vacuum integration.

The patch includes:

* Introducing a new module called TIDStore
* Lazy vacuum and parallel vacuum integration.

TODOs:
* radix tree needs to have the reset funtionality.
* should not allow TIDStore to grow beyond the memory limit.
* change the progress statistics of pg_stat_progress_vacuum.
---
 src/backend/access/common/Makefile    |   1 +
 src/backend/access/common/meson.build |   1 +
 src/backend/access/common/tidstore.c  | 448 ++++++++++++++++++++++++++
 src/backend/access/heap/vacuumlazy.c  | 164 +++-------
 src/backend/commands/vacuum.c         |  76 +----
 src/backend/commands/vacuumparallel.c |  63 ++--
 src/backend/storage/lmgr/lwlock.c     |   2 +
 src/include/access/tidstore.h         |  60 ++++
 src/include/commands/vacuum.h         |  24 +-
 src/include/storage/lwlock.h          |   1 +
 10 files changed, 612 insertions(+), 228 deletions(-)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h

diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 857beaa32d..76265974b1 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..c3cf771f7d
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,448 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		TID (ItemPointer) storage implementation.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+#include "miscadmin.h"
+
+#define XXX_DEBUG_TID_STORE 1
+
+/* XXX: should be configurable for non-heap AMs */
+#define TIDSTORE_OFFSET_NBITS 11	/* pg_ceil_log2_32(MaxHeapTuplesPerPage) */
+
+#define TIDSTORE_VALUE_NBITS 6	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+	((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+struct TIDStore
+{
+	/* main storage for TID */
+	radix_tree	*tree;
+
+	/* # of tids in TIDStore */
+	int	num_tids;
+
+	/* DSA area and handle for shared TIDStore */
+	rt_handle	handle;
+	dsa_area	*area;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	ItemPointer	itemptrs;
+	uint64	nitems;
+#endif
+};
+
+static void tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+/*
+ * Comparator routines for use with qsort() and bsearch().
+ */
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+		rblk;
+	OffsetNumber loff,
+		roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+
+static void
+verify_iter_tids(TIDStoreIter *iter)
+{
+	uint64 index = iter->prev_index;
+
+	if (iter->ts->itemptrs == NULL)
+		return;
+
+	Assert(index <= iter->ts->nitems);
+
+	for (int i = 0; i < iter->num_offsets; i++)
+	{
+		ItemPointerData tid;
+
+		ItemPointerSetBlockNumber(&tid, iter->blkno);
+		ItemPointerSetOffsetNumber(&tid, iter->offsets[i]);
+
+		Assert(ItemPointerEquals(&iter->ts->itemptrs[index++], &tid));
+	}
+
+	iter->prev_index = iter->itemptrs_index;
+}
+
+static void
+dump_itemptrs(TIDStore *ts)
+{
+	StringInfoData buf;
+
+	if (ts->itemptrs == NULL)
+		return;
+
+	initStringInfo(&buf);
+	for (int i = 0; i < ts->nitems; i++)
+	{
+		appendStringInfo(&buf, "(%d,%d) ",
+						 ItemPointerGetBlockNumber(&(ts->itemptrs[i])),
+						 ItemPointerGetOffsetNumber(&(ts->itemptrs[i])));
+	}
+	elog(WARNING, "--- dump (" UINT64_FORMAT " items) ---", ts->nitems);
+	elog(WARNING, "%s\n", buf.data);
+}
+
+#endif
+
+/*
+ * Create a TIDStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TIDStore *
+tidstore_create(dsa_area *area)
+{
+	TIDStore	*ts;
+
+	ts = palloc0(sizeof(TIDStore));
+
+	ts->tree = rt_create(CurrentMemoryContext, area);
+	ts->area = area;
+
+	if (area != NULL)
+		ts->handle = rt_get_handle(ts->tree);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+#define MAXDEADITEMS(avail_mem) \
+	(avail_mem / sizeof(ItemPointerData))
+
+	if (area == NULL)
+	{
+		ts->itemptrs = (ItemPointer) palloc0(sizeof(ItemPointerData) *
+											 MAXDEADITEMS(maintenance_work_mem * 1024));
+		ts->nitems = 0;
+	}
+#endif
+
+	return ts;
+}
+
+/* Attach to the shared TIDStore using a handle */
+TIDStore *
+tidstore_attach(dsa_area *area, rt_handle handle)
+{
+	TIDStore *ts;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	ts = palloc0(sizeof(TIDStore));
+	ts->tree = rt_attach(area, handle);
+
+	return ts;
+}
+
+/*
+ * Detach from a TIDStore. This detaches from radix tree and frees the
+ * backend-local resources.
+ */
+void
+tidstore_detach(TIDStore *ts)
+{
+	rt_detach(ts->tree);
+	pfree(ts);
+}
+
+void
+tidstore_free(TIDStore *ts)
+{
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	if (ts->itemptrs)
+		pfree(ts->itemptrs);
+#endif
+
+	rt_free(ts->tree);
+	pfree(ts);
+}
+
+void
+tidstore_reset(TIDStore *ts)
+{
+	dsa_area *area = ts->area;
+
+	/* Reset the statistics */
+	ts->num_tids = 0;
+
+	/* Recreate radix tree storage */
+	rt_free(ts->tree);
+	ts->tree = rt_create(CurrentMemoryContext, area);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	ts->nitems = 0;
+#endif
+}
+
+/* Add TIDs to TIDStore */
+void
+tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	uint64 last_key = PG_UINT64_MAX;
+	uint64 key;
+	uint64 val = 0;
+	ItemPointerData tid;
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint32	off;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		key = tid_to_key_off(&tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(ts->tree, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= UINT64CONST(1) << off;
+		ts->num_tids++;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+		if (ts->itemptrs)
+		{
+			ItemPointerSetBlockNumber(&(ts->itemptrs[ts->nitems]), blkno);
+			ItemPointerSetOffsetNumber(&(ts->itemptrs[ts->nitems]), offsets[i]);
+			ts->nitems++;
+		}
+#endif
+	}
+
+	if (last_key != PG_UINT64_MAX)
+	{
+		rt_set(ts->tree, last_key, val);
+		val = 0;
+	}
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	if (ts->itemptrs)
+		Assert(ts->nitems == ts->num_tids);
+#endif
+}
+
+/* Return true if the given TID is present in TIDStore */
+bool
+tidstore_lookup_tid(TIDStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val;
+	uint32 off;
+	bool found;
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	bool found_assert;
+#endif
+
+	key = tid_to_key_off(tid, &off);
+
+	found = rt_search(ts->tree, key, &val);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	if (ts->itemptrs)
+		found_assert = bsearch((void *) tid,
+							   (void *) ts->itemptrs,
+							   ts->nitems,
+							   sizeof(ItemPointerData),
+							   vac_cmp_itemptr) != NULL;
+#endif
+
+	if (!found)
+	{
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+		if (ts->itemptrs)
+			Assert(!found_assert);
+#endif
+		return false;
+	}
+
+	found = (val & (UINT64CONST(1) << off)) != 0;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+
+	if (ts->itemptrs && found != found_assert)
+	{
+		elog(WARNING, "tid (%d,%d)\n",
+				ItemPointerGetBlockNumber(tid),
+				ItemPointerGetOffsetNumber(tid));
+		dump_itemptrs(ts);
+	}
+
+	if (ts->itemptrs)
+		Assert(found == found_assert);
+
+#endif
+	return found;
+}
+
+TIDStoreIter *
+tidstore_begin_iterate(TIDStore *ts)
+{
+	TIDStoreIter *iter;
+
+	iter = palloc0(sizeof(TIDStoreIter));
+	iter->ts = ts;
+	iter->tree_iter = rt_begin_iterate(ts->tree);
+	iter->blkno = InvalidBlockNumber;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	iter->itemptrs_index = 0;
+#endif
+
+	return iter;
+}
+
+bool
+tidstore_iterate_next(TIDStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+
+	if (iter->finished)
+		return false;
+
+	if (BlockNumberIsValid(iter->blkno))
+	{
+		iter->num_offsets = 0;
+		tidstore_iter_collect_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (rt_iterate_next(iter->tree_iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = KEY_GET_BLKNO(key);
+
+		if (BlockNumberIsValid(iter->blkno) && iter->blkno != blkno)
+		{
+			/*
+			 * Remember the key-value pair for the next block for the
+			 * next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+			verify_iter_tids(iter);
+#endif
+			return true;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_collect_tids(iter, key, val);
+	}
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	verify_iter_tids(iter);
+#endif
+
+	iter->finished = true;
+	return true;
+}
+
+uint64
+tidstore_num_tids(TIDStore *ts)
+{
+	return ts->num_tids;
+}
+
+uint64
+tidstore_memory_usage(TIDStore *ts)
+{
+	return (uint64) sizeof(TIDStore) + rt_memory_usage(ts->tree);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TIDStore
+ */
+tidstore_handle
+tidstore_get_handle(TIDStore *ts)
+{
+	return rt_get_handle(ts->tree);
+}
+
+/* Extract TIDs from key-value pair */
+static void
+tidstore_iter_collect_tids(TIDStoreIter *iter, uint64 key, uint64 val)
+{
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+		iter->offsets[iter->num_offsets++] = off;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+		iter->itemptrs_index++;
+#endif
+	}
+
+	iter->blkno = KEY_GET_BLKNO(key);
+}
+
+/* Encode a TID to key and val */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64 upper;
+	uint64 tid_i;
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+	*off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+	upper = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return upper;
+}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d59711b7ec..75dead6c14 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -144,6 +145,8 @@ typedef struct LVRelState
 	Relation   *indrels;
 	int			nindexes;
 
+	int			max_bytes;
+
 	/* Aggressive VACUUM? (must set relfrozenxid >= FreezeLimit) */
 	bool		aggressive;
 	/* Use visibility map to skip? (disabled by DISABLE_PAGE_SKIPPING) */
@@ -194,7 +197,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TIDStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -265,8 +268,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer *vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer *vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -392,6 +396,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	vacrel->indname = NULL;
 	vacrel->phase = VACUUM_ERRCB_PHASE_UNKNOWN;
 	vacrel->verbose = verbose;
+	vacrel->max_bytes = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 	errcallback.callback = vacuum_error_callback;
 	errcallback.arg = vacrel;
 	errcallback.previous = error_context_stack;
@@ -853,7 +860,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_unskippable_block,
 				next_failsafe_block = 0,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TIDStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
@@ -867,7 +874,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = vacrel->max_bytes; /* XXX: should use # of tids */
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -937,8 +944,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		/* XXX: should not allow tidstore to grow beyond max_bytes */
+		if (tidstore_memory_usage(vacrel->dead_items) > vacrel->max_bytes)
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1070,11 +1077,17 @@ lazy_scan_heap(LVRelState *vacrel)
 			if (prunestate.has_lpdead_items)
 			{
 				Size		freespace;
+				TIDStoreIter *iter;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+				iter = tidstore_begin_iterate(vacrel->dead_items);
+				tidstore_iterate_next(iter);
+				lazy_vacuum_heap_page(vacrel, blkno, iter->offsets, iter->num_offsets,
+									  buf, &vmbuffer);
+				Assert(!tidstore_iterate_next(iter));
+				pfree(iter);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				tidstore_reset(dead_items);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1111,7 +1124,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
 		}
 
 		/*
@@ -1264,7 +1277,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1863,25 +1876,16 @@ retry:
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TIDStore *dead_items = vacrel->dead_items;
 
 		Assert(!prunestate->all_visible);
 		Assert(prunestate->has_lpdead_items);
 
 		vacrel->lpdead_item_pages++;
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+									 tidstore_num_tids(dead_items));
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
@@ -2088,8 +2092,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TIDStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2098,17 +2101,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		Assert(dead_items->num_items <= dead_items->max_items);
 		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+									 tidstore_num_tids(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2157,7 +2153,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2186,7 +2182,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2213,8 +2209,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2259,7 +2255,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	/* tidstore_reset(vacrel->dead_items); */
 }
 
 /*
@@ -2331,7 +2327,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2368,10 +2364,10 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index;
 	BlockNumber vacuumed_pages;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TIDStoreIter *iter;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2388,8 +2384,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 	vacuumed_pages = 0;
 
-	index = 0;
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while (tidstore_iterate_next(iter))
 	{
 		BlockNumber tblk;
 		Buffer		buf;
@@ -2398,12 +2394,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		tblk = iter->blkno;
 		vacrel->blkno = tblk;
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+		lazy_vacuum_heap_page(vacrel, tblk, iter->offsets, iter->num_offsets,
+							  buf, &vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2427,14 +2424,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2451,11 +2447,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * LP_DEAD item on the page.  The return value is the first index immediately
  * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+					  int num_offsets, Buffer buffer, Buffer *vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			uncnt = 0;
@@ -2474,16 +2469,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = offsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2563,7 +2553,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -3065,46 +3054,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3115,12 +3064,6 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
-
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
 	 * be used for an index, so we invoke parallelism only if there are at
@@ -3146,7 +3089,6 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3159,11 +3101,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(NULL);
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index a6d5ed1f6b..62db8b0101 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2283,16 +2282,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TIDStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2323,18 +2322,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
@@ -2345,60 +2332,7 @@ vac_max_items_to_alloc_size(int max_items)
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TIDStore *dead_items = (TIDStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26d796e52..742039b3a6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+#define PARALLEL_VACUUM_KEY_DSA				2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TIDStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TIDStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +226,22 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TIDStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +289,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +356,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +375,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +384,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +441,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_free(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +452,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TIDStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +950,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TIDStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +996,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1045,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a5ad36ca78..2fb30fe2e7 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -183,6 +183,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"PgStatsHash",
 	/* LWTRANCHE_PGSTATS_DATA: */
 	"PgStatsData",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..f4ccf1dbc5
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,60 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  TID storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TIDStore TIDStore;
+
+typedef struct TIDStoreIter
+{
+	TIDStore	*ts;
+
+	rt_iter		*tree_iter;
+
+	bool		finished;
+
+	uint64		next_key;
+	uint64		next_val;
+
+	BlockNumber		blkno;
+	OffsetNumber	offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+	int				num_offsets;
+
+#ifdef USE_ASSERT_CHECKING
+	uint64		itemptrs_index;
+	int	prev_index;
+#endif
+} TIDStoreIter;
+
+extern TIDStore *tidstore_create(dsa_area *dsa);
+extern TIDStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TIDStore *ts);
+extern void tidstore_free(TIDStore *ts);
+extern void tidstore_reset(TIDStore *ts);
+extern void tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TIDStore *ts, ItemPointer tid);
+extern TIDStoreIter * tidstore_begin_iterate(TIDStore *ts);
+extern bool tidstore_iterate_next(TIDStoreIter *iter);
+extern uint64 tidstore_num_tids(TIDStore *ts);
+extern uint64 tidstore_memory_usage(TIDStore *ts);
+extern tidstore_handle tidstore_get_handle(TIDStore *ts);
+
+#endif		/* TIDSTORE_H */
+
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4e4bc26a8b..c15e6d7a66 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -235,21 +236,6 @@ typedef struct VacuumParams
 	int			nworkers;
 } VacuumParams;
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -302,18 +288,16 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TIDStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TIDStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index a494cb598f..88e35254d1 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -201,6 +201,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DSA,
 	LWTRANCHE_PGSTATS_HASH,
 	LWTRANCHE_PGSTATS_DATA,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
-- 
2.38.1

v13-0007-PoC-DSA-support-for-radix-tree.patchtext/x-patch; charset=US-ASCII; name=v13-0007-PoC-DSA-support-for-radix-tree.patchDownload

From f413f05673b9f85a62ef16f2b0c51614362f62ec Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 16:42:55 +0700
Subject: [PATCH v13 7/8] PoC: DSA support for radix tree

---
 .../bench_radix_tree--1.0.sql                 |   2 +
 contrib/bench_radix_tree/bench_radix_tree.c   |  16 +-
 src/backend/lib/radixtree.c                   | 437 ++++++++++++++----
 src/backend/utils/mmgr/dsa.c                  |  12 +
 src/include/lib/radixtree.h                   |   8 +-
 src/include/utils/dsa.h                       |   1 +
 .../expected/test_radixtree.out               |  25 +
 .../modules/test_radixtree/test_radixtree.c   | 147 ++++--
 8 files changed, 502 insertions(+), 146 deletions(-)

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 83529805fc..d9216d715c 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -7,6 +7,7 @@ create function bench_shuffle_search(
 minblk int4,
 maxblk int4,
 random_block bool DEFAULT false,
+shared bool DEFAULT false,
 OUT nkeys int8,
 OUT rt_mem_allocated int8,
 OUT array_mem_allocated int8,
@@ -23,6 +24,7 @@ create function bench_seq_search(
 minblk int4,
 maxblk int4,
 random_block bool DEFAULT false,
+shared bool DEFAULT false,
 OUT nkeys int8,
 OUT rt_mem_allocated int8,
 OUT array_mem_allocated int8,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index a0693695e6..1a26722495 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -154,6 +154,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	BlockNumber maxblk = PG_GETARG_INT32(1);
 	bool		random_block = PG_GETARG_BOOL(2);
 	radix_tree *rt = NULL;
+	bool		shared = PG_GETARG_BOOL(3);
+	dsa_area   *dsa = NULL;
 	uint64		ntids;
 	uint64		key;
 	uint64		last_key = PG_UINT64_MAX;
@@ -176,7 +178,11 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
 
 	/* measure the load time of the radix tree */
-	rt = rt_create(CurrentMemoryContext);
+	if (shared)
+		dsa = dsa_create(LWLockNewTrancheId());
+	rt = rt_create(CurrentMemoryContext, dsa);
+
+	/* measure the load time of the radix tree */
 	start_time = GetCurrentTimestamp();
 	for (int i = 0; i < ntids; i++)
 	{
@@ -327,7 +333,7 @@ bench_load_random_int(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	pg_prng_seed(&state, 0);
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	start_time = GetCurrentTimestamp();
 	for (uint64 i = 0; i < cnt; i++)
@@ -393,7 +399,7 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
 	}
 	elog(NOTICE, "bench with filter 0x%lX", filter);
 
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	for (uint64 i = 0; i < cnt; i++)
 	{
@@ -462,7 +468,7 @@ bench_fixed_height_search(PG_FUNCTION_ARGS)
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
 
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	start_time = GetCurrentTimestamp();
 
@@ -574,7 +580,7 @@ bench_node128_load(PG_FUNCTION_ARGS)
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
 
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	key_id = 0;
 
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bff37a2c35..b890c38b1a 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -22,6 +22,15 @@
  * choose it to avoid an additional pointer traversal.  It is the reason this code
  * currently does not support variable-length keys.
  *
+ * If DSA area is specified for rt_create(), the radix tree is created in the
+ * DSA area so that multiple processes can access to it simultaneously. The process
+ * who created the shared radix tree needs to tell both DSA area specified when
+ * calling to rt_create() and dsa_pointer of the radix tree, fetched by
+ * rt_get_dsa_pointer(), to other processes so that they can attach by rt_attach().
+ *
+ * XXX: shared radix tree is still PoC state as it doesn't have any locking support.
+ * Also, it supports the iteration only by one process.
+ *
  * XXX: Most functions in this file have two variants for inner nodes and leaf
  * nodes, therefore there are duplication codes. While this sometimes makes the
  * code maintenance tricky, this reduces branch prediction misses when judging
@@ -34,6 +43,9 @@
  *
  * rt_create		- Create a new, empty radix tree
  * rt_free			- Free the radix tree
+ * rt_attach		- Attach to the radix tree
+ * rt_detach		- Detach from the radix tree
+ * rt_get_handle	- Return the handle of the radix tree
  * rt_search		- Search a key-value pair
  * rt_set			- Set a key-value pair
  * rt_delete		- Delete a key-value pair
@@ -65,6 +77,7 @@
 #include "nodes/bitmapset.h"
 #include "port/pg_bitutils.h"
 #include "port/pg_lfind.h"
+#include "utils/dsa.h"
 #include "utils/memutils.h"
 
 #ifdef RT_DEBUG
@@ -426,6 +439,10 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
  * construct the key whenever updating the node iteration information, e.g., when
  * advancing the current index within the node or when moving to the next node
  * at the same level.
+ *
+ * XXX: We need either a safeguard to disallow other processes to begin the
+ * iteration while one process is doing or to allow multiple processes to do
+ * the iteration.
  */
 typedef struct rt_node_iter
 {
@@ -445,23 +462,43 @@ struct rt_iter
 	uint64		key;
 };
 
-/* A radix tree with nodes */
-struct radix_tree
+/* A magic value used to identify our radix tree */
+#define RADIXTREE_MAGIC 0x54A48167
+
+/* Control information for an radix tree */
+typedef struct radix_tree_control
 {
-	MemoryContext context;
+	rt_handle	handle;
+	uint32		magic;
 
+	/* Root node */
 	rt_pointer	root;
+
 	uint64		max_val;
 	uint64		num_keys;
 
-	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
-	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
-
 	/* statistics */
 #ifdef RT_DEBUG
 	int32		cnt[RT_SIZE_CLASS_COUNT];
 #endif
+} radix_tree_control;
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	/* control object in either backend-local memory or DSA */
+	radix_tree_control *ctl;
+
+	/* used only when the radix tree is shared */
+	dsa_area   *area;
+
+	/* used only when the radix tree is private */
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
 };
+#define RadixTreeIsShared(rt) ((rt)->area != NULL)
 
 static void rt_new_root(radix_tree *tree, uint64 key);
 
@@ -490,9 +527,12 @@ static void rt_verify_node(rt_node_ptr node);
 
 /* Decode and encode functions of rt_pointer */
 static inline rt_node *
-rt_pointer_decode(rt_pointer encoded)
+rt_pointer_decode(radix_tree *tree, rt_pointer encoded)
 {
-	return (rt_node *) encoded;
+	if (RadixTreeIsShared(tree))
+		return (rt_node *) dsa_get_address(tree->area, encoded);
+	else
+		return (rt_node *) encoded;
 }
 
 static inline rt_pointer
@@ -503,11 +543,11 @@ rt_pointer_encode(rt_node *decoded)
 
 /* Return a rt_node_ptr created from the given encoded pointer */
 static inline rt_node_ptr
-rt_node_ptr_encoded(rt_pointer encoded)
+rt_node_ptr_encoded(radix_tree *tree, rt_pointer encoded)
 {
 	return (rt_node_ptr) {
 		.encoded = encoded,
-			.decoded = rt_pointer_decode(encoded),
+			.decoded = rt_pointer_decode(tree, encoded)
 			};
 }
 
@@ -954,8 +994,8 @@ rt_new_root(radix_tree *tree, uint64 key)
 	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
 	NODE_SHIFT(newnode) = shift;
 
-	tree->max_val = shift_get_max_val(shift);
-	tree->root = newnode.encoded;
+	tree->ctl->max_val = shift_get_max_val(shift);
+	tree->ctl->root = newnode.encoded;
 }
 
 /*
@@ -964,20 +1004,35 @@ rt_new_root(radix_tree *tree, uint64 key)
 static rt_node_ptr
 rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 {
-	rt_node_ptr	newnode;
+	rt_node_ptr newnode;
 
-	if (inner)
-		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
-														 rt_size_class_info[size_class].inner_size);
-	else
-		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
-														 rt_size_class_info[size_class].leaf_size);
+	if (tree->area != NULL)
+	{
+		dsa_pointer dp;
 
-	newnode.encoded = rt_pointer_encode(newnode.decoded);
+		if (inner)
+			dp = dsa_allocate(tree->area, rt_size_class_info[size_class].inner_size);
+		else
+			dp = dsa_allocate(tree->area, rt_size_class_info[size_class].leaf_size);
+
+		newnode.encoded = (rt_pointer) dp;
+		newnode.decoded = rt_pointer_decode(tree, newnode.encoded);
+	}
+	else
+	{
+		if (inner)
+			newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+															 rt_size_class_info[size_class].inner_size);
+		else
+			newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+															 rt_size_class_info[size_class].leaf_size);
+
+		newnode.encoded = rt_pointer_encode(newnode.decoded);
+	}
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[size_class]++;
+	tree->ctl->cnt[size_class]++;
 #endif
 
 	return newnode;
@@ -1041,10 +1096,10 @@ static void
 rt_free_node(radix_tree *tree, rt_node_ptr node)
 {
 	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node.encoded)
+	if (tree->ctl->root == node.encoded)
 	{
-		tree->root = InvalidRTPointer;
-		tree->max_val = 0;
+		tree->ctl->root = InvalidRTPointer;
+		tree->ctl->max_val = 0;
 	}
 
 #ifdef RT_DEBUG
@@ -1062,12 +1117,15 @@ rt_free_node(radix_tree *tree, rt_node_ptr node)
 		if (i == RT_SIZE_CLASS_COUNT)
 			i = RT_CLASS_256;
 
-		tree->cnt[i]--;
-		Assert(tree->cnt[i] >= 0);
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
 	}
 #endif
 
-	pfree(node.decoded);
+	if (RadixTreeIsShared(tree))
+		dsa_free(tree->area, (dsa_pointer) node.encoded);
+	else
+		pfree(node.decoded);
 }
 
 /*
@@ -1083,7 +1141,7 @@ rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
 	if (rt_node_ptr_eq(&parent, &old_child))
 	{
 		/* Replace the root node with the new large node */
-		tree->root = new_child.encoded;
+		tree->ctl->root = new_child.encoded;
 	}
 	else
 	{
@@ -1105,7 +1163,7 @@ static void
 rt_extend(radix_tree *tree, uint64 key)
 {
 	int			target_shift;
-	rt_node		*root = rt_pointer_decode(tree->root);
+	rt_node		*root = rt_pointer_decode(tree, tree->ctl->root);
 	int			shift = root->shift + RT_NODE_SPAN;
 
 	target_shift = key_get_shift(key);
@@ -1123,15 +1181,15 @@ rt_extend(radix_tree *tree, uint64 key)
 		n4->base.n.shift = shift;
 		n4->base.n.count = 1;
 		n4->base.chunks[0] = 0;
-		n4->children[0] = tree->root;
+		n4->children[0] = tree->ctl->root;
 
 		root->chunk = 0;
-		tree->root = node.encoded;
+		tree->ctl->root = node.encoded;
 
 		shift += RT_NODE_SPAN;
 	}
 
-	tree->max_val = shift_get_max_val(target_shift);
+	tree->ctl->max_val = shift_get_max_val(target_shift);
 }
 
 /*
@@ -1163,7 +1221,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
 	}
 
 	rt_node_insert_leaf(tree, parent, node, key, value);
-	tree->num_keys++;
+	tree->ctl->num_keys++;
 }
 
 /*
@@ -1174,12 +1232,11 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
-					 rt_pointer *child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action, rt_pointer *child_p)
 {
 	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
-	rt_pointer	child;
+	rt_pointer	child = InvalidRTPointer;
 
 	switch (NODE_KIND(node))
 	{
@@ -1210,6 +1267,7 @@ rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
 					break;
 
 				found = true;
+
 				if (action == RT_ACTION_FIND)
 					child = n32->children[idx];
 				else			/* RT_ACTION_DELETE */
@@ -1761,33 +1819,51 @@ retry_insert_leaf_32:
  * Create the radix tree in the given memory context and return it.
  */
 radix_tree *
-rt_create(MemoryContext ctx)
+rt_create(MemoryContext ctx, dsa_area *area)
 {
 	radix_tree *tree;
 	MemoryContext old_ctx;
 
 	old_ctx = MemoryContextSwitchTo(ctx);
 
-	tree = palloc(sizeof(radix_tree));
+	tree = (radix_tree *) palloc0(sizeof(radix_tree));
 	tree->context = ctx;
-	tree->root = InvalidRTPointer;
-	tree->max_val = 0;
-	tree->num_keys = 0;
+
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+
+		tree->area = area;
+		dp = dsa_allocate0(area, sizeof(radix_tree_control));
+		tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
+		tree->ctl->handle = (rt_handle) dp;
+	}
+	else
+	{
+		tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
+		tree->ctl->handle = InvalidDsaPointer;
+	}
+
+	tree->ctl->magic = RADIXTREE_MAGIC;
+	tree->ctl->root = InvalidRTPointer;
 
 	/* Create the slab allocator for each size class */
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	if (area == NULL)
 	{
-		tree->inner_slabs[i] = SlabContextCreate(ctx,
-												 rt_size_class_info[i].name,
-												 rt_size_class_info[i].inner_blocksize,
-												 rt_size_class_info[i].inner_size);
-		tree->leaf_slabs[i] = SlabContextCreate(ctx,
-												rt_size_class_info[i].name,
-												rt_size_class_info[i].leaf_blocksize,
-												rt_size_class_info[i].leaf_size);
+		for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			tree->inner_slabs[i] = SlabContextCreate(ctx,
+													 rt_size_class_info[i].name,
+													 rt_size_class_info[i].inner_blocksize,
+													 rt_size_class_info[i].inner_size);
+			tree->leaf_slabs[i] = SlabContextCreate(ctx,
+													rt_size_class_info[i].name,
+													rt_size_class_info[i].leaf_blocksize,
+													rt_size_class_info[i].leaf_size);
 #ifdef RT_DEBUG
-		tree->cnt[i] = 0;
+			tree->ctl->cnt[i] = 0;
 #endif
+		}
 	}
 
 	MemoryContextSwitchTo(old_ctx);
@@ -1795,16 +1871,163 @@ rt_create(MemoryContext ctx)
 	return tree;
 }
 
+/*
+ * Get a handle that can be used by other processes to attach to this radix
+ * tree.
+ */
+dsa_pointer
+rt_get_handle(radix_tree *tree)
+{
+	Assert(RadixTreeIsShared(tree));
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	return tree->ctl->handle;
+}
+
+/*
+ * Attach to an existing radix tree using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+radix_tree *
+rt_attach(dsa_area *area, rt_handle handle)
+{
+	radix_tree *tree;
+	dsa_pointer	control;
+
+	/* Allocate the backend-local object representing the radix tree */
+	tree = (radix_tree *) palloc0(sizeof(radix_tree));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	/* Set up the local radix tree */
+	tree->area = area;
+	tree->ctl = (radix_tree_control *) dsa_get_address(area, control);
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	return tree;
+}
+
+/*
+ * Detach from a radix tree. This frees backend-local resources associated
+ * with the radix tree, but the radix tree will continue to exist until
+ * it is explicitly freed.
+ */
+void
+rt_detach(radix_tree *tree)
+{
+	Assert(RadixTreeIsShared(tree));
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	pfree(tree);
+}
+
+/*
+ * Recursively free all nodes allocated to the dsa area.
+ */
+static void
+rt_free_recurse(radix_tree *tree, rt_pointer ptr)
+{
+	rt_node_ptr	node = rt_node_ptr_encoded(tree, ptr);
+
+	Assert(RadixTreeIsShared(tree));
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers, so free it */
+	if (NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->area, (dsa_pointer) node.encoded);
+		return;
+	}
+
+	switch (NODE_KIND(node))
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < NODE_COUNT(node); i++)
+					rt_free_recurse(tree, n4->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < NODE_COUNT(node); i++)
+					rt_free_recurse(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						continue;
+
+					rt_free_recurse(tree, node_inner_125_get_child(n125, i));
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_inner_256_is_chunk_used(n256, i))
+						continue;
+
+					rt_free_recurse(tree, node_inner_256_get_child(n256, i));
+				}
+				break;
+			}
+	}
+
+	/* Free the inner node itself */
+	dsa_free(tree->area, node.encoded);
+}
+
 /*
  * Free the given radix tree.
  */
 void
 rt_free(radix_tree *tree)
 {
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (RadixTreeIsShared(tree))
 	{
-		MemoryContextDelete(tree->inner_slabs[i]);
-		MemoryContextDelete(tree->leaf_slabs[i]);
+		/* Free all memory used for radix tree nodes */
+		if (RTPointerIsValid(tree->ctl->root))
+			rt_free_recurse(tree, tree->ctl->root);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix tree.
+		 */
+		tree->ctl->magic = 0;
+		dsa_free(tree->area, tree->ctl->handle);
+	}
+	else
+	{
+		/* Free all memory used for radix tree nodes */
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			MemoryContextDelete(tree->inner_slabs[i]);
+			MemoryContextDelete(tree->leaf_slabs[i]);
+		}
+		pfree(tree->ctl);
 	}
 
 	pfree(tree);
@@ -1822,16 +2045,18 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 	rt_node_ptr	node;
 	rt_node_ptr parent;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
 	/* Empty tree, create the root */
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 		rt_new_root(tree, key);
 
 	/* Extend the tree if necessary */
-	if (key > tree->max_val)
+	if (key > tree->ctl->max_val)
 		rt_extend(tree, key);
 
 	/* Descend the tree until a leaf node */
-	node = parent = rt_node_ptr_encoded(tree->root);
+	node = parent = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
@@ -1847,7 +2072,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 		}
 
 		parent = node;
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1855,7 +2080,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 
 	/* Update the statistics */
 	if (!updated)
-		tree->num_keys++;
+		tree->ctl->num_keys++;
 
 	return updated;
 }
@@ -1871,12 +2096,13 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 	rt_node_ptr    node;
 	int			shift;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
 	Assert(value_p != NULL);
 
-	if (!RTPointerIsValid(tree->root) || key > tree->max_val)
+	if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 
 	/* Descend the tree until a leaf node */
@@ -1890,7 +2116,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1910,14 +2136,16 @@ rt_delete(radix_tree *tree, uint64 key)
 	int			level;
 	bool		deleted;
 
-	if (!tree->root || key > tree->max_val)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
 	/*
 	 * Descend the tree to search the key while building a stack of nodes we
 	 * visited.
 	 */
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	level = -1;
 	while (shift > 0)
@@ -1930,7 +2158,7 @@ rt_delete(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1945,7 +2173,7 @@ rt_delete(radix_tree *tree, uint64 key)
 	}
 
 	/* Found the key to delete. Update the statistics */
-	tree->num_keys--;
+	tree->ctl->num_keys--;
 
 	/*
 	 * Return if the leaf node still has keys and we don't need to delete the
@@ -1985,16 +2213,18 @@ rt_begin_iterate(radix_tree *tree)
 	rt_iter    *iter;
 	int			top_level;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
 	old_ctx = MemoryContextSwitchTo(tree->context);
 
 	iter = (rt_iter *) palloc0(sizeof(rt_iter));
 	iter->tree = tree;
 
 	/* empty tree */
-	if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
+	if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->ctl->root))
 		return iter;
 
-	root = rt_node_ptr_encoded(iter->tree->root);
+	root = rt_node_ptr_encoded(tree, iter->tree->ctl->root);
 	top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
 	iter->stack_len = top_level;
 
@@ -2045,8 +2275,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
 bool
 rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 {
+	Assert(!RadixTreeIsShared(iter->tree) || iter->tree->ctl->magic == RADIXTREE_MAGIC);
+
 	/* Empty tree */
-	if (!iter->tree->root)
+	if (!iter->tree->ctl->root)
 		return false;
 
 	for (;;)
@@ -2190,7 +2422,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *
 	if (found)
 	{
 		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
-		*child_p = rt_node_ptr_encoded(child);
+		*child_p = rt_node_ptr_encoded(iter->tree, child);
 	}
 
 	return found;
@@ -2293,7 +2525,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_
 uint64
 rt_num_entries(radix_tree *tree)
 {
-	return tree->num_keys;
+	return tree->ctl->num_keys;
 }
 
 /*
@@ -2302,12 +2534,19 @@ rt_num_entries(radix_tree *tree)
 uint64
 rt_memory_usage(radix_tree *tree)
 {
-	Size		total = sizeof(radix_tree);
+	Size		total = sizeof(radix_tree) + sizeof(radix_tree_control);
 
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (RadixTreeIsShared(tree))
+		total = dsa_get_total_size(tree->area);
+	else
 	{
-		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
-		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+			total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+		}
 	}
 
 	return total;
@@ -2391,23 +2630,23 @@ rt_verify_node(rt_node_ptr node)
 void
 rt_stats(radix_tree *tree)
 {
-	rt_node *root = rt_pointer_decode(tree->root);
+	rt_node *root = rt_pointer_decode(tree, tree->ctl->root);
 
 	if (root == NULL)
 		return;
 
 	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
-							tree->num_keys,
+							tree->ctl->num_keys,
 							root->shift / RT_NODE_SPAN,
-							tree->cnt[RT_CLASS_4_FULL],
-							tree->cnt[RT_CLASS_32_PARTIAL],
-							tree->cnt[RT_CLASS_32_FULL],
-							tree->cnt[RT_CLASS_125_FULL],
-							tree->cnt[RT_CLASS_256])));
+							tree->ctl->cnt[RT_CLASS_4_FULL],
+							tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+							tree->ctl->cnt[RT_CLASS_32_FULL],
+							tree->ctl->cnt[RT_CLASS_125_FULL],
+							tree->ctl->cnt[RT_CLASS_256])));
 }
 
 static void
-rt_dump_node(rt_node_ptr node, int level, bool recurse)
+rt_dump_node(radix_tree *tree, rt_node_ptr node, int level, bool recurse)
 {
 	rt_node		*n = node.decoded;
 	char		space[128] = {0};
@@ -2445,7 +2684,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, n4->base.chunks[i]);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+							rt_dump_node(tree, rt_node_ptr_encoded(tree, n4->children[i]),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2473,7 +2712,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 
 						if (recurse)
 						{
-							rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+							rt_dump_node(tree, rt_node_ptr_encoded(tree, n32->children[i]),
 										 level + 1, recurse);
 						}
 						else
@@ -2526,7 +2765,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
+							rt_dump_node(tree,
+										 rt_node_ptr_encoded(tree,
+															 node_inner_125_get_child(n125, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2559,7 +2800,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+							rt_dump_node(tree,
+										 rt_node_ptr_encoded(tree,
+															 node_inner_256_get_child(n256, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2579,28 +2822,28 @@ rt_dump_search(radix_tree *tree, uint64 key)
 
 	elog(NOTICE, "-----------------------------------------------------------");
 	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
-		 tree->max_val, tree->max_val);
+		 tree->ctl->max_val, tree->ctl->max_val);
 
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 	{
 		elog(NOTICE, "tree is empty");
 		return;
 	}
 
-	if (key > tree->max_val)
+	if (key > tree->ctl->max_val)
 	{
 		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
 			 key, key);
 		return;
 	}
 
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
 		rt_pointer   child;
 
-		rt_dump_node(node, level, false);
+		rt_dump_node(tree, node, level, false);
 
 		if (NODE_IS_LEAF(node))
 		{
@@ -2615,7 +2858,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			break;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 		level++;
 	}
@@ -2633,15 +2876,15 @@ rt_dump(radix_tree *tree)
 				rt_size_class_info[i].inner_blocksize,
 				rt_size_class_info[i].leaf_size,
 				rt_size_class_info[i].leaf_blocksize);
-	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
 
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 	{
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
-	root = rt_node_ptr_encoded(tree->root);
-	rt_dump_node(root, 0, true);
+	root = rt_node_ptr_encoded(tree, tree->ctl->root);
+	rt_dump_node(tree, root, 0, true);
 }
 #endif
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 82376fde2d..ad169882af 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..68a11df970 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -14,18 +14,24 @@
 #define RADIXTREE_H
 
 #include "postgres.h"
+#include "utils/dsa.h"
 
 #define RT_DEBUG 1
 
 typedef struct radix_tree radix_tree;
 typedef struct rt_iter rt_iter;
+typedef dsa_pointer rt_handle;
 
-extern radix_tree *rt_create(MemoryContext ctx);
+extern radix_tree *rt_create(MemoryContext ctx, dsa_area *dsa);
 extern void rt_free(radix_tree *tree);
 extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
 extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
 extern rt_iter *rt_begin_iterate(radix_tree *tree);
 
+extern rt_handle rt_get_handle(radix_tree *tree);
+extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+extern void rt_detach(radix_tree *tree);
+
 extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
 extern void rt_end_iterate(rt_iter *iter);
 extern bool rt_delete(radix_tree *tree, uint64 key);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 405606fe2f..dad06adecc 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index ce645cb8b5..a217e0d312 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -6,28 +6,53 @@ CREATE EXTENSION test_radixtree;
 SELECT test_radixtree();
 NOTICE:  testing basic operations with leaf node 4
 NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
 NOTICE:  testing basic operations with leaf node 32
 NOTICE:  testing basic operations with inner node 32
 NOTICE:  testing basic operations with leaf node 125
 NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
 NOTICE:  testing basic operations with leaf node 256
 NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
 NOTICE:  testing radix tree node types with shift "0"
 NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "8"
 NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
 NOTICE:  testing radix tree node types with shift "24"
 NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "32"
 NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
 NOTICE:  testing radix tree node types with shift "48"
 NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree node types with shift "56"
 NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
 NOTICE:  testing radix tree with pattern "alternating bits"
 NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of ten"
 NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
 NOTICE:  testing radix tree with pattern "one-every-64k"
 NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "sparse"
 NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
 NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
 NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
  test_radixtree 
 ----------------
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index ea993e63df..fe1e168ec4 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -19,6 +19,7 @@
 #include "nodes/bitmapset.h"
 #include "storage/block.h"
 #include "storage/itemptr.h"
+#include "storage/lwlock.h"
 #include "utils/memutils.h"
 #include "utils/timestamp.h"
 
@@ -99,6 +100,8 @@ static const test_spec test_specs[] = {
 	}
 };
 
+static int lwlock_tranche_id;
+
 PG_MODULE_MAGIC;
 
 PG_FUNCTION_INFO_V1(test_radixtree);
@@ -112,7 +115,7 @@ test_empty(void)
 	uint64		key;
 	uint64		val;
 
-	radixtree = rt_create(CurrentMemoryContext);
+	radixtree = rt_create(CurrentMemoryContext, NULL);
 
 	if (rt_search(radixtree, 0, &dummy))
 		elog(ERROR, "rt_search on empty tree returned true");
@@ -140,17 +143,14 @@ test_empty(void)
 }
 
 static void
-test_basic(int children, bool test_inner)
+do_test_basic(radix_tree *radixtree, int children, bool test_inner)
 {
-	radix_tree	*radixtree;
 	uint64 *keys;
 	int	shift = test_inner ? 8 : 0;
 
 	elog(NOTICE, "testing basic operations with %s node %d",
 		 test_inner ? "inner" : "leaf", children);
 
-	radixtree = rt_create(CurrentMemoryContext);
-
 	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
 	keys = palloc(sizeof(uint64) * children);
 	for (int i = 0; i < children; i++)
@@ -165,7 +165,7 @@ test_basic(int children, bool test_inner)
 	for (int i = 0; i < children; i++)
 	{
 		if (rt_set(radixtree, keys[i], keys[i]))
-			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found %d", keys[i], i);
 	}
 
 	/* update keys */
@@ -185,7 +185,38 @@ test_basic(int children, bool test_inner)
 	}
 
 	pfree(keys);
-	rt_free(radixtree);
+}
+
+static void
+test_basic()
+{
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		radix_tree *tree;
+		dsa_area	*area;
+
+		/* Test the local radix tree */
+		tree = rt_create(CurrentMemoryContext, NULL);
+		do_test_basic(tree, rt_node_kind_fanouts[i], false);
+		rt_free(tree);
+
+		tree = rt_create(CurrentMemoryContext, NULL);
+		do_test_basic(tree, rt_node_kind_fanouts[i], true);
+		rt_free(tree);
+
+		/* Test the shared radix tree */
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(CurrentMemoryContext, area);
+		do_test_basic(tree, rt_node_kind_fanouts[i], false);
+		rt_free(tree);
+		dsa_detach(area);
+
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(CurrentMemoryContext, area);
+		do_test_basic(tree, rt_node_kind_fanouts[i], true);
+		rt_free(tree);
+		dsa_detach(area);
+	}
 }
 
 /*
@@ -286,14 +317,10 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
  * level.
  */
 static void
-test_node_types(uint8 shift)
+do_test_node_types(radix_tree *radixtree, uint8 shift)
 {
-	radix_tree *radixtree;
-
 	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
 
-	radixtree = rt_create(CurrentMemoryContext);
-
 	/*
 	 * Insert and search entries for every node type at the 'shift' level,
 	 * then delete all entries to make it empty, and insert and search entries
@@ -302,19 +329,37 @@ test_node_types(uint8 shift)
 	test_node_types_insert(radixtree, shift, true);
 	test_node_types_delete(radixtree, shift);
 	test_node_types_insert(radixtree, shift, false);
+}
 
-	rt_free(radixtree);
+static void
+test_node_types(void)
+{
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+	{
+		radix_tree *tree;
+		dsa_area   *area;
+
+		/* Test the local radix tree */
+		tree = rt_create(CurrentMemoryContext, NULL);
+		do_test_node_types(tree, shift);
+		rt_free(tree);
+
+		/* Test the shared radix tree */
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(CurrentMemoryContext, area);
+		do_test_node_types(tree, shift);
+		rt_free(tree);
+		dsa_detach(area);
+	}
 }
 
 /*
  * Test with a repeating pattern, defined by the 'spec'.
  */
 static void
-test_pattern(const test_spec * spec)
+do_test_pattern(radix_tree *radixtree, const test_spec * spec)
 {
-	radix_tree *radixtree;
 	rt_iter    *iter;
-	MemoryContext radixtree_ctx;
 	TimestampTz starttime;
 	TimestampTz endtime;
 	uint64		n;
@@ -340,18 +385,6 @@ test_pattern(const test_spec * spec)
 			pattern_values[pattern_num_values++] = i;
 	}
 
-	/*
-	 * Allocate the radix tree.
-	 *
-	 * Allocate it in a separate memory context, so that we can print its
-	 * memory usage easily.
-	 */
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
-										  "radixtree test",
-										  ALLOCSET_SMALL_SIZES);
-	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
-	radixtree = rt_create(radixtree_ctx);
-
 	/*
 	 * Add values to the set.
 	 */
@@ -405,8 +438,6 @@ test_pattern(const test_spec * spec)
 		mem_usage = rt_memory_usage(radixtree);
 		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
 				mem_usage, (double) mem_usage / spec->num_values);
-
-		MemoryContextStats(radixtree_ctx);
 	}
 
 	/* Check that rt_num_entries works */
@@ -555,27 +586,57 @@ test_pattern(const test_spec * spec)
 	if ((nbefore - ndeleted) != nafter)
 		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
 			 nafter, (nbefore - ndeleted), ndeleted);
+}
+
+static void
+test_patterns(void)
+{
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+	{
+		radix_tree *tree;
+		MemoryContext radixtree_ctx;
+		dsa_area   *area;
+		const		test_spec *spec = &test_specs[i];
 
-	MemoryContextDelete(radixtree_ctx);
+		/*
+		 * Allocate the radix tree.
+		 *
+		 * Allocate it in a separate memory context, so that we can print its
+		 * memory usage easily.
+		 */
+		radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+											  "radixtree test",
+											  ALLOCSET_SMALL_SIZES);
+		MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+		/* Test the local radix tree */
+		tree = rt_create(radixtree_ctx, NULL);
+		do_test_pattern(tree, spec);
+		rt_free(tree);
+		MemoryContextReset(radixtree_ctx);
+
+		/* Test the shared radix tree */
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(radixtree_ctx, area);
+		do_test_pattern(tree, spec);
+		rt_free(tree);
+		dsa_detach(area);
+		MemoryContextDelete(radixtree_ctx);
+	}
 }
 
 Datum
 test_radixtree(PG_FUNCTION_ARGS)
 {
-	test_empty();
+	/* get a new lwlock tranche id for all tests for shared radix tree */
+	lwlock_tranche_id = LWLockNewTrancheId();
 
-	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
-	{
-		test_basic(rt_node_kind_fanouts[i], false);
-		test_basic(rt_node_kind_fanouts[i], true);
-	}
-
-	for (int shift = 0; shift <= (64 - 8); shift += 8)
-		test_node_types(shift);
+	test_empty();
+	test_basic();
 
-	/* Test different test patterns, with lots of entries */
-	for (int i = 0; i < lengthof(test_specs); i++)
-		test_pattern(&test_specs[i]);
+	test_node_types();
+	test_patterns();
 
 	PG_RETURN_VOID();
 }
-- 
2.38.1

v13-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchtext/x-patch; charset=US-ASCII; name=v13-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchDownload

From 4dceebdffb8a03e8863d640d25c2d197ef8c16b7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 14 Nov 2022 11:44:17 +0900
Subject: [PATCH v13 6/8] Use rt_node_ptr to reference radix tree nodes.

---
 src/backend/lib/radixtree.c | 688 +++++++++++++++++++++---------------
 1 file changed, 398 insertions(+), 290 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index abd0450727..bff37a2c35 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -150,6 +150,19 @@ typedef enum rt_size_class
 #define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
 } rt_size_class;
 
+/*
+ * rt_pointer is a pointer compatible with a pointer to local memory and a
+ * pointer for DSA area (i.e. dsa_pointer). Since the radix tree node can be
+ * allocated in backend local memory as well as DSA area, we cannot use a
+ * C-pointer to rt_node (i.e. backend local memory address) for child pointers
+ * in inner nodes. Inner nodes need to use rt_pointer instead. We can get
+ * the backend local memory address of a node from a rt_pointer by using
+ * rt_pointer_decode().
+*/
+typedef uintptr_t rt_pointer;
+#define InvalidRTPointer		((rt_pointer) 0)
+#define RTPointerIsValid(x) 	(((rt_pointer) (x)) != InvalidRTPointer)
+
 /* Common type for all nodes types */
 typedef struct rt_node
 {
@@ -175,8 +188,7 @@ typedef struct rt_node
 	/* Node kind, one per search/set algorithm */
 	uint8		kind;
 } rt_node;
-#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+#define RT_NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
 #define VAR_NODE_HAS_FREE_SLOT(node) \
 	((node)->base.n.count < (node)->base.n.fanout)
 #define FIXED_NODE_HAS_FREE_SLOT(node, class) \
@@ -240,7 +252,7 @@ typedef struct rt_node_inner_4
 	rt_node_base_4 base;
 
 	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+	rt_pointer    children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_4;
 
 typedef struct rt_node_leaf_4
@@ -256,7 +268,7 @@ typedef struct rt_node_inner_32
 	rt_node_base_32 base;
 
 	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+	rt_pointer    children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_32;
 
 typedef struct rt_node_leaf_32
@@ -272,7 +284,7 @@ typedef struct rt_node_inner_125
 	rt_node_base_125 base;
 
 	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+	rt_pointer    children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_125;
 
 typedef struct rt_node_leaf_125
@@ -292,7 +304,7 @@ typedef struct rt_node_inner_256
 	rt_node_base_256 base;
 
 	/* Slots for 256 children */
-	rt_node    *children[RT_NODE_MAX_SLOTS];
+	rt_pointer    children[RT_NODE_MAX_SLOTS];
 } rt_node_inner_256;
 
 typedef struct rt_node_leaf_256
@@ -306,6 +318,29 @@ typedef struct rt_node_leaf_256
 	uint64		values[RT_NODE_MAX_SLOTS];
 } rt_node_leaf_256;
 
+/* rt_node_ptr is a data structure representing a pointer for a rt_node */
+typedef struct rt_node_ptr
+{
+	rt_pointer		encoded;
+	rt_node			*decoded;
+} rt_node_ptr;
+#define InvalidRTNodePtr \
+	(rt_node_ptr) {.encoded = InvalidRTPointer, .decoded = NULL}
+#define RTNodePtrIsValid(n) \
+	(!rt_node_ptr_eq((rt_node_ptr *) &(n), &(InvalidRTNodePtr)))
+
+/* Macros for rt_node_ptr to access the fields of rt_node */
+#define NODE_RAW(n)			(n.decoded)
+#define NODE_IS_LEAF(n)		(NODE_RAW(n)->shift == 0)
+#define NODE_IS_EMPTY(n)	(NODE_COUNT(n) == 0)
+#define NODE_KIND(n)	(NODE_RAW(n)->kind)
+#define NODE_COUNT(n)	(NODE_RAW(n)->count)
+#define NODE_SHIFT(n)	(NODE_RAW(n)->shift)
+#define NODE_CHUNK(n)	(NODE_RAW(n)->chunk)
+#define NODE_FANOUT(n)	(NODE_RAW(n)->fanout)
+#define NODE_HAS_FREE_SLOT(n) \
+	(NODE_COUNT(n) < rt_node_kind_info[NODE_KIND(n)].fanout)
+
 /* Information for each size class */
 typedef struct rt_size_class_elem
 {
@@ -394,7 +429,7 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
  */
 typedef struct rt_node_iter
 {
-	rt_node    *node;			/* current node being iterated */
+	rt_node_ptr	node;			/* current node being iterated */
 	int			current_idx;	/* current position. -1 for initial value */
 } rt_node_iter;
 
@@ -415,7 +450,7 @@ struct radix_tree
 {
 	MemoryContext context;
 
-	rt_node    *root;
+	rt_pointer	root;
 	uint64		max_val;
 	uint64		num_keys;
 
@@ -429,27 +464,58 @@ struct radix_tree
 };
 
 static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
-static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+
+static rt_node_ptr rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class,
 								bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_free_node(radix_tree *tree, rt_node_ptr node);
 static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
-										rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+static inline bool rt_node_search_inner(rt_node_ptr node_ptr, uint64 key, rt_action action,
+										rt_pointer *child_p);
+static inline bool rt_node_search_leaf(rt_node_ptr node_ptr, uint64 key, rt_action action,
 									   uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
-								 uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+static bool rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+								 uint64 key, rt_node_ptr child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
 								uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											  rt_node_ptr *child_p);
 static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 											 uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static void rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from);
 static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
 
 /* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
+static void rt_verify_node(rt_node_ptr node);
+
+/* Decode and encode functions of rt_pointer */
+static inline rt_node *
+rt_pointer_decode(rt_pointer encoded)
+{
+	return (rt_node *) encoded;
+}
+
+static inline rt_pointer
+rt_pointer_encode(rt_node *decoded)
+{
+	return (rt_pointer) decoded;
+}
+
+/* Return a rt_node_ptr created from the given encoded pointer */
+static inline rt_node_ptr
+rt_node_ptr_encoded(rt_pointer encoded)
+{
+	return (rt_node_ptr) {
+		.encoded = encoded,
+			.decoded = rt_pointer_decode(encoded),
+			};
+}
+
+static inline bool
+rt_node_ptr_eq(rt_node_ptr *a, rt_node_ptr *b)
+{
+	return (a->decoded == b->decoded) && (a->encoded == b->encoded);
+}
 
 /*
  * Return index of the first element in 'base' that equals 'key'. Return -1
@@ -598,10 +664,10 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
 
 /* Shift the elements right at 'idx' by one */
 static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_shift(uint8 *chunks, rt_pointer *children, int count, int idx)
 {
 	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
-	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_pointer) * (count - idx));
 }
 
 static inline void
@@ -613,10 +679,10 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Delete the element at 'idx' */
 static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_delete(uint8 *chunks, rt_pointer *children, int count, int idx)
 {
 	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
-	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_pointer) * (count - idx - 1));
 }
 
 static inline void
@@ -628,12 +694,12 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Copy both chunks and children/values arrays */
 static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
-						  uint8 *dst_chunks, rt_node **dst_children)
+chunk_children_array_copy(uint8 *src_chunks, rt_pointer *src_children,
+						  uint8 *dst_chunks, rt_pointer *dst_children)
 {
 	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
-	const Size children_size = sizeof(rt_node *) * fanout;
+	const Size children_size = sizeof(rt_pointer) * fanout;
 
 	memcpy(dst_chunks, src_chunks, chunk_size);
 	memcpy(dst_children, src_children, children_size);
@@ -665,7 +731,7 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
 static inline bool
 node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
 	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
@@ -673,23 +739,23 @@ node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
 static inline bool
 node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
 	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
 #endif
 
-static inline rt_node *
+static inline rt_pointer
 node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	return node->children[node->base.slot_idxs[chunk]];
 }
 
 static inline uint64
 node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
 	return node->values[node->base.slot_idxs[chunk]];
 }
@@ -699,9 +765,9 @@ node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
 {
 	int			slotpos = node->base.slot_idxs[chunk];
 
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
-	node->children[node->base.slot_idxs[chunk]] = NULL;
+	node->children[node->base.slot_idxs[chunk]] = InvalidRTPointer;
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
 
@@ -710,7 +776,7 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
 {
 	int			slotpos = node->base.slot_idxs[chunk];
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
@@ -742,11 +808,11 @@ node_125_find_unused_slot(bitmapword *isset)
  }
 
 static inline void
-node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
 {
 	int			slotpos;
 
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 
 	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
@@ -761,7 +827,7 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 {
 	int			slotpos;
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 
 	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
@@ -772,16 +838,16 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 
 /* Update the child corresponding to 'chunk' to 'child' */
 static inline void
-node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[node->base.slot_idxs[chunk]] = child;
 }
 
 static inline void
 node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->values[node->base.slot_idxs[chunk]] = value;
 }
 
@@ -791,21 +857,21 @@ node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 static inline bool
 node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
-	return (node->children[chunk] != NULL);
+	Assert(!RT_NODE_IS_LEAF(node));
+	return RTPointerIsValid(node->children[chunk]);
 }
 
 static inline bool
 node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
 }
 
-static inline rt_node *
+static inline rt_pointer
 node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	Assert(node_inner_256_is_chunk_used(node, chunk));
 	return node->children[chunk];
 }
@@ -813,16 +879,16 @@ node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
 static inline uint64
 node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(node_leaf_256_is_chunk_used(node, chunk));
 	return node->values[chunk];
 }
 
 /* Set the child in the node-256 */
 static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_pointer child)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[chunk] = child;
 }
 
@@ -830,7 +896,7 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
 static inline void
 node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
 	node->values[chunk] = value;
 }
@@ -839,14 +905,14 @@ node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
 static inline void
 node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
-	node->children[chunk] = NULL;
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = InvalidRTPointer;
 }
 
 static inline void
 node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
 }
 
@@ -882,29 +948,32 @@ rt_new_root(radix_tree *tree, uint64 key)
 {
 	int			shift = key_get_shift(key);
 	bool		inner = shift > 0;
-	rt_node    *newnode;
+	rt_node_ptr	newnode;
 
 	newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
 	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
-	newnode->shift = shift;
+	NODE_SHIFT(newnode) = shift;
+
 	tree->max_val = shift_get_max_val(shift);
-	tree->root = newnode;
+	tree->root = newnode.encoded;
 }
 
 /*
  * Allocate a new node with the given node kind.
  */
-static rt_node *
+static rt_node_ptr
 rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 {
-	rt_node    *newnode;
+	rt_node_ptr	newnode;
 
 	if (inner)
-		newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
-												 rt_size_class_info[size_class].inner_size);
+		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+														 rt_size_class_info[size_class].inner_size);
 	else
-		newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
-												 rt_size_class_info[size_class].leaf_size);
+		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+														 rt_size_class_info[size_class].leaf_size);
+
+	newnode.encoded = rt_pointer_encode(newnode.decoded);
 
 #ifdef RT_DEBUG
 	/* update the statistics */
@@ -916,20 +985,20 @@ rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 
 /* Initialize the node contents */
 static inline void
-rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class, bool inner)
 {
 	if (inner)
-		MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+		MemSet(node.decoded, 0, rt_size_class_info[size_class].inner_size);
 	else
-		MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+		MemSet(node.decoded, 0, rt_size_class_info[size_class].leaf_size);
 
-	node->kind = kind;
-	node->fanout = rt_size_class_info[size_class].fanout;
+	NODE_KIND(node) = kind;
+	NODE_FANOUT(node) = rt_size_class_info[size_class].fanout;
 
 	/* Initialize slot_idxs to invalid values */
 	if (kind == RT_NODE_KIND_125)
 	{
-		rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+		rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
 
 		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
 	}
@@ -939,25 +1008,25 @@ rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
 	 * and this is the max size class to it will never grow.
 	 */
 	if (kind == RT_NODE_KIND_256)
-		node->fanout = 0;
+		NODE_FANOUT(node) = 0;
 }
 
 static inline void
-rt_copy_node(rt_node *newnode, rt_node *oldnode)
+rt_copy_node(rt_node_ptr newnode, rt_node_ptr oldnode)
 {
-	newnode->shift = oldnode->shift;
-	newnode->chunk = oldnode->chunk;
-	newnode->count = oldnode->count;
+	NODE_SHIFT(newnode) = NODE_SHIFT(oldnode);
+	NODE_CHUNK(newnode) = NODE_CHUNK(oldnode);
+	NODE_COUNT(newnode) = NODE_COUNT(oldnode);
 }
 
 /*
  * Create a new node with 'new_kind' and the same shift, chunk, and
  * count of 'node'.
  */
-static rt_node*
-rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+static rt_node_ptr
+rt_grow_node_kind(radix_tree *tree, rt_node_ptr node, uint8 new_kind)
 {
-	rt_node	*newnode;
+	rt_node_ptr	newnode;
 	bool inner = !NODE_IS_LEAF(node);
 
 	newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
@@ -969,12 +1038,12 @@ rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
 
 /* Free the given node */
 static void
-rt_free_node(radix_tree *tree, rt_node *node)
+rt_free_node(radix_tree *tree, rt_node_ptr node)
 {
 	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node)
+	if (tree->root == node.encoded)
 	{
-		tree->root = NULL;
+		tree->root = InvalidRTPointer;
 		tree->max_val = 0;
 	}
 
@@ -985,7 +1054,7 @@ rt_free_node(radix_tree *tree, rt_node *node)
 		/* update the statistics */
 		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 		{
-			if (node->fanout == rt_size_class_info[i].fanout)
+			if (NODE_FANOUT(node) == rt_size_class_info[i].fanout)
 				break;
 		}
 
@@ -998,29 +1067,30 @@ rt_free_node(radix_tree *tree, rt_node *node)
 	}
 #endif
 
-	pfree(node);
+	pfree(node.decoded);
 }
 
 /*
  * Replace old_child with new_child, and free the old one.
  */
 static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
-				rt_node *new_child, uint64 key)
+rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
+				rt_node_ptr new_child, uint64 key)
 {
-	Assert(old_child->chunk == new_child->chunk);
-	Assert(old_child->shift == new_child->shift);
+	Assert(NODE_CHUNK(old_child) == NODE_CHUNK(new_child));
+	Assert(NODE_SHIFT(old_child) == NODE_SHIFT(new_child));
 
-	if (parent == old_child)
+	if (rt_node_ptr_eq(&parent, &old_child))
 	{
 		/* Replace the root node with the new large node */
-		tree->root = new_child;
+		tree->root = new_child.encoded;
 	}
 	else
 	{
 		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
 
-		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		replaced = rt_node_insert_inner(tree, InvalidRTNodePtr, parent, key,
+										new_child);
 		Assert(replaced);
 	}
 
@@ -1035,24 +1105,28 @@ static void
 rt_extend(radix_tree *tree, uint64 key)
 {
 	int			target_shift;
-	int			shift = tree->root->shift + RT_NODE_SPAN;
+	rt_node		*root = rt_pointer_decode(tree->root);
+	int			shift = root->shift + RT_NODE_SPAN;
 
 	target_shift = key_get_shift(key);
 
 	/* Grow tree from 'shift' to 'target_shift' */
 	while (shift <= target_shift)
 	{
-		rt_node_inner_4 *node;
+		rt_node_ptr	node;
+		rt_node_inner_4 *n4;
+
+		node = rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+		rt_init_node(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
 
-		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
-		rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
-		node->base.n.shift = shift;
-		node->base.n.count = 1;
-		node->base.chunks[0] = 0;
-		node->children[0] = tree->root;
+		n4 = (rt_node_inner_4 *) node.decoded;
+		n4->base.n.shift = shift;
+		n4->base.n.count = 1;
+		n4->base.chunks[0] = 0;
+		n4->children[0] = tree->root;
 
-		tree->root->chunk = 0;
-		tree->root = (rt_node *) node;
+		root->chunk = 0;
+		tree->root = node.encoded;
 
 		shift += RT_NODE_SPAN;
 	}
@@ -1065,21 +1139,22 @@ rt_extend(radix_tree *tree, uint64 key)
  * Insert inner and leaf nodes from 'node' to bottom.
  */
 static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
-			  rt_node *node)
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
+			  rt_node_ptr node)
 {
-	int			shift = node->shift;
+	int			shift = NODE_SHIFT(node);
 
 	while (shift >= RT_NODE_SPAN)
 	{
-		rt_node    *newchild;
+		rt_node_ptr    newchild;
 		int			newshift = shift - RT_NODE_SPAN;
 		bool		inner = newshift > 0;
 
 		newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
 		rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
-		newchild->shift = newshift;
-		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+		NODE_SHIFT(newchild) = newshift;
+		NODE_CHUNK(newchild) = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
+
 		rt_node_insert_inner(tree, parent, node, key, newchild);
 
 		parent = node;
@@ -1099,17 +1174,18 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
+					 rt_pointer *child_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
-	rt_node    *child = NULL;
+	rt_pointer	child;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
 
 				if (idx < 0)
@@ -1127,7 +1203,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
 
 				if (idx < 0)
@@ -1143,7 +1219,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
 
 				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
 					break;
@@ -1159,7 +1235,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 				if (!node_inner_256_is_chunk_used(n256, chunk))
 					break;
@@ -1176,7 +1252,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 
 	/* update statistics */
 	if (action == RT_ACTION_DELETE && found)
-		node->count--;
+		NODE_COUNT(node)--;
 
 	if (found && child_p)
 		*child_p = child;
@@ -1192,17 +1268,17 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
  * to the value is set to value_p.
  */
 static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node_ptr node, uint64 key, rt_action action, uint64 *value_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
 	uint64		value = 0;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
 
 				if (idx < 0)
@@ -1220,7 +1296,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
 
 				if (idx < 0)
@@ -1236,7 +1312,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
 
 				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
 					break;
@@ -1252,7 +1328,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 				if (!node_leaf_256_is_chunk_used(n256, chunk))
 					break;
@@ -1269,7 +1345,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 
 	/* update statistics */
 	if (action == RT_ACTION_DELETE && found)
-		node->count--;
+		NODE_COUNT(node)--;
 
 	if (found && value_p)
 		*value_p = value;
@@ -1279,19 +1355,19 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 
 /* Insert the child to the inner node */
 static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
-					 rt_node *child)
+rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+					 uint64 key, rt_node_ptr child)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		chunk_exists = false;
 
 	Assert(!NODE_IS_LEAF(node));
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 				int			idx;
 
 				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1299,25 +1375,27 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					n4->children[idx] = child;
+					n4->children[idx] = child.encoded;
 					break;
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
 				{
+					rt_node_ptr	new;
 					rt_node_inner_32 *new32;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
-																   RT_NODE_KIND_32);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+					new32 = (rt_node_inner_32 *) new.decoded;
+
 					chunk_children_array_copy(n4->base.chunks, n4->children,
 											  new32->base.chunks, new32->children);
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
-									key);
-					node = (rt_node *) new32;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1330,14 +1408,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 												   count, insertpos);
 
 					n4->base.chunks[insertpos] = chunk;
-					n4->children[insertpos] = child;
+					n4->children[insertpos] = child.encoded;
 					break;
 				}
 			}
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 				int			idx;
 
 				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1345,45 +1423,52 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					n32->children[idx] = child;
+					n32->children[idx] = child.encoded;
 					break;
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
 				{
-					Assert(parent != NULL);
+					Assert(RTNodePtrIsValid(parent));
 
 					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
 					{
 						/* use the same node kind, but expand to the next size class */
 						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
 						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_ptr	new;
 						rt_node_inner_32 *new32;
 
-						new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+						new = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+						new32 = (rt_node_inner_32 *) new.decoded;
 						memcpy(new32, n32, size);
 						new32->base.n.fanout = fanout;
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+						rt_replace_node(tree, parent, node, new, key);
 
-						/* must update both pointers here */
-						node = (rt_node *) new32;
+						/*
+						 * Must update both pointers here since we update n32 and
+						 * verify node.
+						 */
+						node = new;
 						n32 = new32;
 
 						goto retry_insert_inner_32;
 					}
 					else
 					{
+						rt_node_ptr	new;
 						rt_node_inner_125 *new125;
 
 						/* grow node from 32 to 125 */
-						new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
-																		 RT_NODE_KIND_125);
+						new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+						new125 = (rt_node_inner_125 *) new.decoded;
+
 						for (int i = 0; i < n32->base.n.count; i++)
 							node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
-						node = (rt_node *) new125;
+						rt_replace_node(tree, parent, node, new, key);
+						node = new;
 					}
 				}
 				else
@@ -1398,7 +1483,7 @@ retry_insert_inner_32:
 													   count, insertpos);
 
 						n32->base.chunks[insertpos] = chunk;
-						n32->children[insertpos] = child;
+						n32->children[insertpos] = child.encoded;
 						break;
 					}
 				}
@@ -1406,25 +1491,28 @@ retry_insert_inner_32:
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_125:
 			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
 				int			cnt = 0;
 
 				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					node_inner_125_update(n125, chunk, child);
+					node_inner_125_update(n125, chunk, child.encoded);
 					break;
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
 				{
+					rt_node_ptr	new;
 					rt_node_inner_256 *new256;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 125 to 256 */
-					new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
-																	 RT_NODE_KIND_256);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+					new256 = (rt_node_inner_256 *) new.decoded;
+
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
 					{
 						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1434,32 +1522,31 @@ retry_insert_inner_32:
 						cnt++;
 					}
 
-					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
-					node_inner_125_insert(n125, chunk, child);
+					node_inner_125_insert(n125, chunk, child.encoded);
 					break;
 				}
 			}
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
 				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
 
-				node_inner_256_set(n256, chunk, child);
+				node_inner_256_set(n256, chunk, child.encoded);
 				break;
 			}
 	}
 
 	/* Update statistics */
 	if (!chunk_exists)
-		node->count++;
+		NODE_COUNT(node)++;
 
 	/*
 	 * Done. Finally, verify the chunk and value is inserted or replaced
@@ -1472,19 +1559,19 @@ retry_insert_inner_32:
 
 /* Insert the value to the leaf node */
 static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
 					uint64 key, uint64 value)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		chunk_exists = false;
 
 	Assert(NODE_IS_LEAF(node));
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 				int			idx;
 
 				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1498,16 +1585,18 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
 				{
+					rt_node_ptr	new;
 					rt_node_leaf_32 *new32;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
-																  RT_NODE_KIND_32);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+					new32 = (rt_node_leaf_32 *) new.decoded;
 					chunk_values_array_copy(n4->base.chunks, n4->values,
 											new32->base.chunks, new32->values);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
-					node = (rt_node *) new32;
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1527,7 +1616,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 				int			idx;
 
 				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1541,45 +1630,51 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
 				{
-					Assert(parent != NULL);
+					Assert(RTNodePtrIsValid(parent));
 
 					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
 					{
 						/* use the same node kind, but expand to the next size class */
 						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
 						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_ptr new;
 						rt_node_leaf_32 *new32;
 
-						new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+						new = rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+						new32 = (rt_node_leaf_32 *) new.decoded;
 						memcpy(new32, n32, size);
 						new32->base.n.fanout = fanout;
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+						rt_replace_node(tree, parent, node, new, key);
 
-						/* must update both pointers here */
-						node = (rt_node *) new32;
+						/*
+						 * Must update both pointers here since we update n32 and
+						 * verify node.
+						 */
+						node = new;
 						n32 = new32;
 
 						goto retry_insert_leaf_32;
 					}
 					else
 					{
+						rt_node_ptr	new;
 						rt_node_leaf_125 *new125;
 
 						/* grow node from 32 to 125 */
-						new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
-																		RT_NODE_KIND_125);
+						new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+						new125 = (rt_node_leaf_125 *) new.decoded;
+
 						for (int i = 0; i < n32->base.n.count; i++)
 							node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
-										key);
-						node = (rt_node *) new125;
+						rt_replace_node(tree, parent, node, new, key);
+						node = new;
 					}
 				}
 				else
 				{
-				retry_insert_leaf_32:
+retry_insert_leaf_32:
 					{
 						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
 						int	count = n32->base.n.count;
@@ -1597,7 +1692,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_125:
 			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
 				int			cnt = 0;
 
 				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
@@ -1610,12 +1705,14 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
 				{
+					rt_node_ptr	new;
 					rt_node_leaf_256 *new256;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 125 to 256 */
-					new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
-																	RT_NODE_KIND_256);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+					new256 = (rt_node_leaf_256 *) new.decoded;
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
 					{
 						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1625,9 +1722,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 						cnt++;
 					}
 
-					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1638,7 +1734,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
 				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
@@ -1650,7 +1746,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 	/* Update statistics */
 	if (!chunk_exists)
-		node->count++;
+		NODE_COUNT(node)++;
 
 	/*
 	 * Done. Finally, verify the chunk and value is inserted or replaced
@@ -1674,7 +1770,7 @@ rt_create(MemoryContext ctx)
 
 	tree = palloc(sizeof(radix_tree));
 	tree->context = ctx;
-	tree->root = NULL;
+	tree->root = InvalidRTPointer;
 	tree->max_val = 0;
 	tree->num_keys = 0;
 
@@ -1723,26 +1819,23 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 {
 	int			shift;
 	bool		updated;
-	rt_node    *node;
-	rt_node    *parent;
+	rt_node_ptr	node;
+	rt_node_ptr parent;
 
 	/* Empty tree, create the root */
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 		rt_new_root(tree, key);
 
 	/* Extend the tree if necessary */
 	if (key > tree->max_val)
 		rt_extend(tree, key);
 
-	Assert(tree->root);
-
-	shift = tree->root->shift;
-	node = parent = tree->root;
-
 	/* Descend the tree until a leaf node */
+	node = parent = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer    child;
 
 		if (NODE_IS_LEAF(node))
 			break;
@@ -1754,7 +1847,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 		}
 
 		parent = node;
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1775,21 +1868,21 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 bool
 rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 {
-	rt_node    *node;
+	rt_node_ptr    node;
 	int			shift;
 
 	Assert(value_p != NULL);
 
-	if (!tree->root || key > tree->max_val)
+	if (!RTPointerIsValid(tree->root) || key > tree->max_val)
 		return false;
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer	child;
 
 		if (NODE_IS_LEAF(node))
 			break;
@@ -1797,7 +1890,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1811,8 +1904,8 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 bool
 rt_delete(radix_tree *tree, uint64 key)
 {
-	rt_node    *node;
-	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	rt_node_ptr	node;
+	rt_node_ptr	stack[RT_MAX_LEVEL] = {0};
 	int			shift;
 	int			level;
 	bool		deleted;
@@ -1824,12 +1917,12 @@ rt_delete(radix_tree *tree, uint64 key)
 	 * Descend the tree to search the key while building a stack of nodes we
 	 * visited.
 	 */
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	level = -1;
 	while (shift > 0)
 	{
-		rt_node    *child;
+		rt_pointer	child;
 
 		/* Push the current node to the stack */
 		stack[++level] = node;
@@ -1837,7 +1930,7 @@ rt_delete(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1888,6 +1981,7 @@ rt_iter *
 rt_begin_iterate(radix_tree *tree)
 {
 	MemoryContext old_ctx;
+	rt_node_ptr	root;
 	rt_iter    *iter;
 	int			top_level;
 
@@ -1897,17 +1991,18 @@ rt_begin_iterate(radix_tree *tree)
 	iter->tree = tree;
 
 	/* empty tree */
-	if (!iter->tree->root)
+	if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
 		return iter;
 
-	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	root = rt_node_ptr_encoded(iter->tree->root);
+	top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
 	iter->stack_len = top_level;
 
 	/*
 	 * Descend to the left most leaf node from the root. The key is being
 	 * constructed while descending to the leaf.
 	 */
-	rt_update_iter_stack(iter, iter->tree->root, top_level);
+	rt_update_iter_stack(iter, root, top_level);
 
 	MemoryContextSwitchTo(old_ctx);
 
@@ -1918,14 +2013,15 @@ rt_begin_iterate(radix_tree *tree)
  * Update each node_iter for inner nodes in the iterator node stack.
  */
 static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
 {
 	int			level = from;
-	rt_node    *node = from_node;
+	rt_node_ptr node = from_node;
 
 	for (;;)
 	{
 		rt_node_iter *node_iter = &(iter->stack[level--]);
+		bool found PG_USED_FOR_ASSERTS_ONLY;
 
 		node_iter->node = node;
 		node_iter->current_idx = -1;
@@ -1935,10 +2031,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
 			return;
 
 		/* Advance to the next slot in the inner node */
-		node = rt_node_inner_iterate_next(iter, node_iter);
+		found = rt_node_inner_iterate_next(iter, node_iter, &node);
 
 		/* We must find the first children in the node */
-		Assert(node);
+		Assert(found);
 	}
 }
 
@@ -1955,7 +2051,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 
 	for (;;)
 	{
-		rt_node    *child = NULL;
+		rt_node_ptr	child = InvalidRTNodePtr;
 		uint64		value;
 		int			level;
 		bool		found;
@@ -1976,14 +2072,12 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 		 */
 		for (level = 1; level <= iter->stack_len; level++)
 		{
-			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
-
-			if (child)
+			if (rt_node_inner_iterate_next(iter, &(iter->stack[level]), &child))
 				break;
 		}
 
 		/* the iteration finished */
-		if (!child)
+		if (!RTNodePtrIsValid(child))
 			return false;
 
 		/*
@@ -2015,18 +2109,19 @@ rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
  * Advance the slot in the inner node. Return the child if exists, otherwise
  * null.
  */
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+static inline bool
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *child_p)
 {
-	rt_node    *child = NULL;
+	rt_node_ptr	node = node_iter->node;
+	rt_pointer	child;
 	bool		found = false;
 	uint8		key_chunk;
 
-	switch (node_iter->node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n4->base.n.count)
@@ -2039,7 +2134,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n32->base.n.count)
@@ -2052,7 +2147,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2072,7 +2167,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2093,9 +2188,12 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 	}
 
 	if (found)
-		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+	{
+		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
+		*child_p = rt_node_ptr_encoded(child);
+	}
 
-	return child;
+	return found;
 }
 
 /*
@@ -2103,19 +2201,18 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
  * is set to value_p, otherwise return false.
  */
 static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
-						  uint64 *value_p)
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_p)
 {
-	rt_node    *node = node_iter->node;
+	rt_node_ptr node = node_iter->node;
 	bool		found = false;
 	uint64		value;
 	uint8		key_chunk;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n4->base.n.count)
@@ -2128,7 +2225,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n32->base.n.count)
@@ -2141,7 +2238,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2161,7 +2258,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2183,7 +2280,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 
 	if (found)
 	{
-		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
 		*value_p = value;
 	}
 
@@ -2220,16 +2317,16 @@ rt_memory_usage(radix_tree *tree)
  * Verify the radix tree node.
  */
 static void
-rt_verify_node(rt_node *node)
+rt_verify_node(rt_node_ptr node)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(node->count >= 0);
+	Assert(NODE_COUNT(node) >= 0);
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node.decoded;
 
 				for (int i = 1; i < n4->n.count; i++)
 					Assert(n4->chunks[i - 1] < n4->chunks[i]);
@@ -2238,7 +2335,7 @@ rt_verify_node(rt_node *node)
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node.decoded;
 
 				for (int i = 1; i < n32->n.count; i++)
 					Assert(n32->chunks[i - 1] < n32->chunks[i]);
@@ -2247,7 +2344,7 @@ rt_verify_node(rt_node *node)
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+				rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
 				int			cnt = 0;
 
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2257,10 +2354,10 @@ rt_verify_node(rt_node *node)
 
 					/* Check if the corresponding slot is used */
 					if (NODE_IS_LEAF(node))
-						Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+						Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) n125,
 														  n125->slot_idxs[i]));
 					else
-						Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+						Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) n125,
 														   n125->slot_idxs[i]));
 
 					cnt++;
@@ -2273,7 +2370,7 @@ rt_verify_node(rt_node *node)
 			{
 				if (NODE_IS_LEAF(node))
 				{
-					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 					int			cnt = 0;
 
 					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
@@ -2294,54 +2391,62 @@ rt_verify_node(rt_node *node)
 void
 rt_stats(radix_tree *tree)
 {
+	rt_node *root = rt_pointer_decode(tree->root);
+
+	if (root == NULL)
+		return;
+
 	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
-						 tree->num_keys,
-						 tree->root->shift / RT_NODE_SPAN,
-						 tree->cnt[RT_CLASS_4_FULL],
-						 tree->cnt[RT_CLASS_32_PARTIAL],
-						 tree->cnt[RT_CLASS_32_FULL],
-						 tree->cnt[RT_CLASS_125_FULL],
-						 tree->cnt[RT_CLASS_256])));
+							tree->num_keys,
+							root->shift / RT_NODE_SPAN,
+							tree->cnt[RT_CLASS_4_FULL],
+							tree->cnt[RT_CLASS_32_PARTIAL],
+							tree->cnt[RT_CLASS_32_FULL],
+							tree->cnt[RT_CLASS_125_FULL],
+							tree->cnt[RT_CLASS_256])));
 }
 
 static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(rt_node_ptr node, int level, bool recurse)
 {
-	char		space[125] = {0};
+	rt_node		*n = node.decoded;
+	char		space[128] = {0};
 
 	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
 			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
-			(node->kind == RT_NODE_KIND_4) ? 4 :
-			(node->kind == RT_NODE_KIND_32) ? 32 :
-			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
-			node->fanout == 0 ? 256 : node->fanout,
-			node->count, node->shift, node->chunk);
+
+			(n->kind == RT_NODE_KIND_4) ? 4 :
+			(n->kind == RT_NODE_KIND_32) ? 32 :
+			(n->kind == RT_NODE_KIND_125) ? 125 : 256,
+			n->fanout == 0 ? 256 : n->fanout,
+			n->count, n->shift, n->chunk);
 
 	if (level > 0)
 		sprintf(space, "%*c", level * 4, ' ');
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				for (int i = 0; i < node->count; i++)
+				for (int i = 0; i < NODE_COUNT(node); i++)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
 								space, n4->base.chunks[i], n4->values[i]);
 					}
 					else
 					{
-						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, n4->base.chunks[i]);
 
 						if (recurse)
-							rt_dump_node(n4->children[i], level + 1, recurse);
+							rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2350,25 +2455,26 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 			}
 		case RT_NODE_KIND_32:
 			{
-				for (int i = 0; i < node->count; i++)
+				for (int i = 0; i < NODE_KIND(node); i++)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
 								space, n32->base.chunks[i], n32->values[i]);
 					}
 					else
 					{
-						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, n32->base.chunks[i]);
 
 						if (recurse)
 						{
-							rt_dump_node(n32->children[i], level + 1, recurse);
+							rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+										 level + 1, recurse);
 						}
 						else
 							fprintf(stderr, "\n");
@@ -2378,7 +2484,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+				rt_node_base_125 *b125 = (rt_node_base_125 *) node.decoded;
 
 				fprintf(stderr, "slot_idxs ");
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2390,7 +2496,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 				}
 				if (NODE_IS_LEAF(node))
 				{
-					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node.decoded;
 
 					fprintf(stderr, ", isset-bitmap:");
 					for (int i = 0; i < WORDNUM(128); i++)
@@ -2420,7 +2526,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_125_get_child(n125, i),
+							rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2434,7 +2540,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 						if (!node_leaf_256_is_chunk_used(n256, i))
 							continue;
@@ -2444,7 +2550,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 					}
 					else
 					{
-						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 						if (!node_inner_256_is_chunk_used(n256, i))
 							continue;
@@ -2453,8 +2559,8 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
-										 recurse);
+							rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2467,7 +2573,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 void
 rt_dump_search(radix_tree *tree, uint64 key)
 {
-	rt_node    *node;
+	rt_node_ptr node;
 	int			shift;
 	int			level = 0;
 
@@ -2475,7 +2581,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
 		 tree->max_val, tree->max_val);
 
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 	{
 		elog(NOTICE, "tree is empty");
 		return;
@@ -2488,11 +2594,11 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		return;
 	}
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer   child;
 
 		rt_dump_node(node, level, false);
 
@@ -2509,7 +2615,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			break;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 		level++;
 	}
@@ -2518,6 +2624,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 void
 rt_dump(radix_tree *tree)
 {
+	rt_node_ptr root;
 
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
@@ -2528,12 +2635,13 @@ rt_dump(radix_tree *tree)
 				rt_size_class_info[i].leaf_blocksize);
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
 
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 	{
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
-	rt_dump_node(tree->root, 0, true);
+	root = rt_node_ptr_encoded(tree->root);
+	rt_dump_node(root, 0, true);
 }
 #endif
-- 
2.38.1

#152

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#151)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Dec 6, 2022 at 7:32 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Fri, Dec 2, 2022 at 11:42 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Nov 14, 2022 at 7:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:

- Optimize node128 insert.

I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over and operate on word-sized (32- or 64-bit) types at a time, rather than bytes.

Thanks! I think this is a good idea.

To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That's probably going to need a separate email thread to resolve the coding style clash this causes, so that can be put off for later.

I started a separate thread [1], and 0002 comes from feedback on that. There is a FIXME about using WORDNUM and BITNUM, at least with that spelling. I'm putting that off to ease rebasing the rest as v13 -- getting some CI testing with 0002 seems like a good idea. There are no other changes yet. Next, I will take a look at templating local vs. shared memory. I might try basing that on the styles of both v12 and v8, and see which one works best with templating.

Thank you so much!

In the meanwhile, I've been working on vacuum integration. There are
two things I'd like to discuss some time:

The first is the minimum of maintenance_work_mem, 1 MB. Since the
initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel
vacuum with radix tree cannot work with the minimum
maintenance_work_mem. It will need to increase it to 4MB or so. Maybe
we can start a new thread for that.

The second is how to limit the size of the radix tree to
maintenance_work_mem. I think that it's tricky to estimate the maximum
number of keys in the radix tree that fit in maintenance_work_mem. The
radix tree size varies depending on the key distribution. The next
idea I considered was how to limit the size when inserting a key. In
order to strictly limit the radix tree size, probably we have to
change the rt_set so that it breaks off and returns false if the radix
tree size is about to exceed the memory limit when we allocate a new
node or grow a node kind/class. Ideally, I'd like to control the size
outside of radix tree (e.g. TIDStore) since it could introduce
overhead to rt_set() but probably we need to add such logic in radix
tree.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#153

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#152)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Dec 9, 2022 at 8:20 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

In the meanwhile, I've been working on vacuum integration. There are
two things I'd like to discuss some time:

The first is the minimum of maintenance_work_mem, 1 MB. Since the
initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel
vacuum with radix tree cannot work with the minimum
maintenance_work_mem. It will need to increase it to 4MB or so. Maybe
we can start a new thread for that.

I don't think that'd be very controversial, but I'm also not sure why we'd
need 4MB -- can you explain in more detail what exactly we'd need so that
the feature would work? (The minimum doesn't have to work *well* IIUC, just
do some useful work and not fail).

The second is how to limit the size of the radix tree to
maintenance_work_mem. I think that it's tricky to estimate the maximum
number of keys in the radix tree that fit in maintenance_work_mem. The
radix tree size varies depending on the key distribution. The next
idea I considered was how to limit the size when inserting a key. In
order to strictly limit the radix tree size, probably we have to
change the rt_set so that it breaks off and returns false if the radix
tree size is about to exceed the memory limit when we allocate a new
node or grow a node kind/class.

That seems complex, fragile, and wrong scope.

Ideally, I'd like to control the size
outside of radix tree (e.g. TIDStore) since it could introduce
overhead to rt_set() but probably we need to add such logic in radix
tree.

Does the TIDStore have the ability to ask the DSA (or slab context) to see
how big it is? If a new segment has been allocated that brings us to the
limit, we can stop when we discover that fact. In the local case with slab
blocks, it won't be on nice neat boundaries, but we could check if we're
within the largest block size (~64kB) of overflow.

Remember when we discussed how we might approach parallel pruning? I
envisioned a local array of a few dozen kilobytes to reduce contention on
the tidstore. We could use such an array even for a single worker (always
doing the same thing is simpler anyway). When the array fills up enough so
that the next heap page *could* overflow it: Stop, insert into the store,
and check the store's memory usage before continuing.

--
John Naylor
EDB: http://www.enterprisedb.com

#154

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#153)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Fri, Dec 9, 2022 at 8:20 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

In the meanwhile, I've been working on vacuum integration. There are
two things I'd like to discuss some time:

The first is the minimum of maintenance_work_mem, 1 MB. Since the
initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel
vacuum with radix tree cannot work with the minimum
maintenance_work_mem. It will need to increase it to 4MB or so. Maybe
we can start a new thread for that.

I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail).

The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
the radix tree is using dsa_get_total_size(). If the size returned by
dsa_get_total_size() (+ some memory used by TIDStore meta information)
exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
and heap vacuum. However, when allocating DSA memory for
radix_tree_control at creation, we allocate 1MB
(DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
radix_tree_control from it. das_get_total_size() returns 1MB even if
there is no TID collected.

The second is how to limit the size of the radix tree to
maintenance_work_mem. I think that it's tricky to estimate the maximum
number of keys in the radix tree that fit in maintenance_work_mem. The
radix tree size varies depending on the key distribution. The next
idea I considered was how to limit the size when inserting a key. In
order to strictly limit the radix tree size, probably we have to
change the rt_set so that it breaks off and returns false if the radix
tree size is about to exceed the memory limit when we allocate a new
node or grow a node kind/class.

That seems complex, fragile, and wrong scope.

Ideally, I'd like to control the size
outside of radix tree (e.g. TIDStore) since it could introduce
overhead to rt_set() but probably we need to add such logic in radix
tree.

Does the TIDStore have the ability to ask the DSA (or slab context) to see how big it is?

Yes, TIDStore can check it using dsa_get_total_size().

If a new segment has been allocated that brings us to the limit, we can stop when we discover that fact. In the local case with slab blocks, it won't be on nice neat boundaries, but we could check if we're within the largest block size (~64kB) of overflow.

Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing.

Right, I think it's no problem in slab cases. In DSA cases, the new
segment size follows a geometric series that approximately doubles the
total storage each time we create a new segment. This behavior comes
from the fact that the underlying DSM system isn't designed for large
numbers of segments.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#155

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#154)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com>

wrote:

I don't think that'd be very controversial, but I'm also not sure why

we'd need 4MB -- can you explain in more detail what exactly we'd need so
that the feature would work? (The minimum doesn't have to work *well* IIUC,
just do some useful work and not fail).

The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
the radix tree is using dsa_get_total_size(). If the size returned by
dsa_get_total_size() (+ some memory used by TIDStore meta information)
exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
and heap vacuum. However, when allocating DSA memory for
radix_tree_control at creation, we allocate 1MB
(DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
radix_tree_control from it. das_get_total_size() returns 1MB even if
there is no TID collected.

2MB makes sense.

If the metadata is small, it seems counterproductive to count it towards
the total. We want the decision to be driven by blocks allocated. I have an
idea on that below.

Remember when we discussed how we might approach parallel pruning? I

envisioned a local array of a few dozen kilobytes to reduce contention on
the tidstore. We could use such an array even for a single worker (always
doing the same thing is simpler anyway). When the array fills up enough so
that the next heap page *could* overflow it: Stop, insert into the store,
and check the store's memory usage before continuing.

Right, I think it's no problem in slab cases. In DSA cases, the new
segment size follows a geometric series that approximately doubles the
total storage each time we create a new segment. This behavior comes
from the fact that the underlying DSM system isn't designed for large
numbers of segments.

And taking a look, the size of a new segment can get quite large. It seems
we could test if the total DSA area allocated is greater than half of
maintenance_work_mem. If that parameter is a power of two (common) and

=8MB, then the area will contain just under a power of two the last time

it passes the test. The next segment will bring it to about 3/4 full, like
this:

maintenance work mem = 256MB, so stop if we go over 128MB:

2*(1+2+4+8+16+32) = 126MB -> keep going
126MB + 64 = 190MB -> stop

That would be a simple way to be conservative with the memory limit. The
unfortunate aspect is that the last segment would be mostly wasted, but
it's paradise compared to the pessimistically-sized single array we have
now (even with Peter G.'s VM snapshot informing the allocation size, I
imagine).

And as for minimum possible maintenance work mem, I think this would work
with 2MB, if the community is okay with technically going over the limit by
a few bytes of overhead if a buildfarm animal set to that value. I imagine
it would never go over the limit for realistic (and even most unrealistic)
values. Even with a VM snapshot page in memory and small local arrays of
TIDs, I think with this scheme we'll be well under the limit.

After this feature is complete, I think we should consider a follow-on
patch to get rid of vacuum_work_mem, since it would no longer be needed.

--
John Naylor
EDB: http://www.enterprisedb.com

#156

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#155)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Dec 12, 2022 at 7:14 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:

I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail).

The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
the radix tree is using dsa_get_total_size(). If the size returned by
dsa_get_total_size() (+ some memory used by TIDStore meta information)
exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
and heap vacuum. However, when allocating DSA memory for
radix_tree_control at creation, we allocate 1MB
(DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
radix_tree_control from it. das_get_total_size() returns 1MB even if
there is no TID collected.

2MB makes sense.

If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be driven by blocks allocated. I have an idea on that below.

Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing.

Right, I think it's no problem in slab cases. In DSA cases, the new
segment size follows a geometric series that approximately doubles the
total storage each time we create a new segment. This behavior comes
from the fact that the underlying DSM system isn't designed for large
numbers of segments.

And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area allocated is greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the area will contain just under a power of two the last time it passes the test. The next segment will bring it to about 3/4 full, like this:

maintenance work mem = 256MB, so stop if we go over 128MB:

2*(1+2+4+8+16+32) = 126MB -> keep going
126MB + 64 = 190MB -> stop

That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last segment would be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even with Peter G.'s VM snapshot informing the allocation size, I imagine).

Right. In this case, even if we allocate 64MB, we will use only 2088
bytes at maximum. So I think the memory space used for vacuum is
practically limited to half.

And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with technically going over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would never go over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and small local arrays of TIDs, I think with this scheme we'll be well under the limit.

Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
seems that they look at only memory that are actually dsa_allocate'd.
To be exact, we estimate the number of hash buckets based on work_mem
(and hash_mem_multiplier) and use it as the upper limit. So I've
confirmed that the result of dsa_get_total_size() could exceed the
limit. I'm not sure it's a known and legitimate usage. If we can
follow such usage, we can probably track how much dsa_allocate'd
memory is used in the radix tree. Templating whether or not to count
the memory usage might help avoid the overheads.

After this feature is complete, I think we should consider a follow-on patch to get rid of vacuum_work_mem, since it would no longer be needed.

I think you meant autovacuum_work_mem. Yes, I also think we can get rid of it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#157

sawada.mshk@gmail.com

about 3 years ago

In reply to: Masahiko Sawada (#156)

9 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 12, 2022 at 7:14 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:

I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail).

The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
the radix tree is using dsa_get_total_size(). If the size returned by
dsa_get_total_size() (+ some memory used by TIDStore meta information)
exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
and heap vacuum. However, when allocating DSA memory for
radix_tree_control at creation, we allocate 1MB
(DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
radix_tree_control from it. das_get_total_size() returns 1MB even if
there is no TID collected.

2MB makes sense.

If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be driven by blocks allocated. I have an idea on that below.

Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing.

Right, I think it's no problem in slab cases. In DSA cases, the new
segment size follows a geometric series that approximately doubles the
total storage each time we create a new segment. This behavior comes
from the fact that the underlying DSM system isn't designed for large
numbers of segments.

And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area allocated is greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the area will contain just under a power of two the last time it passes the test. The next segment will bring it to about 3/4 full, like this:

maintenance work mem = 256MB, so stop if we go over 128MB:

2*(1+2+4+8+16+32) = 126MB -> keep going
126MB + 64 = 190MB -> stop

That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last segment would be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even with Peter G.'s VM snapshot informing the allocation size, I imagine).

Right. In this case, even if we allocate 64MB, we will use only 2088
bytes at maximum. So I think the memory space used for vacuum is
practically limited to half.

And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with technically going over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would never go over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and small local arrays of TIDs, I think with this scheme we'll be well under the limit.

Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
seems that they look at only memory that are actually dsa_allocate'd.
To be exact, we estimate the number of hash buckets based on work_mem
(and hash_mem_multiplier) and use it as the upper limit. So I've
confirmed that the result of dsa_get_total_size() could exceed the
limit. I'm not sure it's a known and legitimate usage. If we can
follow such usage, we can probably track how much dsa_allocate'd
memory is used in the radix tree.

I've experimented with this idea. The newly added 0008 patch changes
the radix tree so that it counts the memory usage for both local and
shared cases. As shown below, there is an overhead for that:

w/o 0008 patch

=# select * from bench_load_random_int(1000000)
NOTICE: num_keys = 1000000, height = 7, n4 = 4970924, n15 = 38277,
n32 = 27205, n125 = 0, n256 = 257
mem_allocated | load_ms
---------------+---------
298453544 | 282
(1 row)

w/0 0008 patch

Although it adds some overhead, I think this idea is straightforward
and the most practical for users. And it seems to be consistent with
other components using DSA. We can improve this part in the future for
better memory control, for example, by introducing slab-like DSA
memory management.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v14-0005-tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v14-0005-tool-for-measuring-radix-tree-performance.patchDownload

From 75af1182c7107486db3846e616625e456d640e3c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v14 5/9] tool for measuring radix tree performance

---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  76 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 635 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 6 files changed, 767 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..83529805fc
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..a0693695e6
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,635 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, key);
+	}
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+
+	rt_stats(rt);
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
-- 
2.31.1

v14-0008-PoC-calculate-memory-usage-in-radix-tree.patchapplication/octet-stream; name=v14-0008-PoC-calculate-memory-usage-in-radix-tree.patchDownload

From 8ec7c3f15da739c1a8d78c1eec1e1f45cbe8ba21 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 19 Dec 2022 14:41:43 +0900
Subject: [PATCH v14 8/9] PoC: calculate memory usage in radix tree.

---
 src/backend/lib/radixtree.c  | 137 +++++++++++++++++++++++------------
 src/backend/utils/mmgr/dsa.c |  42 +++++++++++
 src/include/utils/dsa.h      |   1 +
 3 files changed, 135 insertions(+), 45 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 455071cbab..4ad55a0b7c 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -360,14 +360,24 @@ typedef struct rt_size_class_elem
 	const char *name;
 	int			fanout;
 
-	/* slab chunk size */
+	/* node size */
 	Size		inner_size;
 	Size		leaf_size;
 
 	/* slab block size */
-	Size		inner_blocksize;
-	Size		leaf_blocksize;
+	Size		slab_inner_blocksize;
+	Size		slab_leaf_blocksize;
+
+	/*
+	 * We can get how much memory is allocated for a radix tree node using
+	 * GetMemoryChunkSpace() for the local radix tree case. However, in the
+	 * shared case, since DSA doesn't have such functionality we prepare the
+	 * node size that are allocated in DSA for memory calculation.
+	 */
+	Size		dsa_inner_size;
+	Size		dsa_leaf_size;
 } rt_size_class_elem;
+static bool rt_size_class_dsa_info_initialized = false;
 
 /*
  * Calculate the slab blocksize so that we can allocate at least 32 chunks
@@ -381,40 +391,40 @@ static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
 		.fanout = 4,
 		.inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
 		.leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+		.slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+		.slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
 	},
 	[RT_CLASS_32_PARTIAL] = {
 		.name = "radix tree node 15",
 		.fanout = 15,
 		.inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
 		.leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+		.slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+		.slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
 	},
 	[RT_CLASS_32_FULL] = {
 		.name = "radix tree node 32",
 		.fanout = 32,
 		.inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
 		.leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+		.slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+		.slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
 	},
 	[RT_CLASS_125_FULL] = {
 		.name = "radix tree node 125",
 		.fanout = 125,
 		.inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
 		.leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+		.slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+		.slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
 	},
 	[RT_CLASS_256] = {
 		.name = "radix tree node 256",
 		.fanout = 256,
 		.inner_size = sizeof(rt_node_inner_256),
 		.leaf_size = sizeof(rt_node_leaf_256),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+		.slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+		.slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
 	},
 };
 
@@ -477,6 +487,12 @@ typedef struct radix_tree_control
 	uint64		max_val;
 	uint64		num_keys;
 
+	/*
+	 * Track the amount of memory used. The callers can ask for it
+	 * with rt_memory_usage().
+	 */
+	uint64		mem_used;
+
 	/* statistics */
 #ifdef RT_DEBUG
 	int32		cnt[RT_SIZE_CLASS_COUNT];
@@ -1005,15 +1021,22 @@ static rt_node_ptr
 rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 {
 	rt_node_ptr newnode;
+	Size size;
 
 	if (RadixTreeIsShared(tree))
 	{
 		dsa_pointer dp;
 
 		if (inner)
+		{
 			dp = dsa_allocate(tree->area, rt_size_class_info[size_class].inner_size);
+			size = rt_size_class_info[size_class].dsa_inner_size;
+		}
 		else
+		{
 			dp = dsa_allocate(tree->area, rt_size_class_info[size_class].leaf_size);
+			size = rt_size_class_info[size_class].dsa_leaf_size;
+		}
 
 		newnode.encoded = (rt_pointer) dp;
 		newnode.decoded = rt_pointer_decode(tree, newnode.encoded);
@@ -1028,8 +1051,12 @@ rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 															 rt_size_class_info[size_class].leaf_size);
 
 		newnode.encoded = rt_pointer_encode(newnode.decoded);
+		size = GetMemoryChunkSpace(newnode.decoded);
 	}
 
+	/* update memory usage */
+	tree->ctl->mem_used += size;
+
 #ifdef RT_DEBUG
 	/* update the statistics */
 	tree->ctl->cnt[size_class]++;
@@ -1095,6 +1122,15 @@ rt_grow_node_kind(radix_tree *tree, rt_node_ptr node, uint8 new_kind)
 static void
 rt_free_node(radix_tree *tree, rt_node_ptr node)
 {
+	int size;
+	static const int fanout_node_class[RT_NODE_MAX_SLOTS] =
+	{
+		[4] = RT_CLASS_4_FULL,
+		[15] = RT_CLASS_32_PARTIAL,
+		[32] = RT_CLASS_32_FULL,
+		[125] = RT_CLASS_125_FULL,
+	};
+
 	/* If we're deleting the root node, make the tree empty */
 	if (tree->ctl->root == node.encoded)
 	{
@@ -1104,28 +1140,38 @@ rt_free_node(radix_tree *tree, rt_node_ptr node)
 
 #ifdef RT_DEBUG
 	{
-		int i;
+		int size_class = (NODE_FANOUT(node) == 0)
+			? RT_CLASS_256
+			: fanout_node_class[NODE_FANOUT(node)];
 
 		/* update the statistics */
-		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
-		{
-			if (NODE_FANOUT(node) == rt_size_class_info[i].fanout)
-				break;
-		}
-
-		/* fanout of node256 is intentionally 0 */
-		if (i == RT_SIZE_CLASS_COUNT)
-			i = RT_CLASS_256;
-
-		tree->ctl->cnt[i]--;
-		Assert(tree->ctl->cnt[i] >= 0);
+		tree->ctl->cnt[size_class]--;
+		Assert(tree->ctl->cnt[size_class] >= 0);
 	}
 #endif
 
 	if (RadixTreeIsShared(tree))
+	{
+		int size_class = (NODE_FANOUT(node) == 0)
+			? RT_CLASS_256
+			: fanout_node_class[NODE_FANOUT(node)];
+
+		if (!NODE_IS_LEAF(node))
+			size = rt_size_class_info[size_class].dsa_inner_size;
+		else
+			size = rt_size_class_info[size_class].dsa_leaf_size;
+
 		dsa_free(tree->area, (dsa_pointer) node.encoded);
+	}
 	else
+	{
+		size = GetMemoryChunkSpace(node.decoded);
 		pfree(node.decoded);
+	}
+
+	/* update memory usage */
+	tree->ctl->mem_used -= size;
+	Assert(tree->ctl->mem_used > 0);
 }
 
 /*
@@ -1837,15 +1883,18 @@ rt_create(MemoryContext ctx, dsa_area *area)
 		dp = dsa_allocate0(area, sizeof(radix_tree_control));
 		tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
 		tree->ctl->handle = (rt_handle) dp;
+		tree->ctl->mem_used += dsa_get_size_class(sizeof(radix_tree_control));
 	}
 	else
 	{
 		tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
 		tree->ctl->handle = InvalidDsaPointer;
+		tree->ctl->mem_used += GetMemoryChunkSpace(tree->ctl);
 	}
 
 	tree->ctl->magic = RADIXTREE_MAGIC;
 	tree->ctl->root = InvalidRTPointer;
+	tree->ctl->mem_used = GetMemoryChunkSpace(tree);
 
 	/* Create the slab allocator for each size class */
 	if (area == NULL)
@@ -1854,17 +1903,29 @@ rt_create(MemoryContext ctx, dsa_area *area)
 		{
 			tree->inner_slabs[i] = SlabContextCreate(ctx,
 													 rt_size_class_info[i].name,
-													 rt_size_class_info[i].inner_blocksize,
+													 rt_size_class_info[i].slab_inner_blocksize,
 													 rt_size_class_info[i].inner_size);
 			tree->leaf_slabs[i] = SlabContextCreate(ctx,
 													rt_size_class_info[i].name,
-													rt_size_class_info[i].leaf_blocksize,
+													rt_size_class_info[i].slab_leaf_blocksize,
 													rt_size_class_info[i].leaf_size);
 #ifdef RT_DEBUG
 			tree->ctl->cnt[i] = 0;
 #endif
 		}
 	}
+	else if (!rt_size_class_dsa_info_initialized)
+	{
+		for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			rt_size_class_info[i].dsa_inner_size =
+				dsa_get_size_class(rt_size_class_info[i].inner_size);
+			rt_size_class_info[i].dsa_leaf_size =
+				dsa_get_size_class(rt_size_class_info[i].leaf_size);
+		}
+
+		rt_size_class_dsa_info_initialized = true;
+	}
 
 	MemoryContextSwitchTo(old_ctx);
 
@@ -2534,22 +2595,8 @@ rt_num_entries(radix_tree *tree)
 uint64
 rt_memory_usage(radix_tree *tree)
 {
-	Size		total = sizeof(radix_tree) + sizeof(radix_tree_control);
-
 	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
-
-	if (RadixTreeIsShared(tree))
-		total = dsa_get_total_size(tree->area);
-	else
-	{
-		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
-		{
-			total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
-			total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
-		}
-	}
-
-	return total;
+	return tree->ctl->mem_used;
 }
 
 /*
@@ -2873,9 +2920,9 @@ rt_dump(radix_tree *tree)
 		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
 				rt_size_class_info[i].name,
 				rt_size_class_info[i].inner_size,
-				rt_size_class_info[i].inner_blocksize,
+				rt_size_class_info[i].slab_inner_blocksize,
 				rt_size_class_info[i].leaf_size,
-				rt_size_class_info[i].leaf_blocksize);
+				rt_size_class_info[i].slab_leaf_blocksize);
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
 
 	if (!RTPointerIsValid(tree->ctl->root))
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index ad169882af..e77aea10e2 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1208,6 +1208,48 @@ dsa_minimum_size(void)
 	return pages * FPM_PAGE_SIZE;
 }
 
+size_t
+dsa_get_size_class(size_t size)
+{
+	uint16      size_class;
+
+	if (size > dsa_size_classes[lengthof(dsa_size_classes) - 1])
+		return size;
+	else if (size < lengthof(dsa_size_class_map) * DSA_SIZE_CLASS_MAP_QUANTUM)
+	{
+		int			mapidx;
+
+		/* For smaller sizes we have a lookup table... */
+		mapidx = ((size + DSA_SIZE_CLASS_MAP_QUANTUM - 1) /
+				  DSA_SIZE_CLASS_MAP_QUANTUM) - 1;
+		size_class = dsa_size_class_map[mapidx];
+	}
+	else
+	{
+		uint16		min;
+		uint16		max;
+
+		/* ... and for the rest we search by binary chop. */
+		min = dsa_size_class_map[lengthof(dsa_size_class_map) - 1];
+		max = lengthof(dsa_size_classes) - 1;
+
+		while (min < max)
+		{
+			uint16		mid = (min + max) / 2;
+			uint16		class_size = dsa_size_classes[mid];
+
+			if (class_size < size)
+				min = mid + 1;
+			else
+				max = mid;
+		}
+
+		size_class = min;
+	}
+
+	return dsa_size_classes[size_class];
+}
+
 /*
  * Workhorse function for dsa_create and dsa_create_in_place.
  */
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index dad06adecc..a17c4eb88c 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -118,6 +118,7 @@ extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags)
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
 extern size_t dsa_get_total_size(dsa_area *area);
+extern size_t dsa_get_size_class(size_t size);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
-- 
2.31.1

v14-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchapplication/octet-stream; name=v14-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchDownload

From 7e5fd8a19adb0305f77618231364eacaa2e0a59a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 14 Nov 2022 11:44:17 +0900
Subject: [PATCH v14 6/9] Use rt_node_ptr to reference radix tree nodes.

---
 src/backend/lib/radixtree.c | 688 +++++++++++++++++++++---------------
 1 file changed, 398 insertions(+), 290 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index abd0450727..bff37a2c35 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -150,6 +150,19 @@ typedef enum rt_size_class
 #define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
 } rt_size_class;
 
+/*
+ * rt_pointer is a pointer compatible with a pointer to local memory and a
+ * pointer for DSA area (i.e. dsa_pointer). Since the radix tree node can be
+ * allocated in backend local memory as well as DSA area, we cannot use a
+ * C-pointer to rt_node (i.e. backend local memory address) for child pointers
+ * in inner nodes. Inner nodes need to use rt_pointer instead. We can get
+ * the backend local memory address of a node from a rt_pointer by using
+ * rt_pointer_decode().
+*/
+typedef uintptr_t rt_pointer;
+#define InvalidRTPointer		((rt_pointer) 0)
+#define RTPointerIsValid(x) 	(((rt_pointer) (x)) != InvalidRTPointer)
+
 /* Common type for all nodes types */
 typedef struct rt_node
 {
@@ -175,8 +188,7 @@ typedef struct rt_node
 	/* Node kind, one per search/set algorithm */
 	uint8		kind;
 } rt_node;
-#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+#define RT_NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
 #define VAR_NODE_HAS_FREE_SLOT(node) \
 	((node)->base.n.count < (node)->base.n.fanout)
 #define FIXED_NODE_HAS_FREE_SLOT(node, class) \
@@ -240,7 +252,7 @@ typedef struct rt_node_inner_4
 	rt_node_base_4 base;
 
 	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+	rt_pointer    children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_4;
 
 typedef struct rt_node_leaf_4
@@ -256,7 +268,7 @@ typedef struct rt_node_inner_32
 	rt_node_base_32 base;
 
 	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+	rt_pointer    children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_32;
 
 typedef struct rt_node_leaf_32
@@ -272,7 +284,7 @@ typedef struct rt_node_inner_125
 	rt_node_base_125 base;
 
 	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+	rt_pointer    children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_125;
 
 typedef struct rt_node_leaf_125
@@ -292,7 +304,7 @@ typedef struct rt_node_inner_256
 	rt_node_base_256 base;
 
 	/* Slots for 256 children */
-	rt_node    *children[RT_NODE_MAX_SLOTS];
+	rt_pointer    children[RT_NODE_MAX_SLOTS];
 } rt_node_inner_256;
 
 typedef struct rt_node_leaf_256
@@ -306,6 +318,29 @@ typedef struct rt_node_leaf_256
 	uint64		values[RT_NODE_MAX_SLOTS];
 } rt_node_leaf_256;
 
+/* rt_node_ptr is a data structure representing a pointer for a rt_node */
+typedef struct rt_node_ptr
+{
+	rt_pointer		encoded;
+	rt_node			*decoded;
+} rt_node_ptr;
+#define InvalidRTNodePtr \
+	(rt_node_ptr) {.encoded = InvalidRTPointer, .decoded = NULL}
+#define RTNodePtrIsValid(n) \
+	(!rt_node_ptr_eq((rt_node_ptr *) &(n), &(InvalidRTNodePtr)))
+
+/* Macros for rt_node_ptr to access the fields of rt_node */
+#define NODE_RAW(n)			(n.decoded)
+#define NODE_IS_LEAF(n)		(NODE_RAW(n)->shift == 0)
+#define NODE_IS_EMPTY(n)	(NODE_COUNT(n) == 0)
+#define NODE_KIND(n)	(NODE_RAW(n)->kind)
+#define NODE_COUNT(n)	(NODE_RAW(n)->count)
+#define NODE_SHIFT(n)	(NODE_RAW(n)->shift)
+#define NODE_CHUNK(n)	(NODE_RAW(n)->chunk)
+#define NODE_FANOUT(n)	(NODE_RAW(n)->fanout)
+#define NODE_HAS_FREE_SLOT(n) \
+	(NODE_COUNT(n) < rt_node_kind_info[NODE_KIND(n)].fanout)
+
 /* Information for each size class */
 typedef struct rt_size_class_elem
 {
@@ -394,7 +429,7 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
  */
 typedef struct rt_node_iter
 {
-	rt_node    *node;			/* current node being iterated */
+	rt_node_ptr	node;			/* current node being iterated */
 	int			current_idx;	/* current position. -1 for initial value */
 } rt_node_iter;
 
@@ -415,7 +450,7 @@ struct radix_tree
 {
 	MemoryContext context;
 
-	rt_node    *root;
+	rt_pointer	root;
 	uint64		max_val;
 	uint64		num_keys;
 
@@ -429,27 +464,58 @@ struct radix_tree
 };
 
 static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
-static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+
+static rt_node_ptr rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class,
 								bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_free_node(radix_tree *tree, rt_node_ptr node);
 static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
-										rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+static inline bool rt_node_search_inner(rt_node_ptr node_ptr, uint64 key, rt_action action,
+										rt_pointer *child_p);
+static inline bool rt_node_search_leaf(rt_node_ptr node_ptr, uint64 key, rt_action action,
 									   uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
-								 uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+static bool rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+								 uint64 key, rt_node_ptr child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
 								uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											  rt_node_ptr *child_p);
 static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 											 uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static void rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from);
 static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
 
 /* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
+static void rt_verify_node(rt_node_ptr node);
+
+/* Decode and encode functions of rt_pointer */
+static inline rt_node *
+rt_pointer_decode(rt_pointer encoded)
+{
+	return (rt_node *) encoded;
+}
+
+static inline rt_pointer
+rt_pointer_encode(rt_node *decoded)
+{
+	return (rt_pointer) decoded;
+}
+
+/* Return a rt_node_ptr created from the given encoded pointer */
+static inline rt_node_ptr
+rt_node_ptr_encoded(rt_pointer encoded)
+{
+	return (rt_node_ptr) {
+		.encoded = encoded,
+			.decoded = rt_pointer_decode(encoded),
+			};
+}
+
+static inline bool
+rt_node_ptr_eq(rt_node_ptr *a, rt_node_ptr *b)
+{
+	return (a->decoded == b->decoded) && (a->encoded == b->encoded);
+}
 
 /*
  * Return index of the first element in 'base' that equals 'key'. Return -1
@@ -598,10 +664,10 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
 
 /* Shift the elements right at 'idx' by one */
 static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_shift(uint8 *chunks, rt_pointer *children, int count, int idx)
 {
 	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
-	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_pointer) * (count - idx));
 }
 
 static inline void
@@ -613,10 +679,10 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Delete the element at 'idx' */
 static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_delete(uint8 *chunks, rt_pointer *children, int count, int idx)
 {
 	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
-	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_pointer) * (count - idx - 1));
 }
 
 static inline void
@@ -628,12 +694,12 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Copy both chunks and children/values arrays */
 static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
-						  uint8 *dst_chunks, rt_node **dst_children)
+chunk_children_array_copy(uint8 *src_chunks, rt_pointer *src_children,
+						  uint8 *dst_chunks, rt_pointer *dst_children)
 {
 	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
-	const Size children_size = sizeof(rt_node *) * fanout;
+	const Size children_size = sizeof(rt_pointer) * fanout;
 
 	memcpy(dst_chunks, src_chunks, chunk_size);
 	memcpy(dst_children, src_children, children_size);
@@ -665,7 +731,7 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
 static inline bool
 node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
 	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
@@ -673,23 +739,23 @@ node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
 static inline bool
 node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
 	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
 #endif
 
-static inline rt_node *
+static inline rt_pointer
 node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	return node->children[node->base.slot_idxs[chunk]];
 }
 
 static inline uint64
 node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
 	return node->values[node->base.slot_idxs[chunk]];
 }
@@ -699,9 +765,9 @@ node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
 {
 	int			slotpos = node->base.slot_idxs[chunk];
 
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
-	node->children[node->base.slot_idxs[chunk]] = NULL;
+	node->children[node->base.slot_idxs[chunk]] = InvalidRTPointer;
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
 
@@ -710,7 +776,7 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
 {
 	int			slotpos = node->base.slot_idxs[chunk];
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
@@ -742,11 +808,11 @@ node_125_find_unused_slot(bitmapword *isset)
  }
 
 static inline void
-node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
 {
 	int			slotpos;
 
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 
 	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
@@ -761,7 +827,7 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 {
 	int			slotpos;
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 
 	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
@@ -772,16 +838,16 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 
 /* Update the child corresponding to 'chunk' to 'child' */
 static inline void
-node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[node->base.slot_idxs[chunk]] = child;
 }
 
 static inline void
 node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->values[node->base.slot_idxs[chunk]] = value;
 }
 
@@ -791,21 +857,21 @@ node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 static inline bool
 node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
-	return (node->children[chunk] != NULL);
+	Assert(!RT_NODE_IS_LEAF(node));
+	return RTPointerIsValid(node->children[chunk]);
 }
 
 static inline bool
 node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
 }
 
-static inline rt_node *
+static inline rt_pointer
 node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	Assert(node_inner_256_is_chunk_used(node, chunk));
 	return node->children[chunk];
 }
@@ -813,16 +879,16 @@ node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
 static inline uint64
 node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(node_leaf_256_is_chunk_used(node, chunk));
 	return node->values[chunk];
 }
 
 /* Set the child in the node-256 */
 static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_pointer child)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[chunk] = child;
 }
 
@@ -830,7 +896,7 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
 static inline void
 node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
 	node->values[chunk] = value;
 }
@@ -839,14 +905,14 @@ node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
 static inline void
 node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
-	node->children[chunk] = NULL;
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = InvalidRTPointer;
 }
 
 static inline void
 node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
 }
 
@@ -882,29 +948,32 @@ rt_new_root(radix_tree *tree, uint64 key)
 {
 	int			shift = key_get_shift(key);
 	bool		inner = shift > 0;
-	rt_node    *newnode;
+	rt_node_ptr	newnode;
 
 	newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
 	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
-	newnode->shift = shift;
+	NODE_SHIFT(newnode) = shift;
+
 	tree->max_val = shift_get_max_val(shift);
-	tree->root = newnode;
+	tree->root = newnode.encoded;
 }
 
 /*
  * Allocate a new node with the given node kind.
  */
-static rt_node *
+static rt_node_ptr
 rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 {
-	rt_node    *newnode;
+	rt_node_ptr	newnode;
 
 	if (inner)
-		newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
-												 rt_size_class_info[size_class].inner_size);
+		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+														 rt_size_class_info[size_class].inner_size);
 	else
-		newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
-												 rt_size_class_info[size_class].leaf_size);
+		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+														 rt_size_class_info[size_class].leaf_size);
+
+	newnode.encoded = rt_pointer_encode(newnode.decoded);
 
 #ifdef RT_DEBUG
 	/* update the statistics */
@@ -916,20 +985,20 @@ rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 
 /* Initialize the node contents */
 static inline void
-rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class, bool inner)
 {
 	if (inner)
-		MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+		MemSet(node.decoded, 0, rt_size_class_info[size_class].inner_size);
 	else
-		MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+		MemSet(node.decoded, 0, rt_size_class_info[size_class].leaf_size);
 
-	node->kind = kind;
-	node->fanout = rt_size_class_info[size_class].fanout;
+	NODE_KIND(node) = kind;
+	NODE_FANOUT(node) = rt_size_class_info[size_class].fanout;
 
 	/* Initialize slot_idxs to invalid values */
 	if (kind == RT_NODE_KIND_125)
 	{
-		rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+		rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
 
 		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
 	}
@@ -939,25 +1008,25 @@ rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
 	 * and this is the max size class to it will never grow.
 	 */
 	if (kind == RT_NODE_KIND_256)
-		node->fanout = 0;
+		NODE_FANOUT(node) = 0;
 }
 
 static inline void
-rt_copy_node(rt_node *newnode, rt_node *oldnode)
+rt_copy_node(rt_node_ptr newnode, rt_node_ptr oldnode)
 {
-	newnode->shift = oldnode->shift;
-	newnode->chunk = oldnode->chunk;
-	newnode->count = oldnode->count;
+	NODE_SHIFT(newnode) = NODE_SHIFT(oldnode);
+	NODE_CHUNK(newnode) = NODE_CHUNK(oldnode);
+	NODE_COUNT(newnode) = NODE_COUNT(oldnode);
 }
 
 /*
  * Create a new node with 'new_kind' and the same shift, chunk, and
  * count of 'node'.
  */
-static rt_node*
-rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+static rt_node_ptr
+rt_grow_node_kind(radix_tree *tree, rt_node_ptr node, uint8 new_kind)
 {
-	rt_node	*newnode;
+	rt_node_ptr	newnode;
 	bool inner = !NODE_IS_LEAF(node);
 
 	newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
@@ -969,12 +1038,12 @@ rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
 
 /* Free the given node */
 static void
-rt_free_node(radix_tree *tree, rt_node *node)
+rt_free_node(radix_tree *tree, rt_node_ptr node)
 {
 	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node)
+	if (tree->root == node.encoded)
 	{
-		tree->root = NULL;
+		tree->root = InvalidRTPointer;
 		tree->max_val = 0;
 	}
 
@@ -985,7 +1054,7 @@ rt_free_node(radix_tree *tree, rt_node *node)
 		/* update the statistics */
 		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 		{
-			if (node->fanout == rt_size_class_info[i].fanout)
+			if (NODE_FANOUT(node) == rt_size_class_info[i].fanout)
 				break;
 		}
 
@@ -998,29 +1067,30 @@ rt_free_node(radix_tree *tree, rt_node *node)
 	}
 #endif
 
-	pfree(node);
+	pfree(node.decoded);
 }
 
 /*
  * Replace old_child with new_child, and free the old one.
  */
 static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
-				rt_node *new_child, uint64 key)
+rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
+				rt_node_ptr new_child, uint64 key)
 {
-	Assert(old_child->chunk == new_child->chunk);
-	Assert(old_child->shift == new_child->shift);
+	Assert(NODE_CHUNK(old_child) == NODE_CHUNK(new_child));
+	Assert(NODE_SHIFT(old_child) == NODE_SHIFT(new_child));
 
-	if (parent == old_child)
+	if (rt_node_ptr_eq(&parent, &old_child))
 	{
 		/* Replace the root node with the new large node */
-		tree->root = new_child;
+		tree->root = new_child.encoded;
 	}
 	else
 	{
 		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
 
-		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		replaced = rt_node_insert_inner(tree, InvalidRTNodePtr, parent, key,
+										new_child);
 		Assert(replaced);
 	}
 
@@ -1035,24 +1105,28 @@ static void
 rt_extend(radix_tree *tree, uint64 key)
 {
 	int			target_shift;
-	int			shift = tree->root->shift + RT_NODE_SPAN;
+	rt_node		*root = rt_pointer_decode(tree->root);
+	int			shift = root->shift + RT_NODE_SPAN;
 
 	target_shift = key_get_shift(key);
 
 	/* Grow tree from 'shift' to 'target_shift' */
 	while (shift <= target_shift)
 	{
-		rt_node_inner_4 *node;
+		rt_node_ptr	node;
+		rt_node_inner_4 *n4;
+
+		node = rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+		rt_init_node(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
 
-		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
-		rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
-		node->base.n.shift = shift;
-		node->base.n.count = 1;
-		node->base.chunks[0] = 0;
-		node->children[0] = tree->root;
+		n4 = (rt_node_inner_4 *) node.decoded;
+		n4->base.n.shift = shift;
+		n4->base.n.count = 1;
+		n4->base.chunks[0] = 0;
+		n4->children[0] = tree->root;
 
-		tree->root->chunk = 0;
-		tree->root = (rt_node *) node;
+		root->chunk = 0;
+		tree->root = node.encoded;
 
 		shift += RT_NODE_SPAN;
 	}
@@ -1065,21 +1139,22 @@ rt_extend(radix_tree *tree, uint64 key)
  * Insert inner and leaf nodes from 'node' to bottom.
  */
 static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
-			  rt_node *node)
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
+			  rt_node_ptr node)
 {
-	int			shift = node->shift;
+	int			shift = NODE_SHIFT(node);
 
 	while (shift >= RT_NODE_SPAN)
 	{
-		rt_node    *newchild;
+		rt_node_ptr    newchild;
 		int			newshift = shift - RT_NODE_SPAN;
 		bool		inner = newshift > 0;
 
 		newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
 		rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
-		newchild->shift = newshift;
-		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+		NODE_SHIFT(newchild) = newshift;
+		NODE_CHUNK(newchild) = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
+
 		rt_node_insert_inner(tree, parent, node, key, newchild);
 
 		parent = node;
@@ -1099,17 +1174,18 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
+					 rt_pointer *child_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
-	rt_node    *child = NULL;
+	rt_pointer	child;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
 
 				if (idx < 0)
@@ -1127,7 +1203,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
 
 				if (idx < 0)
@@ -1143,7 +1219,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
 
 				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
 					break;
@@ -1159,7 +1235,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 				if (!node_inner_256_is_chunk_used(n256, chunk))
 					break;
@@ -1176,7 +1252,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 
 	/* update statistics */
 	if (action == RT_ACTION_DELETE && found)
-		node->count--;
+		NODE_COUNT(node)--;
 
 	if (found && child_p)
 		*child_p = child;
@@ -1192,17 +1268,17 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
  * to the value is set to value_p.
  */
 static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node_ptr node, uint64 key, rt_action action, uint64 *value_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
 	uint64		value = 0;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
 
 				if (idx < 0)
@@ -1220,7 +1296,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
 
 				if (idx < 0)
@@ -1236,7 +1312,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
 
 				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
 					break;
@@ -1252,7 +1328,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 				if (!node_leaf_256_is_chunk_used(n256, chunk))
 					break;
@@ -1269,7 +1345,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 
 	/* update statistics */
 	if (action == RT_ACTION_DELETE && found)
-		node->count--;
+		NODE_COUNT(node)--;
 
 	if (found && value_p)
 		*value_p = value;
@@ -1279,19 +1355,19 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 
 /* Insert the child to the inner node */
 static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
-					 rt_node *child)
+rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+					 uint64 key, rt_node_ptr child)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		chunk_exists = false;
 
 	Assert(!NODE_IS_LEAF(node));
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 				int			idx;
 
 				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1299,25 +1375,27 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					n4->children[idx] = child;
+					n4->children[idx] = child.encoded;
 					break;
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
 				{
+					rt_node_ptr	new;
 					rt_node_inner_32 *new32;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
-																   RT_NODE_KIND_32);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+					new32 = (rt_node_inner_32 *) new.decoded;
+
 					chunk_children_array_copy(n4->base.chunks, n4->children,
 											  new32->base.chunks, new32->children);
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
-									key);
-					node = (rt_node *) new32;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1330,14 +1408,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 												   count, insertpos);
 
 					n4->base.chunks[insertpos] = chunk;
-					n4->children[insertpos] = child;
+					n4->children[insertpos] = child.encoded;
 					break;
 				}
 			}
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 				int			idx;
 
 				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1345,45 +1423,52 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					n32->children[idx] = child;
+					n32->children[idx] = child.encoded;
 					break;
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
 				{
-					Assert(parent != NULL);
+					Assert(RTNodePtrIsValid(parent));
 
 					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
 					{
 						/* use the same node kind, but expand to the next size class */
 						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
 						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_ptr	new;
 						rt_node_inner_32 *new32;
 
-						new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+						new = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+						new32 = (rt_node_inner_32 *) new.decoded;
 						memcpy(new32, n32, size);
 						new32->base.n.fanout = fanout;
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+						rt_replace_node(tree, parent, node, new, key);
 
-						/* must update both pointers here */
-						node = (rt_node *) new32;
+						/*
+						 * Must update both pointers here since we update n32 and
+						 * verify node.
+						 */
+						node = new;
 						n32 = new32;
 
 						goto retry_insert_inner_32;
 					}
 					else
 					{
+						rt_node_ptr	new;
 						rt_node_inner_125 *new125;
 
 						/* grow node from 32 to 125 */
-						new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
-																		 RT_NODE_KIND_125);
+						new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+						new125 = (rt_node_inner_125 *) new.decoded;
+
 						for (int i = 0; i < n32->base.n.count; i++)
 							node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
-						node = (rt_node *) new125;
+						rt_replace_node(tree, parent, node, new, key);
+						node = new;
 					}
 				}
 				else
@@ -1398,7 +1483,7 @@ retry_insert_inner_32:
 													   count, insertpos);
 
 						n32->base.chunks[insertpos] = chunk;
-						n32->children[insertpos] = child;
+						n32->children[insertpos] = child.encoded;
 						break;
 					}
 				}
@@ -1406,25 +1491,28 @@ retry_insert_inner_32:
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_125:
 			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
 				int			cnt = 0;
 
 				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					node_inner_125_update(n125, chunk, child);
+					node_inner_125_update(n125, chunk, child.encoded);
 					break;
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
 				{
+					rt_node_ptr	new;
 					rt_node_inner_256 *new256;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 125 to 256 */
-					new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
-																	 RT_NODE_KIND_256);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+					new256 = (rt_node_inner_256 *) new.decoded;
+
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
 					{
 						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1434,32 +1522,31 @@ retry_insert_inner_32:
 						cnt++;
 					}
 
-					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
-					node_inner_125_insert(n125, chunk, child);
+					node_inner_125_insert(n125, chunk, child.encoded);
 					break;
 				}
 			}
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
 				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
 
-				node_inner_256_set(n256, chunk, child);
+				node_inner_256_set(n256, chunk, child.encoded);
 				break;
 			}
 	}
 
 	/* Update statistics */
 	if (!chunk_exists)
-		node->count++;
+		NODE_COUNT(node)++;
 
 	/*
 	 * Done. Finally, verify the chunk and value is inserted or replaced
@@ -1472,19 +1559,19 @@ retry_insert_inner_32:
 
 /* Insert the value to the leaf node */
 static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
 					uint64 key, uint64 value)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		chunk_exists = false;
 
 	Assert(NODE_IS_LEAF(node));
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 				int			idx;
 
 				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1498,16 +1585,18 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
 				{
+					rt_node_ptr	new;
 					rt_node_leaf_32 *new32;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
-																  RT_NODE_KIND_32);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+					new32 = (rt_node_leaf_32 *) new.decoded;
 					chunk_values_array_copy(n4->base.chunks, n4->values,
 											new32->base.chunks, new32->values);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
-					node = (rt_node *) new32;
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1527,7 +1616,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 				int			idx;
 
 				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1541,45 +1630,51 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
 				{
-					Assert(parent != NULL);
+					Assert(RTNodePtrIsValid(parent));
 
 					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
 					{
 						/* use the same node kind, but expand to the next size class */
 						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
 						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_ptr new;
 						rt_node_leaf_32 *new32;
 
-						new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+						new = rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+						new32 = (rt_node_leaf_32 *) new.decoded;
 						memcpy(new32, n32, size);
 						new32->base.n.fanout = fanout;
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+						rt_replace_node(tree, parent, node, new, key);
 
-						/* must update both pointers here */
-						node = (rt_node *) new32;
+						/*
+						 * Must update both pointers here since we update n32 and
+						 * verify node.
+						 */
+						node = new;
 						n32 = new32;
 
 						goto retry_insert_leaf_32;
 					}
 					else
 					{
+						rt_node_ptr	new;
 						rt_node_leaf_125 *new125;
 
 						/* grow node from 32 to 125 */
-						new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
-																		RT_NODE_KIND_125);
+						new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+						new125 = (rt_node_leaf_125 *) new.decoded;
+
 						for (int i = 0; i < n32->base.n.count; i++)
 							node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
-										key);
-						node = (rt_node *) new125;
+						rt_replace_node(tree, parent, node, new, key);
+						node = new;
 					}
 				}
 				else
 				{
-				retry_insert_leaf_32:
+retry_insert_leaf_32:
 					{
 						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
 						int	count = n32->base.n.count;
@@ -1597,7 +1692,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_125:
 			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
 				int			cnt = 0;
 
 				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
@@ -1610,12 +1705,14 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
 				{
+					rt_node_ptr	new;
 					rt_node_leaf_256 *new256;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 125 to 256 */
-					new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
-																	RT_NODE_KIND_256);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+					new256 = (rt_node_leaf_256 *) new.decoded;
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
 					{
 						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1625,9 +1722,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 						cnt++;
 					}
 
-					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1638,7 +1734,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
 				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
@@ -1650,7 +1746,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 	/* Update statistics */
 	if (!chunk_exists)
-		node->count++;
+		NODE_COUNT(node)++;
 
 	/*
 	 * Done. Finally, verify the chunk and value is inserted or replaced
@@ -1674,7 +1770,7 @@ rt_create(MemoryContext ctx)
 
 	tree = palloc(sizeof(radix_tree));
 	tree->context = ctx;
-	tree->root = NULL;
+	tree->root = InvalidRTPointer;
 	tree->max_val = 0;
 	tree->num_keys = 0;
 
@@ -1723,26 +1819,23 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 {
 	int			shift;
 	bool		updated;
-	rt_node    *node;
-	rt_node    *parent;
+	rt_node_ptr	node;
+	rt_node_ptr parent;
 
 	/* Empty tree, create the root */
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 		rt_new_root(tree, key);
 
 	/* Extend the tree if necessary */
 	if (key > tree->max_val)
 		rt_extend(tree, key);
 
-	Assert(tree->root);
-
-	shift = tree->root->shift;
-	node = parent = tree->root;
-
 	/* Descend the tree until a leaf node */
+	node = parent = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer    child;
 
 		if (NODE_IS_LEAF(node))
 			break;
@@ -1754,7 +1847,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 		}
 
 		parent = node;
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1775,21 +1868,21 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 bool
 rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 {
-	rt_node    *node;
+	rt_node_ptr    node;
 	int			shift;
 
 	Assert(value_p != NULL);
 
-	if (!tree->root || key > tree->max_val)
+	if (!RTPointerIsValid(tree->root) || key > tree->max_val)
 		return false;
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer	child;
 
 		if (NODE_IS_LEAF(node))
 			break;
@@ -1797,7 +1890,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1811,8 +1904,8 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 bool
 rt_delete(radix_tree *tree, uint64 key)
 {
-	rt_node    *node;
-	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	rt_node_ptr	node;
+	rt_node_ptr	stack[RT_MAX_LEVEL] = {0};
 	int			shift;
 	int			level;
 	bool		deleted;
@@ -1824,12 +1917,12 @@ rt_delete(radix_tree *tree, uint64 key)
 	 * Descend the tree to search the key while building a stack of nodes we
 	 * visited.
 	 */
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	level = -1;
 	while (shift > 0)
 	{
-		rt_node    *child;
+		rt_pointer	child;
 
 		/* Push the current node to the stack */
 		stack[++level] = node;
@@ -1837,7 +1930,7 @@ rt_delete(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1888,6 +1981,7 @@ rt_iter *
 rt_begin_iterate(radix_tree *tree)
 {
 	MemoryContext old_ctx;
+	rt_node_ptr	root;
 	rt_iter    *iter;
 	int			top_level;
 
@@ -1897,17 +1991,18 @@ rt_begin_iterate(radix_tree *tree)
 	iter->tree = tree;
 
 	/* empty tree */
-	if (!iter->tree->root)
+	if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
 		return iter;
 
-	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	root = rt_node_ptr_encoded(iter->tree->root);
+	top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
 	iter->stack_len = top_level;
 
 	/*
 	 * Descend to the left most leaf node from the root. The key is being
 	 * constructed while descending to the leaf.
 	 */
-	rt_update_iter_stack(iter, iter->tree->root, top_level);
+	rt_update_iter_stack(iter, root, top_level);
 
 	MemoryContextSwitchTo(old_ctx);
 
@@ -1918,14 +2013,15 @@ rt_begin_iterate(radix_tree *tree)
  * Update each node_iter for inner nodes in the iterator node stack.
  */
 static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
 {
 	int			level = from;
-	rt_node    *node = from_node;
+	rt_node_ptr node = from_node;
 
 	for (;;)
 	{
 		rt_node_iter *node_iter = &(iter->stack[level--]);
+		bool found PG_USED_FOR_ASSERTS_ONLY;
 
 		node_iter->node = node;
 		node_iter->current_idx = -1;
@@ -1935,10 +2031,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
 			return;
 
 		/* Advance to the next slot in the inner node */
-		node = rt_node_inner_iterate_next(iter, node_iter);
+		found = rt_node_inner_iterate_next(iter, node_iter, &node);
 
 		/* We must find the first children in the node */
-		Assert(node);
+		Assert(found);
 	}
 }
 
@@ -1955,7 +2051,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 
 	for (;;)
 	{
-		rt_node    *child = NULL;
+		rt_node_ptr	child = InvalidRTNodePtr;
 		uint64		value;
 		int			level;
 		bool		found;
@@ -1976,14 +2072,12 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 		 */
 		for (level = 1; level <= iter->stack_len; level++)
 		{
-			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
-
-			if (child)
+			if (rt_node_inner_iterate_next(iter, &(iter->stack[level]), &child))
 				break;
 		}
 
 		/* the iteration finished */
-		if (!child)
+		if (!RTNodePtrIsValid(child))
 			return false;
 
 		/*
@@ -2015,18 +2109,19 @@ rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
  * Advance the slot in the inner node. Return the child if exists, otherwise
  * null.
  */
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+static inline bool
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *child_p)
 {
-	rt_node    *child = NULL;
+	rt_node_ptr	node = node_iter->node;
+	rt_pointer	child;
 	bool		found = false;
 	uint8		key_chunk;
 
-	switch (node_iter->node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n4->base.n.count)
@@ -2039,7 +2134,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n32->base.n.count)
@@ -2052,7 +2147,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2072,7 +2167,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2093,9 +2188,12 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 	}
 
 	if (found)
-		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+	{
+		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
+		*child_p = rt_node_ptr_encoded(child);
+	}
 
-	return child;
+	return found;
 }
 
 /*
@@ -2103,19 +2201,18 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
  * is set to value_p, otherwise return false.
  */
 static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
-						  uint64 *value_p)
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_p)
 {
-	rt_node    *node = node_iter->node;
+	rt_node_ptr node = node_iter->node;
 	bool		found = false;
 	uint64		value;
 	uint8		key_chunk;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n4->base.n.count)
@@ -2128,7 +2225,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n32->base.n.count)
@@ -2141,7 +2238,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2161,7 +2258,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2183,7 +2280,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 
 	if (found)
 	{
-		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
 		*value_p = value;
 	}
 
@@ -2220,16 +2317,16 @@ rt_memory_usage(radix_tree *tree)
  * Verify the radix tree node.
  */
 static void
-rt_verify_node(rt_node *node)
+rt_verify_node(rt_node_ptr node)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(node->count >= 0);
+	Assert(NODE_COUNT(node) >= 0);
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node.decoded;
 
 				for (int i = 1; i < n4->n.count; i++)
 					Assert(n4->chunks[i - 1] < n4->chunks[i]);
@@ -2238,7 +2335,7 @@ rt_verify_node(rt_node *node)
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node.decoded;
 
 				for (int i = 1; i < n32->n.count; i++)
 					Assert(n32->chunks[i - 1] < n32->chunks[i]);
@@ -2247,7 +2344,7 @@ rt_verify_node(rt_node *node)
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+				rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
 				int			cnt = 0;
 
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2257,10 +2354,10 @@ rt_verify_node(rt_node *node)
 
 					/* Check if the corresponding slot is used */
 					if (NODE_IS_LEAF(node))
-						Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+						Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) n125,
 														  n125->slot_idxs[i]));
 					else
-						Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+						Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) n125,
 														   n125->slot_idxs[i]));
 
 					cnt++;
@@ -2273,7 +2370,7 @@ rt_verify_node(rt_node *node)
 			{
 				if (NODE_IS_LEAF(node))
 				{
-					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 					int			cnt = 0;
 
 					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
@@ -2294,54 +2391,62 @@ rt_verify_node(rt_node *node)
 void
 rt_stats(radix_tree *tree)
 {
+	rt_node *root = rt_pointer_decode(tree->root);
+
+	if (root == NULL)
+		return;
+
 	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
-						 tree->num_keys,
-						 tree->root->shift / RT_NODE_SPAN,
-						 tree->cnt[RT_CLASS_4_FULL],
-						 tree->cnt[RT_CLASS_32_PARTIAL],
-						 tree->cnt[RT_CLASS_32_FULL],
-						 tree->cnt[RT_CLASS_125_FULL],
-						 tree->cnt[RT_CLASS_256])));
+							tree->num_keys,
+							root->shift / RT_NODE_SPAN,
+							tree->cnt[RT_CLASS_4_FULL],
+							tree->cnt[RT_CLASS_32_PARTIAL],
+							tree->cnt[RT_CLASS_32_FULL],
+							tree->cnt[RT_CLASS_125_FULL],
+							tree->cnt[RT_CLASS_256])));
 }
 
 static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(rt_node_ptr node, int level, bool recurse)
 {
-	char		space[125] = {0};
+	rt_node		*n = node.decoded;
+	char		space[128] = {0};
 
 	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
 			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
-			(node->kind == RT_NODE_KIND_4) ? 4 :
-			(node->kind == RT_NODE_KIND_32) ? 32 :
-			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
-			node->fanout == 0 ? 256 : node->fanout,
-			node->count, node->shift, node->chunk);
+
+			(n->kind == RT_NODE_KIND_4) ? 4 :
+			(n->kind == RT_NODE_KIND_32) ? 32 :
+			(n->kind == RT_NODE_KIND_125) ? 125 : 256,
+			n->fanout == 0 ? 256 : n->fanout,
+			n->count, n->shift, n->chunk);
 
 	if (level > 0)
 		sprintf(space, "%*c", level * 4, ' ');
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				for (int i = 0; i < node->count; i++)
+				for (int i = 0; i < NODE_COUNT(node); i++)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
 								space, n4->base.chunks[i], n4->values[i]);
 					}
 					else
 					{
-						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, n4->base.chunks[i]);
 
 						if (recurse)
-							rt_dump_node(n4->children[i], level + 1, recurse);
+							rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2350,25 +2455,26 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 			}
 		case RT_NODE_KIND_32:
 			{
-				for (int i = 0; i < node->count; i++)
+				for (int i = 0; i < NODE_KIND(node); i++)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
 								space, n32->base.chunks[i], n32->values[i]);
 					}
 					else
 					{
-						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, n32->base.chunks[i]);
 
 						if (recurse)
 						{
-							rt_dump_node(n32->children[i], level + 1, recurse);
+							rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+										 level + 1, recurse);
 						}
 						else
 							fprintf(stderr, "\n");
@@ -2378,7 +2484,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+				rt_node_base_125 *b125 = (rt_node_base_125 *) node.decoded;
 
 				fprintf(stderr, "slot_idxs ");
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2390,7 +2496,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 				}
 				if (NODE_IS_LEAF(node))
 				{
-					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node.decoded;
 
 					fprintf(stderr, ", isset-bitmap:");
 					for (int i = 0; i < WORDNUM(128); i++)
@@ -2420,7 +2526,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_125_get_child(n125, i),
+							rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2434,7 +2540,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 						if (!node_leaf_256_is_chunk_used(n256, i))
 							continue;
@@ -2444,7 +2550,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 					}
 					else
 					{
-						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 						if (!node_inner_256_is_chunk_used(n256, i))
 							continue;
@@ -2453,8 +2559,8 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
-										 recurse);
+							rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2467,7 +2573,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 void
 rt_dump_search(radix_tree *tree, uint64 key)
 {
-	rt_node    *node;
+	rt_node_ptr node;
 	int			shift;
 	int			level = 0;
 
@@ -2475,7 +2581,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
 		 tree->max_val, tree->max_val);
 
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 	{
 		elog(NOTICE, "tree is empty");
 		return;
@@ -2488,11 +2594,11 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		return;
 	}
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer   child;
 
 		rt_dump_node(node, level, false);
 
@@ -2509,7 +2615,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			break;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 		level++;
 	}
@@ -2518,6 +2624,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 void
 rt_dump(radix_tree *tree)
 {
+	rt_node_ptr root;
 
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
@@ -2528,12 +2635,13 @@ rt_dump(radix_tree *tree)
 				rt_size_class_info[i].leaf_blocksize);
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
 
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 	{
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
-	rt_dump_node(tree->root, 0, true);
+	root = rt_node_ptr_encoded(tree->root);
+	rt_dump_node(root, 0, true);
 }
 #endif
-- 
2.31.1

v14-0009-PoC-lazy-vacuum-integration.patchapplication/octet-stream; name=v14-0009-PoC-lazy-vacuum-integration.patchDownload

From 2431edf71e7e22248af46588f554c47cd169cec7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 4 Nov 2022 14:14:42 +0900
Subject: [PATCH v14 9/9] PoC: lazy vacuum integration.

The patch includes:

* Introducing a new module, TIDStore, to store TID in radix tree.
* Integrating TIDStore with Lazy (parallel) vacuum.
---
 src/backend/access/common/Makefile    |   1 +
 src/backend/access/common/meson.build |   1 +
 src/backend/access/common/tidstore.c  | 531 ++++++++++++++++++++++++++
 src/backend/access/heap/vacuumlazy.c  | 170 +++------
 src/backend/catalog/system_views.sql  |   2 +-
 src/backend/commands/vacuum.c         |  76 +---
 src/backend/commands/vacuumparallel.c |  64 ++--
 src/backend/storage/lmgr/lwlock.c     |   2 +
 src/backend/utils/misc/guc_tables.c   |   2 +-
 src/include/access/tidstore.h         |  49 +++
 src/include/commands/progress.h       |   4 +-
 src/include/commands/vacuum.h         |  25 +-
 src/include/storage/lwlock.h          |   1 +
 src/test/regress/expected/rules.out   |   4 +-
 14 files changed, 696 insertions(+), 236 deletions(-)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h

diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 857beaa32d..76265974b1 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..770c4ab5bf
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,531 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		TID (ItemPointer) storage implementation.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "port/pg_bitutils.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+#include "miscadmin.h"
+
+/* XXX only testing purpose during development, will be removed */
+#define XXX_DEBUG_TID_STORE 1
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. We construct 64-bit unsigned integer that combines
+ * the block number and the offset number. The lowest 11 bits represent the
+ * offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * XXX: If we want to support other table AMs that want to use the full range
+ * of possible offset numbers, we'll need to change this.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ */
+#define TIDSTORE_OFFSET_NBITS 11
+#define TIDSTORE_VALUE_NBITS 6	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+	((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+struct TIDStore
+{
+	/* main storage for TID */
+	radix_tree	*tree;
+
+	/* # of tids in TIDStore */
+	int	num_tids;
+
+	/* maximum bytes TIDStore can consume */
+	uint64	max_bytes;
+
+	/* DSA area and handle for shared TIDStore */
+	rt_handle	handle;
+	dsa_area	*area;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	uint64	max_items;
+	ItemPointer	itemptrs;
+	uint64	nitems;
+#endif
+};
+
+/* Iterator for TDIStore */
+typedef struct TIDStoreIter
+{
+	TIDStore	*ts;
+
+	/* iterator of radix tree */
+	rt_iter		*tree_iter;
+
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TIDStoreIterResult result;
+
+#ifdef USE_ASSERT_CHECKING
+	uint64		itemptrs_index;
+	int	prev_index;
+#endif
+} TIDStoreIter;
+
+static void tidstore_iter_extract_tids(TIDStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+/*
+ * Comparator routines for use with qsort() and bsearch().
+ */
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+		rblk;
+	OffsetNumber loff,
+		roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+
+static void
+verify_iter_tids(TIDStoreIter *iter)
+{
+	uint64 index = iter->prev_index;
+	TIDStoreIterResult *result = &(iter->result);
+
+	if (iter->ts->itemptrs == NULL)
+		return;
+
+	Assert(index <= iter->ts->nitems);
+
+	for (int i = 0; i < result->num_offsets; i++)
+	{
+		ItemPointerData tid;
+
+		ItemPointerSetBlockNumber(&tid, result->blkno);
+		ItemPointerSetOffsetNumber(&tid, result->offsets[i]);
+
+		Assert(ItemPointerEquals(&iter->ts->itemptrs[index++], &tid));
+	}
+
+	iter->prev_index = iter->itemptrs_index;
+}
+
+static void
+dump_itemptrs(TIDStore *ts)
+{
+	StringInfoData buf;
+
+	if (ts->itemptrs == NULL)
+		return;
+
+	initStringInfo(&buf);
+	for (int i = 0; i < ts->nitems; i++)
+	{
+		appendStringInfo(&buf, "(%d,%d) ",
+						 ItemPointerGetBlockNumber(&(ts->itemptrs[i])),
+						 ItemPointerGetOffsetNumber(&(ts->itemptrs[i])));
+	}
+	elog(WARNING, "--- dump (" UINT64_FORMAT " items) ---", ts->nitems);
+	elog(WARNING, "%s\n", buf.data);
+}
+
+#endif
+
+/*
+ * Create a TIDStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TIDStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+	TIDStore	*ts;
+
+	ts = palloc0(sizeof(TIDStore));
+
+	ts->tree = rt_create(CurrentMemoryContext, area);
+	ts->area = area;
+	ts->max_bytes = max_bytes;
+
+	if (area != NULL)
+		ts->handle = rt_get_handle(ts->tree);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+#define MAXDEADITEMS(avail_mem) \
+	(avail_mem / sizeof(ItemPointerData))
+
+	if (area == NULL)
+	{
+		ts->max_items = MAXDEADITEMS(maintenance_work_mem * 1024);
+		ts->itemptrs = (ItemPointer) palloc0(sizeof(ItemPointerData) * ts->max_items);
+		ts->nitems = 0;
+	}
+#endif
+
+	return ts;
+}
+
+/* Attach to the shared TIDStore using a handle */
+TIDStore *
+tidstore_attach(dsa_area *area, rt_handle handle)
+{
+	TIDStore *ts;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	ts = palloc0(sizeof(TIDStore));
+	ts->tree = rt_attach(area, handle);
+
+	return ts;
+}
+
+/*
+ * Detach from a TIDStore. This detaches from radix tree and frees the
+ * backend-local resources.
+ */
+void
+tidstore_detach(TIDStore *ts)
+{
+	rt_detach(ts->tree);
+	pfree(ts);
+}
+
+void
+tidstore_free(TIDStore *ts)
+{
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	if (ts->itemptrs)
+		pfree(ts->itemptrs);
+#endif
+
+	rt_free(ts->tree);
+	pfree(ts);
+}
+
+void
+tidstore_reset(TIDStore *ts)
+{
+	dsa_area *area = ts->area;
+
+	/* Reset the statistics */
+	ts->num_tids = 0;
+
+	/* Free the radix tree */
+	rt_free(ts->tree);
+
+	if (ts->area)
+		dsa_trim(area);
+
+	ts->tree = rt_create(CurrentMemoryContext, area);
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	ts->nitems = 0;
+#endif
+}
+
+/* Add TIDs to TIDStore */
+void
+tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	uint64 last_key = PG_UINT64_MAX;
+	uint64 key;
+	uint64 val = 0;
+	ItemPointerData tid;
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint32	off;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		key = tid_to_key_off(&tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(ts->tree, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= UINT64CONST(1) << off;
+		ts->num_tids++;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+		if (ts->itemptrs)
+		{
+			if (ts->nitems >= ts->max_items)
+			{
+				ts->max_items *= 2;
+				ts->itemptrs = repalloc(ts->itemptrs, sizeof(ItemPointerData) * ts->max_items);
+			}
+
+			Assert(ts->nitems < ts->max_items);
+			ItemPointerSetBlockNumber(&(ts->itemptrs[ts->nitems]), blkno);
+			ItemPointerSetOffsetNumber(&(ts->itemptrs[ts->nitems]), offsets[i]);
+			ts->nitems++;
+		}
+#endif
+	}
+
+	if (last_key != PG_UINT64_MAX)
+	{
+		rt_set(ts->tree, last_key, val);
+		val = 0;
+	}
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	if (ts->itemptrs)
+		Assert(ts->nitems == ts->num_tids);
+#endif
+}
+
+/* Return true if the given TID is present in TIDStore */
+bool
+tidstore_lookup_tid(TIDStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val;
+	uint32 off;
+	bool found;
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	bool found_assert;
+#endif
+
+	key = tid_to_key_off(tid, &off);
+
+	found = rt_search(ts->tree, key, &val);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	if (ts->itemptrs)
+		found_assert = bsearch((void *) tid,
+							   (void *) ts->itemptrs,
+							   ts->nitems,
+							   sizeof(ItemPointerData),
+							   vac_cmp_itemptr) != NULL;
+#endif
+
+	if (!found)
+	{
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+		if (ts->itemptrs)
+			Assert(!found_assert);
+#endif
+		return false;
+	}
+
+	found = (val & (UINT64CONST(1) << off)) != 0;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+
+	if (ts->itemptrs && found != found_assert)
+	{
+		elog(WARNING, "tid (%d,%d)\n",
+				ItemPointerGetBlockNumber(tid),
+				ItemPointerGetOffsetNumber(tid));
+		dump_itemptrs(ts);
+	}
+
+	if (ts->itemptrs)
+		Assert(found == found_assert);
+
+#endif
+	return found;
+}
+
+TIDStoreIter *
+tidstore_begin_iterate(TIDStore *ts)
+{
+	TIDStoreIter *iter;
+
+	iter = palloc0(sizeof(TIDStoreIter));
+	iter->ts = ts;
+	iter->tree_iter = rt_begin_iterate(ts->tree);
+	iter->result.blkno = InvalidBlockNumber;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	iter->itemptrs_index = 0;
+#endif
+
+	return iter;
+}
+
+TIDStoreIterResult *
+tidstore_iterate_next(TIDStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TIDStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (rt_iterate_next(iter->tree_iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = KEY_GET_BLKNO(key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * Remember the key-value pair for the next block for the
+			 * next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+			verify_iter_tids(iter);
+#endif
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	verify_iter_tids(iter);
+#endif
+
+	iter->finished = true;
+	return result;
+}
+
+uint64
+tidstore_num_tids(TIDStore *ts)
+{
+	return ts->num_tids;
+}
+
+bool
+tidstore_is_full(TIDStore *ts)
+{
+	return ((sizeof(TIDStore) + rt_memory_usage(ts->tree)) > ts->max_bytes);
+}
+
+uint64
+tidstore_max_memory(TIDStore *ts)
+{
+	return ts->max_bytes;
+}
+
+uint64
+tidstore_memory_usage(TIDStore *ts)
+{
+	return (uint64) sizeof(TIDStore) + rt_memory_usage(ts->tree);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TIDStore
+ */
+tidstore_handle
+tidstore_get_handle(TIDStore *ts)
+{
+	return rt_get_handle(ts->tree);
+}
+
+/* Extract TIDs from key-value pair */
+static void
+tidstore_iter_extract_tids(TIDStoreIter *iter, uint64 key, uint64 val)
+{
+	TIDStoreIterResult *result = (&iter->result);
+
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+		result->offsets[result->num_offsets++] = off;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+		iter->itemptrs_index++;
+#endif
+	}
+
+	result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a TID to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64 upper;
+	uint64 tid_i;
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+	*off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+	upper = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return upper;
+}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d59711b7ec..24c1dc7099 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -194,7 +195,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TIDStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -265,8 +266,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer *vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer *vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -853,21 +855,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_unskippable_block,
 				next_failsafe_block = 0,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TIDStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -937,8 +939,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		/* XXX: should not allow tidstore to grow beyond max_bytes */
+		if (tidstore_is_full(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1070,11 +1072,18 @@ lazy_scan_heap(LVRelState *vacrel)
 			if (prunestate.has_lpdead_items)
 			{
 				Size		freespace;
+				TIDStoreIter *iter;
+				TIDStoreIterResult *result;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+				iter = tidstore_begin_iterate(vacrel->dead_items);
+				result = tidstore_iterate_next(iter);
+				lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+									  buf, &vmbuffer);
+				Assert(!tidstore_iterate_next(iter));
+				pfree(iter);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				tidstore_reset(dead_items);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1111,7 +1120,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
 		}
 
 		/*
@@ -1264,7 +1273,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1863,25 +1872,16 @@ retry:
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TIDStore *dead_items = vacrel->dead_items;
 
 		Assert(!prunestate->all_visible);
 		Assert(prunestate->has_lpdead_items);
 
 		vacrel->lpdead_item_pages++;
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
@@ -2088,8 +2088,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TIDStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2098,17 +2097,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2157,7 +2149,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2186,7 +2178,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2213,8 +2205,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2259,7 +2251,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	tidstore_reset(vacrel->dead_items);
 }
 
 /*
@@ -2331,7 +2323,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2368,10 +2360,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index;
 	BlockNumber vacuumed_pages;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TIDStoreIter *iter;
+	TIDStoreIterResult *result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2388,8 +2381,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 	vacuumed_pages = 0;
 
-	index = 0;
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while ((result = tidstore_iterate_next(iter)) != NULL)
 	{
 		BlockNumber tblk;
 		Buffer		buf;
@@ -2398,12 +2391,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		tblk = result->blkno;
 		vacrel->blkno = tblk;
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+		lazy_vacuum_heap_page(vacrel, tblk, result->offsets, result->num_offsets,
+							  buf, &vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2427,14 +2421,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2451,11 +2444,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * LP_DEAD item on the page.  The return value is the first index immediately
  * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+					  int num_offsets, Buffer buffer, Buffer *vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			uncnt = 0;
@@ -2474,16 +2466,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = offsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2563,7 +2550,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -3065,46 +3051,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3115,11 +3061,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3146,7 +3090,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3159,11 +3103,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..bc42144f08 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 293b84bbca..7f5776fbf8 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2276,16 +2275,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TIDStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2316,18 +2315,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
@@ -2338,60 +2325,7 @@ vac_max_items_to_alloc_size(int max_items)
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TIDStore *dead_items = (TIDStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26d796e52..429607d5fa 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+#define PARALLEL_VACUUM_KEY_DSA				2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TIDStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TIDStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TIDStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_free(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TIDStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TIDStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 528b2e9643..ea8cf6283b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -186,6 +186,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"PgStatsHash",
 	/* LWTRANCHE_PGSTATS_DATA: */
 	"PgStatsData",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1bf14eec66..5d9808977e 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2280,7 +2280,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 2048, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..4a7ab3f5a8
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  TID storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TIDStore TIDStore;
+typedef struct TIDStoreIter TIDStoreIter;
+
+typedef struct TIDStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+	int				num_offsets;
+} TIDStoreIterResult;
+
+extern TIDStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TIDStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TIDStore *ts);
+extern void tidstore_free(TIDStore *ts);
+extern void tidstore_reset(TIDStore *ts);
+extern void tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TIDStore *ts, ItemPointer tid);
+extern TIDStoreIter * tidstore_begin_iterate(TIDStore *ts);
+extern TIDStoreIterResult *tidstore_iterate_next(TIDStoreIter *iter);
+extern uint64 tidstore_num_tids(TIDStore *ts);
+extern bool tidstore_is_full(TIDStore *ts);
+extern uint64 tidstore_max_memory(TIDStore *ts);
+extern uint64 tidstore_memory_usage(TIDStore *ts);
+extern tidstore_handle tidstore_get_handle(TIDStore *ts);
+
+#endif		/* TIDSTORE_H */
+
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index a28938caf4..75d540d315 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4e4bc26a8b..afe61c21fd 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -235,21 +236,6 @@ typedef struct VacuumParams
 	int			nworkers;
 } VacuumParams;
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -302,18 +288,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TIDStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TIDStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index dd818e16ab..f1e0bcede5 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -204,6 +204,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DSA,
 	LWTRANCHE_PGSTATS_HASH,
 	LWTRANCHE_PGSTATS_DATA,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..0c49354f04 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT s.stats_reset,
-- 
2.31.1

v14-0007-PoC-DSA-support-for-radix-tree.patchapplication/octet-stream; name=v14-0007-PoC-DSA-support-for-radix-tree.patchDownload

From d575b8f8215494d9ac82b256b260acd921de1928 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 16:42:55 +0700
Subject: [PATCH v14 7/9] PoC: DSA support for radix tree

---
 .../bench_radix_tree--1.0.sql                 |   2 +
 contrib/bench_radix_tree/bench_radix_tree.c   |  16 +-
 src/backend/lib/radixtree.c                   | 437 ++++++++++++++----
 src/backend/utils/mmgr/dsa.c                  |  12 +
 src/include/lib/radixtree.h                   |   8 +-
 src/include/utils/dsa.h                       |   1 +
 .../expected/test_radixtree.out               |  25 +
 .../modules/test_radixtree/test_radixtree.c   | 147 ++++--
 8 files changed, 502 insertions(+), 146 deletions(-)

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 83529805fc..d9216d715c 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -7,6 +7,7 @@ create function bench_shuffle_search(
 minblk int4,
 maxblk int4,
 random_block bool DEFAULT false,
+shared bool DEFAULT false,
 OUT nkeys int8,
 OUT rt_mem_allocated int8,
 OUT array_mem_allocated int8,
@@ -23,6 +24,7 @@ create function bench_seq_search(
 minblk int4,
 maxblk int4,
 random_block bool DEFAULT false,
+shared bool DEFAULT false,
 OUT nkeys int8,
 OUT rt_mem_allocated int8,
 OUT array_mem_allocated int8,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index a0693695e6..1a26722495 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -154,6 +154,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	BlockNumber maxblk = PG_GETARG_INT32(1);
 	bool		random_block = PG_GETARG_BOOL(2);
 	radix_tree *rt = NULL;
+	bool		shared = PG_GETARG_BOOL(3);
+	dsa_area   *dsa = NULL;
 	uint64		ntids;
 	uint64		key;
 	uint64		last_key = PG_UINT64_MAX;
@@ -176,7 +178,11 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
 
 	/* measure the load time of the radix tree */
-	rt = rt_create(CurrentMemoryContext);
+	if (shared)
+		dsa = dsa_create(LWLockNewTrancheId());
+	rt = rt_create(CurrentMemoryContext, dsa);
+
+	/* measure the load time of the radix tree */
 	start_time = GetCurrentTimestamp();
 	for (int i = 0; i < ntids; i++)
 	{
@@ -327,7 +333,7 @@ bench_load_random_int(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	pg_prng_seed(&state, 0);
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	start_time = GetCurrentTimestamp();
 	for (uint64 i = 0; i < cnt; i++)
@@ -393,7 +399,7 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
 	}
 	elog(NOTICE, "bench with filter 0x%lX", filter);
 
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	for (uint64 i = 0; i < cnt; i++)
 	{
@@ -462,7 +468,7 @@ bench_fixed_height_search(PG_FUNCTION_ARGS)
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
 
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	start_time = GetCurrentTimestamp();
 
@@ -574,7 +580,7 @@ bench_node128_load(PG_FUNCTION_ARGS)
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
 
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	key_id = 0;
 
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bff37a2c35..455071cbab 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -22,6 +22,15 @@
  * choose it to avoid an additional pointer traversal.  It is the reason this code
  * currently does not support variable-length keys.
  *
+ * If DSA area is specified for rt_create(), the radix tree is created in the
+ * DSA area so that multiple processes can access to it simultaneously. The process
+ * who created the shared radix tree needs to tell both DSA area specified when
+ * calling to rt_create() and dsa_pointer of the radix tree, fetched by
+ * rt_get_dsa_pointer(), to other processes so that they can attach by rt_attach().
+ *
+ * XXX: shared radix tree is still PoC state as it doesn't have any locking support.
+ * Also, it supports the iteration only by one process.
+ *
  * XXX: Most functions in this file have two variants for inner nodes and leaf
  * nodes, therefore there are duplication codes. While this sometimes makes the
  * code maintenance tricky, this reduces branch prediction misses when judging
@@ -34,6 +43,9 @@
  *
  * rt_create		- Create a new, empty radix tree
  * rt_free			- Free the radix tree
+ * rt_attach		- Attach to the radix tree
+ * rt_detach		- Detach from the radix tree
+ * rt_get_handle	- Return the handle of the radix tree
  * rt_search		- Search a key-value pair
  * rt_set			- Set a key-value pair
  * rt_delete		- Delete a key-value pair
@@ -65,6 +77,7 @@
 #include "nodes/bitmapset.h"
 #include "port/pg_bitutils.h"
 #include "port/pg_lfind.h"
+#include "utils/dsa.h"
 #include "utils/memutils.h"
 
 #ifdef RT_DEBUG
@@ -426,6 +439,10 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
  * construct the key whenever updating the node iteration information, e.g., when
  * advancing the current index within the node or when moving to the next node
  * at the same level.
+ *
+ * XXX: We need either a safeguard to disallow other processes to begin the
+ * iteration while one process is doing or to allow multiple processes to do
+ * the iteration.
  */
 typedef struct rt_node_iter
 {
@@ -445,23 +462,43 @@ struct rt_iter
 	uint64		key;
 };
 
-/* A radix tree with nodes */
-struct radix_tree
+/* A magic value used to identify our radix tree */
+#define RADIXTREE_MAGIC 0x54A48167
+
+/* Control information for an radix tree */
+typedef struct radix_tree_control
 {
-	MemoryContext context;
+	rt_handle	handle;
+	uint32		magic;
 
+	/* Root node */
 	rt_pointer	root;
+
 	uint64		max_val;
 	uint64		num_keys;
 
-	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
-	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
-
 	/* statistics */
 #ifdef RT_DEBUG
 	int32		cnt[RT_SIZE_CLASS_COUNT];
 #endif
+} radix_tree_control;
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	/* control object in either backend-local memory or DSA */
+	radix_tree_control *ctl;
+
+	/* used only when the radix tree is shared */
+	dsa_area   *area;
+
+	/* used only when the radix tree is private */
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
 };
+#define RadixTreeIsShared(rt) ((rt)->area != NULL)
 
 static void rt_new_root(radix_tree *tree, uint64 key);
 
@@ -490,9 +527,12 @@ static void rt_verify_node(rt_node_ptr node);
 
 /* Decode and encode functions of rt_pointer */
 static inline rt_node *
-rt_pointer_decode(rt_pointer encoded)
+rt_pointer_decode(radix_tree *tree, rt_pointer encoded)
 {
-	return (rt_node *) encoded;
+	if (RadixTreeIsShared(tree))
+		return (rt_node *) dsa_get_address(tree->area, encoded);
+	else
+		return (rt_node *) encoded;
 }
 
 static inline rt_pointer
@@ -503,11 +543,11 @@ rt_pointer_encode(rt_node *decoded)
 
 /* Return a rt_node_ptr created from the given encoded pointer */
 static inline rt_node_ptr
-rt_node_ptr_encoded(rt_pointer encoded)
+rt_node_ptr_encoded(radix_tree *tree, rt_pointer encoded)
 {
 	return (rt_node_ptr) {
 		.encoded = encoded,
-			.decoded = rt_pointer_decode(encoded),
+			.decoded = rt_pointer_decode(tree, encoded)
 			};
 }
 
@@ -954,8 +994,8 @@ rt_new_root(radix_tree *tree, uint64 key)
 	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
 	NODE_SHIFT(newnode) = shift;
 
-	tree->max_val = shift_get_max_val(shift);
-	tree->root = newnode.encoded;
+	tree->ctl->max_val = shift_get_max_val(shift);
+	tree->ctl->root = newnode.encoded;
 }
 
 /*
@@ -964,20 +1004,35 @@ rt_new_root(radix_tree *tree, uint64 key)
 static rt_node_ptr
 rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 {
-	rt_node_ptr	newnode;
+	rt_node_ptr newnode;
 
-	if (inner)
-		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
-														 rt_size_class_info[size_class].inner_size);
-	else
-		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
-														 rt_size_class_info[size_class].leaf_size);
+	if (RadixTreeIsShared(tree))
+	{
+		dsa_pointer dp;
 
-	newnode.encoded = rt_pointer_encode(newnode.decoded);
+		if (inner)
+			dp = dsa_allocate(tree->area, rt_size_class_info[size_class].inner_size);
+		else
+			dp = dsa_allocate(tree->area, rt_size_class_info[size_class].leaf_size);
+
+		newnode.encoded = (rt_pointer) dp;
+		newnode.decoded = rt_pointer_decode(tree, newnode.encoded);
+	}
+	else
+	{
+		if (inner)
+			newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+															 rt_size_class_info[size_class].inner_size);
+		else
+			newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+															 rt_size_class_info[size_class].leaf_size);
+
+		newnode.encoded = rt_pointer_encode(newnode.decoded);
+	}
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[size_class]++;
+	tree->ctl->cnt[size_class]++;
 #endif
 
 	return newnode;
@@ -1041,10 +1096,10 @@ static void
 rt_free_node(radix_tree *tree, rt_node_ptr node)
 {
 	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node.encoded)
+	if (tree->ctl->root == node.encoded)
 	{
-		tree->root = InvalidRTPointer;
-		tree->max_val = 0;
+		tree->ctl->root = InvalidRTPointer;
+		tree->ctl->max_val = 0;
 	}
 
 #ifdef RT_DEBUG
@@ -1062,12 +1117,15 @@ rt_free_node(radix_tree *tree, rt_node_ptr node)
 		if (i == RT_SIZE_CLASS_COUNT)
 			i = RT_CLASS_256;
 
-		tree->cnt[i]--;
-		Assert(tree->cnt[i] >= 0);
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
 	}
 #endif
 
-	pfree(node.decoded);
+	if (RadixTreeIsShared(tree))
+		dsa_free(tree->area, (dsa_pointer) node.encoded);
+	else
+		pfree(node.decoded);
 }
 
 /*
@@ -1083,7 +1141,7 @@ rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
 	if (rt_node_ptr_eq(&parent, &old_child))
 	{
 		/* Replace the root node with the new large node */
-		tree->root = new_child.encoded;
+		tree->ctl->root = new_child.encoded;
 	}
 	else
 	{
@@ -1105,7 +1163,7 @@ static void
 rt_extend(radix_tree *tree, uint64 key)
 {
 	int			target_shift;
-	rt_node		*root = rt_pointer_decode(tree->root);
+	rt_node		*root = rt_pointer_decode(tree, tree->ctl->root);
 	int			shift = root->shift + RT_NODE_SPAN;
 
 	target_shift = key_get_shift(key);
@@ -1123,15 +1181,15 @@ rt_extend(radix_tree *tree, uint64 key)
 		n4->base.n.shift = shift;
 		n4->base.n.count = 1;
 		n4->base.chunks[0] = 0;
-		n4->children[0] = tree->root;
+		n4->children[0] = tree->ctl->root;
 
 		root->chunk = 0;
-		tree->root = node.encoded;
+		tree->ctl->root = node.encoded;
 
 		shift += RT_NODE_SPAN;
 	}
 
-	tree->max_val = shift_get_max_val(target_shift);
+	tree->ctl->max_val = shift_get_max_val(target_shift);
 }
 
 /*
@@ -1163,7 +1221,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
 	}
 
 	rt_node_insert_leaf(tree, parent, node, key, value);
-	tree->num_keys++;
+	tree->ctl->num_keys++;
 }
 
 /*
@@ -1174,12 +1232,11 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
-					 rt_pointer *child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action, rt_pointer *child_p)
 {
 	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
-	rt_pointer	child;
+	rt_pointer	child = InvalidRTPointer;
 
 	switch (NODE_KIND(node))
 	{
@@ -1210,6 +1267,7 @@ rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
 					break;
 
 				found = true;
+
 				if (action == RT_ACTION_FIND)
 					child = n32->children[idx];
 				else			/* RT_ACTION_DELETE */
@@ -1761,33 +1819,51 @@ retry_insert_leaf_32:
  * Create the radix tree in the given memory context and return it.
  */
 radix_tree *
-rt_create(MemoryContext ctx)
+rt_create(MemoryContext ctx, dsa_area *area)
 {
 	radix_tree *tree;
 	MemoryContext old_ctx;
 
 	old_ctx = MemoryContextSwitchTo(ctx);
 
-	tree = palloc(sizeof(radix_tree));
+	tree = (radix_tree *) palloc0(sizeof(radix_tree));
 	tree->context = ctx;
-	tree->root = InvalidRTPointer;
-	tree->max_val = 0;
-	tree->num_keys = 0;
+
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+
+		tree->area = area;
+		dp = dsa_allocate0(area, sizeof(radix_tree_control));
+		tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
+		tree->ctl->handle = (rt_handle) dp;
+	}
+	else
+	{
+		tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
+		tree->ctl->handle = InvalidDsaPointer;
+	}
+
+	tree->ctl->magic = RADIXTREE_MAGIC;
+	tree->ctl->root = InvalidRTPointer;
 
 	/* Create the slab allocator for each size class */
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	if (area == NULL)
 	{
-		tree->inner_slabs[i] = SlabContextCreate(ctx,
-												 rt_size_class_info[i].name,
-												 rt_size_class_info[i].inner_blocksize,
-												 rt_size_class_info[i].inner_size);
-		tree->leaf_slabs[i] = SlabContextCreate(ctx,
-												rt_size_class_info[i].name,
-												rt_size_class_info[i].leaf_blocksize,
-												rt_size_class_info[i].leaf_size);
+		for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			tree->inner_slabs[i] = SlabContextCreate(ctx,
+													 rt_size_class_info[i].name,
+													 rt_size_class_info[i].inner_blocksize,
+													 rt_size_class_info[i].inner_size);
+			tree->leaf_slabs[i] = SlabContextCreate(ctx,
+													rt_size_class_info[i].name,
+													rt_size_class_info[i].leaf_blocksize,
+													rt_size_class_info[i].leaf_size);
 #ifdef RT_DEBUG
-		tree->cnt[i] = 0;
+			tree->ctl->cnt[i] = 0;
 #endif
+		}
 	}
 
 	MemoryContextSwitchTo(old_ctx);
@@ -1795,16 +1871,163 @@ rt_create(MemoryContext ctx)
 	return tree;
 }
 
+/*
+ * Get a handle that can be used by other processes to attach to this radix
+ * tree.
+ */
+dsa_pointer
+rt_get_handle(radix_tree *tree)
+{
+	Assert(RadixTreeIsShared(tree));
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	return tree->ctl->handle;
+}
+
+/*
+ * Attach to an existing radix tree using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+radix_tree *
+rt_attach(dsa_area *area, rt_handle handle)
+{
+	radix_tree *tree;
+	dsa_pointer	control;
+
+	/* Allocate the backend-local object representing the radix tree */
+	tree = (radix_tree *) palloc0(sizeof(radix_tree));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	/* Set up the local radix tree */
+	tree->area = area;
+	tree->ctl = (radix_tree_control *) dsa_get_address(area, control);
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	return tree;
+}
+
+/*
+ * Detach from a radix tree. This frees backend-local resources associated
+ * with the radix tree, but the radix tree will continue to exist until
+ * it is explicitly freed.
+ */
+void
+rt_detach(radix_tree *tree)
+{
+	Assert(RadixTreeIsShared(tree));
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	pfree(tree);
+}
+
+/*
+ * Recursively free all nodes allocated to the dsa area.
+ */
+static void
+rt_free_recurse(radix_tree *tree, rt_pointer ptr)
+{
+	rt_node_ptr	node = rt_node_ptr_encoded(tree, ptr);
+
+	Assert(RadixTreeIsShared(tree));
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers, so free it */
+	if (NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->area, (dsa_pointer) node.encoded);
+		return;
+	}
+
+	switch (NODE_KIND(node))
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < NODE_COUNT(node); i++)
+					rt_free_recurse(tree, n4->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < NODE_COUNT(node); i++)
+					rt_free_recurse(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						continue;
+
+					rt_free_recurse(tree, node_inner_125_get_child(n125, i));
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_inner_256_is_chunk_used(n256, i))
+						continue;
+
+					rt_free_recurse(tree, node_inner_256_get_child(n256, i));
+				}
+				break;
+			}
+	}
+
+	/* Free the inner node itself */
+	dsa_free(tree->area, node.encoded);
+}
+
 /*
  * Free the given radix tree.
  */
 void
 rt_free(radix_tree *tree)
 {
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (RadixTreeIsShared(tree))
 	{
-		MemoryContextDelete(tree->inner_slabs[i]);
-		MemoryContextDelete(tree->leaf_slabs[i]);
+		/* Free all memory used for radix tree nodes */
+		if (RTPointerIsValid(tree->ctl->root))
+			rt_free_recurse(tree, tree->ctl->root);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix tree.
+		 */
+		tree->ctl->magic = 0;
+		dsa_free(tree->area, tree->ctl->handle);
+	}
+	else
+	{
+		/* Free all memory used for radix tree nodes */
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			MemoryContextDelete(tree->inner_slabs[i]);
+			MemoryContextDelete(tree->leaf_slabs[i]);
+		}
+		pfree(tree->ctl);
 	}
 
 	pfree(tree);
@@ -1822,16 +2045,18 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 	rt_node_ptr	node;
 	rt_node_ptr parent;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
 	/* Empty tree, create the root */
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 		rt_new_root(tree, key);
 
 	/* Extend the tree if necessary */
-	if (key > tree->max_val)
+	if (key > tree->ctl->max_val)
 		rt_extend(tree, key);
 
 	/* Descend the tree until a leaf node */
-	node = parent = rt_node_ptr_encoded(tree->root);
+	node = parent = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
@@ -1847,7 +2072,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 		}
 
 		parent = node;
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1855,7 +2080,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 
 	/* Update the statistics */
 	if (!updated)
-		tree->num_keys++;
+		tree->ctl->num_keys++;
 
 	return updated;
 }
@@ -1871,12 +2096,13 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 	rt_node_ptr    node;
 	int			shift;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
 	Assert(value_p != NULL);
 
-	if (!RTPointerIsValid(tree->root) || key > tree->max_val)
+	if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 
 	/* Descend the tree until a leaf node */
@@ -1890,7 +2116,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1910,14 +2136,16 @@ rt_delete(radix_tree *tree, uint64 key)
 	int			level;
 	bool		deleted;
 
-	if (!tree->root || key > tree->max_val)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
 	/*
 	 * Descend the tree to search the key while building a stack of nodes we
 	 * visited.
 	 */
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	level = -1;
 	while (shift > 0)
@@ -1930,7 +2158,7 @@ rt_delete(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1945,7 +2173,7 @@ rt_delete(radix_tree *tree, uint64 key)
 	}
 
 	/* Found the key to delete. Update the statistics */
-	tree->num_keys--;
+	tree->ctl->num_keys--;
 
 	/*
 	 * Return if the leaf node still has keys and we don't need to delete the
@@ -1985,16 +2213,18 @@ rt_begin_iterate(radix_tree *tree)
 	rt_iter    *iter;
 	int			top_level;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
 	old_ctx = MemoryContextSwitchTo(tree->context);
 
 	iter = (rt_iter *) palloc0(sizeof(rt_iter));
 	iter->tree = tree;
 
 	/* empty tree */
-	if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
+	if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->ctl->root))
 		return iter;
 
-	root = rt_node_ptr_encoded(iter->tree->root);
+	root = rt_node_ptr_encoded(tree, iter->tree->ctl->root);
 	top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
 	iter->stack_len = top_level;
 
@@ -2045,8 +2275,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
 bool
 rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 {
+	Assert(!RadixTreeIsShared(iter->tree) || iter->tree->ctl->magic == RADIXTREE_MAGIC);
+
 	/* Empty tree */
-	if (!iter->tree->root)
+	if (!iter->tree->ctl->root)
 		return false;
 
 	for (;;)
@@ -2190,7 +2422,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *
 	if (found)
 	{
 		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
-		*child_p = rt_node_ptr_encoded(child);
+		*child_p = rt_node_ptr_encoded(iter->tree, child);
 	}
 
 	return found;
@@ -2293,7 +2525,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_
 uint64
 rt_num_entries(radix_tree *tree)
 {
-	return tree->num_keys;
+	return tree->ctl->num_keys;
 }
 
 /*
@@ -2302,12 +2534,19 @@ rt_num_entries(radix_tree *tree)
 uint64
 rt_memory_usage(radix_tree *tree)
 {
-	Size		total = sizeof(radix_tree);
+	Size		total = sizeof(radix_tree) + sizeof(radix_tree_control);
 
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (RadixTreeIsShared(tree))
+		total = dsa_get_total_size(tree->area);
+	else
 	{
-		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
-		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+			total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+		}
 	}
 
 	return total;
@@ -2391,23 +2630,23 @@ rt_verify_node(rt_node_ptr node)
 void
 rt_stats(radix_tree *tree)
 {
-	rt_node *root = rt_pointer_decode(tree->root);
+	rt_node *root = rt_pointer_decode(tree, tree->ctl->root);
 
 	if (root == NULL)
 		return;
 
 	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
-							tree->num_keys,
+							tree->ctl->num_keys,
 							root->shift / RT_NODE_SPAN,
-							tree->cnt[RT_CLASS_4_FULL],
-							tree->cnt[RT_CLASS_32_PARTIAL],
-							tree->cnt[RT_CLASS_32_FULL],
-							tree->cnt[RT_CLASS_125_FULL],
-							tree->cnt[RT_CLASS_256])));
+							tree->ctl->cnt[RT_CLASS_4_FULL],
+							tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+							tree->ctl->cnt[RT_CLASS_32_FULL],
+							tree->ctl->cnt[RT_CLASS_125_FULL],
+							tree->ctl->cnt[RT_CLASS_256])));
 }
 
 static void
-rt_dump_node(rt_node_ptr node, int level, bool recurse)
+rt_dump_node(radix_tree *tree, rt_node_ptr node, int level, bool recurse)
 {
 	rt_node		*n = node.decoded;
 	char		space[128] = {0};
@@ -2445,7 +2684,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, n4->base.chunks[i]);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+							rt_dump_node(tree, rt_node_ptr_encoded(tree, n4->children[i]),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2473,7 +2712,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 
 						if (recurse)
 						{
-							rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+							rt_dump_node(tree, rt_node_ptr_encoded(tree, n32->children[i]),
 										 level + 1, recurse);
 						}
 						else
@@ -2526,7 +2765,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
+							rt_dump_node(tree,
+										 rt_node_ptr_encoded(tree,
+															 node_inner_125_get_child(n125, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2559,7 +2800,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+							rt_dump_node(tree,
+										 rt_node_ptr_encoded(tree,
+															 node_inner_256_get_child(n256, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2579,28 +2822,28 @@ rt_dump_search(radix_tree *tree, uint64 key)
 
 	elog(NOTICE, "-----------------------------------------------------------");
 	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
-		 tree->max_val, tree->max_val);
+		 tree->ctl->max_val, tree->ctl->max_val);
 
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 	{
 		elog(NOTICE, "tree is empty");
 		return;
 	}
 
-	if (key > tree->max_val)
+	if (key > tree->ctl->max_val)
 	{
 		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
 			 key, key);
 		return;
 	}
 
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
 		rt_pointer   child;
 
-		rt_dump_node(node, level, false);
+		rt_dump_node(tree, node, level, false);
 
 		if (NODE_IS_LEAF(node))
 		{
@@ -2615,7 +2858,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			break;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 		level++;
 	}
@@ -2633,15 +2876,15 @@ rt_dump(radix_tree *tree)
 				rt_size_class_info[i].inner_blocksize,
 				rt_size_class_info[i].leaf_size,
 				rt_size_class_info[i].leaf_blocksize);
-	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
 
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 	{
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
-	root = rt_node_ptr_encoded(tree->root);
-	rt_dump_node(root, 0, true);
+	root = rt_node_ptr_encoded(tree, tree->ctl->root);
+	rt_dump_node(tree, root, 0, true);
 }
 #endif
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 82376fde2d..ad169882af 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..68a11df970 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -14,18 +14,24 @@
 #define RADIXTREE_H
 
 #include "postgres.h"
+#include "utils/dsa.h"
 
 #define RT_DEBUG 1
 
 typedef struct radix_tree radix_tree;
 typedef struct rt_iter rt_iter;
+typedef dsa_pointer rt_handle;
 
-extern radix_tree *rt_create(MemoryContext ctx);
+extern radix_tree *rt_create(MemoryContext ctx, dsa_area *dsa);
 extern void rt_free(radix_tree *tree);
 extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
 extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
 extern rt_iter *rt_begin_iterate(radix_tree *tree);
 
+extern rt_handle rt_get_handle(radix_tree *tree);
+extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+extern void rt_detach(radix_tree *tree);
+
 extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
 extern void rt_end_iterate(rt_iter *iter);
 extern bool rt_delete(radix_tree *tree, uint64 key);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 405606fe2f..dad06adecc 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index ce645cb8b5..a217e0d312 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -6,28 +6,53 @@ CREATE EXTENSION test_radixtree;
 SELECT test_radixtree();
 NOTICE:  testing basic operations with leaf node 4
 NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
 NOTICE:  testing basic operations with leaf node 32
 NOTICE:  testing basic operations with inner node 32
 NOTICE:  testing basic operations with leaf node 125
 NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
 NOTICE:  testing basic operations with leaf node 256
 NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
 NOTICE:  testing radix tree node types with shift "0"
 NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "8"
 NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
 NOTICE:  testing radix tree node types with shift "24"
 NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "32"
 NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
 NOTICE:  testing radix tree node types with shift "48"
 NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree node types with shift "56"
 NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
 NOTICE:  testing radix tree with pattern "alternating bits"
 NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of ten"
 NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
 NOTICE:  testing radix tree with pattern "one-every-64k"
 NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "sparse"
 NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
 NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
 NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
  test_radixtree 
 ----------------
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index ea993e63df..fe1e168ec4 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -19,6 +19,7 @@
 #include "nodes/bitmapset.h"
 #include "storage/block.h"
 #include "storage/itemptr.h"
+#include "storage/lwlock.h"
 #include "utils/memutils.h"
 #include "utils/timestamp.h"
 
@@ -99,6 +100,8 @@ static const test_spec test_specs[] = {
 	}
 };
 
+static int lwlock_tranche_id;
+
 PG_MODULE_MAGIC;
 
 PG_FUNCTION_INFO_V1(test_radixtree);
@@ -112,7 +115,7 @@ test_empty(void)
 	uint64		key;
 	uint64		val;
 
-	radixtree = rt_create(CurrentMemoryContext);
+	radixtree = rt_create(CurrentMemoryContext, NULL);
 
 	if (rt_search(radixtree, 0, &dummy))
 		elog(ERROR, "rt_search on empty tree returned true");
@@ -140,17 +143,14 @@ test_empty(void)
 }
 
 static void
-test_basic(int children, bool test_inner)
+do_test_basic(radix_tree *radixtree, int children, bool test_inner)
 {
-	radix_tree	*radixtree;
 	uint64 *keys;
 	int	shift = test_inner ? 8 : 0;
 
 	elog(NOTICE, "testing basic operations with %s node %d",
 		 test_inner ? "inner" : "leaf", children);
 
-	radixtree = rt_create(CurrentMemoryContext);
-
 	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
 	keys = palloc(sizeof(uint64) * children);
 	for (int i = 0; i < children; i++)
@@ -165,7 +165,7 @@ test_basic(int children, bool test_inner)
 	for (int i = 0; i < children; i++)
 	{
 		if (rt_set(radixtree, keys[i], keys[i]))
-			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found %d", keys[i], i);
 	}
 
 	/* update keys */
@@ -185,7 +185,38 @@ test_basic(int children, bool test_inner)
 	}
 
 	pfree(keys);
-	rt_free(radixtree);
+}
+
+static void
+test_basic()
+{
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		radix_tree *tree;
+		dsa_area	*area;
+
+		/* Test the local radix tree */
+		tree = rt_create(CurrentMemoryContext, NULL);
+		do_test_basic(tree, rt_node_kind_fanouts[i], false);
+		rt_free(tree);
+
+		tree = rt_create(CurrentMemoryContext, NULL);
+		do_test_basic(tree, rt_node_kind_fanouts[i], true);
+		rt_free(tree);
+
+		/* Test the shared radix tree */
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(CurrentMemoryContext, area);
+		do_test_basic(tree, rt_node_kind_fanouts[i], false);
+		rt_free(tree);
+		dsa_detach(area);
+
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(CurrentMemoryContext, area);
+		do_test_basic(tree, rt_node_kind_fanouts[i], true);
+		rt_free(tree);
+		dsa_detach(area);
+	}
 }
 
 /*
@@ -286,14 +317,10 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
  * level.
  */
 static void
-test_node_types(uint8 shift)
+do_test_node_types(radix_tree *radixtree, uint8 shift)
 {
-	radix_tree *radixtree;
-
 	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
 
-	radixtree = rt_create(CurrentMemoryContext);
-
 	/*
 	 * Insert and search entries for every node type at the 'shift' level,
 	 * then delete all entries to make it empty, and insert and search entries
@@ -302,19 +329,37 @@ test_node_types(uint8 shift)
 	test_node_types_insert(radixtree, shift, true);
 	test_node_types_delete(radixtree, shift);
 	test_node_types_insert(radixtree, shift, false);
+}
 
-	rt_free(radixtree);
+static void
+test_node_types(void)
+{
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+	{
+		radix_tree *tree;
+		dsa_area   *area;
+
+		/* Test the local radix tree */
+		tree = rt_create(CurrentMemoryContext, NULL);
+		do_test_node_types(tree, shift);
+		rt_free(tree);
+
+		/* Test the shared radix tree */
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(CurrentMemoryContext, area);
+		do_test_node_types(tree, shift);
+		rt_free(tree);
+		dsa_detach(area);
+	}
 }
 
 /*
  * Test with a repeating pattern, defined by the 'spec'.
  */
 static void
-test_pattern(const test_spec * spec)
+do_test_pattern(radix_tree *radixtree, const test_spec * spec)
 {
-	radix_tree *radixtree;
 	rt_iter    *iter;
-	MemoryContext radixtree_ctx;
 	TimestampTz starttime;
 	TimestampTz endtime;
 	uint64		n;
@@ -340,18 +385,6 @@ test_pattern(const test_spec * spec)
 			pattern_values[pattern_num_values++] = i;
 	}
 
-	/*
-	 * Allocate the radix tree.
-	 *
-	 * Allocate it in a separate memory context, so that we can print its
-	 * memory usage easily.
-	 */
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
-										  "radixtree test",
-										  ALLOCSET_SMALL_SIZES);
-	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
-	radixtree = rt_create(radixtree_ctx);
-
 	/*
 	 * Add values to the set.
 	 */
@@ -405,8 +438,6 @@ test_pattern(const test_spec * spec)
 		mem_usage = rt_memory_usage(radixtree);
 		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
 				mem_usage, (double) mem_usage / spec->num_values);
-
-		MemoryContextStats(radixtree_ctx);
 	}
 
 	/* Check that rt_num_entries works */
@@ -555,27 +586,57 @@ test_pattern(const test_spec * spec)
 	if ((nbefore - ndeleted) != nafter)
 		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
 			 nafter, (nbefore - ndeleted), ndeleted);
+}
+
+static void
+test_patterns(void)
+{
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+	{
+		radix_tree *tree;
+		MemoryContext radixtree_ctx;
+		dsa_area   *area;
+		const		test_spec *spec = &test_specs[i];
 
-	MemoryContextDelete(radixtree_ctx);
+		/*
+		 * Allocate the radix tree.
+		 *
+		 * Allocate it in a separate memory context, so that we can print its
+		 * memory usage easily.
+		 */
+		radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+											  "radixtree test",
+											  ALLOCSET_SMALL_SIZES);
+		MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+		/* Test the local radix tree */
+		tree = rt_create(radixtree_ctx, NULL);
+		do_test_pattern(tree, spec);
+		rt_free(tree);
+		MemoryContextReset(radixtree_ctx);
+
+		/* Test the shared radix tree */
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(radixtree_ctx, area);
+		do_test_pattern(tree, spec);
+		rt_free(tree);
+		dsa_detach(area);
+		MemoryContextDelete(radixtree_ctx);
+	}
 }
 
 Datum
 test_radixtree(PG_FUNCTION_ARGS)
 {
-	test_empty();
+	/* get a new lwlock tranche id for all tests for shared radix tree */
+	lwlock_tranche_id = LWLockNewTrancheId();
 
-	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
-	{
-		test_basic(rt_node_kind_fanouts[i], false);
-		test_basic(rt_node_kind_fanouts[i], true);
-	}
-
-	for (int shift = 0; shift <= (64 - 8); shift += 8)
-		test_node_types(shift);
+	test_empty();
+	test_basic();
 
-	/* Test different test patterns, with lots of entries */
-	for (int i = 0; i < lengthof(test_specs); i++)
-		test_pattern(&test_specs[i]);
+	test_node_types();
+	test_patterns();
 
 	PG_RETURN_VOID();
 }
-- 
2.31.1

v14-0004-Use-bitmapword-for-node-125.patchapplication/octet-stream; name=v14-0004-Use-bitmapword-for-node-125.patchDownload

From 066eada2c94025a273fa0e49763c6817fcc1906a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 15:22:26 +0700
Subject: [PATCH v14 4/9] Use bitmapword for node-125

TODO: Rename macros copied from bitmapset.c
---
 src/backend/lib/radixtree.c | 70 ++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 36 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index e7f61fd943..abd0450727 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -62,6 +62,7 @@
 #include "lib/radixtree.h"
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
+#include "nodes/bitmapset.h"
 #include "port/pg_bitutils.h"
 #include "port/pg_lfind.h"
 #include "utils/memutils.h"
@@ -103,6 +104,10 @@
 #define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
 #define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
 
+/* FIXME rename */
+#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
+
 /* Enum used rt_node_search() */
 typedef enum
 {
@@ -207,6 +212,9 @@ typedef struct rt_node_base125
 
 	/* The index of slots for each fanout */
 	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword		isset[WORDNUM(128)];
 } rt_node_base_125;
 
 typedef struct rt_node_base256
@@ -271,9 +279,6 @@ typedef struct rt_node_leaf_125
 {
 	rt_node_base_125 base;
 
-	/* isset is a bitmap to track which slot is in use */
-	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
-
 	/* number of values depends on size class */
 	uint64		values[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_leaf_125;
@@ -655,13 +660,14 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
 	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
 }
 
+#ifdef USE_ASSERT_CHECKING
 /* Is the slot in the node used? */
 static inline bool
 node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
 {
 	Assert(!NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
-	return (node->children[slot] != NULL);
+	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
 
 static inline bool
@@ -669,8 +675,9 @@ node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
 {
 	Assert(NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
-	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
+#endif
 
 static inline rt_node *
 node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
@@ -690,7 +697,10 @@ node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
 static void
 node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
 {
+	int			slotpos = node->base.slot_idxs[chunk];
+
 	Assert(!NODE_IS_LEAF(node));
+	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->children[node->base.slot_idxs[chunk]] = NULL;
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
@@ -701,44 +711,35 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
 	int			slotpos = node->base.slot_idxs[chunk];
 
 	Assert(NODE_IS_LEAF(node));
-	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
 
 /* Return an unused slot in node-125 */
 static int
-node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
-{
-	int			slotpos = 0;
-
-	Assert(!NODE_IS_LEAF(node));
-	while (node_inner_125_is_slot_used(node, slotpos))
-		slotpos++;
-
-	return slotpos;
-}
-
-static int
-node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+node_125_find_unused_slot(bitmapword *isset)
 {
 	int			slotpos;
+	int			idx;
+	bitmapword	inverse;
 
-	Assert(NODE_IS_LEAF(node));
-
-	/* We iterate over the isset bitmap per byte then check each bit */
-	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	/* get the first word with at least one bit not set */
+	for (idx = 0; idx < WORDNUM(128); idx++)
 	{
-		if (node->isset[slotpos] < 0xFF)
+		if (isset[idx] < ~((bitmapword) 0))
 			break;
 	}
-	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
 
-	slotpos *= BITS_PER_BYTE;
-	while (node_leaf_125_is_slot_used(node, slotpos))
-		slotpos++;
+	/* To get the first unset bit in X, get the first set bit in ~X */
+	inverse = ~(isset[idx]);
+	slotpos = idx * BITS_PER_BITMAPWORD;
+	slotpos += bmw_rightmost_one_pos(inverse);
+
+	/* mark the slot used */
+	isset[idx] |= bmw_rightmost_one(inverse);
 
 	return slotpos;
-}
+ }
 
 static inline void
 node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
@@ -747,8 +748,7 @@ node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
 
 	Assert(!NODE_IS_LEAF(node));
 
-	/* find unused slot */
-	slotpos = node_inner_125_find_unused_slot(node, chunk);
+	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
 
 	node->base.slot_idxs[chunk] = slotpos;
@@ -763,12 +763,10 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 
 	Assert(NODE_IS_LEAF(node));
 
-	/* find unused slot */
-	slotpos = node_leaf_125_find_unused_slot(node, chunk);
+	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
 
 	node->base.slot_idxs[chunk] = slotpos;
-	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
 	node->values[slotpos] = value;
 }
 
@@ -2395,9 +2393,9 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
 
 					fprintf(stderr, ", isset-bitmap:");
-					for (int i = 0; i < 16; i++)
+					for (int i = 0; i < WORDNUM(128); i++)
 					{
-						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+						fprintf(stderr, UINT64_FORMAT_HEX " ", n->base.isset[i]);
 					}
 					fprintf(stderr, "\n");
 				}
-- 
2.31.1

v14-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v14-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From caf11ea2ca608edac00443b6ab7590688385b0d4 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v14 2/9] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..4384ff591d 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..fdc504596b 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 814e0b2dba..f95b6afd86 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 60c71d05fe..8305f09f2c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3654,7 +3654,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.31.1

v14-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v14-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From ceaf56be51d2c686a795e1ab1ab40f701ed21d62 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v14 1/9] introduce vector8_min and vector8_highbit_mask

---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.31.1

v14-0003-Add-radix-implementation.patchapplication/octet-stream; name=v14-0003-Add-radix-implementation.patchDownload

From 6ba6c9979b2bd4fb5ef3c61d7a6edac1737e8509 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v14 3/9] Add radix implementation.

---
 src/backend/lib/Makefile                      |    1 +
 src/backend/lib/meson.build                   |    1 +
 src/backend/lib/radixtree.c                   | 2541 +++++++++++++++++
 src/include/lib/radixtree.h                   |   42 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   34 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  581 ++++
 .../test_radixtree/test_radixtree.control     |    4 +
 15 files changed, 3291 insertions(+)
 create mode 100644 src/backend/lib/radixtree.c
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
   'knapsack.c',
   'pairingheap.c',
   'rbtree.c',
+  'radixtree.c',
 )
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..e7f61fd943
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2541 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_set			- Set a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iterate	- Begin iterating through all key-value pairs
+ * rt_iterate_next	- Return next key-value pair, if any
+ * rt_end_iter		- End iteration
+ * rt_memory_usage	- Get the memory usage
+ * rt_num_entries	- Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+	RT_ACTION_FIND = 0,			/* find the key-value */
+	RT_ACTION_DELETE,			/* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+typedef enum rt_size_class
+{
+	RT_CLASS_4_FULL = 0,
+	RT_CLASS_32_PARTIAL,
+	RT_CLASS_32_FULL,
+	RT_CLASS_125_FULL,
+	RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/* Max number of children. We can use uint8 because we never need to store 256 */
+	/* WIP: if we don't have a variable sized node4, this should instead be in the base
+	types as needed, since saving every byte is crucial for the smallest node kind */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} rt_node;
+#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+	((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+	((node)->base.n.count < rt_size_class_info[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct rt_node_base_4
+{
+	rt_node		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+	rt_node		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base125
+{
+	rt_node		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_125;
+
+typedef struct rt_node_base256
+{
+	rt_node		n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+	rt_node_base_4 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+	rt_node_base_4 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+	rt_node_base_32 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+	rt_node_base_32 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_125
+{
+	rt_node_base_125 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_125;
+
+typedef struct rt_node_leaf_125
+{
+	rt_node_base_125 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 children */
+	rt_node    *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+	rt_node_base_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information for each size class */
+typedef struct rt_size_class_elem
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} rt_size_class_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+	[RT_CLASS_4_FULL] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_PARTIAL] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_FULL] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+	},
+	[RT_CLASS_125_FULL] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(rt_node_inner_256),
+		.leaf_size = sizeof(rt_node_leaf_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+	},
+};
+
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+	[RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+	rt_node    *node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	rt_node_iter stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node    *root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+								bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+										rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+									   uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+								 uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+								uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											 uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+						  uint8 *dst_chunks, rt_node **dst_children)
+{
+	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(rt_node *) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values)
+{
+	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(uint64) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(slot < node->base.n.fanout);
+	return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(slot < node->base.n.fanout);
+	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = NULL;
+	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+static void
+node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
+{
+	int			slotpos = node->base.slot_idxs[chunk];
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+/* Return an unused slot in node-125 */
+static int
+node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
+{
+	int			slotpos = 0;
+
+	Assert(!NODE_IS_LEAF(node));
+	while (node_inner_125_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static int
+node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* We iterate over the isset bitmap per byte then check each bit */
+	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	{
+		if (node->isset[slotpos] < 0xFF)
+			break;
+	}
+	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+	slotpos *= BITS_PER_BYTE;
+	while (node_leaf_125_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static inline void
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+	int			slotpos;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_inner_125_find_unused_slot(node, chunk);
+	Assert(slotpos < node->base.n.fanout);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_leaf_125_find_unused_slot(node, chunk);
+	Assert(slotpos < node->base.n.fanout);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(node_inner_256_is_chunk_used(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(node_leaf_256_is_chunk_used(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+	int			shift = key_get_shift(key);
+	bool		inner = shift > 0;
+	rt_node    *newnode;
+
+	newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+	newnode->shift = shift;
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+{
+	rt_node    *newnode;
+
+	if (inner)
+		newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+												 rt_size_class_info[size_class].inner_size);
+	else
+		newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+												 rt_size_class_info[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[size_class]++;
+#endif
+
+	return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+{
+	if (inner)
+		MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+	else
+		MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+
+	node->kind = kind;
+	node->fanout = rt_size_class_info[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+
+		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+	}
+
+	/*
+	 * Technically it's 256, but we cannot store that in a uint8,
+	 * and this is the max size class to it will never grow.
+	 */
+	if (kind == RT_NODE_KIND_256)
+		node->fanout = 0;
+}
+
+static inline void
+rt_copy_node(rt_node *newnode, rt_node *oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->chunk = oldnode->chunk;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+{
+	rt_node	*newnode;
+	bool inner = !NODE_IS_LEAF(node);
+
+	newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
+	rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
+	rt_copy_node(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == rt_size_class_info[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->cnt[i]--;
+		Assert(tree->cnt[i] >= 0);
+	}
+#endif
+
+	pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+				rt_node *new_child, uint64 key)
+{
+	Assert(old_child->chunk == new_child->chunk);
+	Assert(old_child->shift == new_child->shift);
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = new_child;
+	}
+	else
+	{
+		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
+
+		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		Assert(replaced);
+	}
+
+	rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_inner_4 *node;
+
+		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+		rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		node->base.n.shift = shift;
+		node->base.n.count = 1;
+		node->base.chunks[0] = 0;
+		node->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+			  rt_node *node)
+{
+	int			shift = node->shift;
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		rt_node    *newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		inner = newshift > 0;
+
+		newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+		rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+		newchild->shift = newshift;
+		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+		rt_node_insert_inner(tree, parent, node, key, newchild);
+
+		parent = node;
+		node = newchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	rt_node_insert_leaf(tree, parent, node, key, value);
+	tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	rt_node    *child = NULL;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = n4->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n4->base.chunks, n4->children,
+												n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = n32->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n32->base.chunks, n32->children,
+												n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+
+				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = node_inner_125_get_child(n125, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_125_delete(n125, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				if (!node_inner_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = node_inner_256_get_child(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && child_p)
+		*child_p = child;
+
+	return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	uint64		value = 0;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = n4->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+											  n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = n32->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+											  n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+
+				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_125_get_value(n125, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_125_delete(n125, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				if (!node_leaf_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_256_get_value(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && value_p)
+		*value_p = value;
+
+	return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+					 rt_node *child)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_inner_32 *new32;
+					Assert(parent != NULL);
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+																   RT_NODE_KIND_32);
+					chunk_children_array_copy(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+									key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					uint16		count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n4->base.chunks, n4->children,
+												   count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->children[insertpos] = child;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					Assert(parent != NULL);
+
+					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+					{
+						/* use the same node kind, but expand to the next size class */
+						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
+						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_inner_32 *new32;
+
+						new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+						memcpy(new32, n32, size);
+						new32->base.n.fanout = fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new32;
+						n32 = new32;
+
+						goto retry_insert_inner_32;
+					}
+					else
+					{
+						rt_node_inner_125 *new125;
+
+						/* grow node from 32 to 125 */
+						new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																		 RT_NODE_KIND_125);
+						for (int i = 0; i < n32->base.n.count; i++)
+							node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
+						node = (rt_node *) new125;
+					}
+				}
+				else
+				{
+retry_insert_inner_32:
+					{
+						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+						int16 count = n32->base.n.count;
+
+						if (count != 0 && insertpos < count)
+							chunk_children_array_shift(n32->base.chunks, n32->children,
+													   count, insertpos);
+
+						n32->base.chunks[insertpos] = chunk;
+						n32->children[insertpos] = child;
+						break;
+					}
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+				int			cnt = 0;
+
+				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_inner_125_update(n125, chunk, child);
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					rt_node_inner_256 *new256;
+					Assert(parent != NULL);
+
+					/* grow node from 125 to 256 */
+					new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+																	 RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+							continue;
+
+						node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+						cnt++;
+					}
+
+					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_inner_125_insert(n125, chunk, child);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+				node_inner_256_set(n256, chunk, child);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+					uint64 key, uint64 value)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_leaf_32 *new32;
+					Assert(parent != NULL);
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+																  RT_NODE_KIND_32);
+					chunk_values_array_copy(n4->base.chunks, n4->values,
+											new32->base.chunks, new32->values);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and values */
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n4->base.chunks, n4->values,
+												 count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->values[insertpos] = value;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					Assert(parent != NULL);
+
+					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+					{
+						/* use the same node kind, but expand to the next size class */
+						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
+						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_leaf_32 *new32;
+
+						new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+						memcpy(new32, n32, size);
+						new32->base.n.fanout = fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new32;
+						n32 = new32;
+
+						goto retry_insert_leaf_32;
+					}
+					else
+					{
+						rt_node_leaf_125 *new125;
+
+						/* grow node from 32 to 125 */
+						new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																		RT_NODE_KIND_125);
+						for (int i = 0; i < n32->base.n.count; i++)
+							node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
+										key);
+						node = (rt_node *) new125;
+					}
+				}
+				else
+				{
+				retry_insert_leaf_32:
+					{
+						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+						int	count = n32->base.n.count;
+
+						if (count != 0 && insertpos < count)
+							chunk_values_array_shift(n32->base.chunks, n32->values,
+													 count, insertpos);
+
+						n32->base.chunks[insertpos] = chunk;
+						n32->values[insertpos] = value;
+						break;
+					}
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+				int			cnt = 0;
+
+				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_leaf_125_update(n125, chunk, value);
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					rt_node_leaf_256 *new256;
+					Assert(parent != NULL);
+
+					/* grow node from 125 to 256 */
+					new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+																	RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+							continue;
+
+						node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+						cnt++;
+					}
+
+					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_leaf_125_insert(n125, chunk, value);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+				node_leaf_256_set(n256, chunk, value);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 rt_size_class_info[i].name,
+												 rt_size_class_info[i].inner_blocksize,
+												 rt_size_class_info[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												rt_size_class_info[i].name,
+												rt_size_class_info[i].leaf_blocksize,
+												rt_size_class_info[i].leaf_size);
+#ifdef RT_DEBUG
+		tree->cnt[i] = 0;
+#endif
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	rt_node    *node;
+	rt_node    *parent;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = parent = tree->root;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		{
+			rt_set_extend(tree, key, value, parent, node);
+			return false;
+		}
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+	rt_node    *node;
+	int			shift;
+
+	Assert(value_p != NULL);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		rt_node    *child;
+
+		/* Push the current node to the stack */
+		stack[++level] = node;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	Assert(NODE_IS_LEAF(node));
+	deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	rt_free_node(tree, node);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		node = stack[level--];
+
+		deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+	}
+
+	return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->root)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+	int			level = from;
+	rt_node    *node = from_node;
+
+	for (;;)
+	{
+		rt_node_iter *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = rt_node_inner_iterate_next(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->root)
+		return false;
+
+	for (;;)
+	{
+		rt_node    *child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		rt_update_iter_stack(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+	rt_node    *child = NULL;
+	bool		found = false;
+	uint8		key_chunk;
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				child = n4->children[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				child = n32->children[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_125_get_child(n125, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_inner_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_256_get_child(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+	return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+						  uint64 *value_p)
+{
+	rt_node    *node = node_iter->node;
+	bool		found = false;
+	uint64		value;
+	uint8		key_chunk;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				value = n4->values[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				value = n32->values[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_125_get_value(n125, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_leaf_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_256_get_value(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		*value_p = value;
+	}
+
+	return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	Size		total = sizeof(radix_tree);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					if (NODE_IS_LEAF(node))
+						Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+														  n125->slot_idxs[i]));
+					else
+						Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+														   n125->slot_idxs[i]));
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+						cnt += pg_popcount32(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+						 tree->num_keys,
+						 tree->root->shift / RT_NODE_SPAN,
+						 tree->cnt[RT_CLASS_4_FULL],
+						 tree->cnt[RT_CLASS_32_PARTIAL],
+						 tree->cnt[RT_CLASS_32_FULL],
+						 tree->cnt[RT_CLASS_125_FULL],
+						 tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+	char		space[125] = {0};
+
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
+			node->fanout == 0 ? 256 : node->fanout,
+			node->count, node->shift, node->chunk);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							rt_dump_node(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							rt_dump_node(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(b125, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < 16; i++)
+					{
+						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(b125, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_125_get_value(n125, i));
+					}
+					else
+					{
+						rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_125_get_child(n125, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+						if (!node_leaf_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_256_get_value(n256, i));
+					}
+					else
+					{
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+						if (!node_inner_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+		 tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		rt_dump_node(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+			break;
+		}
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+				rt_size_class_info[i].name,
+				rt_size_class_info[i].inner_size,
+				rt_size_class_info[i].inner_blocksize,
+				rt_size_class_info[i].leaf_size,
+				rt_size_class_info[i].leaf_blocksize);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+	if (!tree->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 911a768a29..fd101e3bf4 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -22,6 +22,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..ea993e63df
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,581 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	rt_iter		*iter;
+	uint64		dummy;
+	uint64		key;
+	uint64		val;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_set(radixtree, keys[i], keys[i] + 1))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		uint64		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
-- 
2.31.1

#158

sawada.mshk@gmail.com

about 3 years ago

In reply to: Masahiko Sawada (#157)

9 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Dec 19, 2022 at 4:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 12, 2022 at 7:14 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Fri, Dec 9, 2022 at 8:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 9, 2022 at 5:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:

I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail).

The minimum requirement is 2MB. In PoC patch, TIDStore checks how big
the radix tree is using dsa_get_total_size(). If the size returned by
dsa_get_total_size() (+ some memory used by TIDStore meta information)
exceeds maintenance_work_mem, lazy vacuum starts to do index vacuum
and heap vacuum. However, when allocating DSA memory for
radix_tree_control at creation, we allocate 1MB
(DSA_INITIAL_SEGMENT_SIZE) DSM memory and use memory required for
radix_tree_control from it. das_get_total_size() returns 1MB even if
there is no TID collected.

2MB makes sense.

If the metadata is small, it seems counterproductive to count it towards the total. We want the decision to be driven by blocks allocated. I have an idea on that below.

Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing.

Right, I think it's no problem in slab cases. In DSA cases, the new
segment size follows a geometric series that approximately doubles the
total storage each time we create a new segment. This behavior comes
from the fact that the underlying DSM system isn't designed for large
numbers of segments.

And taking a look, the size of a new segment can get quite large. It seems we could test if the total DSA area allocated is greater than half of maintenance_work_mem. If that parameter is a power of two (common) and >=8MB, then the area will contain just under a power of two the last time it passes the test. The next segment will bring it to about 3/4 full, like this:

maintenance work mem = 256MB, so stop if we go over 128MB:

2*(1+2+4+8+16+32) = 126MB -> keep going
126MB + 64 = 190MB -> stop

That would be a simple way to be conservative with the memory limit. The unfortunate aspect is that the last segment would be mostly wasted, but it's paradise compared to the pessimistically-sized single array we have now (even with Peter G.'s VM snapshot informing the allocation size, I imagine).

Right. In this case, even if we allocate 64MB, we will use only 2088
bytes at maximum. So I think the memory space used for vacuum is
practically limited to half.

And as for minimum possible maintenance work mem, I think this would work with 2MB, if the community is okay with technically going over the limit by a few bytes of overhead if a buildfarm animal set to that value. I imagine it would never go over the limit for realistic (and even most unrealistic) values. Even with a VM snapshot page in memory and small local arrays of TIDs, I think with this scheme we'll be well under the limit.

Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
seems that they look at only memory that are actually dsa_allocate'd.
To be exact, we estimate the number of hash buckets based on work_mem
(and hash_mem_multiplier) and use it as the upper limit. So I've
confirmed that the result of dsa_get_total_size() could exceed the
limit. I'm not sure it's a known and legitimate usage. If we can
follow such usage, we can probably track how much dsa_allocate'd
memory is used in the radix tree.

I've experimented with this idea. The newly added 0008 patch changes
the radix tree so that it counts the memory usage for both local and
shared cases.

I've attached updated version patches to make cfbot happy.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v15-0008-PoC-calculate-memory-usage-in-radix-tree.patchapplication/octet-stream; name=v15-0008-PoC-calculate-memory-usage-in-radix-tree.patchDownload

From 8ec7c3f15da739c1a8d78c1eec1e1f45cbe8ba21 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 19 Dec 2022 14:41:43 +0900
Subject: [PATCH v15 8/9] PoC: calculate memory usage in radix tree.

---
 src/backend/lib/radixtree.c  | 137 +++++++++++++++++++++++------------
 src/backend/utils/mmgr/dsa.c |  42 +++++++++++
 src/include/utils/dsa.h      |   1 +
 3 files changed, 135 insertions(+), 45 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 455071cbab..4ad55a0b7c 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -360,14 +360,24 @@ typedef struct rt_size_class_elem
 	const char *name;
 	int			fanout;
 
-	/* slab chunk size */
+	/* node size */
 	Size		inner_size;
 	Size		leaf_size;
 
 	/* slab block size */
-	Size		inner_blocksize;
-	Size		leaf_blocksize;
+	Size		slab_inner_blocksize;
+	Size		slab_leaf_blocksize;
+
+	/*
+	 * We can get how much memory is allocated for a radix tree node using
+	 * GetMemoryChunkSpace() for the local radix tree case. However, in the
+	 * shared case, since DSA doesn't have such functionality we prepare the
+	 * node size that are allocated in DSA for memory calculation.
+	 */
+	Size		dsa_inner_size;
+	Size		dsa_leaf_size;
 } rt_size_class_elem;
+static bool rt_size_class_dsa_info_initialized = false;
 
 /*
  * Calculate the slab blocksize so that we can allocate at least 32 chunks
@@ -381,40 +391,40 @@ static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
 		.fanout = 4,
 		.inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
 		.leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+		.slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+		.slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
 	},
 	[RT_CLASS_32_PARTIAL] = {
 		.name = "radix tree node 15",
 		.fanout = 15,
 		.inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
 		.leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+		.slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+		.slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
 	},
 	[RT_CLASS_32_FULL] = {
 		.name = "radix tree node 32",
 		.fanout = 32,
 		.inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
 		.leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+		.slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+		.slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
 	},
 	[RT_CLASS_125_FULL] = {
 		.name = "radix tree node 125",
 		.fanout = 125,
 		.inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
 		.leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+		.slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+		.slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
 	},
 	[RT_CLASS_256] = {
 		.name = "radix tree node 256",
 		.fanout = 256,
 		.inner_size = sizeof(rt_node_inner_256),
 		.leaf_size = sizeof(rt_node_leaf_256),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+		.slab_inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+		.slab_leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
 	},
 };
 
@@ -477,6 +487,12 @@ typedef struct radix_tree_control
 	uint64		max_val;
 	uint64		num_keys;
 
+	/*
+	 * Track the amount of memory used. The callers can ask for it
+	 * with rt_memory_usage().
+	 */
+	uint64		mem_used;
+
 	/* statistics */
 #ifdef RT_DEBUG
 	int32		cnt[RT_SIZE_CLASS_COUNT];
@@ -1005,15 +1021,22 @@ static rt_node_ptr
 rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 {
 	rt_node_ptr newnode;
+	Size size;
 
 	if (RadixTreeIsShared(tree))
 	{
 		dsa_pointer dp;
 
 		if (inner)
+		{
 			dp = dsa_allocate(tree->area, rt_size_class_info[size_class].inner_size);
+			size = rt_size_class_info[size_class].dsa_inner_size;
+		}
 		else
+		{
 			dp = dsa_allocate(tree->area, rt_size_class_info[size_class].leaf_size);
+			size = rt_size_class_info[size_class].dsa_leaf_size;
+		}
 
 		newnode.encoded = (rt_pointer) dp;
 		newnode.decoded = rt_pointer_decode(tree, newnode.encoded);
@@ -1028,8 +1051,12 @@ rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 															 rt_size_class_info[size_class].leaf_size);
 
 		newnode.encoded = rt_pointer_encode(newnode.decoded);
+		size = GetMemoryChunkSpace(newnode.decoded);
 	}
 
+	/* update memory usage */
+	tree->ctl->mem_used += size;
+
 #ifdef RT_DEBUG
 	/* update the statistics */
 	tree->ctl->cnt[size_class]++;
@@ -1095,6 +1122,15 @@ rt_grow_node_kind(radix_tree *tree, rt_node_ptr node, uint8 new_kind)
 static void
 rt_free_node(radix_tree *tree, rt_node_ptr node)
 {
+	int size;
+	static const int fanout_node_class[RT_NODE_MAX_SLOTS] =
+	{
+		[4] = RT_CLASS_4_FULL,
+		[15] = RT_CLASS_32_PARTIAL,
+		[32] = RT_CLASS_32_FULL,
+		[125] = RT_CLASS_125_FULL,
+	};
+
 	/* If we're deleting the root node, make the tree empty */
 	if (tree->ctl->root == node.encoded)
 	{
@@ -1104,28 +1140,38 @@ rt_free_node(radix_tree *tree, rt_node_ptr node)
 
 #ifdef RT_DEBUG
 	{
-		int i;
+		int size_class = (NODE_FANOUT(node) == 0)
+			? RT_CLASS_256
+			: fanout_node_class[NODE_FANOUT(node)];
 
 		/* update the statistics */
-		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
-		{
-			if (NODE_FANOUT(node) == rt_size_class_info[i].fanout)
-				break;
-		}
-
-		/* fanout of node256 is intentionally 0 */
-		if (i == RT_SIZE_CLASS_COUNT)
-			i = RT_CLASS_256;
-
-		tree->ctl->cnt[i]--;
-		Assert(tree->ctl->cnt[i] >= 0);
+		tree->ctl->cnt[size_class]--;
+		Assert(tree->ctl->cnt[size_class] >= 0);
 	}
 #endif
 
 	if (RadixTreeIsShared(tree))
+	{
+		int size_class = (NODE_FANOUT(node) == 0)
+			? RT_CLASS_256
+			: fanout_node_class[NODE_FANOUT(node)];
+
+		if (!NODE_IS_LEAF(node))
+			size = rt_size_class_info[size_class].dsa_inner_size;
+		else
+			size = rt_size_class_info[size_class].dsa_leaf_size;
+
 		dsa_free(tree->area, (dsa_pointer) node.encoded);
+	}
 	else
+	{
+		size = GetMemoryChunkSpace(node.decoded);
 		pfree(node.decoded);
+	}
+
+	/* update memory usage */
+	tree->ctl->mem_used -= size;
+	Assert(tree->ctl->mem_used > 0);
 }
 
 /*
@@ -1837,15 +1883,18 @@ rt_create(MemoryContext ctx, dsa_area *area)
 		dp = dsa_allocate0(area, sizeof(radix_tree_control));
 		tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
 		tree->ctl->handle = (rt_handle) dp;
+		tree->ctl->mem_used += dsa_get_size_class(sizeof(radix_tree_control));
 	}
 	else
 	{
 		tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
 		tree->ctl->handle = InvalidDsaPointer;
+		tree->ctl->mem_used += GetMemoryChunkSpace(tree->ctl);
 	}
 
 	tree->ctl->magic = RADIXTREE_MAGIC;
 	tree->ctl->root = InvalidRTPointer;
+	tree->ctl->mem_used = GetMemoryChunkSpace(tree);
 
 	/* Create the slab allocator for each size class */
 	if (area == NULL)
@@ -1854,17 +1903,29 @@ rt_create(MemoryContext ctx, dsa_area *area)
 		{
 			tree->inner_slabs[i] = SlabContextCreate(ctx,
 													 rt_size_class_info[i].name,
-													 rt_size_class_info[i].inner_blocksize,
+													 rt_size_class_info[i].slab_inner_blocksize,
 													 rt_size_class_info[i].inner_size);
 			tree->leaf_slabs[i] = SlabContextCreate(ctx,
 													rt_size_class_info[i].name,
-													rt_size_class_info[i].leaf_blocksize,
+													rt_size_class_info[i].slab_leaf_blocksize,
 													rt_size_class_info[i].leaf_size);
 #ifdef RT_DEBUG
 			tree->ctl->cnt[i] = 0;
 #endif
 		}
 	}
+	else if (!rt_size_class_dsa_info_initialized)
+	{
+		for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			rt_size_class_info[i].dsa_inner_size =
+				dsa_get_size_class(rt_size_class_info[i].inner_size);
+			rt_size_class_info[i].dsa_leaf_size =
+				dsa_get_size_class(rt_size_class_info[i].leaf_size);
+		}
+
+		rt_size_class_dsa_info_initialized = true;
+	}
 
 	MemoryContextSwitchTo(old_ctx);
 
@@ -2534,22 +2595,8 @@ rt_num_entries(radix_tree *tree)
 uint64
 rt_memory_usage(radix_tree *tree)
 {
-	Size		total = sizeof(radix_tree) + sizeof(radix_tree_control);
-
 	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
-
-	if (RadixTreeIsShared(tree))
-		total = dsa_get_total_size(tree->area);
-	else
-	{
-		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
-		{
-			total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
-			total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
-		}
-	}
-
-	return total;
+	return tree->ctl->mem_used;
 }
 
 /*
@@ -2873,9 +2920,9 @@ rt_dump(radix_tree *tree)
 		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
 				rt_size_class_info[i].name,
 				rt_size_class_info[i].inner_size,
-				rt_size_class_info[i].inner_blocksize,
+				rt_size_class_info[i].slab_inner_blocksize,
 				rt_size_class_info[i].leaf_size,
-				rt_size_class_info[i].leaf_blocksize);
+				rt_size_class_info[i].slab_leaf_blocksize);
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
 
 	if (!RTPointerIsValid(tree->ctl->root))
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index ad169882af..e77aea10e2 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1208,6 +1208,48 @@ dsa_minimum_size(void)
 	return pages * FPM_PAGE_SIZE;
 }
 
+size_t
+dsa_get_size_class(size_t size)
+{
+	uint16      size_class;
+
+	if (size > dsa_size_classes[lengthof(dsa_size_classes) - 1])
+		return size;
+	else if (size < lengthof(dsa_size_class_map) * DSA_SIZE_CLASS_MAP_QUANTUM)
+	{
+		int			mapidx;
+
+		/* For smaller sizes we have a lookup table... */
+		mapidx = ((size + DSA_SIZE_CLASS_MAP_QUANTUM - 1) /
+				  DSA_SIZE_CLASS_MAP_QUANTUM) - 1;
+		size_class = dsa_size_class_map[mapidx];
+	}
+	else
+	{
+		uint16		min;
+		uint16		max;
+
+		/* ... and for the rest we search by binary chop. */
+		min = dsa_size_class_map[lengthof(dsa_size_class_map) - 1];
+		max = lengthof(dsa_size_classes) - 1;
+
+		while (min < max)
+		{
+			uint16		mid = (min + max) / 2;
+			uint16		class_size = dsa_size_classes[mid];
+
+			if (class_size < size)
+				min = mid + 1;
+			else
+				max = mid;
+		}
+
+		size_class = min;
+	}
+
+	return dsa_size_classes[size_class];
+}
+
 /*
  * Workhorse function for dsa_create and dsa_create_in_place.
  */
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index dad06adecc..a17c4eb88c 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -118,6 +118,7 @@ extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags)
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
 extern size_t dsa_get_total_size(dsa_area *area);
+extern size_t dsa_get_size_class(size_t size);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
-- 
2.31.1

v15-0005-tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v15-0005-tool-for-measuring-radix-tree-performance.patchDownload

From 75af1182c7107486db3846e616625e456d640e3c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v15 5/9] tool for measuring radix tree performance

---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  76 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 635 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 6 files changed, 767 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..83529805fc
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..a0693695e6
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,635 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, key);
+	}
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+
+	rt_stats(rt);
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
-- 
2.31.1

v15-0007-PoC-DSA-support-for-radix-tree.patchapplication/octet-stream; name=v15-0007-PoC-DSA-support-for-radix-tree.patchDownload

From d575b8f8215494d9ac82b256b260acd921de1928 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 16:42:55 +0700
Subject: [PATCH v15 7/9] PoC: DSA support for radix tree

---
 .../bench_radix_tree--1.0.sql                 |   2 +
 contrib/bench_radix_tree/bench_radix_tree.c   |  16 +-
 src/backend/lib/radixtree.c                   | 437 ++++++++++++++----
 src/backend/utils/mmgr/dsa.c                  |  12 +
 src/include/lib/radixtree.h                   |   8 +-
 src/include/utils/dsa.h                       |   1 +
 .../expected/test_radixtree.out               |  25 +
 .../modules/test_radixtree/test_radixtree.c   | 147 ++++--
 8 files changed, 502 insertions(+), 146 deletions(-)

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 83529805fc..d9216d715c 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -7,6 +7,7 @@ create function bench_shuffle_search(
 minblk int4,
 maxblk int4,
 random_block bool DEFAULT false,
+shared bool DEFAULT false,
 OUT nkeys int8,
 OUT rt_mem_allocated int8,
 OUT array_mem_allocated int8,
@@ -23,6 +24,7 @@ create function bench_seq_search(
 minblk int4,
 maxblk int4,
 random_block bool DEFAULT false,
+shared bool DEFAULT false,
 OUT nkeys int8,
 OUT rt_mem_allocated int8,
 OUT array_mem_allocated int8,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index a0693695e6..1a26722495 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -154,6 +154,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	BlockNumber maxblk = PG_GETARG_INT32(1);
 	bool		random_block = PG_GETARG_BOOL(2);
 	radix_tree *rt = NULL;
+	bool		shared = PG_GETARG_BOOL(3);
+	dsa_area   *dsa = NULL;
 	uint64		ntids;
 	uint64		key;
 	uint64		last_key = PG_UINT64_MAX;
@@ -176,7 +178,11 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
 
 	/* measure the load time of the radix tree */
-	rt = rt_create(CurrentMemoryContext);
+	if (shared)
+		dsa = dsa_create(LWLockNewTrancheId());
+	rt = rt_create(CurrentMemoryContext, dsa);
+
+	/* measure the load time of the radix tree */
 	start_time = GetCurrentTimestamp();
 	for (int i = 0; i < ntids; i++)
 	{
@@ -327,7 +333,7 @@ bench_load_random_int(PG_FUNCTION_ARGS)
 		elog(ERROR, "return type must be a row type");
 
 	pg_prng_seed(&state, 0);
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	start_time = GetCurrentTimestamp();
 	for (uint64 i = 0; i < cnt; i++)
@@ -393,7 +399,7 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
 	}
 	elog(NOTICE, "bench with filter 0x%lX", filter);
 
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	for (uint64 i = 0; i < cnt; i++)
 	{
@@ -462,7 +468,7 @@ bench_fixed_height_search(PG_FUNCTION_ARGS)
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
 
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	start_time = GetCurrentTimestamp();
 
@@ -574,7 +580,7 @@ bench_node128_load(PG_FUNCTION_ARGS)
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
 		elog(ERROR, "return type must be a row type");
 
-	rt = rt_create(CurrentMemoryContext);
+	rt = rt_create(CurrentMemoryContext, NULL);
 
 	key_id = 0;
 
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bff37a2c35..455071cbab 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -22,6 +22,15 @@
  * choose it to avoid an additional pointer traversal.  It is the reason this code
  * currently does not support variable-length keys.
  *
+ * If DSA area is specified for rt_create(), the radix tree is created in the
+ * DSA area so that multiple processes can access to it simultaneously. The process
+ * who created the shared radix tree needs to tell both DSA area specified when
+ * calling to rt_create() and dsa_pointer of the radix tree, fetched by
+ * rt_get_dsa_pointer(), to other processes so that they can attach by rt_attach().
+ *
+ * XXX: shared radix tree is still PoC state as it doesn't have any locking support.
+ * Also, it supports the iteration only by one process.
+ *
  * XXX: Most functions in this file have two variants for inner nodes and leaf
  * nodes, therefore there are duplication codes. While this sometimes makes the
  * code maintenance tricky, this reduces branch prediction misses when judging
@@ -34,6 +43,9 @@
  *
  * rt_create		- Create a new, empty radix tree
  * rt_free			- Free the radix tree
+ * rt_attach		- Attach to the radix tree
+ * rt_detach		- Detach from the radix tree
+ * rt_get_handle	- Return the handle of the radix tree
  * rt_search		- Search a key-value pair
  * rt_set			- Set a key-value pair
  * rt_delete		- Delete a key-value pair
@@ -65,6 +77,7 @@
 #include "nodes/bitmapset.h"
 #include "port/pg_bitutils.h"
 #include "port/pg_lfind.h"
+#include "utils/dsa.h"
 #include "utils/memutils.h"
 
 #ifdef RT_DEBUG
@@ -426,6 +439,10 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
  * construct the key whenever updating the node iteration information, e.g., when
  * advancing the current index within the node or when moving to the next node
  * at the same level.
+ *
+ * XXX: We need either a safeguard to disallow other processes to begin the
+ * iteration while one process is doing or to allow multiple processes to do
+ * the iteration.
  */
 typedef struct rt_node_iter
 {
@@ -445,23 +462,43 @@ struct rt_iter
 	uint64		key;
 };
 
-/* A radix tree with nodes */
-struct radix_tree
+/* A magic value used to identify our radix tree */
+#define RADIXTREE_MAGIC 0x54A48167
+
+/* Control information for an radix tree */
+typedef struct radix_tree_control
 {
-	MemoryContext context;
+	rt_handle	handle;
+	uint32		magic;
 
+	/* Root node */
 	rt_pointer	root;
+
 	uint64		max_val;
 	uint64		num_keys;
 
-	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
-	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
-
 	/* statistics */
 #ifdef RT_DEBUG
 	int32		cnt[RT_SIZE_CLASS_COUNT];
 #endif
+} radix_tree_control;
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	/* control object in either backend-local memory or DSA */
+	radix_tree_control *ctl;
+
+	/* used only when the radix tree is shared */
+	dsa_area   *area;
+
+	/* used only when the radix tree is private */
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
 };
+#define RadixTreeIsShared(rt) ((rt)->area != NULL)
 
 static void rt_new_root(radix_tree *tree, uint64 key);
 
@@ -490,9 +527,12 @@ static void rt_verify_node(rt_node_ptr node);
 
 /* Decode and encode functions of rt_pointer */
 static inline rt_node *
-rt_pointer_decode(rt_pointer encoded)
+rt_pointer_decode(radix_tree *tree, rt_pointer encoded)
 {
-	return (rt_node *) encoded;
+	if (RadixTreeIsShared(tree))
+		return (rt_node *) dsa_get_address(tree->area, encoded);
+	else
+		return (rt_node *) encoded;
 }
 
 static inline rt_pointer
@@ -503,11 +543,11 @@ rt_pointer_encode(rt_node *decoded)
 
 /* Return a rt_node_ptr created from the given encoded pointer */
 static inline rt_node_ptr
-rt_node_ptr_encoded(rt_pointer encoded)
+rt_node_ptr_encoded(radix_tree *tree, rt_pointer encoded)
 {
 	return (rt_node_ptr) {
 		.encoded = encoded,
-			.decoded = rt_pointer_decode(encoded),
+			.decoded = rt_pointer_decode(tree, encoded)
 			};
 }
 
@@ -954,8 +994,8 @@ rt_new_root(radix_tree *tree, uint64 key)
 	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
 	NODE_SHIFT(newnode) = shift;
 
-	tree->max_val = shift_get_max_val(shift);
-	tree->root = newnode.encoded;
+	tree->ctl->max_val = shift_get_max_val(shift);
+	tree->ctl->root = newnode.encoded;
 }
 
 /*
@@ -964,20 +1004,35 @@ rt_new_root(radix_tree *tree, uint64 key)
 static rt_node_ptr
 rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 {
-	rt_node_ptr	newnode;
+	rt_node_ptr newnode;
 
-	if (inner)
-		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
-														 rt_size_class_info[size_class].inner_size);
-	else
-		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
-														 rt_size_class_info[size_class].leaf_size);
+	if (RadixTreeIsShared(tree))
+	{
+		dsa_pointer dp;
 
-	newnode.encoded = rt_pointer_encode(newnode.decoded);
+		if (inner)
+			dp = dsa_allocate(tree->area, rt_size_class_info[size_class].inner_size);
+		else
+			dp = dsa_allocate(tree->area, rt_size_class_info[size_class].leaf_size);
+
+		newnode.encoded = (rt_pointer) dp;
+		newnode.decoded = rt_pointer_decode(tree, newnode.encoded);
+	}
+	else
+	{
+		if (inner)
+			newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+															 rt_size_class_info[size_class].inner_size);
+		else
+			newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+															 rt_size_class_info[size_class].leaf_size);
+
+		newnode.encoded = rt_pointer_encode(newnode.decoded);
+	}
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[size_class]++;
+	tree->ctl->cnt[size_class]++;
 #endif
 
 	return newnode;
@@ -1041,10 +1096,10 @@ static void
 rt_free_node(radix_tree *tree, rt_node_ptr node)
 {
 	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node.encoded)
+	if (tree->ctl->root == node.encoded)
 	{
-		tree->root = InvalidRTPointer;
-		tree->max_val = 0;
+		tree->ctl->root = InvalidRTPointer;
+		tree->ctl->max_val = 0;
 	}
 
 #ifdef RT_DEBUG
@@ -1062,12 +1117,15 @@ rt_free_node(radix_tree *tree, rt_node_ptr node)
 		if (i == RT_SIZE_CLASS_COUNT)
 			i = RT_CLASS_256;
 
-		tree->cnt[i]--;
-		Assert(tree->cnt[i] >= 0);
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
 	}
 #endif
 
-	pfree(node.decoded);
+	if (RadixTreeIsShared(tree))
+		dsa_free(tree->area, (dsa_pointer) node.encoded);
+	else
+		pfree(node.decoded);
 }
 
 /*
@@ -1083,7 +1141,7 @@ rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
 	if (rt_node_ptr_eq(&parent, &old_child))
 	{
 		/* Replace the root node with the new large node */
-		tree->root = new_child.encoded;
+		tree->ctl->root = new_child.encoded;
 	}
 	else
 	{
@@ -1105,7 +1163,7 @@ static void
 rt_extend(radix_tree *tree, uint64 key)
 {
 	int			target_shift;
-	rt_node		*root = rt_pointer_decode(tree->root);
+	rt_node		*root = rt_pointer_decode(tree, tree->ctl->root);
 	int			shift = root->shift + RT_NODE_SPAN;
 
 	target_shift = key_get_shift(key);
@@ -1123,15 +1181,15 @@ rt_extend(radix_tree *tree, uint64 key)
 		n4->base.n.shift = shift;
 		n4->base.n.count = 1;
 		n4->base.chunks[0] = 0;
-		n4->children[0] = tree->root;
+		n4->children[0] = tree->ctl->root;
 
 		root->chunk = 0;
-		tree->root = node.encoded;
+		tree->ctl->root = node.encoded;
 
 		shift += RT_NODE_SPAN;
 	}
 
-	tree->max_val = shift_get_max_val(target_shift);
+	tree->ctl->max_val = shift_get_max_val(target_shift);
 }
 
 /*
@@ -1163,7 +1221,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
 	}
 
 	rt_node_insert_leaf(tree, parent, node, key, value);
-	tree->num_keys++;
+	tree->ctl->num_keys++;
 }
 
 /*
@@ -1174,12 +1232,11 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
-					 rt_pointer *child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action, rt_pointer *child_p)
 {
 	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
-	rt_pointer	child;
+	rt_pointer	child = InvalidRTPointer;
 
 	switch (NODE_KIND(node))
 	{
@@ -1210,6 +1267,7 @@ rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
 					break;
 
 				found = true;
+
 				if (action == RT_ACTION_FIND)
 					child = n32->children[idx];
 				else			/* RT_ACTION_DELETE */
@@ -1761,33 +1819,51 @@ retry_insert_leaf_32:
  * Create the radix tree in the given memory context and return it.
  */
 radix_tree *
-rt_create(MemoryContext ctx)
+rt_create(MemoryContext ctx, dsa_area *area)
 {
 	radix_tree *tree;
 	MemoryContext old_ctx;
 
 	old_ctx = MemoryContextSwitchTo(ctx);
 
-	tree = palloc(sizeof(radix_tree));
+	tree = (radix_tree *) palloc0(sizeof(radix_tree));
 	tree->context = ctx;
-	tree->root = InvalidRTPointer;
-	tree->max_val = 0;
-	tree->num_keys = 0;
+
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+
+		tree->area = area;
+		dp = dsa_allocate0(area, sizeof(radix_tree_control));
+		tree->ctl = (radix_tree_control *) dsa_get_address(area, dp);
+		tree->ctl->handle = (rt_handle) dp;
+	}
+	else
+	{
+		tree->ctl = (radix_tree_control *) palloc0(sizeof(radix_tree_control));
+		tree->ctl->handle = InvalidDsaPointer;
+	}
+
+	tree->ctl->magic = RADIXTREE_MAGIC;
+	tree->ctl->root = InvalidRTPointer;
 
 	/* Create the slab allocator for each size class */
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	if (area == NULL)
 	{
-		tree->inner_slabs[i] = SlabContextCreate(ctx,
-												 rt_size_class_info[i].name,
-												 rt_size_class_info[i].inner_blocksize,
-												 rt_size_class_info[i].inner_size);
-		tree->leaf_slabs[i] = SlabContextCreate(ctx,
-												rt_size_class_info[i].name,
-												rt_size_class_info[i].leaf_blocksize,
-												rt_size_class_info[i].leaf_size);
+		for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			tree->inner_slabs[i] = SlabContextCreate(ctx,
+													 rt_size_class_info[i].name,
+													 rt_size_class_info[i].inner_blocksize,
+													 rt_size_class_info[i].inner_size);
+			tree->leaf_slabs[i] = SlabContextCreate(ctx,
+													rt_size_class_info[i].name,
+													rt_size_class_info[i].leaf_blocksize,
+													rt_size_class_info[i].leaf_size);
 #ifdef RT_DEBUG
-		tree->cnt[i] = 0;
+			tree->ctl->cnt[i] = 0;
 #endif
+		}
 	}
 
 	MemoryContextSwitchTo(old_ctx);
@@ -1795,16 +1871,163 @@ rt_create(MemoryContext ctx)
 	return tree;
 }
 
+/*
+ * Get a handle that can be used by other processes to attach to this radix
+ * tree.
+ */
+dsa_pointer
+rt_get_handle(radix_tree *tree)
+{
+	Assert(RadixTreeIsShared(tree));
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	return tree->ctl->handle;
+}
+
+/*
+ * Attach to an existing radix tree using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+radix_tree *
+rt_attach(dsa_area *area, rt_handle handle)
+{
+	radix_tree *tree;
+	dsa_pointer	control;
+
+	/* Allocate the backend-local object representing the radix tree */
+	tree = (radix_tree *) palloc0(sizeof(radix_tree));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	/* Set up the local radix tree */
+	tree->area = area;
+	tree->ctl = (radix_tree_control *) dsa_get_address(area, control);
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	return tree;
+}
+
+/*
+ * Detach from a radix tree. This frees backend-local resources associated
+ * with the radix tree, but the radix tree will continue to exist until
+ * it is explicitly freed.
+ */
+void
+rt_detach(radix_tree *tree)
+{
+	Assert(RadixTreeIsShared(tree));
+	Assert(tree->ctl->magic == RADIXTREE_MAGIC);
+
+	pfree(tree);
+}
+
+/*
+ * Recursively free all nodes allocated to the dsa area.
+ */
+static void
+rt_free_recurse(radix_tree *tree, rt_pointer ptr)
+{
+	rt_node_ptr	node = rt_node_ptr_encoded(tree, ptr);
+
+	Assert(RadixTreeIsShared(tree));
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers, so free it */
+	if (NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->area, (dsa_pointer) node.encoded);
+		return;
+	}
+
+	switch (NODE_KIND(node))
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < NODE_COUNT(node); i++)
+					rt_free_recurse(tree, n4->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < NODE_COUNT(node); i++)
+					rt_free_recurse(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						continue;
+
+					rt_free_recurse(tree, node_inner_125_get_child(n125, i));
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
+
+				/* Free all children recursively */
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_inner_256_is_chunk_used(n256, i))
+						continue;
+
+					rt_free_recurse(tree, node_inner_256_get_child(n256, i));
+				}
+				break;
+			}
+	}
+
+	/* Free the inner node itself */
+	dsa_free(tree->area, node.encoded);
+}
+
 /*
  * Free the given radix tree.
  */
 void
 rt_free(radix_tree *tree)
 {
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (RadixTreeIsShared(tree))
 	{
-		MemoryContextDelete(tree->inner_slabs[i]);
-		MemoryContextDelete(tree->leaf_slabs[i]);
+		/* Free all memory used for radix tree nodes */
+		if (RTPointerIsValid(tree->ctl->root))
+			rt_free_recurse(tree, tree->ctl->root);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix tree.
+		 */
+		tree->ctl->magic = 0;
+		dsa_free(tree->area, tree->ctl->handle);
+	}
+	else
+	{
+		/* Free all memory used for radix tree nodes */
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			MemoryContextDelete(tree->inner_slabs[i]);
+			MemoryContextDelete(tree->leaf_slabs[i]);
+		}
+		pfree(tree->ctl);
 	}
 
 	pfree(tree);
@@ -1822,16 +2045,18 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 	rt_node_ptr	node;
 	rt_node_ptr parent;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
 	/* Empty tree, create the root */
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 		rt_new_root(tree, key);
 
 	/* Extend the tree if necessary */
-	if (key > tree->max_val)
+	if (key > tree->ctl->max_val)
 		rt_extend(tree, key);
 
 	/* Descend the tree until a leaf node */
-	node = parent = rt_node_ptr_encoded(tree->root);
+	node = parent = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
@@ -1847,7 +2072,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 		}
 
 		parent = node;
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1855,7 +2080,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 
 	/* Update the statistics */
 	if (!updated)
-		tree->num_keys++;
+		tree->ctl->num_keys++;
 
 	return updated;
 }
@@ -1871,12 +2096,13 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 	rt_node_ptr    node;
 	int			shift;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
 	Assert(value_p != NULL);
 
-	if (!RTPointerIsValid(tree->root) || key > tree->max_val)
+	if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 
 	/* Descend the tree until a leaf node */
@@ -1890,7 +2116,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1910,14 +2136,16 @@ rt_delete(radix_tree *tree, uint64 key)
 	int			level;
 	bool		deleted;
 
-	if (!tree->root || key > tree->max_val)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (!RTPointerIsValid(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
 	/*
 	 * Descend the tree to search the key while building a stack of nodes we
 	 * visited.
 	 */
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	level = -1;
 	while (shift > 0)
@@ -1930,7 +2158,7 @@ rt_delete(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1945,7 +2173,7 @@ rt_delete(radix_tree *tree, uint64 key)
 	}
 
 	/* Found the key to delete. Update the statistics */
-	tree->num_keys--;
+	tree->ctl->num_keys--;
 
 	/*
 	 * Return if the leaf node still has keys and we don't need to delete the
@@ -1985,16 +2213,18 @@ rt_begin_iterate(radix_tree *tree)
 	rt_iter    *iter;
 	int			top_level;
 
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
 	old_ctx = MemoryContextSwitchTo(tree->context);
 
 	iter = (rt_iter *) palloc0(sizeof(rt_iter));
 	iter->tree = tree;
 
 	/* empty tree */
-	if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
+	if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->ctl->root))
 		return iter;
 
-	root = rt_node_ptr_encoded(iter->tree->root);
+	root = rt_node_ptr_encoded(tree, iter->tree->ctl->root);
 	top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
 	iter->stack_len = top_level;
 
@@ -2045,8 +2275,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
 bool
 rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 {
+	Assert(!RadixTreeIsShared(iter->tree) || iter->tree->ctl->magic == RADIXTREE_MAGIC);
+
 	/* Empty tree */
-	if (!iter->tree->root)
+	if (!iter->tree->ctl->root)
 		return false;
 
 	for (;;)
@@ -2190,7 +2422,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *
 	if (found)
 	{
 		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
-		*child_p = rt_node_ptr_encoded(child);
+		*child_p = rt_node_ptr_encoded(iter->tree, child);
 	}
 
 	return found;
@@ -2293,7 +2525,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_
 uint64
 rt_num_entries(radix_tree *tree)
 {
-	return tree->num_keys;
+	return tree->ctl->num_keys;
 }
 
 /*
@@ -2302,12 +2534,19 @@ rt_num_entries(radix_tree *tree)
 uint64
 rt_memory_usage(radix_tree *tree)
 {
-	Size		total = sizeof(radix_tree);
+	Size		total = sizeof(radix_tree) + sizeof(radix_tree_control);
 
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	Assert(!RadixTreeIsShared(tree) || tree->ctl->magic == RADIXTREE_MAGIC);
+
+	if (RadixTreeIsShared(tree))
+		total = dsa_get_total_size(tree->area);
+	else
 	{
-		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
-		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+		for (int i = 0; i < RT_NODE_KIND_COUNT; i++)
+		{
+			total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+			total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+		}
 	}
 
 	return total;
@@ -2391,23 +2630,23 @@ rt_verify_node(rt_node_ptr node)
 void
 rt_stats(radix_tree *tree)
 {
-	rt_node *root = rt_pointer_decode(tree->root);
+	rt_node *root = rt_pointer_decode(tree, tree->ctl->root);
 
 	if (root == NULL)
 		return;
 
 	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
-							tree->num_keys,
+							tree->ctl->num_keys,
 							root->shift / RT_NODE_SPAN,
-							tree->cnt[RT_CLASS_4_FULL],
-							tree->cnt[RT_CLASS_32_PARTIAL],
-							tree->cnt[RT_CLASS_32_FULL],
-							tree->cnt[RT_CLASS_125_FULL],
-							tree->cnt[RT_CLASS_256])));
+							tree->ctl->cnt[RT_CLASS_4_FULL],
+							tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+							tree->ctl->cnt[RT_CLASS_32_FULL],
+							tree->ctl->cnt[RT_CLASS_125_FULL],
+							tree->ctl->cnt[RT_CLASS_256])));
 }
 
 static void
-rt_dump_node(rt_node_ptr node, int level, bool recurse)
+rt_dump_node(radix_tree *tree, rt_node_ptr node, int level, bool recurse)
 {
 	rt_node		*n = node.decoded;
 	char		space[128] = {0};
@@ -2445,7 +2684,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, n4->base.chunks[i]);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+							rt_dump_node(tree, rt_node_ptr_encoded(tree, n4->children[i]),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2473,7 +2712,7 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 
 						if (recurse)
 						{
-							rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+							rt_dump_node(tree, rt_node_ptr_encoded(tree, n32->children[i]),
 										 level + 1, recurse);
 						}
 						else
@@ -2526,7 +2765,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
+							rt_dump_node(tree,
+										 rt_node_ptr_encoded(tree,
+															 node_inner_125_get_child(n125, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2559,7 +2800,9 @@ rt_dump_node(rt_node_ptr node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+							rt_dump_node(tree,
+										 rt_node_ptr_encoded(tree,
+															 node_inner_256_get_child(n256, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2579,28 +2822,28 @@ rt_dump_search(radix_tree *tree, uint64 key)
 
 	elog(NOTICE, "-----------------------------------------------------------");
 	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
-		 tree->max_val, tree->max_val);
+		 tree->ctl->max_val, tree->ctl->max_val);
 
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 	{
 		elog(NOTICE, "tree is empty");
 		return;
 	}
 
-	if (key > tree->max_val)
+	if (key > tree->ctl->max_val)
 	{
 		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
 			 key, key);
 		return;
 	}
 
-	node = rt_node_ptr_encoded(tree->root);
+	node = rt_node_ptr_encoded(tree, tree->ctl->root);
 	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
 		rt_pointer   child;
 
-		rt_dump_node(node, level, false);
+		rt_dump_node(tree, node, level, false);
 
 		if (NODE_IS_LEAF(node))
 		{
@@ -2615,7 +2858,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			break;
 
-		node = rt_node_ptr_encoded(child);
+		node = rt_node_ptr_encoded(tree, child);
 		shift -= RT_NODE_SPAN;
 		level++;
 	}
@@ -2633,15 +2876,15 @@ rt_dump(radix_tree *tree)
 				rt_size_class_info[i].inner_blocksize,
 				rt_size_class_info[i].leaf_size,
 				rt_size_class_info[i].leaf_blocksize);
-	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
 
-	if (!RTPointerIsValid(tree->root))
+	if (!RTPointerIsValid(tree->ctl->root))
 	{
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
-	root = rt_node_ptr_encoded(tree->root);
-	rt_dump_node(root, 0, true);
+	root = rt_node_ptr_encoded(tree, tree->ctl->root);
+	rt_dump_node(tree, root, 0, true);
 }
 #endif
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 82376fde2d..ad169882af 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..68a11df970 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -14,18 +14,24 @@
 #define RADIXTREE_H
 
 #include "postgres.h"
+#include "utils/dsa.h"
 
 #define RT_DEBUG 1
 
 typedef struct radix_tree radix_tree;
 typedef struct rt_iter rt_iter;
+typedef dsa_pointer rt_handle;
 
-extern radix_tree *rt_create(MemoryContext ctx);
+extern radix_tree *rt_create(MemoryContext ctx, dsa_area *dsa);
 extern void rt_free(radix_tree *tree);
 extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
 extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
 extern rt_iter *rt_begin_iterate(radix_tree *tree);
 
+extern rt_handle rt_get_handle(radix_tree *tree);
+extern radix_tree *rt_attach(dsa_area *dsa, dsa_pointer dp);
+extern void rt_detach(radix_tree *tree);
+
 extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
 extern void rt_end_iterate(rt_iter *iter);
 extern bool rt_delete(radix_tree *tree, uint64 key);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 405606fe2f..dad06adecc 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index ce645cb8b5..a217e0d312 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -6,28 +6,53 @@ CREATE EXTENSION test_radixtree;
 SELECT test_radixtree();
 NOTICE:  testing basic operations with leaf node 4
 NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
 NOTICE:  testing basic operations with leaf node 32
 NOTICE:  testing basic operations with inner node 32
 NOTICE:  testing basic operations with leaf node 125
 NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
 NOTICE:  testing basic operations with leaf node 256
 NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
 NOTICE:  testing radix tree node types with shift "0"
 NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "8"
 NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
 NOTICE:  testing radix tree node types with shift "24"
 NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "32"
 NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
 NOTICE:  testing radix tree node types with shift "48"
 NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree node types with shift "56"
 NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
 NOTICE:  testing radix tree with pattern "alternating bits"
 NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of ten"
 NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
 NOTICE:  testing radix tree with pattern "one-every-64k"
 NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "sparse"
 NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
 NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
 NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
  test_radixtree 
 ----------------
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index ea993e63df..fe1e168ec4 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -19,6 +19,7 @@
 #include "nodes/bitmapset.h"
 #include "storage/block.h"
 #include "storage/itemptr.h"
+#include "storage/lwlock.h"
 #include "utils/memutils.h"
 #include "utils/timestamp.h"
 
@@ -99,6 +100,8 @@ static const test_spec test_specs[] = {
 	}
 };
 
+static int lwlock_tranche_id;
+
 PG_MODULE_MAGIC;
 
 PG_FUNCTION_INFO_V1(test_radixtree);
@@ -112,7 +115,7 @@ test_empty(void)
 	uint64		key;
 	uint64		val;
 
-	radixtree = rt_create(CurrentMemoryContext);
+	radixtree = rt_create(CurrentMemoryContext, NULL);
 
 	if (rt_search(radixtree, 0, &dummy))
 		elog(ERROR, "rt_search on empty tree returned true");
@@ -140,17 +143,14 @@ test_empty(void)
 }
 
 static void
-test_basic(int children, bool test_inner)
+do_test_basic(radix_tree *radixtree, int children, bool test_inner)
 {
-	radix_tree	*radixtree;
 	uint64 *keys;
 	int	shift = test_inner ? 8 : 0;
 
 	elog(NOTICE, "testing basic operations with %s node %d",
 		 test_inner ? "inner" : "leaf", children);
 
-	radixtree = rt_create(CurrentMemoryContext);
-
 	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
 	keys = palloc(sizeof(uint64) * children);
 	for (int i = 0; i < children; i++)
@@ -165,7 +165,7 @@ test_basic(int children, bool test_inner)
 	for (int i = 0; i < children; i++)
 	{
 		if (rt_set(radixtree, keys[i], keys[i]))
-			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found %d", keys[i], i);
 	}
 
 	/* update keys */
@@ -185,7 +185,38 @@ test_basic(int children, bool test_inner)
 	}
 
 	pfree(keys);
-	rt_free(radixtree);
+}
+
+static void
+test_basic()
+{
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		radix_tree *tree;
+		dsa_area	*area;
+
+		/* Test the local radix tree */
+		tree = rt_create(CurrentMemoryContext, NULL);
+		do_test_basic(tree, rt_node_kind_fanouts[i], false);
+		rt_free(tree);
+
+		tree = rt_create(CurrentMemoryContext, NULL);
+		do_test_basic(tree, rt_node_kind_fanouts[i], true);
+		rt_free(tree);
+
+		/* Test the shared radix tree */
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(CurrentMemoryContext, area);
+		do_test_basic(tree, rt_node_kind_fanouts[i], false);
+		rt_free(tree);
+		dsa_detach(area);
+
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(CurrentMemoryContext, area);
+		do_test_basic(tree, rt_node_kind_fanouts[i], true);
+		rt_free(tree);
+		dsa_detach(area);
+	}
 }
 
 /*
@@ -286,14 +317,10 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
  * level.
  */
 static void
-test_node_types(uint8 shift)
+do_test_node_types(radix_tree *radixtree, uint8 shift)
 {
-	radix_tree *radixtree;
-
 	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
 
-	radixtree = rt_create(CurrentMemoryContext);
-
 	/*
 	 * Insert and search entries for every node type at the 'shift' level,
 	 * then delete all entries to make it empty, and insert and search entries
@@ -302,19 +329,37 @@ test_node_types(uint8 shift)
 	test_node_types_insert(radixtree, shift, true);
 	test_node_types_delete(radixtree, shift);
 	test_node_types_insert(radixtree, shift, false);
+}
 
-	rt_free(radixtree);
+static void
+test_node_types(void)
+{
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+	{
+		radix_tree *tree;
+		dsa_area   *area;
+
+		/* Test the local radix tree */
+		tree = rt_create(CurrentMemoryContext, NULL);
+		do_test_node_types(tree, shift);
+		rt_free(tree);
+
+		/* Test the shared radix tree */
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(CurrentMemoryContext, area);
+		do_test_node_types(tree, shift);
+		rt_free(tree);
+		dsa_detach(area);
+	}
 }
 
 /*
  * Test with a repeating pattern, defined by the 'spec'.
  */
 static void
-test_pattern(const test_spec * spec)
+do_test_pattern(radix_tree *radixtree, const test_spec * spec)
 {
-	radix_tree *radixtree;
 	rt_iter    *iter;
-	MemoryContext radixtree_ctx;
 	TimestampTz starttime;
 	TimestampTz endtime;
 	uint64		n;
@@ -340,18 +385,6 @@ test_pattern(const test_spec * spec)
 			pattern_values[pattern_num_values++] = i;
 	}
 
-	/*
-	 * Allocate the radix tree.
-	 *
-	 * Allocate it in a separate memory context, so that we can print its
-	 * memory usage easily.
-	 */
-	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
-										  "radixtree test",
-										  ALLOCSET_SMALL_SIZES);
-	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
-	radixtree = rt_create(radixtree_ctx);
-
 	/*
 	 * Add values to the set.
 	 */
@@ -405,8 +438,6 @@ test_pattern(const test_spec * spec)
 		mem_usage = rt_memory_usage(radixtree);
 		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
 				mem_usage, (double) mem_usage / spec->num_values);
-
-		MemoryContextStats(radixtree_ctx);
 	}
 
 	/* Check that rt_num_entries works */
@@ -555,27 +586,57 @@ test_pattern(const test_spec * spec)
 	if ((nbefore - ndeleted) != nafter)
 		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
 			 nafter, (nbefore - ndeleted), ndeleted);
+}
+
+static void
+test_patterns(void)
+{
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+	{
+		radix_tree *tree;
+		MemoryContext radixtree_ctx;
+		dsa_area   *area;
+		const		test_spec *spec = &test_specs[i];
 
-	MemoryContextDelete(radixtree_ctx);
+		/*
+		 * Allocate the radix tree.
+		 *
+		 * Allocate it in a separate memory context, so that we can print its
+		 * memory usage easily.
+		 */
+		radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+											  "radixtree test",
+											  ALLOCSET_SMALL_SIZES);
+		MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+		/* Test the local radix tree */
+		tree = rt_create(radixtree_ctx, NULL);
+		do_test_pattern(tree, spec);
+		rt_free(tree);
+		MemoryContextReset(radixtree_ctx);
+
+		/* Test the shared radix tree */
+		area = dsa_create(lwlock_tranche_id);
+		tree = rt_create(radixtree_ctx, area);
+		do_test_pattern(tree, spec);
+		rt_free(tree);
+		dsa_detach(area);
+		MemoryContextDelete(radixtree_ctx);
+	}
 }
 
 Datum
 test_radixtree(PG_FUNCTION_ARGS)
 {
-	test_empty();
+	/* get a new lwlock tranche id for all tests for shared radix tree */
+	lwlock_tranche_id = LWLockNewTrancheId();
 
-	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
-	{
-		test_basic(rt_node_kind_fanouts[i], false);
-		test_basic(rt_node_kind_fanouts[i], true);
-	}
-
-	for (int shift = 0; shift <= (64 - 8); shift += 8)
-		test_node_types(shift);
+	test_empty();
+	test_basic();
 
-	/* Test different test patterns, with lots of entries */
-	for (int i = 0; i < lengthof(test_specs); i++)
-		test_pattern(&test_specs[i]);
+	test_node_types();
+	test_patterns();
 
 	PG_RETURN_VOID();
 }
-- 
2.31.1

v15-0009-PoC-lazy-vacuum-integration.patchapplication/octet-stream; name=v15-0009-PoC-lazy-vacuum-integration.patchDownload

From 1ce76ec8644e7ce8ca1eb021c7e327f1afc11070 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 4 Nov 2022 14:14:42 +0900
Subject: [PATCH v15 9/9] PoC: lazy vacuum integration.

The patch includes:

* Introducing a new module, TIDStore, to store TID in radix tree.
* Integrating TIDStore with Lazy (parallel) vacuum.
---
 src/backend/access/common/Makefile    |   1 +
 src/backend/access/common/meson.build |   1 +
 src/backend/access/common/tidstore.c  | 555 ++++++++++++++++++++++++++
 src/backend/access/heap/vacuumlazy.c  | 171 +++-----
 src/backend/catalog/system_views.sql  |   2 +-
 src/backend/commands/vacuum.c         |  76 +---
 src/backend/commands/vacuumparallel.c |  64 +--
 src/backend/storage/lmgr/lwlock.c     |   2 +
 src/include/access/tidstore.h         |  50 +++
 src/include/commands/progress.h       |   4 +-
 src/include/commands/vacuum.h         |  25 +-
 src/include/storage/lwlock.h          |   1 +
 src/test/regress/expected/rules.out   |   4 +-
 13 files changed, 721 insertions(+), 235 deletions(-)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h

diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 857beaa32d..76265974b1 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -13,6 +13,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..7e6fc4eeca
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,555 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		TID (ItemPointer) storage implementation.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "port/pg_bitutils.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+#include "miscadmin.h"
+
+/* XXX only testing purpose during development, will be removed */
+#define XXX_DEBUG_TID_STORE 1
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. We construct 64-bit unsigned integer that combines
+ * the block number and the offset number. The lowest 11 bits represent the
+ * offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * XXX: If we want to support other table AMs that want to use the full range
+ * of possible offset numbers, we'll need to change this.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ *
+ * The maximum height of the radix tree is 5. The most memory consuming case
+ * while adding TIDs is that we allocate a largest node in a new slab block,
+ * about 70kB. Therefore we deduct 70kB from the maximum memory.
+ */
+#define TIDSTORE_OFFSET_NBITS 11
+#define TIDSTORE_VALUE_NBITS 6	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) */
+#define TIDSTORE_MEMORY_DEDUCT (1024 * 70)
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+	((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+struct TIDStore
+{
+	/* main storage for TID */
+	radix_tree	*tree;
+
+	/* # of tids in TIDStore */
+	int	num_tids;
+
+	/* maximum bytes TIDStore can consume */
+	uint64	max_bytes;
+
+	/* DSA area and handle for shared TIDStore */
+	rt_handle	handle;
+	dsa_area	*area;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	uint64	max_items;
+	ItemPointer	itemptrs;
+	uint64	nitems;
+#endif
+};
+
+/* Iterator for TDIStore */
+typedef struct TIDStoreIter
+{
+	TIDStore	*ts;
+
+	/* iterator of radix tree */
+	rt_iter		*tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TIDStoreIterResult result;
+
+#ifdef USE_ASSERT_CHECKING
+	uint64		itemptrs_index;
+	int	prev_index;
+#endif
+} TIDStoreIter;
+
+static void tidstore_iter_extract_tids(TIDStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+/*
+ * Comparator routines for use with qsort() and bsearch().
+ */
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+		rblk;
+	OffsetNumber loff,
+		roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+
+static void
+verify_iter_tids(TIDStoreIter *iter)
+{
+	uint64 index = iter->prev_index;
+	TIDStoreIterResult *result = &(iter->result);
+
+	if (iter->ts->itemptrs == NULL)
+		return;
+
+	Assert(index <= iter->ts->nitems);
+
+	for (int i = 0; i < result->num_offsets; i++)
+	{
+		ItemPointerData tid;
+
+		ItemPointerSetBlockNumber(&tid, result->blkno);
+		ItemPointerSetOffsetNumber(&tid, result->offsets[i]);
+
+		Assert(ItemPointerEquals(&iter->ts->itemptrs[index++], &tid));
+	}
+
+	iter->prev_index = iter->itemptrs_index;
+}
+
+static void
+dump_itemptrs(TIDStore *ts)
+{
+	StringInfoData buf;
+
+	if (ts->itemptrs == NULL)
+		return;
+
+	initStringInfo(&buf);
+	for (int i = 0; i < ts->nitems; i++)
+	{
+		appendStringInfo(&buf, "(%d,%d) ",
+						 ItemPointerGetBlockNumber(&(ts->itemptrs[i])),
+						 ItemPointerGetOffsetNumber(&(ts->itemptrs[i])));
+	}
+	elog(WARNING, "--- dump (" UINT64_FORMAT " items) ---", ts->nitems);
+	elog(WARNING, "%s\n", buf.data);
+}
+
+#endif
+
+/*
+ * Create a TIDStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TIDStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+	TIDStore	*ts;
+
+	ts = palloc0(sizeof(TIDStore));
+
+	ts->tree = rt_create(CurrentMemoryContext, area);
+	ts->area = area;
+	ts->max_bytes = max_bytes - TIDSTORE_MEMORY_DEDUCT;
+
+	if (area != NULL)
+		ts->handle = rt_get_handle(ts->tree);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+#define MAXDEADITEMS(avail_mem) \
+	(avail_mem / sizeof(ItemPointerData))
+
+	if (area == NULL)
+	{
+		ts->max_items = MAXDEADITEMS(maintenance_work_mem * 1024);
+		ts->itemptrs = (ItemPointer) palloc0(sizeof(ItemPointerData) * ts->max_items);
+		ts->nitems = 0;
+	}
+#endif
+
+	return ts;
+}
+
+/* Attach to the shared TIDStore using a handle */
+TIDStore *
+tidstore_attach(dsa_area *area, rt_handle handle)
+{
+	TIDStore *ts;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	ts = palloc0(sizeof(TIDStore));
+	ts->tree = rt_attach(area, handle);
+
+	return ts;
+}
+
+/*
+ * Detach from a TIDStore. This detaches from radix tree and frees the
+ * backend-local resources.
+ */
+void
+tidstore_detach(TIDStore *ts)
+{
+	rt_detach(ts->tree);
+	pfree(ts);
+}
+
+void
+tidstore_free(TIDStore *ts)
+{
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	if (ts->itemptrs)
+		pfree(ts->itemptrs);
+#endif
+
+	rt_free(ts->tree);
+	pfree(ts);
+}
+
+/* Remove all collected tids but not free the TIDStore */
+void
+tidstore_reset(TIDStore *ts)
+{
+	dsa_area *area = ts->area;
+
+	/* Recreate the radix tree */
+	rt_free(ts->tree);
+
+	/* Return allocated DSM segments to the operating system */
+	if (ts->area)
+		dsa_trim(area);
+
+	ts->tree = rt_create(CurrentMemoryContext, area);
+
+	/* Reset the statistics */
+	ts->num_tids = 0;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	ts->nitems = 0;
+#endif
+}
+
+/* Add TIDs to TIDStore */
+void
+tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	uint64 last_key = PG_UINT64_MAX;
+	uint64 key;
+	uint64 val = 0;
+	ItemPointerData tid;
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint32	off;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		key = tid_to_key_off(&tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(ts->tree, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= UINT64CONST(1) << off;
+		ts->num_tids++;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+		if (ts->itemptrs)
+		{
+			if (ts->nitems >= ts->max_items)
+			{
+				ts->max_items *= 2;
+				ts->itemptrs = repalloc(ts->itemptrs, sizeof(ItemPointerData) * ts->max_items);
+			}
+
+			Assert(ts->nitems < ts->max_items);
+			ItemPointerSetBlockNumber(&(ts->itemptrs[ts->nitems]), blkno);
+			ItemPointerSetOffsetNumber(&(ts->itemptrs[ts->nitems]), offsets[i]);
+			ts->nitems++;
+		}
+#endif
+	}
+
+	if (last_key != PG_UINT64_MAX)
+	{
+		rt_set(ts->tree, last_key, val);
+		val = 0;
+	}
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	if (ts->itemptrs)
+		Assert(ts->nitems == ts->num_tids);
+#endif
+}
+
+/* Return true if the given TID is present in TIDStore */
+bool
+tidstore_lookup_tid(TIDStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val;
+	uint32 off;
+	bool found;
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	bool found_assert;
+#endif
+
+	key = tid_to_key_off(tid, &off);
+
+	found = rt_search(ts->tree, key, &val);
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	if (ts->itemptrs)
+		found_assert = bsearch((void *) tid,
+							   (void *) ts->itemptrs,
+							   ts->nitems,
+							   sizeof(ItemPointerData),
+							   vac_cmp_itemptr) != NULL;
+#endif
+
+	if (!found)
+	{
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+		if (ts->itemptrs)
+			Assert(!found_assert);
+#endif
+		return false;
+	}
+
+	found = (val & (UINT64CONST(1) << off)) != 0;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+
+	if (ts->itemptrs && found != found_assert)
+	{
+		elog(WARNING, "tid (%d,%d)\n",
+				ItemPointerGetBlockNumber(tid),
+				ItemPointerGetOffsetNumber(tid));
+		dump_itemptrs(ts);
+	}
+
+	if (ts->itemptrs)
+		Assert(found == found_assert);
+
+#endif
+	return found;
+}
+
+/*
+ * Prepare to iterate through a TIDStore. Return the TIDStoreIter allocated
+ * in the caller's memory context.
+ */
+TIDStoreIter *
+tidstore_begin_iterate(TIDStore *ts)
+{
+	TIDStoreIter *iter;
+
+	iter = palloc0(sizeof(TIDStoreIter));
+	iter->ts = ts;
+	iter->tree_iter = rt_begin_iterate(ts->tree);
+	iter->result.blkno = InvalidBlockNumber;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	iter->itemptrs_index = 0;
+#endif
+
+	return iter;
+}
+
+/*
+ * Scan the TIDStore and return a TIDStoreIterResult representing TIDs
+ * in one page. Offset numbers in the result is sorted.
+ */
+TIDStoreIterResult *
+tidstore_iterate_next(TIDStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TIDStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (rt_iterate_next(iter->tree_iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = KEY_GET_BLKNO(key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * Remember the key-value pair for the next block for the
+			 * next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+			verify_iter_tids(iter);
+#endif
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+	verify_iter_tids(iter);
+#endif
+
+	iter->finished = true;
+	return result;
+}
+
+/* Finish an iteration over TIDStore */
+void
+tidstore_end_iterate(TIDStoreIter *iter)
+{
+	pfree(iter);
+}
+
+uint64
+tidstore_num_tids(TIDStore *ts)
+{
+	return ts->num_tids;
+}
+
+bool
+tidstore_is_full(TIDStore *ts)
+{
+	return (tidstore_memory_usage(ts) > ts->max_bytes);
+}
+
+uint64
+tidstore_max_memory(TIDStore *ts)
+{
+	return ts->max_bytes;
+}
+
+uint64
+tidstore_memory_usage(TIDStore *ts)
+{
+	return (uint64) sizeof(TIDStore) + rt_memory_usage(ts->tree);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TIDStore
+ */
+tidstore_handle
+tidstore_get_handle(TIDStore *ts)
+{
+	return rt_get_handle(ts->tree);
+}
+
+/* Extract TIDs from key-value pair */
+static void
+tidstore_iter_extract_tids(TIDStoreIter *iter, uint64 key, uint64 val)
+{
+	TIDStoreIterResult *result = (&iter->result);
+
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+		result->offsets[result->num_offsets++] = off;
+
+#if defined(USE_ASSERT_CHECKING) && defined(XXX_DEBUG_TID_STORE)
+		iter->itemptrs_index++;
+#endif
+	}
+
+	result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a TID to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64 upper;
+	uint64 tid_i;
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+	*off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+	upper = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return upper;
+}
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d59711b7ec..40082a6db0 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -194,7 +195,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TIDStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -265,8 +266,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer *vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer *vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -853,21 +855,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_unskippable_block,
 				next_failsafe_block = 0,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TIDStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -937,8 +939,8 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		/* XXX: should not allow tidstore to grow beyond max_bytes */
+		if (tidstore_is_full(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1070,11 +1072,18 @@ lazy_scan_heap(LVRelState *vacrel)
 			if (prunestate.has_lpdead_items)
 			{
 				Size		freespace;
+				TIDStoreIter *iter;
+				TIDStoreIterResult *result;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+				iter = tidstore_begin_iterate(vacrel->dead_items);
+				result = tidstore_iterate_next(iter);
+				lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+									  buf, &vmbuffer);
+				Assert(!tidstore_iterate_next(iter));
+				tidstore_end_iterate(iter);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				tidstore_reset(dead_items);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1111,7 +1120,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
 		}
 
 		/*
@@ -1264,7 +1273,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1863,25 +1872,16 @@ retry:
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TIDStore *dead_items = vacrel->dead_items;
 
 		Assert(!prunestate->all_visible);
 		Assert(prunestate->has_lpdead_items);
 
 		vacrel->lpdead_item_pages++;
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 	}
 
 	/* Finally, add page-local counts to whole-VACUUM counts */
@@ -2088,8 +2088,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TIDStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2098,17 +2097,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2157,7 +2149,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2186,7 +2178,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2213,8 +2205,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2259,7 +2251,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	tidstore_reset(vacrel->dead_items);
 }
 
 /*
@@ -2331,7 +2323,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2368,10 +2360,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index;
 	BlockNumber vacuumed_pages;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TIDStoreIter *iter;
+	TIDStoreIterResult *result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2388,8 +2381,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 	vacuumed_pages = 0;
 
-	index = 0;
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while ((result = tidstore_iterate_next(iter)) != NULL)
 	{
 		BlockNumber tblk;
 		Buffer		buf;
@@ -2398,12 +2391,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		tblk = result->blkno;
 		vacrel->blkno = tblk;
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+		lazy_vacuum_heap_page(vacrel, tblk, result->offsets, result->num_offsets,
+							  buf, &vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2413,6 +2407,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, tblk, freespace);
 		vacuumed_pages++;
 	}
+	tidstore_end_iterate(iter);
 
 	/* Clear the block number information */
 	vacrel->blkno = InvalidBlockNumber;
@@ -2427,14 +2422,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2451,11 +2445,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * LP_DEAD item on the page.  The return value is the first index immediately
  * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+					  int num_offsets, Buffer buffer, Buffer *vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			uncnt = 0;
@@ -2474,16 +2467,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = offsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2563,7 +2551,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -3065,46 +3052,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3115,11 +3062,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3146,7 +3091,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3159,11 +3104,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..bc42144f08 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 293b84bbca..7f5776fbf8 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2276,16 +2275,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TIDStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2316,18 +2315,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
@@ -2338,60 +2325,7 @@ vac_max_items_to_alloc_size(int max_items)
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TIDStore *dead_items = (TIDStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index f26d796e52..429607d5fa 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+#define PARALLEL_VACUUM_KEY_DSA				2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TIDStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TIDStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TIDStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_free(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TIDStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TIDStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 528b2e9643..ea8cf6283b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -186,6 +186,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"PgStatsHash",
 	/* LWTRANCHE_PGSTATS_DATA: */
 	"PgStatsData",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..3afc7612ae
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,50 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  TID storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TIDStore TIDStore;
+typedef struct TIDStoreIter TIDStoreIter;
+
+typedef struct TIDStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+	int				num_offsets;
+} TIDStoreIterResult;
+
+extern TIDStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TIDStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TIDStore *ts);
+extern void tidstore_free(TIDStore *ts);
+extern void tidstore_reset(TIDStore *ts);
+extern void tidstore_add_tids(TIDStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TIDStore *ts, ItemPointer tid);
+extern TIDStoreIter * tidstore_begin_iterate(TIDStore *ts);
+extern TIDStoreIterResult *tidstore_iterate_next(TIDStoreIter *iter);
+extern void tidstore_end_iterate(TIDStoreIter *iter);
+extern uint64 tidstore_num_tids(TIDStore *ts);
+extern bool tidstore_is_full(TIDStore *ts);
+extern uint64 tidstore_max_memory(TIDStore *ts);
+extern uint64 tidstore_memory_usage(TIDStore *ts);
+extern tidstore_handle tidstore_get_handle(TIDStore *ts);
+
+#endif		/* TIDSTORE_H */
+
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index a28938caf4..75d540d315 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 4e4bc26a8b..afe61c21fd 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -235,21 +236,6 @@ typedef struct VacuumParams
 	int			nworkers;
 } VacuumParams;
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -302,18 +288,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TIDStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TIDStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index dd818e16ab..f1e0bcede5 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -204,6 +204,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DSA,
 	LWTRANCHE_PGSTATS_HASH,
 	LWTRANCHE_PGSTATS_DATA,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..0c49354f04 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT s.stats_reset,
-- 
2.31.1

v15-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchapplication/octet-stream; name=v15-0006-Use-rt_node_ptr-to-reference-radix-tree-nodes.patchDownload

From 7e5fd8a19adb0305f77618231364eacaa2e0a59a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 14 Nov 2022 11:44:17 +0900
Subject: [PATCH v15 6/9] Use rt_node_ptr to reference radix tree nodes.

---
 src/backend/lib/radixtree.c | 688 +++++++++++++++++++++---------------
 1 file changed, 398 insertions(+), 290 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index abd0450727..bff37a2c35 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -150,6 +150,19 @@ typedef enum rt_size_class
 #define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
 } rt_size_class;
 
+/*
+ * rt_pointer is a pointer compatible with a pointer to local memory and a
+ * pointer for DSA area (i.e. dsa_pointer). Since the radix tree node can be
+ * allocated in backend local memory as well as DSA area, we cannot use a
+ * C-pointer to rt_node (i.e. backend local memory address) for child pointers
+ * in inner nodes. Inner nodes need to use rt_pointer instead. We can get
+ * the backend local memory address of a node from a rt_pointer by using
+ * rt_pointer_decode().
+*/
+typedef uintptr_t rt_pointer;
+#define InvalidRTPointer		((rt_pointer) 0)
+#define RTPointerIsValid(x) 	(((rt_pointer) (x)) != InvalidRTPointer)
+
 /* Common type for all nodes types */
 typedef struct rt_node
 {
@@ -175,8 +188,7 @@ typedef struct rt_node
 	/* Node kind, one per search/set algorithm */
 	uint8		kind;
 } rt_node;
-#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+#define RT_NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
 #define VAR_NODE_HAS_FREE_SLOT(node) \
 	((node)->base.n.count < (node)->base.n.fanout)
 #define FIXED_NODE_HAS_FREE_SLOT(node, class) \
@@ -240,7 +252,7 @@ typedef struct rt_node_inner_4
 	rt_node_base_4 base;
 
 	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+	rt_pointer    children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_4;
 
 typedef struct rt_node_leaf_4
@@ -256,7 +268,7 @@ typedef struct rt_node_inner_32
 	rt_node_base_32 base;
 
 	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+	rt_pointer    children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_32;
 
 typedef struct rt_node_leaf_32
@@ -272,7 +284,7 @@ typedef struct rt_node_inner_125
 	rt_node_base_125 base;
 
 	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+	rt_pointer    children[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_inner_125;
 
 typedef struct rt_node_leaf_125
@@ -292,7 +304,7 @@ typedef struct rt_node_inner_256
 	rt_node_base_256 base;
 
 	/* Slots for 256 children */
-	rt_node    *children[RT_NODE_MAX_SLOTS];
+	rt_pointer    children[RT_NODE_MAX_SLOTS];
 } rt_node_inner_256;
 
 typedef struct rt_node_leaf_256
@@ -306,6 +318,29 @@ typedef struct rt_node_leaf_256
 	uint64		values[RT_NODE_MAX_SLOTS];
 } rt_node_leaf_256;
 
+/* rt_node_ptr is a data structure representing a pointer for a rt_node */
+typedef struct rt_node_ptr
+{
+	rt_pointer		encoded;
+	rt_node			*decoded;
+} rt_node_ptr;
+#define InvalidRTNodePtr \
+	(rt_node_ptr) {.encoded = InvalidRTPointer, .decoded = NULL}
+#define RTNodePtrIsValid(n) \
+	(!rt_node_ptr_eq((rt_node_ptr *) &(n), &(InvalidRTNodePtr)))
+
+/* Macros for rt_node_ptr to access the fields of rt_node */
+#define NODE_RAW(n)			(n.decoded)
+#define NODE_IS_LEAF(n)		(NODE_RAW(n)->shift == 0)
+#define NODE_IS_EMPTY(n)	(NODE_COUNT(n) == 0)
+#define NODE_KIND(n)	(NODE_RAW(n)->kind)
+#define NODE_COUNT(n)	(NODE_RAW(n)->count)
+#define NODE_SHIFT(n)	(NODE_RAW(n)->shift)
+#define NODE_CHUNK(n)	(NODE_RAW(n)->chunk)
+#define NODE_FANOUT(n)	(NODE_RAW(n)->fanout)
+#define NODE_HAS_FREE_SLOT(n) \
+	(NODE_COUNT(n) < rt_node_kind_info[NODE_KIND(n)].fanout)
+
 /* Information for each size class */
 typedef struct rt_size_class_elem
 {
@@ -394,7 +429,7 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
  */
 typedef struct rt_node_iter
 {
-	rt_node    *node;			/* current node being iterated */
+	rt_node_ptr	node;			/* current node being iterated */
 	int			current_idx;	/* current position. -1 for initial value */
 } rt_node_iter;
 
@@ -415,7 +450,7 @@ struct radix_tree
 {
 	MemoryContext context;
 
-	rt_node    *root;
+	rt_pointer	root;
 	uint64		max_val;
 	uint64		num_keys;
 
@@ -429,27 +464,58 @@ struct radix_tree
 };
 
 static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
-static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+
+static rt_node_ptr rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class,
 								bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_free_node(radix_tree *tree, rt_node_ptr node);
 static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
-										rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+static inline bool rt_node_search_inner(rt_node_ptr node_ptr, uint64 key, rt_action action,
+										rt_pointer *child_p);
+static inline bool rt_node_search_leaf(rt_node_ptr node_ptr, uint64 key, rt_action action,
 									   uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
-								 uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+static bool rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+								 uint64 key, rt_node_ptr child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
 								uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											  rt_node_ptr *child_p);
 static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 											 uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static void rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from);
 static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
 
 /* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
+static void rt_verify_node(rt_node_ptr node);
+
+/* Decode and encode functions of rt_pointer */
+static inline rt_node *
+rt_pointer_decode(rt_pointer encoded)
+{
+	return (rt_node *) encoded;
+}
+
+static inline rt_pointer
+rt_pointer_encode(rt_node *decoded)
+{
+	return (rt_pointer) decoded;
+}
+
+/* Return a rt_node_ptr created from the given encoded pointer */
+static inline rt_node_ptr
+rt_node_ptr_encoded(rt_pointer encoded)
+{
+	return (rt_node_ptr) {
+		.encoded = encoded,
+			.decoded = rt_pointer_decode(encoded),
+			};
+}
+
+static inline bool
+rt_node_ptr_eq(rt_node_ptr *a, rt_node_ptr *b)
+{
+	return (a->decoded == b->decoded) && (a->encoded == b->encoded);
+}
 
 /*
  * Return index of the first element in 'base' that equals 'key'. Return -1
@@ -598,10 +664,10 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
 
 /* Shift the elements right at 'idx' by one */
 static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_shift(uint8 *chunks, rt_pointer *children, int count, int idx)
 {
 	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
-	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_pointer) * (count - idx));
 }
 
 static inline void
@@ -613,10 +679,10 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Delete the element at 'idx' */
 static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+chunk_children_array_delete(uint8 *chunks, rt_pointer *children, int count, int idx)
 {
 	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
-	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_pointer) * (count - idx - 1));
 }
 
 static inline void
@@ -628,12 +694,12 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Copy both chunks and children/values arrays */
 static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
-						  uint8 *dst_chunks, rt_node **dst_children)
+chunk_children_array_copy(uint8 *src_chunks, rt_pointer *src_children,
+						  uint8 *dst_chunks, rt_pointer *dst_children)
 {
 	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
-	const Size children_size = sizeof(rt_node *) * fanout;
+	const Size children_size = sizeof(rt_pointer) * fanout;
 
 	memcpy(dst_chunks, src_chunks, chunk_size);
 	memcpy(dst_children, src_children, children_size);
@@ -665,7 +731,7 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
 static inline bool
 node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
 	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
@@ -673,23 +739,23 @@ node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
 static inline bool
 node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
 	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
 #endif
 
-static inline rt_node *
+static inline rt_pointer
 node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	return node->children[node->base.slot_idxs[chunk]];
 }
 
 static inline uint64
 node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
 	return node->values[node->base.slot_idxs[chunk]];
 }
@@ -699,9 +765,9 @@ node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
 {
 	int			slotpos = node->base.slot_idxs[chunk];
 
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
-	node->children[node->base.slot_idxs[chunk]] = NULL;
+	node->children[node->base.slot_idxs[chunk]] = InvalidRTPointer;
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
 
@@ -710,7 +776,7 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
 {
 	int			slotpos = node->base.slot_idxs[chunk];
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
@@ -742,11 +808,11 @@ node_125_find_unused_slot(bitmapword *isset)
  }
 
 static inline void
-node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
 {
 	int			slotpos;
 
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 
 	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
@@ -761,7 +827,7 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 {
 	int			slotpos;
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 
 	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
@@ -772,16 +838,16 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 
 /* Update the child corresponding to 'chunk' to 'child' */
 static inline void
-node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_pointer child)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[node->base.slot_idxs[chunk]] = child;
 }
 
 static inline void
 node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->values[node->base.slot_idxs[chunk]] = value;
 }
 
@@ -791,21 +857,21 @@ node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 static inline bool
 node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
-	return (node->children[chunk] != NULL);
+	Assert(!RT_NODE_IS_LEAF(node));
+	return RTPointerIsValid(node->children[chunk]);
 }
 
 static inline bool
 node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
 }
 
-static inline rt_node *
+static inline rt_pointer
 node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	Assert(node_inner_256_is_chunk_used(node, chunk));
 	return node->children[chunk];
 }
@@ -813,16 +879,16 @@ node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
 static inline uint64
 node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(node_leaf_256_is_chunk_used(node, chunk));
 	return node->values[chunk];
 }
 
 /* Set the child in the node-256 */
 static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_pointer child)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[chunk] = child;
 }
 
@@ -830,7 +896,7 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
 static inline void
 node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
 	node->values[chunk] = value;
 }
@@ -839,14 +905,14 @@ node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
 static inline void
 node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
-	node->children[chunk] = NULL;
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = InvalidRTPointer;
 }
 
 static inline void
 node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
 }
 
@@ -882,29 +948,32 @@ rt_new_root(radix_tree *tree, uint64 key)
 {
 	int			shift = key_get_shift(key);
 	bool		inner = shift > 0;
-	rt_node    *newnode;
+	rt_node_ptr	newnode;
 
 	newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
 	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
-	newnode->shift = shift;
+	NODE_SHIFT(newnode) = shift;
+
 	tree->max_val = shift_get_max_val(shift);
-	tree->root = newnode;
+	tree->root = newnode.encoded;
 }
 
 /*
  * Allocate a new node with the given node kind.
  */
-static rt_node *
+static rt_node_ptr
 rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 {
-	rt_node    *newnode;
+	rt_node_ptr	newnode;
 
 	if (inner)
-		newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
-												 rt_size_class_info[size_class].inner_size);
+		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+														 rt_size_class_info[size_class].inner_size);
 	else
-		newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
-												 rt_size_class_info[size_class].leaf_size);
+		newnode.decoded = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+														 rt_size_class_info[size_class].leaf_size);
+
+	newnode.encoded = rt_pointer_encode(newnode.decoded);
 
 #ifdef RT_DEBUG
 	/* update the statistics */
@@ -916,20 +985,20 @@ rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 
 /* Initialize the node contents */
 static inline void
-rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+rt_init_node(rt_node_ptr node, uint8 kind, rt_size_class size_class, bool inner)
 {
 	if (inner)
-		MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+		MemSet(node.decoded, 0, rt_size_class_info[size_class].inner_size);
 	else
-		MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+		MemSet(node.decoded, 0, rt_size_class_info[size_class].leaf_size);
 
-	node->kind = kind;
-	node->fanout = rt_size_class_info[size_class].fanout;
+	NODE_KIND(node) = kind;
+	NODE_FANOUT(node) = rt_size_class_info[size_class].fanout;
 
 	/* Initialize slot_idxs to invalid values */
 	if (kind == RT_NODE_KIND_125)
 	{
-		rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+		rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
 
 		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
 	}
@@ -939,25 +1008,25 @@ rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
 	 * and this is the max size class to it will never grow.
 	 */
 	if (kind == RT_NODE_KIND_256)
-		node->fanout = 0;
+		NODE_FANOUT(node) = 0;
 }
 
 static inline void
-rt_copy_node(rt_node *newnode, rt_node *oldnode)
+rt_copy_node(rt_node_ptr newnode, rt_node_ptr oldnode)
 {
-	newnode->shift = oldnode->shift;
-	newnode->chunk = oldnode->chunk;
-	newnode->count = oldnode->count;
+	NODE_SHIFT(newnode) = NODE_SHIFT(oldnode);
+	NODE_CHUNK(newnode) = NODE_CHUNK(oldnode);
+	NODE_COUNT(newnode) = NODE_COUNT(oldnode);
 }
 
 /*
  * Create a new node with 'new_kind' and the same shift, chunk, and
  * count of 'node'.
  */
-static rt_node*
-rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+static rt_node_ptr
+rt_grow_node_kind(radix_tree *tree, rt_node_ptr node, uint8 new_kind)
 {
-	rt_node	*newnode;
+	rt_node_ptr	newnode;
 	bool inner = !NODE_IS_LEAF(node);
 
 	newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
@@ -969,12 +1038,12 @@ rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
 
 /* Free the given node */
 static void
-rt_free_node(radix_tree *tree, rt_node *node)
+rt_free_node(radix_tree *tree, rt_node_ptr node)
 {
 	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node)
+	if (tree->root == node.encoded)
 	{
-		tree->root = NULL;
+		tree->root = InvalidRTPointer;
 		tree->max_val = 0;
 	}
 
@@ -985,7 +1054,7 @@ rt_free_node(radix_tree *tree, rt_node *node)
 		/* update the statistics */
 		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 		{
-			if (node->fanout == rt_size_class_info[i].fanout)
+			if (NODE_FANOUT(node) == rt_size_class_info[i].fanout)
 				break;
 		}
 
@@ -998,29 +1067,30 @@ rt_free_node(radix_tree *tree, rt_node *node)
 	}
 #endif
 
-	pfree(node);
+	pfree(node.decoded);
 }
 
 /*
  * Replace old_child with new_child, and free the old one.
  */
 static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
-				rt_node *new_child, uint64 key)
+rt_replace_node(radix_tree *tree, rt_node_ptr parent, rt_node_ptr old_child,
+				rt_node_ptr new_child, uint64 key)
 {
-	Assert(old_child->chunk == new_child->chunk);
-	Assert(old_child->shift == new_child->shift);
+	Assert(NODE_CHUNK(old_child) == NODE_CHUNK(new_child));
+	Assert(NODE_SHIFT(old_child) == NODE_SHIFT(new_child));
 
-	if (parent == old_child)
+	if (rt_node_ptr_eq(&parent, &old_child))
 	{
 		/* Replace the root node with the new large node */
-		tree->root = new_child;
+		tree->root = new_child.encoded;
 	}
 	else
 	{
 		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
 
-		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		replaced = rt_node_insert_inner(tree, InvalidRTNodePtr, parent, key,
+										new_child);
 		Assert(replaced);
 	}
 
@@ -1035,24 +1105,28 @@ static void
 rt_extend(radix_tree *tree, uint64 key)
 {
 	int			target_shift;
-	int			shift = tree->root->shift + RT_NODE_SPAN;
+	rt_node		*root = rt_pointer_decode(tree->root);
+	int			shift = root->shift + RT_NODE_SPAN;
 
 	target_shift = key_get_shift(key);
 
 	/* Grow tree from 'shift' to 'target_shift' */
 	while (shift <= target_shift)
 	{
-		rt_node_inner_4 *node;
+		rt_node_ptr	node;
+		rt_node_inner_4 *n4;
+
+		node = rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+		rt_init_node(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
 
-		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
-		rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
-		node->base.n.shift = shift;
-		node->base.n.count = 1;
-		node->base.chunks[0] = 0;
-		node->children[0] = tree->root;
+		n4 = (rt_node_inner_4 *) node.decoded;
+		n4->base.n.shift = shift;
+		n4->base.n.count = 1;
+		n4->base.chunks[0] = 0;
+		n4->children[0] = tree->root;
 
-		tree->root->chunk = 0;
-		tree->root = (rt_node *) node;
+		root->chunk = 0;
+		tree->root = node.encoded;
 
 		shift += RT_NODE_SPAN;
 	}
@@ -1065,21 +1139,22 @@ rt_extend(radix_tree *tree, uint64 key)
  * Insert inner and leaf nodes from 'node' to bottom.
  */
 static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
-			  rt_node *node)
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node_ptr parent,
+			  rt_node_ptr node)
 {
-	int			shift = node->shift;
+	int			shift = NODE_SHIFT(node);
 
 	while (shift >= RT_NODE_SPAN)
 	{
-		rt_node    *newchild;
+		rt_node_ptr    newchild;
 		int			newshift = shift - RT_NODE_SPAN;
 		bool		inner = newshift > 0;
 
 		newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
 		rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
-		newchild->shift = newshift;
-		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+		NODE_SHIFT(newchild) = newshift;
+		NODE_CHUNK(newchild) = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
+
 		rt_node_insert_inner(tree, parent, node, key, newchild);
 
 		parent = node;
@@ -1099,17 +1174,18 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node_ptr node, uint64 key, rt_action action,
+					 rt_pointer *child_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
-	rt_node    *child = NULL;
+	rt_pointer	child;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
 
 				if (idx < 0)
@@ -1127,7 +1203,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
 
 				if (idx < 0)
@@ -1143,7 +1219,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
 
 				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
 					break;
@@ -1159,7 +1235,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 				if (!node_inner_256_is_chunk_used(n256, chunk))
 					break;
@@ -1176,7 +1252,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 
 	/* update statistics */
 	if (action == RT_ACTION_DELETE && found)
-		node->count--;
+		NODE_COUNT(node)--;
 
 	if (found && child_p)
 		*child_p = child;
@@ -1192,17 +1268,17 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
  * to the value is set to value_p.
  */
 static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node_ptr node, uint64 key, rt_action action, uint64 *value_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		found = false;
 	uint64		value = 0;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
 
 				if (idx < 0)
@@ -1220,7 +1296,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
 
 				if (idx < 0)
@@ -1236,7 +1312,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
 
 				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
 					break;
@@ -1252,7 +1328,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 				if (!node_leaf_256_is_chunk_used(n256, chunk))
 					break;
@@ -1269,7 +1345,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 
 	/* update statistics */
 	if (action == RT_ACTION_DELETE && found)
-		node->count--;
+		NODE_COUNT(node)--;
 
 	if (found && value_p)
 		*value_p = value;
@@ -1279,19 +1355,19 @@ rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p
 
 /* Insert the child to the inner node */
 static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
-					 rt_node *child)
+rt_node_insert_inner(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
+					 uint64 key, rt_node_ptr child)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		chunk_exists = false;
 
 	Assert(!NODE_IS_LEAF(node));
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 				int			idx;
 
 				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1299,25 +1375,27 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					n4->children[idx] = child;
+					n4->children[idx] = child.encoded;
 					break;
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
 				{
+					rt_node_ptr	new;
 					rt_node_inner_32 *new32;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
-																   RT_NODE_KIND_32);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+					new32 = (rt_node_inner_32 *) new.decoded;
+
 					chunk_children_array_copy(n4->base.chunks, n4->children,
 											  new32->base.chunks, new32->children);
 
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
-									key);
-					node = (rt_node *) new32;
+					Assert(RTNodePtrIsValid(parent));
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1330,14 +1408,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 												   count, insertpos);
 
 					n4->base.chunks[insertpos] = chunk;
-					n4->children[insertpos] = child;
+					n4->children[insertpos] = child.encoded;
 					break;
 				}
 			}
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 				int			idx;
 
 				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1345,45 +1423,52 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					n32->children[idx] = child;
+					n32->children[idx] = child.encoded;
 					break;
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
 				{
-					Assert(parent != NULL);
+					Assert(RTNodePtrIsValid(parent));
 
 					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
 					{
 						/* use the same node kind, but expand to the next size class */
 						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
 						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_ptr	new;
 						rt_node_inner_32 *new32;
 
-						new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+						new = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+						new32 = (rt_node_inner_32 *) new.decoded;
 						memcpy(new32, n32, size);
 						new32->base.n.fanout = fanout;
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+						rt_replace_node(tree, parent, node, new, key);
 
-						/* must update both pointers here */
-						node = (rt_node *) new32;
+						/*
+						 * Must update both pointers here since we update n32 and
+						 * verify node.
+						 */
+						node = new;
 						n32 = new32;
 
 						goto retry_insert_inner_32;
 					}
 					else
 					{
+						rt_node_ptr	new;
 						rt_node_inner_125 *new125;
 
 						/* grow node from 32 to 125 */
-						new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
-																		 RT_NODE_KIND_125);
+						new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+						new125 = (rt_node_inner_125 *) new.decoded;
+
 						for (int i = 0; i < n32->base.n.count; i++)
 							node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
-						node = (rt_node *) new125;
+						rt_replace_node(tree, parent, node, new, key);
+						node = new;
 					}
 				}
 				else
@@ -1398,7 +1483,7 @@ retry_insert_inner_32:
 													   count, insertpos);
 
 						n32->base.chunks[insertpos] = chunk;
-						n32->children[insertpos] = child;
+						n32->children[insertpos] = child.encoded;
 						break;
 					}
 				}
@@ -1406,25 +1491,28 @@ retry_insert_inner_32:
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_125:
 			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
 				int			cnt = 0;
 
 				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-					node_inner_125_update(n125, chunk, child);
+					node_inner_125_update(n125, chunk, child.encoded);
 					break;
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
 				{
+					rt_node_ptr	new;
 					rt_node_inner_256 *new256;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 125 to 256 */
-					new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
-																	 RT_NODE_KIND_256);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+					new256 = (rt_node_inner_256 *) new.decoded;
+
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
 					{
 						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1434,32 +1522,31 @@ retry_insert_inner_32:
 						cnt++;
 					}
 
-					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
-					node_inner_125_insert(n125, chunk, child);
+					node_inner_125_insert(n125, chunk, child.encoded);
 					break;
 				}
 			}
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
 				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
 
-				node_inner_256_set(n256, chunk, child);
+				node_inner_256_set(n256, chunk, child.encoded);
 				break;
 			}
 	}
 
 	/* Update statistics */
 	if (!chunk_exists)
-		node->count++;
+		NODE_COUNT(node)++;
 
 	/*
 	 * Done. Finally, verify the chunk and value is inserted or replaced
@@ -1472,19 +1559,19 @@ retry_insert_inner_32:
 
 /* Insert the value to the leaf node */
 static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+rt_node_insert_leaf(radix_tree *tree, rt_node_ptr parent, rt_node_ptr node,
 					uint64 key, uint64 value)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	uint8		chunk = RT_GET_KEY_CHUNK(key, NODE_SHIFT(node));
 	bool		chunk_exists = false;
 
 	Assert(NODE_IS_LEAF(node));
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 				int			idx;
 
 				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
@@ -1498,16 +1585,18 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
 				{
+					rt_node_ptr	new;
 					rt_node_leaf_32 *new32;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
-																  RT_NODE_KIND_32);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+					new32 = (rt_node_leaf_32 *) new.decoded;
 					chunk_values_array_copy(n4->base.chunks, n4->values,
 											new32->base.chunks, new32->values);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
-					node = (rt_node *) new32;
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1527,7 +1616,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 				int			idx;
 
 				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
@@ -1541,45 +1630,51 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
 				{
-					Assert(parent != NULL);
+					Assert(RTNodePtrIsValid(parent));
 
 					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
 					{
 						/* use the same node kind, but expand to the next size class */
 						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
 						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_ptr new;
 						rt_node_leaf_32 *new32;
 
-						new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+						new = rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+						new32 = (rt_node_leaf_32 *) new.decoded;
 						memcpy(new32, n32, size);
 						new32->base.n.fanout = fanout;
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+						rt_replace_node(tree, parent, node, new, key);
 
-						/* must update both pointers here */
-						node = (rt_node *) new32;
+						/*
+						 * Must update both pointers here since we update n32 and
+						 * verify node.
+						 */
+						node = new;
 						n32 = new32;
 
 						goto retry_insert_leaf_32;
 					}
 					else
 					{
+						rt_node_ptr	new;
 						rt_node_leaf_125 *new125;
 
 						/* grow node from 32 to 125 */
-						new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
-																		RT_NODE_KIND_125);
+						new = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+						new125 = (rt_node_leaf_125 *) new.decoded;
+
 						for (int i = 0; i < n32->base.n.count; i++)
 							node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
-										key);
-						node = (rt_node *) new125;
+						rt_replace_node(tree, parent, node, new, key);
+						node = new;
 					}
 				}
 				else
 				{
-				retry_insert_leaf_32:
+retry_insert_leaf_32:
 					{
 						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
 						int	count = n32->base.n.count;
@@ -1597,7 +1692,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_125:
 			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
 				int			cnt = 0;
 
 				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
@@ -1610,12 +1705,14 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
 				{
+					rt_node_ptr	new;
 					rt_node_leaf_256 *new256;
-					Assert(parent != NULL);
+
+					Assert(RTNodePtrIsValid(parent));
 
 					/* grow node from 125 to 256 */
-					new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
-																	RT_NODE_KIND_256);
+					new = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+					new256 = (rt_node_leaf_256 *) new.decoded;
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
 					{
 						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
@@ -1625,9 +1722,8 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 						cnt++;
 					}
 
-					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
+					rt_replace_node(tree, parent, node, new, key);
+					node = new;
 				}
 				else
 				{
@@ -1638,7 +1734,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
 				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
@@ -1650,7 +1746,7 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 
 	/* Update statistics */
 	if (!chunk_exists)
-		node->count++;
+		NODE_COUNT(node)++;
 
 	/*
 	 * Done. Finally, verify the chunk and value is inserted or replaced
@@ -1674,7 +1770,7 @@ rt_create(MemoryContext ctx)
 
 	tree = palloc(sizeof(radix_tree));
 	tree->context = ctx;
-	tree->root = NULL;
+	tree->root = InvalidRTPointer;
 	tree->max_val = 0;
 	tree->num_keys = 0;
 
@@ -1723,26 +1819,23 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 {
 	int			shift;
 	bool		updated;
-	rt_node    *node;
-	rt_node    *parent;
+	rt_node_ptr	node;
+	rt_node_ptr parent;
 
 	/* Empty tree, create the root */
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 		rt_new_root(tree, key);
 
 	/* Extend the tree if necessary */
 	if (key > tree->max_val)
 		rt_extend(tree, key);
 
-	Assert(tree->root);
-
-	shift = tree->root->shift;
-	node = parent = tree->root;
-
 	/* Descend the tree until a leaf node */
+	node = parent = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer    child;
 
 		if (NODE_IS_LEAF(node))
 			break;
@@ -1754,7 +1847,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 		}
 
 		parent = node;
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1775,21 +1868,21 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 bool
 rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 {
-	rt_node    *node;
+	rt_node_ptr    node;
 	int			shift;
 
 	Assert(value_p != NULL);
 
-	if (!tree->root || key > tree->max_val)
+	if (!RTPointerIsValid(tree->root) || key > tree->max_val)
 		return false;
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer	child;
 
 		if (NODE_IS_LEAF(node))
 			break;
@@ -1797,7 +1890,7 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1811,8 +1904,8 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 bool
 rt_delete(radix_tree *tree, uint64 key)
 {
-	rt_node    *node;
-	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	rt_node_ptr	node;
+	rt_node_ptr	stack[RT_MAX_LEVEL] = {0};
 	int			shift;
 	int			level;
 	bool		deleted;
@@ -1824,12 +1917,12 @@ rt_delete(radix_tree *tree, uint64 key)
 	 * Descend the tree to search the key while building a stack of nodes we
 	 * visited.
 	 */
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	level = -1;
 	while (shift > 0)
 	{
-		rt_node    *child;
+		rt_pointer	child;
 
 		/* Push the current node to the stack */
 		stack[++level] = node;
@@ -1837,7 +1930,7 @@ rt_delete(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			return false;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1888,6 +1981,7 @@ rt_iter *
 rt_begin_iterate(radix_tree *tree)
 {
 	MemoryContext old_ctx;
+	rt_node_ptr	root;
 	rt_iter    *iter;
 	int			top_level;
 
@@ -1897,17 +1991,18 @@ rt_begin_iterate(radix_tree *tree)
 	iter->tree = tree;
 
 	/* empty tree */
-	if (!iter->tree->root)
+	if (!RTPointerIsValid(iter->tree) || !RTPointerIsValid(iter->tree->root))
 		return iter;
 
-	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	root = rt_node_ptr_encoded(iter->tree->root);
+	top_level = NODE_SHIFT(root) / RT_NODE_SPAN;
 	iter->stack_len = top_level;
 
 	/*
 	 * Descend to the left most leaf node from the root. The key is being
 	 * constructed while descending to the leaf.
 	 */
-	rt_update_iter_stack(iter, iter->tree->root, top_level);
+	rt_update_iter_stack(iter, root, top_level);
 
 	MemoryContextSwitchTo(old_ctx);
 
@@ -1918,14 +2013,15 @@ rt_begin_iterate(radix_tree *tree)
  * Update each node_iter for inner nodes in the iterator node stack.
  */
 static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+rt_update_iter_stack(rt_iter *iter, rt_node_ptr from_node, int from)
 {
 	int			level = from;
-	rt_node    *node = from_node;
+	rt_node_ptr node = from_node;
 
 	for (;;)
 	{
 		rt_node_iter *node_iter = &(iter->stack[level--]);
+		bool found PG_USED_FOR_ASSERTS_ONLY;
 
 		node_iter->node = node;
 		node_iter->current_idx = -1;
@@ -1935,10 +2031,10 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
 			return;
 
 		/* Advance to the next slot in the inner node */
-		node = rt_node_inner_iterate_next(iter, node_iter);
+		found = rt_node_inner_iterate_next(iter, node_iter, &node);
 
 		/* We must find the first children in the node */
-		Assert(node);
+		Assert(found);
 	}
 }
 
@@ -1955,7 +2051,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 
 	for (;;)
 	{
-		rt_node    *child = NULL;
+		rt_node_ptr	child = InvalidRTNodePtr;
 		uint64		value;
 		int			level;
 		bool		found;
@@ -1976,14 +2072,12 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 		 */
 		for (level = 1; level <= iter->stack_len; level++)
 		{
-			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
-
-			if (child)
+			if (rt_node_inner_iterate_next(iter, &(iter->stack[level]), &child))
 				break;
 		}
 
 		/* the iteration finished */
-		if (!child)
+		if (!RTNodePtrIsValid(child))
 			return false;
 
 		/*
@@ -2015,18 +2109,19 @@ rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
  * Advance the slot in the inner node. Return the child if exists, otherwise
  * null.
  */
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+static inline bool
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter, rt_node_ptr *child_p)
 {
-	rt_node    *child = NULL;
+	rt_node_ptr	node = node_iter->node;
+	rt_pointer	child;
 	bool		found = false;
 	uint8		key_chunk;
 
-	switch (node_iter->node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n4->base.n.count)
@@ -2039,7 +2134,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n32->base.n.count)
@@ -2052,7 +2147,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2072,7 +2167,7 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2093,9 +2188,12 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 	}
 
 	if (found)
-		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+	{
+		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
+		*child_p = rt_node_ptr_encoded(child);
+	}
 
-	return child;
+	return found;
 }
 
 /*
@@ -2103,19 +2201,18 @@ rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
  * is set to value_p, otherwise return false.
  */
 static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
-						  uint64 *value_p)
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter, uint64 *value_p)
 {
-	rt_node    *node = node_iter->node;
+	rt_node_ptr node = node_iter->node;
 	bool		found = false;
 	uint64		value;
 	uint8		key_chunk;
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n4->base.n.count)
@@ -2128,7 +2225,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 
 				node_iter->current_idx++;
 				if (node_iter->current_idx >= n32->base.n.count)
@@ -2141,7 +2238,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2161,7 +2258,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 			}
 		case RT_NODE_KIND_256:
 			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 				int			i;
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
@@ -2183,7 +2280,7 @@ rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 
 	if (found)
 	{
-		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		rt_iter_update_key(iter, key_chunk, NODE_SHIFT(node));
 		*value_p = value;
 	}
 
@@ -2220,16 +2317,16 @@ rt_memory_usage(radix_tree *tree)
  * Verify the radix tree node.
  */
 static void
-rt_verify_node(rt_node *node)
+rt_verify_node(rt_node_ptr node)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(node->count >= 0);
+	Assert(NODE_COUNT(node) >= 0);
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node.decoded;
 
 				for (int i = 1; i < n4->n.count; i++)
 					Assert(n4->chunks[i - 1] < n4->chunks[i]);
@@ -2238,7 +2335,7 @@ rt_verify_node(rt_node *node)
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node.decoded;
 
 				for (int i = 1; i < n32->n.count; i++)
 					Assert(n32->chunks[i - 1] < n32->chunks[i]);
@@ -2247,7 +2344,7 @@ rt_verify_node(rt_node *node)
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+				rt_node_base_125 *n125 = (rt_node_base_125 *) node.decoded;
 				int			cnt = 0;
 
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2257,10 +2354,10 @@ rt_verify_node(rt_node *node)
 
 					/* Check if the corresponding slot is used */
 					if (NODE_IS_LEAF(node))
-						Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+						Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) n125,
 														  n125->slot_idxs[i]));
 					else
-						Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+						Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) n125,
 														   n125->slot_idxs[i]));
 
 					cnt++;
@@ -2273,7 +2370,7 @@ rt_verify_node(rt_node *node)
 			{
 				if (NODE_IS_LEAF(node))
 				{
-					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 					int			cnt = 0;
 
 					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
@@ -2294,54 +2391,62 @@ rt_verify_node(rt_node *node)
 void
 rt_stats(radix_tree *tree)
 {
+	rt_node *root = rt_pointer_decode(tree->root);
+
+	if (root == NULL)
+		return;
+
 	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
-						 tree->num_keys,
-						 tree->root->shift / RT_NODE_SPAN,
-						 tree->cnt[RT_CLASS_4_FULL],
-						 tree->cnt[RT_CLASS_32_PARTIAL],
-						 tree->cnt[RT_CLASS_32_FULL],
-						 tree->cnt[RT_CLASS_125_FULL],
-						 tree->cnt[RT_CLASS_256])));
+							tree->num_keys,
+							root->shift / RT_NODE_SPAN,
+							tree->cnt[RT_CLASS_4_FULL],
+							tree->cnt[RT_CLASS_32_PARTIAL],
+							tree->cnt[RT_CLASS_32_FULL],
+							tree->cnt[RT_CLASS_125_FULL],
+							tree->cnt[RT_CLASS_256])));
 }
 
 static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(rt_node_ptr node, int level, bool recurse)
 {
-	char		space[125] = {0};
+	rt_node		*n = node.decoded;
+	char		space[128] = {0};
 
 	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
 			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
-			(node->kind == RT_NODE_KIND_4) ? 4 :
-			(node->kind == RT_NODE_KIND_32) ? 32 :
-			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
-			node->fanout == 0 ? 256 : node->fanout,
-			node->count, node->shift, node->chunk);
+
+			(n->kind == RT_NODE_KIND_4) ? 4 :
+			(n->kind == RT_NODE_KIND_32) ? 32 :
+			(n->kind == RT_NODE_KIND_125) ? 125 : 256,
+			n->fanout == 0 ? 256 : n->fanout,
+			n->count, n->shift, n->chunk);
 
 	if (level > 0)
 		sprintf(space, "%*c", level * 4, ' ');
 
-	switch (node->kind)
+	switch (NODE_KIND(node))
 	{
 		case RT_NODE_KIND_4:
 			{
-				for (int i = 0; i < node->count; i++)
+				for (int i = 0; i < NODE_COUNT(node); i++)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
 								space, n4->base.chunks[i], n4->values[i]);
 					}
 					else
 					{
-						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, n4->base.chunks[i]);
 
 						if (recurse)
-							rt_dump_node(n4->children[i], level + 1, recurse);
+							rt_dump_node(rt_node_ptr_encoded(n4->children[i]),
+										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2350,25 +2455,26 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 			}
 		case RT_NODE_KIND_32:
 			{
-				for (int i = 0; i < node->count; i++)
+				for (int i = 0; i < NODE_KIND(node); i++)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
 								space, n32->base.chunks[i], n32->values[i]);
 					}
 					else
 					{
-						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node.decoded;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, n32->base.chunks[i]);
 
 						if (recurse)
 						{
-							rt_dump_node(n32->children[i], level + 1, recurse);
+							rt_dump_node(rt_node_ptr_encoded(n32->children[i]),
+										 level + 1, recurse);
 						}
 						else
 							fprintf(stderr, "\n");
@@ -2378,7 +2484,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+				rt_node_base_125 *b125 = (rt_node_base_125 *) node.decoded;
 
 				fprintf(stderr, "slot_idxs ");
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -2390,7 +2496,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 				}
 				if (NODE_IS_LEAF(node))
 				{
-					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node.decoded;
 
 					fprintf(stderr, ", isset-bitmap:");
 					for (int i = 0; i < WORDNUM(128); i++)
@@ -2420,7 +2526,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_125_get_child(n125, i),
+							rt_dump_node(rt_node_ptr_encoded(node_inner_125_get_child(n125, i)),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2434,7 +2540,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node.decoded;
 
 						if (!node_leaf_256_is_chunk_used(n256, i))
 							continue;
@@ -2444,7 +2550,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 					}
 					else
 					{
-						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node.decoded;
 
 						if (!node_inner_256_is_chunk_used(n256, i))
 							continue;
@@ -2453,8 +2559,8 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
-										 recurse);
+							rt_dump_node(rt_node_ptr_encoded(node_inner_256_get_child(n256, i)),
+										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2467,7 +2573,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 void
 rt_dump_search(radix_tree *tree, uint64 key)
 {
-	rt_node    *node;
+	rt_node_ptr node;
 	int			shift;
 	int			level = 0;
 
@@ -2475,7 +2581,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
 		 tree->max_val, tree->max_val);
 
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 	{
 		elog(NOTICE, "tree is empty");
 		return;
@@ -2488,11 +2594,11 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		return;
 	}
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = rt_node_ptr_encoded(tree->root);
+	shift = NODE_SHIFT(node);
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		rt_pointer   child;
 
 		rt_dump_node(node, level, false);
 
@@ -2509,7 +2615,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
 			break;
 
-		node = child;
+		node = rt_node_ptr_encoded(child);
 		shift -= RT_NODE_SPAN;
 		level++;
 	}
@@ -2518,6 +2624,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 void
 rt_dump(radix_tree *tree)
 {
+	rt_node_ptr root;
 
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
@@ -2528,12 +2635,13 @@ rt_dump(radix_tree *tree)
 				rt_size_class_info[i].leaf_blocksize);
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
 
-	if (!tree->root)
+	if (!RTPointerIsValid(tree->root))
 	{
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
-	rt_dump_node(tree->root, 0, true);
+	root = rt_node_ptr_encoded(tree->root);
+	rt_dump_node(root, 0, true);
 }
 #endif
-- 
2.31.1

v15-0004-Use-bitmapword-for-node-125.patchapplication/octet-stream; name=v15-0004-Use-bitmapword-for-node-125.patchDownload

From 066eada2c94025a273fa0e49763c6817fcc1906a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 15:22:26 +0700
Subject: [PATCH v15 4/9] Use bitmapword for node-125

TODO: Rename macros copied from bitmapset.c
---
 src/backend/lib/radixtree.c | 70 ++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 36 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index e7f61fd943..abd0450727 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -62,6 +62,7 @@
 #include "lib/radixtree.h"
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
+#include "nodes/bitmapset.h"
 #include "port/pg_bitutils.h"
 #include "port/pg_lfind.h"
 #include "utils/memutils.h"
@@ -103,6 +104,10 @@
 #define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
 #define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
 
+/* FIXME rename */
+#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
+
 /* Enum used rt_node_search() */
 typedef enum
 {
@@ -207,6 +212,9 @@ typedef struct rt_node_base125
 
 	/* The index of slots for each fanout */
 	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword		isset[WORDNUM(128)];
 } rt_node_base_125;
 
 typedef struct rt_node_base256
@@ -271,9 +279,6 @@ typedef struct rt_node_leaf_125
 {
 	rt_node_base_125 base;
 
-	/* isset is a bitmap to track which slot is in use */
-	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
-
 	/* number of values depends on size class */
 	uint64		values[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_leaf_125;
@@ -655,13 +660,14 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
 	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
 }
 
+#ifdef USE_ASSERT_CHECKING
 /* Is the slot in the node used? */
 static inline bool
 node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
 {
 	Assert(!NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
-	return (node->children[slot] != NULL);
+	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
 
 static inline bool
@@ -669,8 +675,9 @@ node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
 {
 	Assert(NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
-	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
+#endif
 
 static inline rt_node *
 node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
@@ -690,7 +697,10 @@ node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
 static void
 node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
 {
+	int			slotpos = node->base.slot_idxs[chunk];
+
 	Assert(!NODE_IS_LEAF(node));
+	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->children[node->base.slot_idxs[chunk]] = NULL;
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
@@ -701,44 +711,35 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
 	int			slotpos = node->base.slot_idxs[chunk];
 
 	Assert(NODE_IS_LEAF(node));
-	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
 
 /* Return an unused slot in node-125 */
 static int
-node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
-{
-	int			slotpos = 0;
-
-	Assert(!NODE_IS_LEAF(node));
-	while (node_inner_125_is_slot_used(node, slotpos))
-		slotpos++;
-
-	return slotpos;
-}
-
-static int
-node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+node_125_find_unused_slot(bitmapword *isset)
 {
 	int			slotpos;
+	int			idx;
+	bitmapword	inverse;
 
-	Assert(NODE_IS_LEAF(node));
-
-	/* We iterate over the isset bitmap per byte then check each bit */
-	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	/* get the first word with at least one bit not set */
+	for (idx = 0; idx < WORDNUM(128); idx++)
 	{
-		if (node->isset[slotpos] < 0xFF)
+		if (isset[idx] < ~((bitmapword) 0))
 			break;
 	}
-	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
 
-	slotpos *= BITS_PER_BYTE;
-	while (node_leaf_125_is_slot_used(node, slotpos))
-		slotpos++;
+	/* To get the first unset bit in X, get the first set bit in ~X */
+	inverse = ~(isset[idx]);
+	slotpos = idx * BITS_PER_BITMAPWORD;
+	slotpos += bmw_rightmost_one_pos(inverse);
+
+	/* mark the slot used */
+	isset[idx] |= bmw_rightmost_one(inverse);
 
 	return slotpos;
-}
+ }
 
 static inline void
 node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
@@ -747,8 +748,7 @@ node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
 
 	Assert(!NODE_IS_LEAF(node));
 
-	/* find unused slot */
-	slotpos = node_inner_125_find_unused_slot(node, chunk);
+	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
 
 	node->base.slot_idxs[chunk] = slotpos;
@@ -763,12 +763,10 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 
 	Assert(NODE_IS_LEAF(node));
 
-	/* find unused slot */
-	slotpos = node_leaf_125_find_unused_slot(node, chunk);
+	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
 
 	node->base.slot_idxs[chunk] = slotpos;
-	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
 	node->values[slotpos] = value;
 }
 
@@ -2395,9 +2393,9 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
 
 					fprintf(stderr, ", isset-bitmap:");
-					for (int i = 0; i < 16; i++)
+					for (int i = 0; i < WORDNUM(128); i++)
 					{
-						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+						fprintf(stderr, UINT64_FORMAT_HEX " ", n->base.isset[i]);
 					}
 					fprintf(stderr, "\n");
 				}
-- 
2.31.1

v15-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v15-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From ceaf56be51d2c686a795e1ab1ab40f701ed21d62 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v15 1/9] introduce vector8_min and vector8_highbit_mask

---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.31.1

v15-0003-Add-radix-implementation.patchapplication/octet-stream; name=v15-0003-Add-radix-implementation.patchDownload

From 6ba6c9979b2bd4fb5ef3c61d7a6edac1737e8509 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v15 3/9] Add radix implementation.

---
 src/backend/lib/Makefile                      |    1 +
 src/backend/lib/meson.build                   |    1 +
 src/backend/lib/radixtree.c                   | 2541 +++++++++++++++++
 src/include/lib/radixtree.h                   |   42 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   34 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  581 ++++
 .../test_radixtree/test_radixtree.control     |    4 +
 15 files changed, 3291 insertions(+)
 create mode 100644 src/backend/lib/radixtree.c
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 48da1bddce..4303d306cd 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -9,4 +9,5 @@ backend_sources += files(
   'knapsack.c',
   'pairingheap.c',
   'rbtree.c',
+  'radixtree.c',
 )
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..e7f61fd943
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2541 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_set			- Set a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iterate	- Begin iterating through all key-value pairs
+ * rt_iterate_next	- Return next key-value pair, if any
+ * rt_end_iter		- End iteration
+ * rt_memory_usage	- Get the memory usage
+ * rt_num_entries	- Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+	RT_ACTION_FIND = 0,			/* find the key-value */
+	RT_ACTION_DELETE,			/* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+typedef enum rt_size_class
+{
+	RT_CLASS_4_FULL = 0,
+	RT_CLASS_32_PARTIAL,
+	RT_CLASS_32_FULL,
+	RT_CLASS_125_FULL,
+	RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/* Max number of children. We can use uint8 because we never need to store 256 */
+	/* WIP: if we don't have a variable sized node4, this should instead be in the base
+	types as needed, since saving every byte is crucial for the smallest node kind */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} rt_node;
+#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+	((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+	((node)->base.n.count < rt_size_class_info[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct rt_node_base_4
+{
+	rt_node		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+	rt_node		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base125
+{
+	rt_node		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_125;
+
+typedef struct rt_node_base256
+{
+	rt_node		n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+	rt_node_base_4 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+	rt_node_base_4 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+	rt_node_base_32 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+	rt_node_base_32 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_125
+{
+	rt_node_base_125 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_125;
+
+typedef struct rt_node_leaf_125
+{
+	rt_node_base_125 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 children */
+	rt_node    *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+	rt_node_base_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information for each size class */
+typedef struct rt_size_class_elem
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} rt_size_class_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+	[RT_CLASS_4_FULL] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_PARTIAL] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_FULL] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+	},
+	[RT_CLASS_125_FULL] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(rt_node_inner_256),
+		.leaf_size = sizeof(rt_node_leaf_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+	},
+};
+
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+	[RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+	rt_node    *node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	rt_node_iter stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node    *root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+								bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+										rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+									   uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+								 uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+								uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											 uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+						  uint8 *dst_chunks, rt_node **dst_children)
+{
+	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(rt_node *) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values)
+{
+	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(uint64) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(slot < node->base.n.fanout);
+	return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(slot < node->base.n.fanout);
+	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = NULL;
+	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+static void
+node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
+{
+	int			slotpos = node->base.slot_idxs[chunk];
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+/* Return an unused slot in node-125 */
+static int
+node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
+{
+	int			slotpos = 0;
+
+	Assert(!NODE_IS_LEAF(node));
+	while (node_inner_125_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static int
+node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* We iterate over the isset bitmap per byte then check each bit */
+	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	{
+		if (node->isset[slotpos] < 0xFF)
+			break;
+	}
+	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+	slotpos *= BITS_PER_BYTE;
+	while (node_leaf_125_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static inline void
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+	int			slotpos;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_inner_125_find_unused_slot(node, chunk);
+	Assert(slotpos < node->base.n.fanout);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_leaf_125_find_unused_slot(node, chunk);
+	Assert(slotpos < node->base.n.fanout);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(node_inner_256_is_chunk_used(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(node_leaf_256_is_chunk_used(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+	int			shift = key_get_shift(key);
+	bool		inner = shift > 0;
+	rt_node    *newnode;
+
+	newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+	newnode->shift = shift;
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+{
+	rt_node    *newnode;
+
+	if (inner)
+		newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+												 rt_size_class_info[size_class].inner_size);
+	else
+		newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+												 rt_size_class_info[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[size_class]++;
+#endif
+
+	return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+{
+	if (inner)
+		MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+	else
+		MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+
+	node->kind = kind;
+	node->fanout = rt_size_class_info[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+
+		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+	}
+
+	/*
+	 * Technically it's 256, but we cannot store that in a uint8,
+	 * and this is the max size class to it will never grow.
+	 */
+	if (kind == RT_NODE_KIND_256)
+		node->fanout = 0;
+}
+
+static inline void
+rt_copy_node(rt_node *newnode, rt_node *oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->chunk = oldnode->chunk;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+{
+	rt_node	*newnode;
+	bool inner = !NODE_IS_LEAF(node);
+
+	newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
+	rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
+	rt_copy_node(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == rt_size_class_info[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->cnt[i]--;
+		Assert(tree->cnt[i] >= 0);
+	}
+#endif
+
+	pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+				rt_node *new_child, uint64 key)
+{
+	Assert(old_child->chunk == new_child->chunk);
+	Assert(old_child->shift == new_child->shift);
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = new_child;
+	}
+	else
+	{
+		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
+
+		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		Assert(replaced);
+	}
+
+	rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_inner_4 *node;
+
+		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+		rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		node->base.n.shift = shift;
+		node->base.n.count = 1;
+		node->base.chunks[0] = 0;
+		node->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+			  rt_node *node)
+{
+	int			shift = node->shift;
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		rt_node    *newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		inner = newshift > 0;
+
+		newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+		rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+		newchild->shift = newshift;
+		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+		rt_node_insert_inner(tree, parent, node, key, newchild);
+
+		parent = node;
+		node = newchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	rt_node_insert_leaf(tree, parent, node, key, value);
+	tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	rt_node    *child = NULL;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = n4->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n4->base.chunks, n4->children,
+												n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = n32->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n32->base.chunks, n32->children,
+												n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+
+				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = node_inner_125_get_child(n125, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_125_delete(n125, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				if (!node_inner_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = node_inner_256_get_child(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && child_p)
+		*child_p = child;
+
+	return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	uint64		value = 0;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = n4->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+											  n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = n32->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+											  n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+
+				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_125_get_value(n125, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_125_delete(n125, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				if (!node_leaf_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_256_get_value(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && value_p)
+		*value_p = value;
+
+	return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+					 rt_node *child)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_inner_32 *new32;
+					Assert(parent != NULL);
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+																   RT_NODE_KIND_32);
+					chunk_children_array_copy(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+									key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					uint16		count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n4->base.chunks, n4->children,
+												   count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->children[insertpos] = child;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					Assert(parent != NULL);
+
+					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+					{
+						/* use the same node kind, but expand to the next size class */
+						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
+						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_inner_32 *new32;
+
+						new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+						memcpy(new32, n32, size);
+						new32->base.n.fanout = fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new32;
+						n32 = new32;
+
+						goto retry_insert_inner_32;
+					}
+					else
+					{
+						rt_node_inner_125 *new125;
+
+						/* grow node from 32 to 125 */
+						new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																		 RT_NODE_KIND_125);
+						for (int i = 0; i < n32->base.n.count; i++)
+							node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
+						node = (rt_node *) new125;
+					}
+				}
+				else
+				{
+retry_insert_inner_32:
+					{
+						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+						int16 count = n32->base.n.count;
+
+						if (count != 0 && insertpos < count)
+							chunk_children_array_shift(n32->base.chunks, n32->children,
+													   count, insertpos);
+
+						n32->base.chunks[insertpos] = chunk;
+						n32->children[insertpos] = child;
+						break;
+					}
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+				int			cnt = 0;
+
+				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_inner_125_update(n125, chunk, child);
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					rt_node_inner_256 *new256;
+					Assert(parent != NULL);
+
+					/* grow node from 125 to 256 */
+					new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+																	 RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+							continue;
+
+						node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+						cnt++;
+					}
+
+					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_inner_125_insert(n125, chunk, child);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+				node_inner_256_set(n256, chunk, child);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+					uint64 key, uint64 value)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_leaf_32 *new32;
+					Assert(parent != NULL);
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+																  RT_NODE_KIND_32);
+					chunk_values_array_copy(n4->base.chunks, n4->values,
+											new32->base.chunks, new32->values);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and values */
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n4->base.chunks, n4->values,
+												 count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->values[insertpos] = value;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					Assert(parent != NULL);
+
+					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+					{
+						/* use the same node kind, but expand to the next size class */
+						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
+						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_leaf_32 *new32;
+
+						new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+						memcpy(new32, n32, size);
+						new32->base.n.fanout = fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new32;
+						n32 = new32;
+
+						goto retry_insert_leaf_32;
+					}
+					else
+					{
+						rt_node_leaf_125 *new125;
+
+						/* grow node from 32 to 125 */
+						new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																		RT_NODE_KIND_125);
+						for (int i = 0; i < n32->base.n.count; i++)
+							node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
+										key);
+						node = (rt_node *) new125;
+					}
+				}
+				else
+				{
+				retry_insert_leaf_32:
+					{
+						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+						int	count = n32->base.n.count;
+
+						if (count != 0 && insertpos < count)
+							chunk_values_array_shift(n32->base.chunks, n32->values,
+													 count, insertpos);
+
+						n32->base.chunks[insertpos] = chunk;
+						n32->values[insertpos] = value;
+						break;
+					}
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+				int			cnt = 0;
+
+				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_leaf_125_update(n125, chunk, value);
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					rt_node_leaf_256 *new256;
+					Assert(parent != NULL);
+
+					/* grow node from 125 to 256 */
+					new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+																	RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+							continue;
+
+						node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+						cnt++;
+					}
+
+					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_leaf_125_insert(n125, chunk, value);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+				node_leaf_256_set(n256, chunk, value);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 rt_size_class_info[i].name,
+												 rt_size_class_info[i].inner_blocksize,
+												 rt_size_class_info[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												rt_size_class_info[i].name,
+												rt_size_class_info[i].leaf_blocksize,
+												rt_size_class_info[i].leaf_size);
+#ifdef RT_DEBUG
+		tree->cnt[i] = 0;
+#endif
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	rt_node    *node;
+	rt_node    *parent;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = parent = tree->root;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		{
+			rt_set_extend(tree, key, value, parent, node);
+			return false;
+		}
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+	rt_node    *node;
+	int			shift;
+
+	Assert(value_p != NULL);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		rt_node    *child;
+
+		/* Push the current node to the stack */
+		stack[++level] = node;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	Assert(NODE_IS_LEAF(node));
+	deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	rt_free_node(tree, node);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		node = stack[level--];
+
+		deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+	}
+
+	return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->root)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+	int			level = from;
+	rt_node    *node = from_node;
+
+	for (;;)
+	{
+		rt_node_iter *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = rt_node_inner_iterate_next(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->root)
+		return false;
+
+	for (;;)
+	{
+		rt_node    *child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		rt_update_iter_stack(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+	rt_node    *child = NULL;
+	bool		found = false;
+	uint8		key_chunk;
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				child = n4->children[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				child = n32->children[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_125_get_child(n125, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_inner_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_256_get_child(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+	return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+						  uint64 *value_p)
+{
+	rt_node    *node = node_iter->node;
+	bool		found = false;
+	uint64		value;
+	uint8		key_chunk;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				value = n4->values[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				value = n32->values[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_125_get_value(n125, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_leaf_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_256_get_value(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		*value_p = value;
+	}
+
+	return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	Size		total = sizeof(radix_tree);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					if (NODE_IS_LEAF(node))
+						Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+														  n125->slot_idxs[i]));
+					else
+						Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+														   n125->slot_idxs[i]));
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+						cnt += pg_popcount32(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+						 tree->num_keys,
+						 tree->root->shift / RT_NODE_SPAN,
+						 tree->cnt[RT_CLASS_4_FULL],
+						 tree->cnt[RT_CLASS_32_PARTIAL],
+						 tree->cnt[RT_CLASS_32_FULL],
+						 tree->cnt[RT_CLASS_125_FULL],
+						 tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+	char		space[125] = {0};
+
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
+			node->fanout == 0 ? 256 : node->fanout,
+			node->count, node->shift, node->chunk);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							rt_dump_node(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							rt_dump_node(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(b125, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < 16; i++)
+					{
+						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(b125, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_125_get_value(n125, i));
+					}
+					else
+					{
+						rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_125_get_child(n125, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+						if (!node_leaf_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_256_get_value(n256, i));
+					}
+					else
+					{
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+						if (!node_inner_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+		 tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		rt_dump_node(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+			break;
+		}
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+				rt_size_class_info[i].name,
+				rt_size_class_info[i].inner_size,
+				rt_size_class_info[i].inner_blocksize,
+				rt_size_class_info[i].leaf_size,
+				rt_size_class_info[i].leaf_blocksize);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+	if (!tree->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 911a768a29..fd101e3bf4 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -22,6 +22,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..ea993e63df
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,581 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	rt_iter		*iter;
+	uint64		dummy;
+	uint64		key;
+	uint64		val;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_set(radixtree, keys[i], keys[i] + 1))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		uint64		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
-- 
2.31.1

v15-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v15-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From caf11ea2ca608edac00443b6ab7590688385b0d4 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v15 2/9] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..4384ff591d 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..fdc504596b 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 814e0b2dba..f95b6afd86 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 60c71d05fe..8305f09f2c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3654,7 +3654,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.31.1

#159

/messages/by-id/20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#157)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Dec 19, 2022 at 2:14 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
seems that they look at only memory that are actually dsa_allocate'd.
To be exact, we estimate the number of hash buckets based on work_mem
(and hash_mem_multiplier) and use it as the upper limit. So I've
confirmed that the result of dsa_get_total_size() could exceed the
limit. I'm not sure it's a known and legitimate usage. If we can
follow such usage, we can probably track how much dsa_allocate'd
memory is used in the radix tree.

I've experimented with this idea. The newly added 0008 patch changes
the radix tree so that it counts the memory usage for both local and
shared cases. As shown below, there is an overhead for that:

w/o 0008 patch
298453544 | 282

w/0 0008 patch
293603184 | 297

This adds about as much overhead as the improvement I measured in the v4
slab allocator patch. That's not acceptable, and is exactly what Andres
warned about in

I'm guessing the hash join case can afford to be precise about memory
because it must spill to disk when exceeding workmem. We don't have that
design constraint.

--
John Naylor
EDB: http://www.enterprisedb.com

#160

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#159)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Dec 20, 2022 at 3:09 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Dec 19, 2022 at 2:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 13, 2022 at 1:04 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Looking at other code using DSA such as tidbitmap.c and nodeHash.c, it
seems that they look at only memory that are actually dsa_allocate'd.
To be exact, we estimate the number of hash buckets based on work_mem
(and hash_mem_multiplier) and use it as the upper limit. So I've
confirmed that the result of dsa_get_total_size() could exceed the
limit. I'm not sure it's a known and legitimate usage. If we can
follow such usage, we can probably track how much dsa_allocate'd
memory is used in the radix tree.

I've experimented with this idea. The newly added 0008 patch changes
the radix tree so that it counts the memory usage for both local and
shared cases. As shown below, there is an overhead for that:

w/o 0008 patch
298453544 | 282

w/0 0008 patch
293603184 | 297

This adds about as much overhead as the improvement I measured in the v4 slab allocator patch.

Oh, yes, that's bad.

/messages/by-id/20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de

I'm guessing the hash join case can afford to be precise about memory because it must spill to disk when exceeding workmem. We don't have that design constraint.

You mean that the memory used by the radix tree should be limited not
by the amount of memory actually used, but by the amount of memory
allocated? In other words, it checks by MomoryContextMemAllocated() in
the local cases and by dsa_get_total_size() in the shared case.

The idea of using up to half of maintenance_work_mem might be a good
idea compared to the current flat-array solution. But since it only
uses half, I'm concerned that there will be users who double their
maintenace_work_mem. When it is improved, the user needs to restore
maintenance_work_mem again.

A better solution would be to have slab-like DSA. We allocate the
dynamic shared memory by adding fixed-length large segments. However,
downside would be since the segment size gets large we need to
increase maintenance_work_mem as well. Also, this patch set is already
getting bigger and more complicated, I don't think it's a good idea to
add more.

If we limit the memory usage by checking the amount of memory actually
used, we can use SlabStats() for the local cases. Since DSA doesn't
have such functionality for now we would need to add it. Or we can
track it in the radix tree only in the shared cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#161

/messages/by-id/20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#160)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Dec 21, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Tue, Dec 20, 2022 at 3:09 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I'm guessing the hash join case can afford to be precise about memory

because it must spill to disk when exceeding workmem. We don't have that
design constraint.

You mean that the memory used by the radix tree should be limited not
by the amount of memory actually used, but by the amount of memory
allocated? In other words, it checks by MomoryContextMemAllocated() in
the local cases and by dsa_get_total_size() in the shared case.

I mean, if this patch set uses 10x less memory than v15 (not always, but
easy to find cases where it does), and if it's also expensive to track
memory use precisely, then we don't have an incentive to track memory
precisely. Even if we did, we don't want to assume that every future caller
of radix tree is willing to incur that cost.

The idea of using up to half of maintenance_work_mem might be a good
idea compared to the current flat-array solution. But since it only
uses half, I'm concerned that there will be users who double their
maintenace_work_mem. When it is improved, the user needs to restore
maintenance_work_mem again.

I find it useful to step back and look at the usage patterns:

Autovacuum: Limiting the memory allocated by vacuum is important, since
there are multiple workers and they can run at any time (possibly most of
the time). This case will not use parallel index vacuum, so will use slab,
where the quick estimation of memory taken by the context is not terribly
far off, so we can afford to be more optimistic here.

Manual vacuum: The default configuration assumes we want to finish as soon
as possible (vacuum_cost_delay is zero). Parallel index vacuum can be used.
My experience leads me to believe users are willing to use a lot of memory
to make manual vacuum finish as quickly as possible, and are disappointed
to learn that even if maintenance work mem is 10GB, vacuum can only use 1GB.

So I don't believe anyone will have to double maintenance work mem after
upgrading (even with pessimistic accounting) because we'll be both
- much more efficient with memory on average
- free from the 1GB cap

That said, it's possible 50% is too pessimistic -- a 75% threshold will
bring us very close to powers of two for example:

2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> keep going
766 + 256 = 1022MB -> stop

I'm not sure if that calculation could cause going over the limit, or how
common that would be.

--
John Naylor
EDB: http://www.enterprisedb.com

#162

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#161)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Dec 22, 2022 at 7:24 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Dec 21, 2022 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Dec 20, 2022 at 3:09 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

/messages/by-id/20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de

I'm guessing the hash join case can afford to be precise about memory because it must spill to disk when exceeding workmem. We don't have that design constraint.

You mean that the memory used by the radix tree should be limited not
by the amount of memory actually used, but by the amount of memory
allocated? In other words, it checks by MomoryContextMemAllocated() in
the local cases and by dsa_get_total_size() in the shared case.

I mean, if this patch set uses 10x less memory than v15 (not always, but easy to find cases where it does), and if it's also expensive to track memory use precisely, then we don't have an incentive to track memory precisely. Even if we did, we don't want to assume that every future caller of radix tree is willing to incur that cost.

Understood.

The idea of using up to half of maintenance_work_mem might be a good
idea compared to the current flat-array solution. But since it only
uses half, I'm concerned that there will be users who double their
maintenace_work_mem. When it is improved, the user needs to restore
maintenance_work_mem again.

I find it useful to step back and look at the usage patterns:

Autovacuum: Limiting the memory allocated by vacuum is important, since there are multiple workers and they can run at any time (possibly most of the time). This case will not use parallel index vacuum, so will use slab, where the quick estimation of memory taken by the context is not terribly far off, so we can afford to be more optimistic here.

Manual vacuum: The default configuration assumes we want to finish as soon as possible (vacuum_cost_delay is zero). Parallel index vacuum can be used. My experience leads me to believe users are willing to use a lot of memory to make manual vacuum finish as quickly as possible, and are disappointed to learn that even if maintenance work mem is 10GB, vacuum can only use 1GB.

Agreed.

So I don't believe anyone will have to double maintenance work mem after upgrading (even with pessimistic accounting) because we'll be both
- much more efficient with memory on average
- free from the 1GB cap

Make sense.

That said, it's possible 50% is too pessimistic -- a 75% threshold will bring us very close to powers of two for example:

2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> keep going
766 + 256 = 1022MB -> stop

I'm not sure if that calculation could cause going over the limit, or how common that would be.

If the value is a power of 2, it seems to work perfectly fine. But for
example if it's 700MB, the total memory exceeds the limit:

2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going
510 + 256 = 766MB -> stop but it exceeds the limit.

In a more bigger case, if it's 11000MB,

2*(1+2+...+2048) = 8190MB (74.4%)
8190 + 4096 = 12286MB

That being said, I don't think they are not common cases. So the 75%
threshold seems to work fine in most cases.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#163

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#162)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Dec 22, 2022 at 10:00 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

If the value is a power of 2, it seems to work perfectly fine. But for
example if it's 700MB, the total memory exceeds the limit:

2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going
510 + 256 = 766MB -> stop but it exceeds the limit.

In a more bigger case, if it's 11000MB,

2*(1+2+...+2048) = 8190MB (74.4%)
8190 + 4096 = 12286MB

That being said, I don't think they are not common cases. So the 75%
threshold seems to work fine in most cases.

Thinking some more, I agree this doesn't have large practical risk, but
thinking from the point of view of the community, being loose with memory
limits by up to 10% is not a good precedent.

Perhaps we can be clever and use 75% when the limit is a power of two and
50% otherwise. I'm skeptical of trying to be clever, and I just thought of
an additional concern: We're assuming behavior of the growth in size of new
DSA segments, which could possibly change. Given how allocators are
typically coded, though, it seems safe to assume that they'll at most
double in size.

--
John Naylor
EDB: http://www.enterprisedb.com

#164

john.naylor@enterprisedb.com

about 3 years ago

In reply to: John Naylor (#141)

12 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

I wrote:

- Try templating out the differences between local and shared memory.

Here is a brief progress report before Christmas vacation.

I thought the best way to approach this was to go "inside out", that is,
start with the modest goal of reducing duplicated code for v16.

0001-0005 are copies from v13.

0006 whacks around the rt_node_insert_inner function to reduce the "surface
area" as far as symbols and casts. This includes replacing the goto with an
extra "unlikely" branch.

0007 removes the STRICT pragma for one of our benchmark functions that
crept in somewhere -- it should use the default and not just return NULL
instantly.

0008 further whacks around the node-growing code in rt_node_insert_inner to
remove casts. When growing the size class within the same kind, we have no
need for a "new32" (etc) variable. Also, to keep from getting confused
about what an assert build verifies at the end, add a "newnode" variable
and assign it to "node" as soon as possible.

0009 uses the bitmap logic from 0004 for node256 also. There is no
performance reason for this, because there is no iteration needed, but it's
good for simplicity and consistency.

0010 and 0011 template a common implementation for both leaf and inner
nodes for searching and inserting.

0012: While at it, I couldn't resist using this technique to separate out
delete from search, which makes sense and might give a small performance
boost (at least on less capable hardware). I haven't got to the iteration
functions, but they should be straightforward.

There is more that could be done here, but I didn't want to get too ahead
of myself. For example, it's possible that struct members "children" and
"values" are names that don't need to be distinguished. Making them the
same would reduce code like

+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = value;
+#else
+ n32->children[insertpos] = child;
+#endif

...but there could be downsides and I don't want to distract from the goal
of dealing with shared memory.

The tests pass, but it's not impossible that there is a new bug somewhere.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v16-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v16-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From 9661e7c32198fb77f3218cac7c444490d92f380f Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v16 02/12] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..4384ff591d 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..fdc504596b 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 814e0b2dba..f95b6afd86 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 60c71d05fe..8305f09f2c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3654,7 +3654,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.38.1

v16-0001-introduce-vector8_min-and-vector8_highbit_mask.patchtext/x-patch; charset=US-ASCII; name=v16-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From f817851b80e4ec3fef4e5d9f32cc505c4d7f13f7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v16 01/12] introduce vector8_min and vector8_highbit_mask

---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 61ae4ecf60..0b288c422a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.38.1

v16-0003-Add-radix-implementation.patchtext/x-patch; charset=US-ASCII; name=v16-0003-Add-radix-implementation.patchDownload

From 21751137cc807d4a9473f74ea287c8191dea5093 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v16 03/12] Add radix implementation.

---
 src/backend/lib/Makefile                      |    1 +
 src/backend/lib/meson.build                   |    1 +
 src/backend/lib/radixtree.c                   | 2541 +++++++++++++++++
 src/include/lib/radixtree.h                   |   42 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   34 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  581 ++++
 .../test_radixtree/test_radixtree.control     |    4 +
 15 files changed, 3291 insertions(+)
 create mode 100644 src/backend/lib/radixtree.c
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 0edddffacf..8193da105a 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -11,4 +11,5 @@ backend_sources += files(
   'knapsack.c',
   'pairingheap.c',
   'rbtree.c',
+  'radixtree.c',
 )
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..e7f61fd943
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2541 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_set			- Set a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iterate	- Begin iterating through all key-value pairs
+ * rt_iterate_next	- Return next key-value pair, if any
+ * rt_end_iter		- End iteration
+ * rt_memory_usage	- Get the memory usage
+ * rt_num_entries	- Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/*
+ * Return the number of bits required to represent nslots slots, used
+ * nodes indexed by array lookup.
+ */
+#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/*
+ * Mapping from the value to the bit in is-set bitmap in the node-256.
+ */
+#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
+#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+	RT_ACTION_FIND = 0,			/* find the key-value */
+	RT_ACTION_DELETE,			/* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+typedef enum rt_size_class
+{
+	RT_CLASS_4_FULL = 0,
+	RT_CLASS_32_PARTIAL,
+	RT_CLASS_32_FULL,
+	RT_CLASS_125_FULL,
+	RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/* Max number of children. We can use uint8 because we never need to store 256 */
+	/* WIP: if we don't have a variable sized node4, this should instead be in the base
+	types as needed, since saving every byte is crucial for the smallest node kind */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} rt_node;
+#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+	((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+	((node)->base.n.count < rt_size_class_info[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct rt_node_base_4
+{
+	rt_node		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+	rt_node		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base125
+{
+	rt_node		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+} rt_node_base_125;
+
+typedef struct rt_node_base256
+{
+	rt_node		n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+	rt_node_base_4 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+	rt_node_base_4 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+	rt_node_base_32 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+	rt_node_base_32 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_125
+{
+	rt_node_base_125 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_125;
+
+typedef struct rt_node_leaf_125
+{
+	rt_node_base_125 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 children */
+	rt_node    *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+	rt_node_base_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	uint8		isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information for each size class */
+typedef struct rt_size_class_elem
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} rt_size_class_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+	[RT_CLASS_4_FULL] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_PARTIAL] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_FULL] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+	},
+	[RT_CLASS_125_FULL] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(rt_node_inner_256),
+		.leaf_size = sizeof(rt_node_leaf_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+	},
+};
+
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+	[RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+	rt_node    *node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	rt_node_iter stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node    *root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+								bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+										rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+									   uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+								 uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+								uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											 uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+						  uint8 *dst_chunks, rt_node **dst_children)
+{
+	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(rt_node *) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values)
+{
+	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(uint64) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+/* Is the slot in the node used? */
+static inline bool
+node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(slot < node->base.n.fanout);
+	return (node->children[slot] != NULL);
+}
+
+static inline bool
+node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(slot < node->base.n.fanout);
+	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+}
+
+static inline rt_node *
+node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = NULL;
+	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+static void
+node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
+{
+	int			slotpos = node->base.slot_idxs[chunk];
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+/* Return an unused slot in node-125 */
+static int
+node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
+{
+	int			slotpos = 0;
+
+	Assert(!NODE_IS_LEAF(node));
+	while (node_inner_125_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static int
+node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* We iterate over the isset bitmap per byte then check each bit */
+	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	{
+		if (node->isset[slotpos] < 0xFF)
+			break;
+	}
+	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
+
+	slotpos *= BITS_PER_BYTE;
+	while (node_leaf_125_is_slot_used(node, slotpos))
+		slotpos++;
+
+	return slotpos;
+}
+
+static inline void
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+	int			slotpos;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_inner_125_find_unused_slot(node, chunk);
+	Assert(slotpos < node->base.n.fanout);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	/* find unused slot */
+	slotpos = node_leaf_125_find_unused_slot(node, chunk);
+	Assert(slotpos < node->base.n.fanout);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
+	node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(node_inner_256_is_chunk_used(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(node_leaf_256_is_chunk_used(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+	int			shift = key_get_shift(key);
+	bool		inner = shift > 0;
+	rt_node    *newnode;
+
+	newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+	newnode->shift = shift;
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+{
+	rt_node    *newnode;
+
+	if (inner)
+		newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+												 rt_size_class_info[size_class].inner_size);
+	else
+		newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+												 rt_size_class_info[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[size_class]++;
+#endif
+
+	return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+{
+	if (inner)
+		MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+	else
+		MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+
+	node->kind = kind;
+	node->fanout = rt_size_class_info[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+
+		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+	}
+
+	/*
+	 * Technically it's 256, but we cannot store that in a uint8,
+	 * and this is the max size class to it will never grow.
+	 */
+	if (kind == RT_NODE_KIND_256)
+		node->fanout = 0;
+}
+
+static inline void
+rt_copy_node(rt_node *newnode, rt_node *oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->chunk = oldnode->chunk;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+{
+	rt_node	*newnode;
+	bool inner = !NODE_IS_LEAF(node);
+
+	newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
+	rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
+	rt_copy_node(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == rt_size_class_info[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->cnt[i]--;
+		Assert(tree->cnt[i] >= 0);
+	}
+#endif
+
+	pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+				rt_node *new_child, uint64 key)
+{
+	Assert(old_child->chunk == new_child->chunk);
+	Assert(old_child->shift == new_child->shift);
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = new_child;
+	}
+	else
+	{
+		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
+
+		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		Assert(replaced);
+	}
+
+	rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_inner_4 *node;
+
+		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+		rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		node->base.n.shift = shift;
+		node->base.n.count = 1;
+		node->base.chunks[0] = 0;
+		node->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+			  rt_node *node)
+{
+	int			shift = node->shift;
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		rt_node    *newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		inner = newshift > 0;
+
+		newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+		rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+		newchild->shift = newshift;
+		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+		rt_node_insert_inner(tree, parent, node, key, newchild);
+
+		parent = node;
+		node = newchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	rt_node_insert_leaf(tree, parent, node, key, value);
+	tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	rt_node    *child = NULL;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = n4->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n4->base.chunks, n4->children,
+												n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = n32->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n32->base.chunks, n32->children,
+												n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+
+				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = node_inner_125_get_child(n125, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_125_delete(n125, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				if (!node_inner_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = node_inner_256_get_child(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && child_p)
+		*child_p = child;
+
+	return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	uint64		value = 0;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = n4->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+											  n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = n32->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+											  n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+
+				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_125_get_value(n125, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_125_delete(n125, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				if (!node_leaf_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_256_get_value(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && value_p)
+		*value_p = value;
+
+	return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+					 rt_node *child)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_inner_32 *new32;
+					Assert(parent != NULL);
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+																   RT_NODE_KIND_32);
+					chunk_children_array_copy(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
+									key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					uint16		count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n4->base.chunks, n4->children,
+												   count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->children[insertpos] = child;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					Assert(parent != NULL);
+
+					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+					{
+						/* use the same node kind, but expand to the next size class */
+						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
+						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_inner_32 *new32;
+
+						new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+						memcpy(new32, n32, size);
+						new32->base.n.fanout = fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new32;
+						n32 = new32;
+
+						goto retry_insert_inner_32;
+					}
+					else
+					{
+						rt_node_inner_125 *new125;
+
+						/* grow node from 32 to 125 */
+						new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																		 RT_NODE_KIND_125);
+						for (int i = 0; i < n32->base.n.count; i++)
+							node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
+						node = (rt_node *) new125;
+					}
+				}
+				else
+				{
+retry_insert_inner_32:
+					{
+						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+						int16 count = n32->base.n.count;
+
+						if (count != 0 && insertpos < count)
+							chunk_children_array_shift(n32->base.chunks, n32->children,
+													   count, insertpos);
+
+						n32->base.chunks[insertpos] = chunk;
+						n32->children[insertpos] = child;
+						break;
+					}
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+				int			cnt = 0;
+
+				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_inner_125_update(n125, chunk, child);
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					rt_node_inner_256 *new256;
+					Assert(parent != NULL);
+
+					/* grow node from 125 to 256 */
+					new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+																	 RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+							continue;
+
+						node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+						cnt++;
+					}
+
+					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_inner_125_insert(n125, chunk, child);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+				node_inner_256_set(n256, chunk, child);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+					uint64 key, uint64 value)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_leaf_32 *new32;
+					Assert(parent != NULL);
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+																  RT_NODE_KIND_32);
+					chunk_values_array_copy(n4->base.chunks, n4->values,
+											new32->base.chunks, new32->values);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and values */
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n4->base.chunks, n4->values,
+												 count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->values[insertpos] = value;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					Assert(parent != NULL);
+
+					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+					{
+						/* use the same node kind, but expand to the next size class */
+						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
+						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_leaf_32 *new32;
+
+						new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+						memcpy(new32, n32, size);
+						new32->base.n.fanout = fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new32;
+						n32 = new32;
+
+						goto retry_insert_leaf_32;
+					}
+					else
+					{
+						rt_node_leaf_125 *new125;
+
+						/* grow node from 32 to 125 */
+						new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																		RT_NODE_KIND_125);
+						for (int i = 0; i < n32->base.n.count; i++)
+							node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
+										key);
+						node = (rt_node *) new125;
+					}
+				}
+				else
+				{
+				retry_insert_leaf_32:
+					{
+						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+						int	count = n32->base.n.count;
+
+						if (count != 0 && insertpos < count)
+							chunk_values_array_shift(n32->base.chunks, n32->values,
+													 count, insertpos);
+
+						n32->base.chunks[insertpos] = chunk;
+						n32->values[insertpos] = value;
+						break;
+					}
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+				int			cnt = 0;
+
+				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_leaf_125_update(n125, chunk, value);
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					rt_node_leaf_256 *new256;
+					Assert(parent != NULL);
+
+					/* grow node from 125 to 256 */
+					new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+																	RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+							continue;
+
+						node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+						cnt++;
+					}
+
+					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_leaf_125_insert(n125, chunk, value);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+				node_leaf_256_set(n256, chunk, value);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 rt_size_class_info[i].name,
+												 rt_size_class_info[i].inner_blocksize,
+												 rt_size_class_info[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												rt_size_class_info[i].name,
+												rt_size_class_info[i].leaf_blocksize,
+												rt_size_class_info[i].leaf_size);
+#ifdef RT_DEBUG
+		tree->cnt[i] = 0;
+#endif
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	rt_node    *node;
+	rt_node    *parent;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = parent = tree->root;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		{
+			rt_set_extend(tree, key, value, parent, node);
+			return false;
+		}
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+	rt_node    *node;
+	int			shift;
+
+	Assert(value_p != NULL);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		rt_node    *child;
+
+		/* Push the current node to the stack */
+		stack[++level] = node;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	Assert(NODE_IS_LEAF(node));
+	deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	rt_free_node(tree, node);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		node = stack[level--];
+
+		deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+	}
+
+	return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->root)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+	int			level = from;
+	rt_node    *node = from_node;
+
+	for (;;)
+	{
+		rt_node_iter *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = rt_node_inner_iterate_next(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->root)
+		return false;
+
+	for (;;)
+	{
+		rt_node    *child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		rt_update_iter_stack(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+	rt_node    *child = NULL;
+	bool		found = false;
+	uint8		key_chunk;
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				child = n4->children[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				child = n32->children[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_125_get_child(n125, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_inner_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_256_get_child(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+	return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+						  uint64 *value_p)
+{
+	rt_node    *node = node_iter->node;
+	bool		found = false;
+	uint64		value;
+	uint8		key_chunk;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				value = n4->values[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				value = n32->values[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_125_get_value(n125, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_leaf_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_256_get_value(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		*value_p = value;
+	}
+
+	return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	Size		total = sizeof(radix_tree);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					if (NODE_IS_LEAF(node))
+						Assert(node_leaf_125_is_slot_used((rt_node_leaf_125 *) node,
+														  n125->slot_idxs[i]));
+					else
+						Assert(node_inner_125_is_slot_used((rt_node_inner_125 *) node,
+														   n125->slot_idxs[i]));
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
+						cnt += pg_popcount32(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+						 tree->num_keys,
+						 tree->root->shift / RT_NODE_SPAN,
+						 tree->cnt[RT_CLASS_4_FULL],
+						 tree->cnt[RT_CLASS_32_PARTIAL],
+						 tree->cnt[RT_CLASS_32_FULL],
+						 tree->cnt[RT_CLASS_125_FULL],
+						 tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+	char		space[125] = {0};
+
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
+			node->fanout == 0 ? 256 : node->fanout,
+			node->count, node->shift, node->chunk);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							rt_dump_node(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							rt_dump_node(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(b125, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < 16; i++)
+					{
+						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(b125, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_125_get_value(n125, i));
+					}
+					else
+					{
+						rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_125_get_child(n125, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+						if (!node_leaf_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_256_get_value(n256, i));
+					}
+					else
+					{
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+						if (!node_inner_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+		 tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		rt_dump_node(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+			break;
+		}
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+				rt_size_class_info[i].name,
+				rt_size_class_info[i].inner_size,
+				rt_size_class_info[i].inner_blocksize,
+				rt_size_class_info[i].leaf_size,
+				rt_size_class_info[i].leaf_blocksize);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+	if (!tree->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index eefc0b2063..2458ca64cc 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..ea993e63df
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,581 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	rt_iter		*iter;
+	uint64		dummy;
+	uint64		key;
+	uint64		val;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_set(radixtree, keys[i], keys[i] + 1))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		uint64		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
-- 
2.38.1

v16-0004-Use-bitmapword-for-node-125.patchtext/x-patch; charset=US-ASCII; name=v16-0004-Use-bitmapword-for-node-125.patchDownload

From deab00e6a99e42a8a96ac808dc0858d452bfd0e5 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 15:22:26 +0700
Subject: [PATCH v16 04/12] Use bitmapword for node-125

---
 src/backend/lib/radixtree.c | 70 ++++++++++++++++++-------------------
 1 file changed, 34 insertions(+), 36 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index e7f61fd943..abd0450727 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -62,6 +62,7 @@
 #include "lib/radixtree.h"
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
+#include "nodes/bitmapset.h"
 #include "port/pg_bitutils.h"
 #include "port/pg_lfind.h"
 #include "utils/memutils.h"
@@ -103,6 +104,10 @@
 #define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
 #define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
 
+/* FIXME rename */
+#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
+
 /* Enum used rt_node_search() */
 typedef enum
 {
@@ -207,6 +212,9 @@ typedef struct rt_node_base125
 
 	/* The index of slots for each fanout */
 	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword		isset[WORDNUM(128)];
 } rt_node_base_125;
 
 typedef struct rt_node_base256
@@ -271,9 +279,6 @@ typedef struct rt_node_leaf_125
 {
 	rt_node_base_125 base;
 
-	/* isset is a bitmap to track which slot is in use */
-	uint8		isset[RT_NODE_NSLOTS_BITS(128)];
-
 	/* number of values depends on size class */
 	uint64		values[FLEXIBLE_ARRAY_MEMBER];
 } rt_node_leaf_125;
@@ -655,13 +660,14 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
 	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
 }
 
+#ifdef USE_ASSERT_CHECKING
 /* Is the slot in the node used? */
 static inline bool
 node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
 {
 	Assert(!NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
-	return (node->children[slot] != NULL);
+	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
 
 static inline bool
@@ -669,8 +675,9 @@ node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
 {
 	Assert(NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
-	return ((node->isset[RT_NODE_BITMAP_BYTE(slot)] & RT_NODE_BITMAP_BIT(slot)) != 0);
+	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
 }
+#endif
 
 static inline rt_node *
 node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
@@ -690,7 +697,10 @@ node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
 static void
 node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
 {
+	int			slotpos = node->base.slot_idxs[chunk];
+
 	Assert(!NODE_IS_LEAF(node));
+	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->children[node->base.slot_idxs[chunk]] = NULL;
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
@@ -701,44 +711,35 @@ node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
 	int			slotpos = node->base.slot_idxs[chunk];
 
 	Assert(NODE_IS_LEAF(node));
-	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] &= ~(RT_NODE_BITMAP_BIT(slotpos));
+	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
 
 /* Return an unused slot in node-125 */
 static int
-node_inner_125_find_unused_slot(rt_node_inner_125 *node, uint8 chunk)
-{
-	int			slotpos = 0;
-
-	Assert(!NODE_IS_LEAF(node));
-	while (node_inner_125_is_slot_used(node, slotpos))
-		slotpos++;
-
-	return slotpos;
-}
-
-static int
-node_leaf_125_find_unused_slot(rt_node_leaf_125 *node, uint8 chunk)
+node_125_find_unused_slot(bitmapword *isset)
 {
 	int			slotpos;
+	int			idx;
+	bitmapword	inverse;
 
-	Assert(NODE_IS_LEAF(node));
-
-	/* We iterate over the isset bitmap per byte then check each bit */
-	for (slotpos = 0; slotpos < RT_NODE_NSLOTS_BITS(128); slotpos++)
+	/* get the first word with at least one bit not set */
+	for (idx = 0; idx < WORDNUM(128); idx++)
 	{
-		if (node->isset[slotpos] < 0xFF)
+		if (isset[idx] < ~((bitmapword) 0))
 			break;
 	}
-	Assert(slotpos < RT_NODE_NSLOTS_BITS(128));
 
-	slotpos *= BITS_PER_BYTE;
-	while (node_leaf_125_is_slot_used(node, slotpos))
-		slotpos++;
+	/* To get the first unset bit in X, get the first set bit in ~X */
+	inverse = ~(isset[idx]);
+	slotpos = idx * BITS_PER_BITMAPWORD;
+	slotpos += bmw_rightmost_one_pos(inverse);
+
+	/* mark the slot used */
+	isset[idx] |= bmw_rightmost_one(inverse);
 
 	return slotpos;
-}
+ }
 
 static inline void
 node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
@@ -747,8 +748,7 @@ node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
 
 	Assert(!NODE_IS_LEAF(node));
 
-	/* find unused slot */
-	slotpos = node_inner_125_find_unused_slot(node, chunk);
+	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
 
 	node->base.slot_idxs[chunk] = slotpos;
@@ -763,12 +763,10 @@ node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
 
 	Assert(NODE_IS_LEAF(node));
 
-	/* find unused slot */
-	slotpos = node_leaf_125_find_unused_slot(node, chunk);
+	slotpos = node_125_find_unused_slot(node->base.isset);
 	Assert(slotpos < node->base.n.fanout);
 
 	node->base.slot_idxs[chunk] = slotpos;
-	node->isset[RT_NODE_BITMAP_BYTE(slotpos)] |= RT_NODE_BITMAP_BIT(slotpos);
 	node->values[slotpos] = value;
 }
 
@@ -2395,9 +2393,9 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
 
 					fprintf(stderr, ", isset-bitmap:");
-					for (int i = 0; i < 16; i++)
+					for (int i = 0; i < WORDNUM(128); i++)
 					{
-						fprintf(stderr, "%X ", (uint8) n->isset[i]);
+						fprintf(stderr, UINT64_FORMAT_HEX " ", n->base.isset[i]);
 					}
 					fprintf(stderr, "\n");
 				}
-- 
2.38.1

v16-0005-tool-for-measuring-radix-tree-performance.patchtext/x-patch; charset=US-ASCII; name=v16-0005-tool-for-measuring-radix-tree-performance.patchDownload

From 24859a28b554695d3c5f5e4b41b65375f666c765 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v16 05/12] tool for measuring radix tree performance

---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  76 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 635 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 6 files changed, 767 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..83529805fc
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..a0693695e6
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,635 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, key);
+	}
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+
+	rt_stats(rt);
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
-- 
2.38.1

v16-0009-Use-bitmap-operations-for-isset-arrays-rather-th.patchtext/x-patch; charset=US-ASCII; name=v16-0009-Use-bitmap-operations-for-isset-arrays-rather-th.patchDownload

From e1085a0420f719f7e4ce8a904794ab8e484b75a9 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 19 Dec 2022 16:16:12 +0700
Subject: [PATCH v16 09/12] Use bitmap operations for isset arrays rather than
 byte operations

It's simpler to do the same thing everywhere, even for node256
where iteration performance doesn't matter as much because we
always can insert directly.

Also rename WORDNUM and BITNUM to avoid clashing with bitmapset.c.
---
 src/backend/lib/radixtree.c | 64 +++++++++++++++++++++----------------
 1 file changed, 36 insertions(+), 28 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index ddf7b002fc..7899e844fb 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -77,12 +77,6 @@
 /* The number of maximum slots in the node */
 #define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
 
-/*
- * Return the number of bits required to represent nslots slots, used
- * nodes indexed by array lookup.
- */
-#define RT_NODE_NSLOTS_BITS(nslots) ((nslots) / (sizeof(uint8) * BITS_PER_BYTE))
-
 /* Mask for extracting a chunk from the key */
 #define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
 
@@ -98,15 +92,9 @@
 /* Get a chunk from the key */
 #define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
 
-/*
- * Mapping from the value to the bit in is-set bitmap in the node-256.
- */
-#define RT_NODE_BITMAP_BYTE(v) ((v) / BITS_PER_BYTE)
-#define RT_NODE_BITMAP_BIT(v) (UINT64CONST(1) << ((v) % RT_NODE_SPAN))
-
-/* FIXME rename */
-#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
+/* For accessing bitmaps */
+#define BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
 
 /* Enum used rt_node_search() */
 typedef enum
@@ -214,7 +202,7 @@ typedef struct rt_node_base125
 	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
 
 	/* isset is a bitmap to track which slot is in use */
-	bitmapword		isset[WORDNUM(128)];
+	bitmapword		isset[BM_IDX(128)];
 } rt_node_base_125;
 
 typedef struct rt_node_base256
@@ -300,7 +288,7 @@ typedef struct rt_node_leaf_256
 	rt_node_base_256 base;
 
 	/* isset is a bitmap to track which slot is in use */
-	uint8		isset[RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS)];
+	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
 
 	/* Slots for 256 values */
 	uint64		values[RT_NODE_MAX_SLOTS];
@@ -665,17 +653,23 @@ node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
 static inline bool
 node_inner_125_is_slot_used(rt_node_inner_125 *node, uint8 slot)
 {
+	int			idx = BM_IDX(slot);
+	int			bitnum = BM_BIT(slot);
+
 	Assert(!NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
-	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
+	return (node->base.isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
 }
 
 static inline bool
 node_leaf_125_is_slot_used(rt_node_leaf_125 *node, uint8 slot)
 {
+	int			idx = BM_IDX(slot);
+	int			bitnum = BM_BIT(slot);
+
 	Assert(NODE_IS_LEAF(node));
 	Assert(slot < node->base.n.fanout);
-	return (node->base.isset[WORDNUM(slot)] & ((bitmapword) 1 << BITNUM(slot))) != 0;
+	return (node->base.isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
 }
 #endif
 
@@ -698,9 +692,12 @@ static void
 node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
 {
 	int			slotpos = node->base.slot_idxs[chunk];
+	int			idx = BM_IDX(slotpos);
+	int			bitnum = BM_BIT(slotpos);
 
 	Assert(!NODE_IS_LEAF(node));
-	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
+
+	node->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
 	node->children[node->base.slot_idxs[chunk]] = NULL;
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
@@ -709,9 +706,11 @@ static void
 node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
 {
 	int			slotpos = node->base.slot_idxs[chunk];
+	int			idx = BM_IDX(slotpos);
+	int			bitnum = BM_BIT(slotpos);
 
 	Assert(NODE_IS_LEAF(node));
-	node->base.isset[WORDNUM(slotpos)] &= ~((bitmapword) 1 << BITNUM(slotpos));
+	node->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
 	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
 }
 
@@ -724,7 +723,7 @@ node_125_find_unused_slot(bitmapword *isset)
 	bitmapword	inverse;
 
 	/* get the first word with at least one bit not set */
-	for (idx = 0; idx < WORDNUM(128); idx++)
+	for (idx = 0; idx < BM_IDX(128); idx++)
 	{
 		if (isset[idx] < ~((bitmapword) 0))
 			break;
@@ -798,8 +797,11 @@ node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
 static inline bool
 node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
 {
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
 	Assert(NODE_IS_LEAF(node));
-	return (node->isset[RT_NODE_BITMAP_BYTE(chunk)] & RT_NODE_BITMAP_BIT(chunk)) != 0;
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
 }
 
 static inline rt_node *
@@ -830,8 +832,11 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
 static inline void
 node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
 {
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
 	Assert(NODE_IS_LEAF(node));
-	node->isset[RT_NODE_BITMAP_BYTE(chunk)] |= RT_NODE_BITMAP_BIT(chunk);
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
 	node->values[chunk] = value;
 }
 
@@ -846,8 +851,11 @@ node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
 static inline void
 node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
 {
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
 	Assert(NODE_IS_LEAF(node));
-	node->isset[RT_NODE_BITMAP_BYTE(chunk)] &= ~(RT_NODE_BITMAP_BIT(chunk));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
 }
 
 /*
@@ -2269,8 +2277,8 @@ rt_verify_node(rt_node *node)
 					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
 					int			cnt = 0;
 
-					for (int i = 0; i < RT_NODE_NSLOTS_BITS(RT_NODE_MAX_SLOTS); i++)
-						cnt += pg_popcount32(n256->isset[i]);
+					for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
 
 					/* Check if the number of used chunk matches */
 					Assert(n256->base.n.count == cnt);
@@ -2386,7 +2394,7 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
 
 					fprintf(stderr, ", isset-bitmap:");
-					for (int i = 0; i < WORDNUM(128); i++)
+					for (int i = 0; i < BM_IDX(128); i++)
 					{
 						fprintf(stderr, UINT64_FORMAT_HEX " ", n->base.isset[i]);
 					}
-- 
2.38.1

v16-0008-Use-newnode-variable-to-reduce-unnecessary-casti.patchtext/x-patch; charset=US-ASCII; name=v16-0008-Use-newnode-variable-to-reduce-unnecessary-casti.patchDownload

From 7c2652509034b569eb4fc49faaf4dd7a61bfa8fd Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 19 Dec 2022 15:08:15 +0700
Subject: [PATCH v16 08/12] Use newnode variable to reduce unnecessary casting

---
 src/backend/lib/radixtree.c | 46 +++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 25 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 7c993e096b..ddf7b002fc 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -1284,6 +1284,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 {
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 	bool		chunk_exists = false;
+	rt_node		*newnode = NULL;
 
 	Assert(!NODE_IS_LEAF(node));
 
@@ -1306,18 +1307,16 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
 				{
 					rt_node_inner_32 *new32;
-					Assert(parent != NULL);
 
 					/* grow node from 4 to 32 */
-					new32 = (rt_node_inner_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
-																   RT_NODE_KIND_32);
+					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+					new32 = (rt_node_inner_32 *) newnode;
 					chunk_children_array_copy(n4->base.chunks, n4->children,
 											  new32->base.chunks, new32->children);
 
 					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32,
-									key);
-					node = (rt_node *) new32;
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
 				}
 				else
 				{
@@ -1354,19 +1353,17 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
 					n32->base.n.count == minclass.fanout)
 				{
-					/* use the same node kind, but expand to the next size class */
-					rt_node_inner_32 *new32;
-
-					new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
-					memcpy(new32, n32, minclass.inner_size);
-					new32->base.n.fanout = maxclass.fanout;
+					/* grow to the next size class of this kind */
+					newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+					memcpy(newnode, node, minclass.inner_size);
+					newnode->fanout = maxclass.fanout;
 
 					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
 
-					/* must update both pointers here */
-					node = (rt_node *) new32;
-					n32 = new32;
+					/* also update pointer for this kind */
+					n32 = (rt_node_inner_32 *) newnode;
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
@@ -1374,14 +1371,14 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 					rt_node_inner_125 *new125;
 
 					/* grow node from 32 to 125 */
-					new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
-																	 RT_NODE_KIND_125);
+					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+					new125 = (rt_node_inner_125 *) newnode;
 					for (int i = 0; i < n32->base.n.count; i++)
 						node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
 
 					Assert(parent != NULL);
-					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
-					node = (rt_node *) new125;
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
 				}
 				else
 				{
@@ -1420,8 +1417,8 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 					Assert(parent != NULL);
 
 					/* grow node from 125 to 256 */
-					new256 = (rt_node_inner_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
-																	 RT_NODE_KIND_256);
+					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+					new256 = (rt_node_inner_256 *) newnode;
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
 					{
 						if (!node_125_is_chunk_used(&n125->base, i))
@@ -1431,9 +1428,8 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 						cnt++;
 					}
 
-					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
 				}
 				else
 				{
-- 
2.38.1

v16-0010-Template-out-node-insert-functions.patchtext/x-patch; charset=US-ASCII; name=v16-0010-Template-out-node-insert-functions.patchDownload

From 48892a7f66892aeb3346622fd7b26e20811154d8 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 23 Dec 2022 14:33:49 +0700
Subject: [PATCH v16 10/12] Template out node insert functions

---
 src/backend/lib/radixtree.c             | 369 +-----------------------
 src/include/lib/radixtree_insert_impl.h | 257 +++++++++++++++++
 2 files changed, 263 insertions(+), 363 deletions(-)
 create mode 100644 src/include/lib/radixtree_insert_impl.h

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 7899e844fb..79d12b27d2 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -1290,185 +1290,9 @@ static bool
 rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
 					 rt_node *child)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
-	bool		chunk_exists = false;
-	rt_node		*newnode = NULL;
-
-	Assert(!NODE_IS_LEAF(node));
-
-	switch (node->kind)
-	{
-		case RT_NODE_KIND_4:
-			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
-				int			idx;
-
-				idx = node_4_search_eq(&n4->base, chunk);
-				if (idx != -1)
-				{
-					/* found the existing chunk */
-					chunk_exists = true;
-					n4->children[idx] = child;
-					break;
-				}
-
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
-				{
-					rt_node_inner_32 *new32;
-
-					/* grow node from 4 to 32 */
-					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
-					new32 = (rt_node_inner_32 *) newnode;
-					chunk_children_array_copy(n4->base.chunks, n4->children,
-											  new32->base.chunks, new32->children);
-
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, node, newnode, key);
-					node = newnode;
-				}
-				else
-				{
-					int			insertpos = node_4_get_insertpos(&n4->base, chunk);
-					uint16		count = n4->base.n.count;
-
-					/* shift chunks and children */
-					if (count != 0 && insertpos < count)
-						chunk_children_array_shift(n4->base.chunks, n4->children,
-												   count, insertpos);
-
-					n4->base.chunks[insertpos] = chunk;
-					n4->children[insertpos] = child;
-					break;
-				}
-			}
-			/* FALLTHROUGH */
-		case RT_NODE_KIND_32:
-			{
-				const rt_size_class_elem minclass = rt_size_class_info[RT_CLASS_32_PARTIAL];
-				const rt_size_class_elem maxclass = rt_size_class_info[RT_CLASS_32_FULL];
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
-				int			idx;
-
-				idx = node_32_search_eq(&n32->base, chunk);
-				if (idx != -1)
-				{
-					/* found the existing chunk */
-					chunk_exists = true;
-					n32->children[idx] = child;
-					break;
-				}
-
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
-					n32->base.n.count == minclass.fanout)
-				{
-					/* grow to the next size class of this kind */
-					newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
-					memcpy(newnode, node, minclass.inner_size);
-					newnode->fanout = maxclass.fanout;
-
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, node, newnode, key);
-					node = newnode;
-
-					/* also update pointer for this kind */
-					n32 = (rt_node_inner_32 *) newnode;
-				}
-
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
-				{
-					rt_node_inner_125 *new125;
-
-					/* grow node from 32 to 125 */
-					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
-					new125 = (rt_node_inner_125 *) newnode;
-					for (int i = 0; i < n32->base.n.count; i++)
-						node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
-
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, node, newnode, key);
-					node = newnode;
-				}
-				else
-				{
-					int	insertpos = node_32_get_insertpos(&n32->base, chunk);
-					int16 count = n32->base.n.count;
-
-					if (insertpos < count)
-					{
-						Assert(count > 0);
-						chunk_children_array_shift(n32->base.chunks, n32->children,
-												   count, insertpos);
-					}
-
-					n32->base.chunks[insertpos] = chunk;
-					n32->children[insertpos] = child;
-					break;
-				}
-			}
-			/* FALLTHROUGH */
-		case RT_NODE_KIND_125:
-			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
-				int			cnt = 0;
-
-				if (node_125_is_chunk_used(&n125->base, chunk))
-				{
-					/* found the existing chunk */
-					chunk_exists = true;
-					node_inner_125_update(n125, chunk, child);
-					break;
-				}
-
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
-				{
-					rt_node_inner_256 *new256;
-					Assert(parent != NULL);
-
-					/* grow node from 125 to 256 */
-					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
-					new256 = (rt_node_inner_256 *) newnode;
-					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
-					{
-						if (!node_125_is_chunk_used(&n125->base, i))
-							continue;
-
-						node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
-						cnt++;
-					}
-
-					rt_replace_node(tree, parent, node, newnode, key);
-					node = newnode;
-				}
-				else
-				{
-					node_inner_125_insert(n125, chunk, child);
-					break;
-				}
-			}
-			/* FALLTHROUGH */
-		case RT_NODE_KIND_256:
-			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
-
-				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
-				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
-
-				node_inner_256_set(n256, chunk, child);
-				break;
-			}
-	}
-
-	/* Update statistics */
-	if (!chunk_exists)
-		node->count++;
-
-	/*
-	 * Done. Finally, verify the chunk and value is inserted or replaced
-	 * properly in the node.
-	 */
-	rt_verify_node(node);
-
-	return chunk_exists;
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
 }
 
 /* Insert the value to the leaf node */
@@ -1476,190 +1300,9 @@ static bool
 rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 					uint64 key, uint64 value)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
-	bool		chunk_exists = false;
-
-	Assert(NODE_IS_LEAF(node));
-
-	switch (node->kind)
-	{
-		case RT_NODE_KIND_4:
-			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
-				int			idx;
-
-				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
-				if (idx != -1)
-				{
-					/* found the existing chunk */
-					chunk_exists = true;
-					n4->values[idx] = value;
-					break;
-				}
-
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
-				{
-					rt_node_leaf_32 *new32;
-					Assert(parent != NULL);
-
-					/* grow node from 4 to 32 */
-					new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
-																  RT_NODE_KIND_32);
-					chunk_values_array_copy(n4->base.chunks, n4->values,
-											new32->base.chunks, new32->values);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
-					node = (rt_node *) new32;
-				}
-				else
-				{
-					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
-					int			count = n4->base.n.count;
-
-					/* shift chunks and values */
-					if (count != 0 && insertpos < count)
-						chunk_values_array_shift(n4->base.chunks, n4->values,
-												 count, insertpos);
-
-					n4->base.chunks[insertpos] = chunk;
-					n4->values[insertpos] = value;
-					break;
-				}
-			}
-			/* FALLTHROUGH */
-		case RT_NODE_KIND_32:
-			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
-				int			idx;
-
-				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
-				if (idx != -1)
-				{
-					/* found the existing chunk */
-					chunk_exists = true;
-					n32->values[idx] = value;
-					break;
-				}
-
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
-				{
-					Assert(parent != NULL);
-
-					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
-					{
-						/* use the same node kind, but expand to the next size class */
-						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
-						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
-						rt_node_leaf_32 *new32;
-
-						new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
-						memcpy(new32, n32, size);
-						new32->base.n.fanout = fanout;
-
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
-
-						/* must update both pointers here */
-						node = (rt_node *) new32;
-						n32 = new32;
-
-						goto retry_insert_leaf_32;
-					}
-					else
-					{
-						rt_node_leaf_125 *new125;
-
-						/* grow node from 32 to 125 */
-						new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
-																		RT_NODE_KIND_125);
-						for (int i = 0; i < n32->base.n.count; i++)
-							node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
-
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
-										key);
-						node = (rt_node *) new125;
-					}
-				}
-				else
-				{
-				retry_insert_leaf_32:
-					{
-						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
-						int	count = n32->base.n.count;
-
-						if (count != 0 && insertpos < count)
-							chunk_values_array_shift(n32->base.chunks, n32->values,
-													 count, insertpos);
-
-						n32->base.chunks[insertpos] = chunk;
-						n32->values[insertpos] = value;
-						break;
-					}
-				}
-			}
-			/* FALLTHROUGH */
-		case RT_NODE_KIND_125:
-			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
-				int			cnt = 0;
-
-				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
-				{
-					/* found the existing chunk */
-					chunk_exists = true;
-					node_leaf_125_update(n125, chunk, value);
-					break;
-				}
-
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
-				{
-					rt_node_leaf_256 *new256;
-					Assert(parent != NULL);
-
-					/* grow node from 125 to 256 */
-					new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
-																	RT_NODE_KIND_256);
-					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
-					{
-						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
-							continue;
-
-						node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
-						cnt++;
-					}
-
-					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
-				}
-				else
-				{
-					node_leaf_125_insert(n125, chunk, value);
-					break;
-				}
-			}
-			/* FALLTHROUGH */
-		case RT_NODE_KIND_256:
-			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
-
-				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
-				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
-
-				node_leaf_256_set(n256, chunk, value);
-				break;
-			}
-	}
-
-	/* Update statistics */
-	if (!chunk_exists)
-		node->count++;
-
-	/*
-	 * Done. Finally, verify the chunk and value is inserted or replaced
-	 * properly in the node.
-	 */
-	rt_verify_node(node);
-
-	return chunk_exists;
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
 }
 
 /*
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..8e02c83fc7
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,257 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE rt_node_inner_4
+#define RT_NODE32_TYPE rt_node_inner_32
+#define RT_NODE125_TYPE rt_node_inner_125
+#define RT_NODE256_TYPE rt_node_inner_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE rt_node_leaf_4
+#define RT_NODE32_TYPE rt_node_leaf_32
+#define RT_NODE125_TYPE rt_node_leaf_125
+#define RT_NODE256_TYPE rt_node_leaf_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+	rt_node		*newnode = NULL;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(NODE_IS_LEAF(node));
+#else
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx;
+
+				idx = node_4_search_eq(&n4->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n4->values[idx] = value;
+#else
+					n4->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					RT_NODE32_TYPE *new32;
+
+					/* grow node from 4 to 32 */
+					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+					new32 = (RT_NODE32_TYPE *) newnode;
+#ifdef RT_NODE_LEVEL_LEAF
+					chunk_values_array_copy(n4->base.chunks, n4->values,
+											  new32->base.chunks, new32->values);
+#else
+					chunk_children_array_copy(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children);
+#endif
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos(&n4->base, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						chunk_values_array_shift(n4->base.chunks, n4->values,
+												   count, insertpos);
+#else
+						chunk_children_array_shift(n4->base.chunks, n4->children,
+												   count, insertpos);
+#endif
+					}
+
+					n4->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n4->values[insertpos] = value;
+#else
+					n4->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const rt_size_class_elem minclass = rt_size_class_info[RT_CLASS_32_PARTIAL];
+				const rt_size_class_elem maxclass = rt_size_class_info[RT_CLASS_32_FULL];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx;
+
+				idx = node_32_search_eq(&n32->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[idx] = value;
+#else
+					n32->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+					n32->base.n.count == minclass.fanout)
+				{
+					/* grow to the next size class of this kind */
+					newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+					memcpy(newnode, node, minclass.inner_size);
+					newnode->fanout = maxclass.fanout;
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
+
+					/* also update pointer for this kind */
+					n32 = (RT_NODE32_TYPE *) newnode;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					RT_NODE125_TYPE *new125;
+
+					/* grow node from 32 to 125 */
+					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+					new125 = (RT_NODE125_TYPE *) newnode;
+					for (int i = 0; i < n32->base.n.count; i++)
+#ifdef RT_NODE_LEVEL_LEAF
+						node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
+#else
+						node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+#endif
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = node_32_get_insertpos(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						chunk_values_array_shift(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						chunk_children_array_shift(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = value;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			cnt = 0;
+
+				if (node_125_is_chunk_used(&n125->base, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					node_leaf_125_update(n125, chunk, value);
+#else
+					node_inner_125_update(n125, chunk, child);
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					RT_NODE256_TYPE *new256;
+
+					/* grow node from 125 to 256 */
+					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+					new256 = (RT_NODE256_TYPE *) newnode;
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!node_125_is_chunk_used(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+#else
+						node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+#endif
+						cnt++;
+					}
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
+				}
+				else
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					node_leaf_125_insert(n125, chunk, value);
+#else
+					node_inner_125_insert(n125, chunk, child);
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+#else
+				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+#endif
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+#ifdef RT_NODE_LEVEL_LEAF
+				node_leaf_256_set(n256, chunk, value);
+#else
+				node_inner_256_set(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
-- 
2.38.1

v16-0007-Remove-STRICT-from-bench_search_random_nodes.patchtext/x-patch; charset=US-ASCII; name=v16-0007-Remove-STRICT-from-bench_search_random_nodes.patchDownload

From 42467662e039a9de6a0323d16857e9f17c5e140e Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 12 Dec 2022 10:39:48 +0700
Subject: [PATCH v16 07/12] Remove STRICT from bench_search_random_nodes

---
 contrib/bench_radix_tree/bench_radix_tree--1.0.sql | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 83529805fc..2fd689aa91 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -50,7 +50,7 @@ OUT mem_allocated int8,
 OUT search_ms int8)
 returns record
 as 'MODULE_PATHNAME'
-LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
 
 create function bench_fixed_height_search(
 fanout int4,
-- 
2.38.1

v16-0011-Template-out-node-search-functions.patchtext/x-patch; charset=US-ASCII; name=v16-0011-Template-out-node-search-functions.patchDownload

From a9982146efaa2c1b7139bc804ee33c0062be605a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 23 Dec 2022 15:31:49 +0700
Subject: [PATCH v16 11/12] Template out node search functions

---
 src/backend/lib/radixtree.c             | 168 +-----------------------
 src/include/lib/radixtree_search_impl.h | 151 +++++++++++++++++++++
 2 files changed, 157 insertions(+), 162 deletions(-)
 create mode 100644 src/include/lib/radixtree_search_impl.h

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 79d12b27d2..99450c96c8 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -1109,87 +1109,9 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
 static inline bool
 rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
-	bool		found = false;
-	rt_node    *child = NULL;
-
-	switch (node->kind)
-	{
-		case RT_NODE_KIND_4:
-			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
-				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
-
-				if (idx < 0)
-					break;
-
-				found = true;
-
-				if (action == RT_ACTION_FIND)
-					child = n4->children[idx];
-				else			/* RT_ACTION_DELETE */
-					chunk_children_array_delete(n4->base.chunks, n4->children,
-												n4->base.n.count, idx);
-
-				break;
-			}
-		case RT_NODE_KIND_32:
-			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
-				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
-
-				if (idx < 0)
-					break;
-
-				found = true;
-				if (action == RT_ACTION_FIND)
-					child = n32->children[idx];
-				else			/* RT_ACTION_DELETE */
-					chunk_children_array_delete(n32->base.chunks, n32->children,
-												n32->base.n.count, idx);
-				break;
-			}
-		case RT_NODE_KIND_125:
-			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
-
-				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
-					break;
-
-				found = true;
-
-				if (action == RT_ACTION_FIND)
-					child = node_inner_125_get_child(n125, chunk);
-				else			/* RT_ACTION_DELETE */
-					node_inner_125_delete(n125, chunk);
-
-				break;
-			}
-		case RT_NODE_KIND_256:
-			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
-
-				if (!node_inner_256_is_chunk_used(n256, chunk))
-					break;
-
-				found = true;
-				if (action == RT_ACTION_FIND)
-					child = node_inner_256_get_child(n256, chunk);
-				else			/* RT_ACTION_DELETE */
-					node_inner_256_delete(n256, chunk);
-
-				break;
-			}
-	}
-
-	/* update statistics */
-	if (action == RT_ACTION_DELETE && found)
-		node->count--;
-
-	if (found && child_p)
-		*child_p = child;
-
-	return found;
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
 }
 
 /*
@@ -1202,87 +1124,9 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **chil
 static inline bool
 rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
-	bool		found = false;
-	uint64		value = 0;
-
-	switch (node->kind)
-	{
-		case RT_NODE_KIND_4:
-			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
-				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
-
-				if (idx < 0)
-					break;
-
-				found = true;
-
-				if (action == RT_ACTION_FIND)
-					value = n4->values[idx];
-				else			/* RT_ACTION_DELETE */
-					chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
-											  n4->base.n.count, idx);
-
-				break;
-			}
-		case RT_NODE_KIND_32:
-			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
-				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
-
-				if (idx < 0)
-					break;
-
-				found = true;
-				if (action == RT_ACTION_FIND)
-					value = n32->values[idx];
-				else			/* RT_ACTION_DELETE */
-					chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
-											  n32->base.n.count, idx);
-				break;
-			}
-		case RT_NODE_KIND_125:
-			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
-
-				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
-					break;
-
-				found = true;
-
-				if (action == RT_ACTION_FIND)
-					value = node_leaf_125_get_value(n125, chunk);
-				else			/* RT_ACTION_DELETE */
-					node_leaf_125_delete(n125, chunk);
-
-				break;
-			}
-		case RT_NODE_KIND_256:
-			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
-
-				if (!node_leaf_256_is_chunk_used(n256, chunk))
-					break;
-
-				found = true;
-				if (action == RT_ACTION_FIND)
-					value = node_leaf_256_get_value(n256, chunk);
-				else			/* RT_ACTION_DELETE */
-					node_leaf_256_delete(n256, chunk);
-
-				break;
-			}
-	}
-
-	/* update statistics */
-	if (action == RT_ACTION_DELETE && found)
-		node->count--;
-
-	if (found && value_p)
-		*value_p = value;
-
-	return found;
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
 }
 
 /* Insert the child to the inner node */
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..0173d9cb2f
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,151 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE rt_node_inner_4
+#define RT_NODE32_TYPE rt_node_inner_32
+#define RT_NODE125_TYPE rt_node_inner_125
+#define RT_NODE256_TYPE rt_node_inner_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE rt_node_leaf_4
+#define RT_NODE32_TYPE rt_node_leaf_32
+#define RT_NODE125_TYPE rt_node_leaf_125
+#define RT_NODE256_TYPE rt_node_leaf_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	uint64		value = 0;
+#else
+	rt_node    *child = NULL;
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+#ifdef RT_NODE_LEVEL_LEAF
+					value = n4->values[idx];
+#else
+					child = n4->children[idx];
+#endif
+				else			/* RT_ACTION_DELETE */
+#ifdef RT_NODE_LEVEL_LEAF
+					chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+											  n4->base.n.count, idx);
+#else
+					chunk_children_array_delete(n4->base.chunks, n4->children,
+											    n4->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+#ifdef RT_NODE_LEVEL_LEAF
+					value = n32->values[idx];
+#else
+					child = n32->children[idx];
+#endif
+				else			/* RT_ACTION_DELETE */
+#ifdef RT_NODE_LEVEL_LEAF
+					chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+											  n32->base.n.count, idx);
+#else
+					chunk_children_array_delete(n32->base.chunks, n32->children,
+											    n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+
+				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+#ifdef RT_NODE_LEVEL_LEAF
+					value = node_leaf_125_get_value(n125, chunk);
+#else
+					child = node_inner_125_get_child(n125, chunk);
+#endif
+				else			/* RT_ACTION_DELETE */
+#ifdef RT_NODE_LEVEL_LEAF
+					node_leaf_125_delete(n125, chunk);
+#else
+					node_inner_125_delete(n125, chunk);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!node_leaf_256_is_chunk_used(n256, chunk))
+#else
+				if (!node_inner_256_is_chunk_used(n256, chunk))
+#endif
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+#ifdef RT_NODE_LEVEL_LEAF
+					value = node_leaf_256_get_value(n256, chunk);
+#else
+					child = node_inner_256_get_child(n256, chunk);
+#endif
+				else			/* RT_ACTION_DELETE */
+#ifdef RT_NODE_LEVEL_LEAF
+					node_leaf_256_delete(n256, chunk);
+#else
+					node_inner_256_delete(n256, chunk);
+#endif
+
+				break;
+			}
+	}
+
+	if (found)
+	{
+		/* update statistics */
+		if (action == RT_ACTION_DELETE)
+			node->count--;
+
+#ifdef RT_NODE_LEVEL_LEAF
+		if (value_p)
+			*value_p = value;
+#else
+		if (child_p)
+			*child_p = child;
+#endif
+	}
+
+	return found;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
-- 
2.38.1

v16-0012-Separate-find-and-delete-actions-into-separate-f.patchtext/x-patch; charset=US-ASCII; name=v16-0012-Separate-find-and-delete-actions-into-separate-f.patchDownload

From 3637d74416d565e3ca6faff2ad6b6a25b2c50689 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 23 Dec 2022 17:41:05 +0700
Subject: [PATCH v16 12/12] Separate find and delete actions into separate
 functions

This makes hot paths smaller and less branchy.
---
 src/backend/lib/radixtree.c             | 73 ++++++++++++++++---------
 src/include/lib/radixtree_search_impl.h | 68 ++++++++++++-----------
 2 files changed, 83 insertions(+), 58 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 99450c96c8..c934bff693 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -96,13 +96,6 @@
 #define BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
 #define BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
 
-/* Enum used rt_node_search() */
-typedef enum
-{
-	RT_ACTION_FIND = 0,			/* find the key-value */
-	RT_ACTION_DELETE,			/* delete the key-value */
-} rt_action;
-
 /*
  * Supported radix tree node kinds and size classes.
  *
@@ -422,10 +415,8 @@ static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_cl
 								bool inner);
 static void rt_free_node(radix_tree *tree, rt_node *node);
 static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
-										rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
-									   uint64 *value_p);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p);
 static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
 								 uint64 key, rt_node *child);
 static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
@@ -1100,33 +1091,65 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
 }
 
 /*
- * Search for the child pointer corresponding to 'key' in the given node, and
- * do the specified 'action'.
+ * Search for the child pointer corresponding to 'key' in the given node.
  *
  * Return true if the key is found, otherwise return false. On success, the child
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p)
 {
+#define RT_ACTION_FIND
 #define RT_NODE_LEVEL_INNER
 #include "lib/radixtree_search_impl.h"
 #undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_FIND
 }
 
 /*
- * Search for the value corresponding to 'key' in the given node, and do the
- * specified 'action'.
+ * Search for the value corresponding to 'key' in the given node.
  *
  * Return true if the key is found, otherwise return false. On success, the pointer
  * to the value is set to value_p.
  */
 static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p)
+{
+#define RT_ACTION_FIND
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+#undef RT_ACTION_FIND
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+rt_node_delete_inner(rt_node *node, uint64 key)
+{
+#define RT_ACTION_DELETE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_DELETE
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+rt_node_delete_leaf(rt_node *node, uint64 key)
 {
+#define RT_ACTION_DELETE
 #define RT_NODE_LEVEL_LEAF
 #include "lib/radixtree_search_impl.h"
 #undef RT_NODE_LEVEL_LEAF
+#undef RT_ACTION_DELETE
 }
 
 /* Insert the child to the inner node */
@@ -1235,7 +1258,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 		if (NODE_IS_LEAF(node))
 			break;
 
-		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		if (!rt_node_search_inner(node, key, &child))
 		{
 			rt_set_extend(tree, key, value, parent, node);
 			return false;
@@ -1282,14 +1305,14 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 		if (NODE_IS_LEAF(node))
 			break;
 
-		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		if (!rt_node_search_inner(node, key, &child))
 			return false;
 
 		node = child;
 		shift -= RT_NODE_SPAN;
 	}
 
-	return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+	return rt_node_search_leaf(node, key, value_p);
 }
 
 /*
@@ -1322,7 +1345,7 @@ rt_delete(radix_tree *tree, uint64 key)
 		/* Push the current node to the stack */
 		stack[++level] = node;
 
-		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		if (!rt_node_search_inner(node, key, &child))
 			return false;
 
 		node = child;
@@ -1331,7 +1354,7 @@ rt_delete(radix_tree *tree, uint64 key)
 
 	/* Delete the key from the leaf node if exists */
 	Assert(NODE_IS_LEAF(node));
-	deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+	deleted = rt_node_delete_leaf(node, key);
 
 	if (!deleted)
 	{
@@ -1357,7 +1380,7 @@ rt_delete(radix_tree *tree, uint64 key)
 	{
 		node = stack[level--];
 
-		deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+		deleted = rt_node_delete_inner(node, key);
 		Assert(deleted);
 
 		/* If the node didn't become empty, we stop deleting the key */
@@ -1989,12 +2012,12 @@ rt_dump_search(radix_tree *tree, uint64 key)
 			uint64		dummy;
 
 			/* We reached at a leaf node, find the corresponding slot */
-			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+			rt_node_search_leaf(node, key, &dummy);
 
 			break;
 		}
 
-		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		if (!rt_node_search_inner(node, key, &child))
 			break;
 
 		node = child;
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 0173d9cb2f..28c02da2bf 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -10,16 +10,21 @@
 #define RT_NODE256_TYPE rt_node_leaf_256
 #else
 #error node level must be either inner or leaf
+#endif
+
+#if !defined(RT_ACTION_FIND) && !defined(RT_ACTION_DELETE)
+#error search action must be either find or delete
 #endif
 
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
-	bool		found = false;
 
+#if defined(RT_ACTION_FIND)
 #ifdef RT_NODE_LEVEL_LEAF
 	uint64		value = 0;
 #else
 	rt_node    *child = NULL;
 #endif
+#endif			/* RT_ACTION_FIND */
 
 	switch (node->kind)
 	{
@@ -29,17 +34,15 @@
 				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
 
 				if (idx < 0)
-					break;
-
-				found = true;
+					return false;
 
-				if (action == RT_ACTION_FIND)
+#if defined(RT_ACTION_FIND)
 #ifdef RT_NODE_LEVEL_LEAF
 					value = n4->values[idx];
 #else
 					child = n4->children[idx];
 #endif
-				else			/* RT_ACTION_DELETE */
+#elif defined (RT_ACTION_DELETE)
 #ifdef RT_NODE_LEVEL_LEAF
 					chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
 											  n4->base.n.count, idx);
@@ -47,6 +50,8 @@
 					chunk_children_array_delete(n4->base.chunks, n4->children,
 											    n4->base.n.count, idx);
 #endif
+#endif			/* RT_ACTION_FIND */
+
 				break;
 			}
 		case RT_NODE_KIND_32:
@@ -55,17 +60,15 @@
 				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
 
 				if (idx < 0)
-					break;
-
-				found = true;
+					return false;
 
-				if (action == RT_ACTION_FIND)
+#if defined(RT_ACTION_FIND)
 #ifdef RT_NODE_LEVEL_LEAF
 					value = n32->values[idx];
 #else
 					child = n32->children[idx];
 #endif
-				else			/* RT_ACTION_DELETE */
+#elif defined (RT_ACTION_DELETE)
 #ifdef RT_NODE_LEVEL_LEAF
 					chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
 											  n32->base.n.count, idx);
@@ -73,6 +76,8 @@
 					chunk_children_array_delete(n32->base.chunks, n32->children,
 											    n32->base.n.count, idx);
 #endif
+#endif			/* RT_ACTION_FIND */
+
 				break;
 			}
 		case RT_NODE_KIND_125:
@@ -80,22 +85,22 @@
 				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
 
 				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
-					break;
+					return false;
 
-				found = true;
-
-				if (action == RT_ACTION_FIND)
+#if defined(RT_ACTION_FIND)
 #ifdef RT_NODE_LEVEL_LEAF
 					value = node_leaf_125_get_value(n125, chunk);
 #else
 					child = node_inner_125_get_child(n125, chunk);
 #endif
-				else			/* RT_ACTION_DELETE */
+#elif defined (RT_ACTION_DELETE)
 #ifdef RT_NODE_LEVEL_LEAF
 					node_leaf_125_delete(n125, chunk);
 #else
 					node_inner_125_delete(n125, chunk);
 #endif
+#endif			/* RT_ACTION_FIND */
+
 				break;
 			}
 		case RT_NODE_KIND_256:
@@ -107,43 +112,40 @@
 #else
 				if (!node_inner_256_is_chunk_used(n256, chunk))
 #endif
-					break;
-
-				found = true;
+					return false;
 
-				if (action == RT_ACTION_FIND)
+#if defined(RT_ACTION_FIND)
 #ifdef RT_NODE_LEVEL_LEAF
 					value = node_leaf_256_get_value(n256, chunk);
 #else
 					child = node_inner_256_get_child(n256, chunk);
 #endif
-				else			/* RT_ACTION_DELETE */
+#elif defined (RT_ACTION_DELETE)
 #ifdef RT_NODE_LEVEL_LEAF
 					node_leaf_256_delete(n256, chunk);
 #else
 					node_inner_256_delete(n256, chunk);
 #endif
+#endif			/* RT_ACTION_FIND */
 
 				break;
 			}
 	}
 
-	if (found)
-	{
-		/* update statistics */
-		if (action == RT_ACTION_DELETE)
-			node->count--;
-
+#if defined(RT_ACTION_FIND)
 #ifdef RT_NODE_LEVEL_LEAF
-		if (value_p)
-			*value_p = value;
+	Assert(value_p != NULL);
+	*value_p = value;
 #else
-		if (child_p)
-			*child_p = child;
+	Assert(child_p != NULL);
+	*child_p = child;
 #endif
-	}
+#elif defined (RT_ACTION_DELETE)
+	/* update statistics */
+	node->count--;
+#endif			/* RT_ACTION_FIND */
 
-	return found;
+	return true;
 
 #undef RT_NODE4_TYPE
 #undef RT_NODE32_TYPE
-- 
2.38.1

v16-0006-Preparatory-refactoring-to-simplify-templating.patchtext/x-patch; charset=US-ASCII; name=v16-0006-Preparatory-refactoring-to-simplify-templating.patchDownload

From 4573327ba7fa1179389af4383c04053251a8bf73 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 11 Dec 2022 16:38:08 +0700
Subject: [PATCH v16 06/12] Preparatory refactoring to simplify templating

*Remove gotos and shorten const lookups in node_insert_inner()
*Turn condition into an assert
*Don't cast to base -- use membership
---
 src/backend/lib/radixtree.c | 87 ++++++++++++++++++-------------------
 1 file changed, 42 insertions(+), 45 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index abd0450727..7c993e096b 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -1294,7 +1294,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
 				int			idx;
 
-				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				idx = node_4_search_eq(&n4->base, chunk);
 				if (idx != -1)
 				{
 					/* found the existing chunk */
@@ -1321,7 +1321,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 				}
 				else
 				{
-					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					int			insertpos = node_4_get_insertpos(&n4->base, chunk);
 					uint16		count = n4->base.n.count;
 
 					/* shift chunks and children */
@@ -1337,10 +1337,12 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
+				const rt_size_class_elem minclass = rt_size_class_info[RT_CLASS_32_PARTIAL];
+				const rt_size_class_elem maxclass = rt_size_class_info[RT_CLASS_32_FULL];
 				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
 				int			idx;
 
-				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				idx = node_32_search_eq(&n32->base, chunk);
 				if (idx != -1)
 				{
 					/* found the existing chunk */
@@ -1349,58 +1351,53 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 					break;
 				}
 
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+					n32->base.n.count == minclass.fanout)
 				{
-					Assert(parent != NULL);
+					/* use the same node kind, but expand to the next size class */
+					rt_node_inner_32 *new32;
 
-					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
-					{
-						/* use the same node kind, but expand to the next size class */
-						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size;
-						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
-						rt_node_inner_32 *new32;
+					new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+					memcpy(new32, n32, minclass.inner_size);
+					new32->base.n.fanout = maxclass.fanout;
 
-						new32 = (rt_node_inner_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, true);
-						memcpy(new32, n32, size);
-						new32->base.n.fanout = fanout;
-
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
 
-						/* must update both pointers here */
-						node = (rt_node *) new32;
-						n32 = new32;
+					/* must update both pointers here */
+					node = (rt_node *) new32;
+					n32 = new32;
+				}
 
-						goto retry_insert_inner_32;
-					}
-					else
-					{
-						rt_node_inner_125 *new125;
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					rt_node_inner_125 *new125;
 
-						/* grow node from 32 to 125 */
-						new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
-																		 RT_NODE_KIND_125);
-						for (int i = 0; i < n32->base.n.count; i++)
-							node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+					/* grow node from 32 to 125 */
+					new125 = (rt_node_inner_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																	 RT_NODE_KIND_125);
+					for (int i = 0; i < n32->base.n.count; i++)
+						node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
 
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
-						node = (rt_node *) new125;
-					}
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125, key);
+					node = (rt_node *) new125;
 				}
 				else
 				{
-retry_insert_inner_32:
-					{
-						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
-						int16 count = n32->base.n.count;
+					int	insertpos = node_32_get_insertpos(&n32->base, chunk);
+					int16 count = n32->base.n.count;
 
-						if (count != 0 && insertpos < count)
-							chunk_children_array_shift(n32->base.chunks, n32->children,
-													   count, insertpos);
-
-						n32->base.chunks[insertpos] = chunk;
-						n32->children[insertpos] = child;
-						break;
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+						chunk_children_array_shift(n32->base.chunks, n32->children,
+												   count, insertpos);
 					}
+
+					n32->base.chunks[insertpos] = chunk;
+					n32->children[insertpos] = child;
+					break;
 				}
 			}
 			/* FALLTHROUGH */
@@ -1409,7 +1406,7 @@ retry_insert_inner_32:
 				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
 				int			cnt = 0;
 
-				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+				if (node_125_is_chunk_used(&n125->base, chunk))
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
@@ -1427,7 +1424,7 @@ retry_insert_inner_32:
 																	 RT_NODE_KIND_256);
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
 					{
-						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						if (!node_125_is_chunk_used(&n125->base, i))
 							continue;
 
 						node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
-- 
2.38.1

#165

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#164)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Dec 23, 2022 at 8:47 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I wrote:

- Try templating out the differences between local and shared memory.

Here is a brief progress report before Christmas vacation.

Thanks!

I thought the best way to approach this was to go "inside out", that is, start with the modest goal of reducing duplicated code for v16.

0001-0005 are copies from v13.

0006 whacks around the rt_node_insert_inner function to reduce the "surface area" as far as symbols and casts. This includes replacing the goto with an extra "unlikely" branch.

0007 removes the STRICT pragma for one of our benchmark functions that crept in somewhere -- it should use the default and not just return NULL instantly.

0008 further whacks around the node-growing code in rt_node_insert_inner to remove casts. When growing the size class within the same kind, we have no need for a "new32" (etc) variable. Also, to keep from getting confused about what an assert build verifies at the end, add a "newnode" variable and assign it to "node" as soon as possible.

0009 uses the bitmap logic from 0004 for node256 also. There is no performance reason for this, because there is no iteration needed, but it's good for simplicity and consistency.

These 4 patches make sense to me. We can merge them into 0002 patch
and I'll do similar changes for functions for leaf nodes as well.

0010 and 0011 template a common implementation for both leaf and inner nodes for searching and inserting.

0012: While at it, I couldn't resist using this technique to separate out delete from search, which makes sense and might give a small performance boost (at least on less capable hardware). I haven't got to the iteration functions, but they should be straightforward.

Cool!

There is more that could be done here, but I didn't want to get too ahead of myself. For example, it's possible that struct members "children" and "values" are names that don't need to be distinguished. Making them the same would reduce code like
+#ifdef RT_NODE_LEVEL_LEAF
+ n32->values[insertpos] = value;
+#else
+ n32->children[insertpos] = child;
+#endif
...but there could be downsides and I don't want to distract from the goal of dealing with shared memory.

With these patches, some functions in radixtree.h load the header
files, radixtree_xxx_impl.h, that have the function body. What do you
think about how we can expand this template method to deal with DSA
memory? I imagined that we load say radixtree_template.h with some
macros to use the radix tree like we do for simplehash.h. And
radixtree_template.h further loads xxx_impl.h files for some internal
functions.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#166

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#165)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Dec 27, 2022 at 12:14 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Dec 23, 2022 at 8:47 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

These 4 patches make sense to me.We can merge them into 0002 patch

Okay, then I'll squash them when I post my next patch.

and I'll do similar changes for functions for leaf nodes as well.

I assume you meant something else? -- some of the differences between inner
and leaf are already abstracted away.

In any case, some things are still half-baked, so please wait until my next
patch before doing work on these files.

Also, CI found a bug on 32-bit -- I know what I missed and will fix next
week.

0010 and 0011 template a common implementation for both leaf and inner

nodes for searching and inserting.

0012: While at it, I couldn't resist using this technique to separate

out delete from search, which makes sense and might give a small
performance boost (at least on less capable hardware). I haven't got to the
iteration functions, but they should be straightforward.

Two things came to mind since I posted this, which I'll make clear next
patch:
- A good compiler will get rid of branches when inlining, so maybe no
difference in code generation, but it still looks nicer this way.
- Delete should really use its own template, because it only _accidentally_
looks like search because we don't yet shrink nodes.

What do you
think about how we can expand this template method to deal with DSA
memory? I imagined that we load say radixtree_template.h with some
macros to use the radix tree like we do for simplehash.h. And
radixtree_template.h further loads xxx_impl.h files for some internal
functions.

Right, I was thinking the same. I wanted to start small and look for
opportunities to shrink the code footprint.

--
John Naylor
EDB: http://www.enterprisedb.com

#167

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#166)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Dec 27, 2022 at 2:24 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Dec 27, 2022 at 12:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 23, 2022 at 8:47 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

These 4 patches make sense to me.We can merge them into 0002 patch

Okay, then I'll squash them when I post my next patch.

and I'll do similar changes for functions for leaf nodes as well.

I assume you meant something else? -- some of the differences between inner and leaf are already abstracted away.

Right. If we template these routines I don't need that.

In any case, some things are still half-baked, so please wait until my next patch before doing work on these files.

Also, CI found a bug on 32-bit -- I know what I missed and will fix next week.

Thanks!

0010 and 0011 template a common implementation for both leaf and inner nodes for searching and inserting.

0012: While at it, I couldn't resist using this technique to separate out delete from search, which makes sense and might give a small performance boost (at least on less capable hardware). I haven't got to the iteration functions, but they should be straightforward.

Two things came to mind since I posted this, which I'll make clear next patch:
- A good compiler will get rid of branches when inlining, so maybe no difference in code generation, but it still looks nicer this way.
- Delete should really use its own template, because it only _accidentally_ looks like search because we don't yet shrink nodes.

Okay.

What do you
think about how we can expand this template method to deal with DSA
memory? I imagined that we load say radixtree_template.h with some
macros to use the radix tree like we do for simplehash.h. And
radixtree_template.h further loads xxx_impl.h files for some internal
functions.

Right, I was thinking the same. I wanted to start small and look for opportunities to shrink the code footprint.

Thank you for your confirmation!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#168

john.naylor@enterprisedb.com

about 3 years ago

In reply to: John Naylor (#166)

9 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

[working on templating]

In the end, I decided to base my effort on v8, and not v12 (based on one of
my less-well-thought-out ideas). The latter was a good experiment, but it
did not lead to an increase in readability as I had hoped. The attached v17
is still rough, but it's in good enough shape to evaluate a mostly-complete
templating implementation.

Part of what I didn't like about v8 was distinctions like "node" vs
"nodep", which hinder readability. I've used "allocnode" for some cases
where it makes sense, which is translated to "newnode" for the local
pointer. Some places I just gave up and used "nodep" for parameters like in
v8, just to get it done. We can revisit naming later.

Not done yet:

- get_handle() is not implemented
- rt_attach is defined but unused
- grow_node_kind() was hackishly removed, but could be turned into a macro
(or function that writes to 2 pointers)
- node_update_inner() is back, now that we can share a template with
"search". Seems easier to read, and I suspect this is easier for the
compiler.
- the value type should really be a template macro, but is still hard-coded
to uint64
- I think it's okay if the key is hard coded for PG16: If some use case
needs more than uint64, we could consider "single-value leaves" with varlen
keys as a template option.
- benchmark tests not updated

v13-0007 had some changes to the regression tests, but I haven't included
those. The tests from v13-0003 do pass, both locally and shared. I quickly
hacked together changing shared/local tests by hand (need to recompile),
but it would be good for maintainability if tests could run once each with
local and shmem, but use the same "expected" test output.

Also, I didn't look to see if there were any changes in v14/15 that didn't
have to do with precise memory accounting.

At this point, Masahiko, I'd appreciate your feedback on whether this is an
improvement at all (or at least a good base for improvement), especially
for integrating with the TID store. I think there are some advantages to
the template approach. One possible disadvantage is needing separate
functions for each local and shared memory.

If we go this route, I do think the TID store should invoke the template as
static functions. I'm not quite comfortable with a global function that may
not fit well with future use cases.

One review point I'll mention: Somehow I didn't notice there is no use for
the "chunk" field in the rt_node type -- it's only set to zero and copied
when growing. What is the purpose? Removing it would allow the
smallest node to take up only 32 bytes with a fanout of 3, by eliminating
padding.

Also, v17-0005 has an optimization/simplification for growing into node125
(my version needs an assertion or fallback, but works well now), found by
another reading of Andres' prototype There is a lot of good engineering
there, we should try to preserve it.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v17-0005-Template-out-inner-and-leaf-nodes.patchtext/x-patch; charset=US-ASCII; name=v17-0005-Template-out-inner-and-leaf-nodes.patchDownload

From b1de5cbacf06dd975cc2138a498c5d9897e14df7 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 23 Dec 2022 14:33:49 +0700
Subject: [PATCH v17 5/9] Template out inner and leaf nodes

Use a template for each insert, iteration, search, and
delete functions.

To optimize growing into node125, don't search for a
slot each time -- just copy into the first 32
slots and set the slot index at the same time.
Also set all the isset bits with a single store.

Remove node_*_125_update/insert/delete functions and
node_125_find_unused_slot, since they are now unused.
---
 src/backend/lib/radixtree.c             | 863 ++----------------------
 src/include/lib/radixtree_delete_impl.h | 100 +++
 src/include/lib/radixtree_insert_impl.h | 293 ++++++++
 src/include/lib/radixtree_iter_impl.h   | 129 ++++
 src/include/lib/radixtree_search_impl.h | 102 +++
 src/tools/pginclude/cpluspluscheck      |   6 +
 src/tools/pginclude/headerscheck        |   6 +
 7 files changed, 694 insertions(+), 805 deletions(-)
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index 5203127f76..80cde09aaf 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -96,13 +96,6 @@
 #define BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
 #define BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
 
-/* Enum used rt_node_search() */
-typedef enum
-{
-	RT_ACTION_FIND = 0,			/* find the key-value */
-	RT_ACTION_DELETE,			/* delete the key-value */
-} rt_action;
-
 /*
  * Supported radix tree node kinds and size classes.
  *
@@ -422,10 +415,8 @@ static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_cl
 								bool inner);
 static void rt_free_node(radix_tree *tree, rt_node *node);
 static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
-										rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
-									   uint64 *value_p);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p);
 static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
 								 uint64 key, rt_node *child);
 static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
@@ -663,102 +654,6 @@ node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
 	return node->values[node->base.slot_idxs[chunk]];
 }
 
-static void
-node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
-{
-	int			slotpos = node->base.slot_idxs[chunk];
-	int			idx = BM_IDX(slotpos);
-	int			bitnum = BM_BIT(slotpos);
-
-	Assert(!NODE_IS_LEAF(node));
-
-	node->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
-	node->children[node->base.slot_idxs[chunk]] = NULL;
-	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
-}
-
-static void
-node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
-{
-	int			slotpos = node->base.slot_idxs[chunk];
-	int			idx = BM_IDX(slotpos);
-	int			bitnum = BM_BIT(slotpos);
-
-	Assert(NODE_IS_LEAF(node));
-	node->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
-	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
-}
-
-/* Return an unused slot in node-125 */
-static int
-node_125_find_unused_slot(bitmapword *isset)
-{
-	int			slotpos;
-	int			idx;
-	bitmapword	inverse;
-
-	/* get the first word with at least one bit not set */
-	for (idx = 0; idx < BM_IDX(128); idx++)
-	{
-		if (isset[idx] < ~((bitmapword) 0))
-			break;
-	}
-
-	/* To get the first unset bit in X, get the first set bit in ~X */
-	inverse = ~(isset[idx]);
-	slotpos = idx * BITS_PER_BITMAPWORD;
-	slotpos += bmw_rightmost_one_pos(inverse);
-
-	/* mark the slot used */
-	isset[idx] |= bmw_rightmost_one(inverse);
-
-	return slotpos;
- }
-
-static inline void
-node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
-{
-	int			slotpos;
-
-	Assert(!NODE_IS_LEAF(node));
-
-	slotpos = node_125_find_unused_slot(node->base.isset);
-	Assert(slotpos < node->base.n.fanout);
-
-	node->base.slot_idxs[chunk] = slotpos;
-	node->children[slotpos] = child;
-}
-
-/* Set the slot at the corresponding chunk */
-static inline void
-node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
-{
-	int			slotpos;
-
-	Assert(NODE_IS_LEAF(node));
-
-	slotpos = node_125_find_unused_slot(node->base.isset);
-	Assert(slotpos < node->base.n.fanout);
-
-	node->base.slot_idxs[chunk] = slotpos;
-	node->values[slotpos] = value;
-}
-
-/* Update the child corresponding to 'chunk' to 'child' */
-static inline void
-node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
-{
-	Assert(!NODE_IS_LEAF(node));
-	node->children[node->base.slot_idxs[chunk]] = child;
-}
-
-static inline void
-node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
-{
-	Assert(NODE_IS_LEAF(node));
-	node->values[node->base.slot_idxs[chunk]] = value;
-}
-
 /* Functions to manipulate inner and leaf node-256 */
 
 /* Return true if the slot corresponding to the given chunk is in use */
@@ -1075,189 +970,57 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
 }
 
 /*
- * Search for the child pointer corresponding to 'key' in the given node, and
- * do the specified 'action'.
+ * Search for the child pointer corresponding to 'key' in the given node.
  *
  * Return true if the key is found, otherwise return false. On success, the child
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
-	bool		found = false;
-	rt_node    *child = NULL;
-
-	switch (node->kind)
-	{
-		case RT_NODE_KIND_4:
-			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
-				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
-
-				if (idx < 0)
-					break;
-
-				found = true;
-
-				if (action == RT_ACTION_FIND)
-					child = n4->children[idx];
-				else			/* RT_ACTION_DELETE */
-					chunk_children_array_delete(n4->base.chunks, n4->children,
-												n4->base.n.count, idx);
-
-				break;
-			}
-		case RT_NODE_KIND_32:
-			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
-				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
-
-				if (idx < 0)
-					break;
-
-				found = true;
-				if (action == RT_ACTION_FIND)
-					child = n32->children[idx];
-				else			/* RT_ACTION_DELETE */
-					chunk_children_array_delete(n32->base.chunks, n32->children,
-												n32->base.n.count, idx);
-				break;
-			}
-		case RT_NODE_KIND_125:
-			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
-
-				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
-					break;
-
-				found = true;
-
-				if (action == RT_ACTION_FIND)
-					child = node_inner_125_get_child(n125, chunk);
-				else			/* RT_ACTION_DELETE */
-					node_inner_125_delete(n125, chunk);
-
-				break;
-			}
-		case RT_NODE_KIND_256:
-			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
-
-				if (!node_inner_256_is_chunk_used(n256, chunk))
-					break;
-
-				found = true;
-				if (action == RT_ACTION_FIND)
-					child = node_inner_256_get_child(n256, chunk);
-				else			/* RT_ACTION_DELETE */
-					node_inner_256_delete(n256, chunk);
-
-				break;
-			}
-	}
-
-	/* update statistics */
-	if (action == RT_ACTION_DELETE && found)
-		node->count--;
-
-	if (found && child_p)
-		*child_p = child;
-
-	return found;
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
 }
 
 /*
- * Search for the value corresponding to 'key' in the given node, and do the
- * specified 'action'.
+ * Search for the value corresponding to 'key' in the given node.
  *
  * Return true if the key is found, otherwise return false. On success, the pointer
  * to the value is set to value_p.
  */
 static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
-	bool		found = false;
-	uint64		value = 0;
-
-	switch (node->kind)
-	{
-		case RT_NODE_KIND_4:
-			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
-				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
-
-				if (idx < 0)
-					break;
-
-				found = true;
-
-				if (action == RT_ACTION_FIND)
-					value = n4->values[idx];
-				else			/* RT_ACTION_DELETE */
-					chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
-											  n4->base.n.count, idx);
-
-				break;
-			}
-		case RT_NODE_KIND_32:
-			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
-				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
-
-				if (idx < 0)
-					break;
-
-				found = true;
-				if (action == RT_ACTION_FIND)
-					value = n32->values[idx];
-				else			/* RT_ACTION_DELETE */
-					chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
-											  n32->base.n.count, idx);
-				break;
-			}
-		case RT_NODE_KIND_125:
-			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
-
-				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
-					break;
-
-				found = true;
-
-				if (action == RT_ACTION_FIND)
-					value = node_leaf_125_get_value(n125, chunk);
-				else			/* RT_ACTION_DELETE */
-					node_leaf_125_delete(n125, chunk);
-
-				break;
-			}
-		case RT_NODE_KIND_256:
-			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
-
-				if (!node_leaf_256_is_chunk_used(n256, chunk))
-					break;
-
-				found = true;
-				if (action == RT_ACTION_FIND)
-					value = node_leaf_256_get_value(n256, chunk);
-				else			/* RT_ACTION_DELETE */
-					node_leaf_256_delete(n256, chunk);
-
-				break;
-			}
-	}
-
-	/* update statistics */
-	if (action == RT_ACTION_DELETE && found)
-		node->count--;
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
 
-	if (found && value_p)
-		*value_p = value;
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+rt_node_delete_inner(rt_node *node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
 
-	return found;
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+rt_node_delete_leaf(rt_node *node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
 }
 
 /* Insert the child to the inner node */
@@ -1265,185 +1028,9 @@ static bool
 rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
 					 rt_node *child)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
-	bool		chunk_exists = false;
-	rt_node		*newnode = NULL;
-
-	Assert(!NODE_IS_LEAF(node));
-
-	switch (node->kind)
-	{
-		case RT_NODE_KIND_4:
-			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
-				int			idx;
-
-				idx = node_4_search_eq(&n4->base, chunk);
-				if (idx != -1)
-				{
-					/* found the existing chunk */
-					chunk_exists = true;
-					n4->children[idx] = child;
-					break;
-				}
-
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
-				{
-					rt_node_inner_32 *new32;
-
-					/* grow node from 4 to 32 */
-					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
-					new32 = (rt_node_inner_32 *) newnode;
-					chunk_children_array_copy(n4->base.chunks, n4->children,
-											  new32->base.chunks, new32->children);
-
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, node, newnode, key);
-					node = newnode;
-				}
-				else
-				{
-					int			insertpos = node_4_get_insertpos(&n4->base, chunk);
-					uint16		count = n4->base.n.count;
-
-					/* shift chunks and children */
-					if (count != 0 && insertpos < count)
-						chunk_children_array_shift(n4->base.chunks, n4->children,
-												   count, insertpos);
-
-					n4->base.chunks[insertpos] = chunk;
-					n4->children[insertpos] = child;
-					break;
-				}
-			}
-			/* FALLTHROUGH */
-		case RT_NODE_KIND_32:
-			{
-				const rt_size_class_elem minclass = rt_size_class_info[RT_CLASS_32_PARTIAL];
-				const rt_size_class_elem maxclass = rt_size_class_info[RT_CLASS_32_FULL];
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
-				int			idx;
-
-				idx = node_32_search_eq(&n32->base, chunk);
-				if (idx != -1)
-				{
-					/* found the existing chunk */
-					chunk_exists = true;
-					n32->children[idx] = child;
-					break;
-				}
-
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
-					n32->base.n.count == minclass.fanout)
-				{
-					/* grow to the next size class of this kind */
-					newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
-					memcpy(newnode, node, minclass.inner_size);
-					newnode->fanout = maxclass.fanout;
-
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, node, newnode, key);
-					node = newnode;
-
-					/* also update pointer for this kind */
-					n32 = (rt_node_inner_32 *) newnode;
-				}
-
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
-				{
-					rt_node_inner_125 *new125;
-
-					/* grow node from 32 to 125 */
-					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
-					new125 = (rt_node_inner_125 *) newnode;
-					for (int i = 0; i < n32->base.n.count; i++)
-						node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
-
-					Assert(parent != NULL);
-					rt_replace_node(tree, parent, node, newnode, key);
-					node = newnode;
-				}
-				else
-				{
-					int	insertpos = node_32_get_insertpos(&n32->base, chunk);
-					int16 count = n32->base.n.count;
-
-					if (insertpos < count)
-					{
-						Assert(count > 0);
-						chunk_children_array_shift(n32->base.chunks, n32->children,
-												   count, insertpos);
-					}
-
-					n32->base.chunks[insertpos] = chunk;
-					n32->children[insertpos] = child;
-					break;
-				}
-			}
-			/* FALLTHROUGH */
-		case RT_NODE_KIND_125:
-			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
-				int			cnt = 0;
-
-				if (node_125_is_chunk_used(&n125->base, chunk))
-				{
-					/* found the existing chunk */
-					chunk_exists = true;
-					node_inner_125_update(n125, chunk, child);
-					break;
-				}
-
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
-				{
-					rt_node_inner_256 *new256;
-					Assert(parent != NULL);
-
-					/* grow node from 125 to 256 */
-					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
-					new256 = (rt_node_inner_256 *) newnode;
-					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
-					{
-						if (!node_125_is_chunk_used(&n125->base, i))
-							continue;
-
-						node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
-						cnt++;
-					}
-
-					rt_replace_node(tree, parent, node, newnode, key);
-					node = newnode;
-				}
-				else
-				{
-					node_inner_125_insert(n125, chunk, child);
-					break;
-				}
-			}
-			/* FALLTHROUGH */
-		case RT_NODE_KIND_256:
-			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
-
-				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
-				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
-
-				node_inner_256_set(n256, chunk, child);
-				break;
-			}
-	}
-
-	/* Update statistics */
-	if (!chunk_exists)
-		node->count++;
-
-	/*
-	 * Done. Finally, verify the chunk and value is inserted or replaced
-	 * properly in the node.
-	 */
-	rt_verify_node(node);
-
-	return chunk_exists;
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
 }
 
 /* Insert the value to the leaf node */
@@ -1451,190 +1038,9 @@ static bool
 rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 					uint64 key, uint64 value)
 {
-	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
-	bool		chunk_exists = false;
-
-	Assert(NODE_IS_LEAF(node));
-
-	switch (node->kind)
-	{
-		case RT_NODE_KIND_4:
-			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
-				int			idx;
-
-				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
-				if (idx != -1)
-				{
-					/* found the existing chunk */
-					chunk_exists = true;
-					n4->values[idx] = value;
-					break;
-				}
-
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
-				{
-					rt_node_leaf_32 *new32;
-					Assert(parent != NULL);
-
-					/* grow node from 4 to 32 */
-					new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
-																  RT_NODE_KIND_32);
-					chunk_values_array_copy(n4->base.chunks, n4->values,
-											new32->base.chunks, new32->values);
-					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
-					node = (rt_node *) new32;
-				}
-				else
-				{
-					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
-					int			count = n4->base.n.count;
-
-					/* shift chunks and values */
-					if (count != 0 && insertpos < count)
-						chunk_values_array_shift(n4->base.chunks, n4->values,
-												 count, insertpos);
-
-					n4->base.chunks[insertpos] = chunk;
-					n4->values[insertpos] = value;
-					break;
-				}
-			}
-			/* FALLTHROUGH */
-		case RT_NODE_KIND_32:
-			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
-				int			idx;
-
-				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
-				if (idx != -1)
-				{
-					/* found the existing chunk */
-					chunk_exists = true;
-					n32->values[idx] = value;
-					break;
-				}
-
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
-				{
-					Assert(parent != NULL);
-
-					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
-					{
-						/* use the same node kind, but expand to the next size class */
-						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
-						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
-						rt_node_leaf_32 *new32;
-
-						new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
-						memcpy(new32, n32, size);
-						new32->base.n.fanout = fanout;
-
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
-
-						/* must update both pointers here */
-						node = (rt_node *) new32;
-						n32 = new32;
-
-						goto retry_insert_leaf_32;
-					}
-					else
-					{
-						rt_node_leaf_125 *new125;
-
-						/* grow node from 32 to 125 */
-						new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
-																		RT_NODE_KIND_125);
-						for (int i = 0; i < n32->base.n.count; i++)
-							node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
-
-						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
-										key);
-						node = (rt_node *) new125;
-					}
-				}
-				else
-				{
-				retry_insert_leaf_32:
-					{
-						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
-						int	count = n32->base.n.count;
-
-						if (count != 0 && insertpos < count)
-							chunk_values_array_shift(n32->base.chunks, n32->values,
-													 count, insertpos);
-
-						n32->base.chunks[insertpos] = chunk;
-						n32->values[insertpos] = value;
-						break;
-					}
-				}
-			}
-			/* FALLTHROUGH */
-		case RT_NODE_KIND_125:
-			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
-				int			cnt = 0;
-
-				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
-				{
-					/* found the existing chunk */
-					chunk_exists = true;
-					node_leaf_125_update(n125, chunk, value);
-					break;
-				}
-
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
-				{
-					rt_node_leaf_256 *new256;
-					Assert(parent != NULL);
-
-					/* grow node from 125 to 256 */
-					new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
-																	RT_NODE_KIND_256);
-					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
-					{
-						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
-							continue;
-
-						node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
-						cnt++;
-					}
-
-					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
-									key);
-					node = (rt_node *) new256;
-				}
-				else
-				{
-					node_leaf_125_insert(n125, chunk, value);
-					break;
-				}
-			}
-			/* FALLTHROUGH */
-		case RT_NODE_KIND_256:
-			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
-
-				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
-				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
-
-				node_leaf_256_set(n256, chunk, value);
-				break;
-			}
-	}
-
-	/* Update statistics */
-	if (!chunk_exists)
-		node->count++;
-
-	/*
-	 * Done. Finally, verify the chunk and value is inserted or replaced
-	 * properly in the node.
-	 */
-	rt_verify_node(node);
-
-	return chunk_exists;
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
 }
 
 /*
@@ -1723,7 +1129,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 		if (NODE_IS_LEAF(node))
 			break;
 
-		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		if (!rt_node_search_inner(node, key, &child))
 		{
 			rt_set_extend(tree, key, value, parent, node);
 			return false;
@@ -1770,14 +1176,14 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 		if (NODE_IS_LEAF(node))
 			break;
 
-		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		if (!rt_node_search_inner(node, key, &child))
 			return false;
 
 		node = child;
 		shift -= RT_NODE_SPAN;
 	}
 
-	return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+	return rt_node_search_leaf(node, key, value_p);
 }
 
 /*
@@ -1810,7 +1216,7 @@ rt_delete(radix_tree *tree, uint64 key)
 		/* Push the current node to the stack */
 		stack[++level] = node;
 
-		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		if (!rt_node_search_inner(node, key, &child))
 			return false;
 
 		node = child;
@@ -1819,7 +1225,7 @@ rt_delete(radix_tree *tree, uint64 key)
 
 	/* Delete the key from the leaf node if exists */
 	Assert(NODE_IS_LEAF(node));
-	deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+	deleted = rt_node_delete_leaf(node, key);
 
 	if (!deleted)
 	{
@@ -1845,7 +1251,7 @@ rt_delete(radix_tree *tree, uint64 key)
 	{
 		node = stack[level--];
 
-		deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+		deleted = rt_node_delete_inner(node, key);
 		Assert(deleted);
 
 		/* If the node didn't become empty, we stop deleting the key */
@@ -1994,84 +1400,9 @@ rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
 static inline rt_node *
 rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
 {
-	rt_node    *child = NULL;
-	bool		found = false;
-	uint8		key_chunk;
-
-	switch (node_iter->node->kind)
-	{
-		case RT_NODE_KIND_4:
-			{
-				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
-
-				node_iter->current_idx++;
-				if (node_iter->current_idx >= n4->base.n.count)
-					break;
-
-				child = n4->children[node_iter->current_idx];
-				key_chunk = n4->base.chunks[node_iter->current_idx];
-				found = true;
-				break;
-			}
-		case RT_NODE_KIND_32:
-			{
-				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
-
-				node_iter->current_idx++;
-				if (node_iter->current_idx >= n32->base.n.count)
-					break;
-
-				child = n32->children[node_iter->current_idx];
-				key_chunk = n32->base.chunks[node_iter->current_idx];
-				found = true;
-				break;
-			}
-		case RT_NODE_KIND_125:
-			{
-				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
-				int			i;
-
-				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
-				{
-					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
-						break;
-				}
-
-				if (i >= RT_NODE_MAX_SLOTS)
-					break;
-
-				node_iter->current_idx = i;
-				child = node_inner_125_get_child(n125, i);
-				key_chunk = i;
-				found = true;
-				break;
-			}
-		case RT_NODE_KIND_256:
-			{
-				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
-				int			i;
-
-				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
-				{
-					if (node_inner_256_is_chunk_used(n256, i))
-						break;
-				}
-
-				if (i >= RT_NODE_MAX_SLOTS)
-					break;
-
-				node_iter->current_idx = i;
-				child = node_inner_256_get_child(n256, i);
-				key_chunk = i;
-				found = true;
-				break;
-			}
-	}
-
-	if (found)
-		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
-
-	return child;
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
 }
 
 /*
@@ -2082,88 +1413,9 @@ static inline bool
 rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
 						  uint64 *value_p)
 {
-	rt_node    *node = node_iter->node;
-	bool		found = false;
-	uint64		value;
-	uint8		key_chunk;
-
-	switch (node->kind)
-	{
-		case RT_NODE_KIND_4:
-			{
-				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
-
-				node_iter->current_idx++;
-				if (node_iter->current_idx >= n4->base.n.count)
-					break;
-
-				value = n4->values[node_iter->current_idx];
-				key_chunk = n4->base.chunks[node_iter->current_idx];
-				found = true;
-				break;
-			}
-		case RT_NODE_KIND_32:
-			{
-				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
-
-				node_iter->current_idx++;
-				if (node_iter->current_idx >= n32->base.n.count)
-					break;
-
-				value = n32->values[node_iter->current_idx];
-				key_chunk = n32->base.chunks[node_iter->current_idx];
-				found = true;
-				break;
-			}
-		case RT_NODE_KIND_125:
-			{
-				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
-				int			i;
-
-				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
-				{
-					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
-						break;
-				}
-
-				if (i >= RT_NODE_MAX_SLOTS)
-					break;
-
-				node_iter->current_idx = i;
-				value = node_leaf_125_get_value(n125, i);
-				key_chunk = i;
-				found = true;
-				break;
-			}
-		case RT_NODE_KIND_256:
-			{
-				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
-				int			i;
-
-				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
-				{
-					if (node_leaf_256_is_chunk_used(n256, i))
-						break;
-				}
-
-				if (i >= RT_NODE_MAX_SLOTS)
-					break;
-
-				node_iter->current_idx = i;
-				value = node_leaf_256_get_value(n256, i);
-				key_chunk = i;
-				found = true;
-				break;
-			}
-	}
-
-	if (found)
-	{
-		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
-		*value_p = value;
-	}
-
-	return found;
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
 }
 
 /*
@@ -2229,6 +1481,7 @@ rt_verify_node(rt_node *node)
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
 				{
 					uint8		slot = n125->slot_idxs[i];
+					int			idx = BM_IDX(slot);
 					int			bitnum = BM_BIT(slot);
 
 					if (!node_125_is_chunk_used(n125, i))
@@ -2236,7 +1489,7 @@ rt_verify_node(rt_node *node)
 
 					/* Check if the corresponding slot is used */
 					Assert(slot < node->fanout);
-					Assert((n125->isset[i] & ((bitmapword) 1 << bitnum)) != 0);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
 
 					cnt++;
 				}
@@ -2476,12 +1729,12 @@ rt_dump_search(radix_tree *tree, uint64 key)
 			uint64		dummy;
 
 			/* We reached at a leaf node, find the corresponding slot */
-			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+			rt_node_search_leaf(node, key, &dummy);
 
 			break;
 		}
 
-		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		if (!rt_node_search_inner(node, key, &child))
 			break;
 
 		node = child;
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..24fd9cc02b
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,100 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE rt_node_inner_4
+#define RT_NODE32_TYPE rt_node_inner_32
+#define RT_NODE125_TYPE rt_node_inner_125
+#define RT_NODE256_TYPE rt_node_inner_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE rt_node_leaf_4
+#define RT_NODE32_TYPE rt_node_leaf_32
+#define RT_NODE125_TYPE rt_node_leaf_125
+#define RT_NODE256_TYPE rt_node_leaf_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+										  n4->base.n.count, idx);
+#else
+				chunk_children_array_delete(n4->base.chunks, n4->children,
+											n4->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+										  n32->base.n.count, idx);
+#else
+				chunk_children_array_delete(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_NODE_125_INVALID_IDX)
+					return false;
+
+				idx = BM_IDX(slotpos);
+				bitnum = BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!node_leaf_256_is_chunk_used(n256, chunk))
+#else
+				if (!node_inner_256_is_chunk_used(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				node_leaf_256_delete(n256, chunk);
+#else
+				node_inner_256_delete(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..c63fe9a3c0
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,293 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE rt_node_inner_4
+#define RT_NODE32_TYPE rt_node_inner_32
+#define RT_NODE125_TYPE rt_node_inner_125
+#define RT_NODE256_TYPE rt_node_inner_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE rt_node_leaf_4
+#define RT_NODE32_TYPE rt_node_leaf_32
+#define RT_NODE125_TYPE rt_node_leaf_125
+#define RT_NODE256_TYPE rt_node_leaf_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+	rt_node		*newnode = NULL;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(NODE_IS_LEAF(node));
+#else
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx;
+
+				idx = node_4_search_eq(&n4->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n4->values[idx] = value;
+#else
+					n4->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					RT_NODE32_TYPE *new32;
+
+					/* grow node from 4 to 32 */
+					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+					new32 = (RT_NODE32_TYPE *) newnode;
+#ifdef RT_NODE_LEVEL_LEAF
+					chunk_values_array_copy(n4->base.chunks, n4->values,
+											  new32->base.chunks, new32->values);
+#else
+					chunk_children_array_copy(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children);
+#endif
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos(&n4->base, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						chunk_values_array_shift(n4->base.chunks, n4->values,
+												   count, insertpos);
+#else
+						chunk_children_array_shift(n4->base.chunks, n4->children,
+												   count, insertpos);
+#endif
+					}
+
+					n4->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n4->values[insertpos] = value;
+#else
+					n4->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const rt_size_class_elem minclass = rt_size_class_info[RT_CLASS_32_PARTIAL];
+				const rt_size_class_elem maxclass = rt_size_class_info[RT_CLASS_32_FULL];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx;
+
+				idx = node_32_search_eq(&n32->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[idx] = value;
+#else
+					n32->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+					n32->base.n.fanout == minclass.fanout)
+				{
+					/* grow to the next size class of this kind */
+#ifdef RT_NODE_LEVEL_LEAF
+					newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+					memcpy(newnode, node, minclass.leaf_size);
+#else
+					newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+					memcpy(newnode, node, minclass.inner_size);
+#endif
+					newnode->fanout = maxclass.fanout;
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
+
+					/* also update pointer for this kind */
+					n32 = (RT_NODE32_TYPE *) newnode;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					RT_NODE125_TYPE *new125;
+
+					Assert(n32->base.n.fanout == maxclass.fanout);
+
+					/* grow node from 32 to 125 */
+					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < maxclass.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					Assert(maxclass.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << maxclass.fanout) - 1);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = node_32_get_insertpos(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						chunk_values_array_shift(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						chunk_children_array_shift(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = value;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			cnt = 0;
+
+				if (slotpos != RT_NODE_125_INVALID_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					RT_NODE256_TYPE *new256;
+
+					/* grow node from 125 to 256 */
+					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+					new256 = (RT_NODE256_TYPE *) newnode;
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!node_125_is_chunk_used(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+#else
+						node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+#endif
+						cnt++;
+					}
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < BM_IDX(128); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+#else
+				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+#endif
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+#ifdef RT_NODE_LEVEL_LEAF
+				node_leaf_256_set(n256, chunk, value);
+#else
+				node_inner_256_set(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..bebf8e725a
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,129 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE rt_node_inner_4
+#define RT_NODE32_TYPE rt_node_inner_32
+#define RT_NODE125_TYPE rt_node_inner_125
+#define RT_NODE256_TYPE rt_node_inner_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE rt_node_leaf_4
+#define RT_NODE32_TYPE rt_node_leaf_32
+#define RT_NODE125_TYPE rt_node_leaf_125
+#define RT_NODE256_TYPE rt_node_leaf_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+#ifdef RT_NODE_LEVEL_LEAF
+	uint64		value;
+#else
+	rt_node    *child = NULL;
+#endif
+	bool		found = false;
+	uint8		key_chunk;
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n4->values[node_iter->current_idx];
+#else
+				child = n4->children[node_iter->current_idx];
+#endif
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[node_iter->current_idx];
+#else
+				child = n32->children[node_iter->current_idx];
+#endif
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = node_leaf_125_get_value(n125, i);
+#else
+				child = node_inner_125_get_child(n125, i);
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (node_leaf_256_is_chunk_used(n256, i))
+#else
+					if (node_inner_256_is_chunk_used(n256, i))
+#endif
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = node_leaf_256_get_value(n256, i);
+#else
+				child = node_inner_256_get_child(n256, i);
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+		*value_p = value;
+#endif
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return found;
+#else
+	return child;
+#endif
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..d0366f9bb6
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,102 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE rt_node_inner_4
+#define RT_NODE32_TYPE rt_node_inner_32
+#define RT_NODE125_TYPE rt_node_inner_125
+#define RT_NODE256_TYPE rt_node_inner_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE rt_node_leaf_4
+#define RT_NODE32_TYPE rt_node_leaf_32
+#define RT_NODE125_TYPE rt_node_leaf_125
+#define RT_NODE256_TYPE rt_node_leaf_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	uint64		value = 0;
+#else
+	rt_node    *child = NULL;
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n4->values[idx];
+#else
+				child = n4->children[idx];
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[idx];
+#else
+				child = n32->children[idx];
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+
+				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = node_leaf_125_get_value(n125, chunk);
+#else
+				child = node_inner_125_get_child(n125, chunk);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!node_leaf_256_is_chunk_used(n256, chunk))
+#else
+				if (!node_inner_256_is_chunk_used(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = node_leaf_256_get_value(n256, chunk);
+#else
+				child = node_inner_256_get_child(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	*value_p = value;
+#else
+	Assert(child_p != NULL);
+	*child_p = child;
+#endif
+
+	return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.39.0

v17-0009-Implement-shared-memory.patchtext/x-patch; charset=US-ASCII; name=v17-0009-Implement-shared-memory.patchDownload

From c78f27a61d649b0981fc150c3894a0e1a992bcc0 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 9 Jan 2023 14:32:39 +0700
Subject: [PATCH v17 9/9] Implement shared memory

---
 src/backend/utils/mmgr/dsa.c                  |  12 +
 src/include/lib/radixtree.h                   | 376 +++++++++++++-----
 src/include/lib/radixtree_delete_impl.h       |   6 +
 src/include/lib/radixtree_insert_impl.h       |  43 +-
 src/include/lib/radixtree_iter_impl.h         |  19 +-
 src/include/lib/radixtree_search_impl.h       |  28 +-
 src/include/utils/dsa.h                       |   1 +
 .../modules/test_radixtree/test_radixtree.c   |  43 ++
 8 files changed, 402 insertions(+), 126 deletions(-)

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 604b702a91..50f0aae3ab 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index b3d84da033..2b58a0cdf5 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -42,6 +42,8 @@
  *	  - RT_DEFINE - if defined function definitions are generated
  *	  - RT_SCOPE - in which scope (e.g. extern, static inline) do function
  *		declarations reside
+ *	  - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *		so that multiple processes can access it simultaneously.
  *
  *	  Optional parameters:
  *	  - RT_DEBUG - if defined add stats tracking and debugging functions
@@ -51,6 +53,9 @@
  *
  * RT_CREATE		- Create a new, empty radix tree
  * RT_FREE			- Free the radix tree
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
  * RT_SEARCH		- Search a key-value pair
  * RT_SET			- Set a key-value pair
  * RT_DELETE		- Delete a key-value pair
@@ -80,7 +85,8 @@
 #include "miscadmin.h"
 #include "nodes/bitmapset.h"
 #include "port/pg_bitutils.h"
-#include "port/pg_lfind.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
 #include "utils/memutils.h"
 
 /* helpers */
@@ -92,6 +98,9 @@
 #define RT_CREATE RT_MAKE_NAME(create)
 #define RT_FREE RT_MAKE_NAME(free)
 #define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#endif
 #define RT_SET RT_MAKE_NAME(set)
 #define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
 #define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
@@ -110,9 +119,11 @@
 #define RT_FREE_NODE RT_MAKE_NAME(free_node)
 #define RT_EXTEND RT_MAKE_NAME(extend)
 #define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
-#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
 #define RT_COPY_NODE RT_MAKE_NAME(copy_node)
 #define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
 #define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
 #define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
 #define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
@@ -138,6 +149,7 @@
 #define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
 #define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
 #define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
 #define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
 #define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
 #define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
@@ -150,6 +162,7 @@
 
 /* type declarations */
 #define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
 #define RT_ITER RT_MAKE_NAME(iter)
 #define RT_NODE RT_MAKE_NAME(node)
 #define RT_NODE_ITER RT_MAKE_NAME(node_iter)
@@ -181,8 +194,14 @@
 typedef struct RT_RADIX_TREE RT_RADIX_TREE;
 typedef struct RT_ITER RT_ITER;
 
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+#else
 RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
 RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
 RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
 RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
 RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
@@ -301,9 +320,21 @@ typedef struct RT_NODE
 	uint8		kind;
 } RT_NODE;
 
+
 #define RT_PTR_LOCAL RT_NODE *
 
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
 #define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
 
 #define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
 #define NODE_IS_EMPTY(n)		(((RT_PTR_LOCAL) (n))->count == 0)
@@ -512,21 +543,33 @@ static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
 };
 
 /* A radix tree with nodes */
-typedef struct RT_RADIX_TREE
+typedef struct RT_RADIX_TREE_CONTROL
 {
-	MemoryContext context;
-
 	RT_PTR_ALLOC root;
 	uint64		max_val;
 	uint64		num_keys;
 
-	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
-	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
-
 	/* statistics */
 #ifdef RT_DEBUG
 	int32		cnt[RT_SIZE_CLASS_COUNT];
 #endif
+} RT_RADIX_TREE_CONTROL;
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+	dsa_pointer ctl_dp;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
 } RT_RADIX_TREE;
 
 /*
@@ -542,6 +585,11 @@ typedef struct RT_RADIX_TREE
  * construct the key whenever updating the node iteration information, e.g., when
  * advancing the current index within the node or when moving to the next node
  * at the same level.
++ *
++ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
++ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
++ * We need either a safeguard to disallow other processes to begin the iteration
++ * while one process is doing or to allow multiple processes to do the iteration.
  */
 typedef struct RT_NODE_ITER
 {
@@ -562,14 +610,35 @@ typedef struct RT_ITER
 } RT_ITER;
 
 
-static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
-								 uint64 key, RT_PTR_LOCAL child);
-static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
 								uint64 key, uint64 value);
 
 /* verification (available only with assertion) */
 static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
 
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
 /*
  * Return index of the first element in 'base' that equals 'key'. Return -1
  * if there is no such element.
@@ -801,7 +870,7 @@ static inline bool
 RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
 {
 	Assert(!NODE_IS_LEAF(node));
-	return (node->children[chunk] != NULL);
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
 }
 
 static inline bool
@@ -855,7 +924,7 @@ static inline void
 RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
 {
 	Assert(!NODE_IS_LEAF(node));
-	node->children[chunk] = NULL;
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
 }
 
 static inline void
@@ -897,21 +966,31 @@ RT_SHIFT_GET_MAX_VAL(int shift)
 static RT_PTR_ALLOC
 RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
 {
-	RT_PTR_ALLOC newnode;
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
 
 	if (inner)
-		newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
-												 RT_SIZE_CLASS_INFO[size_class].inner_size);
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
 	else
-		newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
-												 RT_SIZE_CLASS_INFO[size_class].leaf_size);
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (inner)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+#endif
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[size_class]++;
+	tree->ctl->cnt[size_class]++;
 #endif
 
-	return newnode;
+	return allocnode;
 }
 
 /* Initialize the node contents */
@@ -951,13 +1030,15 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
 {
 	int			shift = RT_KEY_GET_SHIFT(key);
 	bool		inner = shift > 0;
-	RT_PTR_ALLOC newnode;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
 
-	newnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
 	RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
 	newnode->shift = shift;
-	tree->max_val = RT_SHIFT_GET_MAX_VAL(shift);
-	tree->root = newnode;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
 }
 
 static inline void
@@ -967,7 +1048,7 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
 	newnode->chunk = oldnode->chunk;
 	newnode->count = oldnode->count;
 }
-
+#if 0
 /*
  * Create a new node with 'new_kind' and the same shift, chunk, and
  * count of 'node'.
@@ -975,30 +1056,33 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
 static RT_NODE*
 RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
 {
-	RT_PTR_ALLOC newnode;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
 	bool inner = !NODE_IS_LEAF(node);
 
-	newnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
 	RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
 	RT_COPY_NODE(newnode, node);
 
 	return newnode;
 }
-
+#endif
 /* Free the given node */
 static void
-RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
 {
 	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node)
+	if (tree->ctl->root == allocnode)
 	{
-		tree->root = NULL;
-		tree->max_val = 0;
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
 	}
 
 #ifdef RT_DEBUG
 	{
 		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
 
 		/* update the statistics */
 		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
@@ -1011,12 +1095,26 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
 		if (i == RT_SIZE_CLASS_COUNT)
 			i = RT_CLASS_256;
 
-		tree->cnt[i]--;
-		Assert(tree->cnt[i] >= 0);
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
 	}
 #endif
 
-	pfree(node);
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+static inline bool
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
 }
 
 /*
@@ -1026,19 +1124,25 @@ static void
 RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
 				RT_PTR_ALLOC new_child, uint64 key)
 {
-	Assert(old_child->chunk == new_child->chunk);
-	Assert(old_child->shift == new_child->shift);
+	RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
+
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old->chunk == new->chunk);
+	Assert(old->shift == new->shift);
+#endif
 
-	if (parent == old_child)
+	if (parent == old)
 	{
 		/* Replace the root node with the new large node */
-		tree->root = new_child;
+		tree->ctl->root = new_child;
 	}
 	else
 	{
-		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
+		bool replaced PG_USED_FOR_ASSERTS_ONLY;
 
-		replaced = RT_NODE_INSERT_INNER(tree, NULL, parent, key, new_child);
+		replaced = RT_NODE_UPDATE_INNER(parent, key, new_child);
 		Assert(replaced);
 	}
 
@@ -1053,7 +1157,8 @@ static void
 RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
 {
 	int			target_shift;
-	int			shift = tree->root->shift + RT_NODE_SPAN;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
 
 	target_shift = RT_KEY_GET_SHIFT(key);
 
@@ -1065,22 +1170,23 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
 		RT_NODE_INNER_4 *n4;
 
 		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
-		node = (RT_PTR_LOCAL) allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
 		RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
 		node->shift = shift;
 		node->count = 1;
 
 		n4 = (RT_NODE_INNER_4 *) node;
 		n4->base.chunks[0] = 0;
-		n4->children[0] = tree->root;
+		n4->children[0] = tree->ctl->root;
 
-		tree->root->chunk = 0;
-		tree->root = node;
+		/* Update the root */
+		tree->ctl->root = allocnode;
+		root->chunk = 0;
 
 		shift += RT_NODE_SPAN;
 	}
 
-	tree->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
 }
 
 /*
@@ -1089,10 +1195,12 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
  */
 static inline void
 RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
-			  RT_PTR_LOCAL node)
+			  RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
 {
 	int			shift = node->shift;
 
+	Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+
 	while (shift >= RT_NODE_SPAN)
 	{
 		RT_PTR_ALLOC allocchild;
@@ -1101,19 +1209,20 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent
 		bool		inner = newshift > 0;
 
 		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
-		newchild = (RT_PTR_LOCAL) allocchild;
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
 		RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
 		newchild->shift = newshift;
 		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
-		RT_NODE_INSERT_INNER(tree, parent, node, key, newchild);
+		RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
 
 		parent = node;
 		node = newchild;
+		nodep = allocchild;
 		shift -= RT_NODE_SPAN;
 	}
 
-	RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
-	tree->num_keys++;
+	RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+	tree->ctl->num_keys++;
 }
 
 /*
@@ -1172,8 +1281,8 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
 
 /* Insert the child to the inner node */
 static bool
-RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node, uint64 key,
-					 RT_PTR_ALLOC child)
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
 {
 #define RT_NODE_LEVEL_INNER
 #include "lib/radixtree_insert_impl.h"
@@ -1182,7 +1291,7 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node
 
 /* Insert the value to the leaf node */
 static bool
-RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
 					uint64 key, uint64 value)
 {
 #define RT_NODE_LEVEL_LEAF
@@ -1194,18 +1303,26 @@ RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
  * Create the radix tree in the given memory context and return it.
  */
 RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
 RT_CREATE(MemoryContext ctx)
+#endif
 {
 	RT_RADIX_TREE *tree;
 	MemoryContext old_ctx;
 
 	old_ctx = MemoryContextSwitchTo(ctx);
 
-	tree = palloc(sizeof(RT_RADIX_TREE));
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
 	tree->context = ctx;
-	tree->root = NULL;
-	tree->max_val = 0;
-	tree->num_keys = 0;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	tree->ctl_dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, tree->ctl_dp);
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
 
 	/* Create the slab allocator for each size class */
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
@@ -1218,27 +1335,52 @@ RT_CREATE(MemoryContext ctx)
 												RT_SIZE_CLASS_INFO[i].name,
 												RT_SIZE_CLASS_INFO[i].leaf_blocksize,
 												RT_SIZE_CLASS_INFO[i].leaf_size);
-#ifdef RT_DEBUG
-		tree->cnt[i] = 0;
-#endif
 	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
 
 	MemoryContextSwitchTo(old_ctx);
 
 	return tree;
 }
 
+#ifdef RT_SHMEM
+RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, dsa_pointer dp)
+{
+	RT_RADIX_TREE *tree;
+
+	/* XXX: memory context support */
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	tree->ctl_dp = dp;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+
+	/* XXX: do we need to set a callback on exit to detach dsa? */
+
+	return tree;
+}
+#endif
+
 /*
  * Free the given radix tree.
  */
 RT_SCOPE void
 RT_FREE(RT_RADIX_TREE *tree)
 {
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, tree->ctl_dp); // XXX
+	dsa_detach(tree->dsa);
+#else
+	pfree(tree->ctl);
+
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
 		MemoryContextDelete(tree->inner_slabs[i]);
 		MemoryContextDelete(tree->leaf_slabs[i]);
 	}
+#endif
 
 	pfree(tree);
 }
@@ -1252,46 +1394,50 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
 {
 	int			shift;
 	bool		updated;
-	RT_PTR_LOCAL node;
 	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC nodep;
+	RT_PTR_LOCAL  node;
 
 	/* Empty tree, create the root */
-	if (!tree->root)
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
 		RT_NEW_ROOT(tree, key);
 
 	/* Extend the tree if necessary */
-	if (key > tree->max_val)
+	if (key > tree->ctl->max_val)
 		RT_EXTEND(tree, key);
 
-	Assert(tree->root);
+	//Assert(tree->ctl->root);
 
-	shift = tree->root->shift;
-	node = parent = tree->root;
+	nodep = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, nodep);
+	shift = parent->shift;
 
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		RT_PTR_LOCAL child;
+		RT_PTR_ALLOC child;
+
+		node = RT_PTR_GET_LOCAL(tree, nodep);
 
 		if (NODE_IS_LEAF(node))
 			break;
 
 		if (!RT_NODE_SEARCH_INNER(node, key, &child))
 		{
-			RT_SET_EXTEND(tree, key, value, parent, node);
+			RT_SET_EXTEND(tree, key, value, parent, nodep, node);
 			return false;
 		}
 
 		parent = node;
-		node = child;
+		nodep = child;
 		shift -= RT_NODE_SPAN;
 	}
 
-	updated = RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
+	updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
 
 	/* Update the statistics */
 	if (!updated)
-		tree->num_keys++;
+		tree->ctl->num_keys++;
 
 	return updated;
 }
@@ -1309,11 +1455,11 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
 
 	Assert(value_p != NULL);
 
-	if (!tree->root || key > tree->max_val)
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
 
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
@@ -1326,7 +1472,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
 		if (!RT_NODE_SEARCH_INNER(node, key, &child))
 			return false;
 
-		node = child;
+		node = RT_PTR_GET_LOCAL(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1341,37 +1487,40 @@ RT_SCOPE bool
 RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 {
 	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
 	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
 	int			shift;
 	int			level;
 	bool		deleted;
 
-	if (!tree->root || key > tree->max_val)
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
 	/*
 	 * Descend the tree to search the key while building a stack of nodes we
 	 * visited.
 	 */
-	node = tree->root;
-	shift = tree->root->shift;
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
 	level = -1;
 	while (shift > 0)
 	{
 		RT_PTR_ALLOC child;
 
 		/* Push the current node to the stack */
-		stack[++level] = node;
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
 
 		if (!RT_NODE_SEARCH_INNER(node, key, &child))
 			return false;
 
-		node = child;
+		allocnode = child;
 		shift -= RT_NODE_SPAN;
 	}
 
 	/* Delete the key from the leaf node if exists */
-	Assert(NODE_IS_LEAF(node));
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
 	deleted = RT_NODE_DELETE_LEAF(node, key);
 
 	if (!deleted)
@@ -1381,7 +1530,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 	}
 
 	/* Found the key to delete. Update the statistics */
-	tree->num_keys--;
+	tree->ctl->num_keys--;
 
 	/*
 	 * Return if the leaf node still has keys and we don't need to delete the
@@ -1391,13 +1540,14 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 		return true;
 
 	/* Free the empty leaf node */
-	RT_FREE_NODE(tree, node);
+	RT_FREE_NODE(tree, allocnode);
 
 	/* Delete the key in inner nodes recursively */
 	while (level >= 0)
 	{
-		node = stack[level--];
+		allocnode = stack[level--];
 
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
 		deleted = RT_NODE_DELETE_INNER(node, key);
 		Assert(deleted);
 
@@ -1406,7 +1556,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 			break;
 
 		/* The node became empty */
-		RT_FREE_NODE(tree, node);
+		RT_FREE_NODE(tree, allocnode);
 	}
 
 	return true;
@@ -1478,6 +1628,7 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 {
 	MemoryContext old_ctx;
 	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
 	int			top_level;
 
 	old_ctx = MemoryContextSwitchTo(tree->context);
@@ -1486,17 +1637,18 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 	iter->tree = tree;
 
 	/* empty tree */
-	if (!iter->tree->root)
+	if (!iter->tree->ctl->root)
 		return iter;
 
-	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
 	iter->stack_len = top_level;
 
 	/*
 	 * Descend to the left most leaf node from the root. The key is being
 	 * constructed while descending to the leaf.
 	 */
-	RT_UPDATE_ITER_STACK(iter, iter->tree->root, top_level);
+	RT_UPDATE_ITER_STACK(iter, root, top_level);
 
 	MemoryContextSwitchTo(old_ctx);
 
@@ -1511,7 +1663,7 @@ RT_SCOPE bool
 RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
 {
 	/* Empty tree */
-	if (!iter->tree->root)
+	if (!iter->tree->ctl->root)
 		return false;
 
 	for (;;)
@@ -1571,7 +1723,7 @@ RT_END_ITERATE(RT_ITER *iter)
 RT_SCOPE uint64
 RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
 {
-	return tree->num_keys;
+	return tree->ctl->num_keys;
 }
 
 /*
@@ -1580,13 +1732,18 @@ RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
 RT_SCOPE uint64
 RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
 {
+	// XXX is this necessary?
 	Size		total = sizeof(RT_RADIX_TREE);
 
+#ifdef RT_SHMEM
+	total = dsa_get_total_size(tree->dsa);
+#else
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
 		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
 		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
 	}
+#endif
 
 	return total;
 }
@@ -1670,13 +1827,13 @@ void
 rt_stats(RT_RADIX_TREE *tree)
 {
 	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
-						 tree->num_keys,
-						 tree->root->shift / RT_NODE_SPAN,
-						 tree->cnt[RT_CLASS_4_FULL],
-						 tree->cnt[RT_CLASS_32_PARTIAL],
-						 tree->cnt[RT_CLASS_32_FULL],
-						 tree->cnt[RT_CLASS_125_FULL],
-						 tree->cnt[RT_CLASS_256])));
+						 tree->ctl->num_keys,
+						 tree->ctl->root->shift / RT_NODE_SPAN,
+						 tree->ctl->cnt[RT_CLASS_4_FULL],
+						 tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+						 tree->ctl->cnt[RT_CLASS_32_FULL],
+						 tree->ctl->cnt[RT_CLASS_125_FULL],
+						 tree->ctl->cnt[RT_CLASS_256])));
 }
 
 static void
@@ -1848,23 +2005,23 @@ rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
 
 	elog(NOTICE, "-----------------------------------------------------------");
 	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
-		 tree->max_val, tree->max_val);
+		 tree->ctl->max_val, tree->ctl->max_val);
 
-	if (!tree->root)
+	if (!tree->ctl->root)
 	{
 		elog(NOTICE, "tree is empty");
 		return;
 	}
 
-	if (key > tree->max_val)
+	if (key > tree->ctl->max_val)
 	{
 		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
 			 key, key);
 		return;
 	}
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = tree->ctl->root;
+	shift = tree->ctl->root->shift;
 	while (shift >= 0)
 	{
 		RT_PTR_LOCAL child;
@@ -1901,15 +2058,15 @@ rt_dump(RT_RADIX_TREE *tree)
 				RT_SIZE_CLASS_INFO[i].inner_blocksize,
 				RT_SIZE_CLASS_INFO[i].leaf_size,
 				RT_SIZE_CLASS_INFO[i].leaf_blocksize);
-	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
 
-	if (!tree->root)
+	if (!tree->ctl->root)
 	{
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
-	rt_dump_node(tree->root, 0, true);
+	rt_dump_node(tree->ctl->root, 0, true);
 }
 #endif
 
@@ -1931,6 +2088,7 @@ rt_dump(RT_RADIX_TREE *tree)
 
 /* type declarations */
 #undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
 #undef RT_ITER
 #undef RT_NODE
 #undef RT_NODE_ITER
@@ -1959,6 +2117,7 @@ rt_dump(RT_RADIX_TREE *tree)
 /* function declarations */
 #undef RT_CREATE
 #undef RT_FREE
+#undef RT_ATTACH
 #undef RT_SET
 #undef RT_BEGIN_ITERATE
 #undef RT_ITERATE_NEXT
@@ -1980,6 +2139,8 @@ rt_dump(RT_RADIX_TREE *tree)
 #undef RT_GROW_NODE_KIND
 #undef RT_COPY_NODE
 #undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
 #undef RT_NODE_4_SEARCH_EQ
 #undef RT_NODE_32_SEARCH_EQ
 #undef RT_NODE_4_GET_INSERTPOS
@@ -2005,6 +2166,7 @@ rt_dump(RT_RADIX_TREE *tree)
 #undef RT_SHIFT_GET_MAX_VAL
 #undef RT_NODE_SEARCH_INNER
 #undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
 #undef RT_NODE_DELETE_INNER
 #undef RT_NODE_DELETE_LEAF
 #undef RT_NODE_INSERT_INNER
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index 6eefc63e19..eb87866b90 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -16,6 +16,12 @@
 
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(NODE_IS_LEAF(node));
+#else
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
 	switch (node->kind)
 	{
 		case RT_NODE_KIND_4:
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index ff76583402..e4faf54d9d 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -14,11 +14,14 @@
 
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 	bool		chunk_exists = false;
-	RT_NODE		*newnode = NULL;
+	RT_PTR_LOCAL newnode = NULL;
+	RT_PTR_ALLOC allocnode;
 
 #ifdef RT_NODE_LEVEL_LEAF
+	const bool inner = false;
 	Assert(NODE_IS_LEAF(node));
 #else
+	const bool inner = true;
 	Assert(!NODE_IS_LEAF(node));
 #endif
 
@@ -45,9 +48,15 @@
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
 				{
 					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
 
 					/* grow node from 4 to 32 */
-					newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
 					new32 = (RT_NODE32_TYPE *) newnode;
 #ifdef RT_NODE_LEVEL_LEAF
 					RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
@@ -57,7 +66,7 @@
 											  new32->base.chunks, new32->children);
 #endif
 					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, node, newnode, key);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
 					node = newnode;
 				}
 				else
@@ -112,17 +121,19 @@
 					n32->base.n.fanout == class32_min.fanout)
 				{
 					/* grow to the next size class of this kind */
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
 #ifdef RT_NODE_LEVEL_LEAF
-					newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, false);
 					memcpy(newnode, node, class32_min.leaf_size);
 #else
-					newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, true);
 					memcpy(newnode, node, class32_min.inner_size);
 #endif
 					newnode->fanout = class32_max.fanout;
 
 					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, node, newnode, key);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
 					node = newnode;
 
 					/* also update pointer for this kind */
@@ -132,11 +143,17 @@
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
 				{
 					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
 
 					Assert(n32->base.n.fanout == class32_max.fanout);
 
 					/* grow node from 32 to 125 */
-					newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
 					new125 = (RT_NODE125_TYPE *) newnode;
 
 					for (int i = 0; i < class32_max.fanout; i++)
@@ -153,7 +170,7 @@
 					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
 
 					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, node, newnode, key);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
 					node = newnode;
 				}
 				else
@@ -204,9 +221,15 @@
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
 				{
 					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
 
 					/* grow node from 125 to 256 */
-					newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
 					new256 = (RT_NODE256_TYPE *) newnode;
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
 					{
@@ -221,7 +244,7 @@
 					}
 
 					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, node, newnode, key);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
 					node = newnode;
 				}
 				else
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index a153011376..09d2018dc0 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -12,13 +12,18 @@
 #error node level must be either inner or leaf
 #endif
 
+	bool		found = false;
+	uint8		key_chunk;
+
 #ifdef RT_NODE_LEVEL_LEAF
 	uint64		value;
+
+	Assert(NODE_IS_LEAF(node_iter->node));
 #else
-	RT_NODE    *child = NULL;
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!NODE_IS_LEAF(node_iter->node));
 #endif
-	bool		found = false;
-	uint8		key_chunk;
 
 	switch (node_iter->node->kind)
 	{
@@ -32,7 +37,7 @@
 #ifdef RT_NODE_LEVEL_LEAF
 				value = n4->values[node_iter->current_idx];
 #else
-				child = n4->children[node_iter->current_idx];
+				child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
 #endif
 				key_chunk = n4->base.chunks[node_iter->current_idx];
 				found = true;
@@ -49,7 +54,7 @@
 #ifdef RT_NODE_LEVEL_LEAF
 				value = n32->values[node_iter->current_idx];
 #else
-				child = n32->children[node_iter->current_idx];
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
 #endif
 				key_chunk = n32->base.chunks[node_iter->current_idx];
 				found = true;
@@ -73,7 +78,7 @@
 #ifdef RT_NODE_LEVEL_LEAF
 				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
 #else
-				child = RT_NODE_INNER_125_GET_CHILD(n125, i);
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
 #endif
 				key_chunk = i;
 				found = true;
@@ -101,7 +106,7 @@
 #ifdef RT_NODE_LEVEL_LEAF
 				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
 #else
-				child = RT_NODE_INNER_256_GET_CHILD(n256, i);
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
 #endif
 				key_chunk = i;
 				found = true;
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index cbc357dcc8..3e97c31c2c 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -16,8 +16,13 @@
 
 #ifdef RT_NODE_LEVEL_LEAF
 	uint64		value = 0;
+
+	Assert(NODE_IS_LEAF(node));
 #else
-	RT_PTR_LOCAL child = NULL;
+#ifndef RT_ACTION_UPDATE
+	RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+#endif
+	Assert(!NODE_IS_LEAF(node));
 #endif
 
 	switch (node->kind)
@@ -32,8 +37,12 @@
 
 #ifdef RT_NODE_LEVEL_LEAF
 				value = n4->values[idx];
+#else
+#ifdef RT_ACTION_UPDATE
+				n4->children[idx] = new_child;
 #else
 				child = n4->children[idx];
+#endif
 #endif
 				break;
 			}
@@ -47,22 +56,31 @@
 
 #ifdef RT_NODE_LEVEL_LEAF
 				value = n32->values[idx];
+#else
+#ifdef RT_ACTION_UPDATE
+				n32->children[idx] = new_child;
 #else
 				child = n32->children[idx];
+#endif
 #endif
 				break;
 			}
 		case RT_NODE_KIND_125:
 			{
 				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
 
-				if (!RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
+				if (slotpos == RT_NODE_125_INVALID_IDX)
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
 				value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+#ifdef RT_ACTION_UPDATE
+				n125->children[slotpos] = new_child;
 #else
 				child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
 #endif
 				break;
 			}
@@ -79,19 +97,25 @@
 
 #ifdef RT_NODE_LEVEL_LEAF
 				value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
 #else
 				child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
 #endif
 				break;
 			}
 	}
 
+#ifndef RT_ACTION_UPDATE
 #ifdef RT_NODE_LEVEL_LEAF
 	Assert(value_p != NULL);
 	*value_p = value;
 #else
 	Assert(child_p != NULL);
 	*child_p = child;
+#endif
 #endif
 
 	return true;
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 104386e674..c67f936880 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 2256d08100..61d842789d 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -18,6 +18,7 @@
 #include "nodes/bitmapset.h"
 #include "storage/block.h"
 #include "storage/itemptr.h"
+#include "storage/lwlock.h"
 #include "utils/memutils.h"
 #include "utils/timestamp.h"
 
@@ -103,6 +104,8 @@ static const test_spec test_specs[] = {
 #define RT_SCOPE static
 #define RT_DECLARE
 #define RT_DEFINE
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
 #include "lib/radixtree.h"
 
 
@@ -119,7 +122,15 @@ test_empty(void)
 	uint64		key;
 	uint64		val;
 
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
 	radixtree = rt_create(CurrentMemoryContext);
+#endif
 
 	if (rt_search(radixtree, 0, &dummy))
 		elog(ERROR, "rt_search on empty tree returned true");
@@ -153,10 +164,20 @@ test_basic(int children, bool test_inner)
 	uint64 *keys;
 	int	shift = test_inner ? 8 : 0;
 
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
 	elog(NOTICE, "testing basic operations with %s node %d",
 		 test_inner ? "inner" : "leaf", children);
 
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
 	radixtree = rt_create(CurrentMemoryContext);
+#endif
 
 	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
 	keys = palloc(sizeof(uint64) * children);
@@ -297,9 +318,19 @@ test_node_types(uint8 shift)
 {
 	rt_radix_tree *radixtree;
 
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
 	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
 
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
 	radixtree = rt_create(CurrentMemoryContext);
+#endif
 
 	/*
 	 * Insert and search entries for every node type at the 'shift' level,
@@ -332,6 +363,11 @@ test_pattern(const test_spec * spec)
 	int			patternlen;
 	uint64	   *pattern_values;
 	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
 
 	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
 	if (rt_test_stats)
@@ -357,7 +393,13 @@ test_pattern(const test_spec * spec)
 										  "radixtree test",
 										  ALLOCSET_SMALL_SIZES);
 	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa);
+#else
 	radixtree = rt_create(radixtree_ctx);
+#endif
+
 
 	/*
 	 * Add values to the set.
@@ -563,6 +605,7 @@ test_pattern(const test_spec * spec)
 		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
 			 nafter, (nbefore - ndeleted), ndeleted);
 
+	rt_free(radixtree);
 	MemoryContextDelete(radixtree_ctx);
 }
 
-- 
2.39.0

v17-0008-Invent-specific-pointer-macros.patchtext/x-patch; charset=US-ASCII; name=v17-0008-Invent-specific-pointer-macros.patchDownload

From 46ac0171f5a3bd80dfea8ad4061b1567650b8061 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 6 Jan 2023 14:20:51 +0700
Subject: [PATCH v17 8/9] Invent specific pointer macros

RT_PTR_LOCAL - a normal pointer to local memory
RT_PTR_ALLOC - the result of allocation, possibly a DSA pointer

RT_EXTEND and RT_SET_EXTEND have some code changes to show
how these are meant to be treated differently, but most punted
until later.
---
 src/include/lib/radixtree.h             | 165 +++++++++++++-----------
 src/include/lib/radixtree_search_impl.h |   2 +-
 2 files changed, 89 insertions(+), 78 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index e4350730b7..b3d84da033 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -301,8 +301,12 @@ typedef struct RT_NODE
 	uint8		kind;
 } RT_NODE;
 
-#define NODE_IS_LEAF(n)			(((RT_NODE *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n)		(((RT_NODE *) (n))->count == 0)
+#define RT_PTR_LOCAL RT_NODE *
+
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+
+#define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((RT_PTR_LOCAL) (n))->count == 0)
 #define VAR_NODE_HAS_FREE_SLOT(node) \
 	((node)->base.n.count < (node)->base.n.fanout)
 #define FIXED_NODE_HAS_FREE_SLOT(node, class) \
@@ -366,7 +370,7 @@ typedef struct RT_NODE_INNER_4
 	RT_NODE_BASE_4 base;
 
 	/* number of children depends on size class */
-	RT_NODE    *children[FLEXIBLE_ARRAY_MEMBER];
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
 } RT_NODE_INNER_4;
 
 typedef struct RT_NODE_LEAF_4
@@ -382,7 +386,7 @@ typedef struct RT_NODE_INNER_32
 	RT_NODE_BASE_32 base;
 
 	/* number of children depends on size class */
-	RT_NODE    *children[FLEXIBLE_ARRAY_MEMBER];
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
 } RT_NODE_INNER_32;
 
 typedef struct RT_NODE_LEAF_32
@@ -398,7 +402,7 @@ typedef struct RT_NODE_INNER_125
 	RT_NODE_BASE_125 base;
 
 	/* number of children depends on size class */
-	RT_NODE    *children[FLEXIBLE_ARRAY_MEMBER];
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
 } RT_NODE_INNER_125;
 
 typedef struct RT_NODE_LEAF_125
@@ -418,7 +422,7 @@ typedef struct RT_NODE_INNER_256
 	RT_NODE_BASE_256 base;
 
 	/* Slots for 256 children */
-	RT_NODE    *children[RT_NODE_MAX_SLOTS];
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
 } RT_NODE_INNER_256;
 
 typedef struct RT_NODE_LEAF_256
@@ -458,33 +462,33 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 	[RT_CLASS_4_FULL] = {
 		.name = "radix tree node 4",
 		.fanout = 4,
-		.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_NODE *),
+		.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_NODE *)),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
 		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
 	},
 	[RT_CLASS_32_PARTIAL] = {
 		.name = "radix tree node 15",
 		.fanout = 15,
-		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_NODE *),
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_NODE *)),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
 		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
 	},
 	[RT_CLASS_32_FULL] = {
 		.name = "radix tree node 32",
 		.fanout = 32,
-		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_NODE *),
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_NODE *)),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
 		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
 	},
 	[RT_CLASS_125_FULL] = {
 		.name = "radix tree node 125",
 		.fanout = 125,
-		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_NODE *),
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_NODE *)),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
 		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
 	},
 	[RT_CLASS_256] = {
@@ -512,7 +516,7 @@ typedef struct RT_RADIX_TREE
 {
 	MemoryContext context;
 
-	RT_NODE    *root;
+	RT_PTR_ALLOC root;
 	uint64		max_val;
 	uint64		num_keys;
 
@@ -541,7 +545,7 @@ typedef struct RT_RADIX_TREE
  */
 typedef struct RT_NODE_ITER
 {
-	RT_NODE    *node;			/* current node being iterated */
+	RT_PTR_LOCAL node;			/* current node being iterated */
 	int			current_idx;	/* current position. -1 for initial value */
 } RT_NODE_ITER;
 
@@ -558,13 +562,13 @@ typedef struct RT_ITER
 } RT_ITER;
 
 
-static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node,
-								 uint64 key, RT_NODE *child);
-static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node,
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_LOCAL child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
 								uint64 key, uint64 value);
 
 /* verification (available only with assertion) */
-static void RT_VERIFY_NODE(RT_NODE *node);
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
 
 /*
  * Return index of the first element in 'base' that equals 'key'. Return -1
@@ -713,10 +717,10 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
 
 /* Shift the elements right at 'idx' by one */
 static inline void
-RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_NODE **children, int count, int idx)
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
 {
 	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
-	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_NODE *) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
 }
 
 static inline void
@@ -728,10 +732,10 @@ RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Delete the element at 'idx' */
 static inline void
-RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_NODE **children, int count, int idx)
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
 {
 	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
-	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_NODE *) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
 }
 
 static inline void
@@ -743,12 +747,12 @@ RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Copy both chunks and children/values arrays */
 static inline void
-RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_NODE **src_children,
-						  uint8 *dst_chunks, RT_NODE **dst_children)
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
 {
 	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
-	const Size children_size = sizeof(RT_NODE *) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
 
 	memcpy(dst_chunks, src_chunks, chunk_size);
 	memcpy(dst_children, src_children, children_size);
@@ -775,7 +779,7 @@ RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
 	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
 }
 
-static inline RT_NODE *
+static inline RT_PTR_ALLOC
 RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
 {
 	Assert(!NODE_IS_LEAF(node));
@@ -810,7 +814,7 @@ RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
 	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
 }
 
-static inline RT_NODE *
+static inline RT_PTR_ALLOC
 RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
 {
 	Assert(!NODE_IS_LEAF(node));
@@ -828,7 +832,7 @@ RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
 
 /* Set the child in the node-256 */
 static inline void
-RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_NODE *child)
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
 {
 	Assert(!NODE_IS_LEAF(node));
 	node->children[chunk] = child;
@@ -890,16 +894,16 @@ RT_SHIFT_GET_MAX_VAL(int shift)
 /*
  * Allocate a new node with the given node kind.
  */
-static RT_NODE *
+static RT_PTR_ALLOC
 RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
 {
-	RT_NODE    *newnode;
+	RT_PTR_ALLOC newnode;
 
 	if (inner)
-		newnode = (RT_NODE *) MemoryContextAlloc(tree->inner_slabs[size_class],
+		newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
 												 RT_SIZE_CLASS_INFO[size_class].inner_size);
 	else
-		newnode = (RT_NODE *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+		newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
 												 RT_SIZE_CLASS_INFO[size_class].leaf_size);
 
 #ifdef RT_DEBUG
@@ -912,7 +916,7 @@ RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
 
 /* Initialize the node contents */
 static inline void
-RT_INIT_NODE(RT_NODE *node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
 {
 	if (inner)
 		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
@@ -947,7 +951,7 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
 {
 	int			shift = RT_KEY_GET_SHIFT(key);
 	bool		inner = shift > 0;
-	RT_NODE    *newnode;
+	RT_PTR_ALLOC newnode;
 
 	newnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
 	RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
@@ -957,7 +961,7 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
 }
 
 static inline void
-RT_COPY_NODE(RT_NODE *newnode, RT_NODE *oldnode)
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
 {
 	newnode->shift = oldnode->shift;
 	newnode->chunk = oldnode->chunk;
@@ -969,9 +973,9 @@ RT_COPY_NODE(RT_NODE *newnode, RT_NODE *oldnode)
  * count of 'node'.
  */
 static RT_NODE*
-RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_NODE *node, uint8 new_kind)
+RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
 {
-	RT_NODE	*newnode;
+	RT_PTR_ALLOC newnode;
 	bool inner = !NODE_IS_LEAF(node);
 
 	newnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
@@ -983,7 +987,7 @@ RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_NODE *node, uint8 new_kind)
 
 /* Free the given node */
 static void
-RT_FREE_NODE(RT_RADIX_TREE *tree, RT_NODE *node)
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
 {
 	/* If we're deleting the root node, make the tree empty */
 	if (tree->root == node)
@@ -1019,8 +1023,8 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_NODE *node)
  * Replace old_child with new_child, and free the old one.
  */
 static void
-RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *old_child,
-				RT_NODE *new_child, uint64 key)
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
 {
 	Assert(old_child->chunk == new_child->chunk);
 	Assert(old_child->shift == new_child->shift);
@@ -1056,17 +1060,22 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
 	/* Grow tree from 'shift' to 'target_shift' */
 	while (shift <= target_shift)
 	{
-		RT_NODE_INNER_4 *node;
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_4 *n4;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+		node = (RT_PTR_LOCAL) allocnode;
+		RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		node->shift = shift;
+		node->count = 1;
 
-		node = (RT_NODE_INNER_4 *) RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
-		RT_INIT_NODE((RT_NODE *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
-		node->base.n.shift = shift;
-		node->base.n.count = 1;
-		node->base.chunks[0] = 0;
-		node->children[0] = tree->root;
+		n4 = (RT_NODE_INNER_4 *) node;
+		n4->base.chunks[0] = 0;
+		n4->children[0] = tree->root;
 
 		tree->root->chunk = 0;
-		tree->root = (RT_NODE *) node;
+		tree->root = node;
 
 		shift += RT_NODE_SPAN;
 	}
@@ -1079,18 +1088,20 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
  * Insert inner and leaf nodes from 'node' to bottom.
  */
 static inline void
-RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_NODE *parent,
-			  RT_NODE *node)
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+			  RT_PTR_LOCAL node)
 {
 	int			shift = node->shift;
 
 	while (shift >= RT_NODE_SPAN)
 	{
-		RT_NODE    *newchild;
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
 		int			newshift = shift - RT_NODE_SPAN;
 		bool		inner = newshift > 0;
 
-		newchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+		newchild = (RT_PTR_LOCAL) allocchild;
 		RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
 		newchild->shift = newshift;
 		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
@@ -1112,7 +1123,7 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_NODE *parent,
  * pointer is set to child_p.
  */
 static inline bool
-RT_NODE_SEARCH_INNER(RT_NODE *node, uint64 key, RT_NODE **child_p)
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
 {
 #define RT_NODE_LEVEL_INNER
 #include "lib/radixtree_search_impl.h"
@@ -1126,7 +1137,7 @@ RT_NODE_SEARCH_INNER(RT_NODE *node, uint64 key, RT_NODE **child_p)
  * to the value is set to value_p.
  */
 static inline bool
-RT_NODE_SEARCH_LEAF(RT_NODE *node, uint64 key, uint64 *value_p)
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
 {
 #define RT_NODE_LEVEL_LEAF
 #include "lib/radixtree_search_impl.h"
@@ -1139,7 +1150,7 @@ RT_NODE_SEARCH_LEAF(RT_NODE *node, uint64 key, uint64 *value_p)
  * Delete the node and return true if the key is found, otherwise return false.
  */
 static inline bool
-RT_NODE_DELETE_INNER(RT_NODE *node, uint64 key)
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
 {
 #define RT_NODE_LEVEL_INNER
 #include "lib/radixtree_delete_impl.h"
@@ -1152,7 +1163,7 @@ RT_NODE_DELETE_INNER(RT_NODE *node, uint64 key)
  * Delete the node and return true if the key is found, otherwise return false.
  */
 static inline bool
-RT_NODE_DELETE_LEAF(RT_NODE *node, uint64 key)
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
 {
 #define RT_NODE_LEVEL_LEAF
 #include "lib/radixtree_delete_impl.h"
@@ -1161,8 +1172,8 @@ RT_NODE_DELETE_LEAF(RT_NODE *node, uint64 key)
 
 /* Insert the child to the inner node */
 static bool
-RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node, uint64 key,
-					 RT_NODE *child)
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node, uint64 key,
+					 RT_PTR_ALLOC child)
 {
 #define RT_NODE_LEVEL_INNER
 #include "lib/radixtree_insert_impl.h"
@@ -1171,7 +1182,7 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node, uint64
 
 /* Insert the value to the leaf node */
 static bool
-RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node,
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
 					uint64 key, uint64 value)
 {
 #define RT_NODE_LEVEL_LEAF
@@ -1241,8 +1252,8 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
 {
 	int			shift;
 	bool		updated;
-	RT_NODE    *node;
-	RT_NODE    *parent;
+	RT_PTR_LOCAL node;
+	RT_PTR_LOCAL parent;
 
 	/* Empty tree, create the root */
 	if (!tree->root)
@@ -1260,7 +1271,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		RT_NODE    *child;
+		RT_PTR_LOCAL child;
 
 		if (NODE_IS_LEAF(node))
 			break;
@@ -1293,7 +1304,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
 RT_SCOPE bool
 RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
 {
-	RT_NODE    *node;
+	RT_PTR_LOCAL node;
 	int			shift;
 
 	Assert(value_p != NULL);
@@ -1307,7 +1318,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		RT_NODE    *child;
+		RT_PTR_ALLOC child;
 
 		if (NODE_IS_LEAF(node))
 			break;
@@ -1329,8 +1340,8 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
 RT_SCOPE bool
 RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 {
-	RT_NODE    *node;
-	RT_NODE    *stack[RT_MAX_LEVEL] = {0};
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
 	int			shift;
 	int			level;
 	bool		deleted;
@@ -1347,7 +1358,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 	level = -1;
 	while (shift > 0)
 	{
-		RT_NODE    *child;
+		RT_PTR_ALLOC child;
 
 		/* Push the current node to the stack */
 		stack[++level] = node;
@@ -1412,7 +1423,7 @@ RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
  * Advance the slot in the inner node. Return the child if exists, otherwise
  * null.
  */
-static inline RT_NODE *
+static inline RT_PTR_LOCAL
 RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
 {
 #define RT_NODE_LEVEL_INNER
@@ -1437,10 +1448,10 @@ RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
  * Update each node_iter for inner nodes in the iterator node stack.
  */
 static void
-RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_NODE *from_node, int from)
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
 {
 	int			level = from;
-	RT_NODE    *node = from_node;
+	RT_PTR_LOCAL node = from_node;
 
 	for (;;)
 	{
@@ -1505,7 +1516,7 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
 
 	for (;;)
 	{
-		RT_NODE    *child = NULL;
+		RT_PTR_LOCAL child = NULL;
 		uint64		value;
 		int			level;
 		bool		found;
@@ -1584,7 +1595,7 @@ RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
  * Verify the radix tree node.
  */
 static void
-RT_VERIFY_NODE(RT_NODE *node)
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
 {
 #ifdef USE_ASSERT_CHECKING
 	Assert(node->count >= 0);
@@ -1669,7 +1680,7 @@ rt_stats(RT_RADIX_TREE *tree)
 }
 
 static void
-rt_dump_node(RT_NODE *node, int level, bool recurse)
+rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
 {
 	char		space[125] = {0};
 
@@ -1831,7 +1842,7 @@ rt_dump_node(RT_NODE *node, int level, bool recurse)
 void
 rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
 {
-	RT_NODE    *node;
+	RT_PTR_LOCAL node;
 	int			shift;
 	int			level = 0;
 
@@ -1856,7 +1867,7 @@ rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
 	shift = tree->root->shift;
 	while (shift >= 0)
 	{
-		RT_NODE    *child;
+		RT_PTR_LOCAL child;
 
 		rt_dump_node(node, level, false);
 
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 1a0d2d3f1f..cbc357dcc8 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -17,7 +17,7 @@
 #ifdef RT_NODE_LEVEL_LEAF
 	uint64		value = 0;
 #else
-	RT_NODE    *child = NULL;
+	RT_PTR_LOCAL child = NULL;
 #endif
 
 	switch (node->kind)
-- 
2.39.0

v17-0007-Convert-radixtree.h-into-a-template.patchtext/x-patch; charset=US-ASCII; name=v17-0007-Convert-radixtree.h-into-a-template.patchDownload

From b4857416c4030057a79cf52cdd7ffff88f55f73c Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 4 Jan 2023 14:43:17 +0700
Subject: [PATCH v17 7/9] Convert radixtree.h into a template

The only thing configurable at this point is function scope
and prefix, since the point is to see if this makes a shared
memory implementation clear and maintainable.

The key and value type are still hard-coded to uint64.
To make this more useful, at least value type should be
configurable.

It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.
---
 src/include/lib/radixtree.h                   | 987 +++++++++++-------
 src/include/lib/radixtree_delete_impl.h       |  36 +-
 src/include/lib/radixtree_insert_impl.h       |  92 +-
 src/include/lib/radixtree_iter_impl.h         |  34 +-
 src/include/lib/radixtree_search_impl.h       |  36 +-
 .../modules/test_radixtree/test_radixtree.c   |  23 +-
 6 files changed, 718 insertions(+), 490 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index fe517793f4..e4350730b7 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -29,24 +29,41 @@
  *
  * XXX: the radix tree node never be shrunk.
  *
+ *	  To generate a radix tree and associated functions for a use case several
+ *	  macros have to be #define'ed before this file is included.  Including
+ *	  the file #undef's all those, so a new radix tree can be generated
+ *	  afterwards.
+ *	  The relevant parameters are:
+ *	  - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ *		will result in radix tree type 'foo_radix_tree' and functions like
+ *		'foo_create'/'foo_free' and so forth.
+ *	  - RT_DECLARE - if defined function prototypes and type declarations are
+ *		generated
+ *	  - RT_DEFINE - if defined function definitions are generated
+ *	  - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *		declarations reside
+ *
+ *	  Optional parameters:
+ *	  - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
  * Interface
  * ---------
  *
- * rt_create		- Create a new, empty radix tree
- * rt_free			- Free the radix tree
- * rt_search		- Search a key-value pair
- * rt_set			- Set a key-value pair
- * rt_delete		- Delete a key-value pair
- * rt_begin_iterate	- Begin iterating through all key-value pairs
- * rt_iterate_next	- Return next key-value pair, if any
- * rt_end_iter		- End iteration
- * rt_memory_usage	- Get the memory usage
- * rt_num_entries	- Get the number of key-value pairs
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_DELETE		- Delete a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITER		- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ * RT_NUM_ENTRIES	- Get the number of key-value pairs
  *
- * rt_create() creates an empty radix tree in the given memory context
+ * RT_CREATE() creates an empty radix tree in the given memory context
  * and memory contexts for all kinds of radix tree node under the memory context.
  *
- * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
  * order of the key.
  *
  * Copyright (c) 2022, PostgreSQL Global Development Group
@@ -66,6 +83,133 @@
 #include "port/pg_lfind.h"
 #include "utils/memutils.h"
 
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#define RT_DELETE RT_MAKE_NAME(delete)
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#define RT_NUM_ENTRIES RT_MAKE_NAME(num_entries)
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_ITER RT_MAKE_NAME(iter)
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
+#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
+#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+RT_SCOPE uint64 RT_NUM_ENTRIES(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* macros and types common to all implementations */
+#ifndef RT_COMMON
+#define RT_COMMON
+
 #ifdef RT_DEBUG
 #define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
 #endif
@@ -80,7 +224,7 @@
 #define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
 
 /* Maximum shift the radix tree uses */
-#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
 
 /* Tree level the radix tree uses */
 #define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
@@ -101,7 +245,7 @@
  * There are 4 node kinds and each node kind have one or two size classes,
  * partial and full. The size classes in the same node kind have the same
  * node structure but have the different number of fanout that is stored
- * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
  * is to be inserted, we allocate a larger area and memcpy the entire old
  * node to it.
  *
@@ -119,19 +263,20 @@
 #define RT_NODE_KIND_256		0x03
 #define RT_NODE_KIND_COUNT		4
 
-typedef enum rt_size_class
+#endif							/* RT_COMMON */
+
+
+typedef enum RT_SIZE_CLASS
 {
 	RT_CLASS_4_FULL = 0,
 	RT_CLASS_32_PARTIAL,
 	RT_CLASS_32_FULL,
 	RT_CLASS_125_FULL,
 	RT_CLASS_256
-
-#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
-} rt_size_class;
+} RT_SIZE_CLASS;
 
 /* Common type for all nodes types */
-typedef struct rt_node
+typedef struct RT_NODE
 {
 	/*
 	 * Number of children.  We use uint16 to be able to indicate 256 children
@@ -154,53 +299,54 @@ typedef struct rt_node
 
 	/* Node kind, one per search/set algorithm */
 	uint8		kind;
-} rt_node;
-#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+} RT_NODE;
+
+#define NODE_IS_LEAF(n)			(((RT_NODE *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((RT_NODE *) (n))->count == 0)
 #define VAR_NODE_HAS_FREE_SLOT(node) \
 	((node)->base.n.count < (node)->base.n.fanout)
 #define FIXED_NODE_HAS_FREE_SLOT(node, class) \
-	((node)->base.n.count < rt_size_class_info[class].fanout)
+	((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
 
 /* Base type of each node kinds for leaf and inner nodes */
 /* The base types must be a be able to accommodate the largest size
 class for variable-sized node kinds*/
-typedef struct rt_node_base_4
+typedef struct RT_NODE_BASE_4
 {
-	rt_node		n;
+	RT_NODE		n;
 
 	/* 4 children, for key chunks */
 	uint8		chunks[4];
-} rt_node_base_4;
+} RT_NODE_BASE_4;
 
-typedef struct rt_node_base32
+typedef struct RT_NODE_BASE_32
 {
-	rt_node		n;
+	RT_NODE		n;
 
 	/* 32 children, for key chunks */
 	uint8		chunks[32];
-} rt_node_base_32;
+} RT_NODE_BASE_32;
 
 /*
  * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
  * 256, to store indexes into a second array that contains up to 125 values (or
  * child pointers in inner nodes).
  */
-typedef struct rt_node_base125
+typedef struct RT_NODE_BASE_125
 {
-	rt_node		n;
+	RT_NODE		n;
 
 	/* The index of slots for each fanout */
 	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
 
 	/* isset is a bitmap to track which slot is in use */
 	bitmapword		isset[BM_IDX(128)];
-} rt_node_base_125;
+} RT_NODE_BASE_125;
 
-typedef struct rt_node_base256
+typedef struct RT_NODE_BASE_256
 {
-	rt_node		n;
-} rt_node_base_256;
+	RT_NODE		n;
+} RT_NODE_BASE_256;
 
 /*
  * Inner and leaf nodes.
@@ -215,79 +361,79 @@ typedef struct rt_node_base256
  * good. It might be better to just indicate non-existing entries the same way
  * in inner nodes.
  */
-typedef struct rt_node_inner_4
+typedef struct RT_NODE_INNER_4
 {
-	rt_node_base_4 base;
+	RT_NODE_BASE_4 base;
 
 	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_inner_4;
+	RT_NODE    *children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_4;
 
-typedef struct rt_node_leaf_4
+typedef struct RT_NODE_LEAF_4
 {
-	rt_node_base_4 base;
+	RT_NODE_BASE_4 base;
 
 	/* number of values depends on size class */
 	uint64		values[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_leaf_4;
+} RT_NODE_LEAF_4;
 
-typedef struct rt_node_inner_32
+typedef struct RT_NODE_INNER_32
 {
-	rt_node_base_32 base;
+	RT_NODE_BASE_32 base;
 
 	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_inner_32;
+	RT_NODE    *children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
 
-typedef struct rt_node_leaf_32
+typedef struct RT_NODE_LEAF_32
 {
-	rt_node_base_32 base;
+	RT_NODE_BASE_32 base;
 
 	/* number of values depends on size class */
 	uint64		values[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_leaf_32;
+} RT_NODE_LEAF_32;
 
-typedef struct rt_node_inner_125
+typedef struct RT_NODE_INNER_125
 {
-	rt_node_base_125 base;
+	RT_NODE_BASE_125 base;
 
 	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_inner_125;
+	RT_NODE    *children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
 
-typedef struct rt_node_leaf_125
+typedef struct RT_NODE_LEAF_125
 {
-	rt_node_base_125 base;
+	RT_NODE_BASE_125 base;
 
 	/* number of values depends on size class */
 	uint64		values[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_leaf_125;
+} RT_NODE_LEAF_125;
 
 /*
  * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
  * for directly storing values (or child pointers in inner nodes).
  */
-typedef struct rt_node_inner_256
+typedef struct RT_NODE_INNER_256
 {
-	rt_node_base_256 base;
+	RT_NODE_BASE_256 base;
 
 	/* Slots for 256 children */
-	rt_node    *children[RT_NODE_MAX_SLOTS];
-} rt_node_inner_256;
+	RT_NODE    *children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
 
-typedef struct rt_node_leaf_256
+typedef struct RT_NODE_LEAF_256
 {
-	rt_node_base_256 base;
+	RT_NODE_BASE_256 base;
 
 	/* isset is a bitmap to track which slot is in use */
 	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
 
 	/* Slots for 256 values */
 	uint64		values[RT_NODE_MAX_SLOTS];
-} rt_node_leaf_256;
+} RT_NODE_LEAF_256;
 
 /* Information for each size class */
-typedef struct rt_size_class_elem
+typedef struct RT_SIZE_CLASS_ELEM
 {
 	const char *name;
 	int			fanout;
@@ -299,7 +445,7 @@ typedef struct rt_size_class_elem
 	/* slab block size */
 	Size		inner_blocksize;
 	Size		leaf_blocksize;
-} rt_size_class_elem;
+} RT_SIZE_CLASS_ELEM;
 
 /*
  * Calculate the slab blocksize so that we can allocate at least 32 chunks
@@ -307,51 +453,54 @@ typedef struct rt_size_class_elem
  */
 #define NODE_SLAB_BLOCK_SIZE(size)	\
 	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
-static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 	[RT_CLASS_4_FULL] = {
 		.name = "radix tree node 4",
 		.fanout = 4,
-		.inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
-		.leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+		.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_NODE *),
+		.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_NODE *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
 	},
 	[RT_CLASS_32_PARTIAL] = {
 		.name = "radix tree node 15",
 		.fanout = 15,
-		.inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
-		.leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_NODE *),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_NODE *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
 	},
 	[RT_CLASS_32_FULL] = {
 		.name = "radix tree node 32",
 		.fanout = 32,
-		.inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
-		.leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_NODE *),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_NODE *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
 	},
 	[RT_CLASS_125_FULL] = {
 		.name = "radix tree node 125",
 		.fanout = 125,
-		.inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
-		.leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_NODE *),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_NODE *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
 	},
 	[RT_CLASS_256] = {
 		.name = "radix tree node 256",
 		.fanout = 256,
-		.inner_size = sizeof(rt_node_inner_256),
-		.leaf_size = sizeof(rt_node_leaf_256),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
 	},
 };
 
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
 /* Map from the node kind to its minimum size class */
-static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
 	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
 	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
 	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
@@ -359,11 +508,11 @@ static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
 };
 
 /* A radix tree with nodes */
-typedef struct radix_tree
+typedef struct RT_RADIX_TREE
 {
 	MemoryContext context;
 
-	rt_node    *root;
+	RT_NODE    *root;
 	uint64		max_val;
 	uint64		num_keys;
 
@@ -374,7 +523,7 @@ typedef struct radix_tree
 #ifdef RT_DEBUG
 	int32		cnt[RT_SIZE_CLASS_COUNT];
 #endif
-} radix_tree;
+} RT_RADIX_TREE;
 
 /*
  * Iteration support.
@@ -382,79 +531,47 @@ typedef struct radix_tree
  * Iterating the radix tree returns each pair of key and value in the ascending
  * order of the key. To support this, the we iterate nodes of each level.
  *
- * rt_node_iter struct is used to track the iteration within a node.
+ * RT_NODE_ITER struct is used to track the iteration within a node.
  *
- * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
  * in order to track the iteration of each level. During the iteration, we also
  * construct the key whenever updating the node iteration information, e.g., when
  * advancing the current index within the node or when moving to the next node
  * at the same level.
  */
-typedef struct rt_node_iter
+typedef struct RT_NODE_ITER
 {
-	rt_node    *node;			/* current node being iterated */
+	RT_NODE    *node;			/* current node being iterated */
 	int			current_idx;	/* current position. -1 for initial value */
-} rt_node_iter;
+} RT_NODE_ITER;
 
-typedef struct rt_iter
+typedef struct RT_ITER
 {
-	radix_tree *tree;
+	RT_RADIX_TREE *tree;
 
 	/* Track the iteration on nodes of each level */
-	rt_node_iter stack[RT_MAX_LEVEL];
+	RT_NODE_ITER stack[RT_MAX_LEVEL];
 	int			stack_len;
 
 	/* The key is being constructed during the iteration */
 	uint64		key;
-} rt_iter;
-
-extern radix_tree *rt_create(MemoryContext ctx);
-extern void rt_free(radix_tree *tree);
-extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
-extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
-extern rt_iter *rt_begin_iterate(radix_tree *tree);
+} RT_ITER;
 
-extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
-extern void rt_end_iterate(rt_iter *iter);
-extern bool rt_delete(radix_tree *tree, uint64 key);
 
-extern uint64 rt_memory_usage(radix_tree *tree);
-extern uint64 rt_num_entries(radix_tree *tree);
-
-#ifdef RT_DEBUG
-extern void rt_dump(radix_tree *tree);
-extern void rt_dump_search(radix_tree *tree, uint64 key);
-extern void rt_stats(radix_tree *tree);
-#endif
-
-
-static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
-static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
-								bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
-static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
-								 uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node,
+								 uint64 key, RT_NODE *child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node,
 								uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
-static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
-											 uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
-static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
 
 /* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
+static void RT_VERIFY_NODE(RT_NODE *node);
 
 /*
  * Return index of the first element in 'base' that equals 'key'. Return -1
  * if there is no such element.
  */
 static inline int
-node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
 {
 	int			idx = -1;
 
@@ -474,7 +591,7 @@ node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
  * Return index of the chunk to insert into chunks in the given node.
  */
 static inline int
-node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
 {
 	int			idx;
 
@@ -492,7 +609,7 @@ node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
  * if there is no such element.
  */
 static inline int
-node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
 {
 	int			count = node->n.count;
 #ifndef USE_NO_SIMD
@@ -541,7 +658,7 @@ node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
  * Return index of the chunk to insert into chunks in the given node.
  */
 static inline int
-node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
 {
 	int			count = node->n.count;
 #ifndef USE_NO_SIMD
@@ -596,14 +713,14 @@ node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
 
 /* Shift the elements right at 'idx' by one */
 static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_NODE **children, int count, int idx)
 {
 	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
-	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_NODE *) * (count - idx));
 }
 
 static inline void
-chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
 {
 	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
 	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
@@ -611,14 +728,14 @@ chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Delete the element at 'idx' */
 static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_NODE **children, int count, int idx)
 {
 	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
-	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_NODE *) * (count - idx - 1));
 }
 
 static inline void
-chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
 {
 	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
 	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
@@ -626,22 +743,22 @@ chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
 
 /* Copy both chunks and children/values arrays */
 static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
-						  uint8 *dst_chunks, rt_node **dst_children)
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_NODE **src_children,
+						  uint8 *dst_chunks, RT_NODE **dst_children)
 {
-	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
-	const Size children_size = sizeof(rt_node *) * fanout;
+	const Size children_size = sizeof(RT_NODE *) * fanout;
 
 	memcpy(dst_chunks, src_chunks, chunk_size);
 	memcpy(dst_children, src_children, children_size);
 }
 
 static inline void
-chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
 						uint8 *dst_chunks, uint64 *dst_values)
 {
-	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
 	const Size values_size = sizeof(uint64) * fanout;
 
@@ -653,23 +770,23 @@ chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
 
 /* Does the given chunk in the node has the value? */
 static inline bool
-node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
 {
 	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
 }
 
-static inline rt_node *
-node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+static inline RT_NODE *
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
 {
 	Assert(!NODE_IS_LEAF(node));
 	return node->children[node->base.slot_idxs[chunk]];
 }
 
 static inline uint64
-node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
 {
 	Assert(NODE_IS_LEAF(node));
-	Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
 	return node->values[node->base.slot_idxs[chunk]];
 }
 
@@ -677,14 +794,14 @@ node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
 
 /* Return true if the slot corresponding to the given chunk is in use */
 static inline bool
-node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
 {
 	Assert(!NODE_IS_LEAF(node));
 	return (node->children[chunk] != NULL);
 }
 
 static inline bool
-node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
 {
 	int			idx = BM_IDX(chunk);
 	int			bitnum = BM_BIT(chunk);
@@ -693,25 +810,25 @@ node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
 	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
 }
 
-static inline rt_node *
-node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+static inline RT_NODE *
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
 {
 	Assert(!NODE_IS_LEAF(node));
-	Assert(node_inner_256_is_chunk_used(node, chunk));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
 	return node->children[chunk];
 }
 
 static inline uint64
-node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
 {
 	Assert(NODE_IS_LEAF(node));
-	Assert(node_leaf_256_is_chunk_used(node, chunk));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
 	return node->values[chunk];
 }
 
 /* Set the child in the node-256 */
 static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_NODE *child)
 {
 	Assert(!NODE_IS_LEAF(node));
 	node->children[chunk] = child;
@@ -719,7 +836,7 @@ node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
 
 /* Set the value in the node-256 */
 static inline void
-node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
 {
 	int			idx = BM_IDX(chunk);
 	int			bitnum = BM_BIT(chunk);
@@ -731,14 +848,14 @@ node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
 
 /* Set the slot at the given chunk position */
 static inline void
-node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
 {
 	Assert(!NODE_IS_LEAF(node));
 	node->children[chunk] = NULL;
 }
 
 static inline void
-node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
 {
 	int			idx = BM_IDX(chunk);
 	int			bitnum = BM_BIT(chunk);
@@ -751,7 +868,7 @@ node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
  * Return the shift that is satisfied to store the given key.
  */
 static inline int
-key_get_shift(uint64 key)
+RT_KEY_GET_SHIFT(uint64 key)
 {
 	return (key == 0)
 		? 0
@@ -762,7 +879,7 @@ key_get_shift(uint64 key)
  * Return the max value stored in a node with the given shift.
  */
 static uint64
-shift_get_max_val(int shift)
+RT_SHIFT_GET_MAX_VAL(int shift)
 {
 	if (shift == RT_MAX_SHIFT)
 		return UINT64_MAX;
@@ -770,38 +887,20 @@ shift_get_max_val(int shift)
 	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
 }
 
-/*
- * Create a new node as the root. Subordinate nodes will be created during
- * the insertion.
- */
-static void
-rt_new_root(radix_tree *tree, uint64 key)
-{
-	int			shift = key_get_shift(key);
-	bool		inner = shift > 0;
-	rt_node    *newnode;
-
-	newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
-	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
-	newnode->shift = shift;
-	tree->max_val = shift_get_max_val(shift);
-	tree->root = newnode;
-}
-
 /*
  * Allocate a new node with the given node kind.
  */
-static rt_node *
-rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+static RT_NODE *
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
 {
-	rt_node    *newnode;
+	RT_NODE    *newnode;
 
 	if (inner)
-		newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
-												 rt_size_class_info[size_class].inner_size);
+		newnode = (RT_NODE *) MemoryContextAlloc(tree->inner_slabs[size_class],
+												 RT_SIZE_CLASS_INFO[size_class].inner_size);
 	else
-		newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
-												 rt_size_class_info[size_class].leaf_size);
+		newnode = (RT_NODE *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+												 RT_SIZE_CLASS_INFO[size_class].leaf_size);
 
 #ifdef RT_DEBUG
 	/* update the statistics */
@@ -813,20 +912,20 @@ rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
 
 /* Initialize the node contents */
 static inline void
-rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+RT_INIT_NODE(RT_NODE *node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
 {
 	if (inner)
-		MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
 	else
-		MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
 
 	node->kind = kind;
-	node->fanout = rt_size_class_info[size_class].fanout;
+	node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
 
 	/* Initialize slot_idxs to invalid values */
 	if (kind == RT_NODE_KIND_125)
 	{
-		rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
 
 		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
 	}
@@ -839,8 +938,26 @@ rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
 		node->fanout = 0;
 }
 
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		inner = shift > 0;
+	RT_NODE    *newnode;
+
+	newnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+	newnode->shift = shift;
+	tree->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->root = newnode;
+}
+
 static inline void
-rt_copy_node(rt_node *newnode, rt_node *oldnode)
+RT_COPY_NODE(RT_NODE *newnode, RT_NODE *oldnode)
 {
 	newnode->shift = oldnode->shift;
 	newnode->chunk = oldnode->chunk;
@@ -851,22 +968,22 @@ rt_copy_node(rt_node *newnode, rt_node *oldnode)
  * Create a new node with 'new_kind' and the same shift, chunk, and
  * count of 'node'.
  */
-static rt_node*
-rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+static RT_NODE*
+RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_NODE *node, uint8 new_kind)
 {
-	rt_node	*newnode;
+	RT_NODE	*newnode;
 	bool inner = !NODE_IS_LEAF(node);
 
-	newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
-	rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
-	rt_copy_node(newnode, node);
+	newnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	RT_COPY_NODE(newnode, node);
 
 	return newnode;
 }
 
 /* Free the given node */
 static void
-rt_free_node(radix_tree *tree, rt_node *node)
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_NODE *node)
 {
 	/* If we're deleting the root node, make the tree empty */
 	if (tree->root == node)
@@ -882,7 +999,7 @@ rt_free_node(radix_tree *tree, rt_node *node)
 		/* update the statistics */
 		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 		{
-			if (node->fanout == rt_size_class_info[i].fanout)
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
 				break;
 		}
 
@@ -902,8 +1019,8 @@ rt_free_node(radix_tree *tree, rt_node *node)
  * Replace old_child with new_child, and free the old one.
  */
 static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
-				rt_node *new_child, uint64 key)
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *old_child,
+				RT_NODE *new_child, uint64 key)
 {
 	Assert(old_child->chunk == new_child->chunk);
 	Assert(old_child->shift == new_child->shift);
@@ -917,11 +1034,11 @@ rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
 	{
 		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
 
-		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		replaced = RT_NODE_INSERT_INNER(tree, NULL, parent, key, new_child);
 		Assert(replaced);
 	}
 
-	rt_free_node(tree, old_child);
+	RT_FREE_NODE(tree, old_child);
 }
 
 /*
@@ -929,32 +1046,32 @@ rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
  * store the key.
  */
 static void
-rt_extend(radix_tree *tree, uint64 key)
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
 {
 	int			target_shift;
 	int			shift = tree->root->shift + RT_NODE_SPAN;
 
-	target_shift = key_get_shift(key);
+	target_shift = RT_KEY_GET_SHIFT(key);
 
 	/* Grow tree from 'shift' to 'target_shift' */
 	while (shift <= target_shift)
 	{
-		rt_node_inner_4 *node;
+		RT_NODE_INNER_4 *node;
 
-		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
-		rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		node = (RT_NODE_INNER_4 *) RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+		RT_INIT_NODE((RT_NODE *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
 		node->base.n.shift = shift;
 		node->base.n.count = 1;
 		node->base.chunks[0] = 0;
 		node->children[0] = tree->root;
 
 		tree->root->chunk = 0;
-		tree->root = (rt_node *) node;
+		tree->root = (RT_NODE *) node;
 
 		shift += RT_NODE_SPAN;
 	}
 
-	tree->max_val = shift_get_max_val(target_shift);
+	tree->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
 }
 
 /*
@@ -962,29 +1079,29 @@ rt_extend(radix_tree *tree, uint64 key)
  * Insert inner and leaf nodes from 'node' to bottom.
  */
 static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
-			  rt_node *node)
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_NODE *parent,
+			  RT_NODE *node)
 {
 	int			shift = node->shift;
 
 	while (shift >= RT_NODE_SPAN)
 	{
-		rt_node    *newchild;
+		RT_NODE    *newchild;
 		int			newshift = shift - RT_NODE_SPAN;
 		bool		inner = newshift > 0;
 
-		newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
-		rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+		newchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
 		newchild->shift = newshift;
 		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
-		rt_node_insert_inner(tree, parent, node, key, newchild);
+		RT_NODE_INSERT_INNER(tree, parent, node, key, newchild);
 
 		parent = node;
 		node = newchild;
 		shift -= RT_NODE_SPAN;
 	}
 
-	rt_node_insert_leaf(tree, parent, node, key, value);
+	RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
 	tree->num_keys++;
 }
 
@@ -995,7 +1112,7 @@ rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
  * pointer is set to child_p.
  */
 static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p)
+RT_NODE_SEARCH_INNER(RT_NODE *node, uint64 key, RT_NODE **child_p)
 {
 #define RT_NODE_LEVEL_INNER
 #include "lib/radixtree_search_impl.h"
@@ -1009,7 +1126,7 @@ rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p)
  * to the value is set to value_p.
  */
 static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p)
+RT_NODE_SEARCH_LEAF(RT_NODE *node, uint64 key, uint64 *value_p)
 {
 #define RT_NODE_LEVEL_LEAF
 #include "lib/radixtree_search_impl.h"
@@ -1022,7 +1139,7 @@ rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p)
  * Delete the node and return true if the key is found, otherwise return false.
  */
 static inline bool
-rt_node_delete_inner(rt_node *node, uint64 key)
+RT_NODE_DELETE_INNER(RT_NODE *node, uint64 key)
 {
 #define RT_NODE_LEVEL_INNER
 #include "lib/radixtree_delete_impl.h"
@@ -1035,7 +1152,7 @@ rt_node_delete_inner(rt_node *node, uint64 key)
  * Delete the node and return true if the key is found, otherwise return false.
  */
 static inline bool
-rt_node_delete_leaf(rt_node *node, uint64 key)
+RT_NODE_DELETE_LEAF(RT_NODE *node, uint64 key)
 {
 #define RT_NODE_LEVEL_LEAF
 #include "lib/radixtree_delete_impl.h"
@@ -1044,8 +1161,8 @@ rt_node_delete_leaf(rt_node *node, uint64 key)
 
 /* Insert the child to the inner node */
 static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
-					 rt_node *child)
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node, uint64 key,
+					 RT_NODE *child)
 {
 #define RT_NODE_LEVEL_INNER
 #include "lib/radixtree_insert_impl.h"
@@ -1054,7 +1171,7 @@ rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 ke
 
 /* Insert the value to the leaf node */
 static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_NODE *parent, RT_NODE *node,
 					uint64 key, uint64 value)
 {
 #define RT_NODE_LEVEL_LEAF
@@ -1065,15 +1182,15 @@ rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
 /*
  * Create the radix tree in the given memory context and return it.
  */
-radix_tree *
-rt_create(MemoryContext ctx)
+RT_SCOPE RT_RADIX_TREE *
+RT_CREATE(MemoryContext ctx)
 {
-	radix_tree *tree;
+	RT_RADIX_TREE *tree;
 	MemoryContext old_ctx;
 
 	old_ctx = MemoryContextSwitchTo(ctx);
 
-	tree = palloc(sizeof(radix_tree));
+	tree = palloc(sizeof(RT_RADIX_TREE));
 	tree->context = ctx;
 	tree->root = NULL;
 	tree->max_val = 0;
@@ -1083,13 +1200,13 @@ rt_create(MemoryContext ctx)
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
 		tree->inner_slabs[i] = SlabContextCreate(ctx,
-												 rt_size_class_info[i].name,
-												 rt_size_class_info[i].inner_blocksize,
-												 rt_size_class_info[i].inner_size);
+												 RT_SIZE_CLASS_INFO[i].name,
+												 RT_SIZE_CLASS_INFO[i].inner_blocksize,
+												 RT_SIZE_CLASS_INFO[i].inner_size);
 		tree->leaf_slabs[i] = SlabContextCreate(ctx,
-												rt_size_class_info[i].name,
-												rt_size_class_info[i].leaf_blocksize,
-												rt_size_class_info[i].leaf_size);
+												RT_SIZE_CLASS_INFO[i].name,
+												RT_SIZE_CLASS_INFO[i].leaf_blocksize,
+												RT_SIZE_CLASS_INFO[i].leaf_size);
 #ifdef RT_DEBUG
 		tree->cnt[i] = 0;
 #endif
@@ -1103,8 +1220,8 @@ rt_create(MemoryContext ctx)
 /*
  * Free the given radix tree.
  */
-void
-rt_free(radix_tree *tree)
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
 {
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
@@ -1119,21 +1236,21 @@ rt_free(radix_tree *tree)
  * Set key to value. If the entry already exists, we update its value to 'value'
  * and return true. Returns false if entry doesn't yet exist.
  */
-bool
-rt_set(radix_tree *tree, uint64 key, uint64 value)
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
 {
 	int			shift;
 	bool		updated;
-	rt_node    *node;
-	rt_node    *parent;
+	RT_NODE    *node;
+	RT_NODE    *parent;
 
 	/* Empty tree, create the root */
 	if (!tree->root)
-		rt_new_root(tree, key);
+		RT_NEW_ROOT(tree, key);
 
 	/* Extend the tree if necessary */
 	if (key > tree->max_val)
-		rt_extend(tree, key);
+		RT_EXTEND(tree, key);
 
 	Assert(tree->root);
 
@@ -1143,14 +1260,14 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		RT_NODE    *child;
 
 		if (NODE_IS_LEAF(node))
 			break;
 
-		if (!rt_node_search_inner(node, key, &child))
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
 		{
-			rt_set_extend(tree, key, value, parent, node);
+			RT_SET_EXTEND(tree, key, value, parent, node);
 			return false;
 		}
 
@@ -1159,7 +1276,7 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
 		shift -= RT_NODE_SPAN;
 	}
 
-	updated = rt_node_insert_leaf(tree, parent, node, key, value);
+	updated = RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
 
 	/* Update the statistics */
 	if (!updated)
@@ -1173,10 +1290,10 @@ rt_set(radix_tree *tree, uint64 key, uint64 value)
  * otherwise return false.  On success, we set the value to *val_p so it must
  * not be NULL.
  */
-bool
-rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
 {
-	rt_node    *node;
+	RT_NODE    *node;
 	int			shift;
 
 	Assert(value_p != NULL);
@@ -1190,30 +1307,30 @@ rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		RT_NODE    *child;
 
 		if (NODE_IS_LEAF(node))
 			break;
 
-		if (!rt_node_search_inner(node, key, &child))
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
 			return false;
 
 		node = child;
 		shift -= RT_NODE_SPAN;
 	}
 
-	return rt_node_search_leaf(node, key, value_p);
+	return RT_NODE_SEARCH_LEAF(node, key, value_p);
 }
 
 /*
  * Delete the given key from the radix tree. Return true if the key is found (and
  * deleted), otherwise do nothing and return false.
  */
-bool
-rt_delete(radix_tree *tree, uint64 key)
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 {
-	rt_node    *node;
-	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	RT_NODE    *node;
+	RT_NODE    *stack[RT_MAX_LEVEL] = {0};
 	int			shift;
 	int			level;
 	bool		deleted;
@@ -1230,12 +1347,12 @@ rt_delete(radix_tree *tree, uint64 key)
 	level = -1;
 	while (shift > 0)
 	{
-		rt_node    *child;
+		RT_NODE    *child;
 
 		/* Push the current node to the stack */
 		stack[++level] = node;
 
-		if (!rt_node_search_inner(node, key, &child))
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
 			return false;
 
 		node = child;
@@ -1244,7 +1361,7 @@ rt_delete(radix_tree *tree, uint64 key)
 
 	/* Delete the key from the leaf node if exists */
 	Assert(NODE_IS_LEAF(node));
-	deleted = rt_node_delete_leaf(node, key);
+	deleted = RT_NODE_DELETE_LEAF(node, key);
 
 	if (!deleted)
 	{
@@ -1263,14 +1380,14 @@ rt_delete(radix_tree *tree, uint64 key)
 		return true;
 
 	/* Free the empty leaf node */
-	rt_free_node(tree, node);
+	RT_FREE_NODE(tree, node);
 
 	/* Delete the key in inner nodes recursively */
 	while (level >= 0)
 	{
 		node = stack[level--];
 
-		deleted = rt_node_delete_inner(node, key);
+		deleted = RT_NODE_DELETE_INNER(node, key);
 		Assert(deleted);
 
 		/* If the node didn't become empty, we stop deleting the key */
@@ -1278,55 +1395,56 @@ rt_delete(radix_tree *tree, uint64 key)
 			break;
 
 		/* The node became empty */
-		rt_free_node(tree, node);
+		RT_FREE_NODE(tree, node);
 	}
 
 	return true;
 }
 
-/* Create and return the iterator for the given radix tree */
-rt_iter *
-rt_begin_iterate(radix_tree *tree)
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
 {
-	MemoryContext old_ctx;
-	rt_iter    *iter;
-	int			top_level;
-
-	old_ctx = MemoryContextSwitchTo(tree->context);
-
-	iter = (rt_iter *) palloc0(sizeof(rt_iter));
-	iter->tree = tree;
-
-	/* empty tree */
-	if (!iter->tree->root)
-		return iter;
-
-	top_level = iter->tree->root->shift / RT_NODE_SPAN;
-	iter->stack_len = top_level;
-
-	/*
-	 * Descend to the left most leaf node from the root. The key is being
-	 * constructed while descending to the leaf.
-	 */
-	rt_update_iter_stack(iter, iter->tree->root, top_level);
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
 
-	MemoryContextSwitchTo(old_ctx);
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_NODE *
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
 
-	return iter;
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
 }
 
 /*
  * Update each node_iter for inner nodes in the iterator node stack.
  */
 static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_NODE *from_node, int from)
 {
 	int			level = from;
-	rt_node    *node = from_node;
+	RT_NODE    *node = from_node;
 
 	for (;;)
 	{
-		rt_node_iter *node_iter = &(iter->stack[level--]);
+		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
 
 		node_iter->node = node;
 		node_iter->current_idx = -1;
@@ -1336,19 +1454,50 @@ rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
 			return;
 
 		/* Advance to the next slot in the inner node */
-		node = rt_node_inner_iterate_next(iter, node_iter);
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
 
 		/* We must find the first children in the node */
 		Assert(node);
 	}
 }
 
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	MemoryContext old_ctx;
+	RT_ITER    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->root)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	RT_UPDATE_ITER_STACK(iter, iter->tree->root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
 /*
  * Return true with setting key_p and value_p if there is next key.  Otherwise,
  * return false.
  */
-bool
-rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
 {
 	/* Empty tree */
 	if (!iter->tree->root)
@@ -1356,13 +1505,13 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 
 	for (;;)
 	{
-		rt_node    *child = NULL;
+		RT_NODE    *child = NULL;
 		uint64		value;
 		int			level;
 		bool		found;
 
 		/* Advance the leaf node iterator to get next key-value pair */
-		found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
 
 		if (found)
 		{
@@ -1377,7 +1526,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 		 */
 		for (level = 1; level <= iter->stack_len; level++)
 		{
-			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
 
 			if (child)
 				break;
@@ -1391,7 +1540,7 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 		 * Set the node to the node iterator and update the iterator stack
 		 * from this node.
 		 */
-		rt_update_iter_stack(iter, child, level - 1);
+		RT_UPDATE_ITER_STACK(iter, child, level - 1);
 
 		/* Node iterators are updated, so try again from the leaf */
 	}
@@ -1399,49 +1548,17 @@ rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
 	return false;
 }
 
-void
-rt_end_iterate(rt_iter *iter)
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
 {
 	pfree(iter);
 }
 
-static inline void
-rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
-{
-	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
-	iter->key |= (((uint64) chunk) << shift);
-}
-
-/*
- * Advance the slot in the inner node. Return the child if exists, otherwise
- * null.
- */
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
-{
-#define RT_NODE_LEVEL_INNER
-#include "lib/radixtree_iter_impl.h"
-#undef RT_NODE_LEVEL_INNER
-}
-
-/*
- * Advance the slot in the leaf node. On success, return true and the value
- * is set to value_p, otherwise return false.
- */
-static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
-						  uint64 *value_p)
-{
-#define RT_NODE_LEVEL_LEAF
-#include "lib/radixtree_iter_impl.h"
-#undef RT_NODE_LEVEL_LEAF
-}
-
 /*
  * Return the number of keys in the radix tree.
  */
-uint64
-rt_num_entries(radix_tree *tree)
+RT_SCOPE uint64
+RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
 {
 	return tree->num_keys;
 }
@@ -1449,10 +1566,10 @@ rt_num_entries(radix_tree *tree)
 /*
  * Return the statistics of the amount of memory used by the radix tree.
  */
-uint64
-rt_memory_usage(radix_tree *tree)
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
 {
-	Size		total = sizeof(radix_tree);
+	Size		total = sizeof(RT_RADIX_TREE);
 
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
@@ -1467,7 +1584,7 @@ rt_memory_usage(radix_tree *tree)
  * Verify the radix tree node.
  */
 static void
-rt_verify_node(rt_node *node)
+RT_VERIFY_NODE(RT_NODE *node)
 {
 #ifdef USE_ASSERT_CHECKING
 	Assert(node->count >= 0);
@@ -1476,7 +1593,7 @@ rt_verify_node(rt_node *node)
 	{
 		case RT_NODE_KIND_4:
 			{
-				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+				RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
 
 				for (int i = 1; i < n4->n.count; i++)
 					Assert(n4->chunks[i - 1] < n4->chunks[i]);
@@ -1485,7 +1602,7 @@ rt_verify_node(rt_node *node)
 			}
 		case RT_NODE_KIND_32:
 			{
-				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
 
 				for (int i = 1; i < n32->n.count; i++)
 					Assert(n32->chunks[i - 1] < n32->chunks[i]);
@@ -1494,7 +1611,7 @@ rt_verify_node(rt_node *node)
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
 				int			cnt = 0;
 
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
@@ -1503,7 +1620,7 @@ rt_verify_node(rt_node *node)
 					int			idx = BM_IDX(slot);
 					int			bitnum = BM_BIT(slot);
 
-					if (!node_125_is_chunk_used(n125, i))
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
 						continue;
 
 					/* Check if the corresponding slot is used */
@@ -1520,7 +1637,7 @@ rt_verify_node(rt_node *node)
 			{
 				if (NODE_IS_LEAF(node))
 				{
-					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
 					int			cnt = 0;
 
 					for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
@@ -1539,7 +1656,7 @@ rt_verify_node(rt_node *node)
 /***************** DEBUG FUNCTIONS *****************/
 #ifdef RT_DEBUG
 void
-rt_stats(radix_tree *tree)
+rt_stats(RT_RADIX_TREE *tree)
 {
 	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
 						 tree->num_keys,
@@ -1552,7 +1669,7 @@ rt_stats(radix_tree *tree)
 }
 
 static void
-rt_dump_node(rt_node *node, int level, bool recurse)
+rt_dump_node(RT_NODE *node, int level, bool recurse)
 {
 	char		space[125] = {0};
 
@@ -1575,14 +1692,14 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+						RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
 								space, n4->base.chunks[i], n4->values[i]);
 					}
 					else
 					{
-						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+						RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, n4->base.chunks[i]);
@@ -1601,14 +1718,14 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
 								space, n32->base.chunks[i], n32->values[i]);
 					}
 					else
 					{
-						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, n32->base.chunks[i]);
@@ -1625,19 +1742,19 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 			}
 		case RT_NODE_KIND_125:
 			{
-				rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
 
 				fprintf(stderr, "slot_idxs ");
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
 				{
-					if (!node_125_is_chunk_used(b125, i))
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
 						continue;
 
 					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
 				}
 				if (NODE_IS_LEAF(node))
 				{
-					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+					RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
 
 					fprintf(stderr, ", isset-bitmap:");
 					for (int i = 0; i < BM_IDX(128); i++)
@@ -1649,25 +1766,25 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
 				{
-					if (!node_125_is_chunk_used(b125, i))
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
 						continue;
 
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+						RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, i, node_leaf_125_get_value(n125, i));
+								space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
 					}
 					else
 					{
-						rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_125_get_child(n125, i),
+							rt_dump_node(RT_NODE_INNER_125_GET_CHILD(n125, i),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -1681,26 +1798,26 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
 
-						if (!node_leaf_256_is_chunk_used(n256, i))
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
 							continue;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, i, node_leaf_256_get_value(n256, i));
+								space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
 					}
 					else
 					{
-						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
 
-						if (!node_inner_256_is_chunk_used(n256, i))
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
 							continue;
 
 						fprintf(stderr, "%schunk 0x%X ->",
 								space, i);
 
 						if (recurse)
-							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+							rt_dump_node(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
 										 recurse);
 						else
 							fprintf(stderr, "\n");
@@ -1712,9 +1829,9 @@ rt_dump_node(rt_node *node, int level, bool recurse)
 }
 
 void
-rt_dump_search(radix_tree *tree, uint64 key)
+rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
 {
-	rt_node    *node;
+	RT_NODE    *node;
 	int			shift;
 	int			level = 0;
 
@@ -1739,7 +1856,7 @@ rt_dump_search(radix_tree *tree, uint64 key)
 	shift = tree->root->shift;
 	while (shift >= 0)
 	{
-		rt_node    *child;
+		RT_NODE    *child;
 
 		rt_dump_node(node, level, false);
 
@@ -1748,12 +1865,12 @@ rt_dump_search(radix_tree *tree, uint64 key)
 			uint64		dummy;
 
 			/* We reached at a leaf node, find the corresponding slot */
-			rt_node_search_leaf(node, key, &dummy);
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
 
 			break;
 		}
 
-		if (!rt_node_search_inner(node, key, &child))
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
 			break;
 
 		node = child;
@@ -1763,16 +1880,16 @@ rt_dump_search(radix_tree *tree, uint64 key)
 }
 
 void
-rt_dump(radix_tree *tree)
+rt_dump(RT_RADIX_TREE *tree)
 {
 
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
-				rt_size_class_info[i].name,
-				rt_size_class_info[i].inner_size,
-				rt_size_class_info[i].inner_blocksize,
-				rt_size_class_info[i].leaf_size,
-				rt_size_class_info[i].leaf_blocksize);
+				RT_SIZE_CLASS_INFO[i].name,
+				RT_SIZE_CLASS_INFO[i].inner_size,
+				RT_SIZE_CLASS_INFO[i].inner_blocksize,
+				RT_SIZE_CLASS_INFO[i].leaf_size,
+				RT_SIZE_CLASS_INFO[i].leaf_blocksize);
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
 
 	if (!tree->root)
@@ -1784,3 +1901,107 @@ rt_dump(radix_tree *tree)
 	rt_dump_node(tree->root, 0, true);
 }
 #endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+
+/* locally declared macros */
+#undef NODE_IS_LEAF
+#undef NODE_IS_EMPTY
+#undef VAR_NODE_HAS_FREE_SLOT
+#undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_SIZE_CLASS_COUNT
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_4_FULL
+#undef RT_CLASS_32_PARTIAL
+#undef RT_CLASS_32_FULL
+#undef RT_CLASS_125_FULL
+#undef RT_CLASS_256
+#undef RT_KIND_MIN_SIZE_CLASS
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_NUM_ENTRIES
+#undef RT_DUMP
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_GROW_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index 24fd9cc02b..6eefc63e19 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -1,15 +1,15 @@
 /* TODO: shrink nodes */
 
 #if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE rt_node_inner_4
-#define RT_NODE32_TYPE rt_node_inner_32
-#define RT_NODE125_TYPE rt_node_inner_125
-#define RT_NODE256_TYPE rt_node_inner_256
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
 #elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE rt_node_leaf_4
-#define RT_NODE32_TYPE rt_node_leaf_32
-#define RT_NODE125_TYPE rt_node_leaf_125
-#define RT_NODE256_TYPE rt_node_leaf_256
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
 #else
 #error node level must be either inner or leaf
 #endif
@@ -21,16 +21,16 @@
 		case RT_NODE_KIND_4:
 			{
 				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
-				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
 
 				if (idx < 0)
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+				RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
 										  n4->base.n.count, idx);
 #else
-				chunk_children_array_delete(n4->base.chunks, n4->children,
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
 											n4->base.n.count, idx);
 #endif
 				break;
@@ -38,16 +38,16 @@
 		case RT_NODE_KIND_32:
 			{
 				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
-				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
 
 				if (idx < 0)
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
 										  n32->base.n.count, idx);
 #else
-				chunk_children_array_delete(n32->base.chunks, n32->children,
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
 											n32->base.n.count, idx);
 #endif
 				break;
@@ -74,16 +74,16 @@
 				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				if (!node_leaf_256_is_chunk_used(n256, chunk))
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
 #else
-				if (!node_inner_256_is_chunk_used(n256, chunk))
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
 #endif
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				node_leaf_256_delete(n256, chunk);
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
 #else
-				node_inner_256_delete(n256, chunk);
+				RT_NODE_INNER_256_DELETE(n256, chunk);
 #endif
 				break;
 			}
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index c63fe9a3c0..ff76583402 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -1,20 +1,20 @@
 #if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE rt_node_inner_4
-#define RT_NODE32_TYPE rt_node_inner_32
-#define RT_NODE125_TYPE rt_node_inner_125
-#define RT_NODE256_TYPE rt_node_inner_256
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
 #elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE rt_node_leaf_4
-#define RT_NODE32_TYPE rt_node_leaf_32
-#define RT_NODE125_TYPE rt_node_leaf_125
-#define RT_NODE256_TYPE rt_node_leaf_256
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
 #else
 #error node level must be either inner or leaf
 #endif
 
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 	bool		chunk_exists = false;
-	rt_node		*newnode = NULL;
+	RT_NODE		*newnode = NULL;
 
 #ifdef RT_NODE_LEVEL_LEAF
 	Assert(NODE_IS_LEAF(node));
@@ -29,7 +29,7 @@
 				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
 				int			idx;
 
-				idx = node_4_search_eq(&n4->base, chunk);
+				idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
 				if (idx != -1)
 				{
 					/* found the existing chunk */
@@ -47,22 +47,22 @@
 					RT_NODE32_TYPE *new32;
 
 					/* grow node from 4 to 32 */
-					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+					newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
 					new32 = (RT_NODE32_TYPE *) newnode;
 #ifdef RT_NODE_LEVEL_LEAF
-					chunk_values_array_copy(n4->base.chunks, n4->values,
+					RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
 											  new32->base.chunks, new32->values);
 #else
-					chunk_children_array_copy(n4->base.chunks, n4->children,
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
 											  new32->base.chunks, new32->children);
 #endif
 					Assert(parent != NULL);
-					rt_replace_node(tree, parent, node, newnode, key);
+					RT_REPLACE_NODE(tree, parent, node, newnode, key);
 					node = newnode;
 				}
 				else
 				{
-					int			insertpos = node_4_get_insertpos(&n4->base, chunk);
+					int			insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
 					int			count = n4->base.n.count;
 
 					/* shift chunks and children */
@@ -70,10 +70,10 @@
 					{
 						Assert(count > 0);
 #ifdef RT_NODE_LEVEL_LEAF
-						chunk_values_array_shift(n4->base.chunks, n4->values,
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
 												   count, insertpos);
 #else
-						chunk_children_array_shift(n4->base.chunks, n4->children,
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
 												   count, insertpos);
 #endif
 					}
@@ -90,12 +90,12 @@
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
-				const rt_size_class_elem minclass = rt_size_class_info[RT_CLASS_32_PARTIAL];
-				const rt_size_class_elem maxclass = rt_size_class_info[RT_CLASS_32_FULL];
+				const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
 				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
 				int			idx;
 
-				idx = node_32_search_eq(&n32->base, chunk);
+				idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
 				if (idx != -1)
 				{
 					/* found the existing chunk */
@@ -109,20 +109,20 @@
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
-					n32->base.n.fanout == minclass.fanout)
+					n32->base.n.fanout == class32_min.fanout)
 				{
 					/* grow to the next size class of this kind */
 #ifdef RT_NODE_LEVEL_LEAF
-					newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, false);
-					memcpy(newnode, node, minclass.leaf_size);
+					newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, false);
+					memcpy(newnode, node, class32_min.leaf_size);
 #else
-					newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
-					memcpy(newnode, node, minclass.inner_size);
+					newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, true);
+					memcpy(newnode, node, class32_min.inner_size);
 #endif
-					newnode->fanout = maxclass.fanout;
+					newnode->fanout = class32_max.fanout;
 
 					Assert(parent != NULL);
-					rt_replace_node(tree, parent, node, newnode, key);
+					RT_REPLACE_NODE(tree, parent, node, newnode, key);
 					node = newnode;
 
 					/* also update pointer for this kind */
@@ -133,13 +133,13 @@
 				{
 					RT_NODE125_TYPE *new125;
 
-					Assert(n32->base.n.fanout == maxclass.fanout);
+					Assert(n32->base.n.fanout == class32_max.fanout);
 
 					/* grow node from 32 to 125 */
-					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+					newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
 					new125 = (RT_NODE125_TYPE *) newnode;
 
-					for (int i = 0; i < maxclass.fanout; i++)
+					for (int i = 0; i < class32_max.fanout; i++)
 					{
 						new125->base.slot_idxs[n32->base.chunks[i]] = i;
 #ifdef RT_NODE_LEVEL_LEAF
@@ -149,26 +149,26 @@
 #endif
 					}
 
-					Assert(maxclass.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
-					new125->base.isset[0] = (bitmapword) (((uint64) 1 << maxclass.fanout) - 1);
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
 
 					Assert(parent != NULL);
-					rt_replace_node(tree, parent, node, newnode, key);
+					RT_REPLACE_NODE(tree, parent, node, newnode, key);
 					node = newnode;
 				}
 				else
 				{
-					int	insertpos = node_32_get_insertpos(&n32->base, chunk);
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
 					int count = n32->base.n.count;
 
 					if (insertpos < count)
 					{
 						Assert(count > 0);
 #ifdef RT_NODE_LEVEL_LEAF
-						chunk_values_array_shift(n32->base.chunks, n32->values,
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
 												   count, insertpos);
 #else
-						chunk_children_array_shift(n32->base.chunks, n32->children,
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
 												   count, insertpos);
 #endif
 					}
@@ -206,22 +206,22 @@
 					RT_NODE256_TYPE *new256;
 
 					/* grow node from 125 to 256 */
-					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+					newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
 					new256 = (RT_NODE256_TYPE *) newnode;
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
 					{
-						if (!node_125_is_chunk_used(&n125->base, i))
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
 							continue;
 #ifdef RT_NODE_LEVEL_LEAF
-						node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
 #else
-						node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
 #endif
 						cnt++;
 					}
 
 					Assert(parent != NULL);
-					rt_replace_node(tree, parent, node, newnode, key);
+					RT_REPLACE_NODE(tree, parent, node, newnode, key);
 					node = newnode;
 				}
 				else
@@ -260,16 +260,16 @@
 				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
 #else
-				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+				chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
 #endif
 				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
 
 #ifdef RT_NODE_LEVEL_LEAF
-				node_leaf_256_set(n256, chunk, value);
+				RT_NODE_LEAF_256_SET(n256, chunk, value);
 #else
-				node_inner_256_set(n256, chunk, child);
+				RT_NODE_INNER_256_SET(n256, chunk, child);
 #endif
 				break;
 			}
@@ -283,7 +283,7 @@
 	 * Done. Finally, verify the chunk and value is inserted or replaced
 	 * properly in the node.
 	 */
-	rt_verify_node(node);
+	RT_VERIFY_NODE(node);
 
 	return chunk_exists;
 
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index bebf8e725a..a153011376 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -1,13 +1,13 @@
 #if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE rt_node_inner_4
-#define RT_NODE32_TYPE rt_node_inner_32
-#define RT_NODE125_TYPE rt_node_inner_125
-#define RT_NODE256_TYPE rt_node_inner_256
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
 #elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE rt_node_leaf_4
-#define RT_NODE32_TYPE rt_node_leaf_32
-#define RT_NODE125_TYPE rt_node_leaf_125
-#define RT_NODE256_TYPE rt_node_leaf_256
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
 #else
 #error node level must be either inner or leaf
 #endif
@@ -15,7 +15,7 @@
 #ifdef RT_NODE_LEVEL_LEAF
 	uint64		value;
 #else
-	rt_node    *child = NULL;
+	RT_NODE    *child = NULL;
 #endif
 	bool		found = false;
 	uint8		key_chunk;
@@ -62,7 +62,7 @@
 
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
 				{
-					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
 						break;
 				}
 
@@ -71,9 +71,9 @@
 
 				node_iter->current_idx = i;
 #ifdef RT_NODE_LEVEL_LEAF
-				value = node_leaf_125_get_value(n125, i);
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
 #else
-				child = node_inner_125_get_child(n125, i);
+				child = RT_NODE_INNER_125_GET_CHILD(n125, i);
 #endif
 				key_chunk = i;
 				found = true;
@@ -87,9 +87,9 @@
 				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
 				{
 #ifdef RT_NODE_LEVEL_LEAF
-					if (node_leaf_256_is_chunk_used(n256, i))
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
 #else
-					if (node_inner_256_is_chunk_used(n256, i))
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
 #endif
 						break;
 				}
@@ -99,9 +99,9 @@
 
 				node_iter->current_idx = i;
 #ifdef RT_NODE_LEVEL_LEAF
-				value = node_leaf_256_get_value(n256, i);
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
 #else
-				child = node_inner_256_get_child(n256, i);
+				child = RT_NODE_INNER_256_GET_CHILD(n256, i);
 #endif
 				key_chunk = i;
 				found = true;
@@ -111,7 +111,7 @@
 
 	if (found)
 	{
-		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
 #ifdef RT_NODE_LEVEL_LEAF
 		*value_p = value;
 #endif
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index d0366f9bb6..1a0d2d3f1f 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -1,13 +1,13 @@
 #if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE rt_node_inner_4
-#define RT_NODE32_TYPE rt_node_inner_32
-#define RT_NODE125_TYPE rt_node_inner_125
-#define RT_NODE256_TYPE rt_node_inner_256
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
 #elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE rt_node_leaf_4
-#define RT_NODE32_TYPE rt_node_leaf_32
-#define RT_NODE125_TYPE rt_node_leaf_125
-#define RT_NODE256_TYPE rt_node_leaf_256
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
 #else
 #error node level must be either inner or leaf
 #endif
@@ -17,7 +17,7 @@
 #ifdef RT_NODE_LEVEL_LEAF
 	uint64		value = 0;
 #else
-	rt_node    *child = NULL;
+	RT_NODE    *child = NULL;
 #endif
 
 	switch (node->kind)
@@ -25,7 +25,7 @@
 		case RT_NODE_KIND_4:
 			{
 				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
-				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
 
 				if (idx < 0)
 					return false;
@@ -40,7 +40,7 @@
 		case RT_NODE_KIND_32:
 			{
 				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
-				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
 
 				if (idx < 0)
 					return false;
@@ -56,13 +56,13 @@
 			{
 				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
 
-				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+				if (!RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				value = node_leaf_125_get_value(n125, chunk);
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
 #else
-				child = node_inner_125_get_child(n125, chunk);
+				child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
 #endif
 				break;
 			}
@@ -71,16 +71,16 @@
 				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				if (!node_leaf_256_is_chunk_used(n256, chunk))
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
 #else
-				if (!node_inner_256_is_chunk_used(n256, chunk))
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
 #endif
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				value = node_leaf_256_get_value(n256, chunk);
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
 #else
-				child = node_inner_256_get_child(n256, chunk);
+				child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
 #endif
 				break;
 			}
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index ea993e63df..2256d08100 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -14,7 +14,6 @@
 
 #include "common/pg_prng.h"
 #include "fmgr.h"
-#include "lib/radixtree.h"
 #include "miscadmin.h"
 #include "nodes/bitmapset.h"
 #include "storage/block.h"
@@ -99,6 +98,14 @@ static const test_spec test_specs[] = {
 	}
 };
 
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+
 PG_MODULE_MAGIC;
 
 PG_FUNCTION_INFO_V1(test_radixtree);
@@ -106,7 +113,7 @@ PG_FUNCTION_INFO_V1(test_radixtree);
 static void
 test_empty(void)
 {
-	radix_tree *radixtree;
+	rt_radix_tree *radixtree;
 	rt_iter		*iter;
 	uint64		dummy;
 	uint64		key;
@@ -142,7 +149,7 @@ test_empty(void)
 static void
 test_basic(int children, bool test_inner)
 {
-	radix_tree	*radixtree;
+	rt_radix_tree	*radixtree;
 	uint64 *keys;
 	int	shift = test_inner ? 8 : 0;
 
@@ -192,7 +199,7 @@ test_basic(int children, bool test_inner)
  * Check if keys from start to end with the shift exist in the tree.
  */
 static void
-check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
 					 int incr)
 {
 	for (int i = start; i < end; i++)
@@ -210,7 +217,7 @@ check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
 }
 
 static void
-test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
 {
 	uint64		num_entries;
 	int		ninserted = 0;
@@ -257,7 +264,7 @@ test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
 }
 
 static void
-test_node_types_delete(radix_tree *radixtree, uint8 shift)
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
 {
 	uint64		num_entries;
 
@@ -288,7 +295,7 @@ test_node_types_delete(radix_tree *radixtree, uint8 shift)
 static void
 test_node_types(uint8 shift)
 {
-	radix_tree *radixtree;
+	rt_radix_tree *radixtree;
 
 	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
 
@@ -312,7 +319,7 @@ test_node_types(uint8 shift)
 static void
 test_pattern(const test_spec * spec)
 {
-	radix_tree *radixtree;
+	rt_radix_tree *radixtree;
 	rt_iter    *iter;
 	MemoryContext radixtree_ctx;
 	TimestampTz starttime;
-- 
2.39.0

v17-0006-Convert-radixtree.c-into-a-header.patchtext/x-patch; charset=US-ASCII; name=v17-0006-Convert-radixtree.c-into-a-header.patchDownload

From 45cad7dcb2c14e035ffd03ca59fcedaf51674bb4 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 4 Jan 2023 12:54:51 +0700
Subject: [PATCH v17 6/9] Convert radixtree.c into a header

Preparation for converting to a template.
---
 src/backend/lib/Makefile    |    1 -
 src/backend/lib/meson.build |    1 -
 src/backend/lib/radixtree.c | 1767 -----------------------------------
 src/include/lib/radixtree.h | 1762 +++++++++++++++++++++++++++++++++-
 4 files changed, 1753 insertions(+), 1778 deletions(-)
 delete mode 100644 src/backend/lib/radixtree.c

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 4c1db794b6..9dad31398a 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,7 +22,6 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
-	radixtree.o \
 	rbtree.o \
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 5f8df32c5c..974cab8776 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -11,5 +11,4 @@ backend_sources += files(
   'knapsack.c',
   'pairingheap.c',
   'rbtree.c',
-  'radixtree.c',
 )
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
deleted file mode 100644
index 80cde09aaf..0000000000
--- a/src/backend/lib/radixtree.c
+++ /dev/null
@@ -1,1767 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * radixtree.c
- *		Implementation for adaptive radix tree.
- *
- * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
- * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
- * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
- * types, each with a different numbers of elements. Depending on the number of
- * children, the appropriate node type is used.
- *
- * There are some differences from the proposed implementation. For instance,
- * there is not support for path compression and lazy path expansion. The radix
- * tree supports fixed length of the key so we don't expect the tree level
- * wouldn't be high.
- *
- * Both the key and the value are 64-bit unsigned integer. The inner nodes and
- * the leaf nodes have slightly different structure: for inner tree nodes,
- * shift > 0, store the pointer to its child node as the value. The leaf nodes,
- * shift == 0, have the 64-bit unsigned integer that is specified by the user as
- * the value. The paper refers to this technique as "Multi-value leaves".  We
- * choose it to avoid an additional pointer traversal.  It is the reason this code
- * currently does not support variable-length keys.
- *
- * XXX: Most functions in this file have two variants for inner nodes and leaf
- * nodes, therefore there are duplication codes. While this sometimes makes the
- * code maintenance tricky, this reduces branch prediction misses when judging
- * whether the node is a inner node of a leaf node.
- *
- * XXX: the radix tree node never be shrunk.
- *
- * Interface
- * ---------
- *
- * rt_create		- Create a new, empty radix tree
- * rt_free			- Free the radix tree
- * rt_search		- Search a key-value pair
- * rt_set			- Set a key-value pair
- * rt_delete		- Delete a key-value pair
- * rt_begin_iterate	- Begin iterating through all key-value pairs
- * rt_iterate_next	- Return next key-value pair, if any
- * rt_end_iter		- End iteration
- * rt_memory_usage	- Get the memory usage
- * rt_num_entries	- Get the number of key-value pairs
- *
- * rt_create() creates an empty radix tree in the given memory context
- * and memory contexts for all kinds of radix tree node under the memory context.
- *
- * rt_iterate_next() ensures returning key-value pairs in the ascending
- * order of the key.
- *
- * Copyright (c) 2022, PostgreSQL Global Development Group
- *
- * IDENTIFICATION
- *	  src/backend/lib/radixtree.c
- *
- *-------------------------------------------------------------------------
- */
-
-#include "postgres.h"
-
-#include "lib/radixtree.h"
-#include "lib/stringinfo.h"
-#include "miscadmin.h"
-#include "nodes/bitmapset.h"
-#include "port/pg_bitutils.h"
-#include "port/pg_lfind.h"
-#include "utils/memutils.h"
-
-#ifdef RT_DEBUG
-#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
-#endif
-
-/* The number of bits encoded in one tree level */
-#define RT_NODE_SPAN	BITS_PER_BYTE
-
-/* The number of maximum slots in the node */
-#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
-
-/* Mask for extracting a chunk from the key */
-#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
-
-/* Maximum shift the radix tree uses */
-#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
-
-/* Tree level the radix tree uses */
-#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
-
-/* Invalid index used in node-125 */
-#define RT_NODE_125_INVALID_IDX	0xFF
-
-/* Get a chunk from the key */
-#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
-
-/* For accessing bitmaps */
-#define BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
-#define BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
-
-/*
- * Supported radix tree node kinds and size classes.
- *
- * There are 4 node kinds and each node kind have one or two size classes,
- * partial and full. The size classes in the same node kind have the same
- * node structure but have the different number of fanout that is stored
- * in 'fanout' of rt_node. For example in size class 15, when a 16th element
- * is to be inserted, we allocate a larger area and memcpy the entire old
- * node to it.
- *
- * This technique allows us to limit the node kinds to 4, which limits the
- * number of cases in switch statements. It also allows a possible future
- * optimization to encode the node kind in a pointer tag.
- *
- * These size classes have been chose carefully so that it minimizes the
- * allocator padding in both the inner and leaf nodes on DSA.
- * node
- */
-#define RT_NODE_KIND_4			0x00
-#define RT_NODE_KIND_32			0x01
-#define RT_NODE_KIND_125		0x02
-#define RT_NODE_KIND_256		0x03
-#define RT_NODE_KIND_COUNT		4
-
-typedef enum rt_size_class
-{
-	RT_CLASS_4_FULL = 0,
-	RT_CLASS_32_PARTIAL,
-	RT_CLASS_32_FULL,
-	RT_CLASS_125_FULL,
-	RT_CLASS_256
-
-#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
-} rt_size_class;
-
-/* Common type for all nodes types */
-typedef struct rt_node
-{
-	/*
-	 * Number of children.  We use uint16 to be able to indicate 256 children
-	 * at the fanout of 8.
-	 */
-	uint16		count;
-
-	/* Max number of children. We can use uint8 because we never need to store 256 */
-	/* WIP: if we don't have a variable sized node4, this should instead be in the base
-	types as needed, since saving every byte is crucial for the smallest node kind */
-	uint8		fanout;
-
-	/*
-	 * Shift indicates which part of the key space is represented by this
-	 * node. That is, the key is shifted by 'shift' and the lowest
-	 * RT_NODE_SPAN bits are then represented in chunk.
-	 */
-	uint8		shift;
-	uint8		chunk;
-
-	/* Node kind, one per search/set algorithm */
-	uint8		kind;
-} rt_node;
-#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
-#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
-#define VAR_NODE_HAS_FREE_SLOT(node) \
-	((node)->base.n.count < (node)->base.n.fanout)
-#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
-	((node)->base.n.count < rt_size_class_info[class].fanout)
-
-/* Base type of each node kinds for leaf and inner nodes */
-/* The base types must be a be able to accommodate the largest size
-class for variable-sized node kinds*/
-typedef struct rt_node_base_4
-{
-	rt_node		n;
-
-	/* 4 children, for key chunks */
-	uint8		chunks[4];
-} rt_node_base_4;
-
-typedef struct rt_node_base32
-{
-	rt_node		n;
-
-	/* 32 children, for key chunks */
-	uint8		chunks[32];
-} rt_node_base_32;
-
-/*
- * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
- * 256, to store indexes into a second array that contains up to 125 values (or
- * child pointers in inner nodes).
- */
-typedef struct rt_node_base125
-{
-	rt_node		n;
-
-	/* The index of slots for each fanout */
-	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
-
-	/* isset is a bitmap to track which slot is in use */
-	bitmapword		isset[BM_IDX(128)];
-} rt_node_base_125;
-
-typedef struct rt_node_base256
-{
-	rt_node		n;
-} rt_node_base_256;
-
-/*
- * Inner and leaf nodes.
- *
- * Theres are separate for two main reasons:
- *
- * 1) the value type might be different than something fitting into a pointer
- *    width type
- * 2) Need to represent non-existing values in a key-type independent way.
- *
- * 1) is clearly worth being concerned about, but it's not clear 2) is as
- * good. It might be better to just indicate non-existing entries the same way
- * in inner nodes.
- */
-typedef struct rt_node_inner_4
-{
-	rt_node_base_4 base;
-
-	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_inner_4;
-
-typedef struct rt_node_leaf_4
-{
-	rt_node_base_4 base;
-
-	/* number of values depends on size class */
-	uint64		values[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_leaf_4;
-
-typedef struct rt_node_inner_32
-{
-	rt_node_base_32 base;
-
-	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_inner_32;
-
-typedef struct rt_node_leaf_32
-{
-	rt_node_base_32 base;
-
-	/* number of values depends on size class */
-	uint64		values[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_leaf_32;
-
-typedef struct rt_node_inner_125
-{
-	rt_node_base_125 base;
-
-	/* number of children depends on size class */
-	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_inner_125;
-
-typedef struct rt_node_leaf_125
-{
-	rt_node_base_125 base;
-
-	/* number of values depends on size class */
-	uint64		values[FLEXIBLE_ARRAY_MEMBER];
-} rt_node_leaf_125;
-
-/*
- * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
- * for directly storing values (or child pointers in inner nodes).
- */
-typedef struct rt_node_inner_256
-{
-	rt_node_base_256 base;
-
-	/* Slots for 256 children */
-	rt_node    *children[RT_NODE_MAX_SLOTS];
-} rt_node_inner_256;
-
-typedef struct rt_node_leaf_256
-{
-	rt_node_base_256 base;
-
-	/* isset is a bitmap to track which slot is in use */
-	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
-
-	/* Slots for 256 values */
-	uint64		values[RT_NODE_MAX_SLOTS];
-} rt_node_leaf_256;
-
-/* Information for each size class */
-typedef struct rt_size_class_elem
-{
-	const char *name;
-	int			fanout;
-
-	/* slab chunk size */
-	Size		inner_size;
-	Size		leaf_size;
-
-	/* slab block size */
-	Size		inner_blocksize;
-	Size		leaf_blocksize;
-} rt_size_class_elem;
-
-/*
- * Calculate the slab blocksize so that we can allocate at least 32 chunks
- * from the block.
- */
-#define NODE_SLAB_BLOCK_SIZE(size)	\
-	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
-static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
-	[RT_CLASS_4_FULL] = {
-		.name = "radix tree node 4",
-		.fanout = 4,
-		.inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
-		.leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
-	},
-	[RT_CLASS_32_PARTIAL] = {
-		.name = "radix tree node 15",
-		.fanout = 15,
-		.inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
-		.leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
-	},
-	[RT_CLASS_32_FULL] = {
-		.name = "radix tree node 32",
-		.fanout = 32,
-		.inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
-		.leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
-	},
-	[RT_CLASS_125_FULL] = {
-		.name = "radix tree node 125",
-		.fanout = 125,
-		.inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
-		.leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
-	},
-	[RT_CLASS_256] = {
-		.name = "radix tree node 256",
-		.fanout = 256,
-		.inner_size = sizeof(rt_node_inner_256),
-		.leaf_size = sizeof(rt_node_leaf_256),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
-	},
-};
-
-/* Map from the node kind to its minimum size class */
-static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
-	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
-	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
-	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
-	[RT_NODE_KIND_256] = RT_CLASS_256,
-};
-
-/*
- * Iteration support.
- *
- * Iterating the radix tree returns each pair of key and value in the ascending
- * order of the key. To support this, the we iterate nodes of each level.
- *
- * rt_node_iter struct is used to track the iteration within a node.
- *
- * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
- * in order to track the iteration of each level. During the iteration, we also
- * construct the key whenever updating the node iteration information, e.g., when
- * advancing the current index within the node or when moving to the next node
- * at the same level.
- */
-typedef struct rt_node_iter
-{
-	rt_node    *node;			/* current node being iterated */
-	int			current_idx;	/* current position. -1 for initial value */
-} rt_node_iter;
-
-struct rt_iter
-{
-	radix_tree *tree;
-
-	/* Track the iteration on nodes of each level */
-	rt_node_iter stack[RT_MAX_LEVEL];
-	int			stack_len;
-
-	/* The key is being constructed during the iteration */
-	uint64		key;
-};
-
-/* A radix tree with nodes */
-struct radix_tree
-{
-	MemoryContext context;
-
-	rt_node    *root;
-	uint64		max_val;
-	uint64		num_keys;
-
-	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
-	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
-
-	/* statistics */
-#ifdef RT_DEBUG
-	int32		cnt[RT_SIZE_CLASS_COUNT];
-#endif
-};
-
-static void rt_new_root(radix_tree *tree, uint64 key);
-static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
-static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
-								bool inner);
-static void rt_free_node(radix_tree *tree, rt_node *node);
-static void rt_extend(radix_tree *tree, uint64 key);
-static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p);
-static inline bool rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p);
-static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
-								 uint64 key, rt_node *child);
-static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
-								uint64 key, uint64 value);
-static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
-static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
-											 uint64 *value_p);
-static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
-static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
-
-/* verification (available only with assertion) */
-static void rt_verify_node(rt_node *node);
-
-/*
- * Return index of the first element in 'base' that equals 'key'. Return -1
- * if there is no such element.
- */
-static inline int
-node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
-{
-	int			idx = -1;
-
-	for (int i = 0; i < node->n.count; i++)
-	{
-		if (node->chunks[i] == chunk)
-		{
-			idx = i;
-			break;
-		}
-	}
-
-	return idx;
-}
-
-/*
- * Return index of the chunk to insert into chunks in the given node.
- */
-static inline int
-node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
-{
-	int			idx;
-
-	for (idx = 0; idx < node->n.count; idx++)
-	{
-		if (node->chunks[idx] >= chunk)
-			break;
-	}
-
-	return idx;
-}
-
-/*
- * Return index of the first element in 'base' that equals 'key'. Return -1
- * if there is no such element.
- */
-static inline int
-node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
-{
-	int			count = node->n.count;
-#ifndef USE_NO_SIMD
-	Vector8		spread_chunk;
-	Vector8		haystack1;
-	Vector8		haystack2;
-	Vector8		cmp1;
-	Vector8		cmp2;
-	uint32		bitfield;
-	int			index_simd = -1;
-#endif
-
-#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
-	int			index = -1;
-
-	for (int i = 0; i < count; i++)
-	{
-		if (node->chunks[i] == chunk)
-		{
-			index = i;
-			break;
-		}
-	}
-#endif
-
-#ifndef USE_NO_SIMD
-	spread_chunk = vector8_broadcast(chunk);
-	vector8_load(&haystack1, &node->chunks[0]);
-	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
-	cmp1 = vector8_eq(spread_chunk, haystack1);
-	cmp2 = vector8_eq(spread_chunk, haystack2);
-	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
-	bitfield &= ((UINT64CONST(1) << count) - 1);
-
-	if (bitfield)
-		index_simd = pg_rightmost_one_pos32(bitfield);
-
-	Assert(index_simd == index);
-	return index_simd;
-#else
-	return index;
-#endif
-}
-
-/*
- * Return index of the chunk to insert into chunks in the given node.
- */
-static inline int
-node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
-{
-	int			count = node->n.count;
-#ifndef USE_NO_SIMD
-	Vector8		spread_chunk;
-	Vector8		haystack1;
-	Vector8		haystack2;
-	Vector8		cmp1;
-	Vector8		cmp2;
-	Vector8		min1;
-	Vector8		min2;
-	uint32		bitfield;
-	int			index_simd;
-#endif
-
-#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
-	int			index;
-
-	for (index = 0; index < count; index++)
-	{
-		if (node->chunks[index] >= chunk)
-			break;
-	}
-#endif
-
-#ifndef USE_NO_SIMD
-	spread_chunk = vector8_broadcast(chunk);
-	vector8_load(&haystack1, &node->chunks[0]);
-	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
-	min1 = vector8_min(spread_chunk, haystack1);
-	min2 = vector8_min(spread_chunk, haystack2);
-	cmp1 = vector8_eq(spread_chunk, min1);
-	cmp2 = vector8_eq(spread_chunk, min2);
-	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
-	bitfield &= ((UINT64CONST(1) << count) - 1);
-
-	if (bitfield)
-		index_simd = pg_rightmost_one_pos32(bitfield);
-	else
-		index_simd = count;
-
-	Assert(index_simd == index);
-	return index_simd;
-#else
-	return index;
-#endif
-}
-
-/*
- * Functions to manipulate both chunks array and children/values array.
- * These are used for node-4 and node-32.
- */
-
-/* Shift the elements right at 'idx' by one */
-static inline void
-chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
-{
-	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
-	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
-}
-
-static inline void
-chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
-{
-	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
-	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
-}
-
-/* Delete the element at 'idx' */
-static inline void
-chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
-{
-	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
-	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
-}
-
-static inline void
-chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
-{
-	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
-	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
-}
-
-/* Copy both chunks and children/values arrays */
-static inline void
-chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
-						  uint8 *dst_chunks, rt_node **dst_children)
-{
-	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
-	const Size chunk_size = sizeof(uint8) * fanout;
-	const Size children_size = sizeof(rt_node *) * fanout;
-
-	memcpy(dst_chunks, src_chunks, chunk_size);
-	memcpy(dst_children, src_children, children_size);
-}
-
-static inline void
-chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
-						uint8 *dst_chunks, uint64 *dst_values)
-{
-	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
-	const Size chunk_size = sizeof(uint8) * fanout;
-	const Size values_size = sizeof(uint64) * fanout;
-
-	memcpy(dst_chunks, src_chunks, chunk_size);
-	memcpy(dst_values, src_values, values_size);
-}
-
-/* Functions to manipulate inner and leaf node-125 */
-
-/* Does the given chunk in the node has the value? */
-static inline bool
-node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
-{
-	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
-}
-
-static inline rt_node *
-node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
-{
-	Assert(!NODE_IS_LEAF(node));
-	return node->children[node->base.slot_idxs[chunk]];
-}
-
-static inline uint64
-node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
-{
-	Assert(NODE_IS_LEAF(node));
-	Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
-	return node->values[node->base.slot_idxs[chunk]];
-}
-
-/* Functions to manipulate inner and leaf node-256 */
-
-/* Return true if the slot corresponding to the given chunk is in use */
-static inline bool
-node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
-{
-	Assert(!NODE_IS_LEAF(node));
-	return (node->children[chunk] != NULL);
-}
-
-static inline bool
-node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
-{
-	int			idx = BM_IDX(chunk);
-	int			bitnum = BM_BIT(chunk);
-
-	Assert(NODE_IS_LEAF(node));
-	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
-}
-
-static inline rt_node *
-node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
-{
-	Assert(!NODE_IS_LEAF(node));
-	Assert(node_inner_256_is_chunk_used(node, chunk));
-	return node->children[chunk];
-}
-
-static inline uint64
-node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
-{
-	Assert(NODE_IS_LEAF(node));
-	Assert(node_leaf_256_is_chunk_used(node, chunk));
-	return node->values[chunk];
-}
-
-/* Set the child in the node-256 */
-static inline void
-node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
-{
-	Assert(!NODE_IS_LEAF(node));
-	node->children[chunk] = child;
-}
-
-/* Set the value in the node-256 */
-static inline void
-node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
-{
-	int			idx = BM_IDX(chunk);
-	int			bitnum = BM_BIT(chunk);
-
-	Assert(NODE_IS_LEAF(node));
-	node->isset[idx] |= ((bitmapword) 1 << bitnum);
-	node->values[chunk] = value;
-}
-
-/* Set the slot at the given chunk position */
-static inline void
-node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
-{
-	Assert(!NODE_IS_LEAF(node));
-	node->children[chunk] = NULL;
-}
-
-static inline void
-node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
-{
-	int			idx = BM_IDX(chunk);
-	int			bitnum = BM_BIT(chunk);
-
-	Assert(NODE_IS_LEAF(node));
-	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
-}
-
-/*
- * Return the shift that is satisfied to store the given key.
- */
-static inline int
-key_get_shift(uint64 key)
-{
-	return (key == 0)
-		? 0
-		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
-}
-
-/*
- * Return the max value stored in a node with the given shift.
- */
-static uint64
-shift_get_max_val(int shift)
-{
-	if (shift == RT_MAX_SHIFT)
-		return UINT64_MAX;
-
-	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
-}
-
-/*
- * Create a new node as the root. Subordinate nodes will be created during
- * the insertion.
- */
-static void
-rt_new_root(radix_tree *tree, uint64 key)
-{
-	int			shift = key_get_shift(key);
-	bool		inner = shift > 0;
-	rt_node    *newnode;
-
-	newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
-	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
-	newnode->shift = shift;
-	tree->max_val = shift_get_max_val(shift);
-	tree->root = newnode;
-}
-
-/*
- * Allocate a new node with the given node kind.
- */
-static rt_node *
-rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
-{
-	rt_node    *newnode;
-
-	if (inner)
-		newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
-												 rt_size_class_info[size_class].inner_size);
-	else
-		newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
-												 rt_size_class_info[size_class].leaf_size);
-
-#ifdef RT_DEBUG
-	/* update the statistics */
-	tree->cnt[size_class]++;
-#endif
-
-	return newnode;
-}
-
-/* Initialize the node contents */
-static inline void
-rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
-{
-	if (inner)
-		MemSet(node, 0, rt_size_class_info[size_class].inner_size);
-	else
-		MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
-
-	node->kind = kind;
-	node->fanout = rt_size_class_info[size_class].fanout;
-
-	/* Initialize slot_idxs to invalid values */
-	if (kind == RT_NODE_KIND_125)
-	{
-		rt_node_base_125 *n125 = (rt_node_base_125 *) node;
-
-		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
-	}
-
-	/*
-	 * Technically it's 256, but we cannot store that in a uint8,
-	 * and this is the max size class to it will never grow.
-	 */
-	if (kind == RT_NODE_KIND_256)
-		node->fanout = 0;
-}
-
-static inline void
-rt_copy_node(rt_node *newnode, rt_node *oldnode)
-{
-	newnode->shift = oldnode->shift;
-	newnode->chunk = oldnode->chunk;
-	newnode->count = oldnode->count;
-}
-
-/*
- * Create a new node with 'new_kind' and the same shift, chunk, and
- * count of 'node'.
- */
-static rt_node*
-rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
-{
-	rt_node	*newnode;
-	bool inner = !NODE_IS_LEAF(node);
-
-	newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
-	rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
-	rt_copy_node(newnode, node);
-
-	return newnode;
-}
-
-/* Free the given node */
-static void
-rt_free_node(radix_tree *tree, rt_node *node)
-{
-	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node)
-	{
-		tree->root = NULL;
-		tree->max_val = 0;
-	}
-
-#ifdef RT_DEBUG
-	{
-		int i;
-
-		/* update the statistics */
-		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
-		{
-			if (node->fanout == rt_size_class_info[i].fanout)
-				break;
-		}
-
-		/* fanout of node256 is intentionally 0 */
-		if (i == RT_SIZE_CLASS_COUNT)
-			i = RT_CLASS_256;
-
-		tree->cnt[i]--;
-		Assert(tree->cnt[i] >= 0);
-	}
-#endif
-
-	pfree(node);
-}
-
-/*
- * Replace old_child with new_child, and free the old one.
- */
-static void
-rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
-				rt_node *new_child, uint64 key)
-{
-	Assert(old_child->chunk == new_child->chunk);
-	Assert(old_child->shift == new_child->shift);
-
-	if (parent == old_child)
-	{
-		/* Replace the root node with the new large node */
-		tree->root = new_child;
-	}
-	else
-	{
-		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
-
-		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
-		Assert(replaced);
-	}
-
-	rt_free_node(tree, old_child);
-}
-
-/*
- * The radix tree doesn't sufficient height. Extend the radix tree so it can
- * store the key.
- */
-static void
-rt_extend(radix_tree *tree, uint64 key)
-{
-	int			target_shift;
-	int			shift = tree->root->shift + RT_NODE_SPAN;
-
-	target_shift = key_get_shift(key);
-
-	/* Grow tree from 'shift' to 'target_shift' */
-	while (shift <= target_shift)
-	{
-		rt_node_inner_4 *node;
-
-		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
-		rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
-		node->base.n.shift = shift;
-		node->base.n.count = 1;
-		node->base.chunks[0] = 0;
-		node->children[0] = tree->root;
-
-		tree->root->chunk = 0;
-		tree->root = (rt_node *) node;
-
-		shift += RT_NODE_SPAN;
-	}
-
-	tree->max_val = shift_get_max_val(target_shift);
-}
-
-/*
- * The radix tree doesn't have inner and leaf nodes for given key-value pair.
- * Insert inner and leaf nodes from 'node' to bottom.
- */
-static inline void
-rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
-			  rt_node *node)
-{
-	int			shift = node->shift;
-
-	while (shift >= RT_NODE_SPAN)
-	{
-		rt_node    *newchild;
-		int			newshift = shift - RT_NODE_SPAN;
-		bool		inner = newshift > 0;
-
-		newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
-		rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
-		newchild->shift = newshift;
-		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
-		rt_node_insert_inner(tree, parent, node, key, newchild);
-
-		parent = node;
-		node = newchild;
-		shift -= RT_NODE_SPAN;
-	}
-
-	rt_node_insert_leaf(tree, parent, node, key, value);
-	tree->num_keys++;
-}
-
-/*
- * Search for the child pointer corresponding to 'key' in the given node.
- *
- * Return true if the key is found, otherwise return false. On success, the child
- * pointer is set to child_p.
- */
-static inline bool
-rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p)
-{
-#define RT_NODE_LEVEL_INNER
-#include "lib/radixtree_search_impl.h"
-#undef RT_NODE_LEVEL_INNER
-}
-
-/*
- * Search for the value corresponding to 'key' in the given node.
- *
- * Return true if the key is found, otherwise return false. On success, the pointer
- * to the value is set to value_p.
- */
-static inline bool
-rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p)
-{
-#define RT_NODE_LEVEL_LEAF
-#include "lib/radixtree_search_impl.h"
-#undef RT_NODE_LEVEL_LEAF
-}
-
-/*
- * Search for the child pointer corresponding to 'key' in the given node.
- *
- * Delete the node and return true if the key is found, otherwise return false.
- */
-static inline bool
-rt_node_delete_inner(rt_node *node, uint64 key)
-{
-#define RT_NODE_LEVEL_INNER
-#include "lib/radixtree_delete_impl.h"
-#undef RT_NODE_LEVEL_INNER
-}
-
-/*
- * Search for the value corresponding to 'key' in the given node.
- *
- * Delete the node and return true if the key is found, otherwise return false.
- */
-static inline bool
-rt_node_delete_leaf(rt_node *node, uint64 key)
-{
-#define RT_NODE_LEVEL_LEAF
-#include "lib/radixtree_delete_impl.h"
-#undef RT_NODE_LEVEL_LEAF
-}
-
-/* Insert the child to the inner node */
-static bool
-rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
-					 rt_node *child)
-{
-#define RT_NODE_LEVEL_INNER
-#include "lib/radixtree_insert_impl.h"
-#undef RT_NODE_LEVEL_INNER
-}
-
-/* Insert the value to the leaf node */
-static bool
-rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
-					uint64 key, uint64 value)
-{
-#define RT_NODE_LEVEL_LEAF
-#include "lib/radixtree_insert_impl.h"
-#undef RT_NODE_LEVEL_LEAF
-}
-
-/*
- * Create the radix tree in the given memory context and return it.
- */
-radix_tree *
-rt_create(MemoryContext ctx)
-{
-	radix_tree *tree;
-	MemoryContext old_ctx;
-
-	old_ctx = MemoryContextSwitchTo(ctx);
-
-	tree = palloc(sizeof(radix_tree));
-	tree->context = ctx;
-	tree->root = NULL;
-	tree->max_val = 0;
-	tree->num_keys = 0;
-
-	/* Create the slab allocator for each size class */
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
-	{
-		tree->inner_slabs[i] = SlabContextCreate(ctx,
-												 rt_size_class_info[i].name,
-												 rt_size_class_info[i].inner_blocksize,
-												 rt_size_class_info[i].inner_size);
-		tree->leaf_slabs[i] = SlabContextCreate(ctx,
-												rt_size_class_info[i].name,
-												rt_size_class_info[i].leaf_blocksize,
-												rt_size_class_info[i].leaf_size);
-#ifdef RT_DEBUG
-		tree->cnt[i] = 0;
-#endif
-	}
-
-	MemoryContextSwitchTo(old_ctx);
-
-	return tree;
-}
-
-/*
- * Free the given radix tree.
- */
-void
-rt_free(radix_tree *tree)
-{
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
-	{
-		MemoryContextDelete(tree->inner_slabs[i]);
-		MemoryContextDelete(tree->leaf_slabs[i]);
-	}
-
-	pfree(tree);
-}
-
-/*
- * Set key to value. If the entry already exists, we update its value to 'value'
- * and return true. Returns false if entry doesn't yet exist.
- */
-bool
-rt_set(radix_tree *tree, uint64 key, uint64 value)
-{
-	int			shift;
-	bool		updated;
-	rt_node    *node;
-	rt_node    *parent;
-
-	/* Empty tree, create the root */
-	if (!tree->root)
-		rt_new_root(tree, key);
-
-	/* Extend the tree if necessary */
-	if (key > tree->max_val)
-		rt_extend(tree, key);
-
-	Assert(tree->root);
-
-	shift = tree->root->shift;
-	node = parent = tree->root;
-
-	/* Descend the tree until a leaf node */
-	while (shift >= 0)
-	{
-		rt_node    *child;
-
-		if (NODE_IS_LEAF(node))
-			break;
-
-		if (!rt_node_search_inner(node, key, &child))
-		{
-			rt_set_extend(tree, key, value, parent, node);
-			return false;
-		}
-
-		parent = node;
-		node = child;
-		shift -= RT_NODE_SPAN;
-	}
-
-	updated = rt_node_insert_leaf(tree, parent, node, key, value);
-
-	/* Update the statistics */
-	if (!updated)
-		tree->num_keys++;
-
-	return updated;
-}
-
-/*
- * Search the given key in the radix tree. Return true if there is the key,
- * otherwise return false.  On success, we set the value to *val_p so it must
- * not be NULL.
- */
-bool
-rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
-{
-	rt_node    *node;
-	int			shift;
-
-	Assert(value_p != NULL);
-
-	if (!tree->root || key > tree->max_val)
-		return false;
-
-	node = tree->root;
-	shift = tree->root->shift;
-
-	/* Descend the tree until a leaf node */
-	while (shift >= 0)
-	{
-		rt_node    *child;
-
-		if (NODE_IS_LEAF(node))
-			break;
-
-		if (!rt_node_search_inner(node, key, &child))
-			return false;
-
-		node = child;
-		shift -= RT_NODE_SPAN;
-	}
-
-	return rt_node_search_leaf(node, key, value_p);
-}
-
-/*
- * Delete the given key from the radix tree. Return true if the key is found (and
- * deleted), otherwise do nothing and return false.
- */
-bool
-rt_delete(radix_tree *tree, uint64 key)
-{
-	rt_node    *node;
-	rt_node    *stack[RT_MAX_LEVEL] = {0};
-	int			shift;
-	int			level;
-	bool		deleted;
-
-	if (!tree->root || key > tree->max_val)
-		return false;
-
-	/*
-	 * Descend the tree to search the key while building a stack of nodes we
-	 * visited.
-	 */
-	node = tree->root;
-	shift = tree->root->shift;
-	level = -1;
-	while (shift > 0)
-	{
-		rt_node    *child;
-
-		/* Push the current node to the stack */
-		stack[++level] = node;
-
-		if (!rt_node_search_inner(node, key, &child))
-			return false;
-
-		node = child;
-		shift -= RT_NODE_SPAN;
-	}
-
-	/* Delete the key from the leaf node if exists */
-	Assert(NODE_IS_LEAF(node));
-	deleted = rt_node_delete_leaf(node, key);
-
-	if (!deleted)
-	{
-		/* no key is found in the leaf node */
-		return false;
-	}
-
-	/* Found the key to delete. Update the statistics */
-	tree->num_keys--;
-
-	/*
-	 * Return if the leaf node still has keys and we don't need to delete the
-	 * node.
-	 */
-	if (!NODE_IS_EMPTY(node))
-		return true;
-
-	/* Free the empty leaf node */
-	rt_free_node(tree, node);
-
-	/* Delete the key in inner nodes recursively */
-	while (level >= 0)
-	{
-		node = stack[level--];
-
-		deleted = rt_node_delete_inner(node, key);
-		Assert(deleted);
-
-		/* If the node didn't become empty, we stop deleting the key */
-		if (!NODE_IS_EMPTY(node))
-			break;
-
-		/* The node became empty */
-		rt_free_node(tree, node);
-	}
-
-	return true;
-}
-
-/* Create and return the iterator for the given radix tree */
-rt_iter *
-rt_begin_iterate(radix_tree *tree)
-{
-	MemoryContext old_ctx;
-	rt_iter    *iter;
-	int			top_level;
-
-	old_ctx = MemoryContextSwitchTo(tree->context);
-
-	iter = (rt_iter *) palloc0(sizeof(rt_iter));
-	iter->tree = tree;
-
-	/* empty tree */
-	if (!iter->tree->root)
-		return iter;
-
-	top_level = iter->tree->root->shift / RT_NODE_SPAN;
-	iter->stack_len = top_level;
-
-	/*
-	 * Descend to the left most leaf node from the root. The key is being
-	 * constructed while descending to the leaf.
-	 */
-	rt_update_iter_stack(iter, iter->tree->root, top_level);
-
-	MemoryContextSwitchTo(old_ctx);
-
-	return iter;
-}
-
-/*
- * Update each node_iter for inner nodes in the iterator node stack.
- */
-static void
-rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
-{
-	int			level = from;
-	rt_node    *node = from_node;
-
-	for (;;)
-	{
-		rt_node_iter *node_iter = &(iter->stack[level--]);
-
-		node_iter->node = node;
-		node_iter->current_idx = -1;
-
-		/* We don't advance the leaf node iterator here */
-		if (NODE_IS_LEAF(node))
-			return;
-
-		/* Advance to the next slot in the inner node */
-		node = rt_node_inner_iterate_next(iter, node_iter);
-
-		/* We must find the first children in the node */
-		Assert(node);
-	}
-}
-
-/*
- * Return true with setting key_p and value_p if there is next key.  Otherwise,
- * return false.
- */
-bool
-rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
-{
-	/* Empty tree */
-	if (!iter->tree->root)
-		return false;
-
-	for (;;)
-	{
-		rt_node    *child = NULL;
-		uint64		value;
-		int			level;
-		bool		found;
-
-		/* Advance the leaf node iterator to get next key-value pair */
-		found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
-
-		if (found)
-		{
-			*key_p = iter->key;
-			*value_p = value;
-			return true;
-		}
-
-		/*
-		 * We've visited all values in the leaf node, so advance inner node
-		 * iterators from the level=1 until we find the next child node.
-		 */
-		for (level = 1; level <= iter->stack_len; level++)
-		{
-			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
-
-			if (child)
-				break;
-		}
-
-		/* the iteration finished */
-		if (!child)
-			return false;
-
-		/*
-		 * Set the node to the node iterator and update the iterator stack
-		 * from this node.
-		 */
-		rt_update_iter_stack(iter, child, level - 1);
-
-		/* Node iterators are updated, so try again from the leaf */
-	}
-
-	return false;
-}
-
-void
-rt_end_iterate(rt_iter *iter)
-{
-	pfree(iter);
-}
-
-static inline void
-rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
-{
-	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
-	iter->key |= (((uint64) chunk) << shift);
-}
-
-/*
- * Advance the slot in the inner node. Return the child if exists, otherwise
- * null.
- */
-static inline rt_node *
-rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
-{
-#define RT_NODE_LEVEL_INNER
-#include "lib/radixtree_iter_impl.h"
-#undef RT_NODE_LEVEL_INNER
-}
-
-/*
- * Advance the slot in the leaf node. On success, return true and the value
- * is set to value_p, otherwise return false.
- */
-static inline bool
-rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
-						  uint64 *value_p)
-{
-#define RT_NODE_LEVEL_LEAF
-#include "lib/radixtree_iter_impl.h"
-#undef RT_NODE_LEVEL_LEAF
-}
-
-/*
- * Return the number of keys in the radix tree.
- */
-uint64
-rt_num_entries(radix_tree *tree)
-{
-	return tree->num_keys;
-}
-
-/*
- * Return the statistics of the amount of memory used by the radix tree.
- */
-uint64
-rt_memory_usage(radix_tree *tree)
-{
-	Size		total = sizeof(radix_tree);
-
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
-	{
-		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
-		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
-	}
-
-	return total;
-}
-
-/*
- * Verify the radix tree node.
- */
-static void
-rt_verify_node(rt_node *node)
-{
-#ifdef USE_ASSERT_CHECKING
-	Assert(node->count >= 0);
-
-	switch (node->kind)
-	{
-		case RT_NODE_KIND_4:
-			{
-				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
-
-				for (int i = 1; i < n4->n.count; i++)
-					Assert(n4->chunks[i - 1] < n4->chunks[i]);
-
-				break;
-			}
-		case RT_NODE_KIND_32:
-			{
-				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
-
-				for (int i = 1; i < n32->n.count; i++)
-					Assert(n32->chunks[i - 1] < n32->chunks[i]);
-
-				break;
-			}
-		case RT_NODE_KIND_125:
-			{
-				rt_node_base_125 *n125 = (rt_node_base_125 *) node;
-				int			cnt = 0;
-
-				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
-				{
-					uint8		slot = n125->slot_idxs[i];
-					int			idx = BM_IDX(slot);
-					int			bitnum = BM_BIT(slot);
-
-					if (!node_125_is_chunk_used(n125, i))
-						continue;
-
-					/* Check if the corresponding slot is used */
-					Assert(slot < node->fanout);
-					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
-
-					cnt++;
-				}
-
-				Assert(n125->n.count == cnt);
-				break;
-			}
-		case RT_NODE_KIND_256:
-			{
-				if (NODE_IS_LEAF(node))
-				{
-					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
-					int			cnt = 0;
-
-					for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
-						cnt += bmw_popcount(n256->isset[i]);
-
-					/* Check if the number of used chunk matches */
-					Assert(n256->base.n.count == cnt);
-
-					break;
-				}
-			}
-	}
-#endif
-}
-
-/***************** DEBUG FUNCTIONS *****************/
-#ifdef RT_DEBUG
-void
-rt_stats(radix_tree *tree)
-{
-	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
-						 tree->num_keys,
-						 tree->root->shift / RT_NODE_SPAN,
-						 tree->cnt[RT_CLASS_4_FULL],
-						 tree->cnt[RT_CLASS_32_PARTIAL],
-						 tree->cnt[RT_CLASS_32_FULL],
-						 tree->cnt[RT_CLASS_125_FULL],
-						 tree->cnt[RT_CLASS_256])));
-}
-
-static void
-rt_dump_node(rt_node *node, int level, bool recurse)
-{
-	char		space[125] = {0};
-
-	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
-			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
-			(node->kind == RT_NODE_KIND_4) ? 4 :
-			(node->kind == RT_NODE_KIND_32) ? 32 :
-			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
-			node->fanout == 0 ? 256 : node->fanout,
-			node->count, node->shift, node->chunk);
-
-	if (level > 0)
-		sprintf(space, "%*c", level * 4, ' ');
-
-	switch (node->kind)
-	{
-		case RT_NODE_KIND_4:
-			{
-				for (int i = 0; i < node->count; i++)
-				{
-					if (NODE_IS_LEAF(node))
-					{
-						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
-
-						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, n4->base.chunks[i], n4->values[i]);
-					}
-					else
-					{
-						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
-
-						fprintf(stderr, "%schunk 0x%X ->",
-								space, n4->base.chunks[i]);
-
-						if (recurse)
-							rt_dump_node(n4->children[i], level + 1, recurse);
-						else
-							fprintf(stderr, "\n");
-					}
-				}
-				break;
-			}
-		case RT_NODE_KIND_32:
-			{
-				for (int i = 0; i < node->count; i++)
-				{
-					if (NODE_IS_LEAF(node))
-					{
-						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
-
-						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, n32->base.chunks[i], n32->values[i]);
-					}
-					else
-					{
-						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
-
-						fprintf(stderr, "%schunk 0x%X ->",
-								space, n32->base.chunks[i]);
-
-						if (recurse)
-						{
-							rt_dump_node(n32->children[i], level + 1, recurse);
-						}
-						else
-							fprintf(stderr, "\n");
-					}
-				}
-				break;
-			}
-		case RT_NODE_KIND_125:
-			{
-				rt_node_base_125 *b125 = (rt_node_base_125 *) node;
-
-				fprintf(stderr, "slot_idxs ");
-				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
-				{
-					if (!node_125_is_chunk_used(b125, i))
-						continue;
-
-					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
-				}
-				if (NODE_IS_LEAF(node))
-				{
-					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
-
-					fprintf(stderr, ", isset-bitmap:");
-					for (int i = 0; i < BM_IDX(128); i++)
-					{
-						fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
-					}
-					fprintf(stderr, "\n");
-				}
-
-				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
-				{
-					if (!node_125_is_chunk_used(b125, i))
-						continue;
-
-					if (NODE_IS_LEAF(node))
-					{
-						rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
-
-						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, i, node_leaf_125_get_value(n125, i));
-					}
-					else
-					{
-						rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
-
-						fprintf(stderr, "%schunk 0x%X ->",
-								space, i);
-
-						if (recurse)
-							rt_dump_node(node_inner_125_get_child(n125, i),
-										 level + 1, recurse);
-						else
-							fprintf(stderr, "\n");
-					}
-				}
-				break;
-			}
-		case RT_NODE_KIND_256:
-			{
-				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
-				{
-					if (NODE_IS_LEAF(node))
-					{
-						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
-
-						if (!node_leaf_256_is_chunk_used(n256, i))
-							continue;
-
-						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, i, node_leaf_256_get_value(n256, i));
-					}
-					else
-					{
-						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
-
-						if (!node_inner_256_is_chunk_used(n256, i))
-							continue;
-
-						fprintf(stderr, "%schunk 0x%X ->",
-								space, i);
-
-						if (recurse)
-							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
-										 recurse);
-						else
-							fprintf(stderr, "\n");
-					}
-				}
-				break;
-			}
-	}
-}
-
-void
-rt_dump_search(radix_tree *tree, uint64 key)
-{
-	rt_node    *node;
-	int			shift;
-	int			level = 0;
-
-	elog(NOTICE, "-----------------------------------------------------------");
-	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
-		 tree->max_val, tree->max_val);
-
-	if (!tree->root)
-	{
-		elog(NOTICE, "tree is empty");
-		return;
-	}
-
-	if (key > tree->max_val)
-	{
-		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
-			 key, key);
-		return;
-	}
-
-	node = tree->root;
-	shift = tree->root->shift;
-	while (shift >= 0)
-	{
-		rt_node    *child;
-
-		rt_dump_node(node, level, false);
-
-		if (NODE_IS_LEAF(node))
-		{
-			uint64		dummy;
-
-			/* We reached at a leaf node, find the corresponding slot */
-			rt_node_search_leaf(node, key, &dummy);
-
-			break;
-		}
-
-		if (!rt_node_search_inner(node, key, &child))
-			break;
-
-		node = child;
-		shift -= RT_NODE_SPAN;
-		level++;
-	}
-}
-
-void
-rt_dump(radix_tree *tree)
-{
-
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
-		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
-				rt_size_class_info[i].name,
-				rt_size_class_info[i].inner_size,
-				rt_size_class_info[i].inner_blocksize,
-				rt_size_class_info[i].leaf_size,
-				rt_size_class_info[i].leaf_blocksize);
-	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
-
-	if (!tree->root)
-	{
-		fprintf(stderr, "empty tree\n");
-		return;
-	}
-
-	rt_dump_node(tree->root, 0, true);
-}
-#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d5d7668617..fe517793f4 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1,24 +1,412 @@
 /*-------------------------------------------------------------------------
  *
- * radixtree.h
- *	  Interface for radix tree.
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_set			- Set a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iterate	- Begin iterating through all key-value pairs
+ * rt_iterate_next	- Return next key-value pair, if any
+ * rt_end_iter		- End iteration
+ * rt_memory_usage	- Get the memory usage
+ * rt_num_entries	- Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
  *
  * Copyright (c) 2022, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *		src/include/lib/radixtree.h
+ *	  src/backend/lib/radixtree.c
  *
  *-------------------------------------------------------------------------
  */
-#ifndef RADIXTREE_H
-#define RADIXTREE_H
 
 #include "postgres.h"
 
-#define RT_DEBUG 1
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+typedef enum rt_size_class
+{
+	RT_CLASS_4_FULL = 0,
+	RT_CLASS_32_PARTIAL,
+	RT_CLASS_32_FULL,
+	RT_CLASS_125_FULL,
+	RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/* Max number of children. We can use uint8 because we never need to store 256 */
+	/* WIP: if we don't have a variable sized node4, this should instead be in the base
+	types as needed, since saving every byte is crucial for the smallest node kind */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} rt_node;
+#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+	((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+	((node)->base.n.count < rt_size_class_info[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct rt_node_base_4
+{
+	rt_node		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+	rt_node		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base125
+{
+	rt_node		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword		isset[BM_IDX(128)];
+} rt_node_base_125;
+
+typedef struct rt_node_base256
+{
+	rt_node		n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+	rt_node_base_4 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+	rt_node_base_4 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+	rt_node_base_32 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+	rt_node_base_32 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_125
+{
+	rt_node_base_125 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_125;
+
+typedef struct rt_node_leaf_125
+{
+	rt_node_base_125 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 children */
+	rt_node    *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+	rt_node_base_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information for each size class */
+typedef struct rt_size_class_elem
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} rt_size_class_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+	[RT_CLASS_4_FULL] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_PARTIAL] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_FULL] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+	},
+	[RT_CLASS_125_FULL] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(rt_node_inner_256),
+		.leaf_size = sizeof(rt_node_leaf_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+	},
+};
+
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+	[RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/* A radix tree with nodes */
+typedef struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node    *root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} radix_tree;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+	rt_node    *node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_node_iter;
+
+typedef struct rt_iter
+{
+	radix_tree *tree;
 
-typedef struct radix_tree radix_tree;
-typedef struct rt_iter rt_iter;
+	/* Track the iteration on nodes of each level */
+	rt_node_iter stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+} rt_iter;
 
 extern radix_tree *rt_create(MemoryContext ctx);
 extern void rt_free(radix_tree *tree);
@@ -39,4 +427,1360 @@ extern void rt_dump_search(radix_tree *tree, uint64 key);
 extern void rt_stats(radix_tree *tree);
 #endif
 
-#endif							/* RADIXTREE_H */
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+								bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+								 uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+								uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											 uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+						  uint8 *dst_chunks, rt_node **dst_children)
+{
+	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(rt_node *) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values)
+{
+	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(uint64) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+static inline rt_node *
+node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(node_inner_256_is_chunk_used(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(node_leaf_256_is_chunk_used(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+	int			shift = key_get_shift(key);
+	bool		inner = shift > 0;
+	rt_node    *newnode;
+
+	newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+	newnode->shift = shift;
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+{
+	rt_node    *newnode;
+
+	if (inner)
+		newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+												 rt_size_class_info[size_class].inner_size);
+	else
+		newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+												 rt_size_class_info[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[size_class]++;
+#endif
+
+	return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+{
+	if (inner)
+		MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+	else
+		MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+
+	node->kind = kind;
+	node->fanout = rt_size_class_info[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+
+		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+	}
+
+	/*
+	 * Technically it's 256, but we cannot store that in a uint8,
+	 * and this is the max size class to it will never grow.
+	 */
+	if (kind == RT_NODE_KIND_256)
+		node->fanout = 0;
+}
+
+static inline void
+rt_copy_node(rt_node *newnode, rt_node *oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->chunk = oldnode->chunk;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+{
+	rt_node	*newnode;
+	bool inner = !NODE_IS_LEAF(node);
+
+	newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
+	rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
+	rt_copy_node(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == rt_size_class_info[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->cnt[i]--;
+		Assert(tree->cnt[i] >= 0);
+	}
+#endif
+
+	pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+				rt_node *new_child, uint64 key)
+{
+	Assert(old_child->chunk == new_child->chunk);
+	Assert(old_child->shift == new_child->shift);
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = new_child;
+	}
+	else
+	{
+		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
+
+		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		Assert(replaced);
+	}
+
+	rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_inner_4 *node;
+
+		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+		rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		node->base.n.shift = shift;
+		node->base.n.count = 1;
+		node->base.chunks[0] = 0;
+		node->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+			  rt_node *node)
+{
+	int			shift = node->shift;
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		rt_node    *newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		inner = newshift > 0;
+
+		newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+		rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+		newchild->shift = newshift;
+		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+		rt_node_insert_inner(tree, parent, node, key, newchild);
+
+		parent = node;
+		node = newchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	rt_node_insert_leaf(tree, parent, node, key, value);
+	tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_node **child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+rt_node_delete_inner(rt_node *node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+rt_node_delete_leaf(rt_node *node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+					 rt_node *child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+					uint64 key, uint64 value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 rt_size_class_info[i].name,
+												 rt_size_class_info[i].inner_blocksize,
+												 rt_size_class_info[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												rt_size_class_info[i].name,
+												rt_size_class_info[i].leaf_blocksize,
+												rt_size_class_info[i].leaf_size);
+#ifdef RT_DEBUG
+		tree->cnt[i] = 0;
+#endif
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	rt_node    *node;
+	rt_node    *parent;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = parent = tree->root;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, &child))
+		{
+			rt_set_extend(tree, key, value, parent, node);
+			return false;
+		}
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+	rt_node    *node;
+	int			shift;
+
+	Assert(value_p != NULL);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	return rt_node_search_leaf(node, key, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		rt_node    *child;
+
+		/* Push the current node to the stack */
+		stack[++level] = node;
+
+		if (!rt_node_search_inner(node, key, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	Assert(NODE_IS_LEAF(node));
+	deleted = rt_node_delete_leaf(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	rt_free_node(tree, node);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		node = stack[level--];
+
+		deleted = rt_node_delete_inner(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+	}
+
+	return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->root)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+	int			level = from;
+	rt_node    *node = from_node;
+
+	for (;;)
+	{
+		rt_node_iter *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = rt_node_inner_iterate_next(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->root)
+		return false;
+
+	for (;;)
+	{
+		rt_node    *child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		rt_update_iter_stack(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+						  uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	Size		total = sizeof(radix_tree);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = BM_IDX(slot);
+					int			bitnum = BM_BIT(slot);
+
+					if (!node_125_is_chunk_used(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+						 tree->num_keys,
+						 tree->root->shift / RT_NODE_SPAN,
+						 tree->cnt[RT_CLASS_4_FULL],
+						 tree->cnt[RT_CLASS_32_PARTIAL],
+						 tree->cnt[RT_CLASS_32_FULL],
+						 tree->cnt[RT_CLASS_125_FULL],
+						 tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+	char		space[125] = {0};
+
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
+			node->fanout == 0 ? 256 : node->fanout,
+			node->count, node->shift, node->chunk);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							rt_dump_node(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							rt_dump_node(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(b125, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < BM_IDX(128); i++)
+					{
+						fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(b125, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_125_get_value(n125, i));
+					}
+					else
+					{
+						rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_125_get_child(n125, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+						if (!node_leaf_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_256_get_value(n256, i));
+					}
+					else
+					{
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+						if (!node_inner_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+		 tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		rt_dump_node(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search_leaf(node, key, &dummy);
+
+			break;
+		}
+
+		if (!rt_node_search_inner(node, key, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+				rt_size_class_info[i].name,
+				rt_size_class_info[i].inner_size,
+				rt_size_class_info[i].inner_blocksize,
+				rt_size_class_info[i].leaf_size,
+				rt_size_class_info[i].leaf_blocksize);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+	if (!tree->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree->root, 0, true);
+}
+#endif
-- 
2.39.0

v17-0001-introduce-vector8_min-and-vector8_highbit_mask.patchtext/x-patch; charset=US-ASCII; name=v17-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From b5653e1d6ac004f5b5420d240f9c0ee142495874 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v17 1/9] introduce vector8_min and vector8_highbit_mask

---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..84d41a340a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.39.0

v17-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v17-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From dd3ab2bf57b5fae0dd7c10a4a44d23db38d65140 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v17 2/9] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 50d86cb01b..e19fd2966d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3655,7 +3655,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.39.0

v17-0004-tool-for-measuring-radix-tree-performance.patchtext/x-patch; charset=US-ASCII; name=v17-0004-tool-for-measuring-radix-tree-performance.patchDownload

From f510d0d88460cbebebba9c089e38e02e054c71bb Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v17 4/9] tool for measuring radix tree performance

---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  76 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 635 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 6 files changed, 767 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..a0693695e6
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,635 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, key);
+	}
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+
+	rt_stats(rt);
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
-- 
2.39.0

v17-0003-Add-radix-implementation.patchtext/x-patch; charset=US-ASCII; name=v17-0003-Add-radix-implementation.patchDownload

From dfe269fb71621a6ec580ef3d8ae601bdbc0c4b91 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v17 3/9] Add radix implementation.

---
 src/backend/lib/Makefile                      |    1 +
 src/backend/lib/meson.build                   |    1 +
 src/backend/lib/radixtree.c                   | 2514 +++++++++++++++++
 src/include/lib/radixtree.h                   |   42 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   34 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  581 ++++
 .../test_radixtree/test_radixtree.control     |    4 +
 15 files changed, 3264 insertions(+)
 create mode 100644 src/backend/lib/radixtree.c
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 9dad31398a..4c1db794b6 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -22,6 +22,7 @@ OBJS = \
 	integerset.o \
 	knapsack.o \
 	pairingheap.o \
+	radixtree.o \
 	rbtree.o \
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/meson.build b/src/backend/lib/meson.build
index 974cab8776..5f8df32c5c 100644
--- a/src/backend/lib/meson.build
+++ b/src/backend/lib/meson.build
@@ -11,4 +11,5 @@ backend_sources += files(
   'knapsack.c',
   'pairingheap.c',
   'rbtree.c',
+  'radixtree.c',
 )
diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
new file mode 100644
index 0000000000..5203127f76
--- /dev/null
+++ b/src/backend/lib/radixtree.c
@@ -0,0 +1,2514 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ * Interface
+ * ---------
+ *
+ * rt_create		- Create a new, empty radix tree
+ * rt_free			- Free the radix tree
+ * rt_search		- Search a key-value pair
+ * rt_set			- Set a key-value pair
+ * rt_delete		- Delete a key-value pair
+ * rt_begin_iterate	- Begin iterating through all key-value pairs
+ * rt_iterate_next	- Return next key-value pair, if any
+ * rt_end_iter		- End iteration
+ * rt_memory_usage	- Get the memory usage
+ * rt_num_entries	- Get the number of key-value pairs
+ *
+ * rt_create() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * rt_iterate_next() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/radixtree.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	key_get_shift(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/* Enum used rt_node_search() */
+typedef enum
+{
+	RT_ACTION_FIND = 0,			/* find the key-value */
+	RT_ACTION_DELETE,			/* delete the key-value */
+} rt_action;
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of rt_node. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+typedef enum rt_size_class
+{
+	RT_CLASS_4_FULL = 0,
+	RT_CLASS_32_PARTIAL,
+	RT_CLASS_32_FULL,
+	RT_CLASS_125_FULL,
+	RT_CLASS_256
+
+#define RT_SIZE_CLASS_COUNT (RT_CLASS_256 + 1)
+} rt_size_class;
+
+/* Common type for all nodes types */
+typedef struct rt_node
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/* Max number of children. We can use uint8 because we never need to store 256 */
+	/* WIP: if we don't have a variable sized node4, this should instead be in the base
+	types as needed, since saving every byte is crucial for the smallest node kind */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} rt_node;
+#define NODE_IS_LEAF(n)			(((rt_node *) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((rt_node *) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+	((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+	((node)->base.n.count < rt_size_class_info[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct rt_node_base_4
+{
+	rt_node		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} rt_node_base_4;
+
+typedef struct rt_node_base32
+{
+	rt_node		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} rt_node_base_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct rt_node_base125
+{
+	rt_node		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword		isset[BM_IDX(128)];
+} rt_node_base_125;
+
+typedef struct rt_node_base256
+{
+	rt_node		n;
+} rt_node_base_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct rt_node_inner_4
+{
+	rt_node_base_4 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_4;
+
+typedef struct rt_node_leaf_4
+{
+	rt_node_base_4 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_4;
+
+typedef struct rt_node_inner_32
+{
+	rt_node_base_32 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_32;
+
+typedef struct rt_node_leaf_32
+{
+	rt_node_base_32 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_32;
+
+typedef struct rt_node_inner_125
+{
+	rt_node_base_125 base;
+
+	/* number of children depends on size class */
+	rt_node    *children[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_inner_125;
+
+typedef struct rt_node_leaf_125
+{
+	rt_node_base_125 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} rt_node_leaf_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct rt_node_inner_256
+{
+	rt_node_base_256 base;
+
+	/* Slots for 256 children */
+	rt_node    *children[RT_NODE_MAX_SLOTS];
+} rt_node_inner_256;
+
+typedef struct rt_node_leaf_256
+{
+	rt_node_base_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} rt_node_leaf_256;
+
+/* Information for each size class */
+typedef struct rt_size_class_elem
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} rt_size_class_elem;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+static rt_size_class_elem rt_size_class_info[RT_SIZE_CLASS_COUNT] = {
+	[RT_CLASS_4_FULL] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_4) + 4 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_4) + 4 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_4) + 4 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_PARTIAL] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 15 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 15 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 15 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_FULL] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_32) + 32 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_32) + 32 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_32) + 32 * sizeof(uint64)),
+	},
+	[RT_CLASS_125_FULL] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *),
+		.leaf_size = sizeof(rt_node_leaf_125) + 125 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_125) + 125 * sizeof(rt_node *)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_125) + 125 * sizeof(uint64)),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(rt_node_inner_256),
+		.leaf_size = sizeof(rt_node_leaf_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_inner_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(rt_node_leaf_256)),
+	},
+};
+
+/* Map from the node kind to its minimum size class */
+static rt_size_class kind_min_size_class[RT_NODE_KIND_COUNT] = {
+	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+	[RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * rt_node_iter struct is used to track the iteration within a node.
+ *
+ * rt_iter is the struct for iteration of the radix tree, and uses rt_node_iter
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct rt_node_iter
+{
+	rt_node    *node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} rt_node_iter;
+
+struct rt_iter
+{
+	radix_tree *tree;
+
+	/* Track the iteration on nodes of each level */
+	rt_node_iter stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+};
+
+/* A radix tree with nodes */
+struct radix_tree
+{
+	MemoryContext context;
+
+	rt_node    *root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+};
+
+static void rt_new_root(radix_tree *tree, uint64 key);
+static rt_node *rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner);
+static inline void rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class,
+								bool inner);
+static void rt_free_node(radix_tree *tree, rt_node *node);
+static void rt_extend(radix_tree *tree, uint64 key);
+static inline bool rt_node_search_inner(rt_node *node, uint64 key, rt_action action,
+										rt_node **child_p);
+static inline bool rt_node_search_leaf(rt_node *node, uint64 key, rt_action action,
+									   uint64 *value_p);
+static bool rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node,
+								 uint64 key, rt_node *child);
+static bool rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+								uint64 key, uint64 value);
+static inline rt_node *rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter);
+static inline bool rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+											 uint64 *value_p);
+static void rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from);
+static inline void rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift);
+
+/* verification (available only with assertion) */
+static void rt_verify_node(rt_node *node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_4_search_eq(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_4_get_insertpos(rt_node_base_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+node_32_search_eq(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+node_32_get_insertpos(rt_node_base_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+chunk_children_array_shift(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(rt_node *) * (count - idx));
+}
+
+static inline void
+chunk_values_array_shift(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+chunk_children_array_delete(uint8 *chunks, rt_node **children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(rt_node *) * (count - idx - 1));
+}
+
+static inline void
+chunk_values_array_delete(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+						  uint8 *dst_chunks, rt_node **dst_children)
+{
+	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(rt_node *) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+chunk_values_array_copy(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values)
+{
+	const int fanout = rt_size_class_info[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(uint64) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+node_125_is_chunk_used(rt_node_base_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+static inline rt_node *
+node_inner_125_get_child(rt_node_inner_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+node_leaf_125_get_value(rt_node_leaf_125 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((rt_node_base_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+static void
+node_inner_125_delete(rt_node_inner_125 *node, uint8 chunk)
+{
+	int			slotpos = node->base.slot_idxs[chunk];
+	int			idx = BM_IDX(slotpos);
+	int			bitnum = BM_BIT(slotpos);
+
+	Assert(!NODE_IS_LEAF(node));
+
+	node->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+	node->children[node->base.slot_idxs[chunk]] = NULL;
+	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+static void
+node_leaf_125_delete(rt_node_leaf_125 *node, uint8 chunk)
+{
+	int			slotpos = node->base.slot_idxs[chunk];
+	int			idx = BM_IDX(slotpos);
+	int			bitnum = BM_BIT(slotpos);
+
+	Assert(NODE_IS_LEAF(node));
+	node->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+	node->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+}
+
+/* Return an unused slot in node-125 */
+static int
+node_125_find_unused_slot(bitmapword *isset)
+{
+	int			slotpos;
+	int			idx;
+	bitmapword	inverse;
+
+	/* get the first word with at least one bit not set */
+	for (idx = 0; idx < BM_IDX(128); idx++)
+	{
+		if (isset[idx] < ~((bitmapword) 0))
+			break;
+	}
+
+	/* To get the first unset bit in X, get the first set bit in ~X */
+	inverse = ~(isset[idx]);
+	slotpos = idx * BITS_PER_BITMAPWORD;
+	slotpos += bmw_rightmost_one_pos(inverse);
+
+	/* mark the slot used */
+	isset[idx] |= bmw_rightmost_one(inverse);
+
+	return slotpos;
+ }
+
+static inline void
+node_inner_125_insert(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+	int			slotpos;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	slotpos = node_125_find_unused_slot(node->base.isset);
+	Assert(slotpos < node->base.n.fanout);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->children[slotpos] = child;
+}
+
+/* Set the slot at the corresponding chunk */
+static inline void
+node_leaf_125_insert(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+	int			slotpos;
+
+	Assert(NODE_IS_LEAF(node));
+
+	slotpos = node_125_find_unused_slot(node->base.isset);
+	Assert(slotpos < node->base.n.fanout);
+
+	node->base.slot_idxs[chunk] = slotpos;
+	node->values[slotpos] = value;
+}
+
+/* Update the child corresponding to 'chunk' to 'child' */
+static inline void
+node_inner_125_update(rt_node_inner_125 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[node->base.slot_idxs[chunk]] = child;
+}
+
+static inline void
+node_leaf_125_update(rt_node_leaf_125 *node, uint8 chunk, uint64 value)
+{
+	Assert(NODE_IS_LEAF(node));
+	node->values[node->base.slot_idxs[chunk]] = value;
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+node_inner_256_is_chunk_used(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[chunk] != NULL);
+}
+
+static inline bool
+node_leaf_256_is_chunk_used(rt_node_leaf_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline rt_node *
+node_inner_256_get_child(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(node_inner_256_is_chunk_used(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+node_leaf_256_get_value(rt_node_leaf_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(node_leaf_256_is_chunk_used(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+node_inner_256_set(rt_node_inner_256 *node, uint8 chunk, rt_node *child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+node_leaf_256_set(rt_node_leaf_256 *node, uint8 chunk, uint64 value)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+node_inner_256_delete(rt_node_inner_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = NULL;
+}
+
+static inline void
+node_leaf_256_delete(rt_node_leaf_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+key_get_shift(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+shift_get_max_val(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+rt_new_root(radix_tree *tree, uint64 key)
+{
+	int			shift = key_get_shift(key);
+	bool		inner = shift > 0;
+	rt_node    *newnode;
+
+	newnode = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+	rt_init_node(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+	newnode->shift = shift;
+	tree->max_val = shift_get_max_val(shift);
+	tree->root = newnode;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static rt_node *
+rt_alloc_node(radix_tree *tree, rt_size_class size_class, bool inner)
+{
+	rt_node    *newnode;
+
+	if (inner)
+		newnode = (rt_node *) MemoryContextAlloc(tree->inner_slabs[size_class],
+												 rt_size_class_info[size_class].inner_size);
+	else
+		newnode = (rt_node *) MemoryContextAlloc(tree->leaf_slabs[size_class],
+												 rt_size_class_info[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[size_class]++;
+#endif
+
+	return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+rt_init_node(rt_node *node, uint8 kind, rt_size_class size_class, bool inner)
+{
+	if (inner)
+		MemSet(node, 0, rt_size_class_info[size_class].inner_size);
+	else
+		MemSet(node, 0, rt_size_class_info[size_class].leaf_size);
+
+	node->kind = kind;
+	node->fanout = rt_size_class_info[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+
+		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+	}
+
+	/*
+	 * Technically it's 256, but we cannot store that in a uint8,
+	 * and this is the max size class to it will never grow.
+	 */
+	if (kind == RT_NODE_KIND_256)
+		node->fanout = 0;
+}
+
+static inline void
+rt_copy_node(rt_node *newnode, rt_node *oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->chunk = oldnode->chunk;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static rt_node*
+rt_grow_node_kind(radix_tree *tree, rt_node *node, uint8 new_kind)
+{
+	rt_node	*newnode;
+	bool inner = !NODE_IS_LEAF(node);
+
+	newnode = rt_alloc_node(tree, kind_min_size_class[new_kind], inner);
+	rt_init_node(newnode, new_kind, kind_min_size_class[new_kind], inner);
+	rt_copy_node(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+rt_free_node(radix_tree *tree, rt_node *node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == rt_size_class_info[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->cnt[i]--;
+		Assert(tree->cnt[i] >= 0);
+	}
+#endif
+
+	pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+rt_replace_node(radix_tree *tree, rt_node *parent, rt_node *old_child,
+				rt_node *new_child, uint64 key)
+{
+	Assert(old_child->chunk == new_child->chunk);
+	Assert(old_child->shift == new_child->shift);
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = new_child;
+	}
+	else
+	{
+		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
+
+		replaced = rt_node_insert_inner(tree, NULL, parent, key, new_child);
+		Assert(replaced);
+	}
+
+	rt_free_node(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+rt_extend(radix_tree *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = key_get_shift(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		rt_node_inner_4 *node;
+
+		node = (rt_node_inner_4 *) rt_alloc_node(tree, RT_CLASS_4_FULL, true);
+		rt_init_node((rt_node *) node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		node->base.n.shift = shift;
+		node->base.n.count = 1;
+		node->base.chunks[0] = 0;
+		node->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = (rt_node *) node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = shift_get_max_val(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+rt_set_extend(radix_tree *tree, uint64 key, uint64 value, rt_node *parent,
+			  rt_node *node)
+{
+	int			shift = node->shift;
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		rt_node    *newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		inner = newshift > 0;
+
+		newchild = rt_alloc_node(tree, RT_CLASS_4_FULL, inner);
+		rt_init_node(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+		newchild->shift = newshift;
+		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+		rt_node_insert_inner(tree, parent, node, key, newchild);
+
+		parent = node;
+		node = newchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	rt_node_insert_leaf(tree, parent, node, key, value);
+	tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node, and
+ * do the specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+rt_node_search_inner(rt_node *node, uint64 key, rt_action action, rt_node **child_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	rt_node    *child = NULL;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = n4->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n4->base.chunks, n4->children,
+												n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = n32->children[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_children_array_delete(n32->base.chunks, n32->children,
+												n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+
+				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					child = node_inner_125_get_child(n125, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_125_delete(n125, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				if (!node_inner_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					child = node_inner_256_get_child(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_inner_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && child_p)
+		*child_p = child;
+
+	return found;
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node, and do the
+ * specified 'action'.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+rt_node_search_leaf(rt_node *node, uint64 key, rt_action action, uint64 *value_p)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		found = false;
+	uint64		value = 0;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = n4->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n4->base.chunks, (uint64 *) n4->values,
+											  n4->base.n.count, idx);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+
+				if (idx < 0)
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = n32->values[idx];
+				else			/* RT_ACTION_DELETE */
+					chunk_values_array_delete(n32->base.chunks, (uint64 *) n32->values,
+											  n32->base.n.count, idx);
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+
+				if (!node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+					break;
+
+				found = true;
+
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_125_get_value(n125, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_125_delete(n125, chunk);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				if (!node_leaf_256_is_chunk_used(n256, chunk))
+					break;
+
+				found = true;
+				if (action == RT_ACTION_FIND)
+					value = node_leaf_256_get_value(n256, chunk);
+				else			/* RT_ACTION_DELETE */
+					node_leaf_256_delete(n256, chunk);
+
+				break;
+			}
+	}
+
+	/* update statistics */
+	if (action == RT_ACTION_DELETE && found)
+		node->count--;
+
+	if (found && value_p)
+		*value_p = value;
+
+	return found;
+}
+
+/* Insert the child to the inner node */
+static bool
+rt_node_insert_inner(radix_tree *tree, rt_node *parent, rt_node *node, uint64 key,
+					 rt_node *child)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+	rt_node		*newnode = NULL;
+
+	Assert(!NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq(&n4->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_inner_32 *new32;
+
+					/* grow node from 4 to 32 */
+					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_32);
+					new32 = (rt_node_inner_32 *) newnode;
+					chunk_children_array_copy(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos(&n4->base, chunk);
+					uint16		count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (count != 0 && insertpos < count)
+						chunk_children_array_shift(n4->base.chunks, n4->children,
+												   count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->children[insertpos] = child;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const rt_size_class_elem minclass = rt_size_class_info[RT_CLASS_32_PARTIAL];
+				const rt_size_class_elem maxclass = rt_size_class_info[RT_CLASS_32_FULL];
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq(&n32->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->children[idx] = child;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+					n32->base.n.count == minclass.fanout)
+				{
+					/* grow to the next size class of this kind */
+					newnode = rt_alloc_node(tree, RT_CLASS_32_FULL, true);
+					memcpy(newnode, node, minclass.inner_size);
+					newnode->fanout = maxclass.fanout;
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
+
+					/* also update pointer for this kind */
+					n32 = (rt_node_inner_32 *) newnode;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					rt_node_inner_125 *new125;
+
+					/* grow node from 32 to 125 */
+					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_125);
+					new125 = (rt_node_inner_125 *) newnode;
+					for (int i = 0; i < n32->base.n.count; i++)
+						node_inner_125_insert(new125, n32->base.chunks[i], n32->children[i]);
+
+					Assert(parent != NULL);
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = node_32_get_insertpos(&n32->base, chunk);
+					int16 count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+						chunk_children_array_shift(n32->base.chunks, n32->children,
+												   count, insertpos);
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+					n32->children[insertpos] = child;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node;
+				int			cnt = 0;
+
+				if (node_125_is_chunk_used(&n125->base, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_inner_125_update(n125, chunk, child);
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					rt_node_inner_256 *new256;
+					Assert(parent != NULL);
+
+					/* grow node from 125 to 256 */
+					newnode = rt_grow_node_kind(tree, node, RT_NODE_KIND_256);
+					new256 = (rt_node_inner_256 *) newnode;
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!node_125_is_chunk_used(&n125->base, i))
+							continue;
+
+						node_inner_256_set(new256, i, node_inner_125_get_child(n125, i));
+						cnt++;
+					}
+
+					rt_replace_node(tree, parent, node, newnode, key);
+					node = newnode;
+				}
+				else
+				{
+					node_inner_125_insert(n125, chunk, child);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+				chunk_exists = node_inner_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+				node_inner_256_set(n256, chunk, child);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/* Insert the value to the leaf node */
+static bool
+rt_node_insert_leaf(radix_tree *tree, rt_node *parent, rt_node *node,
+					uint64 key, uint64 value)
+{
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+	Assert(NODE_IS_LEAF(node));
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+				int			idx;
+
+				idx = node_4_search_eq((rt_node_base_4 *) n4, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n4->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					rt_node_leaf_32 *new32;
+					Assert(parent != NULL);
+
+					/* grow node from 4 to 32 */
+					new32 = (rt_node_leaf_32 *) rt_grow_node_kind(tree, (rt_node *) n4,
+																  RT_NODE_KIND_32);
+					chunk_values_array_copy(n4->base.chunks, n4->values,
+											new32->base.chunks, new32->values);
+					rt_replace_node(tree, parent, (rt_node *) n4, (rt_node *) new32, key);
+					node = (rt_node *) new32;
+				}
+				else
+				{
+					int			insertpos = node_4_get_insertpos((rt_node_base_4 *) n4, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and values */
+					if (count != 0 && insertpos < count)
+						chunk_values_array_shift(n4->base.chunks, n4->values,
+												 count, insertpos);
+
+					n4->base.chunks[insertpos] = chunk;
+					n4->values[insertpos] = value;
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+				int			idx;
+
+				idx = node_32_search_eq((rt_node_base_32 *) n32, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = value;
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					Assert(parent != NULL);
+
+					if (n32->base.n.count == rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
+					{
+						/* use the same node kind, but expand to the next size class */
+						const Size size = rt_size_class_info[RT_CLASS_32_PARTIAL].leaf_size;
+						const int fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
+						rt_node_leaf_32 *new32;
+
+						new32 = (rt_node_leaf_32 *) rt_alloc_node(tree, RT_CLASS_32_FULL, false);
+						memcpy(new32, n32, size);
+						new32->base.n.fanout = fanout;
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new32, key);
+
+						/* must update both pointers here */
+						node = (rt_node *) new32;
+						n32 = new32;
+
+						goto retry_insert_leaf_32;
+					}
+					else
+					{
+						rt_node_leaf_125 *new125;
+
+						/* grow node from 32 to 125 */
+						new125 = (rt_node_leaf_125 *) rt_grow_node_kind(tree, (rt_node *) n32,
+																		RT_NODE_KIND_125);
+						for (int i = 0; i < n32->base.n.count; i++)
+							node_leaf_125_insert(new125, n32->base.chunks[i], n32->values[i]);
+
+						rt_replace_node(tree, parent, (rt_node *) n32, (rt_node *) new125,
+										key);
+						node = (rt_node *) new125;
+					}
+				}
+				else
+				{
+				retry_insert_leaf_32:
+					{
+						int	insertpos = node_32_get_insertpos((rt_node_base_32 *) n32, chunk);
+						int	count = n32->base.n.count;
+
+						if (count != 0 && insertpos < count)
+							chunk_values_array_shift(n32->base.chunks, n32->values,
+													 count, insertpos);
+
+						n32->base.chunks[insertpos] = chunk;
+						n32->values[insertpos] = value;
+						break;
+					}
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node;
+				int			cnt = 0;
+
+				if (node_125_is_chunk_used((rt_node_base_125 *) n125, chunk))
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					node_leaf_125_update(n125, chunk, value);
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					rt_node_leaf_256 *new256;
+					Assert(parent != NULL);
+
+					/* grow node from 125 to 256 */
+					new256 = (rt_node_leaf_256 *) rt_grow_node_kind(tree, (rt_node *) n125,
+																	RT_NODE_KIND_256);
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+							continue;
+
+						node_leaf_256_set(new256, i, node_leaf_125_get_value(n125, i));
+						cnt++;
+					}
+
+					rt_replace_node(tree, parent, (rt_node *) n125, (rt_node *) new256,
+									key);
+					node = (rt_node *) new256;
+				}
+				else
+				{
+					node_leaf_125_insert(n125, chunk, value);
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+				chunk_exists = node_leaf_256_is_chunk_used(n256, chunk);
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+				node_leaf_256_set(n256, chunk, value);
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	rt_verify_node(node);
+
+	return chunk_exists;
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+radix_tree *
+rt_create(MemoryContext ctx)
+{
+	radix_tree *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(radix_tree));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 rt_size_class_info[i].name,
+												 rt_size_class_info[i].inner_blocksize,
+												 rt_size_class_info[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												rt_size_class_info[i].name,
+												rt_size_class_info[i].leaf_blocksize,
+												rt_size_class_info[i].leaf_size);
+#ifdef RT_DEBUG
+		tree->cnt[i] = 0;
+#endif
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+void
+rt_free(radix_tree *tree)
+{
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+bool
+rt_set(radix_tree *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	rt_node    *node;
+	rt_node    *parent;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		rt_new_root(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		rt_extend(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = parent = tree->root;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+		{
+			rt_set_extend(tree, key, value, parent, node);
+			return false;
+		}
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = rt_node_insert_leaf(tree, parent, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+bool
+rt_search(radix_tree *tree, uint64 key, uint64 *value_p)
+{
+	rt_node    *node;
+	int			shift;
+
+	Assert(value_p != NULL);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	return rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+bool
+rt_delete(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	rt_node    *stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		rt_node    *child;
+
+		/* Push the current node to the stack */
+		stack[++level] = node;
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	Assert(NODE_IS_LEAF(node));
+	deleted = rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	rt_free_node(tree, node);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		node = stack[level--];
+
+		deleted = rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		rt_free_node(tree, node);
+	}
+
+	return true;
+}
+
+/* Create and return the iterator for the given radix tree */
+rt_iter *
+rt_begin_iterate(radix_tree *tree)
+{
+	MemoryContext old_ctx;
+	rt_iter    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (rt_iter *) palloc0(sizeof(rt_iter));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->root)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	rt_update_iter_stack(iter, iter->tree->root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+rt_update_iter_stack(rt_iter *iter, rt_node *from_node, int from)
+{
+	int			level = from;
+	rt_node    *node = from_node;
+
+	for (;;)
+	{
+		rt_node_iter *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = rt_node_inner_iterate_next(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+bool
+rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->root)
+		return false;
+
+	for (;;)
+	{
+		rt_node    *child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = rt_node_leaf_iterate_next(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = rt_node_inner_iterate_next(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		rt_update_iter_stack(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+void
+rt_end_iterate(rt_iter *iter)
+{
+	pfree(iter);
+}
+
+static inline void
+rt_iter_update_key(rt_iter *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline rt_node *
+rt_node_inner_iterate_next(rt_iter *iter, rt_node_iter *node_iter)
+{
+	rt_node    *child = NULL;
+	bool		found = false;
+	uint8		key_chunk;
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_inner_4 *n4 = (rt_node_inner_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				child = n4->children[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_inner_32 *n32 = (rt_node_inner_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				child = n32->children[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_inner_125 *n125 = (rt_node_inner_125 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_125_get_child(n125, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_inner_256 *n256 = (rt_node_inner_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_inner_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				child = node_inner_256_get_child(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+
+	return child;
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+rt_node_leaf_iterate_next(rt_iter *iter, rt_node_iter *node_iter,
+						  uint64 *value_p)
+{
+	rt_node    *node = node_iter->node;
+	bool		found = false;
+	uint64		value;
+	uint8		key_chunk;
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+
+				value = n4->values[node_iter->current_idx];
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+				value = n32->values[node_iter->current_idx];
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_125_is_chunk_used((rt_node_base_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_125_get_value(n125, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (node_leaf_256_is_chunk_used(n256, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+				value = node_leaf_256_get_value(n256, i);
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		rt_iter_update_key(iter, key_chunk, node_iter->node->shift);
+		*value_p = value;
+	}
+
+	return found;
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+uint64
+rt_num_entries(radix_tree *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+uint64
+rt_memory_usage(radix_tree *tree)
+{
+	Size		total = sizeof(radix_tree);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+rt_verify_node(rt_node *node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				rt_node_base_4 *n4 = (rt_node_base_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				rt_node_base_32 *n32 = (rt_node_base_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_base_125 *n125 = (rt_node_base_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			bitnum = BM_BIT(slot);
+
+					if (!node_125_is_chunk_used(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[i] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(radix_tree *tree)
+{
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+						 tree->num_keys,
+						 tree->root->shift / RT_NODE_SPAN,
+						 tree->cnt[RT_CLASS_4_FULL],
+						 tree->cnt[RT_CLASS_32_PARTIAL],
+						 tree->cnt[RT_CLASS_32_FULL],
+						 tree->cnt[RT_CLASS_125_FULL],
+						 tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(rt_node *node, int level, bool recurse)
+{
+	char		space[125] = {0};
+
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
+			node->fanout == 0 ? 256 : node->fanout,
+			node->count, node->shift, node->chunk);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_4 *n4 = (rt_node_leaf_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						rt_node_inner_4 *n4 = (rt_node_inner_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							rt_dump_node(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_32 *n32 = (rt_node_leaf_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						rt_node_inner_32 *n32 = (rt_node_inner_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							rt_dump_node(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				rt_node_base_125 *b125 = (rt_node_base_125 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(b125, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					rt_node_leaf_125 *n = (rt_node_leaf_125 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < BM_IDX(128); i++)
+					{
+						fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!node_125_is_chunk_used(b125, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_125 *n125 = (rt_node_leaf_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_125_get_value(n125, i));
+					}
+					else
+					{
+						rt_node_inner_125 *n125 = (rt_node_inner_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_125_get_child(n125, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						rt_node_leaf_256 *n256 = (rt_node_leaf_256 *) node;
+
+						if (!node_leaf_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, node_leaf_256_get_value(n256, i));
+					}
+					else
+					{
+						rt_node_inner_256 *n256 = (rt_node_inner_256 *) node;
+
+						if (!node_inner_256_is_chunk_used(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(node_inner_256_get_child(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(radix_tree *tree, uint64 key)
+{
+	rt_node    *node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+		 tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		rt_node    *child;
+
+		rt_dump_node(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			rt_node_search_leaf(node, key, RT_ACTION_FIND, &dummy);
+
+			break;
+		}
+
+		if (!rt_node_search_inner(node, key, RT_ACTION_FIND, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+void
+rt_dump(radix_tree *tree)
+{
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+				rt_size_class_info[i].name,
+				rt_size_class_info[i].inner_size,
+				rt_size_class_info[i].inner_blocksize,
+				rt_size_class_info[i].leaf_size,
+				rt_size_class_info[i].leaf_blocksize);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+	if (!tree->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree->root, 0, true);
+}
+#endif
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d5d7668617
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *	  Interface for radix tree.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef RADIXTREE_H
+#define RADIXTREE_H
+
+#include "postgres.h"
+
+#define RT_DEBUG 1
+
+typedef struct radix_tree radix_tree;
+typedef struct rt_iter rt_iter;
+
+extern radix_tree *rt_create(MemoryContext ctx);
+extern void rt_free(radix_tree *tree);
+extern bool rt_search(radix_tree *tree, uint64 key, uint64 *val_p);
+extern bool rt_set(radix_tree *tree, uint64 key, uint64 val);
+extern rt_iter *rt_begin_iterate(radix_tree *tree);
+
+extern bool rt_iterate_next(rt_iter *iter, uint64 *key_p, uint64 *value_p);
+extern void rt_end_iterate(rt_iter *iter);
+extern bool rt_delete(radix_tree *tree, uint64 key);
+
+extern uint64 rt_memory_usage(radix_tree *tree);
+extern uint64 rt_num_entries(radix_tree *tree);
+
+#ifdef RT_DEBUG
+extern void rt_dump(radix_tree *tree);
+extern void rt_dump_search(radix_tree *tree, uint64 key);
+extern void rt_stats(radix_tree *tree);
+#endif
+
+#endif							/* RADIXTREE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..ea993e63df
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,581 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	radix_tree *radixtree;
+	rt_iter		*iter;
+	uint64		dummy;
+	uint64		key;
+	uint64		val;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_set(radixtree, keys[i], keys[i] + 1))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		uint64		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
-- 
2.39.0

#169

[1]: https://db.in.tum.de/~leis/papers/artsync.pdf

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#168)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 9, 2023 at 5:59 PM John Naylor <john.naylor@enterprisedb.com> wrote:

[working on templating]

In the end, I decided to base my effort on v8, and not v12 (based on one of my less-well-thought-out ideas). The latter was a good experiment, but it did not lead to an increase in readability as I had hoped. The attached v17 is still rough, but it's in good enough shape to evaluate a mostly-complete templating implementation.

I really appreciate your work!

v13-0007 had some changes to the regression tests, but I haven't included those. The tests from v13-0003 do pass, both locally and shared. I quickly hacked together changing shared/local tests by hand (need to recompile), but it would be good for maintainability if tests could run once each with local and shmem, but use the same "expected" test output.

Agreed.

Also, I didn't look to see if there were any changes in v14/15 that didn't have to do with precise memory accounting.

At this point, Masahiko, I'd appreciate your feedback on whether this is an improvement at all (or at least a good base for improvement), especially for integrating with the TID store. I think there are some advantages to the template approach. One possible disadvantage is needing separate functions for each local and shared memory.

If we go this route, I do think the TID store should invoke the template as static functions. I'm not quite comfortable with a global function that may not fit well with future use cases.

It looks no problem in terms of vacuum integration, although I've not
fully tested yet. TID store uses the radix tree as the main storage,
and with the template radix tree, the data types for shared and
non-shared will be different. TID store can have an union for the
radix tree and the structure would be like follows:

/* Per-backend state for a TidStore */
struct TidStore
{
/*
* Control object. This is allocated in DSA area 'area' in the shared
* case, otherwise in backend-local memory.
*/
TidStoreControl *control;

/* Storage for Tids */
union tree
{
local_radix_tree *local;
shared_radix_tree *shared;
};

/* DSA area for TidStore if used */
dsa_area *area;
};

In the functions of TID store, we need to call either local or shared
radix tree functions depending on whether TID store is shared or not.
We need if-branch for each key-value pair insertion, but I think it
would not be a big performance problem in TID store use cases, since
vacuum is an I/O intensive operation in many cases. Overall, I think
there is no problem and I'll investigate it in depth.

Apart from that, I've been considering the lock support for shared
radix tree. As we discussed before, the current usage (i.e, only
parallel index vacuum) doesn't require locking support at all, so it
would be enough to have a single lock for simplicity. If we want to
use the shared radix tree for other use cases such as the parallel
heap vacuum or the replacement of the hash table for shared buffers,
we would need better lock support. For example, if we want to support
Optimistic Lock Coupling[1]https://db.in.tum.de/~leis/papers/artsync.pdf, we would need to change not only the node
structure but also the logic. Which probably leads to widen the gap
between the code for non-shared and shared radix tree. In this case,
once we have a better radix tree optimized for shared case, perhaps we
can replace the templated shared radix tree with it. I'd like to hear
your opinion on this line.

One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type -- it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up only 32 bytes with a fanout of 3, by eliminating padding.

Oh, I didn't notice that. The chunk field was originally used when
redirecting the child pointer in the parent node from old to new
(grown) node. When redirecting the pointer, since the corresponding
chunk surely exists on the parent we can skip existence checks.
Currently we use RT_NODE_UPDATE_INNER() for that (see
RT_REPLACE_NODE()) but having a dedicated function to update the
existing chunk and child pointer might improve the performance. Or
reducing the node size by getting rid of the chunk field might be
better.

Also, v17-0005 has an optimization/simplification for growing into node125 (my version needs an assertion or fallback, but works well now), found by another reading of Andres' prototype There is a lot of good engineering there, we should try to preserve it.

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#170

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#169)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

It looks no problem in terms of vacuum integration, although I've not
fully tested yet. TID store uses the radix tree as the main storage,
and with the template radix tree, the data types for shared and
non-shared will be different. TID store can have an union for the
radix tree and the structure would be like follows:

/* Storage for Tids */
union tree
{
local_radix_tree *local;
shared_radix_tree *shared;
};

We could possibly go back to using a common data type for this, but with
unused fields in each setting, as before. We would have to be more careful
of things like the 32-bit crash from a few weeks ago.

In the functions of TID store, we need to call either local or shared
radix tree functions depending on whether TID store is shared or not.
We need if-branch for each key-value pair insertion, but I think it
would not be a big performance problem in TID store use cases, since
vacuum is an I/O intensive operation in many cases.

Also, the branch will be easily predicted. That was still true in earlier
patches, but with many more branches and fatter code paths.

Overall, I think
there is no problem and I'll investigate it in depth.

Okay, great. If the separate-functions approach turns out to be ugly, we
can always go back to the branching approach for shared memory. I think
we'll want to keep this as a template overall, at least to allow different
value types and to ease adding variable-length keys if someone finds a need.

Apart from that, I've been considering the lock support for shared
radix tree. As we discussed before, the current usage (i.e, only
parallel index vacuum) doesn't require locking support at all, so it
would be enough to have a single lock for simplicity.

Right, that should be enough for PG16.

If we want to
use the shared radix tree for other use cases such as the parallel
heap vacuum or the replacement of the hash table for shared buffers,
we would need better lock support.

For future parallel pruning, I still think a global lock is "probably" fine
if the workers buffer in local arrays. Highly concurrent applications will
need additional work, of course.

For example, if we want to support
Optimistic Lock Coupling[1],

Interesting, from the same authors!

we would need to change not only the node
structure but also the logic. Which probably leads to widen the gap
between the code for non-shared and shared radix tree. In this case,
once we have a better radix tree optimized for shared case, perhaps we
can replace the templated shared radix tree with it. I'd like to hear
your opinion on this line.

I'm not in a position to speculate on how best to do scalable concurrency,
much less how it should coexist with the local implementation. It's
interesting that their "ROWEX" scheme gives up maintaining order in the
linear nodes.

One review point I'll mention: Somehow I didn't notice there is no use

for the "chunk" field in the rt_node type -- it's only set to zero and
copied when growing. What is the purpose? Removing it would allow the
smallest node to take up only 32 bytes with a fanout of 3, by eliminating
padding.

Oh, I didn't notice that. The chunk field was originally used when
redirecting the child pointer in the parent node from old to new
(grown) node. When redirecting the pointer, since the corresponding
chunk surely exists on the parent we can skip existence checks.
Currently we use RT_NODE_UPDATE_INNER() for that (see
RT_REPLACE_NODE()) but having a dedicated function to update the
existing chunk and child pointer might improve the performance. Or
reducing the node size by getting rid of the chunk field might be
better.

I see. IIUC from a brief re-reading of the code, saving that chunk would
only save us from re-loading "parent->shift" from L1 cache and shifting the
key. The cycles spent doing that seem small compared to the rest of the
work involved in growing a node. Expressions like "if (idx < 0) return
false;" return to an asserts-only variable, so in production builds, I
would hope that branch gets elided (I haven't checked).

I'm quite keen on making the smallest node padding-free, (since we don't
yet have path compression or lazy path expansion), and this seems the way
to get there.

--
John Naylor
EDB: http://www.enterprisedb.com

#171

john.naylor@enterprisedb.com

about 3 years ago

In reply to: John Naylor (#170)

Re: [PoC] Improve dead tuple storage for lazy vacuum

I wrote:

I see. IIUC from a brief re-reading of the code, saving that chunk would

only save us from re-loading "parent->shift" from L1 cache and shifting the
key. The cycles spent doing that seem small compared to the rest of the
work involved in growing a node. Expressions like "if (idx < 0) return
false;" return to an asserts-only variable, so in production builds, I
would hope that branch gets elided (I haven't checked).

On further reflection, this is completely false and I'm not sure what I was
thinking. However, for the update-inner case maybe we can assert that we
found a valid slot.

--
John Naylor
EDB: http://www.enterprisedb.com

#172

sawada.mshk@gmail.com

about 3 years ago

In reply to: John Naylor (#170)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jan 11, 2023 at 12:13 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

It looks no problem in terms of vacuum integration, although I've not
fully tested yet. TID store uses the radix tree as the main storage,
and with the template radix tree, the data types for shared and
non-shared will be different. TID store can have an union for the
radix tree and the structure would be like follows:

/* Storage for Tids */
union tree
{
local_radix_tree *local;
shared_radix_tree *shared;
};

We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We would have to be more careful of things like the 32-bit crash from a few weeks ago.

One idea to have a common data type without unused fields is to use
radix_tree a base class. We cast it to radix_tree_shared or
radix_tree_local depending on the flag is_shared in radix_tree. For
instance we have like (based on non-template version),

struct radix_tree
{
bool is_shared;
MemoryContext context;
};

typedef struct rt_shared
{
rt_handle handle;
uint32 magic;

/* Root node */
dsa_pointer root;

uint64 max_val;
uint64 num_keys;

/* need a lwlock */

/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
#endif
} rt_shared;

struct radix_tree_shared
{
radix_tree rt;

rt_shared *shared;
dsa_area *area;
} radix_tree_shared;

struct radix_tree_local
{
radix_tree rt;

uint64 max_val;
uint64 num_keys;

rt_node *root;

/* used only when the radix tree is private */
MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];

/* statistics */
#ifdef RT_DEBUG
int32 cnt[RT_SIZE_CLASS_COUNT];
#endif
} radix_tree_local;

In the functions of TID store, we need to call either local or shared
radix tree functions depending on whether TID store is shared or not.
We need if-branch for each key-value pair insertion, but I think it
would not be a big performance problem in TID store use cases, since
vacuum is an I/O intensive operation in many cases.

Also, the branch will be easily predicted. That was still true in earlier patches, but with many more branches and fatter code paths.

Overall, I think
there is no problem and I'll investigate it in depth.

Okay, great. If the separate-functions approach turns out to be ugly, we can always go back to the branching approach for shared memory. I think we'll want to keep this as a template overall, at least to allow different value types and to ease adding variable-length keys if someone finds a need.

I agree to keep this as a template. From the vacuum integration
perspective, it would be better if we can use a common data type for
shared and local. It makes sense to have different data types if the
radix trees have different values types.

Apart from that, I've been considering the lock support for shared
radix tree. As we discussed before, the current usage (i.e, only
parallel index vacuum) doesn't require locking support at all, so it
would be enough to have a single lock for simplicity.

Right, that should be enough for PG16.

If we want to
use the shared radix tree for other use cases such as the parallel
heap vacuum or the replacement of the hash table for shared buffers,
we would need better lock support.

For future parallel pruning, I still think a global lock is "probably" fine if the workers buffer in local arrays. Highly concurrent applications will need additional work, of course.

For example, if we want to support
Optimistic Lock Coupling[1],

Interesting, from the same authors!

we would need to change not only the node
structure but also the logic. Which probably leads to widen the gap
between the code for non-shared and shared radix tree. In this case,
once we have a better radix tree optimized for shared case, perhaps we
can replace the templated shared radix tree with it. I'd like to hear
your opinion on this line.

I'm not in a position to speculate on how best to do scalable concurrency, much less how it should coexist with the local implementation. It's interesting that their "ROWEX" scheme gives up maintaining order in the linear nodes.

One review point I'll mention: Somehow I didn't notice there is no use for the "chunk" field in the rt_node type -- it's only set to zero and copied when growing. What is the purpose? Removing it would allow the smallest node to take up only 32 bytes with a fanout of 3, by eliminating padding.

Oh, I didn't notice that. The chunk field was originally used when
redirecting the child pointer in the parent node from old to new
(grown) node. When redirecting the pointer, since the corresponding
chunk surely exists on the parent we can skip existence checks.
Currently we use RT_NODE_UPDATE_INNER() for that (see
RT_REPLACE_NODE()) but having a dedicated function to update the
existing chunk and child pointer might improve the performance. Or
reducing the node size by getting rid of the chunk field might be
better.

I see. IIUC from a brief re-reading of the code, saving that chunk would only save us from re-loading "parent->shift" from L1 cache and shifting the key. The cycles spent doing that seem small compared to the rest of the work involved in growing a node. Expressions like "if (idx < 0) return false;" return to an asserts-only variable, so in production builds, I would hope that branch gets elided (I haven't checked).

I'm quite keen on making the smallest node padding-free, (since we don't yet have path compression or lazy path expansion), and this seems the way to get there.

Okay, let's get rid of that in the v18.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#173

john.naylor@enterprisedb.com

about 3 years ago

In reply to: Masahiko Sawada (#172)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jan 12, 2023 at 12:44 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Wed, Jan 11, 2023 at 12:13 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

I agree to keep this as a template.

Okay, I'll squash the previous patch and work on cleaning up the internals.
I'll keep the external APIs the same so that your work on vacuum
integration can be easily rebased on top of that, and we can work
independently.

From the vacuum integration
perspective, it would be better if we can use a common data type for
shared and local. It makes sense to have different data types if the
radix trees have different values types.

I agree it would be better, all else being equal. I have some further
thoughts below.

It looks no problem in terms of vacuum integration, although I've not
fully tested yet. TID store uses the radix tree as the main storage,
and with the template radix tree, the data types for shared and
non-shared will be different. TID store can have an union for the
radix tree and the structure would be like follows:

/* Storage for Tids */
union tree
{
local_radix_tree *local;
shared_radix_tree *shared;
};

We could possibly go back to using a common data type for this, but

with unused fields in each setting, as before. We would have to be more
careful of things like the 32-bit crash from a few weeks ago.

One idea to have a common data type without unused fields is to use
radix_tree a base class. We cast it to radix_tree_shared or
radix_tree_local depending on the flag is_shared in radix_tree. For
instance we have like (based on non-template version),

struct radix_tree
{
bool is_shared;
MemoryContext context;
};

That could work in principle. My first impression is, just a memory context
is not much of a base class. Also, casts can creep into a large number of
places.

Another thought came to mind: I'm guessing the TID store is unusual --
meaning most uses of radix tree will only need one kind of memory
(local/shared). I could be wrong about that, and it _is_ a guess about the
future. If true, then it makes more sense that only code that needs both
memory kinds should be responsible for keeping them separate.

The template might be easier for future use cases if shared memory were
all-or-nothing, meaning either

- completely different functions and types depending on RT_SHMEM
- use branches (like v8)

The union sounds like a good thing to try, but do whatever seems right.

--
John Naylor
EDB: http://www.enterprisedb.com

#174

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#173)

3 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jan 12, 2023 at 5:21 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Jan 12, 2023 at 12:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 11, 2023 at 12:13 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I agree to keep this as a template.

Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same so that your work on vacuum integration can be easily rebased on top of that, and we can work independently.

Thanks!

From the vacuum integration
perspective, it would be better if we can use a common data type for
shared and local. It makes sense to have different data types if the
radix trees have different values types.

I agree it would be better, all else being equal. I have some further thoughts below.

It looks no problem in terms of vacuum integration, although I've not
fully tested yet. TID store uses the radix tree as the main storage,
and with the template radix tree, the data types for shared and
non-shared will be different. TID store can have an union for the
radix tree and the structure would be like follows:

/* Storage for Tids */
union tree
{
local_radix_tree *local;
shared_radix_tree *shared;
};

We could possibly go back to using a common data type for this, but with unused fields in each setting, as before. We would have to be more careful of things like the 32-bit crash from a few weeks ago.

One idea to have a common data type without unused fields is to use
radix_tree a base class. We cast it to radix_tree_shared or
radix_tree_local depending on the flag is_shared in radix_tree. For
instance we have like (based on non-template version),

struct radix_tree
{
bool is_shared;
MemoryContext context;
};

That could work in principle. My first impression is, just a memory context is not much of a base class. Also, casts can creep into a large number of places.

Another thought came to mind: I'm guessing the TID store is unusual -- meaning most uses of radix tree will only need one kind of memory (local/shared). I could be wrong about that, and it _is_ a guess about the future. If true, then it makes more sense that only code that needs both memory kinds should be responsible for keeping them separate.

True.

The template might be easier for future use cases if shared memory were all-or-nothing, meaning either

- completely different functions and types depending on RT_SHMEM
- use branches (like v8)

The union sounds like a good thing to try, but do whatever seems right.

I've implemented the idea of using union. Let me share WIP code for
discussion, I've attached three patches that can be applied on top of
v17-0009 patch. v17-0010 implements missing shared memory support
functions such as RT_DETACH and RT_GET_HANDLE, and some fixes.
v17-0011 patch adds TidStore, and v17-0012 patch is the vacuum
integration.

Overall, TidStore implementation with the union idea doesn't look so
ugly to me. But I got many compiler warning about unused radix tree
functions like:

tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used
[-Wunused-function]

I'm not sure there is a convenient way to suppress this warning but
one idea is to have some macros to specify what operations are
enabled/declared.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v17-0010-fix-shmem-support.patchapplication/octet-stream; name=v17-0010-fix-shmem-support.patchDownload

From 56a45a0731abc33b3894d0aa0de06869d894637b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 12 Jan 2023 23:22:22 +0900
Subject: [PATCH v17 10/12] fix shmem support

---
 src/include/lib/radixtree.h           | 87 ++++++++++++++++++++++++---
 src/include/lib/radixtree_iter_impl.h |  4 ++
 2 files changed, 82 insertions(+), 9 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 2b58a0cdf5..a2e2e7a190 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -100,6 +100,8 @@
 #define RT_SEARCH RT_MAKE_NAME(search)
 #ifdef RT_SHMEM
 #define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
 #endif
 #define RT_SET RT_MAKE_NAME(set)
 #define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
@@ -164,6 +166,9 @@
 #define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
 #define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
 #define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
 #define RT_NODE RT_MAKE_NAME(node)
 #define RT_NODE_ITER RT_MAKE_NAME(node_iter)
 #define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
@@ -194,9 +199,15 @@
 typedef struct RT_RADIX_TREE RT_RADIX_TREE;
 typedef struct RT_ITER RT_ITER;
 
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
 #ifdef RT_SHMEM
 RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
 RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
 #else
 RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
 #endif
@@ -542,9 +553,19 @@ static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
 	[RT_NODE_KIND_256] = RT_CLASS_256,
 };
 
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
 /* A radix tree with nodes */
 typedef struct RT_RADIX_TREE_CONTROL
 {
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+#endif
+
 	RT_PTR_ALLOC root;
 	uint64		max_val;
 	uint64		num_keys;
@@ -565,7 +586,6 @@ typedef struct RT_RADIX_TREE
 
 #ifdef RT_SHMEM
 	dsa_area   *dsa;
-	dsa_pointer ctl_dp;
 #else
 	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
 	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
@@ -1311,6 +1331,9 @@ RT_CREATE(MemoryContext ctx)
 {
 	RT_RADIX_TREE *tree;
 	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
 
 	old_ctx = MemoryContextSwitchTo(ctx);
 
@@ -1319,8 +1342,10 @@ RT_CREATE(MemoryContext ctx)
 
 #ifdef RT_SHMEM
 	tree->dsa = dsa;
-	tree->ctl_dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
-	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, tree->ctl_dp);
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
 #else
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
 
@@ -1346,21 +1371,40 @@ RT_CREATE(MemoryContext ctx)
 }
 
 #ifdef RT_SHMEM
-RT_RADIX_TREE *
-RT_ATTACH(dsa_area *dsa, dsa_pointer dp)
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
 {
 	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
 
 	/* XXX: memory context support */
 	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
 
-	tree->ctl_dp = dp;
-	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 
 	/* XXX: do we need to set a callback on exit to detach dsa? */
 
 	return tree;
 }
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
 #endif
 
 /*
@@ -1370,8 +1414,15 @@ RT_SCOPE void
 RT_FREE(RT_RADIX_TREE *tree)
 {
 #ifdef RT_SHMEM
-	dsa_free(tree->dsa, tree->ctl_dp); // XXX
-	dsa_detach(tree->dsa);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle); // XXX
+	//dsa_detach(tree->dsa);
 #else
 	pfree(tree->ctl);
 
@@ -1398,6 +1449,10 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
 	RT_PTR_ALLOC nodep;
 	RT_PTR_LOCAL  node;
 
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
 	/* Empty tree, create the root */
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
 		RT_NEW_ROOT(tree, key);
@@ -1453,6 +1508,9 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
 	RT_PTR_LOCAL node;
 	int			shift;
 
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
 	Assert(value_p != NULL);
 
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
@@ -1493,6 +1551,10 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 	int			level;
 	bool		deleted;
 
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
@@ -1736,6 +1798,7 @@ RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
 	Size		total = sizeof(RT_RADIX_TREE);
 
 #ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 	total = dsa_get_total_size(tree->dsa);
 #else
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
@@ -2085,10 +2148,14 @@ rt_dump(RT_RADIX_TREE *tree)
 #undef VAR_NODE_HAS_FREE_SLOT
 #undef FIXED_NODE_HAS_FREE_SLOT
 #undef RT_SIZE_CLASS_COUNT
+#undef RT_RADIX_TREE_MAGIC
 
 /* type declarations */
 #undef RT_RADIX_TREE
 #undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
 #undef RT_ITER
 #undef RT_NODE
 #undef RT_NODE_ITER
@@ -2118,6 +2185,8 @@ rt_dump(RT_RADIX_TREE *tree)
 #undef RT_CREATE
 #undef RT_FREE
 #undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
 #undef RT_SET
 #undef RT_BEGIN_ITERATE
 #undef RT_ITERATE_NEXT
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 09d2018dc0..fd00851732 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -12,6 +12,10 @@
 #error node level must be either inner or leaf
 #endif
 
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
 	bool		found = false;
 	uint8		key_chunk;
 
-- 
2.31.1

v17-0011-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v17-0011-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From c9e8bb135bdfc555153f1e6b324968701f6a26a0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v17 11/12] Add TIDStore, to store sets of TIDs
 (ItemPointerData) efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 587 ++++++++++++++++++
 src/include/access/tidstore.h                 |  49 ++
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  34 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../test_tidstore/test_tidstore.control       |   4 +
 10 files changed, 727 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..4170d13b3c
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,587 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, Tid are encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach(). It can support concurrent updates but only one process
+ * is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. First, we construct 64-bit unsigned integer key that
+ * combines the block number and the offset number. The lowest 11 bits represent
+ * the offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TidSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * XXX: If we want to support other table AMs that want to use the full range
+ * of possible offset numbers, we'll need to change this.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ *
+ * The maximum height of the radix tree is 5.
+ *
+ * XXX: if we want to support non-heap table AM, we need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
+ */
+#define TIDSTORE_OFFSET_NBITS	11
+#define TIDSTORE_VALUE_NBITS	6
+
+/*
+ * Memory consumption depends on the number of Tids stored, but also on the
+ * distribution of them and how the radix tree stores them. The maximum bytes
+ * that a TidStore can use is specified by the max_bytes in tidstore_create().
+ *
+ * In non-shared cases, the radix tree uses a slab allocator for each kind of
+ * node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+ * largest radix tree node in a new slab block, which is approximately 70kB.
+ * Therefore, we deduct 70kB from the maximum bytes.
+ *
+ * In shared cases, DSA allocates the memory segments to bit enough to follow
+ * a geometric series that approximately doubles the total DSA size. So we
+ * limit the maximum bytes for a TidStore to 75%. The 75% threshold perfectly
+ * works in case where the maximum bytes is power-of-2. In other cases, we
+ * use 60& threshold.
+ */
+#define TIDSTORE_MEMORY_DEDUCT_BYTES (1024L * 70) /* 70kB */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+	((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+	/*
+	 * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+	 * bytes a TidStore can use. These two fields are commonly used in both
+	 * non-shared case and shared case.
+	 */
+	uint32	num_tids;
+	uint64	max_bytes;
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+
+	/* handles for TidStore and radix tree */
+	tidstore_handle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 *
+	 * We calculate the maximum bytes for the TidStore in different ways
+	 * for non-shared case and shared case. Please refer to the comment
+	 * TIDSTORE_MEMORY_DEDUCT for details.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes =(uint64) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - TIDSTORE_MEMORY_DEDUCT_BYTES;
+	}
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/* Reset the statistics */
+	ts->control->num_tids = 0;
+
+	/*
+	 * Free the current radix tree, and Return allocated DSM segments
+	 * to the operating system, if necessary. */
+	if (TidStoreIsShared(ts))
+	{
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+	}
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+	if (TidStoreIsShared(ts))
+		shared_rt_set(ts->tree.shared, key, val);
+	else
+		local_rt_set(ts->tree.local, key, val);
+}
+
+/*
+ * Add Tids on a block to TidStore. The caller must ensure the offset numbers
+ * in 'offsets' are ordered in ascending order.
+ */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	uint64 last_key = PG_UINT64_MAX;
+	uint64 key;
+	uint64 val = 0;
+	ItemPointerData tid;
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint32	off;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		key = tid_to_key_off(&tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			/* insert the key-value */
+			tidstore_insert_kv(ts, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= UINT64CONST(1) << off;
+	}
+
+	if (last_key != PG_UINT64_MAX)
+	{
+		/* insert the key-value */
+		tidstore_insert_kv(ts, last_key, val);
+	}
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+}
+
+/* Return true if the given Tid is present in TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val;
+	uint32 off;
+	bool found;
+
+	key = tid_to_key_off(tid, &off);
+
+	found = TidStoreIsShared(ts) ?
+		shared_rt_search(ts->tree.shared, key, &val) :
+		local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+	iter->result.blkno = InvalidBlockNumber;
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (ts->control->num_tids == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+	else
+		return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = KEY_GET_BLKNO(key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * Remember the key-value pair for the next block for the
+			 * next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+uint64
+tidstore_num_tids(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+uint64
+tidstore_memory_usage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return (uint64) sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+	else
+		return (uint64) sizeof(TidStore) + sizeof(TidStore) +
+			local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract Tids from key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->result);
+
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+		result->offsets[result->num_offsets++] = off;
+	}
+
+	result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a Tid to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64 upper;
+	uint64 tid_i;
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+	*off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+	upper = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return upper;
+}
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..4bffdf0920
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+	int				num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern uint64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern uint64 tidstore_max_memory(TidStore *ts);
+extern uint64 tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..1973963440
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..3365b073a4
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.31.1

v17-0012-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v17-0012-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload

From 6f5a52a3bd7c018b42cbd7db1f9cad47d378c816 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 12 Jan 2023 22:04:20 +0900
Subject: [PATCH v17 12/12] Use TIDStore for storing dead tuple TID during lazy
 vacuum.

Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which is not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.

This changes to use TIDStore for this purpose. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.

Also, since we are no longer able to exactly estimate the maximum
number of TIDs can be stored based on the amount of memory. It also
changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.

Furthermore, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, which is the
inital DSA segment size. Due to that, this change increase the minimum
maintenance_work_mem from 1MB to 2MB.
---
 doc/src/sgml/monitoring.sgml               |   8 +-
 src/backend/access/heap/vacuumlazy.c       | 169 +++++++--------------
 src/backend/catalog/system_views.sql       |   2 +-
 src/backend/commands/vacuum.c              |  76 +--------
 src/backend/commands/vacuumparallel.c      |  64 +++++---
 src/backend/storage/lmgr/lwlock.c          |   2 +
 src/backend/utils/misc/guc_tables.c        |   2 +-
 src/include/commands/progress.h            |   4 +-
 src/include/commands/vacuum.h              |  25 +--
 src/include/storage/lwlock.h               |   1 +
 src/test/regress/expected/cluster.out      |   2 +-
 src/test/regress/expected/create_index.out |   2 +-
 src/test/regress/expected/rules.out        |   4 +-
 src/test/regress/sql/cluster.sql           |   2 +-
 src/test/regress/sql/create_index.sql      |   2 +-
 15 files changed, 122 insertions(+), 243 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..6ce7ea9e35 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6840,10 +6840,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6851,10 +6851,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a42e881da3..1041e6640f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -259,8 +260,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer *vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer *vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +827,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				blkno,
 				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +908,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (tidstore_is_full(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1039,11 +1040,18 @@ lazy_scan_heap(LVRelState *vacrel)
 			if (prunestate.has_lpdead_items)
 			{
 				Size		freespace;
+				TidStoreIter *iter;
+				TidStoreIterResult *result;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+				iter = tidstore_begin_iterate(vacrel->dead_items);
+				result = tidstore_iterate_next(iter);
+				lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+									  buf, &vmbuffer);
+				Assert(!tidstore_iterate_next(iter));
+				tidstore_end_iterate(iter);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				tidstore_reset(dead_items);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1080,7 +1088,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
 		}
 
 		/*
@@ -1233,7 +1241,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1871,23 +1879,15 @@ retry:
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		vacrel->lpdead_item_pages++;
 		prunestate->has_lpdead_items = true;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -2107,8 +2107,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2117,17 +2116,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2176,7 +2168,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2205,7 +2197,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2232,8 +2224,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2278,7 +2270,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	tidstore_reset(vacrel->dead_items);
 }
 
 /*
@@ -2351,7 +2343,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2388,10 +2380,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index;
 	BlockNumber vacuumed_pages;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2408,8 +2401,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 	vacuumed_pages = 0;
 
-	index = 0;
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while ((result = tidstore_iterate_next(iter)) != NULL)
 	{
 		BlockNumber tblk;
 		Buffer		buf;
@@ -2418,12 +2411,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		tblk = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		tblk = result->blkno;
 		vacrel->blkno = tblk;
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, tblk, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, tblk, buf, index, &vmbuffer);
+		lazy_vacuum_heap_page(vacrel, tblk, result->offsets, result->num_offsets,
+							  buf, &vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2433,6 +2427,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, tblk, freespace);
 		vacuumed_pages++;
 	}
+	tidstore_end_iterate(iter);
 
 	/* Clear the block number information */
 	vacrel->blkno = InvalidBlockNumber;
@@ -2447,14 +2442,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2471,11 +2465,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * LP_DEAD item on the page.  The return value is the first index immediately
  * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+					  int num_offsets, Buffer buffer, Buffer *vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			uncnt = 0;
@@ -2494,16 +2487,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = offsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2583,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -3079,46 +3066,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3129,11 +3076,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3160,7 +3105,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3173,11 +3118,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 447c9b970f..133e03d728 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c4ed7efce3..7de4350cde 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2298,16 +2297,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2338,18 +2337,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
@@ -2360,60 +2347,7 @@ vac_max_items_to_alloc_size(int max_items)
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..4c0ce4b7e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+#define PARALLEL_VACUUM_KEY_DSA				2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TidStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_destroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TidStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 196bece0a3..ff75fae88a 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -186,6 +186,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"PgStatsHash",
 	/* LWTRANCHE_PGSTATS_DATA: */
 	"PgStatsData",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 92545b4958..3f8a5bc582 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2301,7 +2301,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 2048, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..220d89fff7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
 	MultiXactId MultiXactCutoff;
 };
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e4162db613..40dda03088 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -204,6 +204,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DSA,
 	LWTRANCHE_PGSTATS_HASH,
 	LWTRANCHE_PGSTATS_DATA,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 542c2e098c..e678e6f79e 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -524,7 +524,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
 -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..0c49354f04 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT s.stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index 6cb9c926c0..a795d705d5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -256,7 +256,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 set enable_indexscan = off;
 
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
 
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
-- 
2.31.1

#175

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#163)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Dec 23, 2022 at 4:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Dec 22, 2022 at 10:00 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

If the value is a power of 2, it seems to work perfectly fine. But for
example if it's 700MB, the total memory exceeds the limit:

2*(1+2+4+8+16+32+64+128) = 510MB (72.8% of 700MB) -> keep going
510 + 256 = 766MB -> stop but it exceeds the limit.

In a more bigger case, if it's 11000MB,

2*(1+2+...+2048) = 8190MB (74.4%)
8190 + 4096 = 12286MB

That being said, I don't think they are not common cases. So the 75%
threshold seems to work fine in most cases.

Thinking some more, I agree this doesn't have large practical risk, but thinking from the point of view of the community, being loose with memory limits by up to 10% is not a good precedent.

Agreed.

Perhaps we can be clever and use 75% when the limit is a power of two and 50% otherwise. I'm skeptical of trying to be clever, and I just thought of an additional concern: We're assuming behavior of the growth in size of new DSA segments, which could possibly change. Given how allocators are typically coded, though, it seems safe to assume that they'll at most double in size.

Sounds good to me.

I've written a simple script to simulate the DSA memory usage and the
limit. The 75% limit works fine for a power of two cases, and we can
use the 60% limit for other cases (it seems we can use up to about 66%
but used 60% for safety). It would be best if we can mathematically
prove it but I could prove only the power of two cases. But the script
practically shows the 60% threshold would work for these cases.

Regards

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#176

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#174)

10 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jan 12, 2023 at 9:51 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Thu, Jan 12, 2023 at 5:21 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Okay, I'll squash the previous patch and work on cleaning up the

internals. I'll keep the external APIs the same so that your work on vacuum
integration can be easily rebased on top of that, and we can work
independently.

There were some conflicts with HEAD, so to keep the CF bot busy, I've
quickly put together v18. I still have a lot of cleanup work to do, but
this is enough for now.

0003 contains all v17 local-memory coding squashed together.

0004 perf test not updated but it doesn't build by default so it's fine for
now

0005 removes node.chunk as discussed, but does not change node4 fanout yet.

0006 is a small cleanup regarding setting node fanout.

0007 squashes my shared memory work with Masahiko's fixes from the addendum
v17-0010.

0008 turns the existence checks in RT_NODE_UPDATE_INNER into Asserts, as
discussed.

0009/0010 are just copies of Masauiko's v17 addendum v17-0011/12, but the
latter rebased over recent variable renaming (it's possible I missed
something, so worth checking).

I've implemented the idea of using union. Let me share WIP code for
discussion, I've attached three patches that can be applied on top of

Seems fine as far as the union goes. Let's go ahead with this, and make
progress on locking etc.

Overall, TidStore implementation with the union idea doesn't look so
ugly to me. But I got many compiler warning about unused radix tree
functions like:

tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used
[-Wunused-function]

I'm not sure there is a convenient way to suppress this warning but
one idea is to have some macros to specify what operations are
enabled/declared.

That sounds like a good idea. It's also worth wondering if we even need
RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that
if necessary. It's also misnamed, since it's concerned with the number of
keys. The vacuum case cares about the number of TIDs, and not number of
(encoded) keys. Even if we ever (say) changed the key to blocknumber and
value to Bitmapset, the number of keys might not be interesting. It sounds
like we should at least make the delete functionality optional. (Side note
on optional functions: if an implementation didn't care about iteration or
its order, we could optimize insertion into linear nodes)

Since this is WIP, you may already have some polish in mind, so I won't go
over the patches in detail, but I wanted to ask about a few things (numbers
referring to v17 addendum, not v18):

0011

+ * 'num_tids' is the number of Tids stored so far. 'max_byte' is the
maximum
+ * bytes a TidStore can use. These two fields are commonly used in both
+ * non-shared case and shared case.
+ */
+ uint32 num_tids;

uint32 is how we store the block number, so this too small and will wrap
around on overflow. int64 seems better.

+ * We calculate the maximum bytes for the TidStore in different ways
+ * for non-shared case and shared case. Please refer to the comment
+ * TIDSTORE_MEMORY_DEDUCT for details.
+ */

Maybe the #define and comment should be close to here.

+ * Destroy a TidStore, returning all memory. The caller must be certain
that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that
calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)

If not addressed by next patch, need to phrase comment with FIXME or TODO
about making certain.

+ * Add Tids on a block to TidStore. The caller must ensure the offset
numbers
+ * in 'offsets' are ordered in ascending order.

Must? What happens otherwise?

+ uint64 last_key = PG_UINT64_MAX;

I'm having some difficulty understanding this sentinel and how it's used.

@@ -1039,11 +1040,18 @@ lazy_scan_heap(LVRelState *vacrel)
  if (prunestate.has_lpdead_items)
  {
  Size freespace;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;

- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ result = tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+  buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ tidstore_end_iterate(iter);

  /* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);

This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded
complexity. It arises because lazy_scan_prune() populates the tid store
even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune()
could pass the deadoffsets array, and upon returning, either populate the
store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm
missing some detail, so some description of the design choices made would
be helpful.

On Mon, Jan 16, 2023 at 9:53 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I've written a simple script to simulate the DSA memory usage and the
limit. The 75% limit works fine for a power of two cases, and we can
use the 60% limit for other cases (it seems we can use up to about 66%
but used 60% for safety). It would be best if we can mathematically
prove it but I could prove only the power of two cases. But the script
practically shows the 60% threshold would work for these cases.

Okay. It's worth highlighting this in the comments, and also the fact that
it depends on internal details of how DSA increases segment size.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v18-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/x-patch; name=v18-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From c67b1955c95a036f93baaea8f43dcf49fa6e86f8 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v18 02/10] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bafec5f7..5bd3da4948 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.39.0

v18-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/x-patch; name=v18-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From b7935edac9046631ee9fca095bd8b3901cc5629b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v18 01/10] introduce vector8_min and vector8_highbit_mask

---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..84d41a340a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.39.0

v18-0005-Remove-chunk-from-the-common-node-type.patchapplication/x-patch; name=v18-0005-Remove-chunk-from-the-common-node-type.patchDownload

From 4a385a0667e2489e6b4b850c2f7699049d652811 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 12 Jan 2023 20:32:06 +0700
Subject: [PATCH v18 05/10] Remove chunk from the common node type

This enabled a possible optimization for updating
the parent node's child pointer during node growth.
This is not likely to buy us much, and removing it
reduces the common type size to 5 bytes.

TODO: Reducing the smallest node to 3 members will
eliminate padding and only take up 32 bytes for
inner nodes.
---
 src/include/lib/radixtree.h | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index b3d84da033..72735c4643 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -295,7 +295,6 @@ typedef struct RT_NODE
 	 * RT_NODE_SPAN bits are then represented in chunk.
 	 */
 	uint8		shift;
-	uint8		chunk;
 
 	/* Node kind, one per search/set algorithm */
 	uint8		kind;
@@ -964,7 +963,6 @@ static inline void
 RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
 {
 	newnode->shift = oldnode->shift;
-	newnode->chunk = oldnode->chunk;
 	newnode->count = oldnode->count;
 }
 
@@ -1026,7 +1024,6 @@ static void
 RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
 				RT_PTR_ALLOC new_child, uint64 key)
 {
-	Assert(old_child->chunk == new_child->chunk);
 	Assert(old_child->shift == new_child->shift);
 
 	if (parent == old_child)
@@ -1074,8 +1071,8 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
 		n4->base.chunks[0] = 0;
 		n4->children[0] = tree->root;
 
-		tree->root->chunk = 0;
-		tree->root = node;
+		/* Update the root */
+		tree->ctl->root = allocnode;
 
 		shift += RT_NODE_SPAN;
 	}
@@ -1104,8 +1101,7 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent
 		newchild = (RT_PTR_LOCAL) allocchild;
 		RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
 		newchild->shift = newshift;
-		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
-		RT_NODE_INSERT_INNER(tree, parent, node, key, newchild);
+		RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
 
 		parent = node;
 		node = newchild;
@@ -1684,13 +1680,13 @@ rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
 {
 	char		space[125] = {0};
 
-	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
 			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
 			(node->kind == RT_NODE_KIND_4) ? 4 :
 			(node->kind == RT_NODE_KIND_32) ? 32 :
 			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
 			node->fanout == 0 ? 256 : node->fanout,
-			node->count, node->shift, node->chunk);
+			node->count, node->shift);
 
 	if (level > 0)
 		sprintf(space, "%*c", level * 4, ' ');
-- 
2.39.0

v18-0004-tool-for-measuring-radix-tree-performance.patchapplication/x-patch; name=v18-0004-tool-for-measuring-radix-tree-performance.patchDownload

From 3c7efecad8161974b7168b8a325ce1ae985774fd Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v18 04/10] tool for measuring radix tree performance

XXX: Not for commit

TODO: adjust for templating
---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  76 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 635 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 6 files changed, 767 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..a0693695e6
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,635 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, key);
+	}
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+
+	rt_stats(rt);
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
-- 
2.39.0

v18-0003-Add-radixtree-template.patchapplication/x-patch; name=v18-0003-Add-radixtree-template.patchDownload

From 49a7e1f26c28668a35d96b3533ca59d88119a251 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v18 03/10] Add radixtree template

The only thing configurable at this point is function scope
and prefix, since the point is to see if this makes a shared
memory implementation clear and maintainable.

The key and value type are still hard-coded to uint64.
To make this more useful, at least value type should be
configurable.

It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.
---
 src/include/lib/radixtree.h                   | 2018 +++++++++++++++++
 src/include/lib/radixtree_delete_impl.h       |  100 +
 src/include/lib/radixtree_insert_impl.h       |  293 +++
 src/include/lib/radixtree_iter_impl.h         |  129 ++
 src/include/lib/radixtree_search_impl.h       |  102 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   34 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  588 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 18 files changed, 3367 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..b3d84da033
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2018 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ *	  To generate a radix tree and associated functions for a use case several
+ *	  macros have to be #define'ed before this file is included.  Including
+ *	  the file #undef's all those, so a new radix tree can be generated
+ *	  afterwards.
+ *	  The relevant parameters are:
+ *	  - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ *		will result in radix tree type 'foo_radix_tree' and functions like
+ *		'foo_create'/'foo_free' and so forth.
+ *	  - RT_DECLARE - if defined function prototypes and type declarations are
+ *		generated
+ *	  - RT_DEFINE - if defined function definitions are generated
+ *	  - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *		declarations reside
+ *
+ *	  Optional parameters:
+ *	  - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_DELETE		- Delete a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITER		- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ * RT_NUM_ENTRIES	- Get the number of key-value pairs
+ *
+ * RT_CREATE() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/pg_lfind.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#define RT_DELETE RT_MAKE_NAME(delete)
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#define RT_NUM_ENTRIES RT_MAKE_NAME(num_entries)
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_ITER RT_MAKE_NAME(iter)
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
+#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
+#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+RT_SCOPE uint64 RT_NUM_ENTRIES(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* macros and types common to all implementations */
+#ifndef RT_COMMON
+#define RT_COMMON
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+#endif							/* RT_COMMON */
+
+
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_4_FULL = 0,
+	RT_CLASS_32_PARTIAL,
+	RT_CLASS_32_FULL,
+	RT_CLASS_125_FULL,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/* Max number of children. We can use uint8 because we never need to store 256 */
+	/* WIP: if we don't have a variable sized node4, this should instead be in the base
+	types as needed, since saving every byte is crucial for the smallest node kind */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+	uint8		chunk;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+
+#define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((RT_PTR_LOCAL) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+	((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+	((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct RT_NODE_BASE_4
+{
+	RT_NODE		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} RT_NODE_BASE_4;
+
+typedef struct RT_NODE_BASE_32
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct RT_NODE_BASE_125
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword		isset[BM_IDX(128)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct RT_NODE_INNER_4
+{
+	RT_NODE_BASE_4 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_4;
+
+typedef struct RT_NODE_LEAF_4
+{
+	RT_NODE_BASE_4 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_4;
+
+typedef struct RT_NODE_INNER_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} RT_SIZE_CLASS_ELEM;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_4_FULL] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_PARTIAL] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_FULL] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
+	},
+	[RT_CLASS_125_FULL] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+/* Map from the node kind to its minimum size class */
+static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
+	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+	[RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ */
+typedef struct RT_NODE_ITER
+{
+	RT_PTR_LOCAL node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the iteration on nodes of each level */
+	RT_NODE_ITER stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_LOCAL child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+								uint64 key, uint64 value);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(uint64) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return (node->children[chunk] != NULL);
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = NULL;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
+{
+	RT_PTR_ALLOC newnode;
+
+	if (inner)
+		newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+												 RT_SIZE_CLASS_INFO[size_class].inner_size);
+	else
+		newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+												 RT_SIZE_CLASS_INFO[size_class].leaf_size);
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->cnt[size_class]++;
+#endif
+
+	return newnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+{
+	if (inner)
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+
+	node->kind = kind;
+	node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+	}
+
+	/*
+	 * Technically it's 256, but we cannot store that in a uint8,
+	 * and this is the max size class to it will never grow.
+	 */
+	if (kind == RT_NODE_KIND_256)
+		node->fanout = 0;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		inner = shift > 0;
+	RT_PTR_ALLOC newnode;
+
+	newnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+	newnode->shift = shift;
+	tree->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->root = newnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->chunk = oldnode->chunk;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static RT_NODE*
+RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
+{
+	RT_PTR_ALLOC newnode;
+	bool inner = !NODE_IS_LEAF(node);
+
+	newnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	RT_COPY_NODE(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->root == node)
+	{
+		tree->root = NULL;
+		tree->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->cnt[i]--;
+		Assert(tree->cnt[i] >= 0);
+	}
+#endif
+
+	pfree(node);
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
+{
+	Assert(old_child->chunk == new_child->chunk);
+	Assert(old_child->shift == new_child->shift);
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new large node */
+		tree->root = new_child;
+	}
+	else
+	{
+		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
+
+		replaced = RT_NODE_INSERT_INNER(tree, NULL, parent, key, new_child);
+		Assert(replaced);
+	}
+
+	RT_FREE_NODE(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	int			shift = tree->root->shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_4 *n4;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+		node = (RT_PTR_LOCAL) allocnode;
+		RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		node->shift = shift;
+		node->count = 1;
+
+		n4 = (RT_NODE_INNER_4 *) node;
+		n4->base.chunks[0] = 0;
+		n4->children[0] = tree->root;
+
+		tree->root->chunk = 0;
+		tree->root = node;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+			  RT_PTR_LOCAL node)
+{
+	int			shift = node->shift;
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		inner = newshift > 0;
+
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+		newchild = (RT_PTR_LOCAL) allocchild;
+		RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+		newchild->shift = newshift;
+		newchild->chunk = RT_GET_KEY_CHUNK(key, node->shift);
+		RT_NODE_INSERT_INNER(tree, parent, node, key, newchild);
+
+		parent = node;
+		node = newchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
+	tree->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/* Insert the child to the inner node */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node, uint64 key,
+					 RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Insert the value to the leaf node */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+					uint64 key, uint64 value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+RT_CREATE(MemoryContext ctx)
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = palloc(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+	tree->root = NULL;
+	tree->max_val = 0;
+	tree->num_keys = 0;
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 RT_SIZE_CLASS_INFO[i].name,
+												 RT_SIZE_CLASS_INFO[i].inner_blocksize,
+												 RT_SIZE_CLASS_INFO[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												RT_SIZE_CLASS_INFO[i].name,
+												RT_SIZE_CLASS_INFO[i].leaf_blocksize,
+												RT_SIZE_CLASS_INFO[i].leaf_size);
+#ifdef RT_DEBUG
+		tree->cnt[i] = 0;
+#endif
+	}
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	RT_PTR_LOCAL node;
+	RT_PTR_LOCAL parent;
+
+	/* Empty tree, create the root */
+	if (!tree->root)
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->max_val)
+		RT_EXTEND(tree, key);
+
+	Assert(tree->root);
+
+	shift = tree->root->shift;
+	node = parent = tree->root;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_LOCAL child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_SET_EXTEND(tree, key, value, parent, node);
+			return false;
+		}
+
+		parent = node;
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+
+	Assert(value_p != NULL);
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	node = tree->root;
+	shift = tree->root->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	return RT_NODE_SEARCH_LEAF(node, key, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+	if (!tree->root || key > tree->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	node = tree->root;
+	shift = tree->root->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		RT_PTR_ALLOC child;
+
+		/* Push the current node to the stack */
+		stack[++level] = node;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			return false;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	Assert(NODE_IS_LEAF(node));
+	deleted = RT_NODE_DELETE_LEAF(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	RT_FREE_NODE(tree, node);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		node = stack[level--];
+
+		deleted = RT_NODE_DELETE_INNER(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		RT_FREE_NODE(tree, node);
+	}
+
+	return true;
+}
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+	int			level = from;
+	RT_PTR_LOCAL node = from_node;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	MemoryContext old_ctx;
+	RT_ITER    *iter;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->root)
+		return iter;
+
+	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	RT_UPDATE_ITER_STACK(iter, iter->tree->root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->root)
+		return false;
+
+	for (;;)
+	{
+		RT_PTR_LOCAL child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+	pfree(iter);
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+RT_SCOPE uint64
+RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
+{
+	return tree->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	Size		total = sizeof(RT_RADIX_TREE);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = BM_IDX(slot);
+					int			bitnum = BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(RT_RADIX_TREE *tree)
+{
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+						 tree->num_keys,
+						 tree->root->shift / RT_NODE_SPAN,
+						 tree->cnt[RT_CLASS_4_FULL],
+						 tree->cnt[RT_CLASS_32_PARTIAL],
+						 tree->cnt[RT_CLASS_32_FULL],
+						 tree->cnt[RT_CLASS_125_FULL],
+						 tree->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
+{
+	char		space[125] = {0};
+
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u, chunk 0x%X:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
+			node->fanout == 0 ? 256 : node->fanout,
+			node->count, node->shift, node->chunk);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							rt_dump_node(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							rt_dump_node(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < BM_IDX(128); i++)
+					{
+						fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+					}
+					else
+					{
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+		 tree->max_val, tree->max_val);
+
+	if (!tree->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->max_val)
+	{
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->root;
+	shift = tree->root->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_LOCAL child;
+
+		rt_dump_node(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+void
+rt_dump(RT_RADIX_TREE *tree)
+{
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+				RT_SIZE_CLASS_INFO[i].name,
+				RT_SIZE_CLASS_INFO[i].inner_size,
+				RT_SIZE_CLASS_INFO[i].inner_blocksize,
+				RT_SIZE_CLASS_INFO[i].leaf_size,
+				RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+
+	if (!tree->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree->root, 0, true);
+}
+#endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+
+/* locally declared macros */
+#undef NODE_IS_LEAF
+#undef NODE_IS_EMPTY
+#undef VAR_NODE_HAS_FREE_SLOT
+#undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_SIZE_CLASS_COUNT
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_4_FULL
+#undef RT_CLASS_32_PARTIAL
+#undef RT_CLASS_32_FULL
+#undef RT_CLASS_125_FULL
+#undef RT_CLASS_256
+#undef RT_KIND_MIN_SIZE_CLASS
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_NUM_ENTRIES
+#undef RT_DUMP
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_GROW_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..6eefc63e19
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,100 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
+										  n4->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
+											n4->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
+										  n32->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_NODE_125_INVALID_IDX)
+					return false;
+
+				idx = BM_IDX(slotpos);
+				bitnum = BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+				RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..ff76583402
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,293 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+	RT_NODE		*newnode = NULL;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(NODE_IS_LEAF(node));
+#else
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n4->values[idx] = value;
+#else
+					n4->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					RT_NODE32_TYPE *new32;
+
+					/* grow node from 4 to 32 */
+					newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+					new32 = (RT_NODE32_TYPE *) newnode;
+#ifdef RT_NODE_LEVEL_LEAF
+					RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
+											  new32->base.chunks, new32->values);
+#else
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children);
+#endif
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, node, newnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
+												   count, insertpos);
+#endif
+					}
+
+					n4->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n4->values[insertpos] = value;
+#else
+					n4->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[idx] = value;
+#else
+					n32->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+					n32->base.n.fanout == class32_min.fanout)
+				{
+					/* grow to the next size class of this kind */
+#ifdef RT_NODE_LEVEL_LEAF
+					newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, false);
+					memcpy(newnode, node, class32_min.leaf_size);
+#else
+					newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, true);
+					memcpy(newnode, node, class32_min.inner_size);
+#endif
+					newnode->fanout = class32_max.fanout;
+
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, node, newnode, key);
+					node = newnode;
+
+					/* also update pointer for this kind */
+					n32 = (RT_NODE32_TYPE *) newnode;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					RT_NODE125_TYPE *new125;
+
+					Assert(n32->base.n.fanout == class32_max.fanout);
+
+					/* grow node from 32 to 125 */
+					newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < class32_max.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, node, newnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = value;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			cnt = 0;
+
+				if (slotpos != RT_NODE_125_INVALID_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					RT_NODE256_TYPE *new256;
+
+					/* grow node from 125 to 256 */
+					newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+					new256 = (RT_NODE256_TYPE *) newnode;
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+						cnt++;
+					}
+
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, node, newnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < BM_IDX(128); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+				chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_SET(n256, chunk, value);
+#else
+				RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	RT_VERIFY_NODE(node);
+
+	return chunk_exists;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..a153011376
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,129 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+#ifdef RT_NODE_LEVEL_LEAF
+	uint64		value;
+#else
+	RT_NODE    *child = NULL;
+#endif
+	bool		found = false;
+	uint8		key_chunk;
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n4->values[node_iter->current_idx];
+#else
+				child = n4->children[node_iter->current_idx];
+#endif
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[node_iter->current_idx];
+#else
+				child = n32->children[node_iter->current_idx];
+#endif
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+				child = RT_NODE_INNER_125_GET_CHILD(n125, i);
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+				child = RT_NODE_INNER_256_GET_CHILD(n256, i);
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+		*value_p = value;
+#endif
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return found;
+#else
+	return child;
+#endif
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..cbc357dcc8
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,102 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	uint64		value = 0;
+#else
+	RT_PTR_LOCAL child = NULL;
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n4->values[idx];
+#else
+				child = n4->children[idx];
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[idx];
+#else
+				child = n32->children[idx];
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+
+				if (!RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	*value_p = value;
+#else
+	Assert(child_p != NULL);
+	*child_p = child;
+#endif
+
+	return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..2256d08100
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,588 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	uint64		dummy;
+	uint64		key;
+	uint64		val;
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	rt_radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_set(radixtree, keys[i], keys[i] + 1))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		uint64		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+	radixtree = rt_create(CurrentMemoryContext);
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+	radixtree = rt_create(radixtree_ctx);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.39.0

v18-0006-Clarify-coding-around-fanout.patchapplication/x-patch; name=v18-0006-Clarify-coding-around-fanout.patchDownload

From f0bac77d49a88c82f4725bed5688d5f1e01dbe49 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 12 Jan 2023 20:39:19 +0700
Subject: [PATCH v18 06/10] Clarify coding around fanout

Change assignment of node256's fanout to an
assert and add some comments to the fanout
member of the RT_NODE struct.
---
 src/include/lib/radixtree.h | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 72735c4643..a02e835cd6 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -284,9 +284,15 @@ typedef struct RT_NODE
 	 */
 	uint16		count;
 
-	/* Max number of children. We can use uint8 because we never need to store 256 */
-	/* WIP: if we don't have a variable sized node4, this should instead be in the base
-	types as needed, since saving every byte is crucial for the smallest node kind */
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
 	uint8		fanout;
 
 	/*
@@ -923,7 +929,12 @@ RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner
 		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
 
 	node->kind = kind;
-	node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
 
 	/* Initialize slot_idxs to invalid values */
 	if (kind == RT_NODE_KIND_125)
@@ -932,13 +943,6 @@ RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner
 
 		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
 	}
-
-	/*
-	 * Technically it's 256, but we cannot store that in a uint8,
-	 * and this is the max size class to it will never grow.
-	 */
-	if (kind == RT_NODE_KIND_256)
-		node->fanout = 0;
 }
 
 /*
-- 
2.39.0

v18-0009-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/x-patch; name=v18-0009-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From 9ac5e8839bdf57eeaf357d3f1406b288c022edab Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v18 09/10] Add TIDStore, to store sets of TIDs
 (ItemPointerData) efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 587 ++++++++++++++++++
 src/include/access/tidstore.h                 |  49 ++
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  34 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../test_tidstore/test_tidstore.control       |   4 +
 10 files changed, 727 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..4170d13b3c
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,587 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, Tid are encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach(). It can support concurrent updates but only one process
+ * is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. First, we construct 64-bit unsigned integer key that
+ * combines the block number and the offset number. The lowest 11 bits represent
+ * the offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TidSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * XXX: If we want to support other table AMs that want to use the full range
+ * of possible offset numbers, we'll need to change this.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ *
+ * The maximum height of the radix tree is 5.
+ *
+ * XXX: if we want to support non-heap table AM, we need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
+ */
+#define TIDSTORE_OFFSET_NBITS	11
+#define TIDSTORE_VALUE_NBITS	6
+
+/*
+ * Memory consumption depends on the number of Tids stored, but also on the
+ * distribution of them and how the radix tree stores them. The maximum bytes
+ * that a TidStore can use is specified by the max_bytes in tidstore_create().
+ *
+ * In non-shared cases, the radix tree uses a slab allocator for each kind of
+ * node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+ * largest radix tree node in a new slab block, which is approximately 70kB.
+ * Therefore, we deduct 70kB from the maximum bytes.
+ *
+ * In shared cases, DSA allocates the memory segments to bit enough to follow
+ * a geometric series that approximately doubles the total DSA size. So we
+ * limit the maximum bytes for a TidStore to 75%. The 75% threshold perfectly
+ * works in case where the maximum bytes is power-of-2. In other cases, we
+ * use 60& threshold.
+ */
+#define TIDSTORE_MEMORY_DEDUCT_BYTES (1024L * 70) /* 70kB */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+	((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+	/*
+	 * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+	 * bytes a TidStore can use. These two fields are commonly used in both
+	 * non-shared case and shared case.
+	 */
+	uint32	num_tids;
+	uint64	max_bytes;
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+
+	/* handles for TidStore and radix tree */
+	tidstore_handle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 *
+	 * We calculate the maximum bytes for the TidStore in different ways
+	 * for non-shared case and shared case. Please refer to the comment
+	 * TIDSTORE_MEMORY_DEDUCT for details.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes =(uint64) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - TIDSTORE_MEMORY_DEDUCT_BYTES;
+	}
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/* Reset the statistics */
+	ts->control->num_tids = 0;
+
+	/*
+	 * Free the current radix tree, and Return allocated DSM segments
+	 * to the operating system, if necessary. */
+	if (TidStoreIsShared(ts))
+	{
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+	}
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+	if (TidStoreIsShared(ts))
+		shared_rt_set(ts->tree.shared, key, val);
+	else
+		local_rt_set(ts->tree.local, key, val);
+}
+
+/*
+ * Add Tids on a block to TidStore. The caller must ensure the offset numbers
+ * in 'offsets' are ordered in ascending order.
+ */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	uint64 last_key = PG_UINT64_MAX;
+	uint64 key;
+	uint64 val = 0;
+	ItemPointerData tid;
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint32	off;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		key = tid_to_key_off(&tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			/* insert the key-value */
+			tidstore_insert_kv(ts, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= UINT64CONST(1) << off;
+	}
+
+	if (last_key != PG_UINT64_MAX)
+	{
+		/* insert the key-value */
+		tidstore_insert_kv(ts, last_key, val);
+	}
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+}
+
+/* Return true if the given Tid is present in TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val;
+	uint32 off;
+	bool found;
+
+	key = tid_to_key_off(tid, &off);
+
+	found = TidStoreIsShared(ts) ?
+		shared_rt_search(ts->tree.shared, key, &val) :
+		local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+	iter->result.blkno = InvalidBlockNumber;
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (ts->control->num_tids == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+	else
+		return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = KEY_GET_BLKNO(key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * Remember the key-value pair for the next block for the
+			 * next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+uint64
+tidstore_num_tids(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+uint64
+tidstore_memory_usage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return (uint64) sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+	else
+		return (uint64) sizeof(TidStore) + sizeof(TidStore) +
+			local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract Tids from key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->result);
+
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+		result->offsets[result->num_offsets++] = off;
+	}
+
+	result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a Tid to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64 upper;
+	uint64 tid_i;
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+	*off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+	upper = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return upper;
+}
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..4bffdf0920
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+	int				num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern uint64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern uint64 tidstore_max_memory(TidStore *ts);
+extern uint64 tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..1973963440
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..3365b073a4
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.39.0

v18-0008-Turn-branch-into-Assert-in-RT_NODE_UPDATE_INNER.patchapplication/x-patch; name=v18-0008-Turn-branch-into-Assert-in-RT_NODE_UPDATE_INNER.patchDownload

From 009c01a67817389fc5972d848334c1da00e8864c Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 15 Jan 2023 14:31:42 +0700
Subject: [PATCH v18 08/10] Turn branch into Assert in RT_NODE_UPDATE_INNER

---
 src/include/lib/radixtree.h             |  9 ++----
 src/include/lib/radixtree_search_impl.h | 41 ++++++++++++++-----------
 2 files changed, 25 insertions(+), 25 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index e053a2e56e..9f8bed09f7 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1129,7 +1129,7 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
 #endif
 }
 
-static inline bool
+static inline void
 RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
 {
 #define RT_ACTION_UPDATE
@@ -1160,12 +1160,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child
 		tree->ctl->root = new_child;
 	}
 	else
-	{
-		bool replaced PG_USED_FOR_ASSERTS_ONLY;
-
-		replaced = RT_NODE_UPDATE_INNER(parent, key, new_child);
-		Assert(replaced);
-	}
+		RT_NODE_UPDATE_INNER(parent, key, new_child);
 
 	RT_FREE_NODE(tree, old_child);
 }
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 3e97c31c2c..31e4978e4f 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -32,18 +32,19 @@
 				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
 				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
 
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n4->children[idx] = new_child;
+#else
 				if (idx < 0)
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
 				value = n4->values[idx];
-#else
-#ifdef RT_ACTION_UPDATE
-				n4->children[idx] = new_child;
 #else
 				child = n4->children[idx];
 #endif
-#endif
+#endif							/* RT_ACTION_UPDATE */
 				break;
 			}
 		case RT_NODE_KIND_32:
@@ -51,18 +52,19 @@
 				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
 				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
 
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n32->children[idx] = new_child;
+#else
 				if (idx < 0)
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
 				value = n32->values[idx];
-#else
-#ifdef RT_ACTION_UPDATE
-				n32->children[idx] = new_child;
 #else
 				child = n32->children[idx];
 #endif
-#endif
+#endif							/* RT_ACTION_UPDATE */
 				break;
 			}
 		case RT_NODE_KIND_125:
@@ -70,24 +72,28 @@
 				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
 				int			slotpos = n125->base.slot_idxs[chunk];
 
+#ifdef RT_ACTION_UPDATE
+				Assert(slotpos != RT_NODE_125_INVALID_IDX);
+				n125->children[slotpos] = new_child;
+#else
 				if (slotpos == RT_NODE_125_INVALID_IDX)
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
 				value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
-#else
-#ifdef RT_ACTION_UPDATE
-				n125->children[slotpos] = new_child;
 #else
 				child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
 #endif
-#endif
+#endif							/* RT_ACTION_UPDATE */
 				break;
 			}
 		case RT_NODE_KIND_256:
 			{
 				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
 
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
 #ifdef RT_NODE_LEVEL_LEAF
 				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
 #else
@@ -97,28 +103,27 @@
 
 #ifdef RT_NODE_LEVEL_LEAF
 				value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
-#else
-#ifdef RT_ACTION_UPDATE
-				RT_NODE_INNER_256_SET(n256, chunk, new_child);
 #else
 				child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
 #endif
-#endif
+#endif							/* RT_ACTION_UPDATE */
 				break;
 			}
 	}
 
-#ifndef RT_ACTION_UPDATE
+#ifdef RT_ACTION_UPDATE
+	return;
+#else
 #ifdef RT_NODE_LEVEL_LEAF
 	Assert(value_p != NULL);
 	*value_p = value;
 #else
 	Assert(child_p != NULL);
 	*child_p = child;
-#endif
 #endif
 
 	return true;
+#endif							/* RT_ACTION_UPDATE */
 
 #undef RT_NODE4_TYPE
 #undef RT_NODE32_TYPE
-- 
2.39.0

v18-0007-Implement-shared-memory.patchapplication/x-patch; name=v18-0007-Implement-shared-memory.patchDownload

From ba09c9cb0b6abd31454ef286b8012f1e4d968d8b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 9 Jan 2023 14:32:39 +0700
Subject: [PATCH v18 07/10] Implement shared memory

---
 src/backend/utils/mmgr/dsa.c                  |  12 +
 src/include/lib/radixtree.h                   | 434 ++++++++++++++----
 src/include/lib/radixtree_delete_impl.h       |   6 +
 src/include/lib/radixtree_insert_impl.h       |  43 +-
 src/include/lib/radixtree_iter_impl.h         |  23 +-
 src/include/lib/radixtree_search_impl.h       |  28 +-
 src/include/utils/dsa.h                       |   1 +
 .../modules/test_radixtree/test_radixtree.c   |  43 ++
 8 files changed, 469 insertions(+), 121 deletions(-)

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 604b702a91..50f0aae3ab 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index a02e835cd6..e053a2e56e 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -42,6 +42,8 @@
  *	  - RT_DEFINE - if defined function definitions are generated
  *	  - RT_SCOPE - in which scope (e.g. extern, static inline) do function
  *		declarations reside
+ *	  - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *		so that multiple processes can access it simultaneously.
  *
  *	  Optional parameters:
  *	  - RT_DEBUG - if defined add stats tracking and debugging functions
@@ -51,6 +53,9 @@
  *
  * RT_CREATE		- Create a new, empty radix tree
  * RT_FREE			- Free the radix tree
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
  * RT_SEARCH		- Search a key-value pair
  * RT_SET			- Set a key-value pair
  * RT_DELETE		- Delete a key-value pair
@@ -80,7 +85,8 @@
 #include "miscadmin.h"
 #include "nodes/bitmapset.h"
 #include "port/pg_bitutils.h"
-#include "port/pg_lfind.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
 #include "utils/memutils.h"
 
 /* helpers */
@@ -92,6 +98,11 @@
 #define RT_CREATE RT_MAKE_NAME(create)
 #define RT_FREE RT_MAKE_NAME(free)
 #define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
 #define RT_SET RT_MAKE_NAME(set)
 #define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
 #define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
@@ -110,9 +121,11 @@
 #define RT_FREE_NODE RT_MAKE_NAME(free_node)
 #define RT_EXTEND RT_MAKE_NAME(extend)
 #define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
-#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
 #define RT_COPY_NODE RT_MAKE_NAME(copy_node)
 #define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
 #define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
 #define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
 #define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
@@ -138,6 +151,7 @@
 #define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
 #define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
 #define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
 #define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
 #define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
 #define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
@@ -150,7 +164,11 @@
 
 /* type declarations */
 #define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
 #define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
 #define RT_NODE RT_MAKE_NAME(node)
 #define RT_NODE_ITER RT_MAKE_NAME(node_iter)
 #define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
@@ -181,8 +199,20 @@
 typedef struct RT_RADIX_TREE RT_RADIX_TREE;
 typedef struct RT_ITER RT_ITER;
 
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
 RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
 RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
 RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
 RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
 RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
@@ -306,9 +336,21 @@ typedef struct RT_NODE
 	uint8		kind;
 } RT_NODE;
 
+
 #define RT_PTR_LOCAL RT_NODE *
 
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
 #define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
 
 #define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
 #define NODE_IS_EMPTY(n)		(((RT_PTR_LOCAL) (n))->count == 0)
@@ -516,22 +558,43 @@ static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
 	[RT_NODE_KIND_256] = RT_CLASS_256,
 };
 
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
 /* A radix tree with nodes */
-typedef struct RT_RADIX_TREE
+typedef struct RT_RADIX_TREE_CONTROL
 {
-	MemoryContext context;
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+#endif
 
 	RT_PTR_ALLOC root;
 	uint64		max_val;
 	uint64		num_keys;
 
-	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
-	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
-
 	/* statistics */
 #ifdef RT_DEBUG
 	int32		cnt[RT_SIZE_CLASS_COUNT];
 #endif
+} RT_RADIX_TREE_CONTROL;
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
 } RT_RADIX_TREE;
 
 /*
@@ -547,6 +610,11 @@ typedef struct RT_RADIX_TREE
  * construct the key whenever updating the node iteration information, e.g., when
  * advancing the current index within the node or when moving to the next node
  * at the same level.
++ *
++ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
++ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
++ * We need either a safeguard to disallow other processes to begin the iteration
++ * while one process is doing or to allow multiple processes to do the iteration.
  */
 typedef struct RT_NODE_ITER
 {
@@ -567,14 +635,35 @@ typedef struct RT_ITER
 } RT_ITER;
 
 
-static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
-								 uint64 key, RT_PTR_LOCAL child);
-static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
 								uint64 key, uint64 value);
 
 /* verification (available only with assertion) */
 static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
 
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
 /*
  * Return index of the first element in 'base' that equals 'key'. Return -1
  * if there is no such element.
@@ -806,7 +895,7 @@ static inline bool
 RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
 {
 	Assert(!NODE_IS_LEAF(node));
-	return (node->children[chunk] != NULL);
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
 }
 
 static inline bool
@@ -860,7 +949,7 @@ static inline void
 RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
 {
 	Assert(!NODE_IS_LEAF(node));
-	node->children[chunk] = NULL;
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
 }
 
 static inline void
@@ -902,21 +991,31 @@ RT_SHIFT_GET_MAX_VAL(int shift)
 static RT_PTR_ALLOC
 RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
 {
-	RT_PTR_ALLOC newnode;
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
 
 	if (inner)
-		newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
-												 RT_SIZE_CLASS_INFO[size_class].inner_size);
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
 	else
-		newnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
-												 RT_SIZE_CLASS_INFO[size_class].leaf_size);
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (inner)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+#endif
 
 #ifdef RT_DEBUG
 	/* update the statistics */
-	tree->cnt[size_class]++;
+	tree->ctl->cnt[size_class]++;
 #endif
 
-	return newnode;
+	return allocnode;
 }
 
 /* Initialize the node contents */
@@ -954,13 +1053,15 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
 {
 	int			shift = RT_KEY_GET_SHIFT(key);
 	bool		inner = shift > 0;
-	RT_PTR_ALLOC newnode;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
 
-	newnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
 	RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
 	newnode->shift = shift;
-	tree->max_val = RT_SHIFT_GET_MAX_VAL(shift);
-	tree->root = newnode;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
 }
 
 static inline void
@@ -969,7 +1070,7 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
 	newnode->shift = oldnode->shift;
 	newnode->count = oldnode->count;
 }
-
+#if 0
 /*
  * Create a new node with 'new_kind' and the same shift, chunk, and
  * count of 'node'.
@@ -977,30 +1078,33 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
 static RT_NODE*
 RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
 {
-	RT_PTR_ALLOC newnode;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
 	bool inner = !NODE_IS_LEAF(node);
 
-	newnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
 	RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
 	RT_COPY_NODE(newnode, node);
 
 	return newnode;
 }
-
+#endif
 /* Free the given node */
 static void
-RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
 {
 	/* If we're deleting the root node, make the tree empty */
-	if (tree->root == node)
+	if (tree->ctl->root == allocnode)
 	{
-		tree->root = NULL;
-		tree->max_val = 0;
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
 	}
 
 #ifdef RT_DEBUG
 	{
 		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
 
 		/* update the statistics */
 		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
@@ -1013,12 +1117,26 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
 		if (i == RT_SIZE_CLASS_COUNT)
 			i = RT_CLASS_256;
 
-		tree->cnt[i]--;
-		Assert(tree->cnt[i] >= 0);
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
 	}
 #endif
 
-	pfree(node);
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+static inline bool
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
 }
 
 /*
@@ -1028,18 +1146,24 @@ static void
 RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
 				RT_PTR_ALLOC new_child, uint64 key)
 {
-	Assert(old_child->shift == new_child->shift);
+	RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
+
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old->shift == new->shift);
+#endif
 
-	if (parent == old_child)
+	if (parent == old)
 	{
 		/* Replace the root node with the new large node */
-		tree->root = new_child;
+		tree->ctl->root = new_child;
 	}
 	else
 	{
-		bool		replaced PG_USED_FOR_ASSERTS_ONLY;
+		bool replaced PG_USED_FOR_ASSERTS_ONLY;
 
-		replaced = RT_NODE_INSERT_INNER(tree, NULL, parent, key, new_child);
+		replaced = RT_NODE_UPDATE_INNER(parent, key, new_child);
 		Assert(replaced);
 	}
 
@@ -1054,7 +1178,8 @@ static void
 RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
 {
 	int			target_shift;
-	int			shift = tree->root->shift + RT_NODE_SPAN;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
 
 	target_shift = RT_KEY_GET_SHIFT(key);
 
@@ -1066,14 +1191,14 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
 		RT_NODE_INNER_4 *n4;
 
 		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
-		node = (RT_PTR_LOCAL) allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
 		RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
 		node->shift = shift;
 		node->count = 1;
 
 		n4 = (RT_NODE_INNER_4 *) node;
 		n4->base.chunks[0] = 0;
-		n4->children[0] = tree->root;
+		n4->children[0] = tree->ctl->root;
 
 		/* Update the root */
 		tree->ctl->root = allocnode;
@@ -1081,7 +1206,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
 		shift += RT_NODE_SPAN;
 	}
 
-	tree->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
 }
 
 /*
@@ -1090,10 +1215,12 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
  */
 static inline void
 RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
-			  RT_PTR_LOCAL node)
+			  RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
 {
 	int			shift = node->shift;
 
+	Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+
 	while (shift >= RT_NODE_SPAN)
 	{
 		RT_PTR_ALLOC allocchild;
@@ -1102,18 +1229,19 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent
 		bool		inner = newshift > 0;
 
 		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
-		newchild = (RT_PTR_LOCAL) allocchild;
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
 		RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
 		newchild->shift = newshift;
 		RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
 
 		parent = node;
 		node = newchild;
+		nodep = allocchild;
 		shift -= RT_NODE_SPAN;
 	}
 
-	RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
-	tree->num_keys++;
+	RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+	tree->ctl->num_keys++;
 }
 
 /*
@@ -1172,8 +1300,8 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
 
 /* Insert the child to the inner node */
 static bool
-RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node, uint64 key,
-					 RT_PTR_ALLOC child)
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
 {
 #define RT_NODE_LEVEL_INNER
 #include "lib/radixtree_insert_impl.h"
@@ -1182,7 +1310,7 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node
 
 /* Insert the value to the leaf node */
 static bool
-RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
 					uint64 key, uint64 value)
 {
 #define RT_NODE_LEVEL_LEAF
@@ -1194,18 +1322,31 @@ RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_LOCAL node,
  * Create the radix tree in the given memory context and return it.
  */
 RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
 RT_CREATE(MemoryContext ctx)
+#endif
 {
 	RT_RADIX_TREE *tree;
 	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
 
 	old_ctx = MemoryContextSwitchTo(ctx);
 
-	tree = palloc(sizeof(RT_RADIX_TREE));
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
 	tree->context = ctx;
-	tree->root = NULL;
-	tree->max_val = 0;
-	tree->num_keys = 0;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
 
 	/* Create the slab allocator for each size class */
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
@@ -1218,27 +1359,78 @@ RT_CREATE(MemoryContext ctx)
 												RT_SIZE_CLASS_INFO[i].name,
 												RT_SIZE_CLASS_INFO[i].leaf_blocksize,
 												RT_SIZE_CLASS_INFO[i].leaf_size);
-#ifdef RT_DEBUG
-		tree->cnt[i] = 0;
-#endif
 	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
 
 	MemoryContextSwitchTo(old_ctx);
 
 	return tree;
 }
 
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	/* XXX: memory context support */
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/* XXX: do we need to set a callback on exit to detach dsa? */
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+#endif
+
 /*
  * Free the given radix tree.
  */
 RT_SCOPE void
 RT_FREE(RT_RADIX_TREE *tree)
 {
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle); // XXX
+	//dsa_detach(tree->dsa);
+#else
+	pfree(tree->ctl);
+
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
 		MemoryContextDelete(tree->inner_slabs[i]);
 		MemoryContextDelete(tree->leaf_slabs[i]);
 	}
+#endif
 
 	pfree(tree);
 }
@@ -1252,46 +1444,54 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
 {
 	int			shift;
 	bool		updated;
-	RT_PTR_LOCAL node;
 	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC nodep;
+	RT_PTR_LOCAL  node;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
 
 	/* Empty tree, create the root */
-	if (!tree->root)
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
 		RT_NEW_ROOT(tree, key);
 
 	/* Extend the tree if necessary */
-	if (key > tree->max_val)
+	if (key > tree->ctl->max_val)
 		RT_EXTEND(tree, key);
 
-	Assert(tree->root);
+	//Assert(tree->ctl->root);
 
-	shift = tree->root->shift;
-	node = parent = tree->root;
+	nodep = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, nodep);
+	shift = parent->shift;
 
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		RT_PTR_LOCAL child;
+		RT_PTR_ALLOC child;
+
+		node = RT_PTR_GET_LOCAL(tree, nodep);
 
 		if (NODE_IS_LEAF(node))
 			break;
 
 		if (!RT_NODE_SEARCH_INNER(node, key, &child))
 		{
-			RT_SET_EXTEND(tree, key, value, parent, node);
+			RT_SET_EXTEND(tree, key, value, parent, nodep, node);
 			return false;
 		}
 
 		parent = node;
-		node = child;
+		nodep = child;
 		shift -= RT_NODE_SPAN;
 	}
 
-	updated = RT_NODE_INSERT_LEAF(tree, parent, node, key, value);
+	updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
 
 	/* Update the statistics */
 	if (!updated)
-		tree->num_keys++;
+		tree->ctl->num_keys++;
 
 	return updated;
 }
@@ -1307,13 +1507,16 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
 	RT_PTR_LOCAL node;
 	int			shift;
 
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
 	Assert(value_p != NULL);
 
-	if (!tree->root || key > tree->max_val)
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
 
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
@@ -1326,7 +1529,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
 		if (!RT_NODE_SEARCH_INNER(node, key, &child))
 			return false;
 
-		node = child;
+		node = RT_PTR_GET_LOCAL(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
@@ -1341,37 +1544,44 @@ RT_SCOPE bool
 RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 {
 	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
 	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
 	int			shift;
 	int			level;
 	bool		deleted;
 
-	if (!tree->root || key > tree->max_val)
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
 	/*
 	 * Descend the tree to search the key while building a stack of nodes we
 	 * visited.
 	 */
-	node = tree->root;
-	shift = tree->root->shift;
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
 	level = -1;
 	while (shift > 0)
 	{
 		RT_PTR_ALLOC child;
 
 		/* Push the current node to the stack */
-		stack[++level] = node;
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
 
 		if (!RT_NODE_SEARCH_INNER(node, key, &child))
 			return false;
 
-		node = child;
+		allocnode = child;
 		shift -= RT_NODE_SPAN;
 	}
 
 	/* Delete the key from the leaf node if exists */
-	Assert(NODE_IS_LEAF(node));
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
 	deleted = RT_NODE_DELETE_LEAF(node, key);
 
 	if (!deleted)
@@ -1381,7 +1591,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 	}
 
 	/* Found the key to delete. Update the statistics */
-	tree->num_keys--;
+	tree->ctl->num_keys--;
 
 	/*
 	 * Return if the leaf node still has keys and we don't need to delete the
@@ -1391,13 +1601,14 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 		return true;
 
 	/* Free the empty leaf node */
-	RT_FREE_NODE(tree, node);
+	RT_FREE_NODE(tree, allocnode);
 
 	/* Delete the key in inner nodes recursively */
 	while (level >= 0)
 	{
-		node = stack[level--];
+		allocnode = stack[level--];
 
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
 		deleted = RT_NODE_DELETE_INNER(node, key);
 		Assert(deleted);
 
@@ -1406,7 +1617,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 			break;
 
 		/* The node became empty */
-		RT_FREE_NODE(tree, node);
+		RT_FREE_NODE(tree, allocnode);
 	}
 
 	return true;
@@ -1478,6 +1689,7 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 {
 	MemoryContext old_ctx;
 	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
 	int			top_level;
 
 	old_ctx = MemoryContextSwitchTo(tree->context);
@@ -1486,17 +1698,18 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 	iter->tree = tree;
 
 	/* empty tree */
-	if (!iter->tree->root)
+	if (!iter->tree->ctl->root)
 		return iter;
 
-	top_level = iter->tree->root->shift / RT_NODE_SPAN;
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
 	iter->stack_len = top_level;
 
 	/*
 	 * Descend to the left most leaf node from the root. The key is being
 	 * constructed while descending to the leaf.
 	 */
-	RT_UPDATE_ITER_STACK(iter, iter->tree->root, top_level);
+	RT_UPDATE_ITER_STACK(iter, root, top_level);
 
 	MemoryContextSwitchTo(old_ctx);
 
@@ -1511,7 +1724,7 @@ RT_SCOPE bool
 RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
 {
 	/* Empty tree */
-	if (!iter->tree->root)
+	if (!iter->tree->ctl->root)
 		return false;
 
 	for (;;)
@@ -1571,7 +1784,7 @@ RT_END_ITERATE(RT_ITER *iter)
 RT_SCOPE uint64
 RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
 {
-	return tree->num_keys;
+	return tree->ctl->num_keys;
 }
 
 /*
@@ -1580,13 +1793,19 @@ RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
 RT_SCOPE uint64
 RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
 {
+	// XXX is this necessary?
 	Size		total = sizeof(RT_RADIX_TREE);
 
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
 		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
 		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
 	}
+#endif
 
 	return total;
 }
@@ -1670,13 +1889,13 @@ void
 rt_stats(RT_RADIX_TREE *tree)
 {
 	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
-						 tree->num_keys,
-						 tree->root->shift / RT_NODE_SPAN,
-						 tree->cnt[RT_CLASS_4_FULL],
-						 tree->cnt[RT_CLASS_32_PARTIAL],
-						 tree->cnt[RT_CLASS_32_FULL],
-						 tree->cnt[RT_CLASS_125_FULL],
-						 tree->cnt[RT_CLASS_256])));
+						 tree->ctl->num_keys,
+						 tree->ctl->root->shift / RT_NODE_SPAN,
+						 tree->ctl->cnt[RT_CLASS_4_FULL],
+						 tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+						 tree->ctl->cnt[RT_CLASS_32_FULL],
+						 tree->ctl->cnt[RT_CLASS_125_FULL],
+						 tree->ctl->cnt[RT_CLASS_256])));
 }
 
 static void
@@ -1848,23 +2067,23 @@ rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
 
 	elog(NOTICE, "-----------------------------------------------------------");
 	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
-		 tree->max_val, tree->max_val);
+		 tree->ctl->max_val, tree->ctl->max_val);
 
-	if (!tree->root)
+	if (!tree->ctl->root)
 	{
 		elog(NOTICE, "tree is empty");
 		return;
 	}
 
-	if (key > tree->max_val)
+	if (key > tree->ctl->max_val)
 	{
 		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
 			 key, key);
 		return;
 	}
 
-	node = tree->root;
-	shift = tree->root->shift;
+	node = tree->ctl->root;
+	shift = tree->ctl->root->shift;
 	while (shift >= 0)
 	{
 		RT_PTR_LOCAL child;
@@ -1901,15 +2120,15 @@ rt_dump(RT_RADIX_TREE *tree)
 				RT_SIZE_CLASS_INFO[i].inner_blocksize,
 				RT_SIZE_CLASS_INFO[i].leaf_size,
 				RT_SIZE_CLASS_INFO[i].leaf_blocksize);
-	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->max_val);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
 
-	if (!tree->root)
+	if (!tree->ctl->root)
 	{
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
-	rt_dump_node(tree->root, 0, true);
+	rt_dump_node(tree->ctl->root, 0, true);
 }
 #endif
 
@@ -1928,9 +2147,14 @@ rt_dump(RT_RADIX_TREE *tree)
 #undef VAR_NODE_HAS_FREE_SLOT
 #undef FIXED_NODE_HAS_FREE_SLOT
 #undef RT_SIZE_CLASS_COUNT
+#undef RT_RADIX_TREE_MAGIC
 
 /* type declarations */
 #undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
 #undef RT_ITER
 #undef RT_NODE
 #undef RT_NODE_ITER
@@ -1959,6 +2183,9 @@ rt_dump(RT_RADIX_TREE *tree)
 /* function declarations */
 #undef RT_CREATE
 #undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
 #undef RT_SET
 #undef RT_BEGIN_ITERATE
 #undef RT_ITERATE_NEXT
@@ -1980,6 +2207,8 @@ rt_dump(RT_RADIX_TREE *tree)
 #undef RT_GROW_NODE_KIND
 #undef RT_COPY_NODE
 #undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
 #undef RT_NODE_4_SEARCH_EQ
 #undef RT_NODE_32_SEARCH_EQ
 #undef RT_NODE_4_GET_INSERTPOS
@@ -2005,6 +2234,7 @@ rt_dump(RT_RADIX_TREE *tree)
 #undef RT_SHIFT_GET_MAX_VAL
 #undef RT_NODE_SEARCH_INNER
 #undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
 #undef RT_NODE_DELETE_INNER
 #undef RT_NODE_DELETE_LEAF
 #undef RT_NODE_INSERT_INNER
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index 6eefc63e19..eb87866b90 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -16,6 +16,12 @@
 
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(NODE_IS_LEAF(node));
+#else
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
 	switch (node->kind)
 	{
 		case RT_NODE_KIND_4:
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index ff76583402..e4faf54d9d 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -14,11 +14,14 @@
 
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 	bool		chunk_exists = false;
-	RT_NODE		*newnode = NULL;
+	RT_PTR_LOCAL newnode = NULL;
+	RT_PTR_ALLOC allocnode;
 
 #ifdef RT_NODE_LEVEL_LEAF
+	const bool inner = false;
 	Assert(NODE_IS_LEAF(node));
 #else
+	const bool inner = true;
 	Assert(!NODE_IS_LEAF(node));
 #endif
 
@@ -45,9 +48,15 @@
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
 				{
 					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
 
 					/* grow node from 4 to 32 */
-					newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
 					new32 = (RT_NODE32_TYPE *) newnode;
 #ifdef RT_NODE_LEVEL_LEAF
 					RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
@@ -57,7 +66,7 @@
 											  new32->base.chunks, new32->children);
 #endif
 					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, node, newnode, key);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
 					node = newnode;
 				}
 				else
@@ -112,17 +121,19 @@
 					n32->base.n.fanout == class32_min.fanout)
 				{
 					/* grow to the next size class of this kind */
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
 #ifdef RT_NODE_LEVEL_LEAF
-					newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, false);
 					memcpy(newnode, node, class32_min.leaf_size);
 #else
-					newnode = RT_ALLOC_NODE(tree, RT_CLASS_32_FULL, true);
 					memcpy(newnode, node, class32_min.inner_size);
 #endif
 					newnode->fanout = class32_max.fanout;
 
 					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, node, newnode, key);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
 					node = newnode;
 
 					/* also update pointer for this kind */
@@ -132,11 +143,17 @@
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
 				{
 					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
 
 					Assert(n32->base.n.fanout == class32_max.fanout);
 
 					/* grow node from 32 to 125 */
-					newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
 					new125 = (RT_NODE125_TYPE *) newnode;
 
 					for (int i = 0; i < class32_max.fanout; i++)
@@ -153,7 +170,7 @@
 					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
 
 					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, node, newnode, key);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
 					node = newnode;
 				}
 				else
@@ -204,9 +221,15 @@
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
 				{
 					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
 
 					/* grow node from 125 to 256 */
-					newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
 					new256 = (RT_NODE256_TYPE *) newnode;
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
 					{
@@ -221,7 +244,7 @@
 					}
 
 					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, node, newnode, key);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
 					node = newnode;
 				}
 				else
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index a153011376..0b8b68df6c 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -12,13 +12,22 @@
 #error node level must be either inner or leaf
 #endif
 
+	bool		found = false;
+	uint8		key_chunk;
+
 #ifdef RT_NODE_LEVEL_LEAF
 	uint64		value;
+
+	Assert(NODE_IS_LEAF(node_iter->node));
 #else
-	RT_NODE    *child = NULL;
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 #endif
-	bool		found = false;
-	uint8		key_chunk;
 
 	switch (node_iter->node->kind)
 	{
@@ -32,7 +41,7 @@
 #ifdef RT_NODE_LEVEL_LEAF
 				value = n4->values[node_iter->current_idx];
 #else
-				child = n4->children[node_iter->current_idx];
+				child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
 #endif
 				key_chunk = n4->base.chunks[node_iter->current_idx];
 				found = true;
@@ -49,7 +58,7 @@
 #ifdef RT_NODE_LEVEL_LEAF
 				value = n32->values[node_iter->current_idx];
 #else
-				child = n32->children[node_iter->current_idx];
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
 #endif
 				key_chunk = n32->base.chunks[node_iter->current_idx];
 				found = true;
@@ -73,7 +82,7 @@
 #ifdef RT_NODE_LEVEL_LEAF
 				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
 #else
-				child = RT_NODE_INNER_125_GET_CHILD(n125, i);
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
 #endif
 				key_chunk = i;
 				found = true;
@@ -101,7 +110,7 @@
 #ifdef RT_NODE_LEVEL_LEAF
 				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
 #else
-				child = RT_NODE_INNER_256_GET_CHILD(n256, i);
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
 #endif
 				key_chunk = i;
 				found = true;
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index cbc357dcc8..3e97c31c2c 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -16,8 +16,13 @@
 
 #ifdef RT_NODE_LEVEL_LEAF
 	uint64		value = 0;
+
+	Assert(NODE_IS_LEAF(node));
 #else
-	RT_PTR_LOCAL child = NULL;
+#ifndef RT_ACTION_UPDATE
+	RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+#endif
+	Assert(!NODE_IS_LEAF(node));
 #endif
 
 	switch (node->kind)
@@ -32,8 +37,12 @@
 
 #ifdef RT_NODE_LEVEL_LEAF
 				value = n4->values[idx];
+#else
+#ifdef RT_ACTION_UPDATE
+				n4->children[idx] = new_child;
 #else
 				child = n4->children[idx];
+#endif
 #endif
 				break;
 			}
@@ -47,22 +56,31 @@
 
 #ifdef RT_NODE_LEVEL_LEAF
 				value = n32->values[idx];
+#else
+#ifdef RT_ACTION_UPDATE
+				n32->children[idx] = new_child;
 #else
 				child = n32->children[idx];
+#endif
 #endif
 				break;
 			}
 		case RT_NODE_KIND_125:
 			{
 				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
 
-				if (!RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
+				if (slotpos == RT_NODE_125_INVALID_IDX)
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
 				value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+#ifdef RT_ACTION_UPDATE
+				n125->children[slotpos] = new_child;
 #else
 				child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
 #endif
 				break;
 			}
@@ -79,19 +97,25 @@
 
 #ifdef RT_NODE_LEVEL_LEAF
 				value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
 #else
 				child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
 #endif
 				break;
 			}
 	}
 
+#ifndef RT_ACTION_UPDATE
 #ifdef RT_NODE_LEVEL_LEAF
 	Assert(value_p != NULL);
 	*value_p = value;
 #else
 	Assert(child_p != NULL);
 	*child_p = child;
+#endif
 #endif
 
 	return true;
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 104386e674..c67f936880 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 2256d08100..61d842789d 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -18,6 +18,7 @@
 #include "nodes/bitmapset.h"
 #include "storage/block.h"
 #include "storage/itemptr.h"
+#include "storage/lwlock.h"
 #include "utils/memutils.h"
 #include "utils/timestamp.h"
 
@@ -103,6 +104,8 @@ static const test_spec test_specs[] = {
 #define RT_SCOPE static
 #define RT_DECLARE
 #define RT_DEFINE
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
 #include "lib/radixtree.h"
 
 
@@ -119,7 +122,15 @@ test_empty(void)
 	uint64		key;
 	uint64		val;
 
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
 	radixtree = rt_create(CurrentMemoryContext);
+#endif
 
 	if (rt_search(radixtree, 0, &dummy))
 		elog(ERROR, "rt_search on empty tree returned true");
@@ -153,10 +164,20 @@ test_basic(int children, bool test_inner)
 	uint64 *keys;
 	int	shift = test_inner ? 8 : 0;
 
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
 	elog(NOTICE, "testing basic operations with %s node %d",
 		 test_inner ? "inner" : "leaf", children);
 
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
 	radixtree = rt_create(CurrentMemoryContext);
+#endif
 
 	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
 	keys = palloc(sizeof(uint64) * children);
@@ -297,9 +318,19 @@ test_node_types(uint8 shift)
 {
 	rt_radix_tree *radixtree;
 
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
 	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
 
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
 	radixtree = rt_create(CurrentMemoryContext);
+#endif
 
 	/*
 	 * Insert and search entries for every node type at the 'shift' level,
@@ -332,6 +363,11 @@ test_pattern(const test_spec * spec)
 	int			patternlen;
 	uint64	   *pattern_values;
 	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
 
 	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
 	if (rt_test_stats)
@@ -357,7 +393,13 @@ test_pattern(const test_spec * spec)
 										  "radixtree test",
 										  ALLOCSET_SMALL_SIZES);
 	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa);
+#else
 	radixtree = rt_create(radixtree_ctx);
+#endif
+
 
 	/*
 	 * Add values to the set.
@@ -563,6 +605,7 @@ test_pattern(const test_spec * spec)
 		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
 			 nafter, (nbefore - ndeleted), ndeleted);
 
+	rt_free(radixtree);
 	MemoryContextDelete(radixtree_ctx);
 }
 
-- 
2.39.0

v18-0010-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/x-patch; name=v18-0010-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload

From b2d17b4649e1a5f1de5d8f598ae5c1a5c220d85e Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 13 Jan 2023 15:38:59 +0700
Subject: [PATCH v18 10/10] Use TIDStore for storing dead tuple TID during lazy
 vacuum

Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which is not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.

This changes to use TIDStore for this purpose. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.

Also, since we are no longer able to exactly estimate the maximum
number of TIDs can be stored based on the amount of memory. It also
changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.

Furthermore, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, which is the
inital DSA segment size. Due to that, this change increase the minimum
maintenance_work_mem from 1MB to 2MB.
---
 doc/src/sgml/monitoring.sgml               |   8 +-
 src/backend/access/heap/vacuumlazy.c       | 168 +++++++--------------
 src/backend/catalog/system_views.sql       |   2 +-
 src/backend/commands/vacuum.c              |  76 +---------
 src/backend/commands/vacuumparallel.c      |  64 +++++---
 src/backend/storage/lmgr/lwlock.c          |   2 +
 src/backend/utils/misc/guc_tables.c        |   2 +-
 src/include/commands/progress.h            |   4 +-
 src/include/commands/vacuum.h              |  25 +--
 src/include/storage/lwlock.h               |   1 +
 src/test/regress/expected/cluster.out      |   2 +-
 src/test/regress/expected/create_index.out |   2 +-
 src/test/regress/expected/rules.out        |   4 +-
 src/test/regress/sql/cluster.sql           |   2 +-
 src/test/regress/sql/create_index.sql      |   2 +-
 15 files changed, 122 insertions(+), 242 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..6ce7ea9e35 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6840,10 +6840,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6851,10 +6851,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3694515167..58e87c4528 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -259,8 +260,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer *vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer *vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +827,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				blkno,
 				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +908,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (tidstore_is_full(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1039,11 +1040,18 @@ lazy_scan_heap(LVRelState *vacrel)
 			if (prunestate.has_lpdead_items)
 			{
 				Size		freespace;
+				TidStoreIter *iter;
+				TidStoreIterResult *result;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+				iter = tidstore_begin_iterate(vacrel->dead_items);
+				result = tidstore_iterate_next(iter);
+				lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+									  buf, &vmbuffer);
+				Assert(!tidstore_iterate_next(iter));
+				tidstore_end_iterate(iter);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				tidstore_reset(dead_items);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1080,7 +1088,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
 		}
 
 		/*
@@ -1233,7 +1241,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1871,23 +1879,15 @@ retry:
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		vacrel->lpdead_item_pages++;
 		prunestate->has_lpdead_items = true;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -2107,8 +2107,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2117,17 +2116,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2176,7 +2168,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2205,7 +2197,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2232,8 +2224,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2278,7 +2270,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	tidstore_reset(vacrel->dead_items);
 }
 
 /*
@@ -2351,7 +2343,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2388,10 +2380,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2406,7 +2399,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while ((result = tidstore_iterate_next(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2415,12 +2409,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = result->blkno;;
 		vacrel->blkno = blkno;
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, &vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+							  buf, &vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2430,6 +2425,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	tidstore_end_iterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2439,14 +2435,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2463,11 +2458,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * LP_DEAD item on the page.  The return value is the first index immediately
  * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer *vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+					  int num_offsets, Buffer buffer, Buffer *vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2486,16 +2480,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = offsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2575,7 +2564,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -3071,46 +3059,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3121,11 +3069,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3152,7 +3098,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3165,11 +3111,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 447c9b970f..133e03d728 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c4ed7efce3..7de4350cde 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -95,7 +95,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params);
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2298,16 +2297,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2338,18 +2337,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
@@ -2360,60 +2347,7 @@ vac_max_items_to_alloc_size(int max_items)
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..4c0ce4b7e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+#define PARALLEL_VACUUM_KEY_DSA				2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TidStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_destroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TidStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 196bece0a3..ff75fae88a 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -186,6 +186,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"PgStatsHash",
 	/* LWTRANCHE_PGSTATS_DATA: */
 	"PgStatsData",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5025e80f89..edee8a2b2b 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2301,7 +2301,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 2048, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..220d89fff7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
 	MultiXactId MultiXactCutoff;
 };
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e4162db613..40dda03088 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -204,6 +204,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DSA,
 	LWTRANCHE_PGSTATS_HASH,
 	LWTRANCHE_PGSTATS_DATA,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 542c2e098c..e678e6f79e 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -524,7 +524,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
 -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..0c49354f04 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT s.stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index 6cb9c926c0..a795d705d5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -256,7 +256,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 set enable_indexscan = off;
 
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
 
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
-- 
2.39.0

#177

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#176)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Jan 12, 2023 at 9:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jan 12, 2023 at 5:21 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Okay, I'll squash the previous patch and work on cleaning up the internals. I'll keep the external APIs the same so that your work on vacuum integration can be easily rebased on top of that, and we can work independently.

There were some conflicts with HEAD, so to keep the CF bot busy, I've quickly put together v18. I still have a lot of cleanup work to do, but this is enough for now.

Thanks! cfbot complaints about some warnings but these are expected
(due to unused delete routines etc). But one reported error[1]https://cirrus-ci.com/task/5078505327689728 might
be relevant with 0002 patch?

[05:44:11.759] "link" /MACHINE:x64
/OUT:src/test/modules/test_radixtree/test_radixtree.dll
src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.res
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj
"/nologo" "/release" "/nologo" "/DEBUG"
"/PDB:src/test\modules\test_radixtree\test_radixtree.pdb" "/DLL"
"/IMPLIB:src/test\modules\test_radixtree\test_radixtree.lib"
"/INCREMENTAL:NO" "/STACK:4194304" "/NOEXP" "/DEBUG:FASTLINK"
"/NOIMPLIB" "C:/cirrus/build/src/backend/postgres.exe.lib"
"wldap32.lib" "c:/openssl/1.1/lib/libssl.lib"
"c:/openssl/1.1/lib/libcrypto.lib" "ws2_32.lib" "kernel32.lib"
"user32.lib" "gdi32.lib" "winspool.lib" "shell32.lib" "ole32.lib"
"oleaut32.lib" "uuid.lib" "comdlg32.lib" "advapi32.lib"
[05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
external symbol pg_popcount64
[05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
fatal error LNK1120: 1 unresolved externals

0003 contains all v17 local-memory coding squashed together.

+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.

This comment seems to be out-of-date since we made it a template.

---
+#ifndef RT_COMMON
+#define RT_COMMON

What are we using this macro RT_COMMON for?

---
The following macros are defined but not undefined in radixtree.h:

RT_MAKE_PREFIX
RT_MAKE_NAME
RT_MAKE_NAME_
RT_SEARCH
UINT64_FORMAT_HEX
RT_NODE_SPAN
RT_NODE_MAX_SLOTS
RT_CHUNK_MASK
RT_MAX_SHIFT
RT_MAX_LEVEL
RT_NODE_125_INVALID_IDX
RT_GET_KEY_CHUNK
BM_IDX
BM_BIT
RT_NODE_KIND_4
RT_NODE_KIND_32
RT_NODE_KIND_125
RT_NODE_KIND_256
RT_NODE_KIND_COUNT
RT_PTR_LOCAL
RT_PTR_ALLOC
RT_INVALID_PTR_ALLOC
NODE_SLAB_BLOCK_SIZE

0004 perf test not updated but it doesn't build by default so it's fine for now

Okay.

0005 removes node.chunk as discussed, but does not change node4 fanout yet.

LGTM.

0006 is a small cleanup regarding setting node fanout.

LGTM.

0007 squashes my shared memory work with Masahiko's fixes from the addendum v17-0010.

+ /* XXX: do we need to set a callback on exit to detach dsa? */

In the current shared radix tree design, it's a caller responsible
that they create (or attach to) a DSA area and pass it to RT_CREATE()
or RT_ATTACH(). It enables us to use one DSA not only for the radix
tree but also other data. Which is more flexible. So the caller needs
to detach from the DSA somehow, so I think we don't need to set a
callback here for that.

---
+        dsa_free(tree->dsa, tree->ctl->handle); // XXX
+        //dsa_detach(tree->dsa);

Similar to above, I think we should not detach from the DSA area here.

Given that the DSA area used by the radix tree could be used also by
other data, I think that in RT_FREE() we need to free each radix tree
node allocated in DSA. In lazy vacuum, we check the memory usage
instead of the number of TIDs and need to reset the TidScan after an
index scan. So it does RT_FREE() and dsa_trim() to return DSM segments
to the OS. I've implemented rt_free_recurse() for this purpose in the
v15 version patch.

--
-        Assert(tree->root);
+        //Assert(tree->ctl->root);

I think we don't need this assertion in the first place. We check it
at the beginning of the function.

---

+#ifdef RT_NODE_LEVEL_LEAF
+        Assert(NODE_IS_LEAF(node));
+#else
+        Assert(!NODE_IS_LEAF(node));
+#endif
+

I think we can move this change to 0003 patch.

0008 turns the existence checks in RT_NODE_UPDATE_INNER into Asserts, as discussed.

LGTM.

0009/0010 are just copies of Masauiko's v17 addendum v17-0011/12, but the latter rebased over recent variable renaming (it's possible I missed something, so worth checking).

I've implemented the idea of using union. Let me share WIP code for
discussion, I've attached three patches that can be applied on top of

Seems fine as far as the union goes. Let's go ahead with this, and make progress on locking etc.

Overall, TidStore implementation with the union idea doesn't look so
ugly to me. But I got many compiler warning about unused radix tree
functions like:

tidstore.c:99:19: warning: 'shared_rt_delete' defined but not used
[-Wunused-function]

I'm not sure there is a convenient way to suppress this warning but
one idea is to have some macros to specify what operations are
enabled/declared.

That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The vacuum case cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to blocknumber and value to Bitmapset, the number of keys might not be interesting.

Right. In fact, TIdStore doesn't use RT_NUM_ENTRIES.

It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an implementation didn't care about iteration or its order, we could optimize insertion into linear nodes)

Agreed.

Since this is WIP, you may already have some polish in mind, so I won't go over the patches in detail, but I wanted to ask about a few things (numbers referring to v17 addendum, not v18):

0011
+ * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+ * bytes a TidStore can use. These two fields are commonly used in both
+ * non-shared case and shared case.
+ */
+ uint32 num_tids;
uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better.

Agreed, will fix.

+ * We calculate the maximum bytes for the TidStore in different ways
+ * for non-shared case and shared case. Please refer to the comment
+ * TIDSTORE_MEMORY_DEDUCT for details.
+ */

Maybe the #define and comment should be close to here.

Will fix.

+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)

If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain.

Will fix.

+ * Add Tids on a block to TidStore. The caller must ensure the offset numbers
+ * in 'offsets' are ordered in ascending order.

Must? What happens otherwise?

It ends up missing TIDs by overwriting the same key with different
values. Is it better to have a bool argument, say need_sort, to sort
the given array if the caller wants?

+ uint64 last_key = PG_UINT64_MAX;

I'm having some difficulty understanding this sentinel and how it's used.

Will improve the logic.

@@ -1039,11 +1040,18 @@ lazy_scan_heap(LVRelState *vacrel)
if (prunestate.has_lpdead_items)
{
Size freespace;
+ TidStoreIter *iter;
+ TidStoreIterResult *result;
- lazy_vacuum_heap_page(vacrel, blkno, buf, 0, &vmbuffer);
+ iter = tidstore_begin_iterate(vacrel->dead_items);
+ result = tidstore_iterate_next(iter);
+ lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+  buf, &vmbuffer);
+ Assert(!tidstore_iterate_next(iter));
+ tidstore_end_iterate(iter);
/* Forget the LP_DEAD items that we just vacuumed */
- dead_items->num_items = 0;
+ tidstore_reset(dead_items);
This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because lazy_scan_prune() populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune() could pass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm missing some detail, so some description of the design choices made would be helpful.

I agree that we don't need complexity here. I'll try this idea.

On Mon, Jan 16, 2023 at 9:53 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've written a simple script to simulate the DSA memory usage and the
limit. The 75% limit works fine for a power of two cases, and we can
use the 60% limit for other cases (it seems we can use up to about 66%
but used 60% for safety). It would be best if we can mathematically
prove it but I could prove only the power of two cases. But the script
practically shows the 60% threshold would work for these cases.

Okay. It's worth highlighting this in the comments, and also the fact that it depends on internal details of how DSA increases segment size.

Agreed.

Since it seems you're working on another cleanup, I can address the
above comments after your work is completed. But I'm also fine with
including them into your cleanup work.

Regards,

[1]: https://cirrus-ci.com/task/5078505327689728

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#178

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#177)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Thanks! cfbot complaints about some warnings but these are expected
(due to unused delete routines etc). But one reported error[1] might
be relevant with 0002 patch?

[05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
external symbol pg_popcount64
[05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
fatal error LNK1120: 1 unresolved externals

Yeah, I'm not sure what's causing that. Since that comes from a debugging
function, we could work around it, but it would be nice to understand why,
so I'll probably have to experiment on my CI repo.

---
+#ifndef RT_COMMON
+#define RT_COMMON
What are we using this macro RT_COMMON for?

It was a quick way to define some things only once, so they probably all
showed up in the list of things you found not undefined. It's different
from the style of simplehash.h, which is to have a local name and #undef
for every single thing. simplehash.h is a precedent, so I'll change it to
match. I'll take a look at your list, too.

+ * Add Tids on a block to TidStore. The caller must ensure the offset

numbers

+ * in 'offsets' are ordered in ascending order.

Must? What happens otherwise?

It ends up missing TIDs by overwriting the same key with different
values. Is it better to have a bool argument, say need_sort, to sort
the given array if the caller wants?

Since it seems you're working on another cleanup, I can address the
above comments after your work is completed. But I'm also fine with
including them into your cleanup work.

I think we can work mostly simultaneously, if you work on tid store and
vacuum, and I work on the template. We can always submit a full patchset
including each other's latest work. That will catch rebase issues sooner.

--
John Naylor
EDB: http://www.enterprisedb.com

#179

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#177)

9 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Attached is an update that mostly has the modest goal of getting CI green
again. v19-0003 has squashed the entire radix tree template from
previously. I've kept out the perf test module for now -- still needs
updating.

[05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
external symbol pg_popcount64
[05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
fatal error LNK1120: 1 unresolved externals

Yeah, I'm not sure what's causing that. Since that comes from a debugging

function, we could work around it, but it would be nice to understand why,
so I'll probably have to experiment on my CI repo.

I'm still confused by this error, because it only occurs in the test
module. I successfully built with just 0002 in CI so elsewhere where bmw_*
symbols resolve just fine on all platforms. I've worked around the error in
v19-0004 by using the general-purpose pg_popcount() function. We only need
to count bits in assert builds, so it doesn't matter a whole lot.

+ /* XXX: do we need to set a callback on exit to detach dsa? */

In the current shared radix tree design, it's a caller responsible
that they create (or attach to) a DSA area and pass it to RT_CREATE()
or RT_ATTACH(). It enables us to use one DSA not only for the radix
tree but also other data. Which is more flexible. So the caller needs
to detach from the DSA somehow, so I think we don't need to set a
callback here for that.
---
+        dsa_free(tree->dsa, tree->ctl->handle); // XXX
+        //dsa_detach(tree->dsa);
Similar to above, I think we should not detach from the DSA area here.

Given that the DSA area used by the radix tree could be used also by
other data, I think that in RT_FREE() we need to free each radix tree
node allocated in DSA. In lazy vacuum, we check the memory usage
instead of the number of TIDs and need to reset the TidScan after an
index scan. So it does RT_FREE() and dsa_trim() to return DSM segments
to the OS. I've implemented rt_free_recurse() for this purpose in the
v15 version patch.
--
-        Assert(tree->root);
+        //Assert(tree->ctl->root);
I think we don't need this assertion in the first place. We check it
at the beginning of the function.

I've removed these in v19-0006.

That sounds like a good idea. It's also worth wondering if we even need

RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that
if necessary. It's also misnamed, since it's concerned with the number of
keys. The vacuum case cares about the number of TIDs, and not number of
(encoded) keys. Even if we ever (say) changed the key to blocknumber and
value to Bitmapset, the number of keys might not be interesting.

Right. In fact, TIdStore doesn't use RT_NUM_ENTRIES.

I've moved it to the test module, which uses it extensively. There, the
name is more clear what it's for, so I didn't change the name.

It sounds like we should at least make the delete functionality

optional. (Side note on optional functions: if an implementation didn't
care about iteration or its order, we could optimize insertion into linear
nodes)

Agreed.

Done in v19-0007.

v19-0009 is just a rebase over some more vacuum cleanups.

I'll continue working on internals cleanup.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v19-0001-introduce-vector8_min-and-vector8_highbit_mask.patchtext/x-patch; charset=US-ASCII; name=v19-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From 2cff749da71a4e581e762aac7587ec6463a1dd3d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v19 1/9] introduce vector8_min and vector8_highbit_mask

---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..84d41a340a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.39.0

v19-0005-Remove-RT_NUM_ENTRIES.patchtext/x-patch; charset=US-ASCII; name=v19-0005-Remove-RT_NUM_ENTRIES.patchDownload

From d801347976bdc6489c66dcaf64dfed343bed39dc Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 17 Jan 2023 16:38:09 +0700
Subject: [PATCH v19 5/9] Remove RT_NUM_ENTRIES

This is not expected to be used everywhere, and is very simple
to implement, so move definition to test module where it is
used extensively.
---
 src/include/lib/radixtree.h                      | 13 -------------
 src/test/modules/test_radixtree/test_radixtree.c |  9 +++++++++
 2 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 7f928f02d6..ba326562d5 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -63,7 +63,6 @@
  * RT_ITERATE_NEXT	- Return next key-value pair, if any
  * RT_END_ITER		- End iteration
  * RT_MEMORY_USAGE	- Get the memory usage
- * RT_NUM_ENTRIES	- Get the number of key-value pairs
  *
  * RT_CREATE() creates an empty radix tree in the given memory context
  * and memory contexts for all kinds of radix tree node under the memory context.
@@ -109,7 +108,6 @@
 #define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
 #define RT_DELETE RT_MAKE_NAME(delete)
 #define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
-#define RT_NUM_ENTRIES RT_MAKE_NAME(num_entries)
 #define RT_DUMP RT_MAKE_NAME(dump)
 #define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
 #define RT_STATS RT_MAKE_NAME(stats)
@@ -222,7 +220,6 @@ RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
 RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
 
 RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
-RT_SCOPE uint64 RT_NUM_ENTRIES(RT_RADIX_TREE *tree);
 
 #ifdef RT_DEBUG
 RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
@@ -1773,15 +1770,6 @@ RT_END_ITERATE(RT_ITER *iter)
 	pfree(iter);
 }
 
-/*
- * Return the number of keys in the radix tree.
- */
-RT_SCOPE uint64
-RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
-{
-	return tree->ctl->num_keys;
-}
-
 /*
  * Return the statistics of the amount of memory used by the radix tree.
  */
@@ -2185,7 +2173,6 @@ rt_dump(RT_RADIX_TREE *tree)
 #undef RT_END_ITERATE
 #undef RT_DELETE
 #undef RT_MEMORY_USAGE
-#undef RT_NUM_ENTRIES
 #undef RT_DUMP
 #undef RT_DUMP_SEARCH
 #undef RT_STATS
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 61d842789d..076173f628 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -109,6 +109,15 @@ static const test_spec test_specs[] = {
 #include "lib/radixtree.h"
 
 
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
 PG_MODULE_MAGIC;
 
 PG_FUNCTION_INFO_V1(test_radixtree);
-- 
2.39.0

v19-0004-Workaround-link-errors-on-Windows-CI.patchtext/x-patch; charset=US-ASCII; name=v19-0004-Workaround-link-errors-on-Windows-CI.patchDownload

From 1413044ac1546ea3c940c1bdaa69083bfa417f98 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 17 Jan 2023 15:45:39 +0700
Subject: [PATCH v19 4/9] Workaround link errors on Windows CI

For some reason, using bmw_popcount() here leads to
link errors, although bmw_rightmost_one_pos() works
fine.
---
 src/include/lib/radixtree.h | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 9f8bed09f7..7f928f02d6 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1863,12 +1863,10 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
 				if (NODE_IS_LEAF(node))
 				{
 					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
-					int			cnt = 0;
-
-					for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
-						cnt += bmw_popcount(n256->isset[i]);
+					int			cnt;
 
 					/* Check if the number of used chunk matches */
+					cnt = pg_popcount((const char *) n256->isset, sizeof(n256->isset));
 					Assert(n256->base.n.count == cnt);
 
 					break;
-- 
2.39.0

v19-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v19-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From 77251267b2c2a9123cdd7c2fe03907c45607cf7f Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v19 2/9] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bafec5f7..5bd3da4948 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.39.0

v19-0003-Add-radixtree-template.patchtext/x-patch; charset=US-ASCII; name=v19-0003-Add-radixtree-template.patchDownload

From 3e74bae1c7a27bd9c91c16f433614c1a7563d6de Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v19 3/9] Add radixtree template

The only thing configurable at this point is function scope,
prefix, and local/shared memory.

The key and value type are still hard-coded to uint64.
To make this more useful, at least value type should be
configurable.

It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.

TODO: Reducing the smallest node to 3 members will
eliminate padding and only take up 32 bytes for
inner nodes.
---
 src/backend/utils/mmgr/dsa.c                  |   12 +
 src/include/lib/radixtree.h                   | 2243 +++++++++++++++++
 src/include/lib/radixtree_delete_impl.h       |  106 +
 src/include/lib/radixtree_insert_impl.h       |  316 +++
 src/include/lib/radixtree_iter_impl.h         |  138 +
 src/include/lib/radixtree_search_impl.h       |  131 +
 src/include/utils/dsa.h                       |    1 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   34 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  631 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 20 files changed, 3715 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 604b702a91..50f0aae3ab 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..9f8bed09f7
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2243 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ *	  To generate a radix tree and associated functions for a use case several
+ *	  macros have to be #define'ed before this file is included.  Including
+ *	  the file #undef's all those, so a new radix tree can be generated
+ *	  afterwards.
+ *	  The relevant parameters are:
+ *	  - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ *		will result in radix tree type 'foo_radix_tree' and functions like
+ *		'foo_create'/'foo_free' and so forth.
+ *	  - RT_DECLARE - if defined function prototypes and type declarations are
+ *		generated
+ *	  - RT_DEFINE - if defined function definitions are generated
+ *	  - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *		declarations reside
+ *	  - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *		so that multiple processes can access it simultaneously.
+ *
+ *	  Optional parameters:
+ *	  - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_DELETE		- Delete a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITER		- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ * RT_NUM_ENTRIES	- Get the number of key-value pairs
+ *
+ * RT_CREATE() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#define RT_DELETE RT_MAKE_NAME(delete)
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#define RT_NUM_ENTRIES RT_MAKE_NAME(num_entries)
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
+#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
+#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+RT_SCOPE uint64 RT_NUM_ENTRIES(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* macros and types common to all implementations */
+#ifndef RT_COMMON
+#define RT_COMMON
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+#endif							/* RT_COMMON */
+
+
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_4_FULL = 0,
+	RT_CLASS_32_PARTIAL,
+	RT_CLASS_32_FULL,
+	RT_CLASS_125_FULL,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((RT_PTR_LOCAL) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+	((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+	((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct RT_NODE_BASE_4
+{
+	RT_NODE		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} RT_NODE_BASE_4;
+
+typedef struct RT_NODE_BASE_32
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct RT_NODE_BASE_125
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword		isset[BM_IDX(128)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct RT_NODE_INNER_4
+{
+	RT_NODE_BASE_4 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_4;
+
+typedef struct RT_NODE_LEAF_4
+{
+	RT_NODE_BASE_4 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_4;
+
+typedef struct RT_NODE_INNER_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} RT_SIZE_CLASS_ELEM;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_4_FULL] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_PARTIAL] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_FULL] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
+	},
+	[RT_CLASS_125_FULL] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+/* Map from the node kind to its minimum size class */
+static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
+	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+	[RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+#endif
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
++ *
++ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
++ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
++ * We need either a safeguard to disallow other processes to begin the iteration
++ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+	RT_PTR_LOCAL node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the iteration on nodes of each level */
+	RT_NODE_ITER stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+								uint64 key, uint64 value);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(uint64) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
+{
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
+
+	if (inner)
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (inner)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+#endif
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->ctl->cnt[size_class]++;
+#endif
+
+	return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+{
+	if (inner)
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+
+	node->kind = kind;
+
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+	}
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		inner = shift > 0;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+	newnode->shift = shift;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->count = oldnode->count;
+}
+#if 0
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static RT_NODE*
+RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
+{
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+	bool inner = !NODE_IS_LEAF(node);
+
+	allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	RT_COPY_NODE(newnode, node);
+
+	return newnode;
+}
+#endif
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->ctl->root == allocnode)
+	{
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
+	}
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
+{
+	RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
+
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old->shift == new->shift);
+#endif
+
+	if (parent == old)
+	{
+		/* Replace the root node with the new large node */
+		tree->ctl->root = new_child;
+	}
+	else
+		RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+	RT_FREE_NODE(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_4 *n4;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		node->shift = shift;
+		node->count = 1;
+
+		n4 = (RT_NODE_INNER_4 *) node;
+		n4->base.chunks[0] = 0;
+		n4->children[0] = tree->ctl->root;
+
+		/* Update the root */
+		tree->ctl->root = allocnode;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+			  RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
+{
+	int			shift = node->shift;
+
+	Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		inner = newshift > 0;
+
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+		newchild->shift = newshift;
+		RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
+
+		parent = node;
+		node = newchild;
+		nodep = allocchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+	tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/* Insert the child to the inner node */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Insert the value to the leaf node */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+					uint64 key, uint64 value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 RT_SIZE_CLASS_INFO[i].name,
+												 RT_SIZE_CLASS_INFO[i].inner_blocksize,
+												 RT_SIZE_CLASS_INFO[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												RT_SIZE_CLASS_INFO[i].name,
+												RT_SIZE_CLASS_INFO[i].leaf_blocksize,
+												RT_SIZE_CLASS_INFO[i].leaf_size);
+	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	/* XXX: memory context support */
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/* XXX: do we need to set a callback on exit to detach dsa? */
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle); // XXX
+	//dsa_detach(tree->dsa);
+#else
+	pfree(tree->ctl);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+#endif
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC nodep;
+	RT_PTR_LOCAL  node;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	/* Empty tree, create the root */
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->ctl->max_val)
+		RT_EXTEND(tree, key);
+
+	//Assert(tree->ctl->root);
+
+	nodep = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, nodep);
+	shift = parent->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		node = RT_PTR_GET_LOCAL(tree, nodep);
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_SET_EXTEND(tree, key, value, parent, nodep, node);
+			return false;
+		}
+
+		parent = node;
+		nodep = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->ctl->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	Assert(value_p != NULL);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+		return false;
+
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			return false;
+
+		node = RT_PTR_GET_LOCAL(tree, child);
+		shift -= RT_NODE_SPAN;
+	}
+
+	return RT_NODE_SEARCH_LEAF(node, key, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		RT_PTR_ALLOC child;
+
+		/* Push the current node to the stack */
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			return false;
+
+		allocnode = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	deleted = RT_NODE_DELETE_LEAF(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->ctl->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	RT_FREE_NODE(tree, allocnode);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		allocnode = stack[level--];
+
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		deleted = RT_NODE_DELETE_INNER(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		RT_FREE_NODE(tree, allocnode);
+	}
+
+	return true;
+}
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+	int			level = from;
+	RT_PTR_LOCAL node = from_node;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	MemoryContext old_ctx;
+	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->ctl->root)
+		return iter;
+
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->ctl->root)
+		return false;
+
+	for (;;)
+	{
+		RT_PTR_LOCAL child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+	pfree(iter);
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+RT_SCOPE uint64
+RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	// XXX is this necessary?
+	Size		total = sizeof(RT_RADIX_TREE);
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+#endif
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = BM_IDX(slot);
+					int			bitnum = BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(RT_RADIX_TREE *tree)
+{
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+						 tree->ctl->num_keys,
+						 tree->ctl->root->shift / RT_NODE_SPAN,
+						 tree->ctl->cnt[RT_CLASS_4_FULL],
+						 tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+						 tree->ctl->cnt[RT_CLASS_32_FULL],
+						 tree->ctl->cnt[RT_CLASS_125_FULL],
+						 tree->ctl->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
+{
+	char		space[125] = {0};
+
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
+			node->fanout == 0 ? 256 : node->fanout,
+			node->count, node->shift);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							rt_dump_node(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							rt_dump_node(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < BM_IDX(128); i++)
+					{
+						fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+					}
+					else
+					{
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+		 tree->ctl->max_val, tree->ctl->max_val);
+
+	if (!tree->ctl->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->ctl->max_val)
+	{
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->ctl->root;
+	shift = tree->ctl->root->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_LOCAL child;
+
+		rt_dump_node(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+void
+rt_dump(RT_RADIX_TREE *tree)
+{
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+				RT_SIZE_CLASS_INFO[i].name,
+				RT_SIZE_CLASS_INFO[i].inner_size,
+				RT_SIZE_CLASS_INFO[i].inner_blocksize,
+				RT_SIZE_CLASS_INFO[i].leaf_size,
+				RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+
+	if (!tree->ctl->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree->ctl->root, 0, true);
+}
+#endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+
+/* locally declared macros */
+#undef NODE_IS_LEAF
+#undef NODE_IS_EMPTY
+#undef VAR_NODE_HAS_FREE_SLOT
+#undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_RADIX_TREE_MAGIC
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_4_FULL
+#undef RT_CLASS_32_PARTIAL
+#undef RT_CLASS_32_FULL
+#undef RT_CLASS_125_FULL
+#undef RT_CLASS_256
+#undef RT_KIND_MIN_SIZE_CLASS
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_NUM_ENTRIES
+#undef RT_DUMP
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_GROW_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..eb87866b90
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,106 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(NODE_IS_LEAF(node));
+#else
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
+										  n4->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
+											n4->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
+										  n32->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_NODE_125_INVALID_IDX)
+					return false;
+
+				idx = BM_IDX(slotpos);
+				bitnum = BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+				RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..e4faf54d9d
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,316 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+	RT_PTR_LOCAL newnode = NULL;
+	RT_PTR_ALLOC allocnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	const bool inner = false;
+	Assert(NODE_IS_LEAF(node));
+#else
+	const bool inner = true;
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n4->values[idx] = value;
+#else
+					n4->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+					/* grow node from 4 to 32 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+					new32 = (RT_NODE32_TYPE *) newnode;
+#ifdef RT_NODE_LEVEL_LEAF
+					RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
+											  new32->base.chunks, new32->values);
+#else
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children);
+#endif
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
+												   count, insertpos);
+#endif
+					}
+
+					n4->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n4->values[insertpos] = value;
+#else
+					n4->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[idx] = value;
+#else
+					n32->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+					n32->base.n.fanout == class32_min.fanout)
+				{
+					/* grow to the next size class of this kind */
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class32_min.leaf_size);
+#else
+					memcpy(newnode, node, class32_min.inner_size);
+#endif
+					newnode->fanout = class32_max.fanout;
+
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+
+					/* also update pointer for this kind */
+					n32 = (RT_NODE32_TYPE *) newnode;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+					Assert(n32->base.n.fanout == class32_max.fanout);
+
+					/* grow node from 32 to 125 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < class32_max.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = value;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			cnt = 0;
+
+				if (slotpos != RT_NODE_125_INVALID_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+					/* grow node from 125 to 256 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+					new256 = (RT_NODE256_TYPE *) newnode;
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+						cnt++;
+					}
+
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < BM_IDX(128); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+				chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_SET(n256, chunk, value);
+#else
+				RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	RT_VERIFY_NODE(node);
+
+	return chunk_exists;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..0b8b68df6c
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,138 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	bool		found = false;
+	uint8		key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	uint64		value;
+
+	Assert(NODE_IS_LEAF(node_iter->node));
+#else
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n4->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
+#endif
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+		*value_p = value;
+#endif
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return found;
+#else
+	return child;
+#endif
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..31e4978e4f
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,131 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	uint64		value = 0;
+
+	Assert(NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+	RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+#endif
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n4->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n4->values[idx];
+#else
+				child = n4->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n32->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[idx];
+#else
+				child = n32->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+				Assert(slotpos != RT_NODE_125_INVALID_IDX);
+				n125->children[slotpos] = new_child;
+#else
+				if (slotpos == RT_NODE_125_INVALID_IDX)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+	}
+
+#ifdef RT_ACTION_UPDATE
+	return;
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	*value_p = value;
+#else
+	Assert(child_p != NULL);
+	*child_p = child;
+#endif
+
+	return true;
+#endif							/* RT_ACTION_UPDATE */
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 104386e674..c67f936880 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..61d842789d
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,631 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	uint64		dummy;
+	uint64		key;
+	uint64		val;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	rt_radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_set(radixtree, keys[i], keys[i] + 1))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		uint64		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa);
+#else
+	radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	rt_free(radixtree);
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.39.0

v19-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchtext/x-patch; charset=US-ASCII; name=v19-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload

From ecc796c414a1dbd7c0c0df9bbcab0d922616b1ca Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 17 Jan 2023 17:20:37 +0700
Subject: [PATCH v19 9/9] Use TIDStore for storing dead tuple TID during lazy
 vacuum

Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which is not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.

This changes to use TIDStore for this purpose. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.

Also, since we are no longer able to exactly estimate the maximum
number of TIDs can be stored based on the amount of memory. It also
changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.

Furthermore, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, which is the
inital DSA segment size. Due to that, this change increase the minimum
maintenance_work_mem from 1MB to 2MB.
---
 doc/src/sgml/monitoring.sgml               |   8 +-
 src/backend/access/heap/vacuumlazy.c       | 168 +++++++--------------
 src/backend/catalog/system_views.sql       |   2 +-
 src/backend/commands/vacuum.c              |  76 +---------
 src/backend/commands/vacuumparallel.c      |  64 +++++---
 src/backend/storage/lmgr/lwlock.c          |   2 +
 src/backend/utils/misc/guc_tables.c        |   2 +-
 src/include/commands/progress.h            |   4 +-
 src/include/commands/vacuum.h              |  25 +--
 src/include/storage/lwlock.h               |   1 +
 src/test/regress/expected/cluster.out      |   2 +-
 src/test/regress/expected/create_index.out |   2 +-
 src/test/regress/expected/rules.out        |   4 +-
 src/test/regress/sql/cluster.sql           |   2 +-
 src/test/regress/sql/create_index.sql      |   2 +-
 15 files changed, 122 insertions(+), 242 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..6ce7ea9e35 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6840,10 +6840,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6851,10 +6851,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..41af676dfa 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -259,8 +260,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +827,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				blkno,
 				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +908,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (tidstore_is_full(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1037,11 +1038,18 @@ lazy_scan_heap(LVRelState *vacrel)
 			if (prunestate.has_lpdead_items)
 			{
 				Size		freespace;
+				TidStoreIter *iter;
+				TidStoreIterResult *result;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
+				iter = tidstore_begin_iterate(vacrel->dead_items);
+				result = tidstore_iterate_next(iter);
+				lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+									  buf, vmbuffer);
+				Assert(!tidstore_iterate_next(iter));
+				tidstore_end_iterate(iter);
 
 				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				tidstore_reset(dead_items);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1086,7 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
 		}
 
 		/*
@@ -1249,7 +1257,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1893,23 +1901,15 @@ retry:
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		vacrel->lpdead_item_pages++;
 		prunestate->has_lpdead_items = true;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -2129,8 +2129,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2138,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2198,7 +2190,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2227,7 +2219,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2254,8 +2246,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2300,7 +2292,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	tidstore_reset(vacrel->dead_items);
 }
 
 /*
@@ -2373,7 +2365,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2410,10 +2402,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2421,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while ((result = tidstore_iterate_next(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2437,7 +2431,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2451,7 +2445,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+							  buf, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2461,6 +2456,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	tidstore_end_iterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2470,14 +2466,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2495,11 +2490,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * LP_DEAD item on the page.  The return value is the first index immediately
  * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+					  int num_offsets, Buffer buffer, Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2518,16 +2512,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = offsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2586,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -3093,46 +3081,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3143,11 +3091,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3174,7 +3120,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3187,11 +3133,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d2a8c82900..fdc8a99bba 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1164,7 +1164,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127e..358ad25996 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2343,18 +2342,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
@@ -2365,60 +2352,7 @@ vac_max_items_to_alloc_size(int max_items)
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..4c0ce4b7e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+#define PARALLEL_VACUUM_KEY_DSA				2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TidStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_destroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TidStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 196bece0a3..ff75fae88a 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -186,6 +186,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"PgStatsHash",
 	/* LWTRANCHE_PGSTATS_DATA: */
 	"PgStatsData",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5025e80f89..edee8a2b2b 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2301,7 +2301,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 2048, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..220d89fff7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
 	MultiXactId MultiXactCutoff;
 };
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e4162db613..40dda03088 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -204,6 +204,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DSA,
 	LWTRANCHE_PGSTATS_HASH,
 	LWTRANCHE_PGSTATS_DATA,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
 -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a969ae63eb..630869255f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT s.stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 set enable_indexscan = off;
 
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
 
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
-- 
2.39.0

v19-0006-Shared-memory-cleanups.patchtext/x-patch; charset=US-ASCII; name=v19-0006-Shared-memory-cleanups.patchDownload

From afccfde982c95815b4a7b8dcef62ae5bc1d416d0 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 17 Jan 2023 16:50:38 +0700
Subject: [PATCH v19 6/9] Shared memory cleanups

---
 src/include/lib/radixtree.h | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index ba326562d5..7c7b126b98 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1378,8 +1378,6 @@ RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 
-	/* XXX: do we need to set a callback on exit to detach dsa? */
-
 	return tree;
 }
 
@@ -1412,8 +1410,7 @@ RT_FREE(RT_RADIX_TREE *tree)
 	 * other backends access the memory formerly occupied by this radix tree.
 	 */
 	tree->ctl->magic = 0;
-	dsa_free(tree->dsa, tree->ctl->handle); // XXX
-	//dsa_detach(tree->dsa);
+	dsa_free(tree->dsa, tree->ctl->handle);
 #else
 	pfree(tree->ctl);
 
@@ -1452,8 +1449,6 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
 	if (key > tree->ctl->max_val)
 		RT_EXTEND(tree, key);
 
-	//Assert(tree->ctl->root);
-
 	nodep = tree->ctl->root;
 	parent = RT_PTR_GET_LOCAL(tree, nodep);
 	shift = parent->shift;
-- 
2.39.0

v19-0007-Make-RT_DELETE-optional.patchtext/x-patch; charset=US-ASCII; name=v19-0007-Make-RT_DELETE-optional.patchDownload

From 88e0f6202959fa1a872eacb01c0e24cb27ae66d4 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 17 Jan 2023 17:34:28 +0700
Subject: [PATCH v19 7/9] Make RT_DELETE optional

To prevent compiler warnings in TIDStore
---
 src/include/lib/radixtree.h                      | 16 +++++++++++++++-
 src/test/modules/test_radixtree/test_radixtree.c |  1 +
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 7c7b126b98..c2df8e882e 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -58,7 +58,6 @@
  * RT_GET_HANDLE	- Return the handle of the radix tree
  * RT_SEARCH		- Search a key-value pair
  * RT_SET			- Set a key-value pair
- * RT_DELETE		- Delete a key-value pair
  * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
  * RT_ITERATE_NEXT	- Return next key-value pair, if any
  * RT_END_ITER		- End iteration
@@ -70,6 +69,12 @@
  * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
  * order of the key.
  *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
  * Copyright (c) 2022, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
@@ -106,7 +111,9 @@
 #define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
 #define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
 #define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
 #define RT_DELETE RT_MAKE_NAME(delete)
+#endif
 #define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
 #define RT_DUMP RT_MAKE_NAME(dump)
 #define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
@@ -213,7 +220,9 @@ RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
 
 RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
 RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+#ifdef RT_USE_DELETE
 RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
 
 RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
 RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
@@ -1264,6 +1273,7 @@ RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
 #undef RT_NODE_LEVEL_LEAF
 }
 
+#ifdef RT_USE_DELETE
 /*
  * Search for the child pointer corresponding to 'key' in the given node.
  *
@@ -1289,6 +1299,7 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
 #include "lib/radixtree_delete_impl.h"
 #undef RT_NODE_LEVEL_LEAF
 }
+#endif
 
 /* Insert the child to the inner node */
 static bool
@@ -1523,6 +1534,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
 	return RT_NODE_SEARCH_LEAF(node, key, value_p);
 }
 
+#ifdef RT_USE_DELETE
 /*
  * Delete the given key from the radix tree. Return true if the key is found (and
  * deleted), otherwise do nothing and return false.
@@ -1609,6 +1621,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 
 	return true;
 }
+#endif
 
 static inline void
 RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
@@ -2166,6 +2179,7 @@ rt_dump(RT_RADIX_TREE *tree)
 #undef RT_BEGIN_ITERATE
 #undef RT_ITERATE_NEXT
 #undef RT_END_ITERATE
+#undef RT_USE_DELETE
 #undef RT_DELETE
 #undef RT_MEMORY_USAGE
 #undef RT_DUMP
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 076173f628..f01d4dd733 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -104,6 +104,7 @@ static const test_spec test_specs[] = {
 #define RT_SCOPE static
 #define RT_DECLARE
 #define RT_DEFINE
+#define RT_USE_DELETE
 // WIP: compiles with warnings because rt_attach is defined but not used
 // #define RT_SHMEM
 #include "lib/radixtree.h"
-- 
2.39.0

v19-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchtext/x-patch; charset=US-ASCII; name=v19-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From 5205bba3e7d4542fe350fd3606acb78caace866d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v19 8/9] Add TIDStore, to store sets of TIDs (ItemPointerData)
 efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 587 ++++++++++++++++++
 src/include/access/tidstore.h                 |  49 ++
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  34 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../test_tidstore/test_tidstore.control       |   4 +
 10 files changed, 727 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..4170d13b3c
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,587 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, Tid are encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach(). It can support concurrent updates but only one process
+ * is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "lib/radixtree.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. First, we construct 64-bit unsigned integer key that
+ * combines the block number and the offset number. The lowest 11 bits represent
+ * the offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TidSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * XXX: If we want to support other table AMs that want to use the full range
+ * of possible offset numbers, we'll need to change this.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ *
+ * The maximum height of the radix tree is 5.
+ *
+ * XXX: if we want to support non-heap table AM, we need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
+ */
+#define TIDSTORE_OFFSET_NBITS	11
+#define TIDSTORE_VALUE_NBITS	6
+
+/*
+ * Memory consumption depends on the number of Tids stored, but also on the
+ * distribution of them and how the radix tree stores them. The maximum bytes
+ * that a TidStore can use is specified by the max_bytes in tidstore_create().
+ *
+ * In non-shared cases, the radix tree uses a slab allocator for each kind of
+ * node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+ * largest radix tree node in a new slab block, which is approximately 70kB.
+ * Therefore, we deduct 70kB from the maximum bytes.
+ *
+ * In shared cases, DSA allocates the memory segments to bit enough to follow
+ * a geometric series that approximately doubles the total DSA size. So we
+ * limit the maximum bytes for a TidStore to 75%. The 75% threshold perfectly
+ * works in case where the maximum bytes is power-of-2. In other cases, we
+ * use 60& threshold.
+ */
+#define TIDSTORE_MEMORY_DEDUCT_BYTES (1024L * 70) /* 70kB */
+
+/* Get block number from the key */
+#define KEY_GET_BLKNO(key) \
+	((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+	/*
+	 * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+	 * bytes a TidStore can use. These two fields are commonly used in both
+	 * non-shared case and shared case.
+	 */
+	uint32	num_tids;
+	uint64	max_bytes;
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+
+	/* handles for TidStore and radix tree */
+	tidstore_handle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 *
+	 * We calculate the maximum bytes for the TidStore in different ways
+	 * for non-shared case and shared case. Please refer to the comment
+	 * TIDSTORE_MEMORY_DEDUCT for details.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes =(uint64) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - TIDSTORE_MEMORY_DEDUCT_BYTES;
+	}
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/* Reset the statistics */
+	ts->control->num_tids = 0;
+
+	/*
+	 * Free the current radix tree, and Return allocated DSM segments
+	 * to the operating system, if necessary. */
+	if (TidStoreIsShared(ts))
+	{
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+	}
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+	if (TidStoreIsShared(ts))
+		shared_rt_set(ts->tree.shared, key, val);
+	else
+		local_rt_set(ts->tree.local, key, val);
+}
+
+/*
+ * Add Tids on a block to TidStore. The caller must ensure the offset numbers
+ * in 'offsets' are ordered in ascending order.
+ */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	uint64 last_key = PG_UINT64_MAX;
+	uint64 key;
+	uint64 val = 0;
+	ItemPointerData tid;
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint32	off;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		key = tid_to_key_off(&tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			/* insert the key-value */
+			tidstore_insert_kv(ts, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= UINT64CONST(1) << off;
+	}
+
+	if (last_key != PG_UINT64_MAX)
+	{
+		/* insert the key-value */
+		tidstore_insert_kv(ts, last_key, val);
+	}
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+}
+
+/* Return true if the given Tid is present in TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val;
+	uint32 off;
+	bool found;
+
+	key = tid_to_key_off(tid, &off);
+
+	found = TidStoreIsShared(ts) ?
+		shared_rt_search(ts->tree.shared, key, &val) :
+		local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+	iter->result.blkno = InvalidBlockNumber;
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (ts->control->num_tids == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+	else
+		return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = KEY_GET_BLKNO(key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * Remember the key-value pair for the next block for the
+			 * next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+uint64
+tidstore_num_tids(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+uint64
+tidstore_memory_usage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return (uint64) sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+	else
+		return (uint64) sizeof(TidStore) + sizeof(TidStore) +
+			local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract Tids from key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->result);
+
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+		result->offsets[result->num_offsets++] = off;
+	}
+
+	result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a Tid to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64 upper;
+	uint64 tid_i;
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+	*off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+	upper = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return upper;
+}
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..4bffdf0920
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "lib/radixtree.h"
+#include "storage/itemptr.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+	int				num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern uint64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern uint64 tidstore_max_memory(TidStore *ts);
+extern uint64 tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..1973963440
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..3365b073a4
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.39.0

#180

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#177)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

+ * Add Tids on a block to TidStore. The caller must ensure the offset

numbers

+ * in 'offsets' are ordered in ascending order.

Must? What happens otherwise?

It ends up missing TIDs by overwriting the same key with different
values. Is it better to have a bool argument, say need_sort, to sort
the given array if the caller wants?

Now that I've studied it some more, I see what's happening: We need all
bits set in the "value" before we insert it, since it would be too
expensive to retrieve the current value, add one bit, and put it back.
Also, as a consequence of the encoding, part of the tid is in the key, and
part in the value. It makes more sense now, but it needs more than zero
comments.

As for the order, I don't think it's the responsibility of the caller to
guess if it needs sorting -- if unordered offsets lead to data loss, this
function needs to take care of it.

+ uint64 last_key = PG_UINT64_MAX;

I'm having some difficulty understanding this sentinel and how it's

used.

Will improve the logic.

Part of the problem is the English language: "last" can mean "previous" or
"at the end", so maybe some name changes would help.

--
John Naylor
EDB: http://www.enterprisedb.com

#181

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#179)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jan 17, 2023 at 8:06 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Attached is an update that mostly has the modest goal of getting CI green again. v19-0003 has squashed the entire radix tree template from previously. I've kept out the perf test module for now -- still needs updating.

[05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
external symbol pg_popcount64
[05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
fatal error LNK1120: 1 unresolved externals

Yeah, I'm not sure what's causing that. Since that comes from a debugging function, we could work around it, but it would be nice to understand why, so I'll probably have to experiment on my CI repo.

I'm still confused by this error, because it only occurs in the test module. I successfully built with just 0002 in CI so elsewhere where bmw_* symbols resolve just fine on all platforms. I've worked around the error in v19-0004 by using the general-purpose pg_popcount() function. We only need to count bits in assert builds, so it doesn't matter a whole lot.

I spent today investigating this issue, I found out that on Windows,
libpgport_src.a is not linked when building codes outside of
src/backend unless explicitly linking it. It's not a problem on Linux
etc. but the linker raises a fatal error on Windows. I'm not sure the
right way to fix it but the attached patch resolved the issue on
cfbot. Since it seems not to be related to 0002 patch but maybe the
designed behavior or a problem in meson. We can discuss it on a
separate thread.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

link_pgport_src.patchapplication/octet-stream; name=link_pgport_src.patchDownload

diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
index f96bf159d6..3f444ac05e 100644
--- a/src/test/modules/test_radixtree/meson.build
+++ b/src/test/modules/test_radixtree/meson.build
@@ -12,6 +12,7 @@ endif
 
 test_radixtree = shared_module('test_radixtree',
   test_radixtree_sources,
+  link_with: [pgport_srv],
   kwargs: pg_mod_args,
 )
 testprep_targets += test_radixtree

#182

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#179)

13 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jan 17, 2023 at 8:06 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Attached is an update that mostly has the modest goal of getting CI green again. v19-0003 has squashed the entire radix tree template from previously. I've kept out the perf test module for now -- still needs updating.

[05:44:11.819] test_radixtree.c.obj : error LNK2001: unresolved
external symbol pg_popcount64
[05:44:11.819] src\test\modules\test_radixtree\test_radixtree.dll :
fatal error LNK1120: 1 unresolved externals

Yeah, I'm not sure what's causing that. Since that comes from a debugging function, we could work around it, but it would be nice to understand why, so I'll probably have to experiment on my CI repo.

I'm still confused by this error, because it only occurs in the test module. I successfully built with just 0002 in CI so elsewhere where bmw_* symbols resolve just fine on all platforms. I've worked around the error in v19-0004 by using the general-purpose pg_popcount() function. We only need to count bits in assert builds, so it doesn't matter a whole lot.
+ /* XXX: do we need to set a callback on exit to detach dsa? */

In the current shared radix tree design, it's a caller responsible
that they create (or attach to) a DSA area and pass it to RT_CREATE()
or RT_ATTACH(). It enables us to use one DSA not only for the radix
tree but also other data. Which is more flexible. So the caller needs
to detach from the DSA somehow, so I think we don't need to set a
callback here for that.
---
+        dsa_free(tree->dsa, tree->ctl->handle); // XXX
+        //dsa_detach(tree->dsa);
Similar to above, I think we should not detach from the DSA area here.

Given that the DSA area used by the radix tree could be used also by
other data, I think that in RT_FREE() we need to free each radix tree
node allocated in DSA. In lazy vacuum, we check the memory usage
instead of the number of TIDs and need to reset the TidScan after an
index scan. So it does RT_FREE() and dsa_trim() to return DSM segments
to the OS. I've implemented rt_free_recurse() for this purpose in the
v15 version patch.
--
-        Assert(tree->root);
+        //Assert(tree->ctl->root);
I think we don't need this assertion in the first place. We check it
at the beginning of the function.
I've removed these in v19-0006.

That sounds like a good idea. It's also worth wondering if we even need RT_NUM_ENTRIES at all, since the caller is capable of keeping track of that if necessary. It's also misnamed, since it's concerned with the number of keys. The vacuum case cares about the number of TIDs, and not number of (encoded) keys. Even if we ever (say) changed the key to blocknumber and value to Bitmapset, the number of keys might not be interesting.

Right. In fact, TIdStore doesn't use RT_NUM_ENTRIES.

I've moved it to the test module, which uses it extensively. There, the name is more clear what it's for, so I didn't change the name.

It sounds like we should at least make the delete functionality optional. (Side note on optional functions: if an implementation didn't care about iteration or its order, we could optimize insertion into linear nodes)

Agreed.

Done in v19-0007.

v19-0009 is just a rebase over some more vacuum cleanups.

Thank you for updating the patches!

I've attached new version patches. There is no change from v19 patch
for 0001 through 0006. And 0004, 0005 and 0006 patches look good to
me. We can merge them into 0003 patch.

0007 patch fixes functions that are defined when RT_DEBUG. These
functions might be removed before the commit but this is useful at
least under development. 0008 patch fixes a bug in
RT_CHUNK_VALUES_ARRAY_SHIFT() and adds tests for that. 0009 patch
fixes the cfbot issue by linking pgport_srv. 0010 patch adds
RT_FREE_RECURSE() to free all radix tree nodes allocated in DSA. 0011
patch updates copyright etc. 0012 and 0013 patches are updated patches
that incorporate all comments I got so far.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v20-0013-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v20-0013-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload

From 33f4c5ceed5659224e084549a608414f0f1495d4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 17 Jan 2023 17:20:37 +0700
Subject: [PATCH v20 13/13] Use TIDStore for storing dead tuple TID during lazy
 vacuum

Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which is not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.

This changes to use TIDStore for this purpose. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.

Also, since we are no longer able to exactly estimate the maximum
number of TIDs can be stored based on the amount of memory. It also
changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.

Furthermore, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, which is the
inital DSA segment size. Due to that, this change increase the minimum
maintenance_work_mem from 1MB to 2MB.

XXX: needs to bump catalog version
---
 doc/src/sgml/monitoring.sgml               |   8 +-
 src/backend/access/heap/vacuumlazy.c       | 210 +++++++--------------
 src/backend/catalog/system_views.sql       |   2 +-
 src/backend/commands/vacuum.c              |  76 +-------
 src/backend/commands/vacuumparallel.c      |  64 ++++---
 src/backend/storage/lmgr/lwlock.c          |   2 +
 src/backend/utils/misc/guc_tables.c        |   2 +-
 src/include/commands/progress.h            |   4 +-
 src/include/commands/vacuum.h              |  25 +--
 src/include/storage/lwlock.h               |   1 +
 src/test/regress/expected/cluster.out      |   2 +-
 src/test/regress/expected/create_index.out |   2 +-
 src/test/regress/expected/rules.out        |   4 +-
 src/test/regress/sql/cluster.sql           |   2 +-
 src/test/regress/sql/create_index.sql      |   2 +-
 15 files changed, 138 insertions(+), 268 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c9bc091045..68b13de735 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6844,10 +6844,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6855,10 +6855,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..90f8a5e087 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -220,17 +221,21 @@ typedef struct LVRelState
 typedef struct LVPagePruneState
 {
 	bool		hastup;			/* Page prevents rel truncation? */
-	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+
+	/* collected LP_DEAD items including existing LP_DEAD items */
+	int			lpdead_items;
+	OffsetNumber	deadoffsets[MaxHeapTuplesPerPage];
 
 	/*
 	 * State describes the proper VM bit states to set for the page following
-	 * pruning and freezing.  all_visible implies !has_lpdead_items, but don't
+	 * pruning and freezing.  all_visible implies !HAS_LPDEAD_ITEMS(), but don't
 	 * trust all_frozen result unless all_visible is also set to true.
 	 */
 	bool		all_visible;	/* Every item visible to all? */
 	bool		all_frozen;		/* provided all_visible is also true */
 	TransactionId visibility_cutoff_xid;	/* For recovery conflicts */
 } LVPagePruneState;
+#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)
 
 /* Struct for saving and restoring vacuum error information. */
 typedef struct LVSavedErrInfo
@@ -259,8 +264,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +831,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				blkno,
 				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +912,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (tidstore_is_full(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1018,7 +1023,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
 
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+		Assert(!prunestate.all_visible || !HAS_LPDEAD_ITEMS(prunestate));
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
@@ -1034,14 +1039,12 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * performed here can be thought of as the one-pass equivalent of
 			 * a call to lazy_vacuum().
 			 */
-			if (prunestate.has_lpdead_items)
+			if (HAS_LPDEAD_ITEMS(prunestate))
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
-				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+									  prunestate.lpdead_items, buf, vmbuffer);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1081,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
+		}
+		else if (HAS_LPDEAD_ITEMS(prunestate))
+		{
+			/* Save details of the LP_DEAD items from the page */
+			tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+							  prunestate.lpdead_items);
+
+			pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+										 tidstore_memory_usage(dead_items));
 		}
 
 		/*
@@ -1145,7 +1157,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
 		 * set, however.
 		 */
-		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+		else if (HAS_LPDEAD_ITEMS(prunestate) && PageIsAllVisible(page))
 		{
 			elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1193,7 +1205,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Final steps for block: drop cleanup lock, record free space in the
 		 * FSM
 		 */
-		if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+		if (HAS_LPDEAD_ITEMS(prunestate) && vacrel->do_index_vacuuming)
 		{
 			/*
 			 * Wait until lazy_vacuum_heap_rel() to save free space.  This
@@ -1249,7 +1261,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1543,13 +1555,11 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				tuples_frozen,
-				lpdead_items,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	HeapPageFreeze pagefrz;
 	int64		fpi_before = pgWalUsage.wal_fpi;
-	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1581,6 @@ retry:
 	pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
-	lpdead_items = 0;
 	live_tuples = 0;
 	recently_dead_tuples = 0;
 
@@ -1580,9 +1589,9 @@ retry:
 	 *
 	 * We count tuples removed by the pruning step as tuples_deleted.  Its
 	 * final value can be thought of as the number of tuples that have been
-	 * deleted from the table.  It should not be confused with lpdead_items;
-	 * lpdead_items's final value can be thought of as the number of tuples
-	 * that were deleted from indexes.
+	 * deleted from the table.  It should not be confused with
+	 * prunestate->lpdead_items; prunestate->lpdead_items's final value can
+	 * be thought of as the number of tuples that were deleted from indexes.
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
 									 InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1602,7 @@ retry:
 	 * requiring freezing among remaining tuples with storage
 	 */
 	prunestate->hastup = false;
-	prunestate->has_lpdead_items = false;
+	prunestate->lpdead_items = 0;
 	prunestate->all_visible = true;
 	prunestate->all_frozen = true;
 	prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1647,7 @@ retry:
 			 * (This is another case where it's useful to anticipate that any
 			 * LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
 			 */
-			deadoffsets[lpdead_items++] = offnum;
+			prunestate->deadoffsets[prunestate->lpdead_items++] = offnum;
 			continue;
 		}
 
@@ -1875,7 +1884,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible && lpdead_items == 0)
+	if (prunestate->all_visible && prunestate->lpdead_items == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1888,28 +1897,9 @@ retry:
 	}
 #endif
 
-	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel
-	 */
-	if (lpdead_items > 0)
+	if (prunestate->lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
-		prunestate->has_lpdead_items = true;
-
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1918,7 @@ retry:
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
 	vacrel->tuples_frozen += tuples_frozen;
-	vacrel->lpdead_items += lpdead_items;
+	vacrel->lpdead_items += prunestate->lpdead_items;
 	vacrel->live_tuples += live_tuples;
 	vacrel->recently_dead_tuples += recently_dead_tuples;
 }
@@ -2129,8 +2119,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2128,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2198,7 +2180,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2227,7 +2209,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2254,8 +2236,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2300,7 +2282,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	tidstore_reset(vacrel->dead_items);
 }
 
 /*
@@ -2373,7 +2355,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2410,10 +2392,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2411,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while ((result = tidstore_iterate_next(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2437,7 +2421,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2451,7 +2435,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+							  buf, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2461,6 +2446,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	tidstore_end_iterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2470,14 +2456,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2495,11 +2480,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * LP_DEAD item on the page.  The return value is the first index immediately
  * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+					  int num_offsets, Buffer buffer, Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2518,16 +2502,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = offsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2576,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -3093,46 +3071,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3143,11 +3081,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3174,7 +3110,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3187,11 +3123,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d2a8c82900..fdc8a99bba 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1164,7 +1164,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127e..358ad25996 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2343,18 +2342,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
@@ -2365,60 +2352,7 @@ vac_max_items_to_alloc_size(int max_items)
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..4c0ce4b7e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+#define PARALLEL_VACUUM_KEY_DSA				2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TidStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_destroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TidStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index cbfe329591..4c35af3412 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -188,6 +188,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"PgStatsHash",
 	/* LWTRANCHE_PGSTATS_DATA: */
 	"PgStatsData",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 5025e80f89..edee8a2b2b 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2301,7 +2301,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 2048, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..220d89fff7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
 	MultiXactId MultiXactCutoff;
 };
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 7b7663e2e1..c9b4741e32 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -205,6 +205,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DSA,
 	LWTRANCHE_PGSTATS_HASH,
 	LWTRANCHE_PGSTATS_DATA,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
 -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a969ae63eb..630869255f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT s.stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 set enable_indexscan = off;
 
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
 
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
-- 
2.31.1

v20-0010-Free-all-radix-tree-node-recursively.patchapplication/octet-stream; name=v20-0010-Free-all-radix-tree-node-recursively.patchDownload

From cc2a07008e0eedef43c67c8ef9b55560ce2858b6 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 19 Jan 2023 14:50:37 +0900
Subject: [PATCH v20 10/13] Free all radix tree node recursively.

---
 src/include/lib/radixtree.h | 78 +++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 4ed463ba51..fe94335d53 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -127,6 +127,7 @@
 #define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
 #define RT_INIT_NODE RT_MAKE_NAME(init_node)
 #define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
 #define RT_EXTEND RT_MAKE_NAME(extend)
 #define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
 //#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
@@ -1408,6 +1409,78 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 	return tree->ctl->handle;
 }
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	if (NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+				for (int i = 0; i < n4->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n4->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+				for (int i = 0; i < n32->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+}
 #endif
 
 /*
@@ -1419,6 +1492,10 @@ RT_FREE(RT_RADIX_TREE *tree)
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
 	/*
 	 * Vandalize the control block to help catch programming error where
 	 * other backends access the memory formerly occupied by this radix tree.
@@ -2197,6 +2274,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_ALLOC_NODE
 #undef RT_INIT_NODE
 #undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
 #undef RT_EXTEND
 #undef RT_SET_EXTEND
 #undef RT_GROW_NODE_KIND
-- 
2.31.1

v20-0011-Update-Copyright-and-Identification.patchapplication/octet-stream; name=v20-0011-Update-Copyright-and-Identification.patchDownload

From d458feb13ffa693e635e68592339e4be837f2b2b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 19 Jan 2023 23:49:54 +0900
Subject: [PATCH v20 11/13] Update Copyright and Identification.

---
 src/include/lib/radixtree.h                      | 6 +++---
 src/test/modules/test_radixtree/meson.build      | 2 +-
 src/test/modules/test_radixtree/test_radixtree.c | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index fe94335d53..97cccdc9ca 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1,6 +1,6 @@
 /*-------------------------------------------------------------------------
  *
- * radixtree.c
+ * radixtree.h
  *		Implementation for adaptive radix tree.
  *
  * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
@@ -75,10 +75,10 @@
  * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
  *
  *
- * Copyright (c) 2022, PostgreSQL Global Development Group
+ * Copyright (c) 2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
- *	  src/backend/lib/radixtree.c
+ *	  src/include/lib/radixtree.h
  *
  *-------------------------------------------------------------------------
  */
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
index 72c91d0b7a..6add06bbdb 100644
--- a/src/test/modules/test_radixtree/meson.build
+++ b/src/test/modules/test_radixtree/meson.build
@@ -7,7 +7,7 @@ test_radixtree_sources = files(
 if host_system == 'windows'
   test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
     '--NAME', 'test_radixtree',
-    '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+    '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
 endif
 
 test_radixtree = shared_module('test_radixtree',
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 4b250be3f9..d8323f587f 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -3,7 +3,7 @@
  * test_radixtree.c
  *		Test radixtree set data structure.
  *
- * Copyright (c) 2022, PostgreSQL Global Development Group
+ * Copyright (c) 2023, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
  *		src/test/modules/test_radixtree/test_radixtree.c
-- 
2.31.1

v20-0009-add-link-to-pgport_srv-in-test_radixtree.patchapplication/octet-stream; name=v20-0009-add-link-to-pgport_srv-in-test_radixtree.patchDownload

From cc9e2b8b0614e955231f45bfcedd8cfee1372683 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 19 Jan 2023 09:53:48 +0900
Subject: [PATCH v20 09/13] add link to pgport_srv in test_radixtree.

---
 src/test/modules/test_radixtree/meson.build | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
index f96bf159d6..72c91d0b7a 100644
--- a/src/test/modules/test_radixtree/meson.build
+++ b/src/test/modules/test_radixtree/meson.build
@@ -12,6 +12,7 @@ endif
 
 test_radixtree = shared_module('test_radixtree',
   test_radixtree_sources,
+  link_with: pgport_srv,
   kwargs: pg_mod_args,
 )
 testprep_targets += test_radixtree
-- 
2.31.1

v20-0012-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v20-0012-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From 21d455583898f55e2aa24419b35e4ac34cde4377 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v20 12/13] Add TIDStore, to store sets of TIDs
 (ItemPointerData) efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 624 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   2 +
 src/include/access/tidstore.h                 |  49 ++
 src/include/storage/lwlock.h                  |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  35 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../modules/test_tidstore/test_tidstore.c     | 189 ++++++
 .../test_tidstore/test_tidstore.control       |   4 +
 16 files changed, 963 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 358d2ff90f..c9bc091045 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2180,6 +2180,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to access a shared TID bitmap during a parallel bitmap
        index scan.</entry>
      </row>
+     <row>
+      <entry><literal>SharedTidStore</literal></entry>
+      <entry>Waiting to access a shared TID store.</entry>
+     </row>
      <row>
       <entry><literal>SharedTupleStore</literal></entry>
       <entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..fa55793227
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,624 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a Tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach(). It can support concurrent updates but only one process
+ * is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. First, we construct 64-bit unsigned integer key that
+ * combines the block number and the offset number. The lowest 11 bits represent
+ * the offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ *
+ * The maximum height of the radix tree is 5.
+ *
+ * XXX: if we want to support non-heap table AM that want to use the full
+ * range of possible offset numbers, we'll need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
+ */
+#define TIDSTORE_OFFSET_NBITS	11
+#define TIDSTORE_VALUE_NBITS	6
+
+/*
+ * Memory consumption depends on the number of Tids stored, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in tidstore_create(). We want the total
+ * amount of memory consumption not to exceed the max_bytes.
+ *
+ * In non-shared cases, the radix tree uses slab allocators for each kind of
+ * node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+ * largest radix tree node in a new slab block, which is approximately 70kB.
+ * Therefore, we deduct 70kB from the maximum bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation showed, the 75% threshold for the maximum bytes
+ * perfectly works in case where it is a power-of-2, and the 60% threshold
+ * works for other cases.
+ */
+#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT		(1024L * 70) /* 70kB */
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2	(float) 0.75
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO		(float) 0.6
+
+#define KEY_GET_BLKNO(key) \
+	((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+#define BLKNO_GET_KEY(blkno) \
+	(((uint64) (blkno) << (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+	/*
+	 * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+	 * bytes a TidStore can use. These two fields are commonly used in both
+	 * non-shared case and shared case.
+	 */
+	uint64	num_tids;
+	uint64	max_bytes;
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+
+	/* protect the shared fields */
+	LWLock	lock;
+
+	/* handles for TidStore and radix tree */
+	tidstore_handle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids. Use either one depending on TidStoreIsShared()  */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0)
+			? TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2
+			: TIDSTORE_SHARED_MAX_MEMORY_RATIO;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes =(uint64) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT;
+	}
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix
+		 * tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (TidStoreIsShared(ts))
+	{
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Free the radix tree and return allocated DSA segments to
+		 * the operating system.
+		 */
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+
+		LWLockRelease(&ts->control->lock);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+	}
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+	if (TidStoreIsShared(ts))
+	{
+		/*
+		 * Since the shared radix tree supports concurrent insert,
+		 * we don't need to acquire the lock.
+		 */
+		shared_rt_set(ts->tree.shared, key, val);
+	}
+	else
+		local_rt_set(ts->tree.local, key, val);
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+#define NUM_KEYS_PER_BLOCK	(1 << (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS))
+	ItemPointerData tid;
+	uint64	key_base;
+	uint64	values[NUM_KEYS_PER_BLOCK] = {0};
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+	key_base = BLKNO_GET_KEY(blkno);
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint64	key;
+		uint32	off;
+		int idx;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		/* encode the Tid to key and val */
+		key = tid_to_key_off(&tid, &off);
+
+		idx = key - key_base;
+		Assert(idx >= 0 && idx < NUM_KEYS_PER_BLOCK);
+
+		values[idx] |= UINT64CONST(1) << off;
+	}
+
+	/* insert the calculated key-values to the tree */
+	for (int i = 0; i < NUM_KEYS_PER_BLOCK; i++)
+	{
+		if (values[i])
+		{
+			uint64 key = key_base + i;
+
+			tidstore_insert_kv(ts, key, values[i]);
+		}
+	}
+
+	if (TidStoreIsShared(ts))
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+
+	if (TidStoreIsShared(ts))
+		LWLockRelease(&ts->control->lock);
+}
+
+/* Return true if the given Tid is present in TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val;
+	uint32 off;
+	bool found;
+
+	key = tid_to_key_off(tid, &off);
+
+	found = TidStoreIsShared(ts) ?
+		shared_rt_search(ts->tree.shared, key, &val) :
+		local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+	iter->result.blkno = InvalidBlockNumber;
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (tidstore_num_tids(ts) == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+	else
+		return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = KEY_GET_BLKNO(key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * Remember the key-value pair for the next block for the
+			 * next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+uint64
+tidstore_num_tids(TidStore *ts)
+{
+	uint64 num_tids;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (TidStoreIsShared(ts))
+		return ts->control->num_tids;
+
+	LWLockAcquire(&ts->control->lock, LW_SHARED);
+	num_tids = ts->control->num_tids;
+	LWLockRelease(&ts->control->lock);
+
+	return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+uint64
+tidstore_memory_usage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return (uint64) sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+	else
+		return (uint64) sizeof(TidStore) + sizeof(TidStore) +
+			local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract Tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->result);
+
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+		result->offsets[result->num_offsets++] = off;
+	}
+
+	result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a Tid to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64 upper;
+	uint64 tid_i;
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+	*off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+	upper = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return upper;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 196bece0a3..cbfe329591 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"SharedTupleStore",
 	/* LWTRANCHE_SHARED_TIDBITMAP: */
 	"SharedTidBitmap",
+	/* LWTRANCHE_SHARED_TIDSTORE: */
+	"SharedTidStore",
 	/* LWTRANCHE_PARALLEL_APPEND: */
 	"ParallelAppend",
 	/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..ec3d9f87f5
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+	int				num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern uint64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern uint64 tidstore_max_memory(TidStore *ts);
+extern uint64 tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e4162db613..7b7663e2e1 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_SHARED_TIDBITMAP,
+	LWTRANCHE_SHARED_TIDSTORE,
 	LWTRANCHE_PARALLEL_APPEND,
 	LWTRANCHE_PER_XACT_PREDICATE_LIST,
 	LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..5d38387450
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,189 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+	ItemPointerData tid;
+	bool found;
+
+	ItemPointerSet(&tid, blkno, off);
+
+	found = tidstore_lookup_tid(ts, &tid);
+
+	if (found != expect)
+		elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+			 blkno, off, found, expect);
+}
+
+static void
+test_basic(void)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS	5
+#define TEST_TIDSTORE_NUM_OFFSETS	11
+#define IS_POWER_OF_TWO(x) (((x) & (x - 1)) == 0)
+
+	TidStore *ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+	BlockNumber	blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+	};
+	BlockNumber	blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+	};
+	OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS] = {
+		1 << 5, 1 << 6, 1 << 7, 1 << 8, 1 << 9,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4,
+		1 << 10
+	};
+	OffsetNumber offs_sorted[TEST_TIDSTORE_NUM_OFFSETS] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4,
+		1 << 5, 1 << 6, 1 << 7, 1 << 8, 1 << 9,
+		1 << 10
+	};
+	int blk_idx;
+
+	elog(NOTICE, "testing basic operations");
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, NULL);
+
+	/* add tids */
+	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* lookup test */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, IS_POWER_OF_TWO(off));
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, IS_POWER_OF_TWO(off));
+	}
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+			 tidstore_num_tids(ts),
+			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* iteration test */
+	iter = tidstore_begin_iterate(ts);
+	blk_idx = 0;
+	while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+	{
+		/* check the returned block number */
+		if (blks_sorted[blk_idx] != iter_result->blkno)
+			elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+				 iter_result->blkno, blks_sorted[blk_idx]);
+
+		/* check the returned offset numbers */
+		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+			elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+		for (int i = 0; i < iter_result->num_offsets; i++)
+		{
+			if (offs_sorted[i] != iter_result->offsets[i])
+				elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+					 iter_result->offsets[i], iter_result->blkno,
+					 offs_sorted[i]);
+		}
+
+		blk_idx++;
+	}
+
+	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+		elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+	/* remove all tids */
+	tidstore_reset(ts);
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+	/* lookup test for empty store */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, false);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, false);
+	}
+
+	tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+	TidStore *ts;
+	TidStoreIter *iter;
+	ItemPointerData tid;
+
+	elog(NOTICE, "testing empty tidstore");
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, NULL);
+
+	ItemPointerSet(&tid, 0, FirstOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+			 MaxBlockNumber, MaxOffsetNumber);
+
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+	if (tidstore_is_full(ts))
+		elog(ERROR, "tidstore_is_full on empty store returned true");
+
+	iter = tidstore_begin_iterate(ts);
+
+	if (tidstore_iterate_next(iter) != NULL)
+		elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+	tidstore_end_iterate(iter);
+
+	tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+	test_empty();
+	test_basic();
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.31.1

v20-0005-Shared-memory-cleanups.patchapplication/octet-stream; name=v20-0005-Shared-memory-cleanups.patchDownload

From 071f8c13f5eb18d2d7449dfe5457d27a753b0528 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 17 Jan 2023 16:50:38 +0700
Subject: [PATCH v20 05/13] Shared memory cleanups

---
 src/include/lib/radixtree.h | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index a78079b896..345b37e5fb 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1378,8 +1378,6 @@ RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 
-	/* XXX: do we need to set a callback on exit to detach dsa? */
-
 	return tree;
 }
 
@@ -1412,8 +1410,7 @@ RT_FREE(RT_RADIX_TREE *tree)
 	 * other backends access the memory formerly occupied by this radix tree.
 	 */
 	tree->ctl->magic = 0;
-	dsa_free(tree->dsa, tree->ctl->handle); // XXX
-	//dsa_detach(tree->dsa);
+	dsa_free(tree->dsa, tree->ctl->handle);
 #else
 	pfree(tree->ctl);
 
@@ -1452,8 +1449,6 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
 	if (key > tree->ctl->max_val)
 		RT_EXTEND(tree, key);
 
-	//Assert(tree->ctl->root);
-
 	nodep = tree->ctl->root;
 	parent = RT_PTR_GET_LOCAL(tree, nodep);
 	shift = parent->shift;
-- 
2.31.1

v20-0006-Make-RT_DELETE-optional.patchapplication/octet-stream; name=v20-0006-Make-RT_DELETE-optional.patchDownload

From 5d225cecf001837617b2eab36c96fecf2deb6af7 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 17 Jan 2023 17:34:28 +0700
Subject: [PATCH v20 06/13] Make RT_DELETE optional

To prevent compiler warnings in TIDStore
---
 src/include/lib/radixtree.h                      | 16 +++++++++++++++-
 src/test/modules/test_radixtree/test_radixtree.c |  1 +
 2 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 345b37e5fb..5bdfa74f72 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -58,7 +58,6 @@
  * RT_GET_HANDLE	- Return the handle of the radix tree
  * RT_SEARCH		- Search a key-value pair
  * RT_SET			- Set a key-value pair
- * RT_DELETE		- Delete a key-value pair
  * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
  * RT_ITERATE_NEXT	- Return next key-value pair, if any
  * RT_END_ITER		- End iteration
@@ -70,6 +69,12 @@
  * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
  * order of the key.
  *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
  * Copyright (c) 2022, PostgreSQL Global Development Group
  *
  * IDENTIFICATION
@@ -106,7 +111,9 @@
 #define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
 #define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
 #define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
 #define RT_DELETE RT_MAKE_NAME(delete)
+#endif
 #define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
 #define RT_DUMP RT_MAKE_NAME(dump)
 #define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
@@ -213,7 +220,9 @@ RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
 
 RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
 RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+#ifdef RT_USE_DELETE
 RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
 
 RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
 RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
@@ -1264,6 +1273,7 @@ RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
 #undef RT_NODE_LEVEL_LEAF
 }
 
+#ifdef RT_USE_DELETE
 /*
  * Search for the child pointer corresponding to 'key' in the given node.
  *
@@ -1289,6 +1299,7 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
 #include "lib/radixtree_delete_impl.h"
 #undef RT_NODE_LEVEL_LEAF
 }
+#endif
 
 /* Insert the child to the inner node */
 static bool
@@ -1523,6 +1534,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
 	return RT_NODE_SEARCH_LEAF(node, key, value_p);
 }
 
+#ifdef RT_USE_DELETE
 /*
  * Delete the given key from the radix tree. Return true if the key is found (and
  * deleted), otherwise do nothing and return false.
@@ -1609,6 +1621,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 
 	return true;
 }
+#endif
 
 static inline void
 RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
@@ -2168,6 +2181,7 @@ rt_dump(RT_RADIX_TREE *tree)
 #undef RT_BEGIN_ITERATE
 #undef RT_ITERATE_NEXT
 #undef RT_END_ITERATE
+#undef RT_USE_DELETE
 #undef RT_DELETE
 #undef RT_MEMORY_USAGE
 #undef RT_DUMP
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 076173f628..f01d4dd733 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -104,6 +104,7 @@ static const test_spec test_specs[] = {
 #define RT_SCOPE static
 #define RT_DECLARE
 #define RT_DEFINE
+#define RT_USE_DELETE
 // WIP: compiles with warnings because rt_attach is defined but not used
 // #define RT_SHMEM
 #include "lib/radixtree.h"
-- 
2.31.1

v20-0008-Fix-bug-in-RT_CHUNK_VALUES_ARRAY_SHIFT.patchapplication/octet-stream; name=v20-0008-Fix-bug-in-RT_CHUNK_VALUES_ARRAY_SHIFT.patchDownload

From 58e98149a77ae23548c8d0fb3f3d229496ea1d9e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 19 Jan 2023 23:49:33 +0900
Subject: [PATCH v20 08/13] Fix bug in RT_CHUNK_VALUES_ARRAY_SHIFT().

---
 src/include/lib/radixtree.h                      |  2 +-
 src/test/modules/test_radixtree/test_radixtree.c | 12 ++++++++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index b9e09f5761..4ed463ba51 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -830,7 +830,7 @@ static inline void
 RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
 {
 	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
-	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64) * (count - idx));
 }
 
 /* Delete the element at 'idx' */
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index f01d4dd733..4b250be3f9 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -206,6 +206,18 @@ test_basic(int children, bool test_inner)
 			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
 	}
 
+	/* look up keys */
+	for (int i = 0; i < children; i++)
+	{
+		uint64 value;
+
+		if (!rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (value != keys[i])
+			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+				 value, keys[i]);
+	}
+
 	/* update keys */
 	for (int i = 0; i < children; i++)
 	{
-- 
2.31.1

v20-0007-Fix-RT_DEBUG-functions.patchapplication/octet-stream; name=v20-0007-Fix-RT_DEBUG-functions.patchDownload

From 5c27ee115383d257d8d4f2280ad200e40dc36ceb Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 19 Jan 2023 23:29:16 +0900
Subject: [PATCH v20 07/13] Fix RT_DEBUG functions.

---
 src/include/lib/radixtree.h | 30 +++++++++++++++++-------------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 5bdfa74f72..b9e09f5761 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -115,9 +115,12 @@
 #define RT_DELETE RT_MAKE_NAME(delete)
 #endif
 #define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
 #define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
 #define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
 #define RT_STATS RT_MAKE_NAME(stats)
+#endif
 
 /* internal helper functions (no externally visible prototypes) */
 #define RT_NEW_ROOT RT_MAKE_NAME(new_root)
@@ -1876,8 +1879,8 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
 
 /***************** DEBUG FUNCTIONS *****************/
 #ifdef RT_DEBUG
-void
-rt_stats(RT_RADIX_TREE *tree)
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
 {
 	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
 						 tree->ctl->num_keys,
@@ -1890,7 +1893,7 @@ rt_stats(RT_RADIX_TREE *tree)
 }
 
 static void
-rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
+RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 {
 	char		space[125] = {0};
 
@@ -1926,7 +1929,7 @@ rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
 								space, n4->base.chunks[i]);
 
 						if (recurse)
-							rt_dump_node(n4->children[i], level + 1, recurse);
+							RT_DUMP_NODE(n4->children[i], level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -1953,7 +1956,7 @@ rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
 
 						if (recurse)
 						{
-							rt_dump_node(n32->children[i], level + 1, recurse);
+							RT_DUMP_NODE(n32->children[i], level + 1, recurse);
 						}
 						else
 							fprintf(stderr, "\n");
@@ -2005,7 +2008,7 @@ rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(RT_NODE_INNER_125_GET_CHILD(n125, i),
+							RT_DUMP_NODE(RT_NODE_INNER_125_GET_CHILD(n125, i),
 										 level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2038,7 +2041,7 @@ rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
 								space, i);
 
 						if (recurse)
-							rt_dump_node(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
+							RT_DUMP_NODE(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
 										 recurse);
 						else
 							fprintf(stderr, "\n");
@@ -2049,8 +2052,8 @@ rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
 	}
 }
 
-void
-rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
 {
 	RT_PTR_LOCAL node;
 	int			shift;
@@ -2079,7 +2082,7 @@ rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
 	{
 		RT_PTR_LOCAL child;
 
-		rt_dump_node(node, level, false);
+		RT_DUMP_NODE(node, level, false);
 
 		if (NODE_IS_LEAF(node))
 		{
@@ -2100,8 +2103,8 @@ rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
 	}
 }
 
-void
-rt_dump(RT_RADIX_TREE *tree)
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
 {
 
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
@@ -2119,7 +2122,7 @@ rt_dump(RT_RADIX_TREE *tree)
 		return;
 	}
 
-	rt_dump_node(tree->ctl->root, 0, true);
+	RT_DUMP_NODE(tree->ctl->root, 0, true);
 }
 #endif
 
@@ -2185,6 +2188,7 @@ rt_dump(RT_RADIX_TREE *tree)
 #undef RT_DELETE
 #undef RT_MEMORY_USAGE
 #undef RT_DUMP
+#undef RT_DUMP_NODE
 #undef RT_DUMP_SEARCH
 #undef RT_STATS
 
-- 
2.31.1

v20-0004-Remove-RT_NUM_ENTRIES.patchapplication/octet-stream; name=v20-0004-Remove-RT_NUM_ENTRIES.patchDownload

From 97a647cd9486f58b9186e6dc46fd0afdf474dfd9 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 17 Jan 2023 16:38:09 +0700
Subject: [PATCH v20 04/13] Remove RT_NUM_ENTRIES

This is not expected to be used everywhere, and is very simple
to implement, so move definition to test module where it is
used extensively.
---
 src/include/lib/radixtree.h                      | 13 -------------
 src/test/modules/test_radixtree/test_radixtree.c |  9 +++++++++
 2 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 9f8bed09f7..a78079b896 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -63,7 +63,6 @@
  * RT_ITERATE_NEXT	- Return next key-value pair, if any
  * RT_END_ITER		- End iteration
  * RT_MEMORY_USAGE	- Get the memory usage
- * RT_NUM_ENTRIES	- Get the number of key-value pairs
  *
  * RT_CREATE() creates an empty radix tree in the given memory context
  * and memory contexts for all kinds of radix tree node under the memory context.
@@ -109,7 +108,6 @@
 #define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
 #define RT_DELETE RT_MAKE_NAME(delete)
 #define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
-#define RT_NUM_ENTRIES RT_MAKE_NAME(num_entries)
 #define RT_DUMP RT_MAKE_NAME(dump)
 #define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
 #define RT_STATS RT_MAKE_NAME(stats)
@@ -222,7 +220,6 @@ RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
 RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
 
 RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
-RT_SCOPE uint64 RT_NUM_ENTRIES(RT_RADIX_TREE *tree);
 
 #ifdef RT_DEBUG
 RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
@@ -1773,15 +1770,6 @@ RT_END_ITERATE(RT_ITER *iter)
 	pfree(iter);
 }
 
-/*
- * Return the number of keys in the radix tree.
- */
-RT_SCOPE uint64
-RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
-{
-	return tree->ctl->num_keys;
-}
-
 /*
  * Return the statistics of the amount of memory used by the radix tree.
  */
@@ -2187,7 +2175,6 @@ rt_dump(RT_RADIX_TREE *tree)
 #undef RT_END_ITERATE
 #undef RT_DELETE
 #undef RT_MEMORY_USAGE
-#undef RT_NUM_ENTRIES
 #undef RT_DUMP
 #undef RT_DUMP_SEARCH
 #undef RT_STATS
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 61d842789d..076173f628 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -109,6 +109,15 @@ static const test_spec test_specs[] = {
 #include "lib/radixtree.h"
 
 
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
 PG_MODULE_MAGIC;
 
 PG_FUNCTION_INFO_V1(test_radixtree);
-- 
2.31.1

v20-0003-Add-radixtree-template.patchapplication/octet-stream; name=v20-0003-Add-radixtree-template.patchDownload

From a81afc05faabfc4f2d49cb93cf5867032100a535 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v20 03/13] Add radixtree template

The only thing configurable at this point is function scope,
prefix, and local/shared memory.

The key and value type are still hard-coded to uint64.
To make this more useful, at least value type should be
configurable.

It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.

TODO: Reducing the smallest node to 3 members will
eliminate padding and only take up 32 bytes for
inner nodes.
---
 src/backend/utils/mmgr/dsa.c                  |   12 +
 src/include/lib/radixtree.h                   | 2243 +++++++++++++++++
 src/include/lib/radixtree_delete_impl.h       |  106 +
 src/include/lib/radixtree_insert_impl.h       |  316 +++
 src/include/lib/radixtree_iter_impl.h         |  138 +
 src/include/lib/radixtree_search_impl.h       |  131 +
 src/include/utils/dsa.h                       |    1 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   34 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  631 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 20 files changed, 3715 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 604b702a91..50f0aae3ab 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..9f8bed09f7
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2243 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.c
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ *	  To generate a radix tree and associated functions for a use case several
+ *	  macros have to be #define'ed before this file is included.  Including
+ *	  the file #undef's all those, so a new radix tree can be generated
+ *	  afterwards.
+ *	  The relevant parameters are:
+ *	  - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ *		will result in radix tree type 'foo_radix_tree' and functions like
+ *		'foo_create'/'foo_free' and so forth.
+ *	  - RT_DECLARE - if defined function prototypes and type declarations are
+ *		generated
+ *	  - RT_DEFINE - if defined function definitions are generated
+ *	  - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *		declarations reside
+ *	  - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *		so that multiple processes can access it simultaneously.
+ *
+ *	  Optional parameters:
+ *	  - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_DELETE		- Delete a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITER		- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ * RT_NUM_ENTRIES	- Get the number of key-value pairs
+ *
+ * RT_CREATE() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/radixtree.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#define RT_DELETE RT_MAKE_NAME(delete)
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#define RT_NUM_ENTRIES RT_MAKE_NAME(num_entries)
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
+#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
+#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+RT_SCOPE uint64 RT_NUM_ENTRIES(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* macros and types common to all implementations */
+#ifndef RT_COMMON
+#define RT_COMMON
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+#endif							/* RT_COMMON */
+
+
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_4_FULL = 0,
+	RT_CLASS_32_PARTIAL,
+	RT_CLASS_32_FULL,
+	RT_CLASS_125_FULL,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((RT_PTR_LOCAL) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+	((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+	((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct RT_NODE_BASE_4
+{
+	RT_NODE		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} RT_NODE_BASE_4;
+
+typedef struct RT_NODE_BASE_32
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct RT_NODE_BASE_125
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword		isset[BM_IDX(128)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct RT_NODE_INNER_4
+{
+	RT_NODE_BASE_4 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_4;
+
+typedef struct RT_NODE_LEAF_4
+{
+	RT_NODE_BASE_4 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_4;
+
+typedef struct RT_NODE_INNER_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} RT_SIZE_CLASS_ELEM;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_4_FULL] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_PARTIAL] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_FULL] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
+	},
+	[RT_CLASS_125_FULL] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+/* Map from the node kind to its minimum size class */
+static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
+	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+	[RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+#endif
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
++ *
++ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
++ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
++ * We need either a safeguard to disallow other processes to begin the iteration
++ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+	RT_PTR_LOCAL node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the iteration on nodes of each level */
+	RT_NODE_ITER stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+								uint64 key, uint64 value);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64 *) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(uint64) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
+{
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
+
+	if (inner)
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (inner)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+#endif
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->ctl->cnt[size_class]++;
+#endif
+
+	return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+{
+	if (inner)
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+
+	node->kind = kind;
+
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+	}
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		inner = shift > 0;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+	newnode->shift = shift;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->count = oldnode->count;
+}
+#if 0
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static RT_NODE*
+RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
+{
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+	bool inner = !NODE_IS_LEAF(node);
+
+	allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	RT_COPY_NODE(newnode, node);
+
+	return newnode;
+}
+#endif
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->ctl->root == allocnode)
+	{
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
+	}
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
+{
+	RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
+
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old->shift == new->shift);
+#endif
+
+	if (parent == old)
+	{
+		/* Replace the root node with the new large node */
+		tree->ctl->root = new_child;
+	}
+	else
+		RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+	RT_FREE_NODE(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_4 *n4;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		node->shift = shift;
+		node->count = 1;
+
+		n4 = (RT_NODE_INNER_4 *) node;
+		n4->base.chunks[0] = 0;
+		n4->children[0] = tree->ctl->root;
+
+		/* Update the root */
+		tree->ctl->root = allocnode;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+			  RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
+{
+	int			shift = node->shift;
+
+	Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		inner = newshift > 0;
+
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+		newchild->shift = newshift;
+		RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
+
+		parent = node;
+		node = newchild;
+		nodep = allocchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+	tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/* Insert the child to the inner node */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Insert the value to the leaf node */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+					uint64 key, uint64 value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 RT_SIZE_CLASS_INFO[i].name,
+												 RT_SIZE_CLASS_INFO[i].inner_blocksize,
+												 RT_SIZE_CLASS_INFO[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												RT_SIZE_CLASS_INFO[i].name,
+												RT_SIZE_CLASS_INFO[i].leaf_blocksize,
+												RT_SIZE_CLASS_INFO[i].leaf_size);
+	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	/* XXX: memory context support */
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/* XXX: do we need to set a callback on exit to detach dsa? */
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle); // XXX
+	//dsa_detach(tree->dsa);
+#else
+	pfree(tree->ctl);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+#endif
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC nodep;
+	RT_PTR_LOCAL  node;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	/* Empty tree, create the root */
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->ctl->max_val)
+		RT_EXTEND(tree, key);
+
+	//Assert(tree->ctl->root);
+
+	nodep = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, nodep);
+	shift = parent->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		node = RT_PTR_GET_LOCAL(tree, nodep);
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_SET_EXTEND(tree, key, value, parent, nodep, node);
+			return false;
+		}
+
+		parent = node;
+		nodep = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->ctl->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	Assert(value_p != NULL);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+		return false;
+
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			return false;
+
+		node = RT_PTR_GET_LOCAL(tree, child);
+		shift -= RT_NODE_SPAN;
+	}
+
+	return RT_NODE_SEARCH_LEAF(node, key, value_p);
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		RT_PTR_ALLOC child;
+
+		/* Push the current node to the stack */
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			return false;
+
+		allocnode = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	deleted = RT_NODE_DELETE_LEAF(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->ctl->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	RT_FREE_NODE(tree, allocnode);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		allocnode = stack[level--];
+
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		deleted = RT_NODE_DELETE_INNER(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		RT_FREE_NODE(tree, allocnode);
+	}
+
+	return true;
+}
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+	int			level = from;
+	RT_PTR_LOCAL node = from_node;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	MemoryContext old_ctx;
+	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->ctl->root)
+		return iter;
+
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->ctl->root)
+		return false;
+
+	for (;;)
+	{
+		RT_PTR_LOCAL child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+	pfree(iter);
+}
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+RT_SCOPE uint64
+RT_NUM_ENTRIES(RT_RADIX_TREE *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	// XXX is this necessary?
+	Size		total = sizeof(RT_RADIX_TREE);
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+#endif
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = BM_IDX(slot);
+					int			bitnum = BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+void
+rt_stats(RT_RADIX_TREE *tree)
+{
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+						 tree->ctl->num_keys,
+						 tree->ctl->root->shift / RT_NODE_SPAN,
+						 tree->ctl->cnt[RT_CLASS_4_FULL],
+						 tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+						 tree->ctl->cnt[RT_CLASS_32_FULL],
+						 tree->ctl->cnt[RT_CLASS_125_FULL],
+						 tree->ctl->cnt[RT_CLASS_256])));
+}
+
+static void
+rt_dump_node(RT_PTR_LOCAL node, int level, bool recurse)
+{
+	char		space[125] = {0};
+
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
+			node->fanout == 0 ? 256 : node->fanout,
+			node->count, node->shift);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							rt_dump_node(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							rt_dump_node(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < BM_IDX(128); i++)
+					{
+						fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+					}
+					else
+					{
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							rt_dump_node(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+void
+rt_dump_search(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+		 tree->ctl->max_val, tree->ctl->max_val);
+
+	if (!tree->ctl->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->ctl->max_val)
+	{
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->ctl->root;
+	shift = tree->ctl->root->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_LOCAL child;
+
+		rt_dump_node(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+void
+rt_dump(RT_RADIX_TREE *tree)
+{
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+				RT_SIZE_CLASS_INFO[i].name,
+				RT_SIZE_CLASS_INFO[i].inner_size,
+				RT_SIZE_CLASS_INFO[i].inner_blocksize,
+				RT_SIZE_CLASS_INFO[i].leaf_size,
+				RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+
+	if (!tree->ctl->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	rt_dump_node(tree->ctl->root, 0, true);
+}
+#endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+
+/* locally declared macros */
+#undef NODE_IS_LEAF
+#undef NODE_IS_EMPTY
+#undef VAR_NODE_HAS_FREE_SLOT
+#undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_RADIX_TREE_MAGIC
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_4_FULL
+#undef RT_CLASS_32_PARTIAL
+#undef RT_CLASS_32_FULL
+#undef RT_CLASS_125_FULL
+#undef RT_CLASS_256
+#undef RT_KIND_MIN_SIZE_CLASS
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_NUM_ENTRIES
+#undef RT_DUMP
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_GROW_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..eb87866b90
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,106 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(NODE_IS_LEAF(node));
+#else
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
+										  n4->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
+											n4->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
+										  n32->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_NODE_125_INVALID_IDX)
+					return false;
+
+				idx = BM_IDX(slotpos);
+				bitnum = BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+				RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..e4faf54d9d
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,316 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+	RT_PTR_LOCAL newnode = NULL;
+	RT_PTR_ALLOC allocnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	const bool inner = false;
+	Assert(NODE_IS_LEAF(node));
+#else
+	const bool inner = true;
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n4->values[idx] = value;
+#else
+					n4->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+					/* grow node from 4 to 32 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+					new32 = (RT_NODE32_TYPE *) newnode;
+#ifdef RT_NODE_LEVEL_LEAF
+					RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
+											  new32->base.chunks, new32->values);
+#else
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children);
+#endif
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
+												   count, insertpos);
+#endif
+					}
+
+					n4->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n4->values[insertpos] = value;
+#else
+					n4->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[idx] = value;
+#else
+					n32->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+					n32->base.n.fanout == class32_min.fanout)
+				{
+					/* grow to the next size class of this kind */
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class32_min.leaf_size);
+#else
+					memcpy(newnode, node, class32_min.inner_size);
+#endif
+					newnode->fanout = class32_max.fanout;
+
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+
+					/* also update pointer for this kind */
+					n32 = (RT_NODE32_TYPE *) newnode;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+					Assert(n32->base.n.fanout == class32_max.fanout);
+
+					/* grow node from 32 to 125 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < class32_max.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = value;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			cnt = 0;
+
+				if (slotpos != RT_NODE_125_INVALID_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+					/* grow node from 125 to 256 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+					new256 = (RT_NODE256_TYPE *) newnode;
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+						cnt++;
+					}
+
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < BM_IDX(128); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+				chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_SET(n256, chunk, value);
+#else
+				RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	RT_VERIFY_NODE(node);
+
+	return chunk_exists;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..0b8b68df6c
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,138 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	bool		found = false;
+	uint8		key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	uint64		value;
+
+	Assert(NODE_IS_LEAF(node_iter->node));
+#else
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n4->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
+#endif
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+		*value_p = value;
+#endif
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return found;
+#else
+	return child;
+#endif
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..31e4978e4f
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,131 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	uint64		value = 0;
+
+	Assert(NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+	RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+#endif
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n4->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n4->values[idx];
+#else
+				child = n4->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n32->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[idx];
+#else
+				child = n32->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+				Assert(slotpos != RT_NODE_125_INVALID_IDX);
+				n125->children[slotpos] = new_child;
+#else
+				if (slotpos == RT_NODE_125_INVALID_IDX)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+	}
+
+#ifdef RT_ACTION_UPDATE
+	return;
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	*value_p = value;
+#else
+	Assert(child_p != NULL);
+	*child_p = child;
+#endif
+
+	return true;
+#endif							/* RT_ACTION_UPDATE */
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 104386e674..c67f936880 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..f96bf159d6
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,34 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/backend/lib/radixtree.c',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..61d842789d
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,631 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	uint64		dummy;
+	uint64		key;
+	uint64		val;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	rt_radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_set(radixtree, keys[i], keys[i] + 1))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		uint64		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa);
+#else
+	radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	rt_free(radixtree);
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.31.1

v20-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v20-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From 055a13ace935bd5c6ca421437efb371a25e79b8f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v20 01/13] introduce vector8_min and vector8_highbit_mask

---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..84d41a340a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.31.1

v20-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v20-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From 3e97873aee57c929e38cc38c35205de3e3fb8525 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v20 02/13] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 23bafec5f7..5bd3da4948 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.31.1

#183

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#177)

22 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

In v21, all of your v20 improvements to the radix tree template and test
have been squashed into 0003, with one exception: v20-0010 (recursive
freeing of shared mem), which I've attached separately (for flexibility) as
v21-0006. I believe one of your earlier patches had a new DSA function for
freeing memory more quickly -- was there a problem with that approach? I
don't recall where that discussion went.

+ * XXX: Most functions in this file have two variants for inner nodes

and leaf

+ * nodes, therefore there are duplication codes. While this sometimes

makes the

+ * code maintenance tricky, this reduces branch prediction misses when

judging

+ * whether the node is a inner node of a leaf node.

This comment seems to be out-of-date since we made it a template.

Done in 0020, along with a bunch of other comment editing.

The following macros are defined but not undefined in radixtree.h:

Fixed in v21-0018.

Also:

0007 makes the value type configurable. Some debug functionality still
assumes integer type, but I think the rest is agnostic.
0010 turns node4 into node3, as discussed, going from 48 bytes to 32.
0012 adopts the benchmark module to the template, and adds meson support
(builds with warnings, but okay because not meant for commit).

The rest are cleanups, small refactorings, and more comment rewrites. I've
kept them separate for visibility. Next patch can squash them unless there
is any discussion.

uint32 is how we store the block number, so this too small and will

wrap around on overflow. int64 seems better.

Agreed, will fix.

Great, but it's now uint64, not int64. All the large counters in struct
LVRelState, for example, are signed integers, as the usual practice.
Unsigned ints are "usually" for things like bit patterns and where explicit
wraparound is desired. There's probably more that can be done here to
change to signed types, but I think it's still a bit early to get to that
level of nitpicking. (Soon, I hope :-) )

+ * We calculate the maximum bytes for the TidStore in different ways
+ * for non-shared case and shared case. Please refer to the comment
+ * TIDSTORE_MEMORY_DEDUCT for details.
+ */

Maybe the #define and comment should be close to here.

Will fix.

For this, I intended that "here" meant "in or just above the function".

+#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT (1024L * 70) /* 70kB */
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2 (float) 0.75
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO (float) 0.6

These symbols are used only once, in tidstore_create(), and are difficult
to read. That function has few comments. The symbols have several
paragraphs, but they are far away. It might be better for readability to
just hard-code numbers in the function, with the explanation about the
numbers near where they are used.

+ * Destroy a TidStore, returning all memory. The caller must be

certain that

+ * no other backend will attempt to access the TidStore before calling

this

+ * function. Other backend must explicitly call tidstore_detach to

free up

+ * backend-local memory associated with the TidStore. The backend that

calls

+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
If not addressed by next patch, need to phrase comment with FIXME or

TODO about making certain.

Will fix.

Did anything change here? There is also this, in the template, which I'm
not sure has been addressed:

* XXX: Currently we allow only one process to do iteration. Therefore,
rt_node_iter
* has the local pointers to nodes, rather than RT_PTR_ALLOC.
* We need either a safeguard to disallow other processes to begin the
iteration
* while one process is doing or to allow multiple processes to do the
iteration.

This part only runs "if (vacrel->nindexes == 0)", so seems like

unneeded complexity. It arises because lazy_scan_prune() populates the tid
store even if no index vacuuming happens. Perhaps the caller of
lazy_scan_prune() could pass the deadoffsets array, and upon returning,
either populate the store or call lazy_vacuum_heap_page(), as needed. It's
quite possible I'm missing some detail, so some description of the design
choices made would be helpful.

I agree that we don't need complexity here. I'll try this idea.

Keeping the offsets array in the prunestate seems to work out well.

Some other quick comments on tid store and vacuum, not comprehensive. Let
me know if I've misunderstood something:

TID store:

+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit

I was confused for a while, and I realized the bits are in reverse order
from how they are usually pictured (high on left, low on the right).

+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage <
2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with

+ * XXX: if we want to support non-heap table AM that want to use the full
+ * range of possible offset numbers, we'll need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.

Would it be worth it (or possible) to calculate constants based on
compile-time block size? And/or have a fallback for other table AMs? Since
this file is in access/common, the intention is to allow general-purpose, I
imagine.

+typedef dsa_pointer tidstore_handle;

It's not clear why we need a typedef here, since here:

+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
...
+ control = handle;

...there is a differently-named dsa_pointer variable that just gets the
function parameter.

+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)

size_t is more suitable for memory.

+ /*
+ * Since the shared radix tree supports concurrent insert,
+ * we don't need to acquire the lock.
+ */

Hmm? IIUC, the caller only acquires the lock after returning from here, to
update statistics. Why is it safe to insert with no lock? Am I missing
something?

VACUUM integration:

-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2

Seems like unnecessary churn? It is still all about dead items, after all.
I understand using "DSA" for the LWLock, since that matches surrounding
code.

+#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)

This macro helps the patch readability in some places, but I'm not sure it
helps readability of the file as a whole. The following is in the patch and
seems perfectly clear without the macro:

- if (lpdead_items > 0)
+ if (prunestate->lpdead_items > 0)

About shared memory: I have some mild reservations about the naming of the
"control object", which may be in shared memory. Is that an established
term? (If so, disregard the rest): It seems backwards -- the thing in
shared memory is the actual tree itself. The thing in backend-local memory
has the "handle", and that's how we control the tree. I don't have a better
naming scheme, though, and might not be that important. (Added a WIP
comment)

Now might be a good time to look at earlier XXX comments and come up with a
plan to address them.

That's all I have for now.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v21-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v21-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From 2bd133c432a960f79ec58edbf0fe0767620d81c0 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v21 02/22] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 24510ac29e..758e20f148 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3660,7 +3660,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.39.0

v21-0001-introduce-vector8_min-and-vector8_highbit_mask.patchtext/x-patch; charset=US-ASCII; name=v21-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From b0edeae77488d98733752a9190d1af36838b645f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v21 01/22] introduce vector8_min and vector8_highbit_mask

TODO: commit message
TODO: Remove uint64 case.

separate-commit TODO: move non-SIMD fallbacks to own header
to clean up the #ifdef maze.
---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..84d41a340a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.39.0

v21-0005-Restore-RT_GROW_NODE_KIND.patchtext/x-patch; charset=US-ASCII; name=v21-0005-Restore-RT_GROW_NODE_KIND.patchDownload

From 7af8716587b466a298052c8185cf51ce38399686 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 11:32:24 +0700
Subject: [PATCH v21 05/22] Restore RT_GROW_NODE_KIND

(This was previously "exploded" out during the work to
switch this to a template)

Change the API so that we pass it the allocated pointer
and return the local pointer. That way, there is consistency
in growing nodes whether we change kind or not.

Also rename to RT_SWITCH_NODE_KIND, since it should work just as
well for shrinking nodes.
---
 src/include/lib/radixtree.h             | 104 +++---------------------
 src/include/lib/radixtree_insert_impl.h |  24 ++----
 2 files changed, 19 insertions(+), 109 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index a1458bc25f..c08016de3a 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -127,10 +127,9 @@
 #define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
 #define RT_INIT_NODE RT_MAKE_NAME(init_node)
 #define RT_FREE_NODE RT_MAKE_NAME(free_node)
-#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
 #define RT_EXTEND RT_MAKE_NAME(extend)
 #define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
-//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
 #define RT_COPY_NODE RT_MAKE_NAME(copy_node)
 #define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
 #define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
@@ -1080,26 +1079,22 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
 	newnode->shift = oldnode->shift;
 	newnode->count = oldnode->count;
 }
-#if 0
+
 /*
- * Create a new node with 'new_kind' and the same shift, chunk, and
- * count of 'node'.
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
  */
-static RT_NODE*
-RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+				  uint8 new_kind, uint8 new_class, bool inner)
 {
-	RT_PTR_ALLOC allocnode;
-	RT_PTR_LOCAL newnode;
-	bool inner = !NODE_IS_LEAF(node);
-
-	allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
-	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-	RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, new_class, inner);
 	RT_COPY_NODE(newnode, node);
 
 	return newnode;
 }
-#endif
+
 /* Free the given node */
 static void
 RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
@@ -1415,78 +1410,6 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 	return tree->ctl->handle;
 }
-
-/*
- * Recursively free all nodes allocated to the DSA area.
- */
-static inline void
-RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
-{
-	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
-
-	check_stack_depth();
-	CHECK_FOR_INTERRUPTS();
-
-	/* The leaf node doesn't have child pointers */
-	if (NODE_IS_LEAF(node))
-	{
-		dsa_free(tree->dsa, ptr);
-		return;
-	}
-
-	switch (node->kind)
-	{
-		case RT_NODE_KIND_4:
-			{
-				RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
-
-				for (int i = 0; i < n4->base.n.count; i++)
-					RT_FREE_RECURSE(tree, n4->children[i]);
-
-				break;
-			}
-		case RT_NODE_KIND_32:
-			{
-				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
-
-				for (int i = 0; i < n32->base.n.count; i++)
-					RT_FREE_RECURSE(tree, n32->children[i]);
-
-				break;
-			}
-		case RT_NODE_KIND_125:
-			{
-				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
-
-				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
-				{
-					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
-						continue;
-
-					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
-				}
-
-				break;
-			}
-		case RT_NODE_KIND_256:
-			{
-				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
-
-				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
-				{
-					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
-						continue;
-
-					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
-				}
-
-				break;
-			}
-	}
-
-	/* Free the inner node */
-	dsa_free(tree->dsa, ptr);
-}
 #endif
 
 /*
@@ -1498,10 +1421,6 @@ RT_FREE(RT_RADIX_TREE *tree)
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 
-	/* Free all memory used for radix tree nodes */
-	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
-		RT_FREE_RECURSE(tree, tree->ctl->root);
-
 	/*
 	 * Vandalize the control block to help catch programming error where
 	 * other backends access the memory formerly occupied by this radix tree.
@@ -2280,10 +2199,9 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_ALLOC_NODE
 #undef RT_INIT_NODE
 #undef RT_FREE_NODE
-#undef RT_FREE_RECURSE
 #undef RT_EXTEND
 #undef RT_SET_EXTEND
-#undef RT_GROW_NODE_KIND
+#undef RT_SWITCH_NODE_KIND
 #undef RT_COPY_NODE
 #undef RT_REPLACE_NODE
 #undef RT_PTR_GET_LOCAL
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 1d0eb396e2..e3e44669ea 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -53,11 +53,9 @@
 
 					/* grow node from 4 to 32 */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
-					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-					RT_INIT_NODE(newnode, new_kind, new_class, inner);
-					RT_COPY_NODE(newnode, node);
-					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
 					new32 = (RT_NODE32_TYPE *) newnode;
+
 #ifdef RT_NODE_LEVEL_LEAF
 					RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
 											  new32->base.chunks, new32->values);
@@ -119,13 +117,15 @@
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
 					n32->base.n.fanout == class32_min.fanout)
 				{
-					/* grow to the next size class of this kind */
 					RT_PTR_ALLOC allocnode;
 					RT_PTR_LOCAL newnode;
 					const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
 
+					/* grow to the next size class of this kind */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
 					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n32 = (RT_NODE32_TYPE *) newnode;
+
 #ifdef RT_NODE_LEVEL_LEAF
 					memcpy(newnode, node, class32_min.leaf_size);
 #else
@@ -135,9 +135,6 @@
 
 					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
 					node = newnode;
-
-					/* also update pointer for this kind */
-					n32 = (RT_NODE32_TYPE *) newnode;
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
@@ -152,10 +149,7 @@
 
 					/* grow node from 32 to 125 */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
-					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-					RT_INIT_NODE(newnode, new_kind, new_class, inner);
-					RT_COPY_NODE(newnode, node);
-					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
 					new125 = (RT_NODE125_TYPE *) newnode;
 
 					for (int i = 0; i < class32_max.fanout; i++)
@@ -229,11 +223,9 @@
 
 					/* grow node from 125 to 256 */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
-					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-					RT_INIT_NODE(newnode, new_kind, new_class, inner);
-					RT_COPY_NODE(newnode, node);
-					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
 					new256 = (RT_NODE256_TYPE *) newnode;
+
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
 					{
 						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
-- 
2.39.0

v21-0004-Clean-up-some-nomenclature-around-node-insertion.patchtext/x-patch; charset=US-ASCII; name=v21-0004-Clean-up-some-nomenclature-around-node-insertion.patchDownload

From d9f4b6280f73076df05c1fd03ca6860df3b90c74 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 19 Jan 2023 16:33:51 +0700
Subject: [PATCH v21 04/22] Clean up some nomenclature around node insertion

Replace node/nodep with hopefully more informative names.

In passing, remove some outdated asserts and move some
variable declarations to the scope where they're used.
---
 src/include/lib/radixtree.h             | 64 ++++++++++++++-----------
 src/include/lib/radixtree_insert_impl.h | 22 +++++----
 2 files changed, 47 insertions(+), 39 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 97cccdc9ca..a1458bc25f 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -645,9 +645,9 @@ typedef struct RT_ITER
 } RT_ITER;
 
 
-static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 								 uint64 key, RT_PTR_ALLOC child);
-static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 								uint64 key, uint64 value);
 
 /* verification (available only with assertion) */
@@ -1153,18 +1153,18 @@ RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
  * Replace old_child with new_child, and free the old one.
  */
 static void
-RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+				RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
 				RT_PTR_ALLOC new_child, uint64 key)
 {
-	RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
-
 #ifdef USE_ASSERT_CHECKING
 	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
 
-	Assert(old->shift == new->shift);
+	Assert(old_child->shift == new->shift);
+	Assert(old_child->count == new->count);
 #endif
 
-	if (parent == old)
+	if (parent == old_child)
 	{
 		/* Replace the root node with the new large node */
 		tree->ctl->root = new_child;
@@ -1172,7 +1172,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child
 	else
 		RT_NODE_UPDATE_INNER(parent, key, new_child);
 
-	RT_FREE_NODE(tree, old_child);
+	RT_FREE_NODE(tree, stored_old_child);
 }
 
 /*
@@ -1220,11 +1220,11 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
  */
 static inline void
 RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
-			  RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
+			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
 {
 	int			shift = node->shift;
 
-	Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+	Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
 
 	while (shift >= RT_NODE_SPAN)
 	{
@@ -1237,15 +1237,15 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent
 		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
 		RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
 		newchild->shift = newshift;
-		RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
+		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
 
 		parent = node;
 		node = newchild;
-		nodep = allocchild;
+		stored_node = allocchild;
 		shift -= RT_NODE_SPAN;
 	}
 
-	RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+	RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value);
 	tree->ctl->num_keys++;
 }
 
@@ -1305,9 +1305,15 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
 }
 #endif
 
-/* Insert the child to the inner node */
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
 static bool
-RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 					uint64 key, RT_PTR_ALLOC child)
 {
 #define RT_NODE_LEVEL_INNER
@@ -1315,9 +1321,9 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC node
 #undef RT_NODE_LEVEL_INNER
 }
 
-/* Insert the value to the leaf node */
+/* Like, RT_NODE_INSERT_INNER, but for leaf nodes */
 static bool
-RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 					uint64 key, uint64 value)
 {
 #define RT_NODE_LEVEL_LEAF
@@ -1525,8 +1531,8 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
 	int			shift;
 	bool		updated;
 	RT_PTR_LOCAL parent;
-	RT_PTR_ALLOC nodep;
-	RT_PTR_LOCAL  node;
+	RT_PTR_ALLOC stored_child;
+	RT_PTR_LOCAL  child;
 
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
@@ -1540,32 +1546,32 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
 	if (key > tree->ctl->max_val)
 		RT_EXTEND(tree, key);
 
-	nodep = tree->ctl->root;
-	parent = RT_PTR_GET_LOCAL(tree, nodep);
+	stored_child = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, stored_child);
 	shift = parent->shift;
 
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		RT_PTR_ALLOC child;
+		RT_PTR_ALLOC new_child;
 
-		node = RT_PTR_GET_LOCAL(tree, nodep);
+		child = RT_PTR_GET_LOCAL(tree, stored_child);
 
-		if (NODE_IS_LEAF(node))
+		if (NODE_IS_LEAF(child))
 			break;
 
-		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
 		{
-			RT_SET_EXTEND(tree, key, value, parent, nodep, node);
+			RT_SET_EXTEND(tree, key, value, parent, stored_child, child);
 			return false;
 		}
 
-		parent = node;
-		nodep = child;
+		parent = child;
+		stored_child = new_child;
 		shift -= RT_NODE_SPAN;
 	}
 
-	updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+	updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value);
 
 	/* Update the statistics */
 	if (!updated)
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index e4faf54d9d..1d0eb396e2 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -14,8 +14,6 @@
 
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 	bool		chunk_exists = false;
-	RT_PTR_LOCAL newnode = NULL;
-	RT_PTR_ALLOC allocnode;
 
 #ifdef RT_NODE_LEVEL_LEAF
 	const bool inner = false;
@@ -47,6 +45,8 @@
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
 				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
 					RT_NODE32_TYPE *new32;
 					const uint8 new_kind = RT_NODE_KIND_32;
 					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
@@ -65,8 +65,7 @@
 					RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
 											  new32->base.chunks, new32->children);
 #endif
-					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
 					node = newnode;
 				}
 				else
@@ -121,6 +120,8 @@
 					n32->base.n.fanout == class32_min.fanout)
 				{
 					/* grow to the next size class of this kind */
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
 					const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
 
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
@@ -132,8 +133,7 @@
 #endif
 					newnode->fanout = class32_max.fanout;
 
-					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
 					node = newnode;
 
 					/* also update pointer for this kind */
@@ -142,6 +142,8 @@
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
 				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
 					RT_NODE125_TYPE *new125;
 					const uint8 new_kind = RT_NODE_KIND_125;
 					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
@@ -169,8 +171,7 @@
 					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
 					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
 
-					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
 					node = newnode;
 				}
 				else
@@ -220,6 +221,8 @@
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
 				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
 					RT_NODE256_TYPE *new256;
 					const uint8 new_kind = RT_NODE_KIND_256;
 					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
@@ -243,8 +246,7 @@
 						cnt++;
 					}
 
-					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
 					node = newnode;
 				}
 				else
-- 
2.39.0

v21-0003-Add-radixtree-template.patchtext/x-patch; charset=US-ASCII; name=v21-0003-Add-radixtree-template.patchDownload

From 2035dde63943dc5461a69fc7aa1f510e68f1cd64 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v21 03/22] Add radixtree template

The only thing configurable in this commit is function scope,
prefix, and local/shared memory.

The key and value type are still hard-coded to uint64.

(A later commit in v21 will make value type configurable)

It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.

TODO: Much broader commit message
---
 src/backend/utils/mmgr/dsa.c                  |   12 +
 src/include/lib/radixtree.h                   | 2321 +++++++++++++++++
 src/include/lib/radixtree_delete_impl.h       |  106 +
 src/include/lib/radixtree_insert_impl.h       |  316 +++
 src/include/lib/radixtree_iter_impl.h         |  138 +
 src/include/lib/radixtree_search_impl.h       |  131 +
 src/include/utils/dsa.h                       |    1 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   35 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  653 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 20 files changed, 3816 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 604b702a91..50f0aae3ab 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..97cccdc9ca
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2321 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ *	  To generate a radix tree and associated functions for a use case several
+ *	  macros have to be #define'ed before this file is included.  Including
+ *	  the file #undef's all those, so a new radix tree can be generated
+ *	  afterwards.
+ *	  The relevant parameters are:
+ *	  - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ *		will result in radix tree type 'foo_radix_tree' and functions like
+ *		'foo_create'/'foo_free' and so forth.
+ *	  - RT_DECLARE - if defined function prototypes and type declarations are
+ *		generated
+ *	  - RT_DEFINE - if defined function definitions are generated
+ *	  - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *		declarations reside
+ *	  - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *		so that multiple processes can access it simultaneously.
+ *
+ *	  Optional parameters:
+ *	  - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITER		- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ *
+ * RT_CREATE() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
+#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
+#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* macros and types common to all implementations */
+#ifndef RT_COMMON
+#define RT_COMMON
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+#endif							/* RT_COMMON */
+
+
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_4_FULL = 0,
+	RT_CLASS_32_PARTIAL,
+	RT_CLASS_32_FULL,
+	RT_CLASS_125_FULL,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((RT_PTR_LOCAL) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+	((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+	((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct RT_NODE_BASE_4
+{
+	RT_NODE		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} RT_NODE_BASE_4;
+
+typedef struct RT_NODE_BASE_32
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct RT_NODE_BASE_125
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword		isset[BM_IDX(128)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct RT_NODE_INNER_4
+{
+	RT_NODE_BASE_4 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_4;
+
+typedef struct RT_NODE_LEAF_4
+{
+	RT_NODE_BASE_4 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_4;
+
+typedef struct RT_NODE_INNER_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} RT_SIZE_CLASS_ELEM;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_4_FULL] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_PARTIAL] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_FULL] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
+	},
+	[RT_CLASS_125_FULL] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+/* Map from the node kind to its minimum size class */
+static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
+	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+	[RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+#endif
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
++ *
++ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
++ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
++ * We need either a safeguard to disallow other processes to begin the iteration
++ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+	RT_PTR_LOCAL node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the iteration on nodes of each level */
+	RT_NODE_ITER stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+								uint64 key, uint64 value);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(uint64) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
+{
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
+
+	if (inner)
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (inner)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+#endif
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->ctl->cnt[size_class]++;
+#endif
+
+	return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+{
+	if (inner)
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+
+	node->kind = kind;
+
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+	}
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		inner = shift > 0;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+	newnode->shift = shift;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->count = oldnode->count;
+}
+#if 0
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static RT_NODE*
+RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
+{
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+	bool inner = !NODE_IS_LEAF(node);
+
+	allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	RT_COPY_NODE(newnode, node);
+
+	return newnode;
+}
+#endif
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->ctl->root == allocnode)
+	{
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
+	}
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
+{
+	RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
+
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old->shift == new->shift);
+#endif
+
+	if (parent == old)
+	{
+		/* Replace the root node with the new large node */
+		tree->ctl->root = new_child;
+	}
+	else
+		RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+	RT_FREE_NODE(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_4 *n4;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		node->shift = shift;
+		node->count = 1;
+
+		n4 = (RT_NODE_INNER_4 *) node;
+		n4->base.chunks[0] = 0;
+		n4->children[0] = tree->ctl->root;
+
+		/* Update the root */
+		tree->ctl->root = allocnode;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+			  RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
+{
+	int			shift = node->shift;
+
+	Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		inner = newshift > 0;
+
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+		newchild->shift = newshift;
+		RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
+
+		parent = node;
+		node = newchild;
+		nodep = allocchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+	tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/* Insert the child to the inner node */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Insert the value to the leaf node */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+					uint64 key, uint64 value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 RT_SIZE_CLASS_INFO[i].name,
+												 RT_SIZE_CLASS_INFO[i].inner_blocksize,
+												 RT_SIZE_CLASS_INFO[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												RT_SIZE_CLASS_INFO[i].name,
+												RT_SIZE_CLASS_INFO[i].leaf_blocksize,
+												RT_SIZE_CLASS_INFO[i].leaf_size);
+	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	/* XXX: memory context support */
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	if (NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+				for (int i = 0; i < n4->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n4->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+				for (int i = 0; i < n32->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle);
+#else
+	pfree(tree->ctl);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+#endif
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC nodep;
+	RT_PTR_LOCAL  node;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	/* Empty tree, create the root */
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->ctl->max_val)
+		RT_EXTEND(tree, key);
+
+	nodep = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, nodep);
+	shift = parent->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		node = RT_PTR_GET_LOCAL(tree, nodep);
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_SET_EXTEND(tree, key, value, parent, nodep, node);
+			return false;
+		}
+
+		parent = node;
+		nodep = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->ctl->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	Assert(value_p != NULL);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+		return false;
+
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			return false;
+
+		node = RT_PTR_GET_LOCAL(tree, child);
+		shift -= RT_NODE_SPAN;
+	}
+
+	return RT_NODE_SEARCH_LEAF(node, key, value_p);
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		RT_PTR_ALLOC child;
+
+		/* Push the current node to the stack */
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			return false;
+
+		allocnode = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	deleted = RT_NODE_DELETE_LEAF(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->ctl->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	RT_FREE_NODE(tree, allocnode);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		allocnode = stack[level--];
+
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		deleted = RT_NODE_DELETE_INNER(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		RT_FREE_NODE(tree, allocnode);
+	}
+
+	return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+	int			level = from;
+	RT_PTR_LOCAL node = from_node;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	MemoryContext old_ctx;
+	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->ctl->root)
+		return iter;
+
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->ctl->root)
+		return false;
+
+	for (;;)
+	{
+		RT_PTR_LOCAL child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+	pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	// XXX is this necessary?
+	Size		total = sizeof(RT_RADIX_TREE);
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+#endif
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = BM_IDX(slot);
+					int			bitnum = BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+						 tree->ctl->num_keys,
+						 tree->ctl->root->shift / RT_NODE_SPAN,
+						 tree->ctl->cnt[RT_CLASS_4_FULL],
+						 tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+						 tree->ctl->cnt[RT_CLASS_32_FULL],
+						 tree->ctl->cnt[RT_CLASS_125_FULL],
+						 tree->ctl->cnt[RT_CLASS_256])));
+}
+
+static void
+RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
+{
+	char		space[125] = {0};
+
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
+			node->fanout == 0 ? 256 : node->fanout,
+			node->count, node->shift);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							RT_DUMP_NODE(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							RT_DUMP_NODE(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < BM_IDX(128); i++)
+					{
+						fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+					}
+					else
+					{
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							RT_DUMP_NODE(RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							RT_DUMP_NODE(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+		 tree->ctl->max_val, tree->ctl->max_val);
+
+	if (!tree->ctl->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->ctl->max_val)
+	{
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->ctl->root;
+	shift = tree->ctl->root->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_LOCAL child;
+
+		RT_DUMP_NODE(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+				RT_SIZE_CLASS_INFO[i].name,
+				RT_SIZE_CLASS_INFO[i].inner_size,
+				RT_SIZE_CLASS_INFO[i].inner_blocksize,
+				RT_SIZE_CLASS_INFO[i].leaf_size,
+				RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+
+	if (!tree->ctl->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	RT_DUMP_NODE(tree->ctl->root, 0, true);
+}
+#endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+
+/* locally declared macros */
+#undef NODE_IS_LEAF
+#undef NODE_IS_EMPTY
+#undef VAR_NODE_HAS_FREE_SLOT
+#undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_RADIX_TREE_MAGIC
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_4_FULL
+#undef RT_CLASS_32_PARTIAL
+#undef RT_CLASS_32_FULL
+#undef RT_CLASS_125_FULL
+#undef RT_CLASS_256
+#undef RT_KIND_MIN_SIZE_CLASS
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_GROW_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..eb87866b90
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,106 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(NODE_IS_LEAF(node));
+#else
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
+										  n4->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
+											n4->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
+										  n32->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_NODE_125_INVALID_IDX)
+					return false;
+
+				idx = BM_IDX(slotpos);
+				bitnum = BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+				RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..e4faf54d9d
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,316 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+	RT_PTR_LOCAL newnode = NULL;
+	RT_PTR_ALLOC allocnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	const bool inner = false;
+	Assert(NODE_IS_LEAF(node));
+#else
+	const bool inner = true;
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n4->values[idx] = value;
+#else
+					n4->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+					/* grow node from 4 to 32 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+					new32 = (RT_NODE32_TYPE *) newnode;
+#ifdef RT_NODE_LEVEL_LEAF
+					RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
+											  new32->base.chunks, new32->values);
+#else
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children);
+#endif
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
+												   count, insertpos);
+#endif
+					}
+
+					n4->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n4->values[insertpos] = value;
+#else
+					n4->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[idx] = value;
+#else
+					n32->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+					n32->base.n.fanout == class32_min.fanout)
+				{
+					/* grow to the next size class of this kind */
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class32_min.leaf_size);
+#else
+					memcpy(newnode, node, class32_min.inner_size);
+#endif
+					newnode->fanout = class32_max.fanout;
+
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+
+					/* also update pointer for this kind */
+					n32 = (RT_NODE32_TYPE *) newnode;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+					Assert(n32->base.n.fanout == class32_max.fanout);
+
+					/* grow node from 32 to 125 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < class32_max.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = value;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			cnt = 0;
+
+				if (slotpos != RT_NODE_125_INVALID_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+					/* grow node from 125 to 256 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+					new256 = (RT_NODE256_TYPE *) newnode;
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+						cnt++;
+					}
+
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < BM_IDX(128); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+				chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_SET(n256, chunk, value);
+#else
+				RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	RT_VERIFY_NODE(node);
+
+	return chunk_exists;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..0b8b68df6c
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,138 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	bool		found = false;
+	uint8		key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	uint64		value;
+
+	Assert(NODE_IS_LEAF(node_iter->node));
+#else
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n4->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
+#endif
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+		*value_p = value;
+#endif
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return found;
+#else
+	return child;
+#endif
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..31e4978e4f
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,131 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	uint64		value = 0;
+
+	Assert(NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+	RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+#endif
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n4->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n4->values[idx];
+#else
+				child = n4->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n32->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[idx];
+#else
+				child = n32->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+				Assert(slotpos != RT_NODE_125_INVALID_IDX);
+				n125->children[slotpos] = new_child;
+#else
+				if (slotpos == RT_NODE_125_INVALID_IDX)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+	}
+
+#ifdef RT_ACTION_UPDATE
+	return;
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	*value_p = value;
+#else
+	Assert(child_p != NULL);
+	*child_p = child;
+#endif
+
+	return true;
+#endif							/* RT_ACTION_UPDATE */
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 104386e674..c67f936880 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..d8323f587f
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,653 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	uint64		dummy;
+	uint64		key;
+	uint64		val;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	rt_radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* look up keys */
+	for (int i = 0; i < children; i++)
+	{
+		uint64 value;
+
+		if (!rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (value != keys[i])
+			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+				 value, keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_set(radixtree, keys[i], keys[i] + 1))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		uint64		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa);
+#else
+	radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	rt_free(radixtree);
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.39.0

v21-0006-Free-all-radix-tree-nodes-recursively.patchtext/x-patch; charset=US-ASCII; name=v21-0006-Free-all-radix-tree-nodes-recursively.patchDownload

From 6dca6018bd9ffbb6f00e26b01dfc80377a910440 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 20 Jan 2023 12:38:54 +0700
Subject: [PATCH v21 06/22] Free all radix tree nodes recursively

TODO: Consider adding more general functionality to DSA
to free all segments.
---
 src/include/lib/radixtree.h | 78 +++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index c08016de3a..98e4597eac 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -127,6 +127,7 @@
 #define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
 #define RT_INIT_NODE RT_MAKE_NAME(init_node)
 #define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
 #define RT_EXTEND RT_MAKE_NAME(extend)
 #define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
 #define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
@@ -1410,6 +1411,78 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 	return tree->ctl->handle;
 }
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	if (NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+				for (int i = 0; i < n4->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n4->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+				for (int i = 0; i < n32->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+}
 #endif
 
 /*
@@ -1421,6 +1494,10 @@ RT_FREE(RT_RADIX_TREE *tree)
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
 	/*
 	 * Vandalize the control block to help catch programming error where
 	 * other backends access the memory formerly occupied by this radix tree.
@@ -2199,6 +2276,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_ALLOC_NODE
 #undef RT_INIT_NODE
 #undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
 #undef RT_EXTEND
 #undef RT_SET_EXTEND
 #undef RT_SWITCH_NODE_KIND
-- 
2.39.0

v21-0009-Remove-hard-coded-128.patchtext/x-patch; charset=US-ASCII; name=v21-0009-Remove-hard-coded-128.patchDownload

From 5b4fff91055335d5dcc22ae6eee26168cf889486 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 15:51:21 +0700
Subject: [PATCH v21 09/22] Remove hard-coded 128

Also comment that 64 could be a valid number of bits
in the bitmap for this node type.

TODO: Consider whether we should in fact limit this
node to ~64.

In passing, remove "125" from invalid-slot-index macro.
---
 src/include/lib/radixtree.h             | 19 +++++++++++++------
 src/include/lib/radixtree_delete_impl.h |  4 ++--
 src/include/lib/radixtree_insert_impl.h |  4 ++--
 src/include/lib/radixtree_search_impl.h |  4 ++--
 4 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 172d62c6b0..d15ea8f0fe 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -270,8 +270,15 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
 /* Tree level the radix tree uses */
 #define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
 
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
 /* Invalid index used in node-125 */
-#define RT_NODE_125_INVALID_IDX	0xFF
+#define RT_INVALID_SLOT_IDX	0xFF
 
 /* Get a chunk from the key */
 #define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
@@ -409,7 +416,7 @@ typedef struct RT_NODE_BASE_125
 	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
 
 	/* isset is a bitmap to track which slot is in use */
-	bitmapword		isset[BM_IDX(128)];
+	bitmapword		isset[BM_IDX(RT_SLOT_IDX_LIMIT)];
 } RT_NODE_BASE_125;
 
 typedef struct RT_NODE_BASE_256
@@ -867,7 +874,7 @@ RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
 static inline bool
 RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
 {
-	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+	return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
 }
 
 static inline RT_PTR_ALLOC
@@ -881,7 +888,7 @@ static inline RT_VALUE_TYPE
 RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
 {
 	Assert(NODE_IS_LEAF(node));
-	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
 	return node->values[node->base.slot_idxs[chunk]];
 }
 
@@ -1037,7 +1044,7 @@ RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner
 	{
 		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
 
-		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+		memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
 	}
 }
 
@@ -2052,7 +2059,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 					RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
 
 					fprintf(stderr, ", isset-bitmap:");
-					for (int i = 0; i < BM_IDX(128); i++)
+					for (int i = 0; i < BM_IDX(RT_SLOT_IDX_LIMIT); i++)
 					{
 						fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
 					}
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index 2612730481..2f1c172672 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -65,13 +65,13 @@
 				int			idx;
 				int			bitnum;
 
-				if (slotpos == RT_NODE_125_INVALID_IDX)
+				if (slotpos == RT_INVALID_SLOT_IDX)
 					return false;
 
 				idx = BM_IDX(slotpos);
 				bitnum = BM_BIT(slotpos);
 				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
-				n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+				n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
 
 				break;
 			}
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index e3e44669ea..90fe5f539e 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -201,7 +201,7 @@
 				int			slotpos = n125->base.slot_idxs[chunk];
 				int			cnt = 0;
 
-				if (slotpos != RT_NODE_125_INVALID_IDX)
+				if (slotpos != RT_INVALID_SLOT_IDX)
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
@@ -247,7 +247,7 @@
 					bitmapword	inverse;
 
 					/* get the first word with at least one bit not set */
-					for (idx = 0; idx < BM_IDX(128); idx++)
+					for (idx = 0; idx < BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
 					{
 						if (n125->base.isset[idx] < ~((bitmapword) 0))
 							break;
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 365abaa46d..d2bbdd2450 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -73,10 +73,10 @@
 				int			slotpos = n125->base.slot_idxs[chunk];
 
 #ifdef RT_ACTION_UPDATE
-				Assert(slotpos != RT_NODE_125_INVALID_IDX);
+				Assert(slotpos != RT_INVALID_SLOT_IDX);
 				n125->children[slotpos] = new_child;
 #else
-				if (slotpos == RT_NODE_125_INVALID_IDX)
+				if (slotpos == RT_INVALID_SLOT_IDX)
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-- 
2.39.0

v21-0008-Streamline-calculation-of-slab-blocksize.patchtext/x-patch; charset=US-ASCII; name=v21-0008-Streamline-calculation-of-slab-blocksize.patchDownload

From e02aa8c8c9d36f2d45f91c462e3f55f5f39428e6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 14:55:25 +0700
Subject: [PATCH v21 08/22] Streamline calculation of slab blocksize

To reduce duplication. This will likely lead to
division instructions, but a few cycles won't
matter at all when creating the tree.
---
 src/include/lib/radixtree.h | 50 ++++++++++++++-----------------------
 1 file changed, 19 insertions(+), 31 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 0a39bd6664..172d62c6b0 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -304,6 +304,13 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
 #define RT_NODE_KIND_256		0x03
 #define RT_NODE_KIND_COUNT		4
 
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
 #endif							/* RT_COMMON */
 
 
@@ -503,59 +510,38 @@ typedef struct RT_SIZE_CLASS_ELEM
 	/* slab chunk size */
 	Size		inner_size;
 	Size		leaf_size;
-
-	/* slab block size */
-	Size		inner_blocksize;
-	Size		leaf_blocksize;
 } RT_SIZE_CLASS_ELEM;
 
-/*
- * Calculate the slab blocksize so that we can allocate at least 32 chunks
- * from the block.
- */
-#define NODE_SLAB_BLOCK_SIZE(size)	\
-	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
-
 static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 	[RT_CLASS_4_FULL] = {
 		.name = "radix tree node 4",
 		.fanout = 4,
 		.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_32_PARTIAL] = {
 		.name = "radix tree node 15",
 		.fanout = 15,
 		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_32_FULL] = {
 		.name = "radix tree node 32",
 		.fanout = 32,
 		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_125_FULL] = {
 		.name = "radix tree node 125",
 		.fanout = 125,
 		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_256] = {
 		.name = "radix tree node 256",
 		.fanout = 256,
 		.inner_size = sizeof(RT_NODE_INNER_256),
 		.leaf_size = sizeof(RT_NODE_LEAF_256),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
 	},
 };
 
@@ -1361,14 +1347,18 @@ RT_CREATE(MemoryContext ctx)
 	/* Create the slab allocator for each size class */
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
+		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+		size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+		size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
 		tree->inner_slabs[i] = SlabContextCreate(ctx,
-												 RT_SIZE_CLASS_INFO[i].name,
-												 RT_SIZE_CLASS_INFO[i].inner_blocksize,
-												 RT_SIZE_CLASS_INFO[i].inner_size);
+												 size_class.name,
+												 inner_blocksize,
+												 size_class.inner_size);
 		tree->leaf_slabs[i] = SlabContextCreate(ctx,
-												RT_SIZE_CLASS_INFO[i].name,
-												RT_SIZE_CLASS_INFO[i].leaf_blocksize,
-												RT_SIZE_CLASS_INFO[i].leaf_size);
+												size_class.name,
+												leaf_blocksize,
+												size_class.leaf_size);
 	}
 #endif
 
@@ -2189,12 +2179,10 @@ RT_DUMP(RT_RADIX_TREE *tree)
 {
 
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
-		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+		fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\t%zu\n",
 				RT_SIZE_CLASS_INFO[i].name,
 				RT_SIZE_CLASS_INFO[i].inner_size,
-				RT_SIZE_CLASS_INFO[i].inner_blocksize,
-				RT_SIZE_CLASS_INFO[i].leaf_size,
-				RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+				RT_SIZE_CLASS_INFO[i].leaf_size);
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
 
 	if (!tree->ctl->root)
-- 
2.39.0

v21-0010-Reduce-node4-to-node3.patchtext/x-patch; charset=US-ASCII; name=v21-0010-Reduce-node4-to-node3.patchDownload

From 1226982cc3c3ac779953de4afb6e85f31be11a28 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 18:05:15 +0700
Subject: [PATCH v21 10/22] Reduce node4 to node3

Now that we don't store "chunk", the base node type is only
5 bytes in size. With 3 key chunks, There is no alignment
padding between the chunks array and the child/value array.
This reduces the smallest inner node to 32 bytes on 64-bit
platforms.
---
 src/include/lib/radixtree.h             | 124 ++++++++++++------------
 src/include/lib/radixtree_delete_impl.h |  20 ++--
 src/include/lib/radixtree_insert_impl.h |  38 ++++----
 src/include/lib/radixtree_iter_impl.h   |  18 ++--
 src/include/lib/radixtree_search_impl.h |  18 ++--
 5 files changed, 109 insertions(+), 109 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d15ea8f0fe..6cc8442c89 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -136,9 +136,9 @@
 #define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
 #define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
 #define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
-#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
 #define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
-#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
 #define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
 #define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
 #define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
@@ -181,22 +181,22 @@
 #endif
 #define RT_NODE RT_MAKE_NAME(node)
 #define RT_NODE_ITER RT_MAKE_NAME(node_iter)
-#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
 #define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
 #define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
 #define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
-#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
 #define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
 #define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
 #define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
-#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
 #define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
 #define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
 #define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
 #define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
 #define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
 #define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
-#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_3_FULL RT_MAKE_NAME(class_3_full)
 #define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
 #define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
 #define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
@@ -305,7 +305,7 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
  * allocator padding in both the inner and leaf nodes on DSA.
  * node
  */
-#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_3			0x00
 #define RT_NODE_KIND_32			0x01
 #define RT_NODE_KIND_125		0x02
 #define RT_NODE_KIND_256		0x03
@@ -323,7 +323,7 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
 
 typedef enum RT_SIZE_CLASS
 {
-	RT_CLASS_4_FULL = 0,
+	RT_CLASS_3_FULL = 0,
 	RT_CLASS_32_PARTIAL,
 	RT_CLASS_32_FULL,
 	RT_CLASS_125_FULL,
@@ -387,13 +387,13 @@ typedef struct RT_NODE
 /* Base type of each node kinds for leaf and inner nodes */
 /* The base types must be a be able to accommodate the largest size
 class for variable-sized node kinds*/
-typedef struct RT_NODE_BASE_4
+typedef struct RT_NODE_BASE_3
 {
 	RT_NODE		n;
 
-	/* 4 children, for key chunks */
-	uint8		chunks[4];
-} RT_NODE_BASE_4;
+	/* 3 children, for key chunks */
+	uint8		chunks[3];
+} RT_NODE_BASE_3;
 
 typedef struct RT_NODE_BASE_32
 {
@@ -437,21 +437,21 @@ typedef struct RT_NODE_BASE_256
  * good. It might be better to just indicate non-existing entries the same way
  * in inner nodes.
  */
-typedef struct RT_NODE_INNER_4
+typedef struct RT_NODE_INNER_3
 {
-	RT_NODE_BASE_4 base;
+	RT_NODE_BASE_3 base;
 
 	/* number of children depends on size class */
 	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
-} RT_NODE_INNER_4;
+} RT_NODE_INNER_3;
 
-typedef struct RT_NODE_LEAF_4
+typedef struct RT_NODE_LEAF_3
 {
-	RT_NODE_BASE_4 base;
+	RT_NODE_BASE_3 base;
 
 	/* number of values depends on size class */
 	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
-} RT_NODE_LEAF_4;
+} RT_NODE_LEAF_3;
 
 typedef struct RT_NODE_INNER_32
 {
@@ -520,11 +520,11 @@ typedef struct RT_SIZE_CLASS_ELEM
 } RT_SIZE_CLASS_ELEM;
 
 static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
-	[RT_CLASS_4_FULL] = {
-		.name = "radix tree node 4",
-		.fanout = 4,
-		.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
-		.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE),
+	[RT_CLASS_3_FULL] = {
+		.name = "radix tree node 3",
+		.fanout = 3,
+		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
 	},
 	[RT_CLASS_32_PARTIAL] = {
 		.name = "radix tree node 15",
@@ -556,7 +556,7 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 
 /* Map from the node kind to its minimum size class */
 static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
-	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+	[RT_NODE_KIND_3] = RT_CLASS_3_FULL,
 	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
 	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
 	[RT_NODE_KIND_256] = RT_CLASS_256,
@@ -673,7 +673,7 @@ RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
  * if there is no such element.
  */
 static inline int
-RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
 {
 	int			idx = -1;
 
@@ -693,7 +693,7 @@ RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
  * Return index of the chunk to insert into chunks in the given node.
  */
 static inline int
-RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
 {
 	int			idx;
 
@@ -810,7 +810,7 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
 
 /*
  * Functions to manipulate both chunks array and children/values array.
- * These are used for node-4 and node-32.
+ * These are used for node-3 and node-32.
  */
 
 /* Shift the elements right at 'idx' by one */
@@ -848,7 +848,7 @@ static inline void
 RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
 						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
 {
-	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
 	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
 
@@ -860,7 +860,7 @@ static inline void
 RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
 						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
 {
-	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
 	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
 
@@ -1060,9 +1060,9 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
 	RT_PTR_ALLOC allocnode;
 	RT_PTR_LOCAL newnode;
 
-	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
 	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-	RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
 	newnode->shift = shift;
 	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
 	tree->ctl->root = allocnode;
@@ -1183,17 +1183,17 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
 	{
 		RT_PTR_ALLOC	allocnode;
 		RT_PTR_LOCAL	node;
-		RT_NODE_INNER_4 *n4;
+		RT_NODE_INNER_3 *n3;
 
-		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, true);
 		node = RT_PTR_GET_LOCAL(tree, allocnode);
-		RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3_FULL, true);
 		node->shift = shift;
 		node->count = 1;
 
-		n4 = (RT_NODE_INNER_4 *) node;
-		n4->base.chunks[0] = 0;
-		n4->children[0] = tree->ctl->root;
+		n3 = (RT_NODE_INNER_3 *) node;
+		n3->base.chunks[0] = 0;
+		n3->children[0] = tree->ctl->root;
 
 		/* Update the root */
 		tree->ctl->root = allocnode;
@@ -1223,9 +1223,9 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL
 		int			newshift = shift - RT_NODE_SPAN;
 		bool		inner = newshift > 0;
 
-		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
 		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
-		RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
 		newchild->shift = newshift;
 		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
 
@@ -1430,12 +1430,12 @@ RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
 
 	switch (node->kind)
 	{
-		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_3:
 			{
-				RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+				RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
 
-				for (int i = 0; i < n4->base.n.count; i++)
-					RT_FREE_RECURSE(tree, n4->children[i]);
+				for (int i = 0; i < n3->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n3->children[i]);
 
 				break;
 			}
@@ -1892,12 +1892,12 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
 
 	switch (node->kind)
 	{
-		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_3:
 			{
-				RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
+				RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
 
-				for (int i = 1; i < n4->n.count; i++)
-					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+				for (int i = 1; i < n3->n.count; i++)
+					Assert(n3->chunks[i - 1] < n3->chunks[i]);
 
 				break;
 			}
@@ -1959,10 +1959,10 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
 RT_SCOPE void
 RT_STATS(RT_RADIX_TREE *tree)
 {
-	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
 						 tree->ctl->num_keys,
 						 tree->ctl->root->shift / RT_NODE_SPAN,
-						 tree->ctl->cnt[RT_CLASS_4_FULL],
+						 tree->ctl->cnt[RT_CLASS_3_FULL],
 						 tree->ctl->cnt[RT_CLASS_32_PARTIAL],
 						 tree->ctl->cnt[RT_CLASS_32_FULL],
 						 tree->ctl->cnt[RT_CLASS_125_FULL],
@@ -1977,7 +1977,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 
 	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
 			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
-			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_3) ? 3 :
 			(node->kind == RT_NODE_KIND_32) ? 32 :
 			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
 			node->fanout == 0 ? 256 : node->fanout,
@@ -1988,26 +1988,26 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 
 	switch (node->kind)
 	{
-		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_3:
 			{
 				for (int i = 0; i < node->count; i++)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
+						RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, n4->base.chunks[i], (uint64) n4->values[i]);
+								space, n3->base.chunks[i], (uint64) n3->values[i]);
 					}
 					else
 					{
-						RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+						RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
 
 						fprintf(stderr, "%schunk 0x%X ->",
-								space, n4->base.chunks[i]);
+								space, n3->base.chunks[i]);
 
 						if (recurse)
-							RT_DUMP_NODE(n4->children[i], level + 1, recurse);
+							RT_DUMP_NODE(n3->children[i], level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2229,22 +2229,22 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_ITER
 #undef RT_NODE
 #undef RT_NODE_ITER
-#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_3
 #undef RT_NODE_BASE_32
 #undef RT_NODE_BASE_125
 #undef RT_NODE_BASE_256
-#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_3
 #undef RT_NODE_INNER_32
 #undef RT_NODE_INNER_125
 #undef RT_NODE_INNER_256
-#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_3
 #undef RT_NODE_LEAF_32
 #undef RT_NODE_LEAF_125
 #undef RT_NODE_LEAF_256
 #undef RT_SIZE_CLASS
 #undef RT_SIZE_CLASS_ELEM
 #undef RT_SIZE_CLASS_INFO
-#undef RT_CLASS_4_FULL
+#undef RT_CLASS_3_FULL
 #undef RT_CLASS_32_PARTIAL
 #undef RT_CLASS_32_FULL
 #undef RT_CLASS_125_FULL
@@ -2282,9 +2282,9 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_REPLACE_NODE
 #undef RT_PTR_GET_LOCAL
 #undef RT_PTR_ALLOC_IS_VALID
-#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_3_SEARCH_EQ
 #undef RT_NODE_32_SEARCH_EQ
-#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_3_GET_INSERTPOS
 #undef RT_NODE_32_GET_INSERTPOS
 #undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
 #undef RT_CHUNK_VALUES_ARRAY_SHIFT
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index 2f1c172672..b9f07f4eb5 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -1,12 +1,12 @@
 /* TODO: shrink nodes */
 
 #if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
 #define RT_NODE32_TYPE RT_NODE_INNER_32
 #define RT_NODE125_TYPE RT_NODE_INNER_125
 #define RT_NODE256_TYPE RT_NODE_INNER_256
 #elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
 #define RT_NODE32_TYPE RT_NODE_LEAF_32
 #define RT_NODE125_TYPE RT_NODE_LEAF_125
 #define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -24,20 +24,20 @@
 
 	switch (node->kind)
 	{
-		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_3:
 			{
-				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
-				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
 
 				if (idx < 0)
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, n4->values,
-										  n4->base.n.count, idx);
+				RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+										  n3->base.n.count, idx);
 #else
-				RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
-											n4->base.n.count, idx);
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+											n3->base.n.count, idx);
 #endif
 				break;
 			}
@@ -100,7 +100,7 @@
 
 	return true;
 
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
 #undef RT_NODE32_TYPE
 #undef RT_NODE125_TYPE
 #undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 90fe5f539e..16461bdb03 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -1,10 +1,10 @@
 #if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
 #define RT_NODE32_TYPE RT_NODE_INNER_32
 #define RT_NODE125_TYPE RT_NODE_INNER_125
 #define RT_NODE256_TYPE RT_NODE_INNER_256
 #elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
 #define RT_NODE32_TYPE RT_NODE_LEAF_32
 #define RT_NODE125_TYPE RT_NODE_LEAF_125
 #define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -25,25 +25,25 @@
 
 	switch (node->kind)
 	{
-		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_3:
 			{
-				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
 				int			idx;
 
-				idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
+				idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
 				if (idx != -1)
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
 #ifdef RT_NODE_LEVEL_LEAF
-					n4->values[idx] = value;
+					n3->values[idx] = value;
 #else
-					n4->children[idx] = child;
+					n3->children[idx] = child;
 #endif
 					break;
 				}
 
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n3)))
 				{
 					RT_PTR_ALLOC allocnode;
 					RT_PTR_LOCAL newnode;
@@ -51,16 +51,16 @@
 					const uint8 new_kind = RT_NODE_KIND_32;
 					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
 
-					/* grow node from 4 to 32 */
+					/* grow node from 3 to 32 */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
 					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
 					new32 = (RT_NODE32_TYPE *) newnode;
 
 #ifdef RT_NODE_LEVEL_LEAF
-					RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
+					RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
 											  new32->base.chunks, new32->values);
 #else
-					RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
 											  new32->base.chunks, new32->children);
 #endif
 					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
@@ -68,27 +68,27 @@
 				}
 				else
 				{
-					int			insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
-					int			count = n4->base.n.count;
+					int			insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+					int			count = n3->base.n.count;
 
 					/* shift chunks and children */
 					if (insertpos < count)
 					{
 						Assert(count > 0);
 #ifdef RT_NODE_LEVEL_LEAF
-						RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
 												   count, insertpos);
 #else
-						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
 												   count, insertpos);
 #endif
 					}
 
-					n4->base.chunks[insertpos] = chunk;
+					n3->base.chunks[insertpos] = chunk;
 #ifdef RT_NODE_LEVEL_LEAF
-					n4->values[insertpos] = value;
+					n3->values[insertpos] = value;
 #else
-					n4->children[insertpos] = child;
+					n3->children[insertpos] = child;
 #endif
 					break;
 				}
@@ -304,7 +304,7 @@
 
 	return chunk_exists;
 
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
 #undef RT_NODE32_TYPE
 #undef RT_NODE125_TYPE
 #undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 5c06f8b414..c428531438 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -1,10 +1,10 @@
 #if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
 #define RT_NODE32_TYPE RT_NODE_INNER_32
 #define RT_NODE125_TYPE RT_NODE_INNER_125
 #define RT_NODE256_TYPE RT_NODE_INNER_256
 #elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
 #define RT_NODE32_TYPE RT_NODE_LEAF_32
 #define RT_NODE125_TYPE RT_NODE_LEAF_125
 #define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -31,19 +31,19 @@
 
 	switch (node_iter->node->kind)
 	{
-		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_3:
 			{
-				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
 
 				node_iter->current_idx++;
-				if (node_iter->current_idx >= n4->base.n.count)
+				if (node_iter->current_idx >= n3->base.n.count)
 					break;
 #ifdef RT_NODE_LEVEL_LEAF
-				value = n4->values[node_iter->current_idx];
+				value = n3->values[node_iter->current_idx];
 #else
-				child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
+				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
 #endif
-				key_chunk = n4->base.chunks[node_iter->current_idx];
+				key_chunk = n3->base.chunks[node_iter->current_idx];
 				found = true;
 				break;
 			}
@@ -132,7 +132,7 @@
 	return child;
 #endif
 
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
 #undef RT_NODE32_TYPE
 #undef RT_NODE125_TYPE
 #undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index d2bbdd2450..31138b6a72 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -1,10 +1,10 @@
 #if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
 #define RT_NODE32_TYPE RT_NODE_INNER_32
 #define RT_NODE125_TYPE RT_NODE_INNER_125
 #define RT_NODE256_TYPE RT_NODE_INNER_256
 #elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
 #define RT_NODE32_TYPE RT_NODE_LEAF_32
 #define RT_NODE125_TYPE RT_NODE_LEAF_125
 #define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -27,22 +27,22 @@
 
 	switch (node->kind)
 	{
-		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_3:
 			{
-				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
-				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
 
 #ifdef RT_ACTION_UPDATE
 				Assert(idx >= 0);
-				n4->children[idx] = new_child;
+				n3->children[idx] = new_child;
 #else
 				if (idx < 0)
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				value = n4->values[idx];
+				value = n3->values[idx];
 #else
-				child = n4->children[idx];
+				child = n3->children[idx];
 #endif
 #endif							/* RT_ACTION_UPDATE */
 				break;
@@ -125,7 +125,7 @@
 	return true;
 #endif							/* RT_ACTION_UPDATE */
 
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
 #undef RT_NODE32_TYPE
 #undef RT_NODE125_TYPE
 #undef RT_NODE256_TYPE
-- 
2.39.0

v21-0007-Make-value-type-configurable.patchtext/x-patch; charset=US-ASCII; name=v21-0007-Make-value-type-configurable.patchDownload

From 9b6adf0d916cb4bce3dd0e329b59cd1b27013e67 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 14:19:15 +0700
Subject: [PATCH v21 07/22] Make value type configurable

Tests pass with uint32, although the test module builds
with warnings.
---
 src/include/lib/radixtree.h                   | 79 ++++++++++---------
 src/include/lib/radixtree_delete_impl.h       |  4 +-
 src/include/lib/radixtree_iter_impl.h         |  2 +-
 src/include/lib/radixtree_search_impl.h       |  2 +-
 .../modules/test_radixtree/test_radixtree.c   | 41 ++++++----
 5 files changed, 69 insertions(+), 59 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 98e4597eac..0a39bd6664 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -44,6 +44,7 @@
  *		declarations reside
  *	  - RT_SHMEM - if defined, the radix tree is created in the DSA area
  *		so that multiple processes can access it simultaneously.
+ *	  - RT_VALUE_TYPE - the type of the value.
  *
  *	  Optional parameters:
  *	  - RT_DEBUG - if defined add stats tracking and debugging functions
@@ -222,14 +223,14 @@ RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
 #endif
 RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
 
-RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
-RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE val);
 #ifdef RT_USE_DELETE
 RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
 #endif
 
 RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
-RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
 RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
 
 RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
@@ -435,7 +436,7 @@ typedef struct RT_NODE_LEAF_4
 	RT_NODE_BASE_4 base;
 
 	/* number of values depends on size class */
-	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
 } RT_NODE_LEAF_4;
 
 typedef struct RT_NODE_INNER_32
@@ -451,7 +452,7 @@ typedef struct RT_NODE_LEAF_32
 	RT_NODE_BASE_32 base;
 
 	/* number of values depends on size class */
-	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
 } RT_NODE_LEAF_32;
 
 typedef struct RT_NODE_INNER_125
@@ -467,7 +468,7 @@ typedef struct RT_NODE_LEAF_125
 	RT_NODE_BASE_125 base;
 
 	/* number of values depends on size class */
-	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
 } RT_NODE_LEAF_125;
 
 /*
@@ -490,7 +491,7 @@ typedef struct RT_NODE_LEAF_256
 	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
 
 	/* Slots for 256 values */
-	uint64		values[RT_NODE_MAX_SLOTS];
+	RT_VALUE_TYPE	values[RT_NODE_MAX_SLOTS];
 } RT_NODE_LEAF_256;
 
 /* Information for each size class */
@@ -520,33 +521,33 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 		.name = "radix tree node 4",
 		.fanout = 4,
 		.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
-		.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+		.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE),
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_32_PARTIAL] = {
 		.name = "radix tree node 15",
 		.fanout = 15,
 		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
-		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_32_FULL] = {
 		.name = "radix tree node 32",
 		.fanout = 32,
 		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
-		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_125_FULL] = {
 		.name = "radix tree node 125",
 		.fanout = 125,
 		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
-		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_256] = {
 		.name = "radix tree node 256",
@@ -648,7 +649,7 @@ typedef struct RT_ITER
 static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 								 uint64 key, RT_PTR_ALLOC child);
 static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
-								uint64 key, uint64 value);
+								uint64 key, RT_VALUE_TYPE value);
 
 /* verification (available only with assertion) */
 static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
@@ -828,10 +829,10 @@ RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count,
 }
 
 static inline void
-RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
 {
 	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
-	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
 }
 
 /* Delete the element at 'idx' */
@@ -843,10 +844,10 @@ RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count,
 }
 
 static inline void
-RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
 {
 	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
-	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
 }
 
 /* Copy both chunks and children/values arrays */
@@ -863,12 +864,12 @@ RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
 }
 
 static inline void
-RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
-						uint8 *dst_chunks, uint64 *dst_values)
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
 {
 	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
-	const Size values_size = sizeof(uint64) * fanout;
+	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
 
 	memcpy(dst_chunks, src_chunks, chunk_size);
 	memcpy(dst_values, src_values, values_size);
@@ -890,7 +891,7 @@ RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
 	return node->children[node->base.slot_idxs[chunk]];
 }
 
-static inline uint64
+static inline RT_VALUE_TYPE
 RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
 {
 	Assert(NODE_IS_LEAF(node));
@@ -926,7 +927,7 @@ RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
 	return node->children[chunk];
 }
 
-static inline uint64
+static inline RT_VALUE_TYPE
 RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
 {
 	Assert(NODE_IS_LEAF(node));
@@ -944,7 +945,7 @@ RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
 
 /* Set the value in the node-256 */
 static inline void
-RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
 {
 	int			idx = BM_IDX(chunk);
 	int			bitnum = BM_BIT(chunk);
@@ -1215,7 +1216,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
  * Insert inner and leaf nodes from 'node' to bottom.
  */
 static inline void
-RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL parent,
 			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
 {
 	int			shift = node->shift;
@@ -1266,7 +1267,7 @@ RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
  * to the value is set to value_p.
  */
 static inline bool
-RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
 {
 #define RT_NODE_LEVEL_LEAF
 #include "lib/radixtree_search_impl.h"
@@ -1320,7 +1321,7 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stor
 /* Like, RT_NODE_INSERT_INNER, but for leaf nodes */
 static bool
 RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
-					uint64 key, uint64 value)
+					uint64 key, RT_VALUE_TYPE value)
 {
 #define RT_NODE_LEVEL_LEAF
 #include "lib/radixtree_insert_impl.h"
@@ -1522,7 +1523,7 @@ RT_FREE(RT_RADIX_TREE *tree)
  * and return true. Returns false if entry doesn't yet exist.
  */
 RT_SCOPE bool
-RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
 {
 	int			shift;
 	bool		updated;
@@ -1582,7 +1583,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
  * not be NULL.
  */
 RT_SCOPE bool
-RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 {
 	RT_PTR_LOCAL node;
 	int			shift;
@@ -1730,7 +1731,7 @@ RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
  */
 static inline bool
 RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
-						  uint64 *value_p)
+						  RT_VALUE_TYPE *value_p)
 {
 #define RT_NODE_LEVEL_LEAF
 #include "lib/radixtree_iter_impl.h"
@@ -1803,7 +1804,7 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
  * return false.
  */
 RT_SCOPE bool
-RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
 {
 	/* Empty tree */
 	if (!iter->tree->ctl->root)
@@ -1812,7 +1813,7 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
 	for (;;)
 	{
 		RT_PTR_LOCAL child = NULL;
-		uint64		value;
+		RT_VALUE_TYPE value;
 		int			level;
 		bool		found;
 
@@ -1971,6 +1972,7 @@ RT_STATS(RT_RADIX_TREE *tree)
 						 tree->ctl->cnt[RT_CLASS_256])));
 }
 
+/* XXX For display, assumes value type is numeric */
 static void
 RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 {
@@ -1998,7 +2000,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 						RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, n4->base.chunks[i], n4->values[i]);
+								space, n4->base.chunks[i], (uint64) n4->values[i]);
 					}
 					else
 					{
@@ -2024,7 +2026,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, n32->base.chunks[i], n32->values[i]);
+								space, n32->base.chunks[i], (uint64) n32->values[i]);
 					}
 					else
 					{
@@ -2077,7 +2079,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 						RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+								space, i, (uint64) RT_NODE_LEAF_125_GET_VALUE(n125, i));
 					}
 					else
 					{
@@ -2107,7 +2109,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 							continue;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
+								space, i, (uint64) RT_NODE_LEAF_256_GET_VALUE(n256, i));
 					}
 					else
 					{
@@ -2213,6 +2215,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_SCOPE
 #undef RT_DECLARE
 #undef RT_DEFINE
+#undef RT_VALUE_TYPE
 
 /* locally declared macros */
 #undef NODE_IS_LEAF
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index eb87866b90..2612730481 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -33,7 +33,7 @@
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
+				RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, n4->values,
 										  n4->base.n.count, idx);
 #else
 				RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
@@ -50,7 +50,7 @@
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
 										  n32->base.n.count, idx);
 #else
 				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 0b8b68df6c..5c06f8b414 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -16,7 +16,7 @@
 	uint8		key_chunk;
 
 #ifdef RT_NODE_LEVEL_LEAF
-	uint64		value;
+	RT_VALUE_TYPE		value;
 
 	Assert(NODE_IS_LEAF(node_iter->node));
 #else
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 31e4978e4f..365abaa46d 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -15,7 +15,7 @@
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 
 #ifdef RT_NODE_LEVEL_LEAF
-	uint64		value = 0;
+	RT_VALUE_TYPE		value = 0;
 
 	Assert(NODE_IS_LEAF(node));
 #else
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index d8323f587f..64d46dfe9a 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -24,6 +24,12 @@
 
 #define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
 
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
 /*
  * If you enable this, the "pattern" tests will print information about
  * how long populating, probing, and iterating the test set takes, and
@@ -105,6 +111,7 @@ static const test_spec test_specs[] = {
 #define RT_DECLARE
 #define RT_DEFINE
 #define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
 // WIP: compiles with warnings because rt_attach is defined but not used
 // #define RT_SHMEM
 #include "lib/radixtree.h"
@@ -128,9 +135,9 @@ test_empty(void)
 {
 	rt_radix_tree *radixtree;
 	rt_iter		*iter;
-	uint64		dummy;
+	TestValueType		dummy;
 	uint64		key;
-	uint64		val;
+	TestValueType		val;
 
 #ifdef RT_SHMEM
 	int			tranche_id = LWLockNewTrancheId();
@@ -202,26 +209,26 @@ test_basic(int children, bool test_inner)
 	/* insert keys */
 	for (int i = 0; i < children; i++)
 	{
-		if (rt_set(radixtree, keys[i], keys[i]))
+		if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
 			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
 	}
 
 	/* look up keys */
 	for (int i = 0; i < children; i++)
 	{
-		uint64 value;
+		TestValueType value;
 
 		if (!rt_search(radixtree, keys[i], &value))
 			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
-		if (value != keys[i])
+		if (value != (TestValueType) keys[i])
 			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
-				 value, keys[i]);
+				 value, (TestValueType) keys[i]);
 	}
 
 	/* update keys */
 	for (int i = 0; i < children; i++)
 	{
-		if (!rt_set(radixtree, keys[i], keys[i] + 1))
+		if (!rt_set(radixtree, keys[i], (TestValueType) (keys[i] + 1)))
 			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
 	}
 
@@ -230,7 +237,7 @@ test_basic(int children, bool test_inner)
 	{
 		if (!rt_delete(radixtree, keys[i]))
 			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
-		if (rt_set(radixtree, keys[i], keys[i]))
+		if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
 			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
 	}
 
@@ -248,12 +255,12 @@ check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
 	for (int i = start; i < end; i++)
 	{
 		uint64		key = ((uint64) i << shift);
-		uint64		val;
+		TestValueType		val;
 
 		if (!rt_search(radixtree, key, &val))
 			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
 				 key, end);
-		if (val != key)
+		if (val != (TestValueType) key)
 			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
 				 key, val, key);
 	}
@@ -274,7 +281,7 @@ test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
 		uint64		key = ((uint64) i << shift);
 		bool		found;
 
-		found = rt_set(radixtree, key, key);
+		found = rt_set(radixtree, key, (TestValueType) key);
 		if (found)
 			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
 
@@ -440,7 +447,7 @@ test_pattern(const test_spec * spec)
 
 			x = last_int + pattern_values[i];
 
-			found = rt_set(radixtree, x, x);
+			found = rt_set(radixtree, x, (TestValueType) x);
 
 			if (found)
 				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
@@ -495,7 +502,7 @@ test_pattern(const test_spec * spec)
 		bool		found;
 		bool		expected;
 		uint64		x;
-		uint64		v;
+		TestValueType		v;
 
 		/*
 		 * Pick next value to probe at random.  We limit the probes to the
@@ -526,7 +533,7 @@ test_pattern(const test_spec * spec)
 
 		if (found != expected)
 			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
-		if (found && (v != x))
+		if (found && (v != (TestValueType) x))
 			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
 				 v, x);
 	}
@@ -549,7 +556,7 @@ test_pattern(const test_spec * spec)
 		{
 			uint64		expected = last_int + pattern_values[i];
 			uint64		x;
-			uint64		val;
+			TestValueType		val;
 
 			if (!rt_iterate_next(iter, &x, &val))
 				break;
@@ -558,7 +565,7 @@ test_pattern(const test_spec * spec)
 				elog(ERROR,
 					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
 					 x, expected, i);
-			if (val != expected)
+			if (val != (TestValueType) expected)
 				elog(ERROR,
 					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
 			n++;
@@ -588,7 +595,7 @@ test_pattern(const test_spec * spec)
 	{
 		bool		found;
 		uint64		x;
-		uint64		v;
+		TestValueType		v;
 
 		/*
 		 * Pick next value to probe at random.  We limit the probes to the
-- 
2.39.0

v21-0012-Tool-for-measuring-radix-tree-performance.patchtext/x-patch; charset=US-ASCII; name=v21-0012-Tool-for-measuring-radix-tree-performance.patchDownload

From 96efc422dd1858951de4e41563c081b1d2faaf5f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v21 12/22] Tool for measuring radix tree performance

Includes Meson support, but commented out to avoid warnings

XXX: Not for commit
---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  76 ++
 contrib/bench_radix_tree/bench_radix_tree.c   | 656 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/meson.build          |  33 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 contrib/meson.build                           |   1 +
 8 files changed, 822 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/meson.build
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..4c785c7336
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,656 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	rt_radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, key);
+	}
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+
+	rt_stats(rt);
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+  'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+  bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'bench_radix_tree',
+    '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+  bench_radix_tree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+  'bench_radix_tree.control',
+  'bench_radix_tree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'bench_radix_tree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'bench_radix_tree',
+    ],
+  },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
+#subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.39.0

v21-0013-Get-rid-of-NODE_IS_EMPTY-macro.patchtext/x-patch; charset=US-ASCII; name=v21-0013-Get-rid-of-NODE_IS_EMPTY-macro.patchDownload

From a3829e483ac68d31efb19cb2eca128e200d92c1f Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 21 Jan 2023 13:40:28 +0700
Subject: [PATCH v21 13/22] Get rid of NODE_IS_EMPTY macro

It's already pretty clear what "count == 0" means, and the
existing comments make it obvious.
---
 src/include/lib/radixtree.h | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 4a2dad82bf..567eab4bc8 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -372,7 +372,6 @@ typedef struct RT_NODE
 #endif
 
 #define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
-#define NODE_IS_EMPTY(n)		(((RT_PTR_LOCAL) (n))->count == 0)
 #define VAR_NODE_HAS_FREE_SLOT(node) \
 	((node)->base.n.count < (node)->base.n.fanout)
 #define FIXED_NODE_HAS_FREE_SLOT(node, class) \
@@ -1701,7 +1700,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 	 * Return if the leaf node still has keys and we don't need to delete the
 	 * node.
 	 */
-	if (!NODE_IS_EMPTY(node))
+	if (node->count > 0)
 		return true;
 
 	/* Free the empty leaf node */
@@ -1717,7 +1716,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 		Assert(deleted);
 
 		/* If the node didn't become empty, we stop deleting the key */
-		if (!NODE_IS_EMPTY(node))
+		if (node->count > 0)
 			break;
 
 		/* The node became empty */
@@ -2239,7 +2238,6 @@ RT_DUMP(RT_RADIX_TREE *tree)
 
 /* locally declared macros */
 #undef NODE_IS_LEAF
-#undef NODE_IS_EMPTY
 #undef VAR_NODE_HAS_FREE_SLOT
 #undef FIXED_NODE_HAS_FREE_SLOT
 #undef RT_NODE_KIND_COUNT
-- 
2.39.0

v21-0015-Get-rid-of-FIXED_NODE_HAS_FREE_SLOT.patchtext/x-patch; charset=US-ASCII; name=v21-0015-Get-rid-of-FIXED_NODE_HAS_FREE_SLOT.patchDownload

From f1288439306d54677d626cfc0c1cbdc41fcf0ca5 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 11:53:33 +0700
Subject: [PATCH v21 15/22] Get rid of FIXED_NODE_HAS_FREE_SLOT

It's only used in one assert for the node256 kind, whose
fanout is necessarily fixed, and we already have a
convenient macro to compare that with.
---
 src/include/lib/radixtree.h             | 3 ---
 src/include/lib/radixtree_insert_impl.h | 2 +-
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d48c915373..8fbc0b5086 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -374,8 +374,6 @@ typedef struct RT_NODE
 #define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
 #define VAR_NODE_HAS_FREE_SLOT(node) \
 	((node)->base.n.count < (node)->base.n.fanout)
-#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
-	((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
 
 /* Base type of each node kinds for leaf and inner nodes */
 /* The base types must be a be able to accommodate the largest size
@@ -2262,7 +2260,6 @@ RT_DUMP(RT_RADIX_TREE *tree)
 /* locally declared macros */
 #undef NODE_IS_LEAF
 #undef VAR_NODE_HAS_FREE_SLOT
-#undef FIXED_NODE_HAS_FREE_SLOT
 #undef RT_NODE_KIND_COUNT
 #undef RT_SIZE_CLASS_COUNT
 #undef RT_RADIX_TREE_MAGIC
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 8470c8fc70..b484b7a099 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -286,7 +286,7 @@
 #else
 				chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
 #endif
-				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+				Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
 
 #ifdef RT_NODE_LEVEL_LEAF
 				RT_NODE_LEAF_256_SET(n256, chunk, value);
-- 
2.39.0

v21-0014-Add-some-comments-for-insert-logic.patchtext/x-patch; charset=US-ASCII; name=v21-0014-Add-some-comments-for-insert-logic.patchDownload

From 503ccef8841efcb0809acc73f6f4cc2428342080 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 21 Jan 2023 14:21:55 +0700
Subject: [PATCH v21 14/22] Add some comments for insert logic

---
 src/include/lib/radixtree.h             | 29 ++++++++++++++++++++++---
 src/include/lib/radixtree_insert_impl.h |  5 +++++
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 567eab4bc8..d48c915373 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -731,8 +731,8 @@ RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
 }
 
 /*
- * Return index of the first element in 'base' that equals 'key'. Return -1
- * if there is no such element.
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
  */
 static inline int
 RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
@@ -762,14 +762,22 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
 #endif
 
 #ifndef USE_NO_SIMD
+	/* replicate the search key */
 	spread_chunk = vector8_broadcast(chunk);
+
+	/* compare to the 32 keys stored in the node */
 	vector8_load(&haystack1, &node->chunks[0]);
 	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
 	cmp1 = vector8_eq(spread_chunk, haystack1);
 	cmp2 = vector8_eq(spread_chunk, haystack2);
+
+	/* convert comparison to a bitfield */
 	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+	/* mask off invalid entries */
 	bitfield &= ((UINT64CONST(1) << count) - 1);
 
+	/* convert bitfield to index by counting trailing zeros */
 	if (bitfield)
 		index_simd = pg_rightmost_one_pos32(bitfield);
 
@@ -781,7 +789,8 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
 }
 
 /*
- * Return index of the chunk to insert into chunks in the given node.
+ * Return index of the node's chunk array to insert into,
+ * such that the chunk array remains ordered.
  */
 static inline int
 RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
@@ -804,12 +813,26 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
 
 	for (index = 0; index < count; index++)
 	{
+		/*
+		 * This is coded with '>=' to match what we can do with SIMD,
+		 * with an assert to keep us honest.
+		 */
 		if (node->chunks[index] >= chunk)
+		{
+			Assert(node->chunks[index] != chunk);
 			break;
+		}
 	}
 #endif
 
 #ifndef USE_NO_SIMD
+	/*
+	 * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+	 * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+	 * we need to play some trickery using vector8_min() to effectively get
+	 * <=. There'll never be any equal elements in the current uses, but that's
+	 * what we get here...
+	 */
 	spread_chunk = vector8_broadcast(chunk);
 	vector8_load(&haystack1, &node->chunks[0]);
 	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 16461bdb03..8470c8fc70 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -162,6 +162,11 @@
 #endif
 					}
 
+					/*
+					 * Since we just copied a dense array, we can set the bits
+					 * using a single store, provided the length of that array
+					 * is at most the number of bits in a bitmapword.
+					 */
 					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
 					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
 
-- 
2.39.0

v21-0011-Expand-commentary-for-kinds-vs.-size-classes.patchtext/x-patch; charset=US-ASCII; name=v21-0011-Expand-commentary-for-kinds-vs.-size-classes.patchDownload

From 8711b9afb019ba45a5c4c3e2ec41f72130208a68 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 21 Jan 2023 12:52:53 +0700
Subject: [PATCH v21 11/22] Expand commentary for kinds vs. size classes

Also move class enum closer to array and add #undef's
---
 src/include/lib/radixtree.h | 76 ++++++++++++++++++++++++++-----------
 1 file changed, 53 insertions(+), 23 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 6cc8442c89..4a2dad82bf 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -288,22 +288,26 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
 #define BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
 
 /*
- * Supported radix tree node kinds and size classes.
+ * Node kinds
  *
- * There are 4 node kinds and each node kind have one or two size classes,
- * partial and full. The size classes in the same node kind have the same
- * node structure but have the different number of fanout that is stored
- * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
- * is to be inserted, we allocate a larger area and memcpy the entire old
- * node to it.
+ * The different node kinds are what make the tree "adaptive".
  *
- * This technique allows us to limit the node kinds to 4, which limits the
- * number of cases in switch statements. It also allows a possible future
- * optimization to encode the node kind in a pointer tag.
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
  *
- * These size classes have been chose carefully so that it minimizes the
- * allocator padding in both the inner and leaf nodes on DSA.
- * node
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ *    statments.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ *    in the future to tag the node pointer with the kind, even on
+ *    platforms with 32-bit pointers. This might speed up node traversal
+ *    in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
  */
 #define RT_NODE_KIND_3			0x00
 #define RT_NODE_KIND_32			0x01
@@ -320,16 +324,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
 
 #endif							/* RT_COMMON */
 
-
-typedef enum RT_SIZE_CLASS
-{
-	RT_CLASS_3_FULL = 0,
-	RT_CLASS_32_PARTIAL,
-	RT_CLASS_32_FULL,
-	RT_CLASS_125_FULL,
-	RT_CLASS_256
-} RT_SIZE_CLASS;
-
 /* Common type for all nodes types */
 typedef struct RT_NODE
 {
@@ -508,6 +502,37 @@ typedef struct RT_NODE_LEAF_256
 	RT_VALUE_TYPE	values[RT_NODE_MAX_SLOTS];
 } RT_NODE_LEAF_256;
 
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_3_FULL = 0,
+	RT_CLASS_32_PARTIAL,
+	RT_CLASS_32_FULL,
+	RT_CLASS_125_FULL,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
 /* Information for each size class */
 typedef struct RT_SIZE_CLASS_ELEM
 {
@@ -2217,6 +2242,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef NODE_IS_EMPTY
 #undef VAR_NODE_HAS_FREE_SLOT
 #undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_NODE_KIND_COUNT
 #undef RT_SIZE_CLASS_COUNT
 #undef RT_RADIX_TREE_MAGIC
 
@@ -2229,6 +2255,10 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_ITER
 #undef RT_NODE
 #undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
 #undef RT_NODE_BASE_3
 #undef RT_NODE_BASE_32
 #undef RT_NODE_BASE_125
-- 
2.39.0

v21-0019-Standardize-on-testing-for-is-leaf.patchtext/x-patch; charset=US-ASCII; name=v21-0019-Standardize-on-testing-for-is-leaf.patchDownload

From 42bdeca3facdcaf43284ba0a6c85b6db0ac63ead Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 15:10:10 +0700
Subject: [PATCH v21 19/22] Standardize on testing for "is leaf"

Some recent code decided to test for "is inner", so make
everything consistent.
---
 src/include/lib/radixtree.h             | 38 ++++++++++++-------------
 src/include/lib/radixtree_insert_impl.h | 18 ++++++------
 2 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 95124696ef..5927437034 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1019,24 +1019,24 @@ RT_SHIFT_GET_MAX_VAL(int shift)
  * Allocate a new node with the given node kind.
  */
 static RT_PTR_ALLOC
-RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
 {
 	RT_PTR_ALLOC allocnode;
 	size_t allocsize;
 
-	if (inner)
-		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
-	else
+	if (is_leaf)
 		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
 
 #ifdef RT_SHMEM
 	allocnode = dsa_allocate(tree->dsa, allocsize);
 #else
-	if (inner)
-		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+	if (is_leaf)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
 													  allocsize);
 	else
-		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
 													  allocsize);
 #endif
 
@@ -1050,12 +1050,12 @@ RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
 
 /* Initialize the node contents */
 static inline void
-RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
 {
-	if (inner)
-		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
-	else
+	if (is_leaf)
 		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
 
 	node->kind = kind;
 
@@ -1082,13 +1082,13 @@ static void
 RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
 {
 	int			shift = RT_KEY_GET_SHIFT(key);
-	bool		inner = shift > 0;
+	bool		is_leaf = shift == 0;
 	RT_PTR_ALLOC allocnode;
 	RT_PTR_LOCAL newnode;
 
-	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
 	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, inner);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
 	newnode->shift = shift;
 	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
 	tree->ctl->root = allocnode;
@@ -1107,10 +1107,10 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
  */
 static inline RT_PTR_LOCAL
 RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
-				  uint8 new_kind, uint8 new_class, bool inner)
+				  uint8 new_kind, uint8 new_class, bool is_leaf)
 {
 	RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-	RT_INIT_NODE(newnode, new_kind, new_class, inner);
+	RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
 	RT_COPY_NODE(newnode, node);
 
 	return newnode;
@@ -1247,11 +1247,11 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL
 		RT_PTR_ALLOC allocchild;
 		RT_PTR_LOCAL newchild;
 		int			newshift = shift - RT_NODE_SPAN;
-		bool		inner = newshift > 0;
+		bool		is_leaf = newshift == 0;
 
-		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
 		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
-		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, inner);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
 		newchild->shift = newshift;
 		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
 
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 0fcebf1c6b..22aca0e6cc 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -16,10 +16,10 @@
 	bool		chunk_exists = false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-	const bool inner = false;
+	const bool is_leaf = true;
 	Assert(RT_NODE_IS_LEAF(node));
 #else
-	const bool inner = true;
+	const bool is_leaf = false;
 	Assert(!RT_NODE_IS_LEAF(node));
 #endif
 
@@ -52,8 +52,8 @@
 					const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
 
 					/* grow node from 3 to 32 */
-					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
-					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
 					new32 = (RT_NODE32_TYPE *) newnode;
 
 #ifdef RT_NODE_LEVEL_LEAF
@@ -124,7 +124,7 @@
 					Assert(n32->base.n.fanout == class32_min.fanout);
 
 					/* grow to the next size class of this kind */
-					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
 					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
 					n32 = (RT_NODE32_TYPE *) newnode;
 
@@ -150,8 +150,8 @@
 					Assert(n32->base.n.fanout == class32_max.fanout);
 
 					/* grow node from 32 to 125 */
-					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
-					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
 					new125 = (RT_NODE125_TYPE *) newnode;
 
 					for (int i = 0; i < class32_max.fanout; i++)
@@ -229,8 +229,8 @@
 					const RT_SIZE_CLASS new_class = RT_CLASS_256;
 
 					/* grow node from 125 to 256 */
-					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
-					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
 					new256 = (RT_NODE256_TYPE *) newnode;
 
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
-- 
2.39.0

v21-0016-s-VAR_NODE_HAS_FREE_SLOT-RT_NODE_MUST_GROW.patchtext/x-patch; charset=US-ASCII; name=v21-0016-s-VAR_NODE_HAS_FREE_SLOT-RT_NODE_MUST_GROW.patchDownload

From 55c4517ebaa67e94b2e52e1a1164d44bf09e0bb4 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 12:11:11 +0700
Subject: [PATCH v21 16/22] s/VAR_NODE_HAS_FREE_SLOT/RT_NODE_MUST_GROW/

---
 src/include/lib/radixtree.h             | 6 +++---
 src/include/lib/radixtree_insert_impl.h | 8 ++++----
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 8fbc0b5086..cd8b8d1c22 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -372,8 +372,8 @@ typedef struct RT_NODE
 #endif
 
 #define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
-#define VAR_NODE_HAS_FREE_SLOT(node) \
-	((node)->base.n.count < (node)->base.n.fanout)
+#define RT_NODE_MUST_GROW(node) \
+	((node)->base.n.count == (node)->base.n.fanout)
 
 /* Base type of each node kinds for leaf and inner nodes */
 /* The base types must be a be able to accommodate the largest size
@@ -2259,7 +2259,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 
 /* locally declared macros */
 #undef NODE_IS_LEAF
-#undef VAR_NODE_HAS_FREE_SLOT
+#undef RT_NODE_MUST_GROW
 #undef RT_NODE_KIND_COUNT
 #undef RT_SIZE_CLASS_COUNT
 #undef RT_RADIX_TREE_MAGIC
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index b484b7a099..a0f46b37d3 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -43,7 +43,7 @@
 					break;
 				}
 
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n3)))
+				if (unlikely(RT_NODE_MUST_GROW(n3)))
 				{
 					RT_PTR_ALLOC allocnode;
 					RT_PTR_LOCAL newnode;
@@ -114,7 +114,7 @@
 					break;
 				}
 
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
 					n32->base.n.fanout == class32_min.fanout)
 				{
 					RT_PTR_ALLOC allocnode;
@@ -137,7 +137,7 @@
 					node = newnode;
 				}
 
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				if (unlikely(RT_NODE_MUST_GROW(n32)))
 				{
 					RT_PTR_ALLOC allocnode;
 					RT_PTR_LOCAL newnode;
@@ -218,7 +218,7 @@
 					break;
 				}
 
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				if (unlikely(RT_NODE_MUST_GROW(n125)))
 				{
 					RT_PTR_ALLOC allocnode;
 					RT_PTR_LOCAL newnode;
-- 
2.39.0

v21-0017-Remove-some-maintenance-hazards-in-growing-nodes.patchtext/x-patch; charset=US-ASCII; name=v21-0017-Remove-some-maintenance-hazards-in-growing-nodes.patchDownload

From 5c577598bc1f667333c97dca70155df6d296c251 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 13:29:18 +0700
Subject: [PATCH v21 17/22] Remove some maintenance hazards in growing nodes

Arrange so that kinds with only one size class have no
"full" suffix. This ensures that splitting such a class
into multiple classes will force compilation errors if
the dev has not thought through which new class should
apply in each case.

For node32, make growing into a new size class a bit
more general. It's not clear we would ever need more
than 2 classes, but let's not put up additional road
blocks. Change partial/full to min/max. It's a bit
shorter this way, matches some newer coding, and allows
for the possibility of a "mid" class.

Also remove RT_KIND_MIN_SIZE_CLASS, since it doesn't
reduce the need for future changes, only makes such
a change further away from the effect.

In passing, move a declaration the block where it's used.
---
 src/include/lib/radixtree.h             | 66 +++++++++++--------------
 src/include/lib/radixtree_insert_impl.h | 16 +++---
 2 files changed, 37 insertions(+), 45 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index cd8b8d1c22..7c3f3dcf4f 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -196,12 +196,11 @@
 #define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
 #define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
 #define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
-#define RT_CLASS_3_FULL RT_MAKE_NAME(class_3_full)
-#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
-#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
-#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
 #define RT_CLASS_256 RT_MAKE_NAME(class_256)
-#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
 
 /* generate forward declarations necessary to use the radix tree */
 #ifdef RT_DECLARE
@@ -523,10 +522,10 @@ typedef struct RT_NODE_LEAF_256
  */
 typedef enum RT_SIZE_CLASS
 {
-	RT_CLASS_3_FULL = 0,
-	RT_CLASS_32_PARTIAL,
-	RT_CLASS_32_FULL,
-	RT_CLASS_125_FULL,
+	RT_CLASS_3 = 0,
+	RT_CLASS_32_MIN,
+	RT_CLASS_32_MAX,
+	RT_CLASS_125,
 	RT_CLASS_256
 } RT_SIZE_CLASS;
 
@@ -542,25 +541,25 @@ typedef struct RT_SIZE_CLASS_ELEM
 } RT_SIZE_CLASS_ELEM;
 
 static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
-	[RT_CLASS_3_FULL] = {
+	[RT_CLASS_3] = {
 		.name = "radix tree node 3",
 		.fanout = 3,
 		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
 	},
-	[RT_CLASS_32_PARTIAL] = {
+	[RT_CLASS_32_MIN] = {
 		.name = "radix tree node 15",
 		.fanout = 15,
 		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
 	},
-	[RT_CLASS_32_FULL] = {
+	[RT_CLASS_32_MAX] = {
 		.name = "radix tree node 32",
 		.fanout = 32,
 		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
 	},
-	[RT_CLASS_125_FULL] = {
+	[RT_CLASS_125] = {
 		.name = "radix tree node 125",
 		.fanout = 125,
 		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
@@ -576,14 +575,6 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 
 #define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
 
-/* Map from the node kind to its minimum size class */
-static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
-	[RT_NODE_KIND_3] = RT_CLASS_3_FULL,
-	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
-	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
-	[RT_NODE_KIND_256] = RT_CLASS_256,
-};
-
 #ifdef RT_SHMEM
 /* A magic value used to identify our radix tree */
 #define RT_RADIX_TREE_MAGIC 0x54A48167
@@ -893,7 +884,7 @@ static inline void
 RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
 						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
 {
-	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
 	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
 
@@ -905,7 +896,7 @@ static inline void
 RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
 						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
 {
-	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
 	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
 
@@ -1105,9 +1096,9 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
 	RT_PTR_ALLOC allocnode;
 	RT_PTR_LOCAL newnode;
 
-	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
 	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, inner);
 	newnode->shift = shift;
 	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
 	tree->ctl->root = allocnode;
@@ -1230,9 +1221,9 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
 		RT_PTR_LOCAL	node;
 		RT_NODE_INNER_3 *n3;
 
-		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, true);
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
 		node = RT_PTR_GET_LOCAL(tree, allocnode);
-		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3_FULL, true);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
 		node->shift = shift;
 		node->count = 1;
 
@@ -1268,9 +1259,9 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL
 		int			newshift = shift - RT_NODE_SPAN;
 		bool		inner = newshift > 0;
 
-		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
 		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
-		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, inner);
 		newchild->shift = newshift;
 		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
 
@@ -2007,10 +1998,10 @@ RT_STATS(RT_RADIX_TREE *tree)
 	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
 						 tree->ctl->num_keys,
 						 tree->ctl->root->shift / RT_NODE_SPAN,
-						 tree->ctl->cnt[RT_CLASS_3_FULL],
-						 tree->ctl->cnt[RT_CLASS_32_PARTIAL],
-						 tree->ctl->cnt[RT_CLASS_32_FULL],
-						 tree->ctl->cnt[RT_CLASS_125_FULL],
+						 tree->ctl->cnt[RT_CLASS_3],
+						 tree->ctl->cnt[RT_CLASS_32_MIN],
+						 tree->ctl->cnt[RT_CLASS_32_MAX],
+						 tree->ctl->cnt[RT_CLASS_125],
 						 tree->ctl->cnt[RT_CLASS_256])));
 }
 
@@ -2292,12 +2283,11 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_SIZE_CLASS
 #undef RT_SIZE_CLASS_ELEM
 #undef RT_SIZE_CLASS_INFO
-#undef RT_CLASS_3_FULL
-#undef RT_CLASS_32_PARTIAL
-#undef RT_CLASS_32_FULL
-#undef RT_CLASS_125_FULL
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
 #undef RT_CLASS_256
-#undef RT_KIND_MIN_SIZE_CLASS
 
 /* function declarations */
 #undef RT_CREATE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index a0f46b37d3..e3c3f7a69d 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -49,7 +49,7 @@
 					RT_PTR_LOCAL newnode;
 					RT_NODE32_TYPE *new32;
 					const uint8 new_kind = RT_NODE_KIND_32;
-					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
 
 					/* grow node from 3 to 32 */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
@@ -96,8 +96,7 @@
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
-				const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
-				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
 				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
 				int			idx;
 
@@ -115,11 +114,14 @@
 				}
 
 				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
-					n32->base.n.fanout == class32_min.fanout)
+					n32->base.n.fanout < class32_max.fanout)
 				{
 					RT_PTR_ALLOC allocnode;
 					RT_PTR_LOCAL newnode;
-					const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+					const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+					Assert(n32->base.n.fanout == class32_min.fanout);
 
 					/* grow to the next size class of this kind */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
@@ -143,7 +145,7 @@
 					RT_PTR_LOCAL newnode;
 					RT_NODE125_TYPE *new125;
 					const uint8 new_kind = RT_NODE_KIND_125;
-					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+					const RT_SIZE_CLASS new_class = RT_CLASS_125;
 
 					Assert(n32->base.n.fanout == class32_max.fanout);
 
@@ -224,7 +226,7 @@
 					RT_PTR_LOCAL newnode;
 					RT_NODE256_TYPE *new256;
 					const uint8 new_kind = RT_NODE_KIND_256;
-					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+					const RT_SIZE_CLASS new_class = RT_CLASS_256;
 
 					/* grow node from 125 to 256 */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
-- 
2.39.0

v21-0018-Clean-up-symbols.patchtext/x-patch; charset=US-ASCII; name=v21-0018-Clean-up-symbols.patchDownload

From 96e730fd7056ca0e13b36489ce9e6717fef37318 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 14:37:53 +0700
Subject: [PATCH v21 18/22] Clean up symbols

Remove remaming stragglers who weren't named "RT_*"
and get rid of the temporary expedient RT_COMMON
block in favor of explicit #undefs everywhere.
---
 src/include/lib/radixtree.h             | 91 ++++++++++++++-----------
 src/include/lib/radixtree_delete_impl.h |  4 +-
 src/include/lib/radixtree_insert_impl.h |  4 +-
 src/include/lib/radixtree_iter_impl.h   |  4 +-
 src/include/lib/radixtree_search_impl.h |  4 +-
 5 files changed, 58 insertions(+), 49 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 7c3f3dcf4f..95124696ef 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -246,14 +246,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
 /* generate implementation of the radix tree */
 #ifdef RT_DEFINE
 
-/* macros and types common to all implementations */
-#ifndef RT_COMMON
-#define RT_COMMON
-
-#ifdef RT_DEBUG
-#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
-#endif
-
 /* The number of bits encoded in one tree level */
 #define RT_NODE_SPAN	BITS_PER_BYTE
 
@@ -321,8 +313,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
 #define RT_SLAB_BLOCK_SIZE(size)	\
 	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
 
-#endif							/* RT_COMMON */
-
 /* Common type for all nodes types */
 typedef struct RT_NODE
 {
@@ -370,7 +360,7 @@ typedef struct RT_NODE
 #define RT_INVALID_PTR_ALLOC NULL
 #endif
 
-#define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+#define RT_NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
 #define RT_NODE_MUST_GROW(node) \
 	((node)->base.n.count == (node)->base.n.fanout)
 
@@ -916,14 +906,14 @@ RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
 static inline RT_PTR_ALLOC
 RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	return node->children[node->base.slot_idxs[chunk]];
 }
 
 static inline RT_VALUE_TYPE
 RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
 	return node->values[node->base.slot_idxs[chunk]];
 }
@@ -934,7 +924,7 @@ RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
 static inline bool
 RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
 }
 
@@ -944,14 +934,14 @@ RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
 	int			idx = BM_IDX(chunk);
 	int			bitnum = BM_BIT(chunk);
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
 }
 
 static inline RT_PTR_ALLOC
 RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
 	return node->children[chunk];
 }
@@ -959,7 +949,7 @@ RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
 static inline RT_VALUE_TYPE
 RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
 	return node->values[chunk];
 }
@@ -968,7 +958,7 @@ RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
 static inline void
 RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[chunk] = child;
 }
 
@@ -979,7 +969,7 @@ RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
 	int			idx = BM_IDX(chunk);
 	int			bitnum = BM_BIT(chunk);
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[idx] |= ((bitmapword) 1 << bitnum);
 	node->values[chunk] = value;
 }
@@ -988,7 +978,7 @@ RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
 static inline void
 RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[chunk] = RT_INVALID_PTR_ALLOC;
 }
 
@@ -998,7 +988,7 @@ RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
 	int			idx = BM_IDX(chunk);
 	int			bitnum = BM_BIT(chunk);
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
 }
 
@@ -1458,7 +1448,7 @@ RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
 	CHECK_FOR_INTERRUPTS();
 
 	/* The leaf node doesn't have child pointers */
-	if (NODE_IS_LEAF(node))
+	if (RT_NODE_IS_LEAF(node))
 	{
 		dsa_free(tree->dsa, ptr);
 		return;
@@ -1587,7 +1577,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
 
 		child = RT_PTR_GET_LOCAL(tree, stored_child);
 
-		if (NODE_IS_LEAF(child))
+		if (RT_NODE_IS_LEAF(child))
 			break;
 
 		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
@@ -1637,7 +1627,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 	{
 		RT_PTR_ALLOC child;
 
-		if (NODE_IS_LEAF(node))
+		if (RT_NODE_IS_LEAF(node))
 			break;
 
 		if (!RT_NODE_SEARCH_INNER(node, key, &child))
@@ -1788,7 +1778,7 @@ RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
 		node_iter->current_idx = -1;
 
 		/* We don't advance the leaf node iterator here */
-		if (NODE_IS_LEAF(node))
+		if (RT_NODE_IS_LEAF(node))
 			return;
 
 		/* Advance to the next slot in the inner node */
@@ -1972,7 +1962,7 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
 			}
 		case RT_NODE_KIND_256:
 			{
-				if (NODE_IS_LEAF(node))
+				if (RT_NODE_IS_LEAF(node))
 				{
 					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
 					int			cnt = 0;
@@ -1992,6 +1982,9 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
 
 /***************** DEBUG FUNCTIONS *****************/
 #ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
 RT_SCOPE void
 RT_STATS(RT_RADIX_TREE *tree)
 {
@@ -2012,7 +2005,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 	char		space[125] = {0};
 
 	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
-			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
 			(node->kind == RT_NODE_KIND_3) ? 3 :
 			(node->kind == RT_NODE_KIND_32) ? 32 :
 			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
@@ -2028,11 +2021,11 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 			{
 				for (int i = 0; i < node->count; i++)
 				{
-					if (NODE_IS_LEAF(node))
+					if (RT_NODE_IS_LEAF(node))
 					{
 						RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
 
-						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
 								space, n3->base.chunks[i], (uint64) n3->values[i]);
 					}
 					else
@@ -2054,11 +2047,11 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 			{
 				for (int i = 0; i < node->count; i++)
 				{
-					if (NODE_IS_LEAF(node))
+					if (RT_NODE_IS_LEAF(node))
 					{
 						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
 
-						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
 								space, n32->base.chunks[i], (uint64) n32->values[i]);
 					}
 					else
@@ -2090,14 +2083,14 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 
 					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
 				}
-				if (NODE_IS_LEAF(node))
+				if (RT_NODE_IS_LEAF(node))
 				{
 					RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
 
 					fprintf(stderr, ", isset-bitmap:");
 					for (int i = 0; i < BM_IDX(RT_SLOT_IDX_LIMIT); i++)
 					{
-						fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+						fprintf(stderr, RT_UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
 					}
 					fprintf(stderr, "\n");
 				}
@@ -2107,11 +2100,11 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
 						continue;
 
-					if (NODE_IS_LEAF(node))
+					if (RT_NODE_IS_LEAF(node))
 					{
 						RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
 
-						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
 								space, i, (uint64) RT_NODE_LEAF_125_GET_VALUE(n125, i));
 					}
 					else
@@ -2134,14 +2127,14 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 			{
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
 				{
-					if (NODE_IS_LEAF(node))
+					if (RT_NODE_IS_LEAF(node))
 					{
 						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
 
 						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
 							continue;
 
-						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
 								space, i, (uint64) RT_NODE_LEAF_256_GET_VALUE(n256, i));
 					}
 					else
@@ -2174,7 +2167,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
 	int			level = 0;
 
 	elog(NOTICE, "-----------------------------------------------------------");
-	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ")",
 		 tree->ctl->max_val, tree->ctl->max_val);
 
 	if (!tree->ctl->root)
@@ -2185,7 +2178,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
 
 	if (key > tree->ctl->max_val)
 	{
-		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val",
 			 key, key);
 		return;
 	}
@@ -2198,7 +2191,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
 
 		RT_DUMP_NODE(node, level, false);
 
-		if (NODE_IS_LEAF(node))
+		if (RT_NODE_IS_LEAF(node))
 		{
 			uint64		dummy;
 
@@ -2249,15 +2242,30 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_VALUE_TYPE
 
 /* locally declared macros */
-#undef NODE_IS_LEAF
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef BM_IDX
+#undef BM_BIT
+#undef RT_NODE_IS_LEAF
 #undef RT_NODE_MUST_GROW
 #undef RT_NODE_KIND_COUNT
 #undef RT_SIZE_CLASS_COUNT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
 #undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
 
 /* type declarations */
 #undef RT_RADIX_TREE
 #undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
 #undef RT_PTR_ALLOC
 #undef RT_INVALID_PTR_ALLOC
 #undef RT_HANDLE
@@ -2295,6 +2303,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_ATTACH
 #undef RT_DETACH
 #undef RT_GET_HANDLE
+#undef RT_SEARCH
 #undef RT_SET
 #undef RT_BEGIN_ITERATE
 #undef RT_ITERATE_NEXT
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index b9f07f4eb5..99c90771b9 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -17,9 +17,9 @@
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 
 #ifdef RT_NODE_LEVEL_LEAF
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 #else
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 #endif
 
 	switch (node->kind)
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index e3c3f7a69d..0fcebf1c6b 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -17,10 +17,10 @@
 
 #ifdef RT_NODE_LEVEL_LEAF
 	const bool inner = false;
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 #else
 	const bool inner = true;
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 #endif
 
 	switch (node->kind)
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index c428531438..823d7107c4 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -18,11 +18,11 @@
 #ifdef RT_NODE_LEVEL_LEAF
 	RT_VALUE_TYPE		value;
 
-	Assert(NODE_IS_LEAF(node_iter->node));
+	Assert(RT_NODE_IS_LEAF(node_iter->node));
 #else
 	RT_PTR_LOCAL child = NULL;
 
-	Assert(!NODE_IS_LEAF(node_iter->node));
+	Assert(!RT_NODE_IS_LEAF(node_iter->node));
 #endif
 
 #ifdef RT_SHMEM
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 31138b6a72..c4352045c8 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -17,12 +17,12 @@
 #ifdef RT_NODE_LEVEL_LEAF
 	RT_VALUE_TYPE		value = 0;
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 #else
 #ifndef RT_ACTION_UPDATE
 	RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
 #endif
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 #endif
 
 	switch (node->kind)
-- 
2.39.0

v21-0020-Do-some-rewriting-and-proofreading-of-comments.patchtext/x-patch; charset=US-ASCII; name=v21-0020-Do-some-rewriting-and-proofreading-of-comments.patchDownload

From bf3219324a0b336166390dacfe2ab91ba96d6417 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 23 Jan 2023 18:00:20 +0700
Subject: [PATCH v21 20/22] Do some rewriting and proofreading of comments

In passing, change one ternary operator to if/else.
---
 src/include/lib/radixtree.h | 160 +++++++++++++++++++++---------------
 1 file changed, 92 insertions(+), 68 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 5927437034..7fcd212ea4 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -9,25 +9,38 @@
  * types, each with a different numbers of elements. Depending on the number of
  * children, the appropriate node type is used.
  *
- * There are some differences from the proposed implementation. For instance,
- * there is not support for path compression and lazy path expansion. The radix
- * tree supports fixed length of the key so we don't expect the tree level
- * wouldn't be high.
+ * WIP: notes about traditional radix tree trading off span vs height...
  *
- * Both the key and the value are 64-bit unsigned integer. The inner nodes and
- * the leaf nodes have slightly different structure: for inner tree nodes,
- * shift > 0, store the pointer to its child node as the value. The leaf nodes,
- * shift == 0, have the 64-bit unsigned integer that is specified by the user as
- * the value. The paper refers to this technique as "Multi-value leaves".  We
- * choose it to avoid an additional pointer traversal.  It is the reason this code
- * currently does not support variable-length keys.
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
  *
- * XXX: Most functions in this file have two variants for inner nodes and leaf
- * nodes, therefore there are duplication codes. While this sometimes makes the
- * code maintenance tricky, this reduces branch prediction misses when judging
- * whether the node is a inner node of a leaf node.
+ * The ART paper mentions three ways to implement leaves:
  *
- * XXX: the radix tree node never be shrunk.
+ * "- Single-value leaves: The values are stored using an addi-
+ *  tional leaf node type which stores one value.
+ *  - Multi-value leaves: The values are stored in one of four
+ *  different leaf node types, which mirror the structure of
+ *  inner nodes, but contain values instead of pointers.
+ *  - Combined pointer/value slots: If values fit into point-
+ *  ers, no separate node types are necessary. Instead, each
+ *  pointer storage location in an inner node can either
+ *  store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * WIP: the radix tree nodes don't shrink.
  *
  *	  To generate a radix tree and associated functions for a use case several
  *	  macros have to be #define'ed before this file is included.  Including
@@ -42,11 +55,11 @@
  *	  - RT_DEFINE - if defined function definitions are generated
  *	  - RT_SCOPE - in which scope (e.g. extern, static inline) do function
  *		declarations reside
- *	  - RT_SHMEM - if defined, the radix tree is created in the DSA area
- *		so that multiple processes can access it simultaneously.
  *	  - RT_VALUE_TYPE - the type of the value.
  *
  *	  Optional parameters:
+ *	  - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *		so that multiple processes can access it simultaneously.
  *	  - RT_DEBUG - if defined add stats tracking and debugging functions
  *
  * Interface
@@ -54,9 +67,6 @@
  *
  * RT_CREATE		- Create a new, empty radix tree
  * RT_FREE			- Free the radix tree
- * RT_ATTACH		- Attach to the radix tree
- * RT_DETACH		- Detach from the radix tree
- * RT_GET_HANDLE	- Return the handle of the radix tree
  * RT_SEARCH		- Search a key-value pair
  * RT_SET			- Set a key-value pair
  * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
@@ -64,11 +74,12 @@
  * RT_END_ITER		- End iteration
  * RT_MEMORY_USAGE	- Get the memory usage
  *
- * RT_CREATE() creates an empty radix tree in the given memory context
- * and memory contexts for all kinds of radix tree node under the memory context.
+ * Interface for Shared Memory
+ * ---------
  *
- * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
- * order of the key.
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
  *
  * Optional Interface
  * ---------
@@ -360,13 +371,23 @@ typedef struct RT_NODE
 #define RT_INVALID_PTR_ALLOC NULL
 #endif
 
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
 #define RT_NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+
 #define RT_NODE_MUST_GROW(node) \
 	((node)->base.n.count == (node)->base.n.fanout)
 
-/* Base type of each node kinds for leaf and inner nodes */
-/* The base types must be a be able to accommodate the largest size
-class for variable-sized node kinds*/
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
 typedef struct RT_NODE_BASE_3
 {
 	RT_NODE		n;
@@ -384,9 +405,9 @@ typedef struct RT_NODE_BASE_32
 } RT_NODE_BASE_32;
 
 /*
- * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
- * 256, to store indexes into a second array that contains up to 125 values (or
- * child pointers in inner nodes).
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
  */
 typedef struct RT_NODE_BASE_125
 {
@@ -407,15 +428,8 @@ typedef struct RT_NODE_BASE_256
 /*
  * Inner and leaf nodes.
  *
- * Theres are separate for two main reasons:
- *
- * 1) the value type might be different than something fitting into a pointer
- *    width type
- * 2) Need to represent non-existing values in a key-type independent way.
- *
- * 1) is clearly worth being concerned about, but it's not clear 2) is as
- * good. It might be better to just indicate non-existing entries the same way
- * in inner nodes.
+ * Theres are separate because the value type might be different than
+ * something fitting into a pointer-width type.
  */
 typedef struct RT_NODE_INNER_3
 {
@@ -466,8 +480,10 @@ typedef struct RT_NODE_LEAF_125
 } RT_NODE_LEAF_125;
 
 /*
- * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * node-256 is the largest node type. This node has an array
  * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
  */
 typedef struct RT_NODE_INNER_256
 {
@@ -481,7 +497,10 @@ typedef struct RT_NODE_LEAF_256
 {
 	RT_NODE_BASE_256 base;
 
-	/* isset is a bitmap to track which slot is in use */
+	/*
+	 * Unlike with inner256, zero is a valid value here, so we use a
+	 * bitmap to track which slot is in use.
+	 */
 	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
 
 	/* Slots for 256 values */
@@ -570,7 +589,8 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 #define RT_RADIX_TREE_MAGIC 0x54A48167
 #endif
 
-/* A radix tree with nodes */
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
 typedef struct RT_RADIX_TREE_CONTROL
 {
 #ifdef RT_SHMEM
@@ -588,7 +608,7 @@ typedef struct RT_RADIX_TREE_CONTROL
 #endif
 } RT_RADIX_TREE_CONTROL;
 
-/* A radix tree with nodes */
+/* Entry point for allocating and accessing the tree */
 typedef struct RT_RADIX_TREE
 {
 	MemoryContext context;
@@ -613,15 +633,15 @@ typedef struct RT_RADIX_TREE
  * RT_NODE_ITER struct is used to track the iteration within a node.
  *
  * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
- * in order to track the iteration of each level. During the iteration, we also
+ * in order to track the iteration of each level. During iteration, we also
  * construct the key whenever updating the node iteration information, e.g., when
  * advancing the current index within the node or when moving to the next node
  * at the same level.
-+ *
-+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
-+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
-+ * We need either a safeguard to disallow other processes to begin the iteration
-+ * while one process is doing or to allow multiple processes to do the iteration.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
  */
 typedef struct RT_NODE_ITER
 {
@@ -637,7 +657,7 @@ typedef struct RT_ITER
 	RT_NODE_ITER stack[RT_MAX_LEVEL];
 	int			stack_len;
 
-	/* The key is being constructed during the iteration */
+	/* The key is constructed during iteration */
 	uint64		key;
 } RT_ITER;
 
@@ -672,8 +692,8 @@ RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
 }
 
 /*
- * Return index of the first element in 'base' that equals 'key'. Return -1
- * if there is no such element.
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
  */
 static inline int
 RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
@@ -693,7 +713,8 @@ RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
 }
 
 /*
- * Return index of the chunk to insert into chunks in the given node.
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
  */
 static inline int
 RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
@@ -744,7 +765,7 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
 	/* replicate the search key */
 	spread_chunk = vector8_broadcast(chunk);
 
-	/* compare to the 32 keys stored in the node */
+	/* compare to all 32 keys stored in the node */
 	vector8_load(&haystack1, &node->chunks[0]);
 	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
 	cmp1 = vector8_eq(spread_chunk, haystack1);
@@ -768,7 +789,7 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
 }
 
 /*
- * Return index of the node's chunk array to insert into,
+ * Return index of the chunk and slot arrays for inserting into the node,
  * such that the chunk array remains ordered.
  */
 static inline int
@@ -809,7 +830,7 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
 	 * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
 	 * no unsigned uint8 comparison instruction exists, at least for SSE2. So
 	 * we need to play some trickery using vector8_min() to effectively get
-	 * <=. There'll never be any equal elements in the current uses, but that's
+	 * <=. There'll never be any equal elements in urrent uses, but that's
 	 * what we get here...
 	 */
 	spread_chunk = vector8_broadcast(chunk);
@@ -834,6 +855,7 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
 #endif
 }
 
+
 /*
  * Functions to manipulate both chunks array and children/values array.
  * These are used for node-3 and node-32.
@@ -993,18 +1015,19 @@ RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
 }
 
 /*
- * Return the shift that is satisfied to store the given key.
+ * Return the largest shift that will allowing storing the given key.
  */
 static inline int
 RT_KEY_GET_SHIFT(uint64 key)
 {
-	return (key == 0)
-		? 0
-		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+	if (key == 0)
+		return 0;
+	else
+		return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
 }
 
 /*
- * Return the max value stored in a node with the given shift.
+ * Return the max value that can be stored in the tree with the given shift.
  */
 static uint64
 RT_SHIFT_GET_MAX_VAL(int shift)
@@ -1155,6 +1178,7 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
 #endif
 }
 
+/* Update the parent's pointer when growing a node */
 static inline void
 RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
 {
@@ -1182,7 +1206,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
 
 	if (parent == old_child)
 	{
-		/* Replace the root node with the new large node */
+		/* Replace the root node with the new larger node */
 		tree->ctl->root = new_child;
 	}
 	else
@@ -1192,8 +1216,8 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
 }
 
 /*
- * The radix tree doesn't sufficient height. Extend the radix tree so it can
- * store the key.
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
  */
 static void
 RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
@@ -1337,7 +1361,7 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stor
 #undef RT_NODE_LEVEL_INNER
 }
 
-/* Like, RT_NODE_INSERT_INNER, but for leaf nodes */
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
 static bool
 RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 					uint64 key, RT_VALUE_TYPE value)
@@ -1377,7 +1401,7 @@ RT_CREATE(MemoryContext ctx)
 #else
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
 
-	/* Create the slab allocator for each size class */
+	/* Create a slab context for each size class */
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
 		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
@@ -1570,7 +1594,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
 	parent = RT_PTR_GET_LOCAL(tree, stored_child);
 	shift = parent->shift;
 
-	/* Descend the tree until a leaf node */
+	/* Descend the tree until we reach a leaf node */
 	while (shift >= 0)
 	{
 		RT_PTR_ALLOC new_child;
-- 
2.39.0

v21-0021-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchtext/x-patch; charset=US-ASCII; name=v21-0021-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From f3f586bc84026364d46e7bcf6eddd04a83264de4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v21 21/22] Add TIDStore, to store sets of TIDs
 (ItemPointerData) efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 624 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   2 +
 src/include/access/tidstore.h                 |  49 ++
 src/include/storage/lwlock.h                  |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  35 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../modules/test_tidstore/test_tidstore.c     | 189 ++++++
 .../test_tidstore/test_tidstore.control       |   4 +
 16 files changed, 963 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e3a783abd0..38bc3589ae 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2182,6 +2182,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to access a shared TID bitmap during a parallel bitmap
        index scan.</entry>
      </row>
+     <row>
+      <entry><literal>SharedTidStore</literal></entry>
+      <entry>Waiting to access a shared TID store.</entry>
+     </row>
      <row>
       <entry><literal>SharedTupleStore</literal></entry>
       <entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..fa55793227
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,624 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a Tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach(). It can support concurrent updates but only one process
+ * is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. First, we construct 64-bit unsigned integer key that
+ * combines the block number and the offset number. The lowest 11 bits represent
+ * the offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ *
+ * The maximum height of the radix tree is 5.
+ *
+ * XXX: if we want to support non-heap table AM that want to use the full
+ * range of possible offset numbers, we'll need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
+ */
+#define TIDSTORE_OFFSET_NBITS	11
+#define TIDSTORE_VALUE_NBITS	6
+
+/*
+ * Memory consumption depends on the number of Tids stored, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in tidstore_create(). We want the total
+ * amount of memory consumption not to exceed the max_bytes.
+ *
+ * In non-shared cases, the radix tree uses slab allocators for each kind of
+ * node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+ * largest radix tree node in a new slab block, which is approximately 70kB.
+ * Therefore, we deduct 70kB from the maximum bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation showed, the 75% threshold for the maximum bytes
+ * perfectly works in case where it is a power-of-2, and the 60% threshold
+ * works for other cases.
+ */
+#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT		(1024L * 70) /* 70kB */
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2	(float) 0.75
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO		(float) 0.6
+
+#define KEY_GET_BLKNO(key) \
+	((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+#define BLKNO_GET_KEY(blkno) \
+	(((uint64) (blkno) << (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+	/*
+	 * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+	 * bytes a TidStore can use. These two fields are commonly used in both
+	 * non-shared case and shared case.
+	 */
+	uint64	num_tids;
+	uint64	max_bytes;
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+
+	/* protect the shared fields */
+	LWLock	lock;
+
+	/* handles for TidStore and radix tree */
+	tidstore_handle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids. Use either one depending on TidStoreIsShared()  */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0)
+			? TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2
+			: TIDSTORE_SHARED_MAX_MEMORY_RATIO;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes =(uint64) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT;
+	}
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix
+		 * tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (TidStoreIsShared(ts))
+	{
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Free the radix tree and return allocated DSA segments to
+		 * the operating system.
+		 */
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+
+		LWLockRelease(&ts->control->lock);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+	}
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+	if (TidStoreIsShared(ts))
+	{
+		/*
+		 * Since the shared radix tree supports concurrent insert,
+		 * we don't need to acquire the lock.
+		 */
+		shared_rt_set(ts->tree.shared, key, val);
+	}
+	else
+		local_rt_set(ts->tree.local, key, val);
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+#define NUM_KEYS_PER_BLOCK	(1 << (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS))
+	ItemPointerData tid;
+	uint64	key_base;
+	uint64	values[NUM_KEYS_PER_BLOCK] = {0};
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+	key_base = BLKNO_GET_KEY(blkno);
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint64	key;
+		uint32	off;
+		int idx;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		/* encode the Tid to key and val */
+		key = tid_to_key_off(&tid, &off);
+
+		idx = key - key_base;
+		Assert(idx >= 0 && idx < NUM_KEYS_PER_BLOCK);
+
+		values[idx] |= UINT64CONST(1) << off;
+	}
+
+	/* insert the calculated key-values to the tree */
+	for (int i = 0; i < NUM_KEYS_PER_BLOCK; i++)
+	{
+		if (values[i])
+		{
+			uint64 key = key_base + i;
+
+			tidstore_insert_kv(ts, key, values[i]);
+		}
+	}
+
+	if (TidStoreIsShared(ts))
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+
+	if (TidStoreIsShared(ts))
+		LWLockRelease(&ts->control->lock);
+}
+
+/* Return true if the given Tid is present in TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val;
+	uint32 off;
+	bool found;
+
+	key = tid_to_key_off(tid, &off);
+
+	found = TidStoreIsShared(ts) ?
+		shared_rt_search(ts->tree.shared, key, &val) :
+		local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+	iter->result.blkno = InvalidBlockNumber;
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (tidstore_num_tids(ts) == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+	else
+		return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = KEY_GET_BLKNO(key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * Remember the key-value pair for the next block for the
+			 * next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+uint64
+tidstore_num_tids(TidStore *ts)
+{
+	uint64 num_tids;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (TidStoreIsShared(ts))
+		return ts->control->num_tids;
+
+	LWLockAcquire(&ts->control->lock, LW_SHARED);
+	num_tids = ts->control->num_tids;
+	LWLockRelease(&ts->control->lock);
+
+	return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+uint64
+tidstore_memory_usage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return (uint64) sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+	else
+		return (uint64) sizeof(TidStore) + sizeof(TidStore) +
+			local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract Tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->result);
+
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+		result->offsets[result->num_offsets++] = off;
+	}
+
+	result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a Tid to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64 upper;
+	uint64 tid_i;
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+	*off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+	upper = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return upper;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 196bece0a3..cbfe329591 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"SharedTupleStore",
 	/* LWTRANCHE_SHARED_TIDBITMAP: */
 	"SharedTidBitmap",
+	/* LWTRANCHE_SHARED_TIDSTORE: */
+	"SharedTidStore",
 	/* LWTRANCHE_PARALLEL_APPEND: */
 	"ParallelAppend",
 	/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..ec3d9f87f5
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+	int				num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern uint64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern uint64 tidstore_max_memory(TidStore *ts);
+extern uint64 tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e4162db613..7b7663e2e1 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_SHARED_TIDBITMAP,
+	LWTRANCHE_SHARED_TIDSTORE,
 	LWTRANCHE_PARALLEL_APPEND,
 	LWTRANCHE_PER_XACT_PREDICATE_LIST,
 	LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..5d38387450
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,189 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+	ItemPointerData tid;
+	bool found;
+
+	ItemPointerSet(&tid, blkno, off);
+
+	found = tidstore_lookup_tid(ts, &tid);
+
+	if (found != expect)
+		elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+			 blkno, off, found, expect);
+}
+
+static void
+test_basic(void)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS	5
+#define TEST_TIDSTORE_NUM_OFFSETS	11
+#define IS_POWER_OF_TWO(x) (((x) & (x - 1)) == 0)
+
+	TidStore *ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+	BlockNumber	blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+	};
+	BlockNumber	blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+	};
+	OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS] = {
+		1 << 5, 1 << 6, 1 << 7, 1 << 8, 1 << 9,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4,
+		1 << 10
+	};
+	OffsetNumber offs_sorted[TEST_TIDSTORE_NUM_OFFSETS] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4,
+		1 << 5, 1 << 6, 1 << 7, 1 << 8, 1 << 9,
+		1 << 10
+	};
+	int blk_idx;
+
+	elog(NOTICE, "testing basic operations");
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, NULL);
+
+	/* add tids */
+	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* lookup test */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, IS_POWER_OF_TWO(off));
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, IS_POWER_OF_TWO(off));
+	}
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+			 tidstore_num_tids(ts),
+			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* iteration test */
+	iter = tidstore_begin_iterate(ts);
+	blk_idx = 0;
+	while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+	{
+		/* check the returned block number */
+		if (blks_sorted[blk_idx] != iter_result->blkno)
+			elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+				 iter_result->blkno, blks_sorted[blk_idx]);
+
+		/* check the returned offset numbers */
+		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+			elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+		for (int i = 0; i < iter_result->num_offsets; i++)
+		{
+			if (offs_sorted[i] != iter_result->offsets[i])
+				elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+					 iter_result->offsets[i], iter_result->blkno,
+					 offs_sorted[i]);
+		}
+
+		blk_idx++;
+	}
+
+	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+		elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+	/* remove all tids */
+	tidstore_reset(ts);
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+	/* lookup test for empty store */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, false);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, false);
+	}
+
+	tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+	TidStore *ts;
+	TidStoreIter *iter;
+	ItemPointerData tid;
+
+	elog(NOTICE, "testing empty tidstore");
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, NULL);
+
+	ItemPointerSet(&tid, 0, FirstOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+			 MaxBlockNumber, MaxOffsetNumber);
+
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+	if (tidstore_is_full(ts))
+		elog(ERROR, "tidstore_is_full on empty store returned true");
+
+	iter = tidstore_begin_iterate(ts);
+
+	if (tidstore_iterate_next(iter) != NULL)
+		elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+	tidstore_end_iterate(iter);
+
+	tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+	test_empty();
+	test_basic();
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.39.0

v21-0022-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchtext/x-patch; charset=US-ASCII; name=v21-0022-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload

From 2c93e9cdb3b6825df9633bfe9e122b08d936780c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 20 Jan 2023 10:29:31 +0700
Subject: [PATCH v21 22/22] Use TIDStore for storing dead tuple TID during lazy
 vacuum

Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which is not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.

This changes to use TIDStore for this purpose. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.

Also, since we are no longer able to exactly estimate the maximum
number of TIDs can be stored based on the amount of memory. It also
changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.

Furthermore, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, which is the
inital DSA segment size. Due to that, this change increase the minimum
maintenance_work_mem from 1MB to 2MB.

XXX: needs to bump catalog version
---
 doc/src/sgml/monitoring.sgml               |   8 +-
 src/backend/access/heap/vacuumlazy.c       | 210 +++++++--------------
 src/backend/catalog/system_views.sql       |   2 +-
 src/backend/commands/vacuum.c              |  76 +-------
 src/backend/commands/vacuumparallel.c      |  64 ++++---
 src/backend/storage/lmgr/lwlock.c          |   2 +
 src/backend/utils/misc/guc_tables.c        |   2 +-
 src/include/commands/progress.h            |   4 +-
 src/include/commands/vacuum.h              |  25 +--
 src/include/storage/lwlock.h               |   1 +
 src/test/regress/expected/cluster.out      |   2 +-
 src/test/regress/expected/create_index.out |   2 +-
 src/test/regress/expected/rules.out        |   4 +-
 src/test/regress/sql/cluster.sql           |   2 +-
 src/test/regress/sql/create_index.sql      |   2 +-
 15 files changed, 138 insertions(+), 268 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 38bc3589ae..b96bca38db 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6860,10 +6860,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6871,10 +6871,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..90f8a5e087 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -220,17 +221,21 @@ typedef struct LVRelState
 typedef struct LVPagePruneState
 {
 	bool		hastup;			/* Page prevents rel truncation? */
-	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+
+	/* collected LP_DEAD items including existing LP_DEAD items */
+	int			lpdead_items;
+	OffsetNumber	deadoffsets[MaxHeapTuplesPerPage];
 
 	/*
 	 * State describes the proper VM bit states to set for the page following
-	 * pruning and freezing.  all_visible implies !has_lpdead_items, but don't
+	 * pruning and freezing.  all_visible implies !HAS_LPDEAD_ITEMS(), but don't
 	 * trust all_frozen result unless all_visible is also set to true.
 	 */
 	bool		all_visible;	/* Every item visible to all? */
 	bool		all_frozen;		/* provided all_visible is also true */
 	TransactionId visibility_cutoff_xid;	/* For recovery conflicts */
 } LVPagePruneState;
+#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)
 
 /* Struct for saving and restoring vacuum error information. */
 typedef struct LVSavedErrInfo
@@ -259,8 +264,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +831,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				blkno,
 				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +912,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (tidstore_is_full(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1018,7 +1023,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
 
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+		Assert(!prunestate.all_visible || !HAS_LPDEAD_ITEMS(prunestate));
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
@@ -1034,14 +1039,12 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * performed here can be thought of as the one-pass equivalent of
 			 * a call to lazy_vacuum().
 			 */
-			if (prunestate.has_lpdead_items)
+			if (HAS_LPDEAD_ITEMS(prunestate))
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
-				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+									  prunestate.lpdead_items, buf, vmbuffer);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1081,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
+		}
+		else if (HAS_LPDEAD_ITEMS(prunestate))
+		{
+			/* Save details of the LP_DEAD items from the page */
+			tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+							  prunestate.lpdead_items);
+
+			pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+										 tidstore_memory_usage(dead_items));
 		}
 
 		/*
@@ -1145,7 +1157,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
 		 * set, however.
 		 */
-		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+		else if (HAS_LPDEAD_ITEMS(prunestate) && PageIsAllVisible(page))
 		{
 			elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1193,7 +1205,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Final steps for block: drop cleanup lock, record free space in the
 		 * FSM
 		 */
-		if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+		if (HAS_LPDEAD_ITEMS(prunestate) && vacrel->do_index_vacuuming)
 		{
 			/*
 			 * Wait until lazy_vacuum_heap_rel() to save free space.  This
@@ -1249,7 +1261,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1543,13 +1555,11 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				tuples_frozen,
-				lpdead_items,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	HeapPageFreeze pagefrz;
 	int64		fpi_before = pgWalUsage.wal_fpi;
-	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1581,6 @@ retry:
 	pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
-	lpdead_items = 0;
 	live_tuples = 0;
 	recently_dead_tuples = 0;
 
@@ -1580,9 +1589,9 @@ retry:
 	 *
 	 * We count tuples removed by the pruning step as tuples_deleted.  Its
 	 * final value can be thought of as the number of tuples that have been
-	 * deleted from the table.  It should not be confused with lpdead_items;
-	 * lpdead_items's final value can be thought of as the number of tuples
-	 * that were deleted from indexes.
+	 * deleted from the table.  It should not be confused with
+	 * prunestate->lpdead_items; prunestate->lpdead_items's final value can
+	 * be thought of as the number of tuples that were deleted from indexes.
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
 									 InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1602,7 @@ retry:
 	 * requiring freezing among remaining tuples with storage
 	 */
 	prunestate->hastup = false;
-	prunestate->has_lpdead_items = false;
+	prunestate->lpdead_items = 0;
 	prunestate->all_visible = true;
 	prunestate->all_frozen = true;
 	prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1647,7 @@ retry:
 			 * (This is another case where it's useful to anticipate that any
 			 * LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
 			 */
-			deadoffsets[lpdead_items++] = offnum;
+			prunestate->deadoffsets[prunestate->lpdead_items++] = offnum;
 			continue;
 		}
 
@@ -1875,7 +1884,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible && lpdead_items == 0)
+	if (prunestate->all_visible && prunestate->lpdead_items == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1888,28 +1897,9 @@ retry:
 	}
 #endif
 
-	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel
-	 */
-	if (lpdead_items > 0)
+	if (prunestate->lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
-		prunestate->has_lpdead_items = true;
-
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1918,7 @@ retry:
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
 	vacrel->tuples_frozen += tuples_frozen;
-	vacrel->lpdead_items += lpdead_items;
+	vacrel->lpdead_items += prunestate->lpdead_items;
 	vacrel->live_tuples += live_tuples;
 	vacrel->recently_dead_tuples += recently_dead_tuples;
 }
@@ -2129,8 +2119,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2128,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2198,7 +2180,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2227,7 +2209,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2254,8 +2236,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2300,7 +2282,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	tidstore_reset(vacrel->dead_items);
 }
 
 /*
@@ -2373,7 +2355,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2410,10 +2392,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2411,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while ((result = tidstore_iterate_next(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2437,7 +2421,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2451,7 +2435,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+							  buf, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2461,6 +2446,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	tidstore_end_iterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2470,14 +2456,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2495,11 +2480,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * LP_DEAD item on the page.  The return value is the first index immediately
  * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+					  int num_offsets, Buffer buffer, Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2518,16 +2502,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = offsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2576,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -3093,46 +3071,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3143,11 +3081,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3174,7 +3110,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3187,11 +3123,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a526e607fe 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127e..358ad25996 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2343,18 +2342,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
@@ -2365,60 +2352,7 @@ vac_max_items_to_alloc_size(int max_items)
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..4c0ce4b7e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+#define PARALLEL_VACUUM_KEY_DSA				2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TidStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_destroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TidStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index cbfe329591..4c35af3412 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -188,6 +188,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"PgStatsHash",
 	/* LWTRANCHE_PGSTATS_DATA: */
 	"PgStatsData",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index cd0fc2cb8f..85e42269be 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2301,7 +2301,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 2048, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..220d89fff7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
 	MultiXactId MultiXactCutoff;
 };
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 7b7663e2e1..c9b4741e32 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -205,6 +205,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DSA,
 	LWTRANCHE_PGSTATS_HASH,
 	LWTRANCHE_PGSTATS_DATA,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
 -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..f6ae02eb14 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 set enable_indexscan = off;
 
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
 
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
-- 
2.39.0

#184

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: John Naylor (#183)

22 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Attached is a rebase to fix conflicts from recent commits.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v22-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v22-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From dc2ac74612299ad60e3da958314338a7c3ff1ad5 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v22 02/22] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 51484ca7e2..077f197a64 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.39.0

v22-0004-Clean-up-some-nomenclature-around-node-insertion.patchtext/x-patch; charset=US-ASCII; name=v22-0004-Clean-up-some-nomenclature-around-node-insertion.patchDownload

From 3dab17562a62b9e5086bcf473cf1a81768f70552 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 19 Jan 2023 16:33:51 +0700
Subject: [PATCH v22 04/22] Clean up some nomenclature around node insertion

Replace node/nodep with hopefully more informative names.

In passing, remove some outdated asserts and move some
variable declarations to the scope where they're used.
---
 src/include/lib/radixtree.h             | 64 ++++++++++++++-----------
 src/include/lib/radixtree_insert_impl.h | 22 +++++----
 2 files changed, 47 insertions(+), 39 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 97cccdc9ca..a1458bc25f 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -645,9 +645,9 @@ typedef struct RT_ITER
 } RT_ITER;
 
 
-static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 								 uint64 key, RT_PTR_ALLOC child);
-static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 								uint64 key, uint64 value);
 
 /* verification (available only with assertion) */
@@ -1153,18 +1153,18 @@ RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
  * Replace old_child with new_child, and free the old one.
  */
 static void
-RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+				RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
 				RT_PTR_ALLOC new_child, uint64 key)
 {
-	RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
-
 #ifdef USE_ASSERT_CHECKING
 	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
 
-	Assert(old->shift == new->shift);
+	Assert(old_child->shift == new->shift);
+	Assert(old_child->count == new->count);
 #endif
 
-	if (parent == old)
+	if (parent == old_child)
 	{
 		/* Replace the root node with the new large node */
 		tree->ctl->root = new_child;
@@ -1172,7 +1172,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child
 	else
 		RT_NODE_UPDATE_INNER(parent, key, new_child);
 
-	RT_FREE_NODE(tree, old_child);
+	RT_FREE_NODE(tree, stored_old_child);
 }
 
 /*
@@ -1220,11 +1220,11 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
  */
 static inline void
 RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
-			  RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
+			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
 {
 	int			shift = node->shift;
 
-	Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+	Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
 
 	while (shift >= RT_NODE_SPAN)
 	{
@@ -1237,15 +1237,15 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent
 		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
 		RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
 		newchild->shift = newshift;
-		RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
+		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
 
 		parent = node;
 		node = newchild;
-		nodep = allocchild;
+		stored_node = allocchild;
 		shift -= RT_NODE_SPAN;
 	}
 
-	RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+	RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value);
 	tree->ctl->num_keys++;
 }
 
@@ -1305,9 +1305,15 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
 }
 #endif
 
-/* Insert the child to the inner node */
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
 static bool
-RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 					uint64 key, RT_PTR_ALLOC child)
 {
 #define RT_NODE_LEVEL_INNER
@@ -1315,9 +1321,9 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC node
 #undef RT_NODE_LEVEL_INNER
 }
 
-/* Insert the value to the leaf node */
+/* Like, RT_NODE_INSERT_INNER, but for leaf nodes */
 static bool
-RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 					uint64 key, uint64 value)
 {
 #define RT_NODE_LEVEL_LEAF
@@ -1525,8 +1531,8 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
 	int			shift;
 	bool		updated;
 	RT_PTR_LOCAL parent;
-	RT_PTR_ALLOC nodep;
-	RT_PTR_LOCAL  node;
+	RT_PTR_ALLOC stored_child;
+	RT_PTR_LOCAL  child;
 
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
@@ -1540,32 +1546,32 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
 	if (key > tree->ctl->max_val)
 		RT_EXTEND(tree, key);
 
-	nodep = tree->ctl->root;
-	parent = RT_PTR_GET_LOCAL(tree, nodep);
+	stored_child = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, stored_child);
 	shift = parent->shift;
 
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		RT_PTR_ALLOC child;
+		RT_PTR_ALLOC new_child;
 
-		node = RT_PTR_GET_LOCAL(tree, nodep);
+		child = RT_PTR_GET_LOCAL(tree, stored_child);
 
-		if (NODE_IS_LEAF(node))
+		if (NODE_IS_LEAF(child))
 			break;
 
-		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
 		{
-			RT_SET_EXTEND(tree, key, value, parent, nodep, node);
+			RT_SET_EXTEND(tree, key, value, parent, stored_child, child);
 			return false;
 		}
 
-		parent = node;
-		nodep = child;
+		parent = child;
+		stored_child = new_child;
 		shift -= RT_NODE_SPAN;
 	}
 
-	updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+	updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value);
 
 	/* Update the statistics */
 	if (!updated)
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index e4faf54d9d..1d0eb396e2 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -14,8 +14,6 @@
 
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 	bool		chunk_exists = false;
-	RT_PTR_LOCAL newnode = NULL;
-	RT_PTR_ALLOC allocnode;
 
 #ifdef RT_NODE_LEVEL_LEAF
 	const bool inner = false;
@@ -47,6 +45,8 @@
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
 				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
 					RT_NODE32_TYPE *new32;
 					const uint8 new_kind = RT_NODE_KIND_32;
 					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
@@ -65,8 +65,7 @@
 					RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
 											  new32->base.chunks, new32->children);
 #endif
-					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
 					node = newnode;
 				}
 				else
@@ -121,6 +120,8 @@
 					n32->base.n.fanout == class32_min.fanout)
 				{
 					/* grow to the next size class of this kind */
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
 					const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
 
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
@@ -132,8 +133,7 @@
 #endif
 					newnode->fanout = class32_max.fanout;
 
-					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
 					node = newnode;
 
 					/* also update pointer for this kind */
@@ -142,6 +142,8 @@
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
 				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
 					RT_NODE125_TYPE *new125;
 					const uint8 new_kind = RT_NODE_KIND_125;
 					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
@@ -169,8 +171,7 @@
 					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
 					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
 
-					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
 					node = newnode;
 				}
 				else
@@ -220,6 +221,8 @@
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
 				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
 					RT_NODE256_TYPE *new256;
 					const uint8 new_kind = RT_NODE_KIND_256;
 					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
@@ -243,8 +246,7 @@
 						cnt++;
 					}
 
-					Assert(parent != NULL);
-					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
 					node = newnode;
 				}
 				else
-- 
2.39.0

v22-0001-introduce-vector8_min-and-vector8_highbit_mask.patchtext/x-patch; charset=US-ASCII; name=v22-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From 2b4d8c3a7a538c029faaa14ef5f22beec10406bc Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v22 01/22] introduce vector8_min and vector8_highbit_mask

TODO: commit message
TODO: Remove uint64 case.

separate-commit TODO: move non-SIMD fallbacks to own header
to clean up the #ifdef maze.
---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..84d41a340a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.39.0

v22-0005-Restore-RT_GROW_NODE_KIND.patchtext/x-patch; charset=US-ASCII; name=v22-0005-Restore-RT_GROW_NODE_KIND.patchDownload

From 413cce02ce2419d4760f411a77f24213958ea906 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 11:32:24 +0700
Subject: [PATCH v22 05/22] Restore RT_GROW_NODE_KIND

(This was previously "exploded" out during the work to
switch this to a template)

Change the API so that we pass it the allocated pointer
and return the local pointer. That way, there is consistency
in growing nodes whether we change kind or not.

Also rename to RT_SWITCH_NODE_KIND, since it should work just as
well for shrinking nodes.
---
 src/include/lib/radixtree.h             | 104 +++---------------------
 src/include/lib/radixtree_insert_impl.h |  24 ++----
 2 files changed, 19 insertions(+), 109 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index a1458bc25f..c08016de3a 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -127,10 +127,9 @@
 #define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
 #define RT_INIT_NODE RT_MAKE_NAME(init_node)
 #define RT_FREE_NODE RT_MAKE_NAME(free_node)
-#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
 #define RT_EXTEND RT_MAKE_NAME(extend)
 #define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
-//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
 #define RT_COPY_NODE RT_MAKE_NAME(copy_node)
 #define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
 #define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
@@ -1080,26 +1079,22 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
 	newnode->shift = oldnode->shift;
 	newnode->count = oldnode->count;
 }
-#if 0
+
 /*
- * Create a new node with 'new_kind' and the same shift, chunk, and
- * count of 'node'.
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
  */
-static RT_NODE*
-RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+				  uint8 new_kind, uint8 new_class, bool inner)
 {
-	RT_PTR_ALLOC allocnode;
-	RT_PTR_LOCAL newnode;
-	bool inner = !NODE_IS_LEAF(node);
-
-	allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
-	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-	RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, new_class, inner);
 	RT_COPY_NODE(newnode, node);
 
 	return newnode;
 }
-#endif
+
 /* Free the given node */
 static void
 RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
@@ -1415,78 +1410,6 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 	return tree->ctl->handle;
 }
-
-/*
- * Recursively free all nodes allocated to the DSA area.
- */
-static inline void
-RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
-{
-	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
-
-	check_stack_depth();
-	CHECK_FOR_INTERRUPTS();
-
-	/* The leaf node doesn't have child pointers */
-	if (NODE_IS_LEAF(node))
-	{
-		dsa_free(tree->dsa, ptr);
-		return;
-	}
-
-	switch (node->kind)
-	{
-		case RT_NODE_KIND_4:
-			{
-				RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
-
-				for (int i = 0; i < n4->base.n.count; i++)
-					RT_FREE_RECURSE(tree, n4->children[i]);
-
-				break;
-			}
-		case RT_NODE_KIND_32:
-			{
-				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
-
-				for (int i = 0; i < n32->base.n.count; i++)
-					RT_FREE_RECURSE(tree, n32->children[i]);
-
-				break;
-			}
-		case RT_NODE_KIND_125:
-			{
-				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
-
-				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
-				{
-					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
-						continue;
-
-					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
-				}
-
-				break;
-			}
-		case RT_NODE_KIND_256:
-			{
-				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
-
-				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
-				{
-					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
-						continue;
-
-					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
-				}
-
-				break;
-			}
-	}
-
-	/* Free the inner node */
-	dsa_free(tree->dsa, ptr);
-}
 #endif
 
 /*
@@ -1498,10 +1421,6 @@ RT_FREE(RT_RADIX_TREE *tree)
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 
-	/* Free all memory used for radix tree nodes */
-	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
-		RT_FREE_RECURSE(tree, tree->ctl->root);
-
 	/*
 	 * Vandalize the control block to help catch programming error where
 	 * other backends access the memory formerly occupied by this radix tree.
@@ -2280,10 +2199,9 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_ALLOC_NODE
 #undef RT_INIT_NODE
 #undef RT_FREE_NODE
-#undef RT_FREE_RECURSE
 #undef RT_EXTEND
 #undef RT_SET_EXTEND
-#undef RT_GROW_NODE_KIND
+#undef RT_SWITCH_NODE_KIND
 #undef RT_COPY_NODE
 #undef RT_REPLACE_NODE
 #undef RT_PTR_GET_LOCAL
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 1d0eb396e2..e3e44669ea 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -53,11 +53,9 @@
 
 					/* grow node from 4 to 32 */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
-					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-					RT_INIT_NODE(newnode, new_kind, new_class, inner);
-					RT_COPY_NODE(newnode, node);
-					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
 					new32 = (RT_NODE32_TYPE *) newnode;
+
 #ifdef RT_NODE_LEVEL_LEAF
 					RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
 											  new32->base.chunks, new32->values);
@@ -119,13 +117,15 @@
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
 					n32->base.n.fanout == class32_min.fanout)
 				{
-					/* grow to the next size class of this kind */
 					RT_PTR_ALLOC allocnode;
 					RT_PTR_LOCAL newnode;
 					const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
 
+					/* grow to the next size class of this kind */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
 					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n32 = (RT_NODE32_TYPE *) newnode;
+
 #ifdef RT_NODE_LEVEL_LEAF
 					memcpy(newnode, node, class32_min.leaf_size);
 #else
@@ -135,9 +135,6 @@
 
 					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
 					node = newnode;
-
-					/* also update pointer for this kind */
-					n32 = (RT_NODE32_TYPE *) newnode;
 				}
 
 				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
@@ -152,10 +149,7 @@
 
 					/* grow node from 32 to 125 */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
-					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-					RT_INIT_NODE(newnode, new_kind, new_class, inner);
-					RT_COPY_NODE(newnode, node);
-					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
 					new125 = (RT_NODE125_TYPE *) newnode;
 
 					for (int i = 0; i < class32_max.fanout; i++)
@@ -229,11 +223,9 @@
 
 					/* grow node from 125 to 256 */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
-					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-					RT_INIT_NODE(newnode, new_kind, new_class, inner);
-					RT_COPY_NODE(newnode, node);
-					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
 					new256 = (RT_NODE256_TYPE *) newnode;
+
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
 					{
 						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
-- 
2.39.0

v22-0003-Add-radixtree-template.patchtext/x-patch; charset=US-ASCII; name=v22-0003-Add-radixtree-template.patchDownload

From 7af7400b44e61957b38d4c974cdd4606c32f6b0f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v22 03/22] Add radixtree template

The only thing configurable in this commit is function scope,
prefix, and local/shared memory.

The key and value type are still hard-coded to uint64.

(A later commit in v21 will make value type configurable)

It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.

TODO: Much broader commit message
---
 src/backend/utils/mmgr/dsa.c                  |   12 +
 src/include/lib/radixtree.h                   | 2321 +++++++++++++++++
 src/include/lib/radixtree_delete_impl.h       |  106 +
 src/include/lib/radixtree_insert_impl.h       |  316 +++
 src/include/lib/radixtree_iter_impl.h         |  138 +
 src/include/lib/radixtree_search_impl.h       |  131 +
 src/include/utils/dsa.h                       |    1 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   35 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  653 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 20 files changed, 3816 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 604b702a91..50f0aae3ab 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..97cccdc9ca
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2321 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * There are some differences from the proposed implementation. For instance,
+ * there is not support for path compression and lazy path expansion. The radix
+ * tree supports fixed length of the key so we don't expect the tree level
+ * wouldn't be high.
+ *
+ * Both the key and the value are 64-bit unsigned integer. The inner nodes and
+ * the leaf nodes have slightly different structure: for inner tree nodes,
+ * shift > 0, store the pointer to its child node as the value. The leaf nodes,
+ * shift == 0, have the 64-bit unsigned integer that is specified by the user as
+ * the value. The paper refers to this technique as "Multi-value leaves".  We
+ * choose it to avoid an additional pointer traversal.  It is the reason this code
+ * currently does not support variable-length keys.
+ *
+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
+ *
+ * XXX: the radix tree node never be shrunk.
+ *
+ *	  To generate a radix tree and associated functions for a use case several
+ *	  macros have to be #define'ed before this file is included.  Including
+ *	  the file #undef's all those, so a new radix tree can be generated
+ *	  afterwards.
+ *	  The relevant parameters are:
+ *	  - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ *		will result in radix tree type 'foo_radix_tree' and functions like
+ *		'foo_create'/'foo_free' and so forth.
+ *	  - RT_DECLARE - if defined function prototypes and type declarations are
+ *		generated
+ *	  - RT_DEFINE - if defined function definitions are generated
+ *	  - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *		declarations reside
+ *	  - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *		so that multiple processes can access it simultaneously.
+ *
+ *	  Optional parameters:
+ *	  - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITER		- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ *
+ * RT_CREATE() creates an empty radix tree in the given memory context
+ * and memory contexts for all kinds of radix tree node under the memory context.
+ *
+ * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
+ * order of the key.
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+//#define RT_GROW_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
+#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
+#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* macros and types common to all implementations */
+#ifndef RT_COMMON
+#define RT_COMMON
+
+#ifdef RT_DEBUG
+#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+#endif
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/* Invalid index used in node-125 */
+#define RT_NODE_125_INVALID_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Supported radix tree node kinds and size classes.
+ *
+ * There are 4 node kinds and each node kind have one or two size classes,
+ * partial and full. The size classes in the same node kind have the same
+ * node structure but have the different number of fanout that is stored
+ * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
+ * is to be inserted, we allocate a larger area and memcpy the entire old
+ * node to it.
+ *
+ * This technique allows us to limit the node kinds to 4, which limits the
+ * number of cases in switch statements. It also allows a possible future
+ * optimization to encode the node kind in a pointer tag.
+ *
+ * These size classes have been chose carefully so that it minimizes the
+ * allocator padding in both the inner and leaf nodes on DSA.
+ * node
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+#endif							/* RT_COMMON */
+
+
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_4_FULL = 0,
+	RT_CLASS_32_PARTIAL,
+	RT_CLASS_32_FULL,
+	RT_CLASS_125_FULL,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+#define NODE_IS_EMPTY(n)		(((RT_PTR_LOCAL) (n))->count == 0)
+#define VAR_NODE_HAS_FREE_SLOT(node) \
+	((node)->base.n.count < (node)->base.n.fanout)
+#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
+	((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
+
+/* Base type of each node kinds for leaf and inner nodes */
+/* The base types must be a be able to accommodate the largest size
+class for variable-sized node kinds*/
+typedef struct RT_NODE_BASE_4
+{
+	RT_NODE		n;
+
+	/* 4 children, for key chunks */
+	uint8		chunks[4];
+} RT_NODE_BASE_4;
+
+typedef struct RT_NODE_BASE_32
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
+ * 256, to store indexes into a second array that contains up to 125 values (or
+ * child pointers in inner nodes).
+ */
+typedef struct RT_NODE_BASE_125
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword		isset[BM_IDX(128)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate for two main reasons:
+ *
+ * 1) the value type might be different than something fitting into a pointer
+ *    width type
+ * 2) Need to represent non-existing values in a key-type independent way.
+ *
+ * 1) is clearly worth being concerned about, but it's not clear 2) is as
+ * good. It might be better to just indicate non-existing entries the same way
+ * in inner nodes.
+ */
+typedef struct RT_NODE_INNER_4
+{
+	RT_NODE_BASE_4 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_4;
+
+typedef struct RT_NODE_LEAF_4
+{
+	RT_NODE_BASE_4 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_4;
+
+typedef struct RT_NODE_INNER_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of values depends on size class */
+	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * for directly storing values (or child pointers in inner nodes).
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	uint64		values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+
+	/* slab block size */
+	Size		inner_blocksize;
+	Size		leaf_blocksize;
+} RT_SIZE_CLASS_ELEM;
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define NODE_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_4_FULL] = {
+		.name = "radix tree node 4",
+		.fanout = 4,
+		.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_PARTIAL] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
+	},
+	[RT_CLASS_32_FULL] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
+	},
+	[RT_CLASS_125_FULL] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+/* Map from the node kind to its minimum size class */
+static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
+	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
+	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
+	[RT_NODE_KIND_256] = RT_CLASS_256,
+};
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+#endif
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* A radix tree with nodes */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During the iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
++ *
++ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
++ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
++ * We need either a safeguard to disallow other processes to begin the iteration
++ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+	RT_PTR_LOCAL node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the iteration on nodes of each level */
+	RT_NODE_ITER stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is being constructed during the iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+								uint64 key, uint64 value);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in 'base' that equals 'key'. Return -1
+ * if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk to insert into chunks in the given node.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		if (node->chunks[index] >= chunk)
+			break;
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-4 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
+						uint8 *dst_chunks, uint64 *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(uint64) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline uint64
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return node->children[chunk];
+}
+
+static inline uint64
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!NODE_IS_LEAF(node));
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the shift that is satisfied to store the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	return (key == 0)
+		? 0
+		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value stored in a node with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
+{
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
+
+	if (inner)
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (inner)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+#endif
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->ctl->cnt[size_class]++;
+#endif
+
+	return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+{
+	if (inner)
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+
+	node->kind = kind;
+
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+	}
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		inner = shift > 0;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+	newnode->shift = shift;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->count = oldnode->count;
+}
+#if 0
+/*
+ * Create a new node with 'new_kind' and the same shift, chunk, and
+ * count of 'node'.
+ */
+static RT_NODE*
+RT_GROW_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_LOCAL node, uint8 new_kind)
+{
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+	bool inner = !NODE_IS_LEAF(node);
+
+	allocnode = RT_ALLOC_NODE(tree, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, RT_KIND_MIN_SIZE_CLASS[new_kind], inner);
+	RT_COPY_NODE(newnode, node);
+
+	return newnode;
+}
+#endif
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->ctl->root == allocnode)
+	{
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
+	}
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
+{
+	RT_PTR_LOCAL old = RT_PTR_GET_LOCAL(tree, old_child);
+
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old->shift == new->shift);
+#endif
+
+	if (parent == old)
+	{
+		/* Replace the root node with the new large node */
+		tree->ctl->root = new_child;
+	}
+	else
+		RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+	RT_FREE_NODE(tree, old_child);
+}
+
+/*
+ * The radix tree doesn't sufficient height. Extend the radix tree so it can
+ * store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_4 *n4;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		node->shift = shift;
+		node->count = 1;
+
+		n4 = (RT_NODE_INNER_4 *) node;
+		n4->base.chunks[0] = 0;
+		n4->children[0] = tree->ctl->root;
+
+		/* Update the root */
+		tree->ctl->root = allocnode;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+			  RT_PTR_ALLOC nodep, RT_PTR_LOCAL node)
+{
+	int			shift = node->shift;
+
+	Assert(RT_PTR_GET_LOCAL(tree, nodep) == node);
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		inner = newshift > 0;
+
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+		newchild->shift = newshift;
+		RT_NODE_INSERT_INNER(tree, parent, nodep, node, key, allocchild);
+
+		parent = node;
+		node = newchild;
+		nodep = allocchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+	tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/* Insert the child to the inner node */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Insert the value to the leaf node */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC nodep, RT_PTR_LOCAL node,
+					uint64 key, uint64 value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+	/* Create the slab allocator for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 RT_SIZE_CLASS_INFO[i].name,
+												 RT_SIZE_CLASS_INFO[i].inner_blocksize,
+												 RT_SIZE_CLASS_INFO[i].inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												RT_SIZE_CLASS_INFO[i].name,
+												RT_SIZE_CLASS_INFO[i].leaf_blocksize,
+												RT_SIZE_CLASS_INFO[i].leaf_size);
+	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	/* XXX: memory context support */
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	if (NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+				for (int i = 0; i < n4->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n4->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+				for (int i = 0; i < n32->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle);
+#else
+	pfree(tree->ctl);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+#endif
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
+{
+	int			shift;
+	bool		updated;
+	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC nodep;
+	RT_PTR_LOCAL  node;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	/* Empty tree, create the root */
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->ctl->max_val)
+		RT_EXTEND(tree, key);
+
+	nodep = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, nodep);
+	shift = parent->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		node = RT_PTR_GET_LOCAL(tree, nodep);
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_SET_EXTEND(tree, key, value, parent, nodep, node);
+			return false;
+		}
+
+		parent = node;
+		nodep = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = RT_NODE_INSERT_LEAF(tree, parent, nodep, node, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->ctl->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	Assert(value_p != NULL);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+		return false;
+
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		if (NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			return false;
+
+		node = RT_PTR_GET_LOCAL(tree, child);
+		shift -= RT_NODE_SPAN;
+	}
+
+	return RT_NODE_SEARCH_LEAF(node, key, value_p);
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		RT_PTR_ALLOC child;
+
+		/* Push the current node to the stack */
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			return false;
+
+		allocnode = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	deleted = RT_NODE_DELETE_LEAF(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->ctl->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (!NODE_IS_EMPTY(node))
+		return true;
+
+	/* Free the empty leaf node */
+	RT_FREE_NODE(tree, allocnode);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		allocnode = stack[level--];
+
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		deleted = RT_NODE_DELETE_INNER(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (!NODE_IS_EMPTY(node))
+			break;
+
+		/* The node became empty */
+		RT_FREE_NODE(tree, allocnode);
+	}
+
+	return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  uint64 *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+	int			level = from;
+	RT_PTR_LOCAL node = from_node;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	MemoryContext old_ctx;
+	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->ctl->root)
+		return iter;
+
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->ctl->root)
+		return false;
+
+	for (;;)
+	{
+		RT_PTR_LOCAL child = NULL;
+		uint64		value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+	pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	// XXX is this necessary?
+	Size		total = sizeof(RT_RADIX_TREE);
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+#endif
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = BM_IDX(slot);
+					int			bitnum = BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+						 tree->ctl->num_keys,
+						 tree->ctl->root->shift / RT_NODE_SPAN,
+						 tree->ctl->cnt[RT_CLASS_4_FULL],
+						 tree->ctl->cnt[RT_CLASS_32_PARTIAL],
+						 tree->ctl->cnt[RT_CLASS_32_FULL],
+						 tree->ctl->cnt[RT_CLASS_125_FULL],
+						 tree->ctl->cnt[RT_CLASS_256])));
+}
+
+static void
+RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
+{
+	char		space[125] = {0};
+
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
+			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
+			node->fanout == 0 ? 256 : node->fanout,
+			node->count, node->shift);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n4->base.chunks[i], n4->values[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n4->base.chunks[i]);
+
+						if (recurse)
+							RT_DUMP_NODE(n4->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, n32->base.chunks[i], n32->values[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							RT_DUMP_NODE(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+				}
+				if (NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < BM_IDX(128); i++)
+					{
+						fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+					}
+					else
+					{
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							RT_DUMP_NODE(RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+								space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							RT_DUMP_NODE(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+		 tree->ctl->max_val, tree->ctl->max_val);
+
+	if (!tree->ctl->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->ctl->max_val)
+	{
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->ctl->root;
+	shift = tree->ctl->root->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_LOCAL child;
+
+		RT_DUMP_NODE(node, level, false);
+
+		if (NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+				RT_SIZE_CLASS_INFO[i].name,
+				RT_SIZE_CLASS_INFO[i].inner_size,
+				RT_SIZE_CLASS_INFO[i].inner_blocksize,
+				RT_SIZE_CLASS_INFO[i].leaf_size,
+				RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+
+	if (!tree->ctl->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	RT_DUMP_NODE(tree->ctl->root, 0, true);
+}
+#endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+
+/* locally declared macros */
+#undef NODE_IS_LEAF
+#undef NODE_IS_EMPTY
+#undef VAR_NODE_HAS_FREE_SLOT
+#undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_RADIX_TREE_MAGIC
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_4_FULL
+#undef RT_CLASS_32_PARTIAL
+#undef RT_CLASS_32_FULL
+#undef RT_CLASS_125_FULL
+#undef RT_CLASS_256
+#undef RT_KIND_MIN_SIZE_CLASS
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_GROW_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..eb87866b90
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,106 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(NODE_IS_LEAF(node));
+#else
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
+										  n4->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
+											n4->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
+										  n32->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_NODE_125_INVALID_IDX)
+					return false;
+
+				idx = BM_IDX(slotpos);
+				bitnum = BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+				RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..e4faf54d9d
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,316 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+	RT_PTR_LOCAL newnode = NULL;
+	RT_PTR_ALLOC allocnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	const bool inner = false;
+	Assert(NODE_IS_LEAF(node));
+#else
+	const bool inner = true;
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n4->values[idx] = value;
+#else
+					n4->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				{
+					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+					/* grow node from 4 to 32 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_32);
+					new32 = (RT_NODE32_TYPE *) newnode;
+#ifdef RT_NODE_LEVEL_LEAF
+					RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
+											  new32->base.chunks, new32->values);
+#else
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
+											  new32->base.chunks, new32->children);
+#endif
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
+					int			count = n4->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
+												   count, insertpos);
+#endif
+					}
+
+					n4->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n4->values[insertpos] = value;
+#else
+					n4->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[idx] = value;
+#else
+					n32->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+					n32->base.n.fanout == class32_min.fanout)
+				{
+					/* grow to the next size class of this kind */
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class32_min.leaf_size);
+#else
+					memcpy(newnode, node, class32_min.inner_size);
+#endif
+					newnode->fanout = class32_max.fanout;
+
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+
+					/* also update pointer for this kind */
+					n32 = (RT_NODE32_TYPE *) newnode;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				{
+					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+					Assert(n32->base.n.fanout == class32_max.fanout);
+
+					/* grow node from 32 to 125 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_125);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < class32_max.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = value;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			cnt = 0;
+
+				if (slotpos != RT_NODE_125_INVALID_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				{
+					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+
+					/* grow node from 125 to 256 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					RT_INIT_NODE(newnode, new_kind, new_class, inner);
+					RT_COPY_NODE(newnode, node);
+					//newnode = RT_GROW_NODE_KIND(tree, node, RT_NODE_KIND_256);
+					new256 = (RT_NODE256_TYPE *) newnode;
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+						cnt++;
+					}
+
+					Assert(parent != NULL);
+					RT_REPLACE_NODE(tree, parent, nodep, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < BM_IDX(128); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+				chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_SET(n256, chunk, value);
+#else
+				RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	RT_VERIFY_NODE(node);
+
+	return chunk_exists;
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..0b8b68df6c
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,138 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	bool		found = false;
+	uint8		key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	uint64		value;
+
+	Assert(NODE_IS_LEAF(node_iter->node));
+#else
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n4->base.n.count)
+					break;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n4->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
+#endif
+				key_chunk = n4->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+		*value_p = value;
+#endif
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return found;
+#else
+	return child;
+#endif
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..31e4978e4f
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,131 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	uint64		value = 0;
+
+	Assert(NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+	RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+#endif
+	Assert(!NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n4->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n4->values[idx];
+#else
+				child = n4->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n32->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[idx];
+#else
+				child = n32->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+				Assert(slotpos != RT_NODE_125_INVALID_IDX);
+				n125->children[slotpos] = new_child;
+#else
+				if (slotpos == RT_NODE_125_INVALID_IDX)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+	}
+
+#ifdef RT_ACTION_UPDATE
+	return;
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	*value_p = value;
+#else
+	Assert(child_p != NULL);
+	*child_p = child;
+#endif
+
+	return true;
+#endif							/* RT_ACTION_UPDATE */
+
+#undef RT_NODE4_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 104386e674..c67f936880 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..d8323f587f
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,653 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	uint64		dummy;
+	uint64		key;
+	uint64		val;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	rt_radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* look up keys */
+	for (int i = 0; i < children; i++)
+	{
+		uint64 value;
+
+		if (!rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (value != keys[i])
+			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+				 value, keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_set(radixtree, keys[i], keys[i] + 1))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		uint64		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa);
+#else
+	radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			uint64		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		uint64		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	rt_free(radixtree);
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.39.0

v22-0007-Make-value-type-configurable.patchtext/x-patch; charset=US-ASCII; name=v22-0007-Make-value-type-configurable.patchDownload

From b0cc522b623c126c97b65376b1e7a071cb69f1c6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 14:19:15 +0700
Subject: [PATCH v22 07/22] Make value type configurable

Tests pass with uint32, although the test module builds
with warnings.
---
 src/include/lib/radixtree.h                   | 79 ++++++++++---------
 src/include/lib/radixtree_delete_impl.h       |  4 +-
 src/include/lib/radixtree_iter_impl.h         |  2 +-
 src/include/lib/radixtree_search_impl.h       |  2 +-
 .../modules/test_radixtree/test_radixtree.c   | 41 ++++++----
 5 files changed, 69 insertions(+), 59 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 98e4597eac..0a39bd6664 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -44,6 +44,7 @@
  *		declarations reside
  *	  - RT_SHMEM - if defined, the radix tree is created in the DSA area
  *		so that multiple processes can access it simultaneously.
+ *	  - RT_VALUE_TYPE - the type of the value.
  *
  *	  Optional parameters:
  *	  - RT_DEBUG - if defined add stats tracking and debugging functions
@@ -222,14 +223,14 @@ RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
 #endif
 RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
 
-RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *val_p);
-RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 val);
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE val);
 #ifdef RT_USE_DELETE
 RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
 #endif
 
 RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
-RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
 RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
 
 RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
@@ -435,7 +436,7 @@ typedef struct RT_NODE_LEAF_4
 	RT_NODE_BASE_4 base;
 
 	/* number of values depends on size class */
-	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
 } RT_NODE_LEAF_4;
 
 typedef struct RT_NODE_INNER_32
@@ -451,7 +452,7 @@ typedef struct RT_NODE_LEAF_32
 	RT_NODE_BASE_32 base;
 
 	/* number of values depends on size class */
-	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
 } RT_NODE_LEAF_32;
 
 typedef struct RT_NODE_INNER_125
@@ -467,7 +468,7 @@ typedef struct RT_NODE_LEAF_125
 	RT_NODE_BASE_125 base;
 
 	/* number of values depends on size class */
-	uint64		values[FLEXIBLE_ARRAY_MEMBER];
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
 } RT_NODE_LEAF_125;
 
 /*
@@ -490,7 +491,7 @@ typedef struct RT_NODE_LEAF_256
 	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
 
 	/* Slots for 256 values */
-	uint64		values[RT_NODE_MAX_SLOTS];
+	RT_VALUE_TYPE	values[RT_NODE_MAX_SLOTS];
 } RT_NODE_LEAF_256;
 
 /* Information for each size class */
@@ -520,33 +521,33 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 		.name = "radix tree node 4",
 		.fanout = 4,
 		.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
-		.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64),
+		.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE),
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(uint64)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_32_PARTIAL] = {
 		.name = "radix tree node 15",
 		.fanout = 15,
 		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
-		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(uint64)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_32_FULL] = {
 		.name = "radix tree node 32",
 		.fanout = 32,
 		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
-		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(uint64)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_125_FULL] = {
 		.name = "radix tree node 125",
 		.fanout = 125,
 		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
-		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
 		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(uint64)),
+		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_256] = {
 		.name = "radix tree node 256",
@@ -648,7 +649,7 @@ typedef struct RT_ITER
 static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 								 uint64 key, RT_PTR_ALLOC child);
 static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
-								uint64 key, uint64 value);
+								uint64 key, RT_VALUE_TYPE value);
 
 /* verification (available only with assertion) */
 static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
@@ -828,10 +829,10 @@ RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count,
 }
 
 static inline void
-RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, uint64 *values, int count, int idx)
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
 {
 	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
-	memmove(&(values[idx + 1]), &(values[idx]), sizeof(uint64) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
 }
 
 /* Delete the element at 'idx' */
@@ -843,10 +844,10 @@ RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count,
 }
 
 static inline void
-RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, uint64 *values, int count, int idx)
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
 {
 	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
-	memmove(&(values[idx]), &(values[idx + 1]), sizeof(uint64) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
 }
 
 /* Copy both chunks and children/values arrays */
@@ -863,12 +864,12 @@ RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
 }
 
 static inline void
-RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, uint64 *src_values,
-						uint8 *dst_chunks, uint64 *dst_values)
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
 {
 	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
-	const Size values_size = sizeof(uint64) * fanout;
+	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
 
 	memcpy(dst_chunks, src_chunks, chunk_size);
 	memcpy(dst_values, src_values, values_size);
@@ -890,7 +891,7 @@ RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
 	return node->children[node->base.slot_idxs[chunk]];
 }
 
-static inline uint64
+static inline RT_VALUE_TYPE
 RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
 {
 	Assert(NODE_IS_LEAF(node));
@@ -926,7 +927,7 @@ RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
 	return node->children[chunk];
 }
 
-static inline uint64
+static inline RT_VALUE_TYPE
 RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
 {
 	Assert(NODE_IS_LEAF(node));
@@ -944,7 +945,7 @@ RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
 
 /* Set the value in the node-256 */
 static inline void
-RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, uint64 value)
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
 {
 	int			idx = BM_IDX(chunk);
 	int			bitnum = BM_BIT(chunk);
@@ -1215,7 +1216,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
  * Insert inner and leaf nodes from 'node' to bottom.
  */
 static inline void
-RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, uint64 value, RT_PTR_LOCAL parent,
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL parent,
 			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
 {
 	int			shift = node->shift;
@@ -1266,7 +1267,7 @@ RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
  * to the value is set to value_p.
  */
 static inline bool
-RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, uint64 *value_p)
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
 {
 #define RT_NODE_LEVEL_LEAF
 #include "lib/radixtree_search_impl.h"
@@ -1320,7 +1321,7 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stor
 /* Like, RT_NODE_INSERT_INNER, but for leaf nodes */
 static bool
 RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
-					uint64 key, uint64 value)
+					uint64 key, RT_VALUE_TYPE value)
 {
 #define RT_NODE_LEVEL_LEAF
 #include "lib/radixtree_insert_impl.h"
@@ -1522,7 +1523,7 @@ RT_FREE(RT_RADIX_TREE *tree)
  * and return true. Returns false if entry doesn't yet exist.
  */
 RT_SCOPE bool
-RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
 {
 	int			shift;
 	bool		updated;
@@ -1582,7 +1583,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, uint64 value)
  * not be NULL.
  */
 RT_SCOPE bool
-RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, uint64 *value_p)
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 {
 	RT_PTR_LOCAL node;
 	int			shift;
@@ -1730,7 +1731,7 @@ RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
  */
 static inline bool
 RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
-						  uint64 *value_p)
+						  RT_VALUE_TYPE *value_p)
 {
 #define RT_NODE_LEVEL_LEAF
 #include "lib/radixtree_iter_impl.h"
@@ -1803,7 +1804,7 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
  * return false.
  */
 RT_SCOPE bool
-RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
 {
 	/* Empty tree */
 	if (!iter->tree->ctl->root)
@@ -1812,7 +1813,7 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, uint64 *value_p)
 	for (;;)
 	{
 		RT_PTR_LOCAL child = NULL;
-		uint64		value;
+		RT_VALUE_TYPE value;
 		int			level;
 		bool		found;
 
@@ -1971,6 +1972,7 @@ RT_STATS(RT_RADIX_TREE *tree)
 						 tree->ctl->cnt[RT_CLASS_256])));
 }
 
+/* XXX For display, assumes value type is numeric */
 static void
 RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 {
@@ -1998,7 +2000,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 						RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, n4->base.chunks[i], n4->values[i]);
+								space, n4->base.chunks[i], (uint64) n4->values[i]);
 					}
 					else
 					{
@@ -2024,7 +2026,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, n32->base.chunks[i], n32->values[i]);
+								space, n32->base.chunks[i], (uint64) n32->values[i]);
 					}
 					else
 					{
@@ -2077,7 +2079,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 						RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+								space, i, (uint64) RT_NODE_LEAF_125_GET_VALUE(n125, i));
 					}
 					else
 					{
@@ -2107,7 +2109,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 							continue;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, i, RT_NODE_LEAF_256_GET_VALUE(n256, i));
+								space, i, (uint64) RT_NODE_LEAF_256_GET_VALUE(n256, i));
 					}
 					else
 					{
@@ -2213,6 +2215,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_SCOPE
 #undef RT_DECLARE
 #undef RT_DEFINE
+#undef RT_VALUE_TYPE
 
 /* locally declared macros */
 #undef NODE_IS_LEAF
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index eb87866b90..2612730481 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -33,7 +33,7 @@
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, (uint64 *) n4->values,
+				RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, n4->values,
 										  n4->base.n.count, idx);
 #else
 				RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
@@ -50,7 +50,7 @@
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, (uint64 *) n32->values,
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
 										  n32->base.n.count, idx);
 #else
 				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 0b8b68df6c..5c06f8b414 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -16,7 +16,7 @@
 	uint8		key_chunk;
 
 #ifdef RT_NODE_LEVEL_LEAF
-	uint64		value;
+	RT_VALUE_TYPE		value;
 
 	Assert(NODE_IS_LEAF(node_iter->node));
 #else
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 31e4978e4f..365abaa46d 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -15,7 +15,7 @@
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 
 #ifdef RT_NODE_LEVEL_LEAF
-	uint64		value = 0;
+	RT_VALUE_TYPE		value = 0;
 
 	Assert(NODE_IS_LEAF(node));
 #else
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index d8323f587f..64d46dfe9a 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -24,6 +24,12 @@
 
 #define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
 
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
 /*
  * If you enable this, the "pattern" tests will print information about
  * how long populating, probing, and iterating the test set takes, and
@@ -105,6 +111,7 @@ static const test_spec test_specs[] = {
 #define RT_DECLARE
 #define RT_DEFINE
 #define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
 // WIP: compiles with warnings because rt_attach is defined but not used
 // #define RT_SHMEM
 #include "lib/radixtree.h"
@@ -128,9 +135,9 @@ test_empty(void)
 {
 	rt_radix_tree *radixtree;
 	rt_iter		*iter;
-	uint64		dummy;
+	TestValueType		dummy;
 	uint64		key;
-	uint64		val;
+	TestValueType		val;
 
 #ifdef RT_SHMEM
 	int			tranche_id = LWLockNewTrancheId();
@@ -202,26 +209,26 @@ test_basic(int children, bool test_inner)
 	/* insert keys */
 	for (int i = 0; i < children; i++)
 	{
-		if (rt_set(radixtree, keys[i], keys[i]))
+		if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
 			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
 	}
 
 	/* look up keys */
 	for (int i = 0; i < children; i++)
 	{
-		uint64 value;
+		TestValueType value;
 
 		if (!rt_search(radixtree, keys[i], &value))
 			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
-		if (value != keys[i])
+		if (value != (TestValueType) keys[i])
 			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
-				 value, keys[i]);
+				 value, (TestValueType) keys[i]);
 	}
 
 	/* update keys */
 	for (int i = 0; i < children; i++)
 	{
-		if (!rt_set(radixtree, keys[i], keys[i] + 1))
+		if (!rt_set(radixtree, keys[i], (TestValueType) (keys[i] + 1)))
 			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
 	}
 
@@ -230,7 +237,7 @@ test_basic(int children, bool test_inner)
 	{
 		if (!rt_delete(radixtree, keys[i]))
 			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
-		if (rt_set(radixtree, keys[i], keys[i]))
+		if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
 			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
 	}
 
@@ -248,12 +255,12 @@ check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
 	for (int i = start; i < end; i++)
 	{
 		uint64		key = ((uint64) i << shift);
-		uint64		val;
+		TestValueType		val;
 
 		if (!rt_search(radixtree, key, &val))
 			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
 				 key, end);
-		if (val != key)
+		if (val != (TestValueType) key)
 			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
 				 key, val, key);
 	}
@@ -274,7 +281,7 @@ test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
 		uint64		key = ((uint64) i << shift);
 		bool		found;
 
-		found = rt_set(radixtree, key, key);
+		found = rt_set(radixtree, key, (TestValueType) key);
 		if (found)
 			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
 
@@ -440,7 +447,7 @@ test_pattern(const test_spec * spec)
 
 			x = last_int + pattern_values[i];
 
-			found = rt_set(radixtree, x, x);
+			found = rt_set(radixtree, x, (TestValueType) x);
 
 			if (found)
 				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
@@ -495,7 +502,7 @@ test_pattern(const test_spec * spec)
 		bool		found;
 		bool		expected;
 		uint64		x;
-		uint64		v;
+		TestValueType		v;
 
 		/*
 		 * Pick next value to probe at random.  We limit the probes to the
@@ -526,7 +533,7 @@ test_pattern(const test_spec * spec)
 
 		if (found != expected)
 			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
-		if (found && (v != x))
+		if (found && (v != (TestValueType) x))
 			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
 				 v, x);
 	}
@@ -549,7 +556,7 @@ test_pattern(const test_spec * spec)
 		{
 			uint64		expected = last_int + pattern_values[i];
 			uint64		x;
-			uint64		val;
+			TestValueType		val;
 
 			if (!rt_iterate_next(iter, &x, &val))
 				break;
@@ -558,7 +565,7 @@ test_pattern(const test_spec * spec)
 				elog(ERROR,
 					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
 					 x, expected, i);
-			if (val != expected)
+			if (val != (TestValueType) expected)
 				elog(ERROR,
 					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
 			n++;
@@ -588,7 +595,7 @@ test_pattern(const test_spec * spec)
 	{
 		bool		found;
 		uint64		x;
-		uint64		v;
+		TestValueType		v;
 
 		/*
 		 * Pick next value to probe at random.  We limit the probes to the
-- 
2.39.0

v22-0006-Free-all-radix-tree-nodes-recursively.patchtext/x-patch; charset=US-ASCII; name=v22-0006-Free-all-radix-tree-nodes-recursively.patchDownload

From fe4ed7bf8033453b1ba38b6d298aa519fbe5b9f8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 20 Jan 2023 12:38:54 +0700
Subject: [PATCH v22 06/22] Free all radix tree nodes recursively

TODO: Consider adding more general functionality to DSA
to free all segments.
---
 src/include/lib/radixtree.h | 78 +++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index c08016de3a..98e4597eac 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -127,6 +127,7 @@
 #define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
 #define RT_INIT_NODE RT_MAKE_NAME(init_node)
 #define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
 #define RT_EXTEND RT_MAKE_NAME(extend)
 #define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
 #define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
@@ -1410,6 +1411,78 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 	return tree->ctl->handle;
 }
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	if (NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+				for (int i = 0; i < n4->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n4->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+				for (int i = 0; i < n32->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+}
 #endif
 
 /*
@@ -1421,6 +1494,10 @@ RT_FREE(RT_RADIX_TREE *tree)
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
 	/*
 	 * Vandalize the control block to help catch programming error where
 	 * other backends access the memory formerly occupied by this radix tree.
@@ -2199,6 +2276,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_ALLOC_NODE
 #undef RT_INIT_NODE
 #undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
 #undef RT_EXTEND
 #undef RT_SET_EXTEND
 #undef RT_SWITCH_NODE_KIND
-- 
2.39.0

v22-0008-Streamline-calculation-of-slab-blocksize.patchtext/x-patch; charset=US-ASCII; name=v22-0008-Streamline-calculation-of-slab-blocksize.patchDownload

From 26d69b070472d5e2af3a87565d900dad91b273e8 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 14:55:25 +0700
Subject: [PATCH v22 08/22] Streamline calculation of slab blocksize

To reduce duplication. This will likely lead to
division instructions, but a few cycles won't
matter at all when creating the tree.
---
 src/include/lib/radixtree.h | 50 ++++++++++++++-----------------------
 1 file changed, 19 insertions(+), 31 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 0a39bd6664..172d62c6b0 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -304,6 +304,13 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
 #define RT_NODE_KIND_256		0x03
 #define RT_NODE_KIND_COUNT		4
 
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
 #endif							/* RT_COMMON */
 
 
@@ -503,59 +510,38 @@ typedef struct RT_SIZE_CLASS_ELEM
 	/* slab chunk size */
 	Size		inner_size;
 	Size		leaf_size;
-
-	/* slab block size */
-	Size		inner_blocksize;
-	Size		leaf_blocksize;
 } RT_SIZE_CLASS_ELEM;
 
-/*
- * Calculate the slab blocksize so that we can allocate at least 32 chunks
- * from the block.
- */
-#define NODE_SLAB_BLOCK_SIZE(size)	\
-	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
-
 static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 	[RT_CLASS_4_FULL] = {
 		.name = "radix tree node 4",
 		.fanout = 4,
 		.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_32_PARTIAL] = {
 		.name = "radix tree node 15",
 		.fanout = 15,
 		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_32_FULL] = {
 		.name = "radix tree node 32",
 		.fanout = 32,
 		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_125_FULL] = {
 		.name = "radix tree node 125",
 		.fanout = 125,
 		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE)),
 	},
 	[RT_CLASS_256] = {
 		.name = "radix tree node 256",
 		.fanout = 256,
 		.inner_size = sizeof(RT_NODE_INNER_256),
 		.leaf_size = sizeof(RT_NODE_LEAF_256),
-		.inner_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_INNER_256)),
-		.leaf_blocksize = NODE_SLAB_BLOCK_SIZE(sizeof(RT_NODE_LEAF_256)),
 	},
 };
 
@@ -1361,14 +1347,18 @@ RT_CREATE(MemoryContext ctx)
 	/* Create the slab allocator for each size class */
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
+		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+		size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+		size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
 		tree->inner_slabs[i] = SlabContextCreate(ctx,
-												 RT_SIZE_CLASS_INFO[i].name,
-												 RT_SIZE_CLASS_INFO[i].inner_blocksize,
-												 RT_SIZE_CLASS_INFO[i].inner_size);
+												 size_class.name,
+												 inner_blocksize,
+												 size_class.inner_size);
 		tree->leaf_slabs[i] = SlabContextCreate(ctx,
-												RT_SIZE_CLASS_INFO[i].name,
-												RT_SIZE_CLASS_INFO[i].leaf_blocksize,
-												RT_SIZE_CLASS_INFO[i].leaf_size);
+												size_class.name,
+												leaf_blocksize,
+												size_class.leaf_size);
 	}
 #endif
 
@@ -2189,12 +2179,10 @@ RT_DUMP(RT_RADIX_TREE *tree)
 {
 
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
-		fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize %zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+		fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\t%zu\n",
 				RT_SIZE_CLASS_INFO[i].name,
 				RT_SIZE_CLASS_INFO[i].inner_size,
-				RT_SIZE_CLASS_INFO[i].inner_blocksize,
-				RT_SIZE_CLASS_INFO[i].leaf_size,
-				RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+				RT_SIZE_CLASS_INFO[i].leaf_size);
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
 
 	if (!tree->ctl->root)
-- 
2.39.0

v22-0009-Remove-hard-coded-128.patchtext/x-patch; charset=US-ASCII; name=v22-0009-Remove-hard-coded-128.patchDownload

From e3c3cae8de8db407334aa5f16d187b69baea6279 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 15:51:21 +0700
Subject: [PATCH v22 09/22] Remove hard-coded 128

Also comment that 64 could be a valid number of bits
in the bitmap for this node type.

TODO: Consider whether we should in fact limit this
node to ~64.

In passing, remove "125" from invalid-slot-index macro.
---
 src/include/lib/radixtree.h             | 19 +++++++++++++------
 src/include/lib/radixtree_delete_impl.h |  4 ++--
 src/include/lib/radixtree_insert_impl.h |  4 ++--
 src/include/lib/radixtree_search_impl.h |  4 ++--
 4 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 172d62c6b0..d15ea8f0fe 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -270,8 +270,15 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
 /* Tree level the radix tree uses */
 #define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
 
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
 /* Invalid index used in node-125 */
-#define RT_NODE_125_INVALID_IDX	0xFF
+#define RT_INVALID_SLOT_IDX	0xFF
 
 /* Get a chunk from the key */
 #define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
@@ -409,7 +416,7 @@ typedef struct RT_NODE_BASE_125
 	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
 
 	/* isset is a bitmap to track which slot is in use */
-	bitmapword		isset[BM_IDX(128)];
+	bitmapword		isset[BM_IDX(RT_SLOT_IDX_LIMIT)];
 } RT_NODE_BASE_125;
 
 typedef struct RT_NODE_BASE_256
@@ -867,7 +874,7 @@ RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
 static inline bool
 RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
 {
-	return node->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX;
+	return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
 }
 
 static inline RT_PTR_ALLOC
@@ -881,7 +888,7 @@ static inline RT_VALUE_TYPE
 RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
 {
 	Assert(NODE_IS_LEAF(node));
-	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_NODE_125_INVALID_IDX);
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
 	return node->values[node->base.slot_idxs[chunk]];
 }
 
@@ -1037,7 +1044,7 @@ RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner
 	{
 		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
 
-		memset(n125->slot_idxs, RT_NODE_125_INVALID_IDX, sizeof(n125->slot_idxs));
+		memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
 	}
 }
 
@@ -2052,7 +2059,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 					RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
 
 					fprintf(stderr, ", isset-bitmap:");
-					for (int i = 0; i < BM_IDX(128); i++)
+					for (int i = 0; i < BM_IDX(RT_SLOT_IDX_LIMIT); i++)
 					{
 						fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
 					}
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index 2612730481..2f1c172672 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -65,13 +65,13 @@
 				int			idx;
 				int			bitnum;
 
-				if (slotpos == RT_NODE_125_INVALID_IDX)
+				if (slotpos == RT_INVALID_SLOT_IDX)
 					return false;
 
 				idx = BM_IDX(slotpos);
 				bitnum = BM_BIT(slotpos);
 				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
-				n125->base.slot_idxs[chunk] = RT_NODE_125_INVALID_IDX;
+				n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
 
 				break;
 			}
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index e3e44669ea..90fe5f539e 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -201,7 +201,7 @@
 				int			slotpos = n125->base.slot_idxs[chunk];
 				int			cnt = 0;
 
-				if (slotpos != RT_NODE_125_INVALID_IDX)
+				if (slotpos != RT_INVALID_SLOT_IDX)
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
@@ -247,7 +247,7 @@
 					bitmapword	inverse;
 
 					/* get the first word with at least one bit not set */
-					for (idx = 0; idx < BM_IDX(128); idx++)
+					for (idx = 0; idx < BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
 					{
 						if (n125->base.isset[idx] < ~((bitmapword) 0))
 							break;
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 365abaa46d..d2bbdd2450 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -73,10 +73,10 @@
 				int			slotpos = n125->base.slot_idxs[chunk];
 
 #ifdef RT_ACTION_UPDATE
-				Assert(slotpos != RT_NODE_125_INVALID_IDX);
+				Assert(slotpos != RT_INVALID_SLOT_IDX);
 				n125->children[slotpos] = new_child;
 #else
-				if (slotpos == RT_NODE_125_INVALID_IDX)
+				if (slotpos == RT_INVALID_SLOT_IDX)
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-- 
2.39.0

v22-0010-Reduce-node4-to-node3.patchtext/x-patch; charset=US-ASCII; name=v22-0010-Reduce-node4-to-node3.patchDownload

From dfa2aece9d83cc6e9ab791c6b1641aca1d02d8f6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 20 Jan 2023 18:05:15 +0700
Subject: [PATCH v22 10/22] Reduce node4 to node3

Now that we don't store "chunk", the base node type is only
5 bytes in size. With 3 key chunks, There is no alignment
padding between the chunks array and the child/value array.
This reduces the smallest inner node to 32 bytes on 64-bit
platforms.
---
 src/include/lib/radixtree.h             | 124 ++++++++++++------------
 src/include/lib/radixtree_delete_impl.h |  20 ++--
 src/include/lib/radixtree_insert_impl.h |  38 ++++----
 src/include/lib/radixtree_iter_impl.h   |  18 ++--
 src/include/lib/radixtree_search_impl.h |  18 ++--
 5 files changed, 109 insertions(+), 109 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d15ea8f0fe..6cc8442c89 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -136,9 +136,9 @@
 #define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
 #define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
 #define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
-#define RT_NODE_4_SEARCH_EQ RT_MAKE_NAME(node_4_search_eq)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
 #define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
-#define RT_NODE_4_GET_INSERTPOS RT_MAKE_NAME(node_4_get_insertpos)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
 #define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
 #define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
 #define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
@@ -181,22 +181,22 @@
 #endif
 #define RT_NODE RT_MAKE_NAME(node)
 #define RT_NODE_ITER RT_MAKE_NAME(node_iter)
-#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
 #define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
 #define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
 #define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
-#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
 #define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
 #define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
 #define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
-#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
 #define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
 #define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
 #define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
 #define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
 #define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
 #define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
-#define RT_CLASS_4_FULL RT_MAKE_NAME(class_4_full)
+#define RT_CLASS_3_FULL RT_MAKE_NAME(class_3_full)
 #define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
 #define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
 #define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
@@ -305,7 +305,7 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
  * allocator padding in both the inner and leaf nodes on DSA.
  * node
  */
-#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_3			0x00
 #define RT_NODE_KIND_32			0x01
 #define RT_NODE_KIND_125		0x02
 #define RT_NODE_KIND_256		0x03
@@ -323,7 +323,7 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
 
 typedef enum RT_SIZE_CLASS
 {
-	RT_CLASS_4_FULL = 0,
+	RT_CLASS_3_FULL = 0,
 	RT_CLASS_32_PARTIAL,
 	RT_CLASS_32_FULL,
 	RT_CLASS_125_FULL,
@@ -387,13 +387,13 @@ typedef struct RT_NODE
 /* Base type of each node kinds for leaf and inner nodes */
 /* The base types must be a be able to accommodate the largest size
 class for variable-sized node kinds*/
-typedef struct RT_NODE_BASE_4
+typedef struct RT_NODE_BASE_3
 {
 	RT_NODE		n;
 
-	/* 4 children, for key chunks */
-	uint8		chunks[4];
-} RT_NODE_BASE_4;
+	/* 3 children, for key chunks */
+	uint8		chunks[3];
+} RT_NODE_BASE_3;
 
 typedef struct RT_NODE_BASE_32
 {
@@ -437,21 +437,21 @@ typedef struct RT_NODE_BASE_256
  * good. It might be better to just indicate non-existing entries the same way
  * in inner nodes.
  */
-typedef struct RT_NODE_INNER_4
+typedef struct RT_NODE_INNER_3
 {
-	RT_NODE_BASE_4 base;
+	RT_NODE_BASE_3 base;
 
 	/* number of children depends on size class */
 	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
-} RT_NODE_INNER_4;
+} RT_NODE_INNER_3;
 
-typedef struct RT_NODE_LEAF_4
+typedef struct RT_NODE_LEAF_3
 {
-	RT_NODE_BASE_4 base;
+	RT_NODE_BASE_3 base;
 
 	/* number of values depends on size class */
 	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
-} RT_NODE_LEAF_4;
+} RT_NODE_LEAF_3;
 
 typedef struct RT_NODE_INNER_32
 {
@@ -520,11 +520,11 @@ typedef struct RT_SIZE_CLASS_ELEM
 } RT_SIZE_CLASS_ELEM;
 
 static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
-	[RT_CLASS_4_FULL] = {
-		.name = "radix tree node 4",
-		.fanout = 4,
-		.inner_size = sizeof(RT_NODE_INNER_4) + 4 * sizeof(RT_PTR_ALLOC),
-		.leaf_size = sizeof(RT_NODE_LEAF_4) + 4 * sizeof(RT_VALUE_TYPE),
+	[RT_CLASS_3_FULL] = {
+		.name = "radix tree node 3",
+		.fanout = 3,
+		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
 	},
 	[RT_CLASS_32_PARTIAL] = {
 		.name = "radix tree node 15",
@@ -556,7 +556,7 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 
 /* Map from the node kind to its minimum size class */
 static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
-	[RT_NODE_KIND_4] = RT_CLASS_4_FULL,
+	[RT_NODE_KIND_3] = RT_CLASS_3_FULL,
 	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
 	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
 	[RT_NODE_KIND_256] = RT_CLASS_256,
@@ -673,7 +673,7 @@ RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
  * if there is no such element.
  */
 static inline int
-RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
 {
 	int			idx = -1;
 
@@ -693,7 +693,7 @@ RT_NODE_4_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
  * Return index of the chunk to insert into chunks in the given node.
  */
 static inline int
-RT_NODE_4_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
 {
 	int			idx;
 
@@ -810,7 +810,7 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
 
 /*
  * Functions to manipulate both chunks array and children/values array.
- * These are used for node-4 and node-32.
+ * These are used for node-3 and node-32.
  */
 
 /* Shift the elements right at 'idx' by one */
@@ -848,7 +848,7 @@ static inline void
 RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
 						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
 {
-	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
 	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
 
@@ -860,7 +860,7 @@ static inline void
 RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
 						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
 {
-	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4_FULL].fanout;
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
 	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
 
@@ -1060,9 +1060,9 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
 	RT_PTR_ALLOC allocnode;
 	RT_PTR_LOCAL newnode;
 
-	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
 	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-	RT_INIT_NODE(newnode, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
 	newnode->shift = shift;
 	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
 	tree->ctl->root = allocnode;
@@ -1183,17 +1183,17 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
 	{
 		RT_PTR_ALLOC	allocnode;
 		RT_PTR_LOCAL	node;
-		RT_NODE_INNER_4 *n4;
+		RT_NODE_INNER_3 *n3;
 
-		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, true);
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, true);
 		node = RT_PTR_GET_LOCAL(tree, allocnode);
-		RT_INIT_NODE(node, RT_NODE_KIND_4, RT_CLASS_4_FULL, true);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3_FULL, true);
 		node->shift = shift;
 		node->count = 1;
 
-		n4 = (RT_NODE_INNER_4 *) node;
-		n4->base.chunks[0] = 0;
-		n4->children[0] = tree->ctl->root;
+		n3 = (RT_NODE_INNER_3 *) node;
+		n3->base.chunks[0] = 0;
+		n3->children[0] = tree->ctl->root;
 
 		/* Update the root */
 		tree->ctl->root = allocnode;
@@ -1223,9 +1223,9 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL
 		int			newshift = shift - RT_NODE_SPAN;
 		bool		inner = newshift > 0;
 
-		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_4_FULL, inner);
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
 		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
-		RT_INIT_NODE(newchild, RT_NODE_KIND_4, RT_CLASS_4_FULL, inner);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
 		newchild->shift = newshift;
 		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
 
@@ -1430,12 +1430,12 @@ RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
 
 	switch (node->kind)
 	{
-		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_3:
 			{
-				RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+				RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
 
-				for (int i = 0; i < n4->base.n.count; i++)
-					RT_FREE_RECURSE(tree, n4->children[i]);
+				for (int i = 0; i < n3->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n3->children[i]);
 
 				break;
 			}
@@ -1892,12 +1892,12 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
 
 	switch (node->kind)
 	{
-		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_3:
 			{
-				RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
+				RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
 
-				for (int i = 1; i < n4->n.count; i++)
-					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+				for (int i = 1; i < n3->n.count; i++)
+					Assert(n3->chunks[i - 1] < n3->chunks[i]);
 
 				break;
 			}
@@ -1959,10 +1959,10 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
 RT_SCOPE void
 RT_STATS(RT_RADIX_TREE *tree)
 {
-	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n4 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
 						 tree->ctl->num_keys,
 						 tree->ctl->root->shift / RT_NODE_SPAN,
-						 tree->ctl->cnt[RT_CLASS_4_FULL],
+						 tree->ctl->cnt[RT_CLASS_3_FULL],
 						 tree->ctl->cnt[RT_CLASS_32_PARTIAL],
 						 tree->ctl->cnt[RT_CLASS_32_FULL],
 						 tree->ctl->cnt[RT_CLASS_125_FULL],
@@ -1977,7 +1977,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 
 	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
 			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
-			(node->kind == RT_NODE_KIND_4) ? 4 :
+			(node->kind == RT_NODE_KIND_3) ? 3 :
 			(node->kind == RT_NODE_KIND_32) ? 32 :
 			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
 			node->fanout == 0 ? 256 : node->fanout,
@@ -1988,26 +1988,26 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 
 	switch (node->kind)
 	{
-		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_3:
 			{
 				for (int i = 0; i < node->count; i++)
 				{
 					if (NODE_IS_LEAF(node))
 					{
-						RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
+						RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
 
 						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
-								space, n4->base.chunks[i], (uint64) n4->values[i]);
+								space, n3->base.chunks[i], (uint64) n3->values[i]);
 					}
 					else
 					{
-						RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+						RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
 
 						fprintf(stderr, "%schunk 0x%X ->",
-								space, n4->base.chunks[i]);
+								space, n3->base.chunks[i]);
 
 						if (recurse)
-							RT_DUMP_NODE(n4->children[i], level + 1, recurse);
+							RT_DUMP_NODE(n3->children[i], level + 1, recurse);
 						else
 							fprintf(stderr, "\n");
 					}
@@ -2229,22 +2229,22 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_ITER
 #undef RT_NODE
 #undef RT_NODE_ITER
-#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_3
 #undef RT_NODE_BASE_32
 #undef RT_NODE_BASE_125
 #undef RT_NODE_BASE_256
-#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_3
 #undef RT_NODE_INNER_32
 #undef RT_NODE_INNER_125
 #undef RT_NODE_INNER_256
-#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_3
 #undef RT_NODE_LEAF_32
 #undef RT_NODE_LEAF_125
 #undef RT_NODE_LEAF_256
 #undef RT_SIZE_CLASS
 #undef RT_SIZE_CLASS_ELEM
 #undef RT_SIZE_CLASS_INFO
-#undef RT_CLASS_4_FULL
+#undef RT_CLASS_3_FULL
 #undef RT_CLASS_32_PARTIAL
 #undef RT_CLASS_32_FULL
 #undef RT_CLASS_125_FULL
@@ -2282,9 +2282,9 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_REPLACE_NODE
 #undef RT_PTR_GET_LOCAL
 #undef RT_PTR_ALLOC_IS_VALID
-#undef RT_NODE_4_SEARCH_EQ
+#undef RT_NODE_3_SEARCH_EQ
 #undef RT_NODE_32_SEARCH_EQ
-#undef RT_NODE_4_GET_INSERTPOS
+#undef RT_NODE_3_GET_INSERTPOS
 #undef RT_NODE_32_GET_INSERTPOS
 #undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
 #undef RT_CHUNK_VALUES_ARRAY_SHIFT
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index 2f1c172672..b9f07f4eb5 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -1,12 +1,12 @@
 /* TODO: shrink nodes */
 
 #if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
 #define RT_NODE32_TYPE RT_NODE_INNER_32
 #define RT_NODE125_TYPE RT_NODE_INNER_125
 #define RT_NODE256_TYPE RT_NODE_INNER_256
 #elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
 #define RT_NODE32_TYPE RT_NODE_LEAF_32
 #define RT_NODE125_TYPE RT_NODE_LEAF_125
 #define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -24,20 +24,20 @@
 
 	switch (node->kind)
 	{
-		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_3:
 			{
-				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
-				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
 
 				if (idx < 0)
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				RT_CHUNK_VALUES_ARRAY_DELETE(n4->base.chunks, n4->values,
-										  n4->base.n.count, idx);
+				RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+										  n3->base.n.count, idx);
 #else
-				RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
-											n4->base.n.count, idx);
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+											n3->base.n.count, idx);
 #endif
 				break;
 			}
@@ -100,7 +100,7 @@
 
 	return true;
 
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
 #undef RT_NODE32_TYPE
 #undef RT_NODE125_TYPE
 #undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 90fe5f539e..16461bdb03 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -1,10 +1,10 @@
 #if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
 #define RT_NODE32_TYPE RT_NODE_INNER_32
 #define RT_NODE125_TYPE RT_NODE_INNER_125
 #define RT_NODE256_TYPE RT_NODE_INNER_256
 #elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
 #define RT_NODE32_TYPE RT_NODE_LEAF_32
 #define RT_NODE125_TYPE RT_NODE_LEAF_125
 #define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -25,25 +25,25 @@
 
 	switch (node->kind)
 	{
-		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_3:
 			{
-				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
 				int			idx;
 
-				idx = RT_NODE_4_SEARCH_EQ(&n4->base, chunk);
+				idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
 				if (idx != -1)
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
 #ifdef RT_NODE_LEVEL_LEAF
-					n4->values[idx] = value;
+					n3->values[idx] = value;
 #else
-					n4->children[idx] = child;
+					n3->children[idx] = child;
 #endif
 					break;
 				}
 
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n4)))
+				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n3)))
 				{
 					RT_PTR_ALLOC allocnode;
 					RT_PTR_LOCAL newnode;
@@ -51,16 +51,16 @@
 					const uint8 new_kind = RT_NODE_KIND_32;
 					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
 
-					/* grow node from 4 to 32 */
+					/* grow node from 3 to 32 */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
 					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
 					new32 = (RT_NODE32_TYPE *) newnode;
 
 #ifdef RT_NODE_LEVEL_LEAF
-					RT_CHUNK_VALUES_ARRAY_COPY(n4->base.chunks, n4->values,
+					RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
 											  new32->base.chunks, new32->values);
 #else
-					RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
 											  new32->base.chunks, new32->children);
 #endif
 					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
@@ -68,27 +68,27 @@
 				}
 				else
 				{
-					int			insertpos = RT_NODE_4_GET_INSERTPOS(&n4->base, chunk);
-					int			count = n4->base.n.count;
+					int			insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+					int			count = n3->base.n.count;
 
 					/* shift chunks and children */
 					if (insertpos < count)
 					{
 						Assert(count > 0);
 #ifdef RT_NODE_LEVEL_LEAF
-						RT_CHUNK_VALUES_ARRAY_SHIFT(n4->base.chunks, n4->values,
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
 												   count, insertpos);
 #else
-						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
 												   count, insertpos);
 #endif
 					}
 
-					n4->base.chunks[insertpos] = chunk;
+					n3->base.chunks[insertpos] = chunk;
 #ifdef RT_NODE_LEVEL_LEAF
-					n4->values[insertpos] = value;
+					n3->values[insertpos] = value;
 #else
-					n4->children[insertpos] = child;
+					n3->children[insertpos] = child;
 #endif
 					break;
 				}
@@ -304,7 +304,7 @@
 
 	return chunk_exists;
 
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
 #undef RT_NODE32_TYPE
 #undef RT_NODE125_TYPE
 #undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 5c06f8b414..c428531438 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -1,10 +1,10 @@
 #if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
 #define RT_NODE32_TYPE RT_NODE_INNER_32
 #define RT_NODE125_TYPE RT_NODE_INNER_125
 #define RT_NODE256_TYPE RT_NODE_INNER_256
 #elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
 #define RT_NODE32_TYPE RT_NODE_LEAF_32
 #define RT_NODE125_TYPE RT_NODE_LEAF_125
 #define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -31,19 +31,19 @@
 
 	switch (node_iter->node->kind)
 	{
-		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_3:
 			{
-				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node_iter->node;
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
 
 				node_iter->current_idx++;
-				if (node_iter->current_idx >= n4->base.n.count)
+				if (node_iter->current_idx >= n3->base.n.count)
 					break;
 #ifdef RT_NODE_LEVEL_LEAF
-				value = n4->values[node_iter->current_idx];
+				value = n3->values[node_iter->current_idx];
 #else
-				child = RT_PTR_GET_LOCAL(iter->tree, n4->children[node_iter->current_idx]);
+				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
 #endif
-				key_chunk = n4->base.chunks[node_iter->current_idx];
+				key_chunk = n3->base.chunks[node_iter->current_idx];
 				found = true;
 				break;
 			}
@@ -132,7 +132,7 @@
 	return child;
 #endif
 
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
 #undef RT_NODE32_TYPE
 #undef RT_NODE125_TYPE
 #undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index d2bbdd2450..31138b6a72 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -1,10 +1,10 @@
 #if defined(RT_NODE_LEVEL_INNER)
-#define RT_NODE4_TYPE RT_NODE_INNER_4
+#define RT_NODE3_TYPE RT_NODE_INNER_3
 #define RT_NODE32_TYPE RT_NODE_INNER_32
 #define RT_NODE125_TYPE RT_NODE_INNER_125
 #define RT_NODE256_TYPE RT_NODE_INNER_256
 #elif defined(RT_NODE_LEVEL_LEAF)
-#define RT_NODE4_TYPE RT_NODE_LEAF_4
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
 #define RT_NODE32_TYPE RT_NODE_LEAF_32
 #define RT_NODE125_TYPE RT_NODE_LEAF_125
 #define RT_NODE256_TYPE RT_NODE_LEAF_256
@@ -27,22 +27,22 @@
 
 	switch (node->kind)
 	{
-		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_3:
 			{
-				RT_NODE4_TYPE *n4 = (RT_NODE4_TYPE *) node;
-				int			idx = RT_NODE_4_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
 
 #ifdef RT_ACTION_UPDATE
 				Assert(idx >= 0);
-				n4->children[idx] = new_child;
+				n3->children[idx] = new_child;
 #else
 				if (idx < 0)
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				value = n4->values[idx];
+				value = n3->values[idx];
 #else
-				child = n4->children[idx];
+				child = n3->children[idx];
 #endif
 #endif							/* RT_ACTION_UPDATE */
 				break;
@@ -125,7 +125,7 @@
 	return true;
 #endif							/* RT_ACTION_UPDATE */
 
-#undef RT_NODE4_TYPE
+#undef RT_NODE3_TYPE
 #undef RT_NODE32_TYPE
 #undef RT_NODE125_TYPE
 #undef RT_NODE256_TYPE
-- 
2.39.0

v22-0011-Expand-commentary-for-kinds-vs.-size-classes.patchtext/x-patch; charset=US-ASCII; name=v22-0011-Expand-commentary-for-kinds-vs.-size-classes.patchDownload

From 78faaad01a69a5a81eb219e3f45983c1b466e173 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 21 Jan 2023 12:52:53 +0700
Subject: [PATCH v22 11/22] Expand commentary for kinds vs. size classes

Also move class enum closer to array and add #undef's
---
 src/include/lib/radixtree.h | 76 ++++++++++++++++++++++++++-----------
 1 file changed, 53 insertions(+), 23 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 6cc8442c89..4a2dad82bf 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -288,22 +288,26 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
 #define BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
 
 /*
- * Supported radix tree node kinds and size classes.
+ * Node kinds
  *
- * There are 4 node kinds and each node kind have one or two size classes,
- * partial and full. The size classes in the same node kind have the same
- * node structure but have the different number of fanout that is stored
- * in 'fanout' of RT_NODE. For example in size class 15, when a 16th element
- * is to be inserted, we allocate a larger area and memcpy the entire old
- * node to it.
+ * The different node kinds are what make the tree "adaptive".
  *
- * This technique allows us to limit the node kinds to 4, which limits the
- * number of cases in switch statements. It also allows a possible future
- * optimization to encode the node kind in a pointer tag.
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
  *
- * These size classes have been chose carefully so that it minimizes the
- * allocator padding in both the inner and leaf nodes on DSA.
- * node
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ *    statments.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ *    in the future to tag the node pointer with the kind, even on
+ *    platforms with 32-bit pointers. This might speed up node traversal
+ *    in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
  */
 #define RT_NODE_KIND_3			0x00
 #define RT_NODE_KIND_32			0x01
@@ -320,16 +324,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
 
 #endif							/* RT_COMMON */
 
-
-typedef enum RT_SIZE_CLASS
-{
-	RT_CLASS_3_FULL = 0,
-	RT_CLASS_32_PARTIAL,
-	RT_CLASS_32_FULL,
-	RT_CLASS_125_FULL,
-	RT_CLASS_256
-} RT_SIZE_CLASS;
-
 /* Common type for all nodes types */
 typedef struct RT_NODE
 {
@@ -508,6 +502,37 @@ typedef struct RT_NODE_LEAF_256
 	RT_VALUE_TYPE	values[RT_NODE_MAX_SLOTS];
 } RT_NODE_LEAF_256;
 
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_3_FULL = 0,
+	RT_CLASS_32_PARTIAL,
+	RT_CLASS_32_FULL,
+	RT_CLASS_125_FULL,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
 /* Information for each size class */
 typedef struct RT_SIZE_CLASS_ELEM
 {
@@ -2217,6 +2242,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef NODE_IS_EMPTY
 #undef VAR_NODE_HAS_FREE_SLOT
 #undef FIXED_NODE_HAS_FREE_SLOT
+#undef RT_NODE_KIND_COUNT
 #undef RT_SIZE_CLASS_COUNT
 #undef RT_RADIX_TREE_MAGIC
 
@@ -2229,6 +2255,10 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_ITER
 #undef RT_NODE
 #undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
 #undef RT_NODE_BASE_3
 #undef RT_NODE_BASE_32
 #undef RT_NODE_BASE_125
-- 
2.39.0

v22-0012-Tool-for-measuring-radix-tree-performance.patchtext/x-patch; charset=US-ASCII; name=v22-0012-Tool-for-measuring-radix-tree-performance.patchDownload

From 626a2545ffaaf6e1ee09a502df152fa0597276fa Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v22 12/22] Tool for measuring radix tree performance

Includes Meson support, but commented out to avoid warnings

XXX: Not for commit
---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  76 ++
 contrib/bench_radix_tree/bench_radix_tree.c   | 656 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/meson.build          |  33 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 contrib/meson.build                           |   1 +
 8 files changed, 822 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/meson.build
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..4c785c7336
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,656 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	rt_radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, key);
+	}
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+
+	rt_stats(rt);
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+  'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+  bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'bench_radix_tree',
+    '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+  bench_radix_tree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+  'bench_radix_tree.control',
+  'bench_radix_tree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'bench_radix_tree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'bench_radix_tree',
+    ],
+  },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
+#subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.39.0

v22-0013-Get-rid-of-NODE_IS_EMPTY-macro.patchtext/x-patch; charset=US-ASCII; name=v22-0013-Get-rid-of-NODE_IS_EMPTY-macro.patchDownload

From d9944828bfc3ab39f29b522aadedda6e5d978041 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 21 Jan 2023 13:40:28 +0700
Subject: [PATCH v22 13/22] Get rid of NODE_IS_EMPTY macro

It's already pretty clear what "count == 0" means, and the
existing comments make it obvious.
---
 src/include/lib/radixtree.h | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 4a2dad82bf..567eab4bc8 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -372,7 +372,6 @@ typedef struct RT_NODE
 #endif
 
 #define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
-#define NODE_IS_EMPTY(n)		(((RT_PTR_LOCAL) (n))->count == 0)
 #define VAR_NODE_HAS_FREE_SLOT(node) \
 	((node)->base.n.count < (node)->base.n.fanout)
 #define FIXED_NODE_HAS_FREE_SLOT(node, class) \
@@ -1701,7 +1700,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 	 * Return if the leaf node still has keys and we don't need to delete the
 	 * node.
 	 */
-	if (!NODE_IS_EMPTY(node))
+	if (node->count > 0)
 		return true;
 
 	/* Free the empty leaf node */
@@ -1717,7 +1716,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 		Assert(deleted);
 
 		/* If the node didn't become empty, we stop deleting the key */
-		if (!NODE_IS_EMPTY(node))
+		if (node->count > 0)
 			break;
 
 		/* The node became empty */
@@ -2239,7 +2238,6 @@ RT_DUMP(RT_RADIX_TREE *tree)
 
 /* locally declared macros */
 #undef NODE_IS_LEAF
-#undef NODE_IS_EMPTY
 #undef VAR_NODE_HAS_FREE_SLOT
 #undef FIXED_NODE_HAS_FREE_SLOT
 #undef RT_NODE_KIND_COUNT
-- 
2.39.0

v22-0014-Add-some-comments-for-insert-logic.patchtext/x-patch; charset=US-ASCII; name=v22-0014-Add-some-comments-for-insert-logic.patchDownload

From dec37d66a36728ea9581ac51b91ab91850ec0e3b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 21 Jan 2023 14:21:55 +0700
Subject: [PATCH v22 14/22] Add some comments for insert logic

---
 src/include/lib/radixtree.h             | 29 ++++++++++++++++++++++---
 src/include/lib/radixtree_insert_impl.h |  5 +++++
 2 files changed, 31 insertions(+), 3 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 567eab4bc8..d48c915373 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -731,8 +731,8 @@ RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
 }
 
 /*
- * Return index of the first element in 'base' that equals 'key'. Return -1
- * if there is no such element.
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
  */
 static inline int
 RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
@@ -762,14 +762,22 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
 #endif
 
 #ifndef USE_NO_SIMD
+	/* replicate the search key */
 	spread_chunk = vector8_broadcast(chunk);
+
+	/* compare to the 32 keys stored in the node */
 	vector8_load(&haystack1, &node->chunks[0]);
 	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
 	cmp1 = vector8_eq(spread_chunk, haystack1);
 	cmp2 = vector8_eq(spread_chunk, haystack2);
+
+	/* convert comparison to a bitfield */
 	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+	/* mask off invalid entries */
 	bitfield &= ((UINT64CONST(1) << count) - 1);
 
+	/* convert bitfield to index by counting trailing zeros */
 	if (bitfield)
 		index_simd = pg_rightmost_one_pos32(bitfield);
 
@@ -781,7 +789,8 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
 }
 
 /*
- * Return index of the chunk to insert into chunks in the given node.
+ * Return index of the node's chunk array to insert into,
+ * such that the chunk array remains ordered.
  */
 static inline int
 RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
@@ -804,12 +813,26 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
 
 	for (index = 0; index < count; index++)
 	{
+		/*
+		 * This is coded with '>=' to match what we can do with SIMD,
+		 * with an assert to keep us honest.
+		 */
 		if (node->chunks[index] >= chunk)
+		{
+			Assert(node->chunks[index] != chunk);
 			break;
+		}
 	}
 #endif
 
 #ifndef USE_NO_SIMD
+	/*
+	 * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+	 * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+	 * we need to play some trickery using vector8_min() to effectively get
+	 * <=. There'll never be any equal elements in the current uses, but that's
+	 * what we get here...
+	 */
 	spread_chunk = vector8_broadcast(chunk);
 	vector8_load(&haystack1, &node->chunks[0]);
 	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 16461bdb03..8470c8fc70 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -162,6 +162,11 @@
 #endif
 					}
 
+					/*
+					 * Since we just copied a dense array, we can set the bits
+					 * using a single store, provided the length of that array
+					 * is at most the number of bits in a bitmapword.
+					 */
 					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
 					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
 
-- 
2.39.0

v22-0015-Get-rid-of-FIXED_NODE_HAS_FREE_SLOT.patchtext/x-patch; charset=US-ASCII; name=v22-0015-Get-rid-of-FIXED_NODE_HAS_FREE_SLOT.patchDownload

From 23527a3d2b725a4f3876125e5f663540ab411e92 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 11:53:33 +0700
Subject: [PATCH v22 15/22] Get rid of FIXED_NODE_HAS_FREE_SLOT

It's only used in one assert for the node256 kind, whose
fanout is necessarily fixed, and we already have a
convenient macro to compare that with.
---
 src/include/lib/radixtree.h             | 3 ---
 src/include/lib/radixtree_insert_impl.h | 2 +-
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d48c915373..8fbc0b5086 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -374,8 +374,6 @@ typedef struct RT_NODE
 #define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
 #define VAR_NODE_HAS_FREE_SLOT(node) \
 	((node)->base.n.count < (node)->base.n.fanout)
-#define FIXED_NODE_HAS_FREE_SLOT(node, class) \
-	((node)->base.n.count < RT_SIZE_CLASS_INFO[class].fanout)
 
 /* Base type of each node kinds for leaf and inner nodes */
 /* The base types must be a be able to accommodate the largest size
@@ -2262,7 +2260,6 @@ RT_DUMP(RT_RADIX_TREE *tree)
 /* locally declared macros */
 #undef NODE_IS_LEAF
 #undef VAR_NODE_HAS_FREE_SLOT
-#undef FIXED_NODE_HAS_FREE_SLOT
 #undef RT_NODE_KIND_COUNT
 #undef RT_SIZE_CLASS_COUNT
 #undef RT_RADIX_TREE_MAGIC
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 8470c8fc70..b484b7a099 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -286,7 +286,7 @@
 #else
 				chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
 #endif
-				Assert(chunk_exists || FIXED_NODE_HAS_FREE_SLOT(n256, RT_CLASS_256));
+				Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
 
 #ifdef RT_NODE_LEVEL_LEAF
 				RT_NODE_LEAF_256_SET(n256, chunk, value);
-- 
2.39.0

v22-0016-s-VAR_NODE_HAS_FREE_SLOT-RT_NODE_MUST_GROW.patchtext/x-patch; charset=US-ASCII; name=v22-0016-s-VAR_NODE_HAS_FREE_SLOT-RT_NODE_MUST_GROW.patchDownload

From 48033e8a97ff0d8f6276578c0ffd86209a2e129b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 12:11:11 +0700
Subject: [PATCH v22 16/22] s/VAR_NODE_HAS_FREE_SLOT/RT_NODE_MUST_GROW/

---
 src/include/lib/radixtree.h             | 6 +++---
 src/include/lib/radixtree_insert_impl.h | 8 ++++----
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 8fbc0b5086..cd8b8d1c22 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -372,8 +372,8 @@ typedef struct RT_NODE
 #endif
 
 #define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
-#define VAR_NODE_HAS_FREE_SLOT(node) \
-	((node)->base.n.count < (node)->base.n.fanout)
+#define RT_NODE_MUST_GROW(node) \
+	((node)->base.n.count == (node)->base.n.fanout)
 
 /* Base type of each node kinds for leaf and inner nodes */
 /* The base types must be a be able to accommodate the largest size
@@ -2259,7 +2259,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 
 /* locally declared macros */
 #undef NODE_IS_LEAF
-#undef VAR_NODE_HAS_FREE_SLOT
+#undef RT_NODE_MUST_GROW
 #undef RT_NODE_KIND_COUNT
 #undef RT_SIZE_CLASS_COUNT
 #undef RT_RADIX_TREE_MAGIC
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index b484b7a099..a0f46b37d3 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -43,7 +43,7 @@
 					break;
 				}
 
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n3)))
+				if (unlikely(RT_NODE_MUST_GROW(n3)))
 				{
 					RT_PTR_ALLOC allocnode;
 					RT_PTR_LOCAL newnode;
@@ -114,7 +114,7 @@
 					break;
 				}
 
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)) &&
+				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
 					n32->base.n.fanout == class32_min.fanout)
 				{
 					RT_PTR_ALLOC allocnode;
@@ -137,7 +137,7 @@
 					node = newnode;
 				}
 
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n32)))
+				if (unlikely(RT_NODE_MUST_GROW(n32)))
 				{
 					RT_PTR_ALLOC allocnode;
 					RT_PTR_LOCAL newnode;
@@ -218,7 +218,7 @@
 					break;
 				}
 
-				if (unlikely(!VAR_NODE_HAS_FREE_SLOT(n125)))
+				if (unlikely(RT_NODE_MUST_GROW(n125)))
 				{
 					RT_PTR_ALLOC allocnode;
 					RT_PTR_LOCAL newnode;
-- 
2.39.0

v22-0018-Clean-up-symbols.patchtext/x-patch; charset=US-ASCII; name=v22-0018-Clean-up-symbols.patchDownload

From 67984ba863923017a7c9f976be58fef706eeccd2 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 14:37:53 +0700
Subject: [PATCH v22 18/22] Clean up symbols

Remove remaming stragglers who weren't named "RT_*"
and get rid of the temporary expedient RT_COMMON
block in favor of explicit #undefs everywhere.
---
 src/include/lib/radixtree.h             | 91 ++++++++++++++-----------
 src/include/lib/radixtree_delete_impl.h |  4 +-
 src/include/lib/radixtree_insert_impl.h |  4 +-
 src/include/lib/radixtree_iter_impl.h   |  4 +-
 src/include/lib/radixtree_search_impl.h |  4 +-
 5 files changed, 58 insertions(+), 49 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 7c3f3dcf4f..95124696ef 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -246,14 +246,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
 /* generate implementation of the radix tree */
 #ifdef RT_DEFINE
 
-/* macros and types common to all implementations */
-#ifndef RT_COMMON
-#define RT_COMMON
-
-#ifdef RT_DEBUG
-#define UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
-#endif
-
 /* The number of bits encoded in one tree level */
 #define RT_NODE_SPAN	BITS_PER_BYTE
 
@@ -321,8 +313,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
 #define RT_SLAB_BLOCK_SIZE(size)	\
 	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
 
-#endif							/* RT_COMMON */
-
 /* Common type for all nodes types */
 typedef struct RT_NODE
 {
@@ -370,7 +360,7 @@ typedef struct RT_NODE
 #define RT_INVALID_PTR_ALLOC NULL
 #endif
 
-#define NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+#define RT_NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
 #define RT_NODE_MUST_GROW(node) \
 	((node)->base.n.count == (node)->base.n.fanout)
 
@@ -916,14 +906,14 @@ RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
 static inline RT_PTR_ALLOC
 RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	return node->children[node->base.slot_idxs[chunk]];
 }
 
 static inline RT_VALUE_TYPE
 RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
 	return node->values[node->base.slot_idxs[chunk]];
 }
@@ -934,7 +924,7 @@ RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
 static inline bool
 RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
 }
 
@@ -944,14 +934,14 @@ RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
 	int			idx = BM_IDX(chunk);
 	int			bitnum = BM_BIT(chunk);
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
 }
 
 static inline RT_PTR_ALLOC
 RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
 	return node->children[chunk];
 }
@@ -959,7 +949,7 @@ RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
 static inline RT_VALUE_TYPE
 RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
 {
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
 	return node->values[chunk];
 }
@@ -968,7 +958,7 @@ RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
 static inline void
 RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[chunk] = child;
 }
 
@@ -979,7 +969,7 @@ RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
 	int			idx = BM_IDX(chunk);
 	int			bitnum = BM_BIT(chunk);
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[idx] |= ((bitmapword) 1 << bitnum);
 	node->values[chunk] = value;
 }
@@ -988,7 +978,7 @@ RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
 static inline void
 RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
 {
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 	node->children[chunk] = RT_INVALID_PTR_ALLOC;
 }
 
@@ -998,7 +988,7 @@ RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
 	int			idx = BM_IDX(chunk);
 	int			bitnum = BM_BIT(chunk);
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
 }
 
@@ -1458,7 +1448,7 @@ RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
 	CHECK_FOR_INTERRUPTS();
 
 	/* The leaf node doesn't have child pointers */
-	if (NODE_IS_LEAF(node))
+	if (RT_NODE_IS_LEAF(node))
 	{
 		dsa_free(tree->dsa, ptr);
 		return;
@@ -1587,7 +1577,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
 
 		child = RT_PTR_GET_LOCAL(tree, stored_child);
 
-		if (NODE_IS_LEAF(child))
+		if (RT_NODE_IS_LEAF(child))
 			break;
 
 		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
@@ -1637,7 +1627,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 	{
 		RT_PTR_ALLOC child;
 
-		if (NODE_IS_LEAF(node))
+		if (RT_NODE_IS_LEAF(node))
 			break;
 
 		if (!RT_NODE_SEARCH_INNER(node, key, &child))
@@ -1788,7 +1778,7 @@ RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
 		node_iter->current_idx = -1;
 
 		/* We don't advance the leaf node iterator here */
-		if (NODE_IS_LEAF(node))
+		if (RT_NODE_IS_LEAF(node))
 			return;
 
 		/* Advance to the next slot in the inner node */
@@ -1972,7 +1962,7 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
 			}
 		case RT_NODE_KIND_256:
 			{
-				if (NODE_IS_LEAF(node))
+				if (RT_NODE_IS_LEAF(node))
 				{
 					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
 					int			cnt = 0;
@@ -1992,6 +1982,9 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
 
 /***************** DEBUG FUNCTIONS *****************/
 #ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
 RT_SCOPE void
 RT_STATS(RT_RADIX_TREE *tree)
 {
@@ -2012,7 +2005,7 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 	char		space[125] = {0};
 
 	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
-			NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
 			(node->kind == RT_NODE_KIND_3) ? 3 :
 			(node->kind == RT_NODE_KIND_32) ? 32 :
 			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
@@ -2028,11 +2021,11 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 			{
 				for (int i = 0; i < node->count; i++)
 				{
-					if (NODE_IS_LEAF(node))
+					if (RT_NODE_IS_LEAF(node))
 					{
 						RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
 
-						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
 								space, n3->base.chunks[i], (uint64) n3->values[i]);
 					}
 					else
@@ -2054,11 +2047,11 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 			{
 				for (int i = 0; i < node->count; i++)
 				{
-					if (NODE_IS_LEAF(node))
+					if (RT_NODE_IS_LEAF(node))
 					{
 						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
 
-						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
 								space, n32->base.chunks[i], (uint64) n32->values[i]);
 					}
 					else
@@ -2090,14 +2083,14 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 
 					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
 				}
-				if (NODE_IS_LEAF(node))
+				if (RT_NODE_IS_LEAF(node))
 				{
 					RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
 
 					fprintf(stderr, ", isset-bitmap:");
 					for (int i = 0; i < BM_IDX(RT_SLOT_IDX_LIMIT); i++)
 					{
-						fprintf(stderr, UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+						fprintf(stderr, RT_UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
 					}
 					fprintf(stderr, "\n");
 				}
@@ -2107,11 +2100,11 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
 						continue;
 
-					if (NODE_IS_LEAF(node))
+					if (RT_NODE_IS_LEAF(node))
 					{
 						RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
 
-						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
 								space, i, (uint64) RT_NODE_LEAF_125_GET_VALUE(n125, i));
 					}
 					else
@@ -2134,14 +2127,14 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 			{
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
 				{
-					if (NODE_IS_LEAF(node))
+					if (RT_NODE_IS_LEAF(node))
 					{
 						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
 
 						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
 							continue;
 
-						fprintf(stderr, "%schunk 0x%X value 0x" UINT64_FORMAT_HEX "\n",
+						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
 								space, i, (uint64) RT_NODE_LEAF_256_GET_VALUE(n256, i));
 					}
 					else
@@ -2174,7 +2167,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
 	int			level = 0;
 
 	elog(NOTICE, "-----------------------------------------------------------");
-	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ")",
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ")",
 		 tree->ctl->max_val, tree->ctl->max_val);
 
 	if (!tree->ctl->root)
@@ -2185,7 +2178,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
 
 	if (key > tree->ctl->max_val)
 	{
-		elog(NOTICE, "key " UINT64_FORMAT "(0x" UINT64_FORMAT_HEX ") is larger than max val",
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val",
 			 key, key);
 		return;
 	}
@@ -2198,7 +2191,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
 
 		RT_DUMP_NODE(node, level, false);
 
-		if (NODE_IS_LEAF(node))
+		if (RT_NODE_IS_LEAF(node))
 		{
 			uint64		dummy;
 
@@ -2249,15 +2242,30 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_VALUE_TYPE
 
 /* locally declared macros */
-#undef NODE_IS_LEAF
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef BM_IDX
+#undef BM_BIT
+#undef RT_NODE_IS_LEAF
 #undef RT_NODE_MUST_GROW
 #undef RT_NODE_KIND_COUNT
 #undef RT_SIZE_CLASS_COUNT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
 #undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
 
 /* type declarations */
 #undef RT_RADIX_TREE
 #undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
 #undef RT_PTR_ALLOC
 #undef RT_INVALID_PTR_ALLOC
 #undef RT_HANDLE
@@ -2295,6 +2303,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_ATTACH
 #undef RT_DETACH
 #undef RT_GET_HANDLE
+#undef RT_SEARCH
 #undef RT_SET
 #undef RT_BEGIN_ITERATE
 #undef RT_ITERATE_NEXT
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
index b9f07f4eb5..99c90771b9 100644
--- a/src/include/lib/radixtree_delete_impl.h
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -17,9 +17,9 @@
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 
 #ifdef RT_NODE_LEVEL_LEAF
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 #else
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 #endif
 
 	switch (node->kind)
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index e3c3f7a69d..0fcebf1c6b 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -17,10 +17,10 @@
 
 #ifdef RT_NODE_LEVEL_LEAF
 	const bool inner = false;
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 #else
 	const bool inner = true;
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 #endif
 
 	switch (node->kind)
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index c428531438..823d7107c4 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -18,11 +18,11 @@
 #ifdef RT_NODE_LEVEL_LEAF
 	RT_VALUE_TYPE		value;
 
-	Assert(NODE_IS_LEAF(node_iter->node));
+	Assert(RT_NODE_IS_LEAF(node_iter->node));
 #else
 	RT_PTR_LOCAL child = NULL;
 
-	Assert(!NODE_IS_LEAF(node_iter->node));
+	Assert(!RT_NODE_IS_LEAF(node_iter->node));
 #endif
 
 #ifdef RT_SHMEM
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index 31138b6a72..c4352045c8 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -17,12 +17,12 @@
 #ifdef RT_NODE_LEVEL_LEAF
 	RT_VALUE_TYPE		value = 0;
 
-	Assert(NODE_IS_LEAF(node));
+	Assert(RT_NODE_IS_LEAF(node));
 #else
 #ifndef RT_ACTION_UPDATE
 	RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
 #endif
-	Assert(!NODE_IS_LEAF(node));
+	Assert(!RT_NODE_IS_LEAF(node));
 #endif
 
 	switch (node->kind)
-- 
2.39.0

v22-0017-Remove-some-maintenance-hazards-in-growing-nodes.patchtext/x-patch; charset=US-ASCII; name=v22-0017-Remove-some-maintenance-hazards-in-growing-nodes.patchDownload

From 57a34d75a143086ba8bb3920486747957b87552d Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 13:29:18 +0700
Subject: [PATCH v22 17/22] Remove some maintenance hazards in growing nodes

Arrange so that kinds with only one size class have no
"full" suffix. This ensures that splitting such a class
into multiple classes will force compilation errors if
the dev has not thought through which new class should
apply in each case.

For node32, make growing into a new size class a bit
more general. It's not clear we would ever need more
than 2 classes, but let's not put up additional road
blocks. Change partial/full to min/max. It's a bit
shorter this way, matches some newer coding, and allows
for the possibility of a "mid" class.

Also remove RT_KIND_MIN_SIZE_CLASS, since it doesn't
reduce the need for future changes, only makes such
a change further away from the effect.

In passing, move a declaration the block where it's used.
---
 src/include/lib/radixtree.h             | 66 +++++++++++--------------
 src/include/lib/radixtree_insert_impl.h | 16 +++---
 2 files changed, 37 insertions(+), 45 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index cd8b8d1c22..7c3f3dcf4f 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -196,12 +196,11 @@
 #define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
 #define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
 #define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
-#define RT_CLASS_3_FULL RT_MAKE_NAME(class_3_full)
-#define RT_CLASS_32_PARTIAL RT_MAKE_NAME(class_32_partial)
-#define RT_CLASS_32_FULL RT_MAKE_NAME(class_32_full)
-#define RT_CLASS_125_FULL RT_MAKE_NAME(class_125_full)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
 #define RT_CLASS_256 RT_MAKE_NAME(class_256)
-#define RT_KIND_MIN_SIZE_CLASS RT_MAKE_NAME(kind_min_size_class)
 
 /* generate forward declarations necessary to use the radix tree */
 #ifdef RT_DECLARE
@@ -523,10 +522,10 @@ typedef struct RT_NODE_LEAF_256
  */
 typedef enum RT_SIZE_CLASS
 {
-	RT_CLASS_3_FULL = 0,
-	RT_CLASS_32_PARTIAL,
-	RT_CLASS_32_FULL,
-	RT_CLASS_125_FULL,
+	RT_CLASS_3 = 0,
+	RT_CLASS_32_MIN,
+	RT_CLASS_32_MAX,
+	RT_CLASS_125,
 	RT_CLASS_256
 } RT_SIZE_CLASS;
 
@@ -542,25 +541,25 @@ typedef struct RT_SIZE_CLASS_ELEM
 } RT_SIZE_CLASS_ELEM;
 
 static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
-	[RT_CLASS_3_FULL] = {
+	[RT_CLASS_3] = {
 		.name = "radix tree node 3",
 		.fanout = 3,
 		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
 	},
-	[RT_CLASS_32_PARTIAL] = {
+	[RT_CLASS_32_MIN] = {
 		.name = "radix tree node 15",
 		.fanout = 15,
 		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
 	},
-	[RT_CLASS_32_FULL] = {
+	[RT_CLASS_32_MAX] = {
 		.name = "radix tree node 32",
 		.fanout = 32,
 		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
 	},
-	[RT_CLASS_125_FULL] = {
+	[RT_CLASS_125] = {
 		.name = "radix tree node 125",
 		.fanout = 125,
 		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
@@ -576,14 +575,6 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 
 #define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
 
-/* Map from the node kind to its minimum size class */
-static const RT_SIZE_CLASS RT_KIND_MIN_SIZE_CLASS[RT_NODE_KIND_COUNT] = {
-	[RT_NODE_KIND_3] = RT_CLASS_3_FULL,
-	[RT_NODE_KIND_32] = RT_CLASS_32_PARTIAL,
-	[RT_NODE_KIND_125] = RT_CLASS_125_FULL,
-	[RT_NODE_KIND_256] = RT_CLASS_256,
-};
-
 #ifdef RT_SHMEM
 /* A magic value used to identify our radix tree */
 #define RT_RADIX_TREE_MAGIC 0x54A48167
@@ -893,7 +884,7 @@ static inline void
 RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
 						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
 {
-	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
 	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
 
@@ -905,7 +896,7 @@ static inline void
 RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
 						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
 {
-	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_FULL].fanout;
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
 	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
 
@@ -1105,9 +1096,9 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
 	RT_PTR_ALLOC allocnode;
 	RT_PTR_LOCAL newnode;
 
-	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
 	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, inner);
 	newnode->shift = shift;
 	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
 	tree->ctl->root = allocnode;
@@ -1230,9 +1221,9 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
 		RT_PTR_LOCAL	node;
 		RT_NODE_INNER_3 *n3;
 
-		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, true);
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
 		node = RT_PTR_GET_LOCAL(tree, allocnode);
-		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3_FULL, true);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
 		node->shift = shift;
 		node->count = 1;
 
@@ -1268,9 +1259,9 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL
 		int			newshift = shift - RT_NODE_SPAN;
 		bool		inner = newshift > 0;
 
-		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3_FULL, inner);
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
 		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
-		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3_FULL, inner);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, inner);
 		newchild->shift = newshift;
 		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
 
@@ -2007,10 +1998,10 @@ RT_STATS(RT_RADIX_TREE *tree)
 	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
 						 tree->ctl->num_keys,
 						 tree->ctl->root->shift / RT_NODE_SPAN,
-						 tree->ctl->cnt[RT_CLASS_3_FULL],
-						 tree->ctl->cnt[RT_CLASS_32_PARTIAL],
-						 tree->ctl->cnt[RT_CLASS_32_FULL],
-						 tree->ctl->cnt[RT_CLASS_125_FULL],
+						 tree->ctl->cnt[RT_CLASS_3],
+						 tree->ctl->cnt[RT_CLASS_32_MIN],
+						 tree->ctl->cnt[RT_CLASS_32_MAX],
+						 tree->ctl->cnt[RT_CLASS_125],
 						 tree->ctl->cnt[RT_CLASS_256])));
 }
 
@@ -2292,12 +2283,11 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_SIZE_CLASS
 #undef RT_SIZE_CLASS_ELEM
 #undef RT_SIZE_CLASS_INFO
-#undef RT_CLASS_3_FULL
-#undef RT_CLASS_32_PARTIAL
-#undef RT_CLASS_32_FULL
-#undef RT_CLASS_125_FULL
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
 #undef RT_CLASS_256
-#undef RT_KIND_MIN_SIZE_CLASS
 
 /* function declarations */
 #undef RT_CREATE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index a0f46b37d3..e3c3f7a69d 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -49,7 +49,7 @@
 					RT_PTR_LOCAL newnode;
 					RT_NODE32_TYPE *new32;
 					const uint8 new_kind = RT_NODE_KIND_32;
-					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
 
 					/* grow node from 3 to 32 */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
@@ -96,8 +96,7 @@
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_32:
 			{
-				const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_PARTIAL];
-				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_FULL];
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
 				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
 				int			idx;
 
@@ -115,11 +114,14 @@
 				}
 
 				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
-					n32->base.n.fanout == class32_min.fanout)
+					n32->base.n.fanout < class32_max.fanout)
 				{
 					RT_PTR_ALLOC allocnode;
 					RT_PTR_LOCAL newnode;
-					const RT_SIZE_CLASS new_class = RT_CLASS_32_FULL;
+					const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+					Assert(n32->base.n.fanout == class32_min.fanout);
 
 					/* grow to the next size class of this kind */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
@@ -143,7 +145,7 @@
 					RT_PTR_LOCAL newnode;
 					RT_NODE125_TYPE *new125;
 					const uint8 new_kind = RT_NODE_KIND_125;
-					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+					const RT_SIZE_CLASS new_class = RT_CLASS_125;
 
 					Assert(n32->base.n.fanout == class32_max.fanout);
 
@@ -224,7 +226,7 @@
 					RT_PTR_LOCAL newnode;
 					RT_NODE256_TYPE *new256;
 					const uint8 new_kind = RT_NODE_KIND_256;
-					const RT_SIZE_CLASS new_class = RT_KIND_MIN_SIZE_CLASS[new_kind];
+					const RT_SIZE_CLASS new_class = RT_CLASS_256;
 
 					/* grow node from 125 to 256 */
 					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
-- 
2.39.0

v22-0019-Standardize-on-testing-for-is-leaf.patchtext/x-patch; charset=US-ASCII; name=v22-0019-Standardize-on-testing-for-is-leaf.patchDownload

From 9908dfdecbd22eacbc57a7863fe67cbb42b22f90 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 22 Jan 2023 15:10:10 +0700
Subject: [PATCH v22 19/22] Standardize on testing for "is leaf"

Some recent code decided to test for "is inner", so make
everything consistent.
---
 src/include/lib/radixtree.h             | 38 ++++++++++++-------------
 src/include/lib/radixtree_insert_impl.h | 18 ++++++------
 2 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 95124696ef..5927437034 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1019,24 +1019,24 @@ RT_SHIFT_GET_MAX_VAL(int shift)
  * Allocate a new node with the given node kind.
  */
 static RT_PTR_ALLOC
-RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
 {
 	RT_PTR_ALLOC allocnode;
 	size_t allocsize;
 
-	if (inner)
-		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
-	else
+	if (is_leaf)
 		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
 
 #ifdef RT_SHMEM
 	allocnode = dsa_allocate(tree->dsa, allocsize);
 #else
-	if (inner)
-		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+	if (is_leaf)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
 													  allocsize);
 	else
-		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
 													  allocsize);
 #endif
 
@@ -1050,12 +1050,12 @@ RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool inner)
 
 /* Initialize the node contents */
 static inline void
-RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool inner)
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
 {
-	if (inner)
-		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
-	else
+	if (is_leaf)
 		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
 
 	node->kind = kind;
 
@@ -1082,13 +1082,13 @@ static void
 RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
 {
 	int			shift = RT_KEY_GET_SHIFT(key);
-	bool		inner = shift > 0;
+	bool		is_leaf = shift == 0;
 	RT_PTR_ALLOC allocnode;
 	RT_PTR_LOCAL newnode;
 
-	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
 	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, inner);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
 	newnode->shift = shift;
 	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
 	tree->ctl->root = allocnode;
@@ -1107,10 +1107,10 @@ RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
  */
 static inline RT_PTR_LOCAL
 RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
-				  uint8 new_kind, uint8 new_class, bool inner)
+				  uint8 new_kind, uint8 new_class, bool is_leaf)
 {
 	RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-	RT_INIT_NODE(newnode, new_kind, new_class, inner);
+	RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
 	RT_COPY_NODE(newnode, node);
 
 	return newnode;
@@ -1247,11 +1247,11 @@ RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL
 		RT_PTR_ALLOC allocchild;
 		RT_PTR_LOCAL newchild;
 		int			newshift = shift - RT_NODE_SPAN;
-		bool		inner = newshift > 0;
+		bool		is_leaf = newshift == 0;
 
-		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, inner);
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
 		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
-		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, inner);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
 		newchild->shift = newshift;
 		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
 
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 0fcebf1c6b..22aca0e6cc 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -16,10 +16,10 @@
 	bool		chunk_exists = false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-	const bool inner = false;
+	const bool is_leaf = true;
 	Assert(RT_NODE_IS_LEAF(node));
 #else
-	const bool inner = true;
+	const bool is_leaf = false;
 	Assert(!RT_NODE_IS_LEAF(node));
 #endif
 
@@ -52,8 +52,8 @@
 					const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
 
 					/* grow node from 3 to 32 */
-					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
-					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
 					new32 = (RT_NODE32_TYPE *) newnode;
 
 #ifdef RT_NODE_LEVEL_LEAF
@@ -124,7 +124,7 @@
 					Assert(n32->base.n.fanout == class32_min.fanout);
 
 					/* grow to the next size class of this kind */
-					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
 					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
 					n32 = (RT_NODE32_TYPE *) newnode;
 
@@ -150,8 +150,8 @@
 					Assert(n32->base.n.fanout == class32_max.fanout);
 
 					/* grow node from 32 to 125 */
-					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
-					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
 					new125 = (RT_NODE125_TYPE *) newnode;
 
 					for (int i = 0; i < class32_max.fanout; i++)
@@ -229,8 +229,8 @@
 					const RT_SIZE_CLASS new_class = RT_CLASS_256;
 
 					/* grow node from 125 to 256 */
-					allocnode = RT_ALLOC_NODE(tree, new_class, inner);
-					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, inner);
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
 					new256 = (RT_NODE256_TYPE *) newnode;
 
 					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
-- 
2.39.0

v22-0020-Do-some-rewriting-and-proofreading-of-comments.patchtext/x-patch; charset=US-ASCII; name=v22-0020-Do-some-rewriting-and-proofreading-of-comments.patchDownload

From cd7664aea7022902e08d26ef91a1a88421fde3c6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 23 Jan 2023 18:00:20 +0700
Subject: [PATCH v22 20/22] Do some rewriting and proofreading of comments

In passing, change one ternary operator to if/else.
---
 src/include/lib/radixtree.h | 160 +++++++++++++++++++++---------------
 1 file changed, 92 insertions(+), 68 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 5927437034..7fcd212ea4 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -9,25 +9,38 @@
  * types, each with a different numbers of elements. Depending on the number of
  * children, the appropriate node type is used.
  *
- * There are some differences from the proposed implementation. For instance,
- * there is not support for path compression and lazy path expansion. The radix
- * tree supports fixed length of the key so we don't expect the tree level
- * wouldn't be high.
+ * WIP: notes about traditional radix tree trading off span vs height...
  *
- * Both the key and the value are 64-bit unsigned integer. The inner nodes and
- * the leaf nodes have slightly different structure: for inner tree nodes,
- * shift > 0, store the pointer to its child node as the value. The leaf nodes,
- * shift == 0, have the 64-bit unsigned integer that is specified by the user as
- * the value. The paper refers to this technique as "Multi-value leaves".  We
- * choose it to avoid an additional pointer traversal.  It is the reason this code
- * currently does not support variable-length keys.
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
  *
- * XXX: Most functions in this file have two variants for inner nodes and leaf
- * nodes, therefore there are duplication codes. While this sometimes makes the
- * code maintenance tricky, this reduces branch prediction misses when judging
- * whether the node is a inner node of a leaf node.
+ * The ART paper mentions three ways to implement leaves:
  *
- * XXX: the radix tree node never be shrunk.
+ * "- Single-value leaves: The values are stored using an addi-
+ *  tional leaf node type which stores one value.
+ *  - Multi-value leaves: The values are stored in one of four
+ *  different leaf node types, which mirror the structure of
+ *  inner nodes, but contain values instead of pointers.
+ *  - Combined pointer/value slots: If values fit into point-
+ *  ers, no separate node types are necessary. Instead, each
+ *  pointer storage location in an inner node can either
+ *  store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * WIP: the radix tree nodes don't shrink.
  *
  *	  To generate a radix tree and associated functions for a use case several
  *	  macros have to be #define'ed before this file is included.  Including
@@ -42,11 +55,11 @@
  *	  - RT_DEFINE - if defined function definitions are generated
  *	  - RT_SCOPE - in which scope (e.g. extern, static inline) do function
  *		declarations reside
- *	  - RT_SHMEM - if defined, the radix tree is created in the DSA area
- *		so that multiple processes can access it simultaneously.
  *	  - RT_VALUE_TYPE - the type of the value.
  *
  *	  Optional parameters:
+ *	  - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *		so that multiple processes can access it simultaneously.
  *	  - RT_DEBUG - if defined add stats tracking and debugging functions
  *
  * Interface
@@ -54,9 +67,6 @@
  *
  * RT_CREATE		- Create a new, empty radix tree
  * RT_FREE			- Free the radix tree
- * RT_ATTACH		- Attach to the radix tree
- * RT_DETACH		- Detach from the radix tree
- * RT_GET_HANDLE	- Return the handle of the radix tree
  * RT_SEARCH		- Search a key-value pair
  * RT_SET			- Set a key-value pair
  * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
@@ -64,11 +74,12 @@
  * RT_END_ITER		- End iteration
  * RT_MEMORY_USAGE	- Get the memory usage
  *
- * RT_CREATE() creates an empty radix tree in the given memory context
- * and memory contexts for all kinds of radix tree node under the memory context.
+ * Interface for Shared Memory
+ * ---------
  *
- * RT_ITERATE_NEXT() ensures returning key-value pairs in the ascending
- * order of the key.
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
  *
  * Optional Interface
  * ---------
@@ -360,13 +371,23 @@ typedef struct RT_NODE
 #define RT_INVALID_PTR_ALLOC NULL
 #endif
 
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
 #define RT_NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+
 #define RT_NODE_MUST_GROW(node) \
 	((node)->base.n.count == (node)->base.n.fanout)
 
-/* Base type of each node kinds for leaf and inner nodes */
-/* The base types must be a be able to accommodate the largest size
-class for variable-sized node kinds*/
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
 typedef struct RT_NODE_BASE_3
 {
 	RT_NODE		n;
@@ -384,9 +405,9 @@ typedef struct RT_NODE_BASE_32
 } RT_NODE_BASE_32;
 
 /*
- * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length, typically
- * 256, to store indexes into a second array that contains up to 125 values (or
- * child pointers in inner nodes).
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
  */
 typedef struct RT_NODE_BASE_125
 {
@@ -407,15 +428,8 @@ typedef struct RT_NODE_BASE_256
 /*
  * Inner and leaf nodes.
  *
- * Theres are separate for two main reasons:
- *
- * 1) the value type might be different than something fitting into a pointer
- *    width type
- * 2) Need to represent non-existing values in a key-type independent way.
- *
- * 1) is clearly worth being concerned about, but it's not clear 2) is as
- * good. It might be better to just indicate non-existing entries the same way
- * in inner nodes.
+ * Theres are separate because the value type might be different than
+ * something fitting into a pointer-width type.
  */
 typedef struct RT_NODE_INNER_3
 {
@@ -466,8 +480,10 @@ typedef struct RT_NODE_LEAF_125
 } RT_NODE_LEAF_125;
 
 /*
- * node-256 is the largest node type. This node has RT_NODE_MAX_SLOTS length array
+ * node-256 is the largest node type. This node has an array
  * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
  */
 typedef struct RT_NODE_INNER_256
 {
@@ -481,7 +497,10 @@ typedef struct RT_NODE_LEAF_256
 {
 	RT_NODE_BASE_256 base;
 
-	/* isset is a bitmap to track which slot is in use */
+	/*
+	 * Unlike with inner256, zero is a valid value here, so we use a
+	 * bitmap to track which slot is in use.
+	 */
 	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
 
 	/* Slots for 256 values */
@@ -570,7 +589,8 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 #define RT_RADIX_TREE_MAGIC 0x54A48167
 #endif
 
-/* A radix tree with nodes */
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
 typedef struct RT_RADIX_TREE_CONTROL
 {
 #ifdef RT_SHMEM
@@ -588,7 +608,7 @@ typedef struct RT_RADIX_TREE_CONTROL
 #endif
 } RT_RADIX_TREE_CONTROL;
 
-/* A radix tree with nodes */
+/* Entry point for allocating and accessing the tree */
 typedef struct RT_RADIX_TREE
 {
 	MemoryContext context;
@@ -613,15 +633,15 @@ typedef struct RT_RADIX_TREE
  * RT_NODE_ITER struct is used to track the iteration within a node.
  *
  * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
- * in order to track the iteration of each level. During the iteration, we also
+ * in order to track the iteration of each level. During iteration, we also
  * construct the key whenever updating the node iteration information, e.g., when
  * advancing the current index within the node or when moving to the next node
  * at the same level.
-+ *
-+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
-+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
-+ * We need either a safeguard to disallow other processes to begin the iteration
-+ * while one process is doing or to allow multiple processes to do the iteration.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
  */
 typedef struct RT_NODE_ITER
 {
@@ -637,7 +657,7 @@ typedef struct RT_ITER
 	RT_NODE_ITER stack[RT_MAX_LEVEL];
 	int			stack_len;
 
-	/* The key is being constructed during the iteration */
+	/* The key is constructed during iteration */
 	uint64		key;
 } RT_ITER;
 
@@ -672,8 +692,8 @@ RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
 }
 
 /*
- * Return index of the first element in 'base' that equals 'key'. Return -1
- * if there is no such element.
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
  */
 static inline int
 RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
@@ -693,7 +713,8 @@ RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
 }
 
 /*
- * Return index of the chunk to insert into chunks in the given node.
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
  */
 static inline int
 RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
@@ -744,7 +765,7 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
 	/* replicate the search key */
 	spread_chunk = vector8_broadcast(chunk);
 
-	/* compare to the 32 keys stored in the node */
+	/* compare to all 32 keys stored in the node */
 	vector8_load(&haystack1, &node->chunks[0]);
 	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
 	cmp1 = vector8_eq(spread_chunk, haystack1);
@@ -768,7 +789,7 @@ RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
 }
 
 /*
- * Return index of the node's chunk array to insert into,
+ * Return index of the chunk and slot arrays for inserting into the node,
  * such that the chunk array remains ordered.
  */
 static inline int
@@ -809,7 +830,7 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
 	 * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
 	 * no unsigned uint8 comparison instruction exists, at least for SSE2. So
 	 * we need to play some trickery using vector8_min() to effectively get
-	 * <=. There'll never be any equal elements in the current uses, but that's
+	 * <=. There'll never be any equal elements in urrent uses, but that's
 	 * what we get here...
 	 */
 	spread_chunk = vector8_broadcast(chunk);
@@ -834,6 +855,7 @@ RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
 #endif
 }
 
+
 /*
  * Functions to manipulate both chunks array and children/values array.
  * These are used for node-3 and node-32.
@@ -993,18 +1015,19 @@ RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
 }
 
 /*
- * Return the shift that is satisfied to store the given key.
+ * Return the largest shift that will allowing storing the given key.
  */
 static inline int
 RT_KEY_GET_SHIFT(uint64 key)
 {
-	return (key == 0)
-		? 0
-		: (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+	if (key == 0)
+		return 0;
+	else
+		return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
 }
 
 /*
- * Return the max value stored in a node with the given shift.
+ * Return the max value that can be stored in the tree with the given shift.
  */
 static uint64
 RT_SHIFT_GET_MAX_VAL(int shift)
@@ -1155,6 +1178,7 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
 #endif
 }
 
+/* Update the parent's pointer when growing a node */
 static inline void
 RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
 {
@@ -1182,7 +1206,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
 
 	if (parent == old_child)
 	{
-		/* Replace the root node with the new large node */
+		/* Replace the root node with the new larger node */
 		tree->ctl->root = new_child;
 	}
 	else
@@ -1192,8 +1216,8 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
 }
 
 /*
- * The radix tree doesn't sufficient height. Extend the radix tree so it can
- * store the key.
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
  */
 static void
 RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
@@ -1337,7 +1361,7 @@ RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stor
 #undef RT_NODE_LEVEL_INNER
 }
 
-/* Like, RT_NODE_INSERT_INNER, but for leaf nodes */
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
 static bool
 RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 					uint64 key, RT_VALUE_TYPE value)
@@ -1377,7 +1401,7 @@ RT_CREATE(MemoryContext ctx)
 #else
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
 
-	/* Create the slab allocator for each size class */
+	/* Create a slab context for each size class */
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
 	{
 		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
@@ -1570,7 +1594,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
 	parent = RT_PTR_GET_LOCAL(tree, stored_child);
 	shift = parent->shift;
 
-	/* Descend the tree until a leaf node */
+	/* Descend the tree until we reach a leaf node */
 	while (shift >= 0)
 	{
 		RT_PTR_ALLOC new_child;
-- 
2.39.0

v22-0022-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchtext/x-patch; charset=US-ASCII; name=v22-0022-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload

From cd1cc048b81abbd942a9a7e66b1d64a9a844ac84 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 17 Jan 2023 17:20:37 +0700
Subject: [PATCH v22 22/22] Use TIDStore for storing dead tuple TID during lazy
 vacuum

Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which is not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.

This changes to use TIDStore for this purpose. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.

Also, since we are no longer able to exactly estimate the maximum
number of TIDs can be stored based on the amount of memory. It also
changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.

Furthermore, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, which is the
inital DSA segment size. Due to that, this change increase the minimum
maintenance_work_mem from 1MB to 2MB.

XXX: needs to bump catalog version
---
 doc/src/sgml/monitoring.sgml               |   8 +-
 src/backend/access/heap/vacuumlazy.c       | 210 +++++++--------------
 src/backend/catalog/system_views.sql       |   2 +-
 src/backend/commands/vacuum.c              |  76 +-------
 src/backend/commands/vacuumparallel.c      |  64 ++++---
 src/backend/storage/lmgr/lwlock.c          |   2 +
 src/backend/utils/misc/guc_tables.c        |   2 +-
 src/include/commands/progress.h            |   4 +-
 src/include/commands/vacuum.h              |  25 +--
 src/include/storage/lwlock.h               |   1 +
 src/test/regress/expected/cluster.out      |   2 +-
 src/test/regress/expected/create_index.out |   2 +-
 src/test/regress/expected/rules.out        |   4 +-
 src/test/regress/sql/cluster.sql           |   2 +-
 src/test/regress/sql/create_index.sql      |   2 +-
 15 files changed, 138 insertions(+), 268 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d936aa3da3..0230c74e3d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6870,10 +6870,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6881,10 +6881,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..90f8a5e087 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -220,17 +221,21 @@ typedef struct LVRelState
 typedef struct LVPagePruneState
 {
 	bool		hastup;			/* Page prevents rel truncation? */
-	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+
+	/* collected LP_DEAD items including existing LP_DEAD items */
+	int			lpdead_items;
+	OffsetNumber	deadoffsets[MaxHeapTuplesPerPage];
 
 	/*
 	 * State describes the proper VM bit states to set for the page following
-	 * pruning and freezing.  all_visible implies !has_lpdead_items, but don't
+	 * pruning and freezing.  all_visible implies !HAS_LPDEAD_ITEMS(), but don't
 	 * trust all_frozen result unless all_visible is also set to true.
 	 */
 	bool		all_visible;	/* Every item visible to all? */
 	bool		all_frozen;		/* provided all_visible is also true */
 	TransactionId visibility_cutoff_xid;	/* For recovery conflicts */
 } LVPagePruneState;
+#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)
 
 /* Struct for saving and restoring vacuum error information. */
 typedef struct LVSavedErrInfo
@@ -259,8 +264,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +831,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				blkno,
 				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +912,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (tidstore_is_full(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1018,7 +1023,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
 
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+		Assert(!prunestate.all_visible || !HAS_LPDEAD_ITEMS(prunestate));
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
@@ -1034,14 +1039,12 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * performed here can be thought of as the one-pass equivalent of
 			 * a call to lazy_vacuum().
 			 */
-			if (prunestate.has_lpdead_items)
+			if (HAS_LPDEAD_ITEMS(prunestate))
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
-				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+									  prunestate.lpdead_items, buf, vmbuffer);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1081,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
+		}
+		else if (HAS_LPDEAD_ITEMS(prunestate))
+		{
+			/* Save details of the LP_DEAD items from the page */
+			tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+							  prunestate.lpdead_items);
+
+			pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+										 tidstore_memory_usage(dead_items));
 		}
 
 		/*
@@ -1145,7 +1157,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
 		 * set, however.
 		 */
-		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+		else if (HAS_LPDEAD_ITEMS(prunestate) && PageIsAllVisible(page))
 		{
 			elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1193,7 +1205,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Final steps for block: drop cleanup lock, record free space in the
 		 * FSM
 		 */
-		if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+		if (HAS_LPDEAD_ITEMS(prunestate) && vacrel->do_index_vacuuming)
 		{
 			/*
 			 * Wait until lazy_vacuum_heap_rel() to save free space.  This
@@ -1249,7 +1261,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1543,13 +1555,11 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				tuples_frozen,
-				lpdead_items,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	HeapPageFreeze pagefrz;
 	int64		fpi_before = pgWalUsage.wal_fpi;
-	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1581,6 @@ retry:
 	pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
-	lpdead_items = 0;
 	live_tuples = 0;
 	recently_dead_tuples = 0;
 
@@ -1580,9 +1589,9 @@ retry:
 	 *
 	 * We count tuples removed by the pruning step as tuples_deleted.  Its
 	 * final value can be thought of as the number of tuples that have been
-	 * deleted from the table.  It should not be confused with lpdead_items;
-	 * lpdead_items's final value can be thought of as the number of tuples
-	 * that were deleted from indexes.
+	 * deleted from the table.  It should not be confused with
+	 * prunestate->lpdead_items; prunestate->lpdead_items's final value can
+	 * be thought of as the number of tuples that were deleted from indexes.
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
 									 InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1602,7 @@ retry:
 	 * requiring freezing among remaining tuples with storage
 	 */
 	prunestate->hastup = false;
-	prunestate->has_lpdead_items = false;
+	prunestate->lpdead_items = 0;
 	prunestate->all_visible = true;
 	prunestate->all_frozen = true;
 	prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1647,7 @@ retry:
 			 * (This is another case where it's useful to anticipate that any
 			 * LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
 			 */
-			deadoffsets[lpdead_items++] = offnum;
+			prunestate->deadoffsets[prunestate->lpdead_items++] = offnum;
 			continue;
 		}
 
@@ -1875,7 +1884,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible && lpdead_items == 0)
+	if (prunestate->all_visible && prunestate->lpdead_items == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1888,28 +1897,9 @@ retry:
 	}
 #endif
 
-	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel
-	 */
-	if (lpdead_items > 0)
+	if (prunestate->lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
-		prunestate->has_lpdead_items = true;
-
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1918,7 @@ retry:
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
 	vacrel->tuples_frozen += tuples_frozen;
-	vacrel->lpdead_items += lpdead_items;
+	vacrel->lpdead_items += prunestate->lpdead_items;
 	vacrel->live_tuples += live_tuples;
 	vacrel->recently_dead_tuples += recently_dead_tuples;
 }
@@ -2129,8 +2119,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2128,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2198,7 +2180,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2227,7 +2209,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2254,8 +2236,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2300,7 +2282,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	tidstore_reset(vacrel->dead_items);
 }
 
 /*
@@ -2373,7 +2355,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2410,10 +2392,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2411,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while ((result = tidstore_iterate_next(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2437,7 +2421,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2451,7 +2435,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+							  buf, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2461,6 +2446,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	tidstore_end_iterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2470,14 +2456,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2495,11 +2480,10 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * LP_DEAD item on the page.  The return value is the first index immediately
  * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+					  int num_offsets, Buffer buffer, Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2518,16 +2502,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = offsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2576,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -3093,46 +3071,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3143,11 +3081,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3174,7 +3110,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3187,11 +3123,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(vac_work_mem, NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a526e607fe 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127e..358ad25996 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2343,18 +2342,6 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
@@ -2365,60 +2352,7 @@ vac_max_items_to_alloc_size(int max_items)
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..4c0ce4b7e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -44,7 +44,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+#define PARALLEL_VACUUM_KEY_DSA				2
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TidStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DSA, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(vac_work_mem, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_destroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TidStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DSA, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"LogicalRepLauncherDSA",
 	/* LWTRANCHE_LAUNCHER_HASH: */
 	"LogicalRepLauncherHash",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4ac808ed22..422914f0a9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2312,7 +2312,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 2048, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..220d89fff7 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
 	MultiXactId MultiXactCutoff;
 };
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DATA,
 	LWTRANCHE_LAUNCHER_DSA,
 	LWTRANCHE_LAUNCHER_HASH,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
 -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..f6ae02eb14 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 set enable_indexscan = off;
 
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
 
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
-- 
2.39.0

v22-0021-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchtext/x-patch; charset=US-ASCII; name=v22-0021-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From 777bc2d7c18cba89122e581962634696e72ada56 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v22 21/22] Add TIDStore, to store sets of TIDs
 (ItemPointerData) efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 626 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   2 +
 src/include/access/tidstore.h                 |  49 ++
 src/include/storage/lwlock.h                  |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  35 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../modules/test_tidstore/test_tidstore.c     | 189 ++++++
 .../test_tidstore/test_tidstore.control       |   4 +
 16 files changed, 965 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..d936aa3da3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2192,6 +2192,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to access a shared TID bitmap during a parallel bitmap
        index scan.</entry>
      </row>
+     <row>
+      <entry><literal>SharedTidStore</literal></entry>
+      <entry>Waiting to access a shared TID store.</entry>
+     </row>
      <row>
       <entry><literal>SharedTupleStore</literal></entry>
       <entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..26e3077b5e
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,626 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a Tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach(). It can support concurrent updates but only one process
+ * is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, item pointers are represented as a pair of 64-bit
+ * key and 64-bit value. First, we construct 64-bit unsigned integer key that
+ * combines the block number and the offset number. The lowest 11 bits represent
+ * the offset number, and the next 32 bits are block number. That is, only 43
+ * bits are used:
+ *
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ *
+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
+ * the bits, because smaller keys could help keeping the radix tree shallow.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits, and
+ * the rest 37 bits are used as the key:
+ *
+ * value = bitmap representation of XXXXXX
+ * key = XXXXXYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYuu
+ *
+ * The maximum height of the radix tree is 5.
+ *
+ * XXX: if we want to support non-heap table AM that want to use the full
+ * range of possible offset numbers, we'll need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
+ */
+#define TIDSTORE_OFFSET_NBITS	11
+#define TIDSTORE_VALUE_NBITS	6
+
+/*
+ * Memory consumption depends on the number of Tids stored, but also on the
+ * distribution of them, how the radix tree stores, and the memory management
+ * that backed the radix tree. The maximum bytes that a TidStore can
+ * use is specified by the max_bytes in tidstore_create(). We want the total
+ * amount of memory consumption not to exceed the max_bytes.
+ *
+ * In non-shared cases, the radix tree uses slab allocators for each kind of
+ * node class. The most memory consuming case while adding Tids associated
+ * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+ * largest radix tree node in a new slab block, which is approximately 70kB.
+ * Therefore, we deduct 70kB from the maximum bytes.
+ *
+ * In shared cases, DSA allocates the memory segments big enough to follow
+ * a geometric series that approximately doubles the total DSA size (see
+ * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+ * size and the simulation showed, the 75% threshold for the maximum bytes
+ * perfectly works in case where it is a power-of-2, and the 60% threshold
+ * works for other cases.
+ */
+#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT		(1024L * 70) /* 70kB */
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2	(float) 0.75
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO		(float) 0.6
+
+#define KEY_GET_BLKNO(key) \
+	((BlockNumber) ((key) >> (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+#define BLKNO_GET_KEY(blkno) \
+	(((uint64) (blkno) << (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS)))
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+	/*
+	 * 'num_tids' is the number of Tids stored so far. 'max_byte' is the maximum
+	 * bytes a TidStore can use. These two fields are commonly used in both
+	 * non-shared case and shared case.
+	 */
+	uint64	num_tids;
+	uint64	max_bytes;
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+
+	/* protect the shared fields */
+	LWLock	lock;
+
+	/* handles for TidStore and radix tree */
+	tidstore_handle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids. Use either one depending on TidStoreIsShared()  */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline uint64 tid_to_key_off(ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(uint64 max_bytes, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0)
+			? TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2
+			: TIDSTORE_SHARED_MAX_MEMORY_RATIO;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes =(uint64) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT;
+	}
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix
+		 * tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (TidStoreIsShared(ts))
+	{
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Free the radix tree and return allocated DSA segments to
+		 * the operating system.
+		 */
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+
+		LWLockRelease(&ts->control->lock);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+	}
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+	if (TidStoreIsShared(ts))
+	{
+		/*
+		 * Since the shared radix tree supports concurrent insert,
+		 * we don't need to acquire the lock.
+		 */
+		shared_rt_set(ts->tree.shared, key, val);
+	}
+	else
+		local_rt_set(ts->tree.local, key, val);
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+#define NUM_KEYS_PER_BLOCK	(1 << (TIDSTORE_OFFSET_NBITS - TIDSTORE_VALUE_NBITS))
+	ItemPointerData tid;
+	uint64	key_base;
+	uint64	values[NUM_KEYS_PER_BLOCK] = {0};
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+	key_base = BLKNO_GET_KEY(blkno);
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint64	key;
+		uint32	off;
+		int idx;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		/* encode the Tid to key and val */
+		key = tid_to_key_off(&tid, &off);
+
+		idx = key - key_base;
+		Assert(idx >= 0 && idx < NUM_KEYS_PER_BLOCK);
+
+		values[idx] |= UINT64CONST(1) << off;
+	}
+
+	/* insert the calculated key-values to the tree */
+	for (int i = 0; i < NUM_KEYS_PER_BLOCK; i++)
+	{
+		if (values[i])
+		{
+			uint64 key = key_base + i;
+
+			tidstore_insert_kv(ts, key, values[i]);
+		}
+	}
+
+	if (TidStoreIsShared(ts))
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+
+	if (TidStoreIsShared(ts))
+		LWLockRelease(&ts->control->lock);
+}
+
+/* Return true if the given Tid is present in TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val;
+	uint32 off;
+	bool found;
+
+	key = tid_to_key_off(tid, &off);
+
+	found = TidStoreIsShared(ts) ?
+		shared_rt_search(ts->tree.shared, key, &val) :
+		local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+	iter->result.blkno = InvalidBlockNumber;
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (tidstore_num_tids(ts) == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+	else
+		return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = KEY_GET_BLKNO(key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * Remember the key-value pair for the next block for the
+			 * next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+uint64
+tidstore_num_tids(TidStore *ts)
+{
+	uint64 num_tids;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (TidStoreIsShared(ts))
+		return ts->control->num_tids;
+
+	LWLockAcquire(&ts->control->lock, LW_SHARED);
+	num_tids = ts->control->num_tids;
+	LWLockRelease(&ts->control->lock);
+
+	return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+uint64
+tidstore_memory_usage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return (uint64) sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+	else
+		return (uint64) sizeof(TidStore) + sizeof(TidStore) +
+			local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract Tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->result);
+
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << TIDSTORE_OFFSET_NBITS) - 1);
+		result->offsets[result->num_offsets++] = off;
+	}
+
+	result->blkno = KEY_GET_BLKNO(key);
+}
+
+/*
+ * Encode a Tid to key and val.
+ */
+static inline uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64 upper;
+	uint64 tid_i;
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << TIDSTORE_OFFSET_NBITS;
+
+	*off = tid_i & ((1 << TIDSTORE_VALUE_NBITS) - 1);
+	upper = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return upper;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"SharedTupleStore",
 	/* LWTRANCHE_SHARED_TIDBITMAP: */
 	"SharedTidBitmap",
+	/* LWTRANCHE_SHARED_TIDSTORE: */
+	"SharedTidStore",
 	/* LWTRANCHE_PARALLEL_APPEND: */
 	"ParallelAppend",
 	/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..ec3d9f87f5
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	offsets[MaxOffsetNumber]; /* XXX: usually don't use up */
+	int				num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(uint64 max_bytes, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern uint64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern uint64 tidstore_max_memory(TidStore *ts);
+extern uint64 tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_SHARED_TIDBITMAP,
+	LWTRANCHE_SHARED_TIDSTORE,
 	LWTRANCHE_PARALLEL_APPEND,
 	LWTRANCHE_PER_XACT_PREDICATE_LIST,
 	LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..5d38387450
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,189 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+	ItemPointerData tid;
+	bool found;
+
+	ItemPointerSet(&tid, blkno, off);
+
+	found = tidstore_lookup_tid(ts, &tid);
+
+	if (found != expect)
+		elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+			 blkno, off, found, expect);
+}
+
+static void
+test_basic(void)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS	5
+#define TEST_TIDSTORE_NUM_OFFSETS	11
+#define IS_POWER_OF_TWO(x) (((x) & (x - 1)) == 0)
+
+	TidStore *ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+	BlockNumber	blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+	};
+	BlockNumber	blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+	};
+	OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS] = {
+		1 << 5, 1 << 6, 1 << 7, 1 << 8, 1 << 9,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4,
+		1 << 10
+	};
+	OffsetNumber offs_sorted[TEST_TIDSTORE_NUM_OFFSETS] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3, 1 << 4,
+		1 << 5, 1 << 6, 1 << 7, 1 << 8, 1 << 9,
+		1 << 10
+	};
+	int blk_idx;
+
+	elog(NOTICE, "testing basic operations");
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, NULL);
+
+	/* add tids */
+	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* lookup test */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, IS_POWER_OF_TWO(off));
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, IS_POWER_OF_TWO(off));
+	}
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+			 tidstore_num_tids(ts),
+			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* iteration test */
+	iter = tidstore_begin_iterate(ts);
+	blk_idx = 0;
+	while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+	{
+		/* check the returned block number */
+		if (blks_sorted[blk_idx] != iter_result->blkno)
+			elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+				 iter_result->blkno, blks_sorted[blk_idx]);
+
+		/* check the returned offset numbers */
+		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+			elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+		for (int i = 0; i < iter_result->num_offsets; i++)
+		{
+			if (offs_sorted[i] != iter_result->offsets[i])
+				elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+					 iter_result->offsets[i], iter_result->blkno,
+					 offs_sorted[i]);
+		}
+
+		blk_idx++;
+	}
+
+	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+		elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+	/* remove all tids */
+	tidstore_reset(ts);
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+	/* lookup test for empty store */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, false);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, false);
+	}
+
+	tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+	TidStore *ts;
+	TidStoreIter *iter;
+	ItemPointerData tid;
+
+	elog(NOTICE, "testing empty tidstore");
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, NULL);
+
+	ItemPointerSet(&tid, 0, FirstOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+			 MaxBlockNumber, MaxOffsetNumber);
+
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+	if (tidstore_is_full(ts))
+		elog(ERROR, "tidstore_is_full on empty store returned true");
+
+	iter = tidstore_begin_iterate(ts);
+
+	if (tidstore_iterate_next(iter) != NULL)
+		elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+	tidstore_end_iterate(iter);
+
+	tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+	test_empty();
+	test_basic();
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.39.0

#185

Dilip Kumar

dilipbalaut@gmail.com

almost 3 years ago

In reply to: John Naylor (#184)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 23, 2023 at 6:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Attached is a rebase to fix conflicts from recent commits.

I have reviewed v22-0022* patch and I have some comments.

It also changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.

I think this statement needs to be rephrased.

/*
* vac_tid_reaped() -- is a particular tid deletable?
*
* This has the right signature to be an IndexBulkDeleteCallback.
*
* Assumes dead_items array is sorted (in ascending TID order).
*/

I think this comment 'Assumes dead_items array is sorted' is not valid anymore.

We are changing the min value of 'maintenance_work_mem' to 2MB. Should
we do the same for the 'autovacuum_work_mem'?

4.
+
+    /* collected LP_DEAD items including existing LP_DEAD items */
+    int            lpdead_items;
+    OffsetNumber    deadoffsets[MaxHeapTuplesPerPage];

We are actually collecting dead offsets but the variable name says
'lpdead_items' instead of something like ndeadoffsets num_deadoffsets.
And the comment is also saying dead items.

5.
/*
* lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
* vacrel->dead_items array.
*
* Caller must have an exclusive buffer lock on the buffer (though a full
* cleanup lock is also acceptable). vmbuffer must be valid and already have
* a pin on blkno's visibility map page.
*
* index is an offset into the vacrel->dead_items array for the first listed
* LP_DEAD item on the page. The return value is the first index immediately
* after all LP_DEAD items for the same page in the array.
*/

This comment needs to be changed as this is referring to the
'vacrel->dead_items array' which no longer exists.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#186

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#183)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 23, 2023 at 8:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one exception: v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I believe one of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with that approach? I don't recall where that discussion went.

Hmm, I don't remember I proposed such a patch, either.

One idea to address it would be that we pass a shared memory to
RT_CREATE() and we create a DSA area dedicated to the radix tree in
place. We should return the created DSA area along with the radix tree
so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(),
and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA
area. A downside of this idea would be that one DSA area only for a
radix tree is always required.

Another idea would be that we allocate a big enough DSA area and
quarry small memory for nodes from there. But it would need to
introduce another complexity so I prefer to avoid it.

FYI the current design is inspired by dshash.c. In dshash_destory(),
we dsa_free() each elements allocated by dshash.c

+ * XXX: Most functions in this file have two variants for inner nodes and leaf
+ * nodes, therefore there are duplication codes. While this sometimes makes the
+ * code maintenance tricky, this reduces branch prediction misses when judging
+ * whether the node is a inner node of a leaf node.
This comment seems to be out-of-date since we made it a template.
Done in 0020, along with a bunch of other comment editing.

The following macros are defined but not undefined in radixtree.h:

Fixed in v21-0018.

Also:

0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest is agnostic.

radixtree_search_impl.h still assumes that the value type is an
integer type as follows:

#ifdef RT_NODE_LEVEL_LEAF
RT_VALUE_TYPE value = 0;

Assert(RT_NODE_IS_LEAF(node));
#else

Also, I think if we make the value type configurable, it's better to
pass the pointer of the value to RT_SET() instead of copying the
values since the value size could be large.

0010 turns node4 into node3, as discussed, going from 48 bytes to 32.
0012 adopts the benchmark module to the template, and adds meson support (builds with warnings, but okay because not meant for commit).

The rest are cleanups, small refactorings, and more comment rewrites. I've kept them separate for visibility. Next patch can squash them unless there is any discussion.

0008 patch

        for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
-               fprintf(stderr, "%s\tinner_size %zu\tinner_blocksize
%zu\tleaf_size %zu\tleaf_blocksize %zu\n",
+               fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\t%zu\n",
                                RT_SIZE_CLASS_INFO[i].name,
                                RT_SIZE_CLASS_INFO[i].inner_size,
-                               RT_SIZE_CLASS_INFO[i].inner_blocksize,
-                               RT_SIZE_CLASS_INFO[i].leaf_size,
-                               RT_SIZE_CLASS_INFO[i].leaf_blocksize);
+                               RT_SIZE_CLASS_INFO[i].leaf_size);

There is additional '%zu' at the end of the format string:

---
0011 patch

+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ *    statments.

typo: s/statments/statements/

The rest look good to me. I'll incorporate these fixes in the next
version patch.

uint32 is how we store the block number, so this too small and will wrap around on overflow. int64 seems better.

Agreed, will fix.

Great, but it's now uint64, not int64. All the large counters in struct LVRelState, for example, are signed integers, as the usual practice. Unsigned ints are "usually" for things like bit patterns and where explicit wraparound is desired. There's probably more that can be done here to change to signed types, but I think it's still a bit early to get to that level of nitpicking. (Soon, I hope :-) )

Agreed. I'll change it in the next version patch.

+ * We calculate the maximum bytes for the TidStore in different ways
+ * for non-shared case and shared case. Please refer to the comment
+ * TIDSTORE_MEMORY_DEDUCT for details.
+ */
Maybe the #define and comment should be close to here.
Will fix.
For this, I intended that "here" meant "in or just above the function".
+#define TIDSTORE_LOCAL_MAX_MEMORY_DEDUCT (1024L * 70) /* 70kB */
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO_PO2 (float) 0.75
+#define TIDSTORE_SHARED_MAX_MEMORY_RATIO (float) 0.6
These symbols are used only once, in tidstore_create(), and are difficult to read. That function has few comments. The symbols have several paragraphs, but they are far away. It might be better for readability to just hard-code numbers in the function, with the explanation about the numbers near where they are used.

Agreed, will fix.

+ * Destroy a TidStore, returning all memory. The caller must be certain that
+ * no other backend will attempt to access the TidStore before calling this
+ * function. Other backend must explicitly call tidstore_detach to free up
+ * backend-local memory associated with the TidStore. The backend that calls
+ * tidstore_destroy must not call tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)

If not addressed by next patch, need to phrase comment with FIXME or TODO about making certain.

Will fix.

Did anything change here?

Oops, the fix is missed in the patch for some reason. I'll fix it.

There is also this, in the template, which I'm not sure has been addressed:

* XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
* has the local pointers to nodes, rather than RT_PTR_ALLOC.
* We need either a safeguard to disallow other processes to begin the iteration
* while one process is doing or to allow multiple processes to do the iteration.

It's not addressed yet. I think adding a safeguard is better for the
first version. A simple solution is to add a flag, say iter_active, to
allow only one process to enable the iteration. What do you think?

This part only runs "if (vacrel->nindexes == 0)", so seems like unneeded complexity. It arises because lazy_scan_prune() populates the tid store even if no index vacuuming happens. Perhaps the caller of lazy_scan_prune() could pass the deadoffsets array, and upon returning, either populate the store or call lazy_vacuum_heap_page(), as needed. It's quite possible I'm missing some detail, so some description of the design choices made would be helpful.

I agree that we don't need complexity here. I'll try this idea.

Keeping the offsets array in the prunestate seems to work out well.

Some other quick comments on tid store and vacuum, not comprehensive. Let me know if I've misunderstood something:

TID store:
+ * XXXXXXXX XXXYYYYY YYYYYYYY YYYYYYYY YYYYYYYY YYYuuuu
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
I was confused for a while, and I realized the bits are in reverse order from how they are usually pictured (high on left, low on the right).

I borrowed it from ginpostinglist.c but it seems better to write in
the common order.

+ * 11 bits enough for the offset number, because MaxHeapTuplesPerPage < 2^11
+ * on all supported block sizes (TIDSTORE_OFFSET_NBITS). We are frugal with
+ * XXX: if we want to support non-heap table AM that want to use the full
+ * range of possible offset numbers, we'll need to reconsider
+ * TIDSTORE_OFFSET_NBITS value.
Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallback for other table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine.

I think we can pass the maximum offset numbers to tidstore_create()
and calculate these values.

+typedef dsa_pointer tidstore_handle;

It's not clear why we need a typedef here, since here:
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+ TidStore *ts;
+ dsa_pointer control;
...
+ control = handle;
...there is a differently-named dsa_pointer variable that just gets the function parameter.

I guess one reason is to improve compatibility; we can stash the
actual value of the handle, which could help some cases, for example,
when we need to change the actual value of the handle. dshash.c uses
the same idea. Another reason would be to improve readability.

+/* Return the maximum memory TidStore can use */
+uint64
+tidstore_max_memory(TidStore *ts)

size_t is more suitable for memory.

WIll fix.

+ /*
+ * Since the shared radix tree supports concurrent insert,
+ * we don't need to acquire the lock.
+ */
Hmm? IIUC, the caller only acquires the lock after returning from here, to update statistics. Why is it safe to insert with no lock? Am I missing something?

You're right. I was missing something. The lock should be taken before
adding key-value pairs.

VACUUM integration:
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS 2
+#define PARALLEL_VACUUM_KEY_DSA 2
Seems like unnecessary churn? It is still all about dead items, after all. I understand using "DSA" for the LWLock, since that matches surrounding code.

Agreed, will remove.

+#define HAS_LPDEAD_ITEMS(state) (((state).lpdead_items) > 0)

This macro helps the patch readability in some places, but I'm not sure it helps readability of the file as a whole. The following is in the patch and seems perfectly clear without the macro:
- if (lpdead_items > 0)
+ if (prunestate->lpdead_items > 0)

Will remove the macro.

About shared memory: I have some mild reservations about the naming of the "control object", which may be in shared memory. Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memory is the actual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree. I don't have a better naming scheme, though, and might not be that important. (Added a WIP comment)

That seems a valid concern. I borrowed the "control object" from
dshash.c but it supports only shared cases. The fact that the radix
tree supports both local and shared seems to introduce this confusion.
I came up with other names such as RT_RADIX_TREE_CORE or
RT_RADIX_TREE_ROOT but not sure these are better than the current
one.

Now might be a good time to look at earlier XXX comments and come up with a plan to address them.

Agreed.

Other XXX comments that are not mentioned yet are:

+   /* XXX: memory context support */
+   tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));

I'm not sure we really need memory context support for RT_ATTACH()
since in the shared case, we allocate backend-local memory only for
RT_RADIX_TREE.

---
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+   // XXX is this necessary?
+   Size        total = sizeof(RT_RADIX_TREE);

Regarding this, I followed intset_memory_usage(). But in the radix
tree, RT_RADIX_TREE is very small so probably we can ignore it.

---
+/* XXX For display, assumes value type is numeric */
+static void
+RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)

I think we can display values in hex encoded format but given the
value could be large, we don't necessarily need to display actual
values. Displaying the tree structure and chunks would be helpful for
debugging the radix tree.

---
There is no XXX comment but I'll try to add lock support in the next
version patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#187

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#186)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jan 25, 2023 at 8:42 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Mon, Jan 23, 2023 at 8:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

In v21, all of your v20 improvements to the radix tree template and

test have been squashed into 0003, with one exception: v20-0010 (recursive
freeing of shared mem), which I've attached separately (for flexibility) as
v21-0006. I believe one of your earlier patches had a new DSA function for
freeing memory more quickly -- was there a problem with that approach? I
don't recall where that discussion went.

Hmm, I don't remember I proposed such a patch, either.

I went looking, and it turns out I remembered wrong, sorry.

One idea to address it would be that we pass a shared memory to
RT_CREATE() and we create a DSA area dedicated to the radix tree in
place. We should return the created DSA area along with the radix tree
so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(),
and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA
area. A downside of this idea would be that one DSA area only for a
radix tree is always required.

Another idea would be that we allocate a big enough DSA area and
quarry small memory for nodes from there. But it would need to
introduce another complexity so I prefer to avoid it.

FYI the current design is inspired by dshash.c. In dshash_destory(),
we dsa_free() each elements allocated by dshash.c

Okay, thanks for the info.

0007 makes the value type configurable. Some debug functionality still

assumes integer type, but I think the rest is agnostic.

radixtree_search_impl.h still assumes that the value type is an
integer type as follows:

#ifdef RT_NODE_LEVEL_LEAF
RT_VALUE_TYPE value = 0;

Assert(RT_NODE_IS_LEAF(node));
#else

Also, I think if we make the value type configurable, it's better to
pass the pointer of the value to RT_SET() instead of copying the
values since the value size could be large.

Thanks, I will remove the assignment and look into pass-by-reference.

Oops, the fix is missed in the patch for some reason. I'll fix it.

There is also this, in the template, which I'm not sure has been

addressed:

* XXX: Currently we allow only one process to do iteration. Therefore,

rt_node_iter

* has the local pointers to nodes, rather than RT_PTR_ALLOC.
* We need either a safeguard to disallow other processes to begin the

iteration

* while one process is doing or to allow multiple processes to do the

iteration.

It's not addressed yet. I think adding a safeguard is better for the
first version. A simple solution is to add a flag, say iter_active, to
allow only one process to enable the iteration. What do you think?

I don't quite have enough info to offer an opinion, but this sounds like a
different form of locking. I'm sure it's come up before, but could you
describe why iteration is different from other operations, regarding
concurrency?

Would it be worth it (or possible) to calculate constants based on

compile-time block size? And/or have a fallback for other table AMs? Since
this file is in access/common, the intention is to allow general-purpose, I
imagine.

I think we can pass the maximum offset numbers to tidstore_create()
and calculate these values.

That would work easily for vacuumlazy.c, since it's in the "heap" subdir so
we know the max possible offset. I haven't looked at vacuumparallel.c, but
I can tell it is not in a heap-specific directory, so I don't know how easy
that would be to pass along the right value.

About shared memory: I have some mild reservations about the naming of

the "control object", which may be in shared memory. Is that an established
term? (If so, disregard the rest): It seems backwards -- the thing in
shared memory is the actual tree itself. The thing in backend-local memory
has the "handle", and that's how we control the tree. I don't have a better
naming scheme, though, and might not be that important. (Added a WIP
comment)

That seems a valid concern. I borrowed the "control object" from
dshash.c but it supports only shared cases. The fact that the radix
tree supports both local and shared seems to introduce this confusion.
I came up with other names such as RT_RADIX_TREE_CORE or
RT_RADIX_TREE_ROOT but not sure these are better than the current
one.

Okay, if dshash uses it, we have some precedent.

Now might be a good time to look at earlier XXX comments and come up

with a plan to address them.

Agreed.

Other XXX comments that are not mentioned yet are:
+   /* XXX: memory context support */
+   tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
I'm not sure we really need memory context support for RT_ATTACH()
since in the shared case, we allocate backend-local memory only for
RT_RADIX_TREE.

Okay, we can remove this.

---
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+   // XXX is this necessary?
+   Size        total = sizeof(RT_RADIX_TREE);
Regarding this, I followed intset_memory_usage(). But in the radix
tree, RT_RADIX_TREE is very small so probably we can ignore it.

That was more a note to myself that I forgot about, so here is my
reasoning: In the shared case, we just overwrite that initial total, but
for the local case we add to it. A future reader could think this is
inconsistent and needs to be fixed. Since we deduct from the guc limit to
guard against worst-case re-allocation, and that deduction is not very
precise (nor needs to be), I agree we should just forget about tiny sizes
like this in both cases.

---
+/* XXX For display, assumes value type is numeric */
+static void
+RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
I think we can display values in hex encoded format but given the
value could be large, we don't necessarily need to display actual
values. Displaying the tree structure and chunks would be helpful for
debugging the radix tree.

Okay, I can try that unless you do it first.

There is no XXX comment but I'll try to add lock support in the next
version patch.

Since there were calls to LWLockAcquire/Release in the last version, I'm a
bit confused by this. Perhaps for the next patch, the email should contain
a few sentences describing how locking is intended to work, including for
iteration.

Hmm, I wonder if we need to use the isolation tester. It's both a blessing
and a curse that the first client of this data structure is tid lookup.
It's a blessing because it doesn't present a highly-concurrent workload
mixing reads and writes and so simple locking is adequate. It's a curse
because to test locking and have any chance of finding bugs, we can't rely
on vacuum to tell us that because (as you've said) it might very well work
fine with no locking at all. So we must come up with test cases ourselves.

--
John Naylor
EDB: http://www.enterprisedb.com

#188

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Dilip Kumar (#185)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jan 24, 2023 at 1:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jan 23, 2023 at 6:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Attached is a rebase to fix conflicts from recent commits.

I have reviewed v22-0022* patch and I have some comments.

1.

It also changes to the column names max_dead_tuples and num_dead_tuples

and to

show the progress information in bytes.

I think this statement needs to be rephrased.

Could you be more specific?

3.

We are changing the min value of 'maintenance_work_mem' to 2MB. Should
we do the same for the 'autovacuum_work_mem'?

Yes, we should change that, too. We've discussed previously that
autovacuum_work_mem is possibly rendered unnecessary by this work, but we
agreed that that should be a separate thread. And needs additional testing
to verify.

I agree with your other comments.

--
John Naylor
EDB: http://www.enterprisedb.com

#189

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#187)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jan 26, 2023 at 3:54 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Jan 25, 2023 at 8:42 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 23, 2023 at 8:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Jan 16, 2023 at 3:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 16, 2023 at 2:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

In v21, all of your v20 improvements to the radix tree template and test have been squashed into 0003, with one exception: v20-0010 (recursive freeing of shared mem), which I've attached separately (for flexibility) as v21-0006. I believe one of your earlier patches had a new DSA function for freeing memory more quickly -- was there a problem with that approach? I don't recall where that discussion went.

Hmm, I don't remember I proposed such a patch, either.

I went looking, and it turns out I remembered wrong, sorry.

One idea to address it would be that we pass a shared memory to
RT_CREATE() and we create a DSA area dedicated to the radix tree in
place. We should return the created DSA area along with the radix tree
so that the caller can use it (e.g., for dsa_get_handle(), dsa_pin(),
and dsa_pin_mapping() etc). In RT_FREE(), we just detach from the DSA
area. A downside of this idea would be that one DSA area only for a
radix tree is always required.

Another idea would be that we allocate a big enough DSA area and
quarry small memory for nodes from there. But it would need to
introduce another complexity so I prefer to avoid it.

FYI the current design is inspired by dshash.c. In dshash_destory(),
we dsa_free() each elements allocated by dshash.c

Okay, thanks for the info.

0007 makes the value type configurable. Some debug functionality still assumes integer type, but I think the rest is agnostic.

radixtree_search_impl.h still assumes that the value type is an
integer type as follows:

#ifdef RT_NODE_LEVEL_LEAF
RT_VALUE_TYPE value = 0;

Assert(RT_NODE_IS_LEAF(node));
#else

Also, I think if we make the value type configurable, it's better to
pass the pointer of the value to RT_SET() instead of copying the
values since the value size could be large.

Thanks, I will remove the assignment and look into pass-by-reference.

Oops, the fix is missed in the patch for some reason. I'll fix it.

There is also this, in the template, which I'm not sure has been addressed:

* XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
* has the local pointers to nodes, rather than RT_PTR_ALLOC.
* We need either a safeguard to disallow other processes to begin the iteration
* while one process is doing or to allow multiple processes to do the iteration.

It's not addressed yet. I think adding a safeguard is better for the
first version. A simple solution is to add a flag, say iter_active, to
allow only one process to enable the iteration. What do you think?

I don't quite have enough info to offer an opinion, but this sounds like a different form of locking. I'm sure it's come up before, but could you describe why iteration is different from other operations, regarding concurrency?

I think that we need to prevent concurrent updates (RT_SET() and
RT_DELETE()) during the iteration to get the consistent result through
the whole iteration operation. Unlike other operations such as
RT_SET(), we cannot expect that a job doing something for each
key-value pair in the radix tree completes in a short time, so we
cannot keep holding the radix tree lock until the end of the
iteration. So the idea is that we set iter_active to true (with the
lock in exclusive mode), and prevent concurrent updates when the flag
is true.

Would it be worth it (or possible) to calculate constants based on compile-time block size? And/or have a fallback for other table AMs? Since this file is in access/common, the intention is to allow general-purpose, I imagine.

I think we can pass the maximum offset numbers to tidstore_create()
and calculate these values.

That would work easily for vacuumlazy.c, since it's in the "heap" subdir so we know the max possible offset. I haven't looked at vacuumparallel.c, but I can tell it is not in a heap-specific directory, so I don't know how easy that would be to pass along the right value.

I think the user (e.g, vacuumlazy.c) can pass the maximum offset
number to the parallel vacuum.

About shared memory: I have some mild reservations about the naming of the "control object", which may be in shared memory. Is that an established term? (If so, disregard the rest): It seems backwards -- the thing in shared memory is the actual tree itself. The thing in backend-local memory has the "handle", and that's how we control the tree. I don't have a better naming scheme, though, and might not be that important. (Added a WIP comment)

That seems a valid concern. I borrowed the "control object" from
dshash.c but it supports only shared cases. The fact that the radix
tree supports both local and shared seems to introduce this confusion.
I came up with other names such as RT_RADIX_TREE_CORE or
RT_RADIX_TREE_ROOT but not sure these are better than the current
one.

Okay, if dshash uses it, we have some precedent.
Now might be a good time to look at earlier XXX comments and come up with a plan to address them.

Agreed.

Other XXX comments that are not mentioned yet are:
+   /* XXX: memory context support */
+   tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
I'm not sure we really need memory context support for RT_ATTACH()
since in the shared case, we allocate backend-local memory only for
RT_RADIX_TREE.
Okay, we can remove this.
---
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+   // XXX is this necessary?
+   Size        total = sizeof(RT_RADIX_TREE);
Regarding this, I followed intset_memory_usage(). But in the radix
tree, RT_RADIX_TREE is very small so probably we can ignore it.
That was more a note to myself that I forgot about, so here is my reasoning: In the shared case, we just overwrite that initial total, but for the local case we add to it. A future reader could think this is inconsistent and needs to be fixed. Since we deduct from the guc limit to guard against worst-case re-allocation, and that deduction is not very precise (nor needs to be), I agree we should just forget about tiny sizes like this in both cases.

Thanks for your explanation, agreed.

---
+/* XXX For display, assumes value type is numeric */
+static void
+RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
I think we can display values in hex encoded format but given the
value could be large, we don't necessarily need to display actual
values. Displaying the tree structure and chunks would be helpful for
debugging the radix tree.
Okay, I can try that unless you do it first.

There is no XXX comment but I'll try to add lock support in the next
version patch.

Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the next patch, the email should contain a few sentences describing how locking is intended to work, including for iteration.

The lock I'm thinking of adding is a simple readers-writer lock. This
lock is used for concurrent radix tree operations except for the
iteration. For operations concurrent to the iteration, I used a flag
for the reason I mentioned above.

Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of this data structure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads and writes and so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs, we can't rely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So we must come up with test cases ourselves.

Using the isolation tester to test locking seems like a good idea. We
can include it in test_radixtree. But given that the locking in the
radix tree is very simple, the test case would be very simple. It may
be controversial whether it's worth adding such testing by adding both
the new test module and test cases.

I'm working on the fixes I mentioned in the previous email and going
to share the updated patch today. Please wait to do these fixes if
you're okay.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#190

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#189)

18 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jan 26, 2023 at 5:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I'm working on the fixes I mentioned in the previous email and going
to share the updated patch today. Please wait to do these fixes if
you're okay.

I've attached updated version patches. As we agreed I've merged your
changes in v22 into the main (0003) patch. But I still kept the patch
of recursively freeing nodes separate as we might need more
discussion. In v23 I attached, 0006 through 0016 patches are fixes and
improvements for the radix tree. I've incorporated all comments I got
unless I'm missing something.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v23-0016-Add-read-write-lock-to-radix-tree-in-RT_SHMEM-ca.patchapplication/octet-stream; name=v23-0016-Add-read-write-lock-to-radix-tree-in-RT_SHMEM-ca.patchDownload

From 730cdcba6c89954806ac40e2ed63720a93d3fe56 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 17:43:29 +0900
Subject: [PATCH v23 16/18] Add read-write lock to radix tree in RT_SHMEM case.

---
 src/include/lib/radixtree.h                   | 100 +++++++++++++++++-
 .../modules/test_radixtree/test_radixtree.c   |   8 +-
 2 files changed, 99 insertions(+), 9 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 11716fbfca..542daae6d0 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -40,6 +40,8 @@
  * There are some optimizations not yet implemented, particularly path
  * compression and lazy path expansion.
  *
+ * WIP: describe about how locking works.
+ *
  * WIP: the radix tree nodes don't shrink.
  *
  * To generate a radix tree and associated functions for a use case several
@@ -224,7 +226,7 @@ typedef dsa_pointer RT_HANDLE;
 #endif
 
 #ifdef RT_SHMEM
-RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
 RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
 RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
 RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
@@ -371,6 +373,16 @@ typedef struct RT_NODE
 #define RT_INVALID_PTR_ALLOC NULL
 #endif
 
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree)	LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree)	LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree)			LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree)	((void) 0)
+#define RT_LOCK_SHARED(tree)	((void) 0)
+#define RT_UNLOCK(tree)			((void) 0)
+#endif
+
 /*
  * Inner nodes and leaf nodes have analogous structure. To distinguish
  * them at runtime, we take advantage of the fact that the key chunk
@@ -596,6 +608,7 @@ typedef struct RT_RADIX_TREE_CONTROL
 #ifdef RT_SHMEM
 	RT_HANDLE	handle;
 	uint32		magic;
+	LWLock		lock;
 #endif
 
 	RT_PTR_ALLOC root;
@@ -1376,7 +1389,7 @@ RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC store
  */
 RT_SCOPE RT_RADIX_TREE *
 #ifdef RT_SHMEM
-RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
 #else
 RT_CREATE(MemoryContext ctx)
 #endif
@@ -1398,6 +1411,7 @@ RT_CREATE(MemoryContext ctx)
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
 	tree->ctl->handle = dp;
 	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	LWLockInitialize(&tree->ctl->lock, tranche_id);
 #else
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
 
@@ -1581,8 +1595,13 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 #endif
 
+	RT_LOCK_EXCLUSIVE(tree);
+
 	if (unlikely(tree->ctl->iter_active))
+	{
+		RT_UNLOCK(tree);
 		elog(ERROR, "cannot add new key-value to radix tree while iteration is in progress");
+	}
 
 	/* Empty tree, create the root */
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
@@ -1609,6 +1628,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
 		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
 		{
 			RT_SET_EXTEND(tree, key, value, parent, stored_child, child);
+			RT_UNLOCK(tree);
 			return false;
 		}
 
@@ -1623,12 +1643,13 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
 	if (!updated)
 		tree->ctl->num_keys++;
 
+	RT_UNLOCK(tree);
 	return updated;
 }
 
 /*
  * Search the given key in the radix tree. Return true if there is the key,
- * otherwise return false.  On success, we set the value to *val_p so it must
+ * otherwise return false.	On success, we set the value to *val_p so it must
  * not be NULL.
  */
 RT_SCOPE bool
@@ -1636,14 +1657,20 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 {
 	RT_PTR_LOCAL node;
 	int			shift;
+	bool		found;
 
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 #endif
 	Assert(value_p != NULL);
 
+	RT_LOCK_SHARED(tree);
+
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
 		return false;
+	}
 
 	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
 	shift = node->shift;
@@ -1657,13 +1684,19 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 			break;
 
 		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
 			return false;
+		}
 
 		node = RT_PTR_GET_LOCAL(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
-	return RT_NODE_SEARCH_LEAF(node, key, value_p);
+	found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+	RT_UNLOCK(tree);
+	return found;
 }
 
 #ifdef RT_USE_DELETE
@@ -1685,11 +1718,19 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 #endif
 
+	RT_LOCK_EXCLUSIVE(tree);
+
 	if (unlikely(tree->ctl->iter_active))
+	{
+		RT_UNLOCK(tree);
 		elog(ERROR, "cannot delete key to radix tree while iteration is in progress");
+	}
 
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
 		return false;
+	}
 
 	/*
 	 * Descend the tree to search the key while building a stack of nodes we
@@ -1708,7 +1749,10 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 		node = RT_PTR_GET_LOCAL(tree, allocnode);
 
 		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
 			return false;
+		}
 
 		allocnode = child;
 		shift -= RT_NODE_SPAN;
@@ -1721,6 +1765,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 	if (!deleted)
 	{
 		/* no key is found in the leaf node */
+		RT_UNLOCK(tree);
 		return false;
 	}
 
@@ -1732,7 +1777,10 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 	 * node.
 	 */
 	if (node->count > 0)
+	{
+		RT_UNLOCK(tree);
 		return true;
+	}
 
 	/* Free the empty leaf node */
 	RT_FREE_NODE(tree, allocnode);
@@ -1754,6 +1802,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 		RT_FREE_NODE(tree, allocnode);
 	}
 
+	RT_UNLOCK(tree);
 	return true;
 }
 #endif
@@ -1827,8 +1876,13 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 	RT_PTR_LOCAL root;
 	int			top_level;
 
+	RT_LOCK_EXCLUSIVE(tree);
+
 	if (unlikely(tree->ctl->iter_active))
+	{
+		RT_UNLOCK(tree);
 		elog(ERROR, "cannot begin iteration while another iteration is in progress");
+	}
 
 	old_ctx = MemoryContextSwitchTo(tree->context);
 
@@ -1838,7 +1892,10 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 
 	/* empty tree */
 	if (!iter->tree->ctl->root)
+	{
+		RT_UNLOCK(tree);
 		return iter;
+	}
 
 	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
 	top_level = root->shift / RT_NODE_SPAN;
@@ -1852,11 +1909,12 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 
 	MemoryContextSwitchTo(old_ctx);
 
+	RT_UNLOCK(tree);
 	return iter;
 }
 
 /*
- * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * Return true with setting key_p and value_p if there is next key.	 Otherwise,
  * return false.
  */
 RT_SCOPE bool
@@ -1864,9 +1922,14 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
 {
 	Assert(iter->tree->ctl->iter_active);
 
+	RT_LOCK_SHARED(iter->tree);
+
 	/* Empty tree */
 	if (!iter->tree->ctl->root)
+	{
+		RT_UNLOCK(iter->tree);
 		return false;
+	}
 
 	for (;;)
 	{
@@ -1882,6 +1945,7 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
 		{
 			*key_p = iter->key;
 			*value_p = value;
+			RT_UNLOCK(iter->tree);
 			return true;
 		}
 
@@ -1899,7 +1963,10 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
 
 		/* the iteration finished */
 		if (!child)
+		{
+			RT_UNLOCK(iter->tree);
 			return false;
+		}
 
 		/*
 		 * Set the node to the node iterator and update the iterator stack
@@ -1910,13 +1977,17 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
 		/* Node iterators are updated, so try again from the leaf */
 	}
 
+	RT_UNLOCK(iter->tree);
 	return false;
 }
 
 RT_SCOPE void
 RT_END_ITERATE(RT_ITER *iter)
 {
+	RT_LOCK_EXCLUSIVE(iter->tree);
 	iter->tree->ctl->iter_active = false;
+	RT_UNLOCK(iter->tree);
+
 	pfree(iter);
 }
 
@@ -1928,6 +1999,8 @@ RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
 {
 	Size		total = 0;
 
+	RT_LOCK_SHARED(tree);
+
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 	total = dsa_get_total_size(tree->dsa);
@@ -1939,6 +2012,7 @@ RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
 	}
 #endif
 
+	RT_UNLOCK(tree);
 	return total;
 }
 
@@ -2023,6 +2097,8 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
 RT_SCOPE void
 RT_STATS(RT_RADIX_TREE *tree)
 {
+	RT_LOCK_SHARED(tree);
+
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
 	fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
 
@@ -2042,6 +2118,8 @@ RT_STATS(RT_RADIX_TREE *tree)
 				tree->ctl->cnt[RT_CLASS_125],
 				tree->ctl->cnt[RT_CLASS_256]);
 	}
+
+	RT_UNLOCK(tree);
 }
 
 static void
@@ -2235,14 +2313,18 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
 
 	RT_STATS(tree);
 
+	RT_LOCK_SHARED(tree);
+
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
 	{
+		RT_UNLOCK(tree);
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
 	if (key > tree->ctl->max_val)
 	{
+		RT_UNLOCK(tree);
 		fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
 				key, key);
 		return;
@@ -2276,6 +2358,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
 		shift -= RT_NODE_SPAN;
 		level++;
 	}
+	RT_UNLOCK(tree);
 
 	fprintf(stderr, "%s", buf.data);
 }
@@ -2287,8 +2370,11 @@ RT_DUMP(RT_RADIX_TREE *tree)
 
 	RT_STATS(tree);
 
+	RT_LOCK_SHARED(tree);
+
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
 	{
+		RT_UNLOCK(tree);
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
@@ -2296,6 +2382,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 	initStringInfo(&buf);
 
 	RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+	RT_UNLOCK(tree);
 
 	fprintf(stderr, "%s",buf.data);
 }
@@ -2323,6 +2410,9 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_GET_KEY_CHUNK
 #undef BM_IDX
 #undef BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
 #undef RT_NODE_IS_LEAF
 #undef RT_NODE_MUST_GROW
 #undef RT_NODE_KIND_COUNT
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 2a93e731ae..bbe1a619b6 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -144,7 +144,7 @@ test_empty(void)
 	dsa_area   *dsa;
 	dsa = dsa_create(tranche_id);
 
-	radixtree = rt_create(CurrentMemoryContext, dsa);
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
 #else
 	radixtree = rt_create(CurrentMemoryContext);
 #endif
@@ -195,7 +195,7 @@ test_basic(int children, bool test_inner)
 		 test_inner ? "inner" : "leaf", children);
 
 #ifdef RT_SHMEM
-	radixtree = rt_create(CurrentMemoryContext, dsa);
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
 #else
 	radixtree = rt_create(CurrentMemoryContext);
 #endif
@@ -363,7 +363,7 @@ test_node_types(uint8 shift)
 	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
 
 #ifdef RT_SHMEM
-	radixtree = rt_create(CurrentMemoryContext, dsa);
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
 #else
 	radixtree = rt_create(CurrentMemoryContext);
 #endif
@@ -434,7 +434,7 @@ test_pattern(const test_spec * spec)
 	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
 
 #ifdef RT_SHMEM
-	radixtree = rt_create(radixtree_ctx, dsa);
+	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
 #else
 	radixtree = rt_create(radixtree_ctx);
 #endif
-- 
2.31.1

v23-0014-Improve-RT_DUMP-and-RT_DUMP_SEARCH-output.patchapplication/octet-stream; name=v23-0014-Improve-RT_DUMP-and-RT_DUMP_SEARCH-output.patchDownload

From d13da75dfe46d9ea7776751134fe4c22f83cd15d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 16:52:41 +0900
Subject: [PATCH v23 14/18] Improve RT_DUMP() and RT_DUMP_SEARCH() output.

We don't display values since these might not be integers.
---
 src/include/lib/radixtree.h | 201 +++++++++++++++++++++---------------
 1 file changed, 118 insertions(+), 83 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index dbf9df604f..11716fbfca 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -2023,32 +2023,46 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
 RT_SCOPE void
 RT_STATS(RT_RADIX_TREE *tree)
 {
-	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
-						 tree->ctl->num_keys,
-						 tree->ctl->root->shift / RT_NODE_SPAN,
-						 tree->ctl->cnt[RT_CLASS_3],
-						 tree->ctl->cnt[RT_CLASS_32_MIN],
-						 tree->ctl->cnt[RT_CLASS_32_MAX],
-						 tree->ctl->cnt[RT_CLASS_125],
-						 tree->ctl->cnt[RT_CLASS_256])));
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+	fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+	fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+		fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+				root->shift / RT_NODE_SPAN,
+				tree->ctl->cnt[RT_CLASS_3],
+				tree->ctl->cnt[RT_CLASS_32_MIN],
+				tree->ctl->cnt[RT_CLASS_32_MAX],
+				tree->ctl->cnt[RT_CLASS_125],
+				tree->ctl->cnt[RT_CLASS_256]);
+	}
 }
 
-/* XXX For display, assumes value type is numeric */
 static void
-RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+			 bool recurse, StringInfo buf)
 {
-	char		space[125] = {0};
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+	StringInfoData spaces;
 
-	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
-			RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
-			(node->kind == RT_NODE_KIND_3) ? 3 :
-			(node->kind == RT_NODE_KIND_32) ? 32 :
-			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
-			node->fanout == 0 ? 256 : node->fanout,
-			node->count, node->shift);
+	initStringInfo(&spaces);
+	appendStringInfoSpaces(&spaces, (level * 4) + 1);
 
-	if (level > 0)
-		sprintf(space, "%*c", level * 4, ' ');
+	appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+					 spaces.data,
+					 level == 0 ? "" : "-> ",
+					 RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+					 (node->kind == RT_NODE_KIND_3) ? 3 :
+					 (node->kind == RT_NODE_KIND_32) ? 32 :
+					 (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+					 node->fanout == 0 ? 256 : node->fanout,
+					 node->count, node->shift);
 
 	switch (node->kind)
 	{
@@ -2060,20 +2074,24 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 					{
 						RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
 
-						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
-								space, n3->base.chunks[i], (uint64) n3->values[i]);
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n3->base.chunks[i]);
 					}
 					else
 					{
 						RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
 
-						fprintf(stderr, "%schunk 0x%X ->",
-								space, n3->base.chunks[i]);
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n3->base.chunks[i]);
 
 						if (recurse)
-							RT_DUMP_NODE(n3->children[i], level + 1, recurse);
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n3->children[i], level + 1,
+										 recurse, buf);
+						}
 						else
-							fprintf(stderr, "\n");
+							appendStringInfo(buf, " (skipped)\n");
 					}
 				}
 				break;
@@ -2086,22 +2104,25 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 					{
 						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
 
-						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
-								space, n32->base.chunks[i], (uint64) n32->values[i]);
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n32->base.chunks[i]);
 					}
 					else
 					{
 						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
 
-						fprintf(stderr, "%schunk 0x%X ->",
-								space, n32->base.chunks[i]);
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n32->base.chunks[i]);
 
 						if (recurse)
 						{
-							RT_DUMP_NODE(n32->children[i], level + 1, recurse);
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n32->children[i], level + 1,
+										 recurse, buf);
 						}
 						else
-							fprintf(stderr, "\n");
+							appendStringInfo(buf, " (skipped)\n");
+
 					}
 				}
 				break;
@@ -2109,26 +2130,23 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 		case RT_NODE_KIND_125:
 			{
 				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+				char *sep = "";
 
-				fprintf(stderr, "slot_idxs ");
+				appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
 				{
 					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
 						continue;
 
-					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+					appendStringInfo(buf, "%s[%d]=%d ",
+									 sep, i, b125->slot_idxs[i]);
+					sep = ",";
 				}
-				if (RT_NODE_IS_LEAF(node))
-				{
-					RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
 
-					fprintf(stderr, ", isset-bitmap:");
-					for (int i = 0; i < BM_IDX(RT_SLOT_IDX_LIMIT); i++)
-					{
-						fprintf(stderr, RT_UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
-					}
-					fprintf(stderr, "\n");
-				}
+				appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+				for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+					appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+				appendStringInfo(buf, "\n");
 
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
 				{
@@ -2136,30 +2154,39 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 						continue;
 
 					if (RT_NODE_IS_LEAF(node))
-					{
-						RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
-
-						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
-								space, i, (uint64) RT_NODE_LEAF_125_GET_VALUE(n125, i));
-					}
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
 					else
 					{
 						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
 
-						fprintf(stderr, "%schunk 0x%X ->",
-								space, i);
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
 
 						if (recurse)
-							RT_DUMP_NODE(RT_NODE_INNER_125_GET_CHILD(n125, i),
-										 level + 1, recurse);
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse, buf);
+						}
 						else
-							fprintf(stderr, "\n");
+							appendStringInfo(buf, " (skipped)\n");
 					}
 				}
 				break;
 			}
 		case RT_NODE_KIND_256:
 			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+					appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+					for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+						appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+					appendStringInfo(buf, "\n");
+				}
+
 				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
 				{
 					if (RT_NODE_IS_LEAF(node))
@@ -2169,8 +2196,8 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
 							continue;
 
-						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
-								space, i, (uint64) RT_NODE_LEAF_256_GET_VALUE(n256, i));
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
 					}
 					else
 					{
@@ -2179,14 +2206,17 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
 							continue;
 
-						fprintf(stderr, "%schunk 0x%X ->",
-								space, i);
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
 
 						if (recurse)
-							RT_DUMP_NODE(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
-										 recurse);
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+										 level + 1, recurse, buf);
+						}
 						else
-							fprintf(stderr, "\n");
+							appendStringInfo(buf, " (skipped)\n");
 					}
 				}
 				break;
@@ -2197,38 +2227,40 @@ RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
 RT_SCOPE void
 RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
 {
+	RT_PTR_ALLOC allocnode;
 	RT_PTR_LOCAL node;
+	StringInfoData buf;
 	int			shift;
 	int			level = 0;
 
-	elog(NOTICE, "-----------------------------------------------------------");
-	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ")",
-		 tree->ctl->max_val, tree->ctl->max_val);
+	RT_STATS(tree);
 
-	if (!tree->ctl->root)
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
 	{
-		elog(NOTICE, "tree is empty");
+		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
 	if (key > tree->ctl->max_val)
 	{
-		elog(NOTICE, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val",
-			 key, key);
+		fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+				key, key);
 		return;
 	}
 
-	node = tree->ctl->root;
-	shift = tree->ctl->root->shift;
+	initStringInfo(&buf);
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
 	while (shift >= 0)
 	{
-		RT_PTR_LOCAL child;
+		RT_PTR_ALLOC child;
 
-		RT_DUMP_NODE(node, level, false);
+		RT_DUMP_NODE(tree, allocnode, level, false, &buf);
 
 		if (RT_NODE_IS_LEAF(node))
 		{
-			uint64		dummy;
+			RT_VALUE_TYPE	dummy;
 
 			/* We reached at a leaf node, find the corresponding slot */
 			RT_NODE_SEARCH_LEAF(node, key, &dummy);
@@ -2239,30 +2271,33 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
 		if (!RT_NODE_SEARCH_INNER(node, key, &child))
 			break;
 
-		node = child;
+		allocnode = child;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
 		shift -= RT_NODE_SPAN;
 		level++;
 	}
+
+	fprintf(stderr, "%s", buf.data);
 }
 
 RT_SCOPE void
 RT_DUMP(RT_RADIX_TREE *tree)
 {
+	StringInfoData buf;
 
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
-		fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\n",
-				RT_SIZE_CLASS_INFO[i].name,
-				RT_SIZE_CLASS_INFO[i].inner_size,
-				RT_SIZE_CLASS_INFO[i].leaf_size);
-	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+	RT_STATS(tree);
 
-	if (!tree->ctl->root)
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
 	{
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
-	RT_DUMP_NODE(tree->ctl->root, 0, true);
+	initStringInfo(&buf);
+
+	RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+
+	fprintf(stderr, "%s",buf.data);
 }
 #endif
 
-- 
2.31.1

v23-0018-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v23-0018-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload

From 0822ccf1c1df26abf50e865c62a69a302fcfc58f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 17 Jan 2023 17:20:37 +0700
Subject: [PATCH v23 18/18] Use TIDStore for storing dead tuple TID during lazy
 vacuum

Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.

Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.

Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and num_dead_tuple_bytes.

In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.

XXX: needs to bump catalog version
---
 doc/src/sgml/monitoring.sgml               |   8 +-
 src/backend/access/heap/vacuumlazy.c       | 218 +++++++--------------
 src/backend/catalog/system_views.sql       |   2 +-
 src/backend/commands/vacuum.c              |  78 +-------
 src/backend/commands/vacuumparallel.c      |  62 +++---
 src/backend/postmaster/autovacuum.c        |   6 +-
 src/backend/storage/lmgr/lwlock.c          |   2 +
 src/backend/utils/misc/guc_tables.c        |   2 +-
 src/include/commands/progress.h            |   4 +-
 src/include/commands/vacuum.h              |  25 +--
 src/include/storage/lwlock.h               |   1 +
 src/test/regress/expected/cluster.out      |   2 +-
 src/test/regress/expected/create_index.out |   2 +-
 src/test/regress/expected/rules.out        |   4 +-
 src/test/regress/sql/cluster.sql           |   2 +-
 src/test/regress/sql/create_index.sql      |   2 +-
 16 files changed, 142 insertions(+), 278 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d936aa3da3..0230c74e3d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6870,10 +6870,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6881,10 +6881,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..3537df16fd 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -220,11 +221,14 @@ typedef struct LVRelState
 typedef struct LVPagePruneState
 {
 	bool		hastup;			/* Page prevents rel truncation? */
-	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+
+	/* collected offsets of LP_DEAD items including existing ones */
+	OffsetNumber	deadoffsets[MaxHeapTuplesPerPage];
+	int				num_offsets;
 
 	/*
 	 * State describes the proper VM bit states to set for the page following
-	 * pruning and freezing.  all_visible implies !has_lpdead_items, but don't
+	 * pruning and freezing.  all_visible implies num_offsets == 0, but don't
 	 * trust all_frozen result unless all_visible is also set to true.
 	 */
 	bool		all_visible;	/* Every item visible to all? */
@@ -259,8 +263,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +830,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				blkno,
 				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +911,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (tidstore_is_full(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1018,7 +1022,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
 
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+		Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
@@ -1034,14 +1038,12 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * performed here can be thought of as the one-pass equivalent of
 			 * a call to lazy_vacuum().
 			 */
-			if (prunestate.has_lpdead_items)
+			if (prunestate.num_offsets > 0)
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
-				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+									  prunestate.num_offsets, buf, vmbuffer);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1080,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
+		}
+		else if (prunestate.num_offsets > 0)
+		{
+			/* Save details of the LP_DEAD items from the page */
+			tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+							  prunestate.num_offsets);
+
+			pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+										 tidstore_memory_usage(dead_items));
 		}
 
 		/*
@@ -1145,7 +1156,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
 		 * set, however.
 		 */
-		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+		else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
 		{
 			elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1193,7 +1204,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Final steps for block: drop cleanup lock, record free space in the
 		 * FSM
 		 */
-		if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+		if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
 		{
 			/*
 			 * Wait until lazy_vacuum_heap_rel() to save free space.  This
@@ -1249,7 +1260,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1543,13 +1554,11 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				tuples_frozen,
-				lpdead_items,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	HeapPageFreeze pagefrz;
 	int64		fpi_before = pgWalUsage.wal_fpi;
-	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1580,6 @@ retry:
 	pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
-	lpdead_items = 0;
 	live_tuples = 0;
 	recently_dead_tuples = 0;
 
@@ -1580,9 +1588,9 @@ retry:
 	 *
 	 * We count tuples removed by the pruning step as tuples_deleted.  Its
 	 * final value can be thought of as the number of tuples that have been
-	 * deleted from the table.  It should not be confused with lpdead_items;
-	 * lpdead_items's final value can be thought of as the number of tuples
-	 * that were deleted from indexes.
+	 * deleted from the table.  It should not be confused with
+	 * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+	 * be thought of as the number of tuples that were deleted from indexes.
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
 									 InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1601,7 @@ retry:
 	 * requiring freezing among remaining tuples with storage
 	 */
 	prunestate->hastup = false;
-	prunestate->has_lpdead_items = false;
+	prunestate->num_offsets = 0;
 	prunestate->all_visible = true;
 	prunestate->all_frozen = true;
 	prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1646,7 @@ retry:
 			 * (This is another case where it's useful to anticipate that any
 			 * LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
 			 */
-			deadoffsets[lpdead_items++] = offnum;
+			prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
 			continue;
 		}
 
@@ -1875,7 +1883,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible && lpdead_items == 0)
+	if (prunestate->all_visible && prunestate->num_offsets == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1888,28 +1896,9 @@ retry:
 	}
 #endif
 
-	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel
-	 */
-	if (lpdead_items > 0)
+	if (prunestate->num_offsets > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
-		prunestate->has_lpdead_items = true;
-
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1917,7 @@ retry:
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
 	vacrel->tuples_frozen += tuples_frozen;
-	vacrel->lpdead_items += lpdead_items;
+	vacrel->lpdead_items += prunestate->num_offsets;
 	vacrel->live_tuples += live_tuples;
 	vacrel->recently_dead_tuples += recently_dead_tuples;
 }
@@ -2129,8 +2118,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2127,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2198,7 +2179,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2227,7 +2208,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2254,8 +2235,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2300,7 +2281,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	tidstore_reset(vacrel->dead_items);
 }
 
 /*
@@ -2373,7 +2354,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2410,10 +2391,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2410,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while ((result = tidstore_iterate_next(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2437,7 +2420,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2451,7 +2434,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+							  buf, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2461,6 +2445,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	tidstore_end_iterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2470,36 +2455,30 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
 }
 
 /*
- *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *	lazy_vacuum_heap_page() -- free page's LP_DEAD items.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+					  OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2518,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -3093,46 +3066,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3143,11 +3076,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3174,7 +3105,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem, MaxHeapTuplesPerPage,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3187,11 +3118,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
+										 NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a526e607fe 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127e..d8e680ca20 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2343,82 +2342,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..5c7e6ed99c 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TidStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int max_offset, int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_destroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TidStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index f5ea381c53..d88db3e1f8 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3397,12 +3397,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
 		return true;
 
 	/*
-	 * We clamp manually-set values to at least 1MB.  Since
+	 * We clamp manually-set values to at least 2MB.  Since
 	 * maintenance_work_mem is always set to at least this value, do the same
 	 * here.
 	 */
-	if (*newval < 1024)
-		*newval = 1024;
+	if (*newval < 2048)
+		*newval = 2048;
 
 	return true;
 }
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"LogicalRepLauncherDSA",
 	/* LWTRANCHE_LAUNCHER_HASH: */
 	"LogicalRepLauncherHash",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4ac808ed22..422914f0a9 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2312,7 +2312,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 2048, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..a3ebb169ef 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
 	MultiXactId MultiXactCutoff;
 };
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem, int max_offset,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DATA,
 	LWTRANCHE_LAUNCHER_DSA,
 	LWTRANCHE_LAUNCHER_HASH,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
 -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..f6ae02eb14 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 set enable_indexscan = off;
 
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
 
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
-- 
2.31.1

v23-0017-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v23-0017-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From 32ccdca354e5d9e82f8be512e3afc65ee9930f2a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v23 17/18] Add TIDStore, to store sets of TIDs
 (ItemPointerData) efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 674 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   2 +
 src/include/access/tidstore.h                 |  49 ++
 src/include/storage/lwlock.h                  |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  35 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../modules/test_tidstore/test_tidstore.c     | 195 +++++
 .../test_tidstore/test_tidstore.control       |   4 +
 16 files changed, 1019 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..d936aa3da3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2192,6 +2192,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to access a shared TID bitmap during a parallel bitmap
        index scan.</entry>
      </row>
+     <row>
+      <entry><literal>SharedTidStore</literal></entry>
+      <entry>Waiting to access a shared TID store.</entry>
+     </row>
      <row>
       <entry><literal>SharedTupleStore</literal></entry>
       <entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..89aea71945
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,674 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * XXX: Only one process is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *                                                |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits for offset number fits in a 64-bit value, we don't
+ * encode tids but directly use the block number and the offset number as key
+ * and value, respectively.
+ */
+#define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+	int64	num_tids;		/* the number of Tids stored so far */
+	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
+	int		max_offset;		/* the maximum offset number */
+	bool	encode_tids;	/* do we use tid encoding? */
+	int		offset_nbits;	/* the number of bits used for offset number */
+	int		offset_key_nbits;	/* the number of bits of a offset number
+								 * used for the key */
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+	LWLock	lock;
+
+	/* handles for TidStore and radix tree */
+	tidstore_handle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids. Use either one depending on TidStoreIsShared()  */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 *
+	 * Memory consumption depends on the number of Tids stored, but also on the
+	 * distribution of them, how the radix tree stores, and the memory management
+	 * that backed the radix tree. The maximum bytes that a TidStore can
+	 * use is specified by the max_bytes in tidstore_create(). We want the total
+	 * amount of memory consumption not to exceed the max_bytes.
+	 *
+	 * In non-shared cases, the radix tree uses slab allocators for each kind of
+	 * node class. The most memory consuming case while adding Tids associated
+	 * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+	 * largest radix tree node in a new slab block, which is approximately 70kB.
+	 * Therefore, we deduct 70kB from the maximum bytes.
+	 *
+	 * In shared cases, DSA allocates the memory segments big enough to follow
+	 * a geometric series that approximately doubles the total DSA size (see
+	 * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+	 * size and the simulation revealed, the 75% threshold for the maximum bytes
+	 * perfectly works in case where it is a power-of-2, and the 60% threshold
+	 * works for other cases.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes =(uint64) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - (1024 * 70);
+	}
+
+	ts->control->max_offset = max_offset;
+	ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+	if (ts->control->offset_nbits > TIDSTORE_VALUE_NBITS)
+	{
+		ts->control->encode_tids = true;
+		ts->control->offset_key_nbits =
+			ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+	}
+	else
+	{
+		ts->control->encode_tids = false;
+		ts->control->offset_key_nbits = 0;
+	}
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix
+		 * tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Free the radix tree and return allocated DSA segments to
+		 * the operating system.
+		 */
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+
+		LWLockRelease(&ts->control->lock);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+	}
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+	if (TidStoreIsShared(ts))
+		shared_rt_set(ts->tree.shared, key, val);
+	else
+		local_rt_set(ts->tree.local, key, val);
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	ItemPointerData tid;
+	uint64	key_base;
+	uint64	*values;
+	int	nkeys;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+
+	if (ts->control->encode_tids)
+	{
+		key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+		nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+	}
+	else
+	{
+		key_base = (uint64) blkno;
+		nkeys = 1;
+	}
+
+	values = palloc0(sizeof(uint64) * nkeys);
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint64	key;
+		uint32	off;
+		int idx;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		/* encode the tid to key and val */
+		key = tid_to_key_off(ts, &tid, &off);
+
+		idx = key - key_base;
+		Assert(idx >= 0 && idx < nkeys);
+
+		values[idx] |= UINT64CONST(1) << off;
+	}
+
+	if (TidStoreIsShared(ts))
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+	/* insert the calculated key-values to the tree */
+	for (int i = 0; i < nkeys; i++)
+	{
+		if (values[i])
+		{
+			uint64 key = key_base + i;
+
+			tidstore_insert_kv(ts, key, values[i]);
+		}
+	}
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+
+	if (TidStoreIsShared(ts))
+		LWLockRelease(&ts->control->lock);
+
+	pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val = 0;
+	uint32 off;
+	bool found;
+
+	key = tid_to_key_off(ts, tid, &off);
+
+	if (TidStoreIsShared(ts))
+		found = shared_rt_search(ts->tree.shared, key, &val);
+	else
+		found = local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	iter->result.blkno = InvalidBlockNumber;
+	iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (tidstore_num_tids(ts) == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+	else
+		return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = key_get_blkno(iter->ts, key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * Remember the key-value pair for the next block for the
+			 * next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter->result.offsets);
+	pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+	uint64 num_tids;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (TidStoreIsShared(ts))
+		return ts->control->num_tids;
+
+	LWLockAcquire(&ts->control->lock, LW_SHARED);
+	num_tids = ts->control->num_tids;
+	LWLockRelease(&ts->control->lock);
+
+	return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+	else
+		return sizeof(TidStore) + sizeof(TidStore) +
+			local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract Tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->result);
+
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if (i > iter->ts->control->max_offset)
+			break;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+		Assert(result->num_offsets < iter->ts->control->max_offset);
+		result->offsets[result->num_offsets++] = off;
+	}
+
+	result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+	if (ts->control->encode_tids)
+		return (BlockNumber) (key >> ts->control->offset_key_nbits);
+	else
+		return (BlockNumber) key;
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+{
+	uint64 key;
+	uint64 tid_i;
+
+	if (!ts->control->encode_tids)
+	{
+		*off = ItemPointerGetOffsetNumber(tid);
+
+		/* Use the block number as the key */
+		return (int64) ItemPointerGetBlockNumber(tid);
+	}
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << ts->control->offset_nbits;
+
+	*off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+	key = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"SharedTupleStore",
 	/* LWTRANCHE_SHARED_TIDBITMAP: */
 	"SharedTidBitmap",
+	/* LWTRANCHE_SHARED_TIDSTORE: */
+	"SharedTidStore",
 	/* LWTRANCHE_PARALLEL_APPEND: */
 	"ParallelAppend",
 	/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	*offsets;
+	int				num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_SHARED_TIDBITMAP,
+	LWTRANCHE_SHARED_TIDSTORE,
 	LWTRANCHE_PARALLEL_APPEND,
 	LWTRANCHE_PER_XACT_PREDICATE_LIST,
 	LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9b849ae8e8
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,195 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+	ItemPointerData tid;
+	bool found;
+
+	ItemPointerSet(&tid, blkno, off);
+
+	found = tidstore_lookup_tid(ts, &tid);
+
+	if (found != expect)
+		elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+			 blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS	5
+#define TEST_TIDSTORE_NUM_OFFSETS	5
+
+	TidStore *ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+	BlockNumber	blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+	};
+	BlockNumber	blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+	};
+	OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+	int blk_idx;
+
+	/* prepare the offset array */
+	offs[0] = FirstOffsetNumber;
+	offs[1] = FirstOffsetNumber + 1;
+	offs[2] = max_offset / 2;
+	offs[3] = max_offset - 1;
+	offs[4] = max_offset;
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+
+	/* add tids */
+	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* lookup test */
+	for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+	{
+		bool expect = false;
+		for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+		{
+			if (offs[i] == off)
+			{
+				expect = true;
+				break;
+			}
+		}
+
+		check_tid(ts, 0, off, expect);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, expect);
+	}
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+			 tidstore_num_tids(ts),
+			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* iteration test */
+	iter = tidstore_begin_iterate(ts);
+	blk_idx = 0;
+	while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+	{
+		/* check the returned block number */
+		if (blks_sorted[blk_idx] != iter_result->blkno)
+			elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+				 iter_result->blkno, blks_sorted[blk_idx]);
+
+		/* check the returned offset numbers */
+		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+			elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+		for (int i = 0; i < iter_result->num_offsets; i++)
+		{
+			if (offs[i] != iter_result->offsets[i])
+				elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+					 iter_result->offsets[i], iter_result->blkno, offs[i]);
+		}
+
+		blk_idx++;
+	}
+
+	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+		elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+	/* remove all tids */
+	tidstore_reset(ts);
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+	/* lookup test for empty store */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, false);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, false);
+	}
+
+	tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+	TidStore *ts;
+	TidStoreIter *iter;
+	ItemPointerData tid;
+
+	elog(NOTICE, "testing empty tidstore");
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+
+	ItemPointerSet(&tid, 0, FirstOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+			 MaxBlockNumber, MaxOffsetNumber);
+
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+	if (tidstore_is_full(ts))
+		elog(ERROR, "tidstore_is_full on empty store returned true");
+
+	iter = tidstore_begin_iterate(ts);
+
+	if (tidstore_iterate_next(iter) != NULL)
+		elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+	tidstore_end_iterate(iter);
+
+	tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	elog(NOTICE, "testing basic operations");
+	test_basic(MaxHeapTuplesPerPage);
+	test_basic(10);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.31.1

v23-0015-Detach-DSA-after-tests-in-test_radixtree.patchapplication/octet-stream; name=v23-0015-Detach-DSA-after-tests-in-test_radixtree.patchDownload

From 139100053c485f7ade6117e42ab6567dd94bdd76 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 17:43:04 +0900
Subject: [PATCH v23 15/18] Detach DSA after tests in test_radixtree.

---
 src/test/modules/test_radixtree/test_radixtree.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 64d46dfe9a..2a93e731ae 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -172,6 +172,10 @@ test_empty(void)
 	rt_end_iterate(iter);
 
 	rt_free(radixtree);
+
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
 }
 
 static void
@@ -243,6 +247,9 @@ test_basic(int children, bool test_inner)
 
 	pfree(keys);
 	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
 }
 
 /*
@@ -371,6 +378,9 @@ test_node_types(uint8 shift)
 	test_node_types_insert(radixtree, shift, false);
 
 	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
 }
 
 /*
@@ -636,6 +646,9 @@ test_pattern(const test_spec * spec)
 
 	rt_free(radixtree);
 	MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
 }
 
 Datum
-- 
2.31.1

v23-0013-Remove-XXX-comment-for-MemoryContext-support-for.patchapplication/octet-stream; name=v23-0013-Remove-XXX-comment-for-MemoryContext-support-for.patchDownload

From 8e5ca0c31972bde4e0d64b76ef8cfee599af1044 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 11:07:38 +0900
Subject: [PATCH v23 13/18] Remove XXX comment for MemoryContext support for
 RT_ATTACH() as discussed.

---
 src/include/lib/radixtree.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index e9ff3aa05d..dbf9df604f 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1433,7 +1433,6 @@ RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
 	RT_RADIX_TREE *tree;
 	dsa_pointer	control;
 
-	/* XXX: memory context support */
 	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
 
 	/* Find the control object in shard memory */
-- 
2.31.1

v23-0011-Add-a-safeguard-for-concurrent-iteration-in-RT_S.patchapplication/octet-stream; name=v23-0011-Add-a-safeguard-for-concurrent-iteration-in-RT_S.patchDownload

From 56c458643f58723c59ed28477f6d129374a59e6c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 11:04:41 +0900
Subject: [PATCH v23 11/18] Add a safeguard for concurrent iteration in
 RT_SHMEM case.

---
 src/include/lib/radixtree.h | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 003e8215aa..0277d5e6fb 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -602,6 +602,9 @@ typedef struct RT_RADIX_TREE_CONTROL
 	uint64		max_val;
 	uint64		num_keys;
 
+	/* is iteration in progress? */
+	bool		iter_active;
+
 	/* statistics */
 #ifdef RT_DEBUG
 	int32		cnt[RT_SIZE_CLASS_COUNT];
@@ -638,10 +641,7 @@ typedef struct RT_RADIX_TREE
  * advancing the current index within the node or when moving to the next node
  * at the same level.
  *
- * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
- * has the local pointers to nodes, rather than RT_PTR_ALLOC.
- * We need either a safeguard to disallow other processes to begin the iteration
- * while one process is doing or to allow multiple processes to do the iteration.
+ * In RT_SHMEM case, only one process is allowed to do iteration.
  */
 typedef struct RT_NODE_ITER
 {
@@ -1582,6 +1582,9 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 #endif
 
+	if (unlikely(tree->ctl->iter_active))
+		elog(ERROR, "cannot add new key-value to radix tree while iteration is in progress");
+
 	/* Empty tree, create the root */
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
 		RT_NEW_ROOT(tree, key);
@@ -1683,6 +1686,9 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 #endif
 
+	if (unlikely(tree->ctl->iter_active))
+		elog(ERROR, "cannot delete key to radix tree while iteration is in progress");
+
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
 		return false;
 
@@ -1822,10 +1828,14 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 	RT_PTR_LOCAL root;
 	int			top_level;
 
+	if (unlikely(tree->ctl->iter_active))
+		elog(ERROR, "cannot begin iteration while another iteration is in progress");
+
 	old_ctx = MemoryContextSwitchTo(tree->context);
 
 	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
 	iter->tree = tree;
+	tree->ctl->iter_active = true;
 
 	/* empty tree */
 	if (!iter->tree->ctl->root)
@@ -1853,6 +1863,8 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 RT_SCOPE bool
 RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
 {
+	Assert(iter->tree->ctl->iter_active);
+
 	/* Empty tree */
 	if (!iter->tree->ctl->root)
 		return false;
@@ -1905,6 +1917,7 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
 RT_SCOPE void
 RT_END_ITERATE(RT_ITER *iter)
 {
+	iter->tree->ctl->iter_active = false;
 	pfree(iter);
 }
 
-- 
2.31.1

v23-0010-Fix-a-typo-in-simd.h.patchapplication/octet-stream; name=v23-0010-Fix-a-typo-in-simd.h.patchDownload

From d8b39122cea6ca7363b0ae6d96d99bd018a264c4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 10:51:12 +0900
Subject: [PATCH v23 10/18] Fix a typo in simd.h

---
 src/include/port/simd.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 84d41a340a..f0bba33c53 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -280,7 +280,7 @@ vector8_is_highbit_set(const Vector8 v)
 }
 
 /*
- * Return the bitmak of the high-bit of each element.
+ * Return the bitmask of the high-bit of each element.
  */
 static inline uint32
 vector8_highbit_mask(const Vector8 v)
-- 
2.31.1

v23-0009-Miscellaneous-fixes.patchapplication/octet-stream; name=v23-0009-Miscellaneous-fixes.patchDownload

From 222e13f6e19baa6189c25167d0f20919230842c3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 10:50:33 +0900
Subject: [PATCH v23 09/18] Miscellaneous fixes.

---
 src/include/lib/radixtree.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index b389ee3ed3..003e8215aa 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -304,7 +304,7 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
  * XXX There are 4 node kinds, and this should never be increased,
  * for several reasons:
  * 1. With 5 or more kinds, gcc tends to use a jump table for switch
- *    statments.
+ *    statements.
  * 2. The 4 kinds can be represented with 2 bits, so we have the option
  *    in the future to tag the node pointer with the kind, even on
  *    platforms with 32-bit pointers. This might speed up node traversal
@@ -2239,7 +2239,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 {
 
 	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
-		fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\t%zu\n",
+		fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\n",
 				RT_SIZE_CLASS_INFO[i].name,
 				RT_SIZE_CLASS_INFO[i].inner_size,
 				RT_SIZE_CLASS_INFO[i].leaf_size);
-- 
2.31.1

v23-0012-Don-t-include-the-size-of-RT_RADIX_TREE-to-memor.patchapplication/octet-stream; name=v23-0012-Don-t-include-the-size-of-RT_RADIX_TREE-to-memor.patchDownload

From b90e3412b94bfc5bf8de7e2f1e6a0fe286075f52 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 11:06:30 +0900
Subject: [PATCH v23 12/18] Don't include the size of RT_RADIX_TREE to memory
 usage as discussed.

---
 src/include/lib/radixtree.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 0277d5e6fb..e9ff3aa05d 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1927,8 +1927,7 @@ RT_END_ITERATE(RT_ITER *iter)
 RT_SCOPE uint64
 RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
 {
-	// XXX is this necessary?
-	Size		total = sizeof(RT_RADIX_TREE);
+	Size		total = 0;
 
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
-- 
2.31.1

v23-0008-Align-indents-of-the-file-header-comments.patchapplication/octet-stream; name=v23-0008-Align-indents-of-the-file-header-comments.patchDownload

From 6c08547c8d6b56ff7ff4a686cab863d58c6a16e6 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 10:49:17 +0900
Subject: [PATCH v23 08/18] Align indents of the file header comments.

---
 src/include/lib/radixtree.h | 36 ++++++++++++++++++------------------
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 6852cb0b45..b389ee3ed3 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -42,25 +42,25 @@
  *
  * WIP: the radix tree nodes don't shrink.
  *
- *	  To generate a radix tree and associated functions for a use case several
- *	  macros have to be #define'ed before this file is included.  Including
- *	  the file #undef's all those, so a new radix tree can be generated
- *	  afterwards.
- *	  The relevant parameters are:
- *	  - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
- *		will result in radix tree type 'foo_radix_tree' and functions like
- *		'foo_create'/'foo_free' and so forth.
- *	  - RT_DECLARE - if defined function prototypes and type declarations are
- *		generated
- *	  - RT_DEFINE - if defined function definitions are generated
- *	  - RT_SCOPE - in which scope (e.g. extern, static inline) do function
- *		declarations reside
- *	  - RT_VALUE_TYPE - the type of the value.
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included.  Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * 	 will result in radix tree type 'foo_radix_tree' and functions like
+ *	 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ *	 generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *	 declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
  *
- *	  Optional parameters:
- *	  - RT_SHMEM - if defined, the radix tree is created in the DSA area
- *		so that multiple processes can access it simultaneously.
- *	  - RT_DEBUG - if defined add stats tracking and debugging functions
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *	 so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
  *
  * Interface
  * ---------
-- 
2.31.1

v23-0007-undef-RT_SLOT_IDX_LIMIT.patchapplication/octet-stream; name=v23-0007-undef-RT_SLOT_IDX_LIMIT.patchDownload

From 31742053ef1824698e0ae0c3a059eb2f06164522 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 10:45:16 +0900
Subject: [PATCH v23 07/18] undef RT_SLOT_IDX_LIMIT.

---
 src/include/lib/radixtree.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 7fcd212ea4..6852cb0b45 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -2281,6 +2281,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_NODE_MUST_GROW
 #undef RT_NODE_KIND_COUNT
 #undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
 #undef RT_INVALID_SLOT_IDX
 #undef RT_SLAB_BLOCK_SIZE
 #undef RT_RADIX_TREE_MAGIC
-- 
2.31.1

v23-0006-Fix-compile-error-when-RT_VALUE_TYPE-is-non-inte.patchapplication/octet-stream; name=v23-0006-Fix-compile-error-when-RT_VALUE_TYPE-is-non-inte.patchDownload

From 00d0b18389d7852b34a3eee16f69038a2f07ebaa Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 10:43:01 +0900
Subject: [PATCH v23 06/18] Fix compile error when RT_VALUE_TYPE is
 non-integer.

'value' must be initialized since we assign value
to *value_p to supress compiler warning.
---
 src/include/lib/radixtree_search_impl.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index c4352045c8..a319c46c39 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -15,7 +15,8 @@
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 
 #ifdef RT_NODE_LEVEL_LEAF
-	RT_VALUE_TYPE		value = 0;
+	RT_VALUE_TYPE		value;
+	MemSet(&value, 0, sizeof(RT_VALUE_TYPE));
 
 	Assert(RT_NODE_IS_LEAF(node));
 #else
-- 
2.31.1

v23-0005-Tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v23-0005-Tool-for-measuring-radix-tree-performance.patchDownload

From 5157516f81a3f19de42809fbaec6f3b1e523c68a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v23 05/18] Tool for measuring radix tree performance

Includes Meson support, but commented out to avoid warnings

XXX: Not for commit
---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  76 ++
 contrib/bench_radix_tree/bench_radix_tree.c   | 656 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/meson.build          |  33 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 contrib/meson.build                           |   1 +
 8 files changed, 822 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/meson.build
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..4c785c7336
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,656 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	rt_radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, key);
+	}
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+
+	rt_stats(rt);
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+  'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+  bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'bench_radix_tree',
+    '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+  bench_radix_tree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+  'bench_radix_tree.control',
+  'bench_radix_tree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'bench_radix_tree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'bench_radix_tree',
+    ],
+  },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
+#subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.31.1

v23-0004-Free-all-radix-tree-nodes-recursively.patchapplication/octet-stream; name=v23-0004-Free-all-radix-tree-nodes-recursively.patchDownload

From 9df198bc8781a4d619e4d8c4e584305ef560be48 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 20 Jan 2023 12:38:54 +0700
Subject: [PATCH v23 04/18] Free all radix tree nodes recursively

TODO: Consider adding more general functionality to DSA
to free all segments.
---
 src/include/lib/radixtree.h | 78 +++++++++++++++++++++++++++++++++++++
 1 file changed, 78 insertions(+)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index bc0c0b5853..7fcd212ea4 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -139,6 +139,7 @@
 #define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
 #define RT_INIT_NODE RT_MAKE_NAME(init_node)
 #define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
 #define RT_EXTEND RT_MAKE_NAME(extend)
 #define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
 #define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
@@ -1458,6 +1459,78 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 	return tree->ctl->handle;
 }
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	if (RT_NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+				for (int i = 0; i < n3->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n3->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+				for (int i = 0; i < n32->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+}
 #endif
 
 /*
@@ -1469,6 +1542,10 @@ RT_FREE(RT_RADIX_TREE *tree)
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
 	/*
 	 * Vandalize the control block to help catch programming error where
 	 * other backends access the memory formerly occupied by this radix tree.
@@ -2268,6 +2345,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_ALLOC_NODE
 #undef RT_INIT_NODE
 #undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
 #undef RT_EXTEND
 #undef RT_SET_EXTEND
 #undef RT_SWITCH_NODE_KIND
-- 
2.31.1

v23-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v23-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From beaaee64bc91286d05b9e3c47e9f42eeb2ff5f19 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v23 02/18] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 51484ca7e2..077f197a64 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.31.1

v23-0003-Add-radixtree-template.patchapplication/octet-stream; name=v23-0003-Add-radixtree-template.patchDownload

From 0dfc3627858a18821ac12e9a0f84c922194f3ac7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v23 03/18] Add radixtree template

The only thing configurable in this commit is function scope,
prefix, and local/shared memory.

The key and value type are still hard-coded to uint64.

(A later commit in v21 will make value type configurable)

It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.

TODO: Much broader commit message
---
 src/backend/utils/mmgr/dsa.c                  |   12 +
 src/include/lib/radixtree.h                   | 2314 +++++++++++++++++
 src/include/lib/radixtree_delete_impl.h       |  106 +
 src/include/lib/radixtree_insert_impl.h       |  317 +++
 src/include/lib/radixtree_iter_impl.h         |  138 +
 src/include/lib/radixtree_search_impl.h       |  131 +
 src/include/utils/dsa.h                       |    1 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   35 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  660 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 20 files changed, 3817 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 604b702a91..50f0aae3ab 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..bc0c0b5853
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2314 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ *  tional leaf node type which stores one value.
+ *  - Multi-value leaves: The values are stored in one of four
+ *  different leaf node types, which mirror the structure of
+ *  inner nodes, but contain values instead of pointers.
+ *  - Combined pointer/value slots: If values fit into point-
+ *  ers, no separate node types are necessary. Instead, each
+ *  pointer storage location in an inner node can either
+ *  store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ *	  To generate a radix tree and associated functions for a use case several
+ *	  macros have to be #define'ed before this file is included.  Including
+ *	  the file #undef's all those, so a new radix tree can be generated
+ *	  afterwards.
+ *	  The relevant parameters are:
+ *	  - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ *		will result in radix tree type 'foo_radix_tree' and functions like
+ *		'foo_create'/'foo_free' and so forth.
+ *	  - RT_DECLARE - if defined function prototypes and type declarations are
+ *		generated
+ *	  - RT_DEFINE - if defined function definitions are generated
+ *	  - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *		declarations reside
+ *	  - RT_VALUE_TYPE - the type of the value.
+ *
+ *	  Optional parameters:
+ *	  - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *		so that multiple processes can access it simultaneously.
+ *	  - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITER		- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE val);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ *    statments.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ *    in the future to tag the node pointer with the kind, even on
+ *    platforms with 32-bit pointers. This might speed up node traversal
+ *    in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+	((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+	RT_NODE		n;
+
+	/* 3 children, for key chunks */
+	uint8		chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword		isset[BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/*
+	 * Unlike with inner256, zero is a valid value here, so we use a
+	 * bitmap to track which slot is in use.
+	 */
+	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	RT_VALUE_TYPE	values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_3 = 0,
+	RT_CLASS_32_MIN,
+	RT_CLASS_32_MAX,
+	RT_CLASS_125,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_3] = {
+		.name = "radix tree node 3",
+		.fanout = 3,
+		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MIN] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MAX] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_125] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+#endif
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+	RT_PTR_LOCAL node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the iteration on nodes of each level */
+	RT_NODE_ITER stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is constructed during iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								uint64 key, RT_VALUE_TYPE value);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/* replicate the search key */
+	spread_chunk = vector8_broadcast(chunk);
+
+	/* compare to all 32 keys stored in the node */
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+
+	/* convert comparison to a bitfield */
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+	/* mask off invalid entries */
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	/* convert bitfield to index by counting trailing zeros */
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		/*
+		 * This is coded with '>=' to match what we can do with SIMD,
+		 * with an assert to keep us honest.
+		 */
+		if (node->chunks[index] >= chunk)
+		{
+			Assert(node->chunks[index] != chunk);
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/*
+	 * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+	 * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+	 * we need to play some trickery using vector8_min() to effectively get
+	 * <=. There'll never be any equal elements in urrent uses, but that's
+	 * what we get here...
+	 */
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	if (key == 0)
+		return 0;
+	else
+		return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
+
+	if (is_leaf)
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (is_leaf)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+#endif
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->ctl->cnt[size_class]++;
+#endif
+
+	return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	if (is_leaf)
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+	node->kind = kind;
+
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+		memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+	}
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		is_leaf = shift == 0;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+	newnode->shift = shift;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+				  uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+	RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+	RT_COPY_NODE(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->ctl->root == allocnode)
+	{
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
+	}
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+				RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old_child->shift == new->shift);
+	Assert(old_child->count == new->count);
+#endif
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new larger node */
+		tree->ctl->root = new_child;
+	}
+	else
+		RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+	RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_3 *n3;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+		node->shift = shift;
+		node->count = 1;
+
+		n3 = (RT_NODE_INNER_3 *) node;
+		n3->base.chunks[0] = 0;
+		n3->children[0] = tree->ctl->root;
+
+		/* Update the root */
+		tree->ctl->root = allocnode;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL parent,
+			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+	int			shift = node->shift;
+
+	Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		is_leaf = newshift == 0;
+
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+		newchild->shift = newshift;
+		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+		parent = node;
+		node = newchild;
+		stored_node = allocchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value);
+	tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_VALUE_TYPE value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+	/* Create a slab context for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+		size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+		size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 size_class.name,
+												 inner_blocksize,
+												 size_class.inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												size_class.name,
+												leaf_blocksize,
+												size_class.leaf_size);
+	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	/* XXX: memory context support */
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle);
+#else
+	pfree(tree->ctl);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+#endif
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
+{
+	int			shift;
+	bool		updated;
+	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC stored_child;
+	RT_PTR_LOCAL  child;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	/* Empty tree, create the root */
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->ctl->max_val)
+		RT_EXTEND(tree, key);
+
+	stored_child = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, stored_child);
+	shift = parent->shift;
+
+	/* Descend the tree until we reach a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC new_child;
+
+		child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+		if (RT_NODE_IS_LEAF(child))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+		{
+			RT_SET_EXTEND(tree, key, value, parent, stored_child, child);
+			return false;
+		}
+
+		parent = child;
+		stored_child = new_child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->ctl->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	Assert(value_p != NULL);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+		return false;
+
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		if (RT_NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			return false;
+
+		node = RT_PTR_GET_LOCAL(tree, child);
+		shift -= RT_NODE_SPAN;
+	}
+
+	return RT_NODE_SEARCH_LEAF(node, key, value_p);
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		RT_PTR_ALLOC child;
+
+		/* Push the current node to the stack */
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			return false;
+
+		allocnode = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	deleted = RT_NODE_DELETE_LEAF(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->ctl->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (node->count > 0)
+		return true;
+
+	/* Free the empty leaf node */
+	RT_FREE_NODE(tree, allocnode);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		allocnode = stack[level--];
+
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		deleted = RT_NODE_DELETE_INNER(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (node->count > 0)
+			break;
+
+		/* The node became empty */
+		RT_FREE_NODE(tree, allocnode);
+	}
+
+	return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+	int			level = from;
+	RT_PTR_LOCAL node = from_node;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (RT_NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	MemoryContext old_ctx;
+	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->ctl->root)
+		return iter;
+
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->ctl->root)
+		return false;
+
+	for (;;)
+	{
+		RT_PTR_LOCAL child = NULL;
+		RT_VALUE_TYPE value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+	pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	// XXX is this necessary?
+	Size		total = sizeof(RT_RADIX_TREE);
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+#endif
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+				for (int i = 1; i < n3->n.count; i++)
+					Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = BM_IDX(slot);
+					int			bitnum = BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+	ereport(NOTICE, (errmsg("num_keys = " UINT64_FORMAT ", height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u",
+						 tree->ctl->num_keys,
+						 tree->ctl->root->shift / RT_NODE_SPAN,
+						 tree->ctl->cnt[RT_CLASS_3],
+						 tree->ctl->cnt[RT_CLASS_32_MIN],
+						 tree->ctl->cnt[RT_CLASS_32_MAX],
+						 tree->ctl->cnt[RT_CLASS_125],
+						 tree->ctl->cnt[RT_CLASS_256])));
+}
+
+/* XXX For display, assumes value type is numeric */
+static void
+RT_DUMP_NODE(RT_PTR_LOCAL node, int level, bool recurse)
+{
+	char		space[125] = {0};
+
+	fprintf(stderr, "[%s] kind %d, fanout %d, count %u, shift %u:\n",
+			RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+			(node->kind == RT_NODE_KIND_3) ? 3 :
+			(node->kind == RT_NODE_KIND_32) ? 32 :
+			(node->kind == RT_NODE_KIND_125) ? 125 : 256,
+			node->fanout == 0 ? 256 : node->fanout,
+			node->count, node->shift);
+
+	if (level > 0)
+		sprintf(space, "%*c", level * 4, ' ');
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
+								space, n3->base.chunks[i], (uint64) n3->values[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n3->base.chunks[i]);
+
+						if (recurse)
+							RT_DUMP_NODE(n3->children[i], level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
+								space, n32->base.chunks[i], (uint64) n32->values[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							RT_DUMP_NODE(n32->children[i], level + 1, recurse);
+						}
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+
+				fprintf(stderr, "slot_idxs ");
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					fprintf(stderr, " [%d]=%d, ", i, b125->slot_idxs[i]);
+				}
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_125 *n = (RT_NODE_LEAF_125 *) node;
+
+					fprintf(stderr, ", isset-bitmap:");
+					for (int i = 0; i < BM_IDX(RT_SLOT_IDX_LIMIT); i++)
+					{
+						fprintf(stderr, RT_UINT64_FORMAT_HEX " ", (uint64) n->base.isset[i]);
+					}
+					fprintf(stderr, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_125 *n125 = (RT_NODE_LEAF_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
+								space, i, (uint64) RT_NODE_LEAF_125_GET_VALUE(n125, i));
+					}
+					else
+					{
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							RT_DUMP_NODE(RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X value 0x" RT_UINT64_FORMAT_HEX "\n",
+								space, i, (uint64) RT_NODE_LEAF_256_GET_VALUE(n256, i));
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						fprintf(stderr, "%schunk 0x%X ->",
+								space, i);
+
+						if (recurse)
+							RT_DUMP_NODE(RT_NODE_INNER_256_GET_CHILD(n256, i), level + 1,
+										 recurse);
+						else
+							fprintf(stderr, "\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+	int			level = 0;
+
+	elog(NOTICE, "-----------------------------------------------------------");
+	elog(NOTICE, "max_val = " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ")",
+		 tree->ctl->max_val, tree->ctl->max_val);
+
+	if (!tree->ctl->root)
+	{
+		elog(NOTICE, "tree is empty");
+		return;
+	}
+
+	if (key > tree->ctl->max_val)
+	{
+		elog(NOTICE, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val",
+			 key, key);
+		return;
+	}
+
+	node = tree->ctl->root;
+	shift = tree->ctl->root->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_LOCAL child;
+
+		RT_DUMP_NODE(node, level, false);
+
+		if (RT_NODE_IS_LEAF(node))
+		{
+			uint64		dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			break;
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		fprintf(stderr, "%s\tinner_size %zu\tleaf_size %zu\t%zu\n",
+				RT_SIZE_CLASS_INFO[i].name,
+				RT_SIZE_CLASS_INFO[i].inner_size,
+				RT_SIZE_CLASS_INFO[i].leaf_size);
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+
+	if (!tree->ctl->root)
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	RT_DUMP_NODE(tree->ctl->root, 0, true);
+}
+#endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef BM_IDX
+#undef BM_BIT
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..99c90771b9
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,106 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+										  n3->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+											n3->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+										  n32->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+				idx = BM_IDX(slotpos);
+				bitnum = BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+				RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..22aca0e6cc
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,317 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	const bool is_leaf = true;
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	const bool is_leaf = false;
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n3->values[idx] = value;
+#else
+					n3->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n3)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+					/* grow node from 3 to 32 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+											  new32->base.chunks, new32->values);
+#else
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+											  new32->base.chunks, new32->children);
+#endif
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+					int			count = n3->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+												   count, insertpos);
+#endif
+					}
+
+					n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n3->values[insertpos] = value;
+#else
+					n3->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[idx] = value;
+#else
+					n32->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+					n32->base.n.fanout < class32_max.fanout)
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+					Assert(n32->base.n.fanout == class32_min.fanout);
+
+					/* grow to the next size class of this kind */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class32_min.leaf_size);
+#else
+					memcpy(newnode, node, class32_min.inner_size);
+#endif
+					newnode->fanout = class32_max.fanout;
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n32)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+					Assert(n32->base.n.fanout == class32_max.fanout);
+
+					/* grow node from 32 to 125 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < class32_max.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					/*
+					 * Since we just copied a dense array, we can set the bits
+					 * using a single store, provided the length of that array
+					 * is at most the number of bits in a bitmapword.
+					 */
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = value;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			cnt = 0;
+
+				if (slotpos != RT_INVALID_SLOT_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n125)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+					/* grow node from 125 to 256 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new256 = (RT_NODE256_TYPE *) newnode;
+
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+						cnt++;
+					}
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+				chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+				Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_SET(n256, chunk, value);
+#else
+				RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	RT_VERIFY_NODE(node);
+
+	return chunk_exists;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..823d7107c4
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,138 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	bool		found = false;
+	uint8		key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	RT_VALUE_TYPE		value;
+
+	Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n3->base.n.count)
+					break;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n3->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+				key_chunk = n3->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+		*value_p = value;
+#endif
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return found;
+#else
+	return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..c4352045c8
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,131 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	RT_VALUE_TYPE		value = 0;
+
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+	RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+#endif
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n3->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n3->values[idx];
+#else
+				child = n3->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n32->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[idx];
+#else
+				child = n32->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+				Assert(slotpos != RT_INVALID_SLOT_IDX);
+				n125->children[slotpos] = new_child;
+#else
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+	}
+
+#ifdef RT_ACTION_UPDATE
+	return;
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	*value_p = value;
+#else
+	Assert(child_p != NULL);
+	*child_p = child;
+#endif
+
+	return true;
+#endif							/* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 104386e674..c67f936880 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -117,6 +117,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..64d46dfe9a
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,660 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	TestValueType		dummy;
+	uint64		key;
+	TestValueType		val;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	rt_radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* look up keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType value;
+
+		if (!rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (value != (TestValueType) keys[i])
+			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+				 value, (TestValueType) keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_set(radixtree, keys[i], (TestValueType) (keys[i] + 1)))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		TestValueType		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != (TestValueType) key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType) key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa);
+#else
+	radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, (TestValueType) x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != (TestValueType) x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			TestValueType		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != (TestValueType) expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	rt_free(radixtree);
+	MemoryContextDelete(radixtree_ctx);
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.31.1

v23-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v23-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From 990c01fbf68b39b5f2c6109440f63e6c305ba7f0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v23 01/18] introduce vector8_min and vector8_highbit_mask

TODO: commit message
TODO: Remove uint64 case.

separate-commit TODO: move non-SIMD fallbacks to own header
to clean up the #ifdef maze.
---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..84d41a340a 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmak of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.31.1

#191

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#189)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jan 26, 2023 at 3:33 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Thu, Jan 26, 2023 at 3:54 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I think that we need to prevent concurrent updates (RT_SET() and
RT_DELETE()) during the iteration to get the consistent result through
the whole iteration operation. Unlike other operations such as
RT_SET(), we cannot expect that a job doing something for each
key-value pair in the radix tree completes in a short time, so we
cannot keep holding the radix tree lock until the end of the
iteration.

This sounds like a performance concern, rather than a correctness concern,
is that right? If so, I don't think we should worry too much about
optimizing simple locking, because it will *never* be fast enough for
highly-concurrent read-write workloads anyway, and anyone interested in
those workloads will have to completely replace the locking scheme,
possibly using one of the ideas in the last ART paper you mentioned.

The first implementation should be simple, easy to test/verify, easy to
understand, and easy to replace. As much as possible anyway.

So the idea is that we set iter_active to true (with the
lock in exclusive mode), and prevent concurrent updates when the flag
is true.

...by throwing elog(ERROR)? I'm not so sure users of this API would prefer
that to waiting.

Since there were calls to LWLockAcquire/Release in the last version,

I'm a bit confused by this. Perhaps for the next patch, the email should
contain a few sentences describing how locking is intended to work,
including for iteration.

The lock I'm thinking of adding is a simple readers-writer lock. This
lock is used for concurrent radix tree operations except for the
iteration. For operations concurrent to the iteration, I used a flag
for the reason I mentioned above.

This doesn't tell me anything -- we already agreed on "simple reader-writer
lock", months ago I believe. And I only have a vague idea about the
tradeoffs made regarding iteration.

+ * WIP: describe about how locking works.

A first draft of what is intended for this WIP would be a good start. This
WIP is from v23-0016, which contains no comments and a one-line commit
message. I'd rather not try closely studying that patch (or how it works
with 0011) until I have a clearer understanding of what requirements are
assumed, what trade-offs are considered, and how it should be tested.

[thinks some more...] Is there an API-level assumption that hasn't been
spelled out? Would it help to have a parameter for whether the iteration
function wants to reserve the privilege to perform writes? It could take
the appropriate lock at the start, and there could then be multiple
read-only iterators, but only one read/write iterator. Note, I'm just
guessing here, and I don't want to make things more difficult for future
improvements.

Hmm, I wonder if we need to use the isolation tester. It's both a

blessing and a curse that the first client of this data structure is tid
lookup. It's a blessing because it doesn't present a highly-concurrent
workload mixing reads and writes and so simple locking is adequate. It's a
curse because to test locking and have any chance of finding bugs, we can't
rely on vacuum to tell us that because (as you've said) it might very well
work fine with no locking at all. So we must come up with test cases
ourselves.

Using the isolation tester to test locking seems like a good idea. We
can include it in test_radixtree. But given that the locking in the
radix tree is very simple, the test case would be very simple. It may
be controversial whether it's worth adding such testing by adding both
the new test module and test cases.

I mean that the isolation tester (or something else) would contain test
cases. I didn't mean to imply redundant testing.

I think the user (e.g, vacuumlazy.c) can pass the maximum offset
number to the parallel vacuum.

Okay, sounds good.

Most of v23's cleanups/fixes in the radix template look good to me,
although I didn't read the debugging code very closely. There is one
exception:

0006 - I've never heard of memset'ing a variable to avoid "variable unused"
compiler warnings, and it seems strange. It turns out we don't actually
need this variable in the first place. The attached .txt patch removes the
local variable and just writes to the passed pointer. This required callers
to initialize a couple of their own variables, but only child pointers, at
least on gcc 12. And I will work later on making "value" in the public API
a pointer.

0017 - I haven't taken a close look at the new changes, but I did notice
this some time ago:

+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+ else
+ return sizeof(TidStore) + sizeof(TidStore) +
+ local_rt_memory_usage(ts->tree.local);

There is repetition in the else branch.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

remove-intermediate-variables.txttext/plain; charset=US-ASCII; name=remove-intermediate-variables.txtDownload

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 542daae6d0..c2ee7f4fa1 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1618,7 +1618,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
 	/* Descend the tree until we reach a leaf node */
 	while (shift >= 0)
 	{
-		RT_PTR_ALLOC new_child;
+		RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;
 
 		child = RT_PTR_GET_LOCAL(tree, stored_child);
 
@@ -1678,7 +1678,7 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 	/* Descend the tree until a leaf node */
 	while (shift >= 0)
 	{
-		RT_PTR_ALLOC child;
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
 
 		if (RT_NODE_IS_LEAF(node))
 			break;
@@ -1742,7 +1742,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 	level = -1;
 	while (shift > 0)
 	{
-		RT_PTR_ALLOC child;
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
 
 		/* Push the current node to the stack */
 		stack[++level] = allocnode;
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
index a319c46c39..c8410e9a5c 100644
--- a/src/include/lib/radixtree_search_impl.h
+++ b/src/include/lib/radixtree_search_impl.h
@@ -15,13 +15,11 @@
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
 
 #ifdef RT_NODE_LEVEL_LEAF
-	RT_VALUE_TYPE		value;
-	MemSet(&value, 0, sizeof(RT_VALUE_TYPE));
-
+	Assert(value_p != NULL);
 	Assert(RT_NODE_IS_LEAF(node));
 #else
 #ifndef RT_ACTION_UPDATE
-	RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+	Assert(child_p != NULL);
 #endif
 	Assert(!RT_NODE_IS_LEAF(node));
 #endif
@@ -41,9 +39,9 @@
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				value = n3->values[idx];
+				*value_p = n3->values[idx];
 #else
-				child = n3->children[idx];
+				*child_p = n3->children[idx];
 #endif
 #endif							/* RT_ACTION_UPDATE */
 				break;
@@ -61,9 +59,9 @@
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				value = n32->values[idx];
+				*value_p = n32->values[idx];
 #else
-				child = n32->children[idx];
+				*child_p = n32->children[idx];
 #endif
 #endif							/* RT_ACTION_UPDATE */
 				break;
@@ -81,9 +79,9 @@
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				value = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+				*value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
 #else
-				child = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+				*child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
 #endif
 #endif							/* RT_ACTION_UPDATE */
 				break;
@@ -103,9 +101,9 @@
 					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				value = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+				*value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
 #else
-				child = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+				*child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
 #endif
 #endif							/* RT_ACTION_UPDATE */
 				break;
@@ -115,14 +113,6 @@
 #ifdef RT_ACTION_UPDATE
 	return;
 #else
-#ifdef RT_NODE_LEVEL_LEAF
-	Assert(value_p != NULL);
-	*value_p = value;
-#else
-	Assert(child_p != NULL);
-	*child_p = child;
-#endif
-
 	return true;
 #endif							/* RT_ACTION_UPDATE */

#192

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#191)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sat, Jan 28, 2023 at 8:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Jan 26, 2023 at 3:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jan 26, 2023 at 3:54 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I think that we need to prevent concurrent updates (RT_SET() and
RT_DELETE()) during the iteration to get the consistent result through
the whole iteration operation. Unlike other operations such as
RT_SET(), we cannot expect that a job doing something for each
key-value pair in the radix tree completes in a short time, so we
cannot keep holding the radix tree lock until the end of the
iteration.

This sounds like a performance concern, rather than a correctness concern, is that right? If so, I don't think we should worry too much about optimizing simple locking, because it will *never* be fast enough for highly-concurrent read-write workloads anyway, and anyone interested in those workloads will have to completely replace the locking scheme, possibly using one of the ideas in the last ART paper you mentioned.

The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as possible anyway.

Yes, but if a concurrent writer waits for another process to finish
the iteration, it ends up waiting on a lwlock, which is not
interruptible.

So the idea is that we set iter_active to true (with the
lock in exclusive mode), and prevent concurrent updates when the flag
is true.

...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.

Right. I think if we want to wait rather than an ERROR, the waiter
should wait in an interruptible way, for example, a condition
variable. I did a simpler way in the v22 patch.

...but looking at dshash.c, dshash_seq_next() seems to return an entry
while holding a lwlock on the partition. My assumption might be wrong.

Since there were calls to LWLockAcquire/Release in the last version, I'm a bit confused by this. Perhaps for the next patch, the email should contain a few sentences describing how locking is intended to work, including for iteration.

The lock I'm thinking of adding is a simple readers-writer lock. This
lock is used for concurrent radix tree operations except for the
iteration. For operations concurrent to the iteration, I used a flag
for the reason I mentioned above.

This doesn't tell me anything -- we already agreed on "simple reader-writer lock", months ago I believe. And I only have a vague idea about the tradeoffs made regarding iteration.

+ * WIP: describe about how locking works.

A first draft of what is intended for this WIP would be a good start. This WIP is from v23-0016, which contains no comments and a one-line commit message. I'd rather not try closely studying that patch (or how it works with 0011) until I have a clearer understanding of what requirements are assumed, what trade-offs are considered, and how it should be tested.

[thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameter for whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriate lock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note, I'm just guessing here, and I don't want to make things more difficult for future improvements.

Seems a good idea. Given the use case for parallel heap vacuum, it
would be a good idea to support having multiple read-only writers. The
iteration of the v22 is read-only, so if we want to support read-write
iterator, we would need to support a function that modifies the
current key-value returned by the iteration.

Hmm, I wonder if we need to use the isolation tester. It's both a blessing and a curse that the first client of this data structure is tid lookup. It's a blessing because it doesn't present a highly-concurrent workload mixing reads and writes and so simple locking is adequate. It's a curse because to test locking and have any chance of finding bugs, we can't rely on vacuum to tell us that because (as you've said) it might very well work fine with no locking at all. So we must come up with test cases ourselves.

Using the isolation tester to test locking seems like a good idea. We
can include it in test_radixtree. But given that the locking in the
radix tree is very simple, the test case would be very simple. It may
be controversial whether it's worth adding such testing by adding both
the new test module and test cases.

I mean that the isolation tester (or something else) would contain test cases. I didn't mean to imply redundant testing.

Okay, understood.

I think the user (e.g, vacuumlazy.c) can pass the maximum offset
number to the parallel vacuum.

Okay, sounds good.

Most of v23's cleanups/fixes in the radix template look good to me, although I didn't read the debugging code very closely. There is one exception:

0006 - I've never heard of memset'ing a variable to avoid "variable unused" compiler warnings, and it seems strange. It turns out we don't actually need this variable in the first place. The attached .txt patch removes the local variable and just writes to the passed pointer. This required callers to initialize a couple of their own variables, but only child pointers, at least on gcc 12.

Agreed with the attached patch.

And I will work later on making "value" in the public API a pointer.

Thanks!

0017 - I haven't taken a close look at the new changes, but I did notice this some time ago:
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+ else
+ return sizeof(TidStore) + sizeof(TidStore) +
+ local_rt_memory_usage(ts->tree.local);
There is repetition in the else branch.

Agreed, will remove.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#193

Dilip Kumar

dilipbalaut@gmail.com

almost 3 years ago

In reply to: John Naylor (#188)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jan 26, 2023 at 12:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Jan 24, 2023 at 1:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jan 23, 2023 at 6:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Attached is a rebase to fix conflicts from recent commits.

I have reviewed v22-0022* patch and I have some comments.

1.

It also changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.

I think this statement needs to be rephrased.

Could you be more specific?

I mean the below statement in the commit message doesn't look
grammatically correct to me.

"It also changes to the column names max_dead_tuples and
num_dead_tuples and to show the progress information in bytes."

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#194

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#192)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sun, Jan 29, 2023 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Sat, Jan 28, 2023 at 8:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

The first implementation should be simple, easy to test/verify, easy to

understand, and easy to replace. As much as possible anyway.

Yes, but if a concurrent writer waits for another process to finish
the iteration, it ends up waiting on a lwlock, which is not
interruptible.

So the idea is that we set iter_active to true (with the
lock in exclusive mode), and prevent concurrent updates when the flag
is true.

...by throwing elog(ERROR)? I'm not so sure users of this API would

prefer that to waiting.

Right. I think if we want to wait rather than an ERROR, the waiter
should wait in an interruptible way, for example, a condition
variable. I did a simpler way in the v22 patch.

...but looking at dshash.c, dshash_seq_next() seems to return an entry
while holding a lwlock on the partition. My assumption might be wrong.

Using partitions there makes holding a lock less painful on average, I
imagine, but I don't know the details there.

If we make it clear that the first committed version is not (yet) designed
for high concurrency with mixed read-write workloads, I think waiting (as a
protocol) is fine. If waiting is a problem for some use case, at that point
we should just go all the way and replace the locking entirely. In fact, it
might be good to spell this out in the top-level comment and include a link
to the second ART paper.

[thinks some more...] Is there an API-level assumption that hasn't been

spelled out? Would it help to have a parameter for whether the iteration
function wants to reserve the privilege to perform writes? It could take
the appropriate lock at the start, and there could then be multiple
read-only iterators, but only one read/write iterator. Note, I'm just
guessing here, and I don't want to make things more difficult for future
improvements.

Seems a good idea. Given the use case for parallel heap vacuum, it
would be a good idea to support having multiple read-only writers. The
iteration of the v22 is read-only, so if we want to support read-write
iterator, we would need to support a function that modifies the
current key-value returned by the iteration.

Okay, so updating during iteration is not currently supported. It could in
the future, but I'd say that can also wait for fine-grained concurrency
support. Intermediate-term, we should at least make it straightforward to
support:

1) parallel heap vacuum -> multiple read-only iterators
2) parallel heap pruning -> multiple writers

It may or may not be worth it for someone to actually start either of those
projects, and there are other ways to improve vacuum that may be more
pressing. That said, it seems the tid store with global locking would
certainly work fine for #1 and maybe "not too bad" for #2. #2 can also
mitigate waiting by using larger batching, or the leader process could
"pre-warm" the tid store with zero-values using block numbers from the
visibility map.

--
John Naylor
EDB: http://www.enterprisedb.com

#195

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Dilip Kumar (#193)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 30, 2023 at 1:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Jan 26, 2023 at 12:39 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Jan 24, 2023 at 1:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Jan 23, 2023 at 6:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Attached is a rebase to fix conflicts from recent commits.

I have reviewed v22-0022* patch and I have some comments.

1.

It also changes to the column names max_dead_tuples and num_dead_tuples and to
show the progress information in bytes.

I think this statement needs to be rephrased.

Could you be more specific?

I mean the below statement in the commit message doesn't look
grammatically correct to me.

"It also changes to the column names max_dead_tuples and
num_dead_tuples and to show the progress information in bytes."

I've changed the commit message in the v23 patch. Please check it.
Other comments are also incorporated in the v23 patch. Thank you for
the comments!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#196

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#194)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 30, 2023 at 1:31 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Sun, Jan 29, 2023 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Jan 28, 2023 at 8:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as possible anyway.

Yes, but if a concurrent writer waits for another process to finish
the iteration, it ends up waiting on a lwlock, which is not
interruptible.

So the idea is that we set iter_active to true (with the
lock in exclusive mode), and prevent concurrent updates when the flag
is true.

...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.

Right. I think if we want to wait rather than an ERROR, the waiter
should wait in an interruptible way, for example, a condition
variable. I did a simpler way in the v22 patch.

...but looking at dshash.c, dshash_seq_next() seems to return an entry
while holding a lwlock on the partition. My assumption might be wrong.

Using partitions there makes holding a lock less painful on average, I imagine, but I don't know the details there.

If we make it clear that the first committed version is not (yet) designed for high concurrency with mixed read-write workloads, I think waiting (as a protocol) is fine. If waiting is a problem for some use case, at that point we should just go all the way and replace the locking entirely. In fact, it might be good to spell this out in the top-level comment and include a link to the second ART paper.

Agreed. Will update the comments.

[thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameter for whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriate lock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note, I'm just guessing here, and I don't want to make things more difficult for future improvements.

Seems a good idea. Given the use case for parallel heap vacuum, it
would be a good idea to support having multiple read-only writers. The
iteration of the v22 is read-only, so if we want to support read-write
iterator, we would need to support a function that modifies the
current key-value returned by the iteration.

Okay, so updating during iteration is not currently supported. It could in the future, but I'd say that can also wait for fine-grained concurrency support. Intermediate-term, we should at least make it straightforward to support:

1) parallel heap vacuum -> multiple read-only iterators
2) parallel heap pruning -> multiple writers

It may or may not be worth it for someone to actually start either of those projects, and there are other ways to improve vacuum that may be more pressing. That said, it seems the tid store with global locking would certainly work fine for #1 and maybe "not too bad" for #2. #2 can also mitigate waiting by using larger batching, or the leader process could "pre-warm" the tid store with zero-values using block numbers from the visibility map.

True. Using a larger batching method seems to be worth testing when we
implement the parallel heap pruning.

In the next version patch, I'm going to update the locking support
part and incorporate other comments I got.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#197

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#196)

9 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 30, 2023 at 11:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 30, 2023 at 1:31 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Sun, Jan 29, 2023 at 9:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Jan 28, 2023 at 8:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

The first implementation should be simple, easy to test/verify, easy to understand, and easy to replace. As much as possible anyway.

Yes, but if a concurrent writer waits for another process to finish
the iteration, it ends up waiting on a lwlock, which is not
interruptible.

So the idea is that we set iter_active to true (with the
lock in exclusive mode), and prevent concurrent updates when the flag
is true.

...by throwing elog(ERROR)? I'm not so sure users of this API would prefer that to waiting.

Right. I think if we want to wait rather than an ERROR, the waiter
should wait in an interruptible way, for example, a condition
variable. I did a simpler way in the v22 patch.

...but looking at dshash.c, dshash_seq_next() seems to return an entry
while holding a lwlock on the partition. My assumption might be wrong.

Using partitions there makes holding a lock less painful on average, I imagine, but I don't know the details there.

If we make it clear that the first committed version is not (yet) designed for high concurrency with mixed read-write workloads, I think waiting (as a protocol) is fine. If waiting is a problem for some use case, at that point we should just go all the way and replace the locking entirely. In fact, it might be good to spell this out in the top-level comment and include a link to the second ART paper.

Agreed. Will update the comments.

[thinks some more...] Is there an API-level assumption that hasn't been spelled out? Would it help to have a parameter for whether the iteration function wants to reserve the privilege to perform writes? It could take the appropriate lock at the start, and there could then be multiple read-only iterators, but only one read/write iterator. Note, I'm just guessing here, and I don't want to make things more difficult for future improvements.

Seems a good idea. Given the use case for parallel heap vacuum, it
would be a good idea to support having multiple read-only writers. The
iteration of the v22 is read-only, so if we want to support read-write
iterator, we would need to support a function that modifies the
current key-value returned by the iteration.

Okay, so updating during iteration is not currently supported. It could in the future, but I'd say that can also wait for fine-grained concurrency support. Intermediate-term, we should at least make it straightforward to support:

1) parallel heap vacuum -> multiple read-only iterators
2) parallel heap pruning -> multiple writers

It may or may not be worth it for someone to actually start either of those projects, and there are other ways to improve vacuum that may be more pressing. That said, it seems the tid store with global locking would certainly work fine for #1 and maybe "not too bad" for #2. #2 can also mitigate waiting by using larger batching, or the leader process could "pre-warm" the tid store with zero-values using block numbers from the visibility map.

True. Using a larger batching method seems to be worth testing when we
implement the parallel heap pruning.

In the next version patch, I'm going to update the locking support
part and incorporate other comments I got.

I've attached v24 patches. The locking support patch is separated
(0005 patch). Also I kept the updates for TidStore and the vacuum
integration from v23 separate.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v24-0005-Add-read-write-lock-to-radix-tree-in-RT_SHMEM-ca.patchapplication/octet-stream; name=v24-0005-Add-read-write-lock-to-radix-tree-in-RT_SHMEM-ca.patchDownload

From 1085ef0b9b8b31795616abc43063a91b27e7d5a4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 25 Jan 2023 17:43:29 +0900
Subject: [PATCH v24 5/9] Add read-write lock to radix tree in RT_SHMEM case.

---
 src/include/lib/radixtree.h                   | 102 ++++++++++++++++--
 .../modules/test_radixtree/test_radixtree.c   |   8 +-
 2 files changed, 100 insertions(+), 10 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index f591d903fc..48134b10e4 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -40,6 +40,18 @@
  * There are some optimizations not yet implemented, particularly path
  * compression and lazy path expansion.
  *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
  * WIP: the radix tree nodes don't shrink.
  *
  * To generate a radix tree and associated functions for a use case several
@@ -224,7 +236,7 @@ typedef dsa_pointer RT_HANDLE;
 #endif
 
 #ifdef RT_SHMEM
-RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
 RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
 RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
 RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
@@ -371,6 +383,16 @@ typedef struct RT_NODE
 #define RT_INVALID_PTR_ALLOC NULL
 #endif
 
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree)	LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree)	LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree)			LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree)	((void) 0)
+#define RT_LOCK_SHARED(tree)	((void) 0)
+#define RT_UNLOCK(tree)			((void) 0)
+#endif
+
 /*
  * Inner nodes and leaf nodes have analogous structure. To distinguish
  * them at runtime, we take advantage of the fact that the key chunk
@@ -596,6 +618,7 @@ typedef struct RT_RADIX_TREE_CONTROL
 #ifdef RT_SHMEM
 	RT_HANDLE	handle;
 	uint32		magic;
+	LWLock		lock;
 #endif
 
 	RT_PTR_ALLOC root;
@@ -1376,7 +1399,7 @@ RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC store
  */
 RT_SCOPE RT_RADIX_TREE *
 #ifdef RT_SHMEM
-RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
 #else
 RT_CREATE(MemoryContext ctx)
 #endif
@@ -1398,6 +1421,7 @@ RT_CREATE(MemoryContext ctx)
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
 	tree->ctl->handle = dp;
 	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	LWLockInitialize(&tree->ctl->lock, tranche_id);
 #else
 	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
 
@@ -1581,6 +1605,8 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 #endif
 
+	RT_LOCK_EXCLUSIVE(tree);
+
 	/* Empty tree, create the root */
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
 		RT_NEW_ROOT(tree, key);
@@ -1606,6 +1632,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
 		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
 		{
 			RT_SET_EXTEND(tree, key, value, parent, stored_child, child);
+			RT_UNLOCK(tree);
 			return false;
 		}
 
@@ -1620,12 +1647,13 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
 	if (!updated)
 		tree->ctl->num_keys++;
 
+	RT_UNLOCK(tree);
 	return updated;
 }
 
 /*
  * Search the given key in the radix tree. Return true if there is the key,
- * otherwise return false.  On success, we set the value to *val_p so it must
+ * otherwise return false.	On success, we set the value to *val_p so it must
  * not be NULL.
  */
 RT_SCOPE bool
@@ -1633,14 +1661,20 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 {
 	RT_PTR_LOCAL node;
 	int			shift;
+	bool		found;
 
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 #endif
 	Assert(value_p != NULL);
 
+	RT_LOCK_SHARED(tree);
+
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
 		return false;
+	}
 
 	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
 	shift = node->shift;
@@ -1654,13 +1688,19 @@ RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 			break;
 
 		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
 			return false;
+		}
 
 		node = RT_PTR_GET_LOCAL(tree, child);
 		shift -= RT_NODE_SPAN;
 	}
 
-	return RT_NODE_SEARCH_LEAF(node, key, value_p);
+	found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+	RT_UNLOCK(tree);
+	return found;
 }
 
 #ifdef RT_USE_DELETE
@@ -1682,8 +1722,13 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 #endif
 
+	RT_LOCK_EXCLUSIVE(tree);
+
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
 		return false;
+	}
 
 	/*
 	 * Descend the tree to search the key while building a stack of nodes we
@@ -1702,7 +1747,10 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 		node = RT_PTR_GET_LOCAL(tree, allocnode);
 
 		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
 			return false;
+		}
 
 		allocnode = child;
 		shift -= RT_NODE_SPAN;
@@ -1715,6 +1763,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 	if (!deleted)
 	{
 		/* no key is found in the leaf node */
+		RT_UNLOCK(tree);
 		return false;
 	}
 
@@ -1726,7 +1775,10 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 	 * node.
 	 */
 	if (node->count > 0)
+	{
+		RT_UNLOCK(tree);
 		return true;
+	}
 
 	/* Free the empty leaf node */
 	RT_FREE_NODE(tree, allocnode);
@@ -1748,6 +1800,7 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 		RT_FREE_NODE(tree, allocnode);
 	}
 
+	RT_UNLOCK(tree);
 	return true;
 }
 #endif
@@ -1812,7 +1865,12 @@ RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
 	}
 }
 
-/* Create and return the iterator for the given radix tree */
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
 RT_SCOPE RT_ITER *
 RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 {
@@ -1826,6 +1884,8 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
 	iter->tree = tree;
 
+	RT_LOCK_SHARED(tree);
+
 	/* empty tree */
 	if (!iter->tree->ctl->root)
 		return iter;
@@ -1846,7 +1906,7 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 }
 
 /*
- * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * Return true with setting key_p and value_p if there is next key.  Otherwise
  * return false.
  */
 RT_SCOPE bool
@@ -1901,9 +1961,20 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
 	return false;
 }
 
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
 RT_SCOPE void
 RT_END_ITERATE(RT_ITER *iter)
 {
+#ifdef RT_SHMEM
+	Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+	RT_UNLOCK(iter->tree);
 	pfree(iter);
 }
 
@@ -1915,6 +1986,8 @@ RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
 {
 	Size		total = 0;
 
+	RT_LOCK_SHARED(tree);
+
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 	total = dsa_get_total_size(tree->dsa);
@@ -1926,6 +1999,7 @@ RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
 	}
 #endif
 
+	RT_UNLOCK(tree);
 	return total;
 }
 
@@ -2010,6 +2084,8 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
 RT_SCOPE void
 RT_STATS(RT_RADIX_TREE *tree)
 {
+	RT_LOCK_SHARED(tree);
+
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
 	fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
 
@@ -2029,6 +2105,8 @@ RT_STATS(RT_RADIX_TREE *tree)
 				tree->ctl->cnt[RT_CLASS_125],
 				tree->ctl->cnt[RT_CLASS_256]);
 	}
+
+	RT_UNLOCK(tree);
 }
 
 static void
@@ -2222,14 +2300,18 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
 
 	RT_STATS(tree);
 
+	RT_LOCK_SHARED(tree);
+
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
 	{
+		RT_UNLOCK(tree);
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
 
 	if (key > tree->ctl->max_val)
 	{
+		RT_UNLOCK(tree);
 		fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
 				key, key);
 		return;
@@ -2263,6 +2345,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
 		shift -= RT_NODE_SPAN;
 		level++;
 	}
+	RT_UNLOCK(tree);
 
 	fprintf(stderr, "%s", buf.data);
 }
@@ -2274,8 +2357,11 @@ RT_DUMP(RT_RADIX_TREE *tree)
 
 	RT_STATS(tree);
 
+	RT_LOCK_SHARED(tree);
+
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
 	{
+		RT_UNLOCK(tree);
 		fprintf(stderr, "empty tree\n");
 		return;
 	}
@@ -2283,6 +2369,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 	initStringInfo(&buf);
 
 	RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+	RT_UNLOCK(tree);
 
 	fprintf(stderr, "%s",buf.data);
 }
@@ -2310,6 +2397,9 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_GET_KEY_CHUNK
 #undef BM_IDX
 #undef BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
 #undef RT_NODE_IS_LEAF
 #undef RT_NODE_MUST_GROW
 #undef RT_NODE_KIND_COUNT
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 2a93e731ae..bbe1a619b6 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -144,7 +144,7 @@ test_empty(void)
 	dsa_area   *dsa;
 	dsa = dsa_create(tranche_id);
 
-	radixtree = rt_create(CurrentMemoryContext, dsa);
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
 #else
 	radixtree = rt_create(CurrentMemoryContext);
 #endif
@@ -195,7 +195,7 @@ test_basic(int children, bool test_inner)
 		 test_inner ? "inner" : "leaf", children);
 
 #ifdef RT_SHMEM
-	radixtree = rt_create(CurrentMemoryContext, dsa);
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
 #else
 	radixtree = rt_create(CurrentMemoryContext);
 #endif
@@ -363,7 +363,7 @@ test_node_types(uint8 shift)
 	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
 
 #ifdef RT_SHMEM
-	radixtree = rt_create(CurrentMemoryContext, dsa);
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
 #else
 	radixtree = rt_create(CurrentMemoryContext);
 #endif
@@ -434,7 +434,7 @@ test_pattern(const test_spec * spec)
 	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
 
 #ifdef RT_SHMEM
-	radixtree = rt_create(radixtree_ctx, dsa);
+	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
 #else
 	radixtree = rt_create(radixtree_ctx);
 #endif
-- 
2.31.1

v24-0008-Update-TidStore-patch-from-v23.patchapplication/octet-stream; name=v24-0008-Update-TidStore-patch-from-v23.patchDownload

From c76104ba85a5668cfbcb236610bc494127642102 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 31 Jan 2023 17:41:31 +0900
Subject: [PATCH v24 8/9] Update TidStore patch from v23.

Incorporate the comments, update comments, and add the description of
concurrency support.
---
 src/backend/access/common/tidstore.c | 110 +++++++++++++++------------
 1 file changed, 62 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 89aea71945..f656de2189 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -11,7 +11,10 @@
  * to tidstore_create(). Other backends can attach to the shared TidStore by
  * tidstore_attach().
  *
- * XXX: Only one process is allowed to iterate over the TidStore at a time.
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -23,7 +26,6 @@
  */
 #include "postgres.h"
 
-#include "access/htup_details.h"
 #include "access/tidstore.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
@@ -87,14 +89,17 @@
 #define RT_VALUE_TYPE uint64
 #include "lib/radixtree.h"
 
-/* The header object for a TidStore */
+/* The control object for a TidStore */
 typedef struct TidStoreControl
 {
-	int64	num_tids;		/* the number of Tids stored so far */
+	/* the number of tids in the store */
+	int64	num_tids;
+
+	/* These values are never changed after creation */
 	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
 	int		max_offset;		/* the maximum offset number */
+	int		offset_nbits;	/* the number of bits required for max_offset */
 	bool	encode_tids;	/* do we use tid encoding? */
-	int		offset_nbits;	/* the number of bits used for offset number */
 	int		offset_key_nbits;	/* the number of bits of a offset number
 								 * used for the key */
 
@@ -117,7 +122,7 @@ struct TidStore
 	 */
 	TidStoreControl *control;
 
-	/* Storage for Tids. Use either one depending on TidStoreIsShared()  */
+	/* Storage for Tids. Use either one depending on TidStoreIsShared() */
 	union
 	{
 		local_rt_radix_tree *local;
@@ -170,24 +175,24 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
 	/*
 	 * Create the radix tree for the main storage.
 	 *
-	 * Memory consumption depends on the number of Tids stored, but also on the
+	 * Memory consumption depends on the number of stored tids, but also on the
 	 * distribution of them, how the radix tree stores, and the memory management
 	 * that backed the radix tree. The maximum bytes that a TidStore can
 	 * use is specified by the max_bytes in tidstore_create(). We want the total
-	 * amount of memory consumption not to exceed the max_bytes.
+	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
 	 *
-	 * In non-shared cases, the radix tree uses slab allocators for each kind of
-	 * node class. The most memory consuming case while adding Tids associated
-	 * with one page (i.e. during tidstore_add_tids()) is that we allocate the
-	 * largest radix tree node in a new slab block, which is approximately 70kB.
-	 * Therefore, we deduct 70kB from the maximum bytes.
+	 * In local TidStore cases, the radix tree uses slab allocators for each kind
+	 * of node class. The most memory consuming case while adding Tids associated
+	 * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+	 * we deduct 70kB from the max_bytes.
 	 *
 	 * In shared cases, DSA allocates the memory segments big enough to follow
 	 * a geometric series that approximately doubles the total DSA size (see
 	 * make_new_segment() in dsa.c). We simulated the how DSA increases segment
 	 * size and the simulation revealed, the 75% threshold for the maximum bytes
-	 * perfectly works in case where it is a power-of-2, and the 60% threshold
-	 * works for other cases.
+	 * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+	 * threshold works for other cases.
 	 */
 	if (area != NULL)
 	{
@@ -199,7 +204,7 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
 
 		dp = dsa_allocate0(area, sizeof(TidStoreControl));
 		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
-		ts->control->max_bytes =(uint64) (max_bytes * ratio);
+		ts->control->max_bytes = (uint64) (max_bytes * ratio);
 		ts->area = area;
 
 		ts->control->magic = TIDSTORE_MAGIC;
@@ -212,12 +217,16 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
 		ts->tree.local = local_rt_create(CurrentMemoryContext);
 
 		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
-		ts->control->max_bytes = max_bytes - (1024 * 70);
+		ts->control->max_bytes = max_bytes - (70 * 1024);
 	}
 
 	ts->control->max_offset = max_offset;
 	ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
 
+	/*
+	 * We use tid encoding if the number of bits for the offset number doesn't
+	 * fix in a value, uint64.
+	 */
 	if (ts->control->offset_nbits > TIDSTORE_VALUE_NBITS)
 	{
 		ts->control->encode_tids = true;
@@ -311,7 +320,10 @@ tidstore_destroy(TidStore *ts)
 	pfree(ts);
 }
 
-/* Forget all collected Tids */
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
 void
 tidstore_reset(TidStore *ts)
 {
@@ -350,15 +362,6 @@ tidstore_reset(TidStore *ts)
 	}
 }
 
-static inline void
-tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
-{
-	if (TidStoreIsShared(ts))
-		shared_rt_set(ts->tree.shared, key, val);
-	else
-		local_rt_set(ts->tree.local, key, val);
-}
-
 /* Add Tids on a block to TidStore */
 void
 tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
@@ -371,8 +374,6 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
-	ItemPointerSetBlockNumber(&tid, blkno);
-
 	if (ts->control->encode_tids)
 	{
 		key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
@@ -383,9 +384,9 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 		key_base = (uint64) blkno;
 		nkeys = 1;
 	}
-
 	values = palloc0(sizeof(uint64) * nkeys);
 
+	ItemPointerSetBlockNumber(&tid, blkno);
 	for (int i = 0; i < num_offsets; i++)
 	{
 		uint64	key;
@@ -413,7 +414,10 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 		{
 			uint64 key = key_base + i;
 
-			tidstore_insert_kv(ts, key, values[i]);
+			if (TidStoreIsShared(ts))
+				shared_rt_set(ts->tree.shared, key, values[i]);
+			else
+				local_rt_set(ts->tree.local, key, values[i]);
 		}
 	}
 
@@ -449,8 +453,11 @@ tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
 }
 
 /*
- * Prepare to iterate through a TidStore. The caller must be certain that
- * no other backend will attempt to update the TidStore during the iteration.
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
  */
 TidStoreIter *
 tidstore_begin_iterate(TidStore *ts)
@@ -482,13 +489,14 @@ tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
 {
 	if (TidStoreIsShared(iter->ts))
 		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
-	else
-		return local_rt_iterate_next(iter->tree_iter.local, key, val);
+
+	return local_rt_iterate_next(iter->tree_iter.local, key, val);
 }
 
 /*
- * Scan the TidStore and return a TidStoreIterResult representing Tids
- * in one page. Offset numbers in the result is sorted.
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
  */
 TidStoreIterResult *
 tidstore_iterate_next(TidStoreIter *iter)
@@ -502,6 +510,7 @@ tidstore_iterate_next(TidStoreIter *iter)
 
 	if (BlockNumberIsValid(result->blkno))
 	{
+		/* Process the previously collected key-value */
 		result->num_offsets = 0;
 		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
 	}
@@ -515,8 +524,8 @@ tidstore_iterate_next(TidStoreIter *iter)
 		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
 		{
 			/*
-			 * Remember the key-value pair for the next block for the
-			 * next iteration.
+			 * We got a key-value pair for a different block. So return the
+			 * collected tids, and remember the key-value for the next iteration.
 			 */
 			iter->next_key = key;
 			iter->next_val = val;
@@ -531,7 +540,10 @@ tidstore_iterate_next(TidStoreIter *iter)
 	return result;
 }
 
-/* Finish an iteration over TidStore */
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
 void
 tidstore_end_iterate(TidStoreIter *iter)
 {
@@ -544,7 +556,7 @@ tidstore_end_iterate(TidStoreIter *iter)
 	pfree(iter);
 }
 
-/* Return the number of Tids we collected so far */
+/* Return the number of tids we collected so far */
 int64
 tidstore_num_tids(TidStore *ts)
 {
@@ -552,7 +564,7 @@ tidstore_num_tids(TidStore *ts)
 
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
-	if (TidStoreIsShared(ts))
+	if (!TidStoreIsShared(ts))
 		return ts->control->num_tids;
 
 	LWLockAcquire(&ts->control->lock, LW_SHARED);
@@ -593,9 +605,8 @@ tidstore_memory_usage(TidStore *ts)
 	 */
 	if (TidStoreIsShared(ts))
 		return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
-	else
-		return sizeof(TidStore) + sizeof(TidStore) +
-			local_rt_memory_usage(ts->tree.local);
+
+	return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
 }
 
 /*
@@ -609,7 +620,7 @@ tidstore_get_handle(TidStore *ts)
 	return ts->control->handle;
 }
 
-/* Extract Tids from the given key-value pair */
+/* Extract tids from the given key-value pair */
 static void
 tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
 {
@@ -621,7 +632,10 @@ tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
 		OffsetNumber	off;
 
 		if (i > iter->ts->control->max_offset)
+		{
+			Assert(!iter->ts->control->encode_tids);
 			break;
+		}
 
 		if ((val & (UINT64CONST(1) << i)) == 0)
 			continue;
@@ -644,8 +658,8 @@ key_get_blkno(TidStore *ts, uint64 key)
 {
 	if (ts->control->encode_tids)
 		return (BlockNumber) (key >> ts->control->offset_key_nbits);
-	else
-		return (BlockNumber) key;
+
+	return (BlockNumber) key;
 }
 
 /* Encode a tid to key and offset */
-- 
2.31.1

v24-0009-Update-vacuum-integration-patch-from-v23.patchapplication/octet-stream; name=v24-0009-Update-vacuum-integration-patch-from-v23.patchDownload

From fd380a199f38545a56d7fa11c45ec088d62389f4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 31 Jan 2023 22:44:40 +0900
Subject: [PATCH v24 9/9] Update vacuum integration patch from v23.

---
 src/backend/access/heap/vacuumlazy.c  | 64 +++++++++++++--------------
 src/backend/commands/vacuumparallel.c | 11 +++--
 2 files changed, 37 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3537df16fd..b4e40423a8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,18 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * create a TidStore with the maximum bytes that can be used by the TidStore.
+ * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
+ * vacuum the pages that we've pruned). This frees up the memory space dedicated
+ * to storing dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -492,11 +492,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
-	 * parallel VACUUM initialization as part of allocating shared memory
-	 * space used for dead_items.  (But do a failsafe precheck first, to
-	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
-	 * is already dangerously old.)
+	 * Allocate dead_items memory using dead_items_alloc.  This handles parallel
+	 * VACUUM initialization as part of allocating shared memory space used for
+	 * dead_items.  (But do a failsafe precheck first, to ensure that parallel
+	 * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+	 * old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
@@ -802,7 +802,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -973,7 +973,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				continue;
 			}
 
-			/* Collect LP_DEAD items in dead_items array, count tuples */
+			/* Collect LP_DEAD items in dead_items, count tuples */
 			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
 								  &recordfreespace))
 			{
@@ -1015,10 +1015,10 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Prune, freeze, and count tuples.
 		 *
 		 * Accumulates details of remaining LP_DEAD line pointers on page in
-		 * dead_items array.  This includes LP_DEAD line pointers that we
-		 * pruned ourselves, as well as existing LP_DEAD line pointers that
-		 * were pruned some time earlier.  Also considers freezing XIDs in the
-		 * tuple headers of remaining items with storage.
+		 * dead_items.  This includes LP_DEAD line pointers that we pruned
+		 * ourselves, as well as existing LP_DEAD line pointers that were pruned
+		 * some time earlier.  Also considers freezing XIDs in the tuple headers
+		 * of remaining items with storage.
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
 
@@ -1084,7 +1084,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		}
 		else if (prunestate.num_offsets > 0)
 		{
-			/* Save details of the LP_DEAD items from the page */
+			/* Save details of the LP_DEAD items from the page in dead_items */
 			tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
 							  prunestate.num_offsets);
 
@@ -1535,9 +1535,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
  * The approach we take now is to restart pruning when the race condition is
  * detected.  This allows heap_page_prune() to prune the tuples inserted by
  * the now-aborted transaction.  This is a little crude, but it guarantees
- * that any items that make it into the dead_items array are simple LP_DEAD
- * line pointers, and that every remaining item with tuple storage is
- * considered as a candidate for freezing.
+ * that any items that make it into the dead_items are simple LP_DEAD line
+ * pointers, and that every remaining item with tuple storage is considered
+ * as a candidate for freezing.
  */
 static void
 lazy_scan_prune(LVRelState *vacrel,
@@ -1929,7 +1929,7 @@ retry:
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2088,7 +2088,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2373,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2461,7 +2460,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
-					vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+					vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2660,8 +2660,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -3067,8 +3067,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 }
 
 /*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.
  *
  * Also handles parallel initialization as part of allocating dead_items in
  * DSM when required.
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 5c7e6ed99c..d653683693 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,12 +9,11 @@
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
  * vacuum process.  ParalleVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * the shared TidStore. We launch parallel worker processes at the start of
+ * parallel index bulk-deletion and index cleanup and once all indexes are
+ * processed, the parallel worker processes exit.  Each time we process indexes
+ * in parallel, the parallel context is re-initialized so that the same DSM can
+ * be used for multiple passes of index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
-- 
2.31.1

v24-0007-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v24-0007-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload

From 850aff99cfddb2e77822d616248a4550cdae269c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 17 Jan 2023 17:20:37 +0700
Subject: [PATCH v24 7/9] Use TIDStore for storing dead tuple TID during lazy
 vacuum

Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.

Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.

Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and num_dead_tuple_bytes.

In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.

XXX: needs to bump catalog version
---
 doc/src/sgml/monitoring.sgml               |   8 +-
 src/backend/access/heap/vacuumlazy.c       | 218 +++++++--------------
 src/backend/catalog/system_views.sql       |   2 +-
 src/backend/commands/vacuum.c              |  78 +-------
 src/backend/commands/vacuumparallel.c      |  62 +++---
 src/backend/postmaster/autovacuum.c        |   6 +-
 src/backend/storage/lmgr/lwlock.c          |   2 +
 src/backend/utils/misc/guc_tables.c        |   2 +-
 src/include/commands/progress.h            |   4 +-
 src/include/commands/vacuum.h              |  25 +--
 src/include/storage/lwlock.h               |   1 +
 src/test/regress/expected/cluster.out      |   2 +-
 src/test/regress/expected/create_index.out |   2 +-
 src/test/regress/expected/rules.out        |   4 +-
 src/test/regress/sql/cluster.sql           |   2 +-
 src/test/regress/sql/create_index.sql      |   2 +-
 16 files changed, 142 insertions(+), 278 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d936aa3da3..0230c74e3d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6870,10 +6870,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6881,10 +6881,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..3537df16fd 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -220,11 +221,14 @@ typedef struct LVRelState
 typedef struct LVPagePruneState
 {
 	bool		hastup;			/* Page prevents rel truncation? */
-	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+
+	/* collected offsets of LP_DEAD items including existing ones */
+	OffsetNumber	deadoffsets[MaxHeapTuplesPerPage];
+	int				num_offsets;
 
 	/*
 	 * State describes the proper VM bit states to set for the page following
-	 * pruning and freezing.  all_visible implies !has_lpdead_items, but don't
+	 * pruning and freezing.  all_visible implies num_offsets == 0, but don't
 	 * trust all_frozen result unless all_visible is also set to true.
 	 */
 	bool		all_visible;	/* Every item visible to all? */
@@ -259,8 +263,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -825,21 +830,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				blkno,
 				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +911,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (tidstore_is_full(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1018,7 +1022,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
 
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+		Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
@@ -1034,14 +1038,12 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * performed here can be thought of as the one-pass equivalent of
 			 * a call to lazy_vacuum().
 			 */
-			if (prunestate.has_lpdead_items)
+			if (prunestate.num_offsets > 0)
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
-				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+									  prunestate.num_offsets, buf, vmbuffer);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1080,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
+		}
+		else if (prunestate.num_offsets > 0)
+		{
+			/* Save details of the LP_DEAD items from the page */
+			tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+							  prunestate.num_offsets);
+
+			pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+										 tidstore_memory_usage(dead_items));
 		}
 
 		/*
@@ -1145,7 +1156,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
 		 * set, however.
 		 */
-		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+		else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
 		{
 			elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1193,7 +1204,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Final steps for block: drop cleanup lock, record free space in the
 		 * FSM
 		 */
-		if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+		if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
 		{
 			/*
 			 * Wait until lazy_vacuum_heap_rel() to save free space.  This
@@ -1249,7 +1260,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1543,13 +1554,11 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				tuples_frozen,
-				lpdead_items,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	HeapPageFreeze pagefrz;
 	int64		fpi_before = pgWalUsage.wal_fpi;
-	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1580,6 @@ retry:
 	pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
-	lpdead_items = 0;
 	live_tuples = 0;
 	recently_dead_tuples = 0;
 
@@ -1580,9 +1588,9 @@ retry:
 	 *
 	 * We count tuples removed by the pruning step as tuples_deleted.  Its
 	 * final value can be thought of as the number of tuples that have been
-	 * deleted from the table.  It should not be confused with lpdead_items;
-	 * lpdead_items's final value can be thought of as the number of tuples
-	 * that were deleted from indexes.
+	 * deleted from the table.  It should not be confused with
+	 * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+	 * be thought of as the number of tuples that were deleted from indexes.
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
 									 InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1601,7 @@ retry:
 	 * requiring freezing among remaining tuples with storage
 	 */
 	prunestate->hastup = false;
-	prunestate->has_lpdead_items = false;
+	prunestate->num_offsets = 0;
 	prunestate->all_visible = true;
 	prunestate->all_frozen = true;
 	prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1646,7 @@ retry:
 			 * (This is another case where it's useful to anticipate that any
 			 * LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
 			 */
-			deadoffsets[lpdead_items++] = offnum;
+			prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
 			continue;
 		}
 
@@ -1875,7 +1883,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible && lpdead_items == 0)
+	if (prunestate->all_visible && prunestate->num_offsets == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1888,28 +1896,9 @@ retry:
 	}
 #endif
 
-	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel
-	 */
-	if (lpdead_items > 0)
+	if (prunestate->num_offsets > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
-		prunestate->has_lpdead_items = true;
-
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1917,7 @@ retry:
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
 	vacrel->tuples_frozen += tuples_frozen;
-	vacrel->lpdead_items += lpdead_items;
+	vacrel->lpdead_items += prunestate->num_offsets;
 	vacrel->live_tuples += live_tuples;
 	vacrel->recently_dead_tuples += recently_dead_tuples;
 }
@@ -2129,8 +2118,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2127,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2198,7 +2179,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2227,7 +2208,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2254,8 +2235,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2300,7 +2281,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	tidstore_reset(vacrel->dead_items);
 }
 
 /*
@@ -2373,7 +2354,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2410,10 +2391,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2410,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while ((result = tidstore_iterate_next(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2437,7 +2420,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2451,7 +2434,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+							  buf, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2461,6 +2445,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	tidstore_end_iterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2470,36 +2455,30 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items), vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
 }
 
 /*
- *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *	lazy_vacuum_heap_page() -- free page's LP_DEAD items.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+					  OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2518,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -3093,46 +3066,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3143,11 +3076,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3174,7 +3105,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem, MaxHeapTuplesPerPage,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3187,11 +3118,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
+										 NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a526e607fe 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127e..d8e680ca20 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2343,82 +2342,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..5c7e6ed99c 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TidStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int max_offset, int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_destroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TidStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index f5ea381c53..d88db3e1f8 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3397,12 +3397,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
 		return true;
 
 	/*
-	 * We clamp manually-set values to at least 1MB.  Since
+	 * We clamp manually-set values to at least 2MB.  Since
 	 * maintenance_work_mem is always set to at least this value, do the same
 	 * here.
 	 */
-	if (*newval < 1024)
-		*newval = 1024;
+	if (*newval < 2048)
+		*newval = 2048;
 
 	return true;
 }
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"LogicalRepLauncherDSA",
 	/* LWTRANCHE_LAUNCHER_HASH: */
 	"LogicalRepLauncherHash",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index c5a95f5dcc..a8e7041395 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2312,7 +2312,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 2048, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..a3ebb169ef 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
 	MultiXactId MultiXactCutoff;
 };
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem, int max_offset,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DATA,
 	LWTRANCHE_LAUNCHER_DSA,
 	LWTRANCHE_LAUNCHER_HASH,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
 -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..f6ae02eb14 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 set enable_indexscan = off;
 
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
 
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
-- 
2.31.1

v24-0006-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v24-0006-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From 9bb09e2742c2c8aa21802697c33fb3357f7516d9 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v24 6/9] Add TIDStore, to store sets of TIDs (ItemPointerData)
 efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 674 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   2 +
 src/include/access/tidstore.h                 |  49 ++
 src/include/storage/lwlock.h                  |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  35 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../modules/test_tidstore/test_tidstore.c     | 195 +++++
 .../test_tidstore/test_tidstore.control       |   4 +
 16 files changed, 1019 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..d936aa3da3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2192,6 +2192,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to access a shared TID bitmap during a parallel bitmap
        index scan.</entry>
      </row>
+     <row>
+      <entry><literal>SharedTidStore</literal></entry>
+      <entry>Waiting to access a shared TID store.</entry>
+     </row>
      <row>
       <entry><literal>SharedTupleStore</literal></entry>
       <entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..89aea71945
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,674 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * XXX: Only one process is allowed to iterate over the TidStore at a time.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *                                                |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits for offset number fits in a 64-bit value, we don't
+ * encode tids but directly use the block number and the offset number as key
+ * and value, respectively.
+ */
+#define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The header object for a TidStore */
+typedef struct TidStoreControl
+{
+	int64	num_tids;		/* the number of Tids stored so far */
+	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
+	int		max_offset;		/* the maximum offset number */
+	bool	encode_tids;	/* do we use tid encoding? */
+	int		offset_nbits;	/* the number of bits used for offset number */
+	int		offset_key_nbits;	/* the number of bits of a offset number
+								 * used for the key */
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+	LWLock	lock;
+
+	/* handles for TidStore and radix tree */
+	tidstore_handle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids. Use either one depending on TidStoreIsShared()  */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 *
+	 * Memory consumption depends on the number of Tids stored, but also on the
+	 * distribution of them, how the radix tree stores, and the memory management
+	 * that backed the radix tree. The maximum bytes that a TidStore can
+	 * use is specified by the max_bytes in tidstore_create(). We want the total
+	 * amount of memory consumption not to exceed the max_bytes.
+	 *
+	 * In non-shared cases, the radix tree uses slab allocators for each kind of
+	 * node class. The most memory consuming case while adding Tids associated
+	 * with one page (i.e. during tidstore_add_tids()) is that we allocate the
+	 * largest radix tree node in a new slab block, which is approximately 70kB.
+	 * Therefore, we deduct 70kB from the maximum bytes.
+	 *
+	 * In shared cases, DSA allocates the memory segments big enough to follow
+	 * a geometric series that approximately doubles the total DSA size (see
+	 * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+	 * size and the simulation revealed, the 75% threshold for the maximum bytes
+	 * perfectly works in case where it is a power-of-2, and the 60% threshold
+	 * works for other cases.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes =(uint64) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - (1024 * 70);
+	}
+
+	ts->control->max_offset = max_offset;
+	ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+	if (ts->control->offset_nbits > TIDSTORE_VALUE_NBITS)
+	{
+		ts->control->encode_tids = true;
+		ts->control->offset_key_nbits =
+			ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+	}
+	else
+	{
+		ts->control->encode_tids = false;
+		ts->control->offset_key_nbits = 0;
+	}
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix
+		 * tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/* Forget all collected Tids */
+void
+tidstore_reset(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Free the radix tree and return allocated DSA segments to
+		 * the operating system.
+		 */
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+
+		LWLockRelease(&ts->control->lock);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+	}
+}
+
+static inline void
+tidstore_insert_kv(TidStore *ts, uint64 key, uint64 val)
+{
+	if (TidStoreIsShared(ts))
+		shared_rt_set(ts->tree.shared, key, val);
+	else
+		local_rt_set(ts->tree.local, key, val);
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	ItemPointerData tid;
+	uint64	key_base;
+	uint64	*values;
+	int	nkeys;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+
+	if (ts->control->encode_tids)
+	{
+		key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+		nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+	}
+	else
+	{
+		key_base = (uint64) blkno;
+		nkeys = 1;
+	}
+
+	values = palloc0(sizeof(uint64) * nkeys);
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint64	key;
+		uint32	off;
+		int idx;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		/* encode the tid to key and val */
+		key = tid_to_key_off(ts, &tid, &off);
+
+		idx = key - key_base;
+		Assert(idx >= 0 && idx < nkeys);
+
+		values[idx] |= UINT64CONST(1) << off;
+	}
+
+	if (TidStoreIsShared(ts))
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+	/* insert the calculated key-values to the tree */
+	for (int i = 0; i < nkeys; i++)
+	{
+		if (values[i])
+		{
+			uint64 key = key_base + i;
+
+			tidstore_insert_kv(ts, key, values[i]);
+		}
+	}
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+
+	if (TidStoreIsShared(ts))
+		LWLockRelease(&ts->control->lock);
+
+	pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val = 0;
+	uint32 off;
+	bool found;
+
+	key = tid_to_key_off(ts, tid, &off);
+
+	if (TidStoreIsShared(ts))
+		found = shared_rt_search(ts->tree.shared, key, &val);
+	else
+		found = local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. The caller must be certain that
+ * no other backend will attempt to update the TidStore during the iteration.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	iter->result.blkno = InvalidBlockNumber;
+	iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (tidstore_num_tids(ts) == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+	else
+		return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a TidStoreIterResult representing Tids
+ * in one page. Offset numbers in the result is sorted.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = key_get_blkno(iter->ts, key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * Remember the key-value pair for the next block for the
+			 * next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/* Finish an iteration over TidStore */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter->result.offsets);
+	pfree(iter);
+}
+
+/* Return the number of Tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+	uint64 num_tids;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (TidStoreIsShared(ts))
+		return ts->control->num_tids;
+
+	LWLockAcquire(&ts->control->lock, LW_SHARED);
+	num_tids = ts->control->num_tids;
+	LWLockRelease(&ts->control->lock);
+
+	return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+	else
+		return sizeof(TidStore) + sizeof(TidStore) +
+			local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract Tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->result);
+
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if (i > iter->ts->control->max_offset)
+			break;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+		Assert(result->num_offsets < iter->ts->control->max_offset);
+		result->offsets[result->num_offsets++] = off;
+	}
+
+	result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+	if (ts->control->encode_tids)
+		return (BlockNumber) (key >> ts->control->offset_key_nbits);
+	else
+		return (BlockNumber) key;
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+{
+	uint64 key;
+	uint64 tid_i;
+
+	if (!ts->control->encode_tids)
+	{
+		*off = ItemPointerGetOffsetNumber(tid);
+
+		/* Use the block number as the key */
+		return (int64) ItemPointerGetBlockNumber(tid);
+	}
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << ts->control->offset_nbits;
+
+	*off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+	key = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"SharedTupleStore",
 	/* LWTRANCHE_SHARED_TIDBITMAP: */
 	"SharedTidBitmap",
+	/* LWTRANCHE_SHARED_TIDSTORE: */
+	"SharedTidStore",
 	/* LWTRANCHE_PARALLEL_APPEND: */
 	"ParallelAppend",
 	/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	*offsets;
+	int				num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_SHARED_TIDBITMAP,
+	LWTRANCHE_SHARED_TIDSTORE,
 	LWTRANCHE_PARALLEL_APPEND,
 	LWTRANCHE_PER_XACT_PREDICATE_LIST,
 	LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9b849ae8e8
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,195 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+	ItemPointerData tid;
+	bool found;
+
+	ItemPointerSet(&tid, blkno, off);
+
+	found = tidstore_lookup_tid(ts, &tid);
+
+	if (found != expect)
+		elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+			 blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS	5
+#define TEST_TIDSTORE_NUM_OFFSETS	5
+
+	TidStore *ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+	BlockNumber	blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+	};
+	BlockNumber	blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+	};
+	OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+	int blk_idx;
+
+	/* prepare the offset array */
+	offs[0] = FirstOffsetNumber;
+	offs[1] = FirstOffsetNumber + 1;
+	offs[2] = max_offset / 2;
+	offs[3] = max_offset - 1;
+	offs[4] = max_offset;
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+
+	/* add tids */
+	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* lookup test */
+	for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+	{
+		bool expect = false;
+		for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+		{
+			if (offs[i] == off)
+			{
+				expect = true;
+				break;
+			}
+		}
+
+		check_tid(ts, 0, off, expect);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, expect);
+	}
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+			 tidstore_num_tids(ts),
+			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* iteration test */
+	iter = tidstore_begin_iterate(ts);
+	blk_idx = 0;
+	while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+	{
+		/* check the returned block number */
+		if (blks_sorted[blk_idx] != iter_result->blkno)
+			elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+				 iter_result->blkno, blks_sorted[blk_idx]);
+
+		/* check the returned offset numbers */
+		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+			elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+		for (int i = 0; i < iter_result->num_offsets; i++)
+		{
+			if (offs[i] != iter_result->offsets[i])
+				elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+					 iter_result->offsets[i], iter_result->blkno, offs[i]);
+		}
+
+		blk_idx++;
+	}
+
+	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+		elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+	/* remove all tids */
+	tidstore_reset(ts);
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+	/* lookup test for empty store */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, false);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, false);
+	}
+
+	tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+	TidStore *ts;
+	TidStoreIter *iter;
+	ItemPointerData tid;
+
+	elog(NOTICE, "testing empty tidstore");
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+
+	ItemPointerSet(&tid, 0, FirstOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+			 MaxBlockNumber, MaxOffsetNumber);
+
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+	if (tidstore_is_full(ts))
+		elog(ERROR, "tidstore_is_full on empty store returned true");
+
+	iter = tidstore_begin_iterate(ts);
+
+	if (tidstore_iterate_next(iter) != NULL)
+		elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+	tidstore_end_iterate(iter);
+
+	tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	elog(NOTICE, "testing basic operations");
+	test_basic(MaxHeapTuplesPerPage);
+	test_basic(10);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.31.1

v24-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v24-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From f4ae4a7c957b5e9351607ffbd85cd044ed09c339 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v24 2/9] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 0dca6bc5fa..80e91fac0f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 07fbb7ccf6..f4d1d60cd2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.31.1

v24-0001-introduce-vector8_min-and-vector8_highbit_mask.patchapplication/octet-stream; name=v24-0001-introduce-vector8_min-and-vector8_highbit_mask.patchDownload

From a42eb01c87675698ae5972421f8896f85f048f2b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v24 1/9] introduce vector8_min and vector8_highbit_mask

TODO: commit message
TODO: Remove uint64 case.

separate-commit TODO: move non-SIMD fallbacks to own header
to clean up the #ifdef maze.
---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..f0bba33c53 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -77,6 +77,7 @@ static inline bool vector8_has(const Vector8 v, const uint8 c);
 static inline bool vector8_has_zero(const Vector8 v);
 static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
 #endif
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -277,6 +279,36 @@ vector8_is_highbit_set(const Vector8 v)
 #endif
 }
 
+/*
+ * Return the bitmask of the high-bit of each element.
+ */
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#else
+	uint32 mask = 0;
+
+	for (Size i = 0; i < sizeof(Vector8); i++)
+		mask |= (((const uint8 *) &v)[i] >> 7) << i;
+
+	return mask;
+#endif
+}
+
 /*
  * Exactly like vector8_is_highbit_set except for the input type, so it
  * looks at each byte separately.
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Compare the given vectors and return the vector of minimum elements.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.31.1

v24-0003-Add-radixtree-template.patchapplication/octet-stream; name=v24-0003-Add-radixtree-template.patchDownload

From 3d16fe0d216f4efb093dd880da02a6e54651d109 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v24 3/9] Add radixtree template

The only thing configurable in this commit is function scope,
prefix, and local/shared memory.

The key and value type are still hard-coded to uint64.

(A later commit in v21 will make value type configurable)

It might be good at some point to offer a different tree type,
e.g. "single-value leaves" to allow for variable length keys
and values, giving full flexibility to developers.

TODO: Much broader commit message
---
 src/backend/utils/mmgr/dsa.c                  |   12 +
 src/include/lib/radixtree.h                   | 2426 +++++++++++++++++
 src/include/lib/radixtree_delete_impl.h       |  106 +
 src/include/lib/radixtree_insert_impl.h       |  317 +++
 src/include/lib/radixtree_iter_impl.h         |  138 +
 src/include/lib/radixtree_search_impl.h       |  122 +
 src/include/utils/dsa.h                       |    1 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   35 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  673 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 20 files changed, 3933 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..f591d903fc
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2426 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *		Implementation for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ *  tional leaf node type which stores one value.
+ *  - Multi-value leaves: The values are stored in one of four
+ *  different leaf node types, which mirror the structure of
+ *  inner nodes, but contain values instead of pointers.
+ *  - Combined pointer/value slots: If values fit into point-
+ *  ers, no separate node types are necessary. Instead, each
+ *  pointer storage location in an inner node can either
+ *  store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included.  Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * 	 will result in radix tree type 'foo_radix_tree' and functions like
+ *	 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ *	 generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *	 declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *	 so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITER		- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *val_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE val);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ *    statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ *    in the future to tag the node pointer with the kind, even on
+ *    platforms with 32-bit pointers. This might speed up node traversal
+ *    in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+	((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+	RT_NODE		n;
+
+	/* 3 children, for key chunks */
+	uint8		chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* isset is a bitmap to track which slot is in use */
+	bitmapword		isset[BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * Theres are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/*
+	 * Unlike with inner256, zero is a valid value here, so we use a
+	 * bitmap to track which slot is in use.
+	 */
+	bitmapword	isset[BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	RT_VALUE_TYPE	values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_3 = 0,
+	RT_CLASS_32_MIN,
+	RT_CLASS_32_MAX,
+	RT_CLASS_125,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_3] = {
+		.name = "radix tree node 3",
+		.fanout = 3,
+		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MIN] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MAX] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_125] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+#endif
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+	RT_PTR_LOCAL node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the iteration on nodes of each level */
+	RT_NODE_ITER stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is constructed during iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								uint64 key, RT_VALUE_TYPE value);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/* replicate the search key */
+	spread_chunk = vector8_broadcast(chunk);
+
+	/* compare to all 32 keys stored in the node */
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+
+	/* convert comparison to a bitfield */
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+	/* mask off invalid entries */
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	/* convert bitfield to index by counting trailing zeros */
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		/*
+		 * This is coded with '>=' to match what we can do with SIMD,
+		 * with an assert to keep us honest.
+		 */
+		if (node->chunks[index] >= chunk)
+		{
+			Assert(node->chunks[index] != chunk);
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/*
+	 * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+	 * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+	 * we need to play some trickery using vector8_min() to effectively get
+	 * <=. There'll never be any equal elements in urrent uses, but that's
+	 * what we get here...
+	 */
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = BM_IDX(chunk);
+	int			bitnum = BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	if (key == 0)
+		return 0;
+	else
+		return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
+
+	if (is_leaf)
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (is_leaf)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+#endif
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->ctl->cnt[size_class]++;
+#endif
+
+	return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	if (is_leaf)
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+	node->kind = kind;
+
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+		memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+	}
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		is_leaf = shift == 0;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+	newnode->shift = shift;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+				  uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+	RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+	RT_COPY_NODE(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->ctl->root == allocnode)
+	{
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
+	}
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+				RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old_child->shift == new->shift);
+	Assert(old_child->count == new->count);
+#endif
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new larger node */
+		tree->ctl->root = new_child;
+	}
+	else
+		RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+	RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_3 *n3;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+		node->shift = shift;
+		node->count = 1;
+
+		n3 = (RT_NODE_INNER_3 *) node;
+		n3->base.chunks[0] = 0;
+		n3->children[0] = tree->ctl->root;
+
+		/* Update the root */
+		tree->ctl->root = allocnode;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value, RT_PTR_LOCAL parent,
+			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+	int			shift = node->shift;
+
+	Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		is_leaf = newshift == 0;
+
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+		newchild->shift = newshift;
+		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+		parent = node;
+		node = newchild;
+		stored_node = allocchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value);
+	tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_VALUE_TYPE value)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+	/* Create a slab context for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+		size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+		size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 size_class.name,
+												 inner_blocksize,
+												 size_class.inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												size_class.name,
+												leaf_blocksize,
+												size_class.leaf_size);
+	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	if (RT_NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+				for (int i = 0; i < n3->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n3->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+				for (int i = 0; i < n32->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle);
+#else
+	pfree(tree->ctl);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+#endif
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE value)
+{
+	int			shift;
+	bool		updated;
+	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC stored_child;
+	RT_PTR_LOCAL  child;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	/* Empty tree, create the root */
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->ctl->max_val)
+		RT_EXTEND(tree, key);
+
+	stored_child = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, stored_child);
+	shift = parent->shift;
+
+	/* Descend the tree until we reach a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;;
+
+		child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+		if (RT_NODE_IS_LEAF(child))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+		{
+			RT_SET_EXTEND(tree, key, value, parent, stored_child, child);
+			return false;
+		}
+
+		parent = child;
+		stored_child = new_child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->ctl->num_keys++;
+
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false.  On success, we set the value to *val_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	Assert(value_p != NULL);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+		return false;
+
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		if (RT_NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			return false;
+
+		node = RT_PTR_GET_LOCAL(tree, child);
+		shift -= RT_NODE_SPAN;
+	}
+
+	return RT_NODE_SEARCH_LEAF(node, key, value_p);
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+		return false;
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		/* Push the current node to the stack */
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			return false;
+
+		allocnode = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	deleted = RT_NODE_DELETE_LEAF(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->ctl->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (node->count > 0)
+		return true;
+
+	/* Free the empty leaf node */
+	RT_FREE_NODE(tree, allocnode);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		allocnode = stack[level--];
+
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		deleted = RT_NODE_DELETE_INNER(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (node->count > 0)
+			break;
+
+		/* The node became empty */
+		RT_FREE_NODE(tree, allocnode);
+	}
+
+	return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+	int			level = from;
+	RT_PTR_LOCAL node = from_node;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (RT_NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/* Create and return the iterator for the given radix tree */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	MemoryContext old_ctx;
+	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+
+	/* empty tree */
+	if (!iter->tree->ctl->root)
+		return iter;
+
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise,
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->ctl->root)
+		return false;
+
+	for (;;)
+	{
+		RT_PTR_LOCAL child = NULL;
+		RT_VALUE_TYPE value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+	pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	Size		total = 0;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+#endif
+
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+				for (int i = 1; i < n3->n.count; i++)
+					Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = BM_IDX(slot);
+					int			bitnum = BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+	fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+	fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+		fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+				root->shift / RT_NODE_SPAN,
+				tree->ctl->cnt[RT_CLASS_3],
+				tree->ctl->cnt[RT_CLASS_32_MIN],
+				tree->ctl->cnt[RT_CLASS_32_MAX],
+				tree->ctl->cnt[RT_CLASS_125],
+				tree->ctl->cnt[RT_CLASS_256]);
+	}
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+			 bool recurse, StringInfo buf)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+	StringInfoData spaces;
+
+	initStringInfo(&spaces);
+	appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+	appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+					 spaces.data,
+					 level == 0 ? "" : "-> ",
+					 RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+					 (node->kind == RT_NODE_KIND_3) ? 3 :
+					 (node->kind == RT_NODE_KIND_32) ? 32 :
+					 (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+					 node->fanout == 0 ? 256 : node->fanout,
+					 node->count, node->shift);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n3->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n3->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n3->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n32->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n32->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+				char *sep = "";
+
+				appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					appendStringInfo(buf, "%s[%d]=%d ",
+									 sep, i, b125->slot_idxs[i]);
+					sep = ",";
+				}
+
+				appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+				for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+					appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+				appendStringInfo(buf, "\n");
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					if (RT_NODE_IS_LEAF(node))
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					else
+					{
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+					appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+					for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+						appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+					appendStringInfo(buf, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL node;
+	StringInfoData buf;
+	int			shift;
+	int			level = 0;
+
+	RT_STATS(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	if (key > tree->ctl->max_val)
+	{
+		fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+				key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+		if (RT_NODE_IS_LEAF(node))
+		{
+			RT_VALUE_TYPE	dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			break;
+
+		allocnode = child;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+
+	fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+	StringInfoData buf;
+
+	RT_STATS(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	initStringInfo(&buf);
+
+	RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+
+	fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef BM_IDX
+#undef BM_BIT
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..99c90771b9
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,106 @@
+/* TODO: shrink nodes */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+										  n3->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+											n3->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+										  n32->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+				idx = BM_IDX(slotpos);
+				bitnum = BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+				RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..22aca0e6cc
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,317 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	const bool is_leaf = true;
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	const bool is_leaf = false;
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n3->values[idx] = value;
+#else
+					n3->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n3)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+					/* grow node from 3 to 32 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+											  new32->base.chunks, new32->values);
+#else
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+											  new32->base.chunks, new32->children);
+#endif
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+					int			count = n3->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+												   count, insertpos);
+#endif
+					}
+
+					n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n3->values[insertpos] = value;
+#else
+					n3->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[idx] = value;
+#else
+					n32->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+					n32->base.n.fanout < class32_max.fanout)
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+					Assert(n32->base.n.fanout == class32_min.fanout);
+
+					/* grow to the next size class of this kind */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class32_min.leaf_size);
+#else
+					memcpy(newnode, node, class32_min.inner_size);
+#endif
+					newnode->fanout = class32_max.fanout;
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n32)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+					Assert(n32->base.n.fanout == class32_max.fanout);
+
+					/* grow node from 32 to 125 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < class32_max.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					/*
+					 * Since we just copied a dense array, we can set the bits
+					 * using a single store, provided the length of that array
+					 * is at most the number of bits in a bitmapword.
+					 */
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = value;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			cnt = 0;
+
+				if (slotpos != RT_INVALID_SLOT_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n125)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+					/* grow node from 125 to 256 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new256 = (RT_NODE256_TYPE *) newnode;
+
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+						cnt++;
+					}
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = value;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+				chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+				Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_SET(n256, chunk, value);
+#else
+				RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	RT_VERIFY_NODE(node);
+
+	return chunk_exists;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..823d7107c4
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,138 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	bool		found = false;
+	uint8		key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	RT_VALUE_TYPE		value;
+
+	Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n3->base.n.count)
+					break;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n3->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+				key_chunk = n3->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+		*value_p = value;
+#endif
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return found;
+#else
+	return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..c8410e9a5c
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,122 @@
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+	Assert(child_p != NULL);
+#endif
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n3->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n3->values[idx];
+#else
+				*child_p = n3->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n32->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n32->values[idx];
+#else
+				*child_p = n32->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+				Assert(slotpos != RT_INVALID_SLOT_IDX);
+				n125->children[slotpos] = new_child;
+#else
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				*child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				*child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+	}
+
+#ifdef RT_ACTION_UPDATE
+	return;
+#else
+	return true;
+#endif							/* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..2a93e731ae
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,673 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	TestValueType		dummy;
+	uint64		key;
+	TestValueType		val;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	rt_radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* look up keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType value;
+
+		if (!rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (value != (TestValueType) keys[i])
+			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+				 value, (TestValueType) keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_set(radixtree, keys[i], (TestValueType) (keys[i] + 1)))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], (TestValueType) keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		TestValueType		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != (TestValueType) key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType) key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa);
+#else
+	radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, (TestValueType) x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != (TestValueType) x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			TestValueType		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != (TestValueType) expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	rt_free(radixtree);
+	MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.31.1

v24-0004-Tool-for-measuring-radix-tree-performance.patchapplication/octet-stream; name=v24-0004-Tool-for-measuring-radix-tree-performance.patchDownload

From aa1bb230f2760dbc9185b3237bbd4aba735b20c0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v24 4/9] Tool for measuring radix tree performance

Includes Meson support, but commented out to avoid warnings

XXX: Not for commit
---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  76 ++
 contrib/bench_radix_tree/bench_radix_tree.c   | 656 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/meson.build          |  33 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 contrib/meson.build                           |   1 +
 8 files changed, 822 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/meson.build
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..4c785c7336
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,656 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	rt_radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, key);
+	}
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	int			key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+
+	rt_stats(rt);
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+  'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+  bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'bench_radix_tree',
+    '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+  bench_radix_tree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+  'bench_radix_tree.control',
+  'bench_radix_tree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'bench_radix_tree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'bench_radix_tree',
+    ],
+  },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
+#subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.31.1

#198

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#197)

10 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jan 31, 2023 at 9:43 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I've attached v24 patches. The locking support patch is separated
(0005 patch). Also I kept the updates for TidStore and the vacuum
integration from v23 separate.

Okay, that's a lot more simple, and closer to what I imagined. For v25, I
squashed v24's additions and added a couple of my own. I've kept the CF
status at "needs review" because no specific action is required at the
moment.

I did start to review the TID store some more, but that's on hold because
something else came up: On a lark I decided to re-run some benchmarks to
see if anything got lost in converting to a template, and that led me down
a rabbit hole -- some good and bad news on that below.

0001:

I removed the uint64 case, as discussed. There is now a brief commit
message, but needs to be fleshed out a bit. I took another look at the Arm
optimization that Nathan found some month ago, for forming the highbit
mask, but that doesn't play nicely with how node32 uses it, so I decided
against it. I added a comment to describe the reasoning in case someone
else gets a similar idea.

I briefly looked into "separate-commit TODO: move non-SIMD fallbacks to
their own header to clean up the #ifdef maze.", but decided it wasn't such
a clear win to justify starting the work now. It's still in the back of my
mind, but I removed the reminder from the commit message.

0003:

The template now requires the value to be passed as a pointer. That was a
pretty trivial change, but affected multiple other patches, so not sent
separately. Also adds a forgotten RT_ prefix to the bitmap macros and adds
a top comment to the *_impl.h headers. There are some comment fixes. The
changes were either trivial or discussed earlier, so also not sent
separately.

0004/5: I wanted to measure the load time as well as search time in
bench_search_random_nodes(). That's kept separate to make it easier to test
other patch versions.

The bad news is that the speed of loading TIDs in
bench_seq/shuffle_search() has regressed noticeably. I can't reproduce this
in any other bench function and was the reason for writing 0005 to begin
with. More confusingly, my efforts to fix this improved *other* functions,
but the former didn't budge at all. First the patches:

0006 adds and removes some "inline" declarations (where it made sense), and
added some for "pg_noinline" based on Andres' advice some months ago.

0007 removes some dead code. RT_NODE_INSERT_INNER is only called during
RT_SET_EXTEND, so it can't possibly find an existing key. This kind of
change is much easier with the inner/node cases handled together in a
template, as far as being sure of how those cases are different. I thought
about trying the search in assert builds and verifying it doesn't exist,
but thought yet another #ifdef would be too messy.

v25-addendum-try-no-maintain-order.txt -- It makes optional keeping the key
chunks in order for the linear-search nodes. I believe the TID store no
longer cares about the ordering, but this is a text file for now because I
don't want to clutter the CI with a behavior change. Also, the second ART
paper (on concurrency) mentioned that some locking schemes don't allow
these arrays to be shifted. So it might make sense to give up entirely on
guaranteeing ordered iteration, or at least make it optional as in the
patch.

Now for some numbers:

========================================
psql -c "select * from bench_search_random_nodes(10*1000*1000)"
(min load time of three)

v15:
mem_allocated | load_ms | search_ms
---------------+---------+-----------
334182184 | 3352 | 2073

v25-0005:
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 3426 | 2126

v25-0006 (inlining or not):
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 3327 | 2035

v25-0007 (remove dead code):
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 3313 | 2037

v25-addendum...txt (no ordering):
mem_allocated | load_ms | search_ms
---------------+---------+-----------
331987008 | 2762 | 2042

Allowing unordered inserts helps a lot here in loading. That's expected
because there are a lot of inserts into the linear nodes. 0006 might help a
little.

========================================
psql -c "select avg(load_ms) from generate_series(1,30) x(x), lateral
(select * from bench_load_random_int(500 * 1000 * (1+x-x))) a"

v15:
avg
----------------------
207.3000000000000000

v25-0005:
avg
----------------------
190.6000000000000000

v25-0006 (inlining or not):
avg
----------------------
189.3333333333333333

v25-0007 (remove dead code):
avg
----------------------
186.4666666666666667

v25-addendum...txt (no ordering):
avg
----------------------
179.7000000000000000

Most of the improvement from v15 to v25 probably comes from the change from
node4 to node3, and this test stresses that node the most. That shows in
the total memory used: it goes from 152MB to 132MB. Allowing unordered
inserts helps some, the others are not convincing.

========================================
psql -c "select rt_load_ms, rt_search_ms from bench_seq_search(0, 1 * 1000
* 1000)"
(min load time of three)

v15:
rt_load_ms | rt_search_ms
------------+--------------
113 | 455

v25-0005:
rt_load_ms | rt_search_ms
------------+--------------
135 | 456

v25-0006 (inlining or not):
rt_load_ms | rt_search_ms
------------+--------------
136 | 455

v25-0007 (remove dead code):
rt_load_ms | rt_search_ms
------------+--------------
135 | 455

v25-addendum...txt (no ordering):
rt_load_ms | rt_search_ms
------------+--------------
134 | 455

Note: The regression seems to have started in v17, which is the first with
a full template.

Nothing so far has helped here, and previous experience has shown that
trying to profile 100ms will not be useful. Instead of putting more effort
into diving deeper, it seems a better use of time to write a benchmark that
calls the tid store itself. That's more realistic, since this function was
intended to test load and search of tids, but the tid store doesn't quite
operate so simply anymore. What do you think, Masahiko?

I'm inclined to keep 0006, because it might give a slight boost, and 0007
because it's never a bad idea to remove dead code.
--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v25-addendum-try-no-maintain-order.txttext/plain; charset=US-ASCII; name=v25-addendum-try-no-maintain-order.txtDownload

diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 4e00b46d9b..3f831227c9 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -80,9 +80,10 @@
 				}
 				else
 				{
-					int			insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+					int			insertpos;// = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
 					int			count = n3->base.n.count;
-
+#ifdef RT_MAINTAIN_ORDERING
+					insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
 					/* shift chunks and children */
 					if (insertpos < count)
 					{
@@ -95,6 +96,9 @@
 												   count, insertpos);
 #endif
 					}
+#else
+					insertpos = count;
+#endif	/* order */
 
 					n3->base.chunks[insertpos] = chunk;
 #ifdef RT_NODE_LEVEL_LEAF
@@ -186,8 +190,10 @@
 				}
 				else
 				{
-					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int			insertpos;
 					int count = n32->base.n.count;
+#ifdef RT_MAINTAIN_ORDERING
+					insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
 
 					if (insertpos < count)
 					{
@@ -200,6 +206,9 @@
 												   count, insertpos);
 #endif
 					}
+#else
+					insertpos = count;
+#endif
 
 					n32->base.chunks[insertpos] = chunk;
 #ifdef RT_NODE_LEVEL_LEAF

v25-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v25-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From 86c2d232a0ea193a856cb0348e0825b5e4b7a4b7 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v25 2/9] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 3d2225e1ae..5f9a511b4a 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 07fbb7ccf6..f4d1d60cd2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.39.1

v25-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchtext/x-patch; charset=US-ASCII; name=v25-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload

From 949c6eef5ff7cc4f8ef2673f9aa63142a1d913ae Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v25 1/9] Introduce helper SIMD functions for small byte arrays

vector8_min - helper for emulating ">=" semantics

vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask

Masahiko Sawada

Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..350e2caaea 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #endif
 
 /* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	/*
+	 * Note: There is a faster way to do this, but it returns a uint64 and
+	 * and if the caller wanted to extract the bit position using CTZ,
+	 * it would have to divide that result by 4.
+	 */
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the bitwise OR of the inputs
  */
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.39.1

v25-0005-Measure-load-time-of-bench_search_random_nodes.patchtext/x-patch; charset=US-ASCII; name=v25-0005-Measure-load-time-of-bench_search_random_nodes.patchDownload

From 8edd5b4c0fcbf7681c5388faaf85a96ae451c99e Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 7 Feb 2023 13:06:00 +0700
Subject: [PATCH v25 5/9] Measure load time of bench_search_random_nodes

---
 .../bench_radix_tree/bench_radix_tree--1.0.sql  |  1 +
 contrib/bench_radix_tree/bench_radix_tree.c     | 17 ++++++++++++-----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 2fd689aa91..95eedbbe10 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -47,6 +47,7 @@ create function bench_search_random_nodes(
 cnt int8,
 filter_str text DEFAULT NULL,
 OUT mem_allocated int8,
+OUT load_ms int8,
 OUT search_ms int8)
 returns record
 as 'MODULE_PATHNAME'
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 73ddee32de..7d1e2eee57 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -395,9 +395,10 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
 				end_time;
 	long		secs;
 	int			usecs;
+	int64		load_time_ms;
 	int64		search_time_ms;
-	Datum		values[2] = {0};
-	bool		nulls[2] = {0};
+	Datum		values[3] = {0};
+	bool		nulls[3] = {0};
 	/* from trial and error */
 	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
 
@@ -416,13 +417,18 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
 
 	rt = rt_create(CurrentMemoryContext);
 
+	start_time = GetCurrentTimestamp();
 	for (uint64 i = 0; i < cnt; i++)
 	{
-		const uint64 hash = hash64(i);
-		const uint64 key = hash & filter;
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
 
 		rt_set(rt, key, &key);
 	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
 
 	elog(NOTICE, "sleeping for 2 seconds...");
 	pg_usleep(2 * 1000000L);
@@ -449,7 +455,8 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
 	rt_stats(rt);
 
 	values[0] = Int64GetDatum(rt_memory_usage(rt));
-	values[1] = Int64GetDatum(search_time_ms);
+	values[1] = Int64GetDatum(load_time_ms);
+	values[2] = Int64GetDatum(search_time_ms);
 
 	rt_free(rt);
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
-- 
2.39.1

v25-0004-Tool-for-measuring-radix-tree-performance.patchtext/x-patch; charset=US-ASCII; name=v25-0004-Tool-for-measuring-radix-tree-performance.patchDownload

From 6fb21eb0b44b5923c0b736d82e86b1d4a40a71d6 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v25 4/9] Tool for measuring radix tree performance

Includes Meson support, but commented out to avoid warnings

XXX: Not for commit
---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  76 ++
 contrib/bench_radix_tree/bench_radix_tree.c   | 656 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/meson.build          |  33 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 contrib/meson.build                           |   1 +
 8 files changed, 822 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/meson.build
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..73ddee32de
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,656 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	rt_radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, &val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, &val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, &key);
+	}
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, &key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+
+	rt_stats(rt);
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+  'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+  bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'bench_radix_tree',
+    '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+  bench_radix_tree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+  'bench_radix_tree.control',
+  'bench_radix_tree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'bench_radix_tree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'bench_radix_tree',
+    ],
+  },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
+#subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.39.1

v25-0003-Add-radixtree-template.patchtext/x-patch; charset=US-ASCII; name=v25-0003-Add-radixtree-template.patchDownload

From f421579a2e04baa04b258399e01f01485ce6f358 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v25 3/9] Add radixtree template

WIP: commit message based on template comments
---
 src/backend/utils/mmgr/dsa.c                  |   12 +
 src/include/lib/radixtree.h                   | 2516 +++++++++++++++++
 src/include/lib/radixtree_delete_impl.h       |  122 +
 src/include/lib/radixtree_insert_impl.h       |  332 +++
 src/include/lib/radixtree_iter_impl.h         |  153 +
 src/include/lib/radixtree_search_impl.h       |  138 +
 src/include/utils/dsa.h                       |    1 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   35 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  674 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 20 files changed, 4086 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d6919aef08
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2516 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *		Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ *  tional leaf node type which stores one value.
+ *  - Multi-value leaves: The values are stored in one of four
+ *  different leaf node types, which mirror the structure of
+ *  inner nodes, but contain values instead of pointers.
+ *  - Combined pointer/value slots: If values fit into point-
+ *  ers, no separate node types are necessary. Instead, each
+ *  pointer storage location in an inner node can either
+ *  store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included.  Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * 	 will result in radix tree type 'foo_radix_tree' and functions like
+ *	 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ *	 generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *	 declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *	 so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITER		- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ *    statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ *    in the future to tag the node pointer with the kind, even on
+ *    platforms with 32-bit pointers. This might speed up node traversal
+ *    in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree)	LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree)	LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree)			LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree)	((void) 0)
+#define RT_LOCK_SHARED(tree)	((void) 0)
+#define RT_UNLOCK(tree)			((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+	((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+	RT_NODE		n;
+
+	/* 3 children, for key chunks */
+	uint8		chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* bitmap to track which slots are in use */
+	bitmapword		isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/*
+	 * Unlike with inner256, zero is a valid value here, so we use a
+	 * bitmap to track which slots are in use.
+	 */
+	bitmapword	isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	RT_VALUE_TYPE	values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_3 = 0,
+	RT_CLASS_32_MIN,
+	RT_CLASS_32_MAX,
+	RT_CLASS_125,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_3] = {
+		.name = "radix tree node 3",
+		.fanout = 3,
+		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MIN] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MAX] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_125] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+	LWLock		lock;
+#endif
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+	RT_PTR_LOCAL node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the iteration on nodes of each level */
+	RT_NODE_ITER stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is constructed during iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/* replicate the search key */
+	spread_chunk = vector8_broadcast(chunk);
+
+	/* compare to all 32 keys stored in the node */
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+
+	/* convert comparison to a bitfield */
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+	/* mask off invalid entries */
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	/* convert bitfield to index by counting trailing zeros */
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		/*
+		 * This is coded with '>=' to match what we can do with SIMD,
+		 * with an assert to keep us honest.
+		 */
+		if (node->chunks[index] >= chunk)
+		{
+			Assert(node->chunks[index] != chunk);
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/*
+	 * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+	 * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+	 * we need to play some trickery using vector8_min() to effectively get
+	 * >=. There'll never be any equal elements in current uses, but that's
+	 * what we get here...
+	 */
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	if (key == 0)
+		return 0;
+	else
+		return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
+
+	if (is_leaf)
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (is_leaf)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+#endif
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->ctl->cnt[size_class]++;
+#endif
+
+	return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	if (is_leaf)
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+	node->kind = kind;
+
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+		memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+	}
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		is_leaf = shift == 0;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+	newnode->shift = shift;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+				  uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+	RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+	RT_COPY_NODE(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->ctl->root == allocnode)
+	{
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
+	}
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+				RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old_child->shift == new->shift);
+	Assert(old_child->count == new->count);
+#endif
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new larger node */
+		tree->ctl->root = new_child;
+	}
+	else
+		RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+	RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_3 *n3;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+		node->shift = shift;
+		node->count = 1;
+
+		n3 = (RT_NODE_INNER_3 *) node;
+		n3->base.chunks[0] = 0;
+		n3->children[0] = tree->ctl->root;
+
+		/* Update the root */
+		tree->ctl->root = allocnode;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+	int			shift = node->shift;
+
+	Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		is_leaf = newshift == 0;
+
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+		newchild->shift = newshift;
+		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+		parent = node;
+		node = newchild;
+		stored_node = allocchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+	tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+	/* Create a slab context for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+		size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+		size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 size_class.name,
+												 inner_blocksize,
+												 size_class.inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												size_class.name,
+												leaf_blocksize,
+												size_class.leaf_size);
+	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	if (RT_NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+				for (int i = 0; i < n3->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n3->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+				for (int i = 0; i < n32->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle);
+#else
+	pfree(tree->ctl);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+#endif
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	int			shift;
+	bool		updated;
+	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC stored_child;
+	RT_PTR_LOCAL  child;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	/* Empty tree, create the root */
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->ctl->max_val)
+		RT_EXTEND(tree, key);
+
+	stored_child = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, stored_child);
+	shift = parent->shift;
+
+	/* Descend the tree until we reach a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;;
+
+		child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+		if (RT_NODE_IS_LEAF(child))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+		{
+			RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		parent = child;
+		stored_child = new_child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->ctl->num_keys++;
+
+	RT_UNLOCK(tree);
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+	bool		found;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	Assert(value_p != NULL);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		if (RT_NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		node = RT_PTR_GET_LOCAL(tree, child);
+		shift -= RT_NODE_SPAN;
+	}
+
+	found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+	RT_UNLOCK(tree);
+	return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		/* Push the current node to the stack */
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		allocnode = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	deleted = RT_NODE_DELETE_LEAF(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->ctl->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (node->count > 0)
+	{
+		RT_UNLOCK(tree);
+		return true;
+	}
+
+	/* Free the empty leaf node */
+	RT_FREE_NODE(tree, allocnode);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		allocnode = stack[level--];
+
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		deleted = RT_NODE_DELETE_INNER(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (node->count > 0)
+			break;
+
+		/* The node became empty */
+		RT_FREE_NODE(tree, allocnode);
+	}
+
+	RT_UNLOCK(tree);
+	return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+	int			level = from;
+	RT_PTR_LOCAL node = from_node;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (RT_NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	MemoryContext old_ctx;
+	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+
+	RT_LOCK_SHARED(tree);
+
+	/* empty tree */
+	if (!iter->tree->ctl->root)
+		return iter;
+
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->ctl->root)
+		return false;
+
+	for (;;)
+	{
+		RT_PTR_LOCAL child = NULL;
+		RT_VALUE_TYPE value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+	Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+	RT_UNLOCK(iter->tree);
+	pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	Size		total = 0;
+
+	RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+#endif
+
+	RT_UNLOCK(tree);
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+				for (int i = 1; i < n3->n.count; i++)
+					Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = RT_BM_IDX(slot);
+					int			bitnum = RT_BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+	RT_LOCK_SHARED(tree);
+
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+	fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+	fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+		fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+				root->shift / RT_NODE_SPAN,
+				tree->ctl->cnt[RT_CLASS_3],
+				tree->ctl->cnt[RT_CLASS_32_MIN],
+				tree->ctl->cnt[RT_CLASS_32_MAX],
+				tree->ctl->cnt[RT_CLASS_125],
+				tree->ctl->cnt[RT_CLASS_256]);
+	}
+
+	RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+			 bool recurse, StringInfo buf)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+	StringInfoData spaces;
+
+	initStringInfo(&spaces);
+	appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+	appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+					 spaces.data,
+					 level == 0 ? "" : "-> ",
+					 RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+					 (node->kind == RT_NODE_KIND_3) ? 3 :
+					 (node->kind == RT_NODE_KIND_32) ? 32 :
+					 (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+					 node->fanout == 0 ? 256 : node->fanout,
+					 node->count, node->shift);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n3->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n3->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n3->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n32->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n32->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+				char *sep = "";
+
+				appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					appendStringInfo(buf, "%s[%d]=%d ",
+									 sep, i, b125->slot_idxs[i]);
+					sep = ",";
+				}
+
+				appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+				for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+					appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+				appendStringInfo(buf, "\n");
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					if (RT_NODE_IS_LEAF(node))
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					else
+					{
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+					appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+					for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+						appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+					appendStringInfo(buf, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL node;
+	StringInfoData buf;
+	int			shift;
+	int			level = 0;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	if (key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+				key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+		if (RT_NODE_IS_LEAF(node))
+		{
+			RT_VALUE_TYPE	dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			break;
+
+		allocnode = child;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+	StringInfoData buf;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	initStringInfo(&buf);
+
+	RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ *	  Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+										  n3->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+											n3->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+										  n32->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+				idx = RT_BM_IDX(slotpos);
+				bitnum = RT_BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+				RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..c18e26b537
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,332 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ *	  Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	const bool is_leaf = true;
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	const bool is_leaf = false;
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n3->values[idx] = *value_p;
+#else
+					n3->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n3)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+					/* grow node from 3 to 32 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+											  new32->base.chunks, new32->values);
+#else
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+											  new32->base.chunks, new32->children);
+#endif
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+					int			count = n3->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+												   count, insertpos);
+#endif
+					}
+
+					n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n3->values[insertpos] = *value_p;
+#else
+					n3->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[idx] = *value_p;
+#else
+					n32->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+					n32->base.n.fanout < class32_max.fanout)
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+					Assert(n32->base.n.fanout == class32_min.fanout);
+
+					/* grow to the next size class of this kind */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class32_min.leaf_size);
+#else
+					memcpy(newnode, node, class32_min.inner_size);
+#endif
+					newnode->fanout = class32_max.fanout;
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n32)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+					Assert(n32->base.n.fanout == class32_max.fanout);
+
+					/* grow node from 32 to 125 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < class32_max.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					/*
+					 * Since we just copied a dense array, we can set the bits
+					 * using a single store, provided the length of that array
+					 * is at most the number of bits in a bitmapword.
+					 */
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = *value_p;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			cnt = 0;
+
+				if (slotpos != RT_INVALID_SLOT_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = *value_p;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n125)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+					/* grow node from 125 to 256 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new256 = (RT_NODE256_TYPE *) newnode;
+
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+						cnt++;
+					}
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = *value_p;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+				chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+				Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+				RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	RT_VERIFY_NODE(node);
+
+	return chunk_exists;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..98c78eb237
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,153 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ *	  Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	bool		found = false;
+	uint8		key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	RT_VALUE_TYPE		value;
+
+	Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n3->base.n.count)
+					break;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n3->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+				key_chunk = n3->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+		*value_p = value;
+#endif
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return found;
+#else
+	return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ *	  Common implementation for search in leaf and inner nodes, plus
+ *	  update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+	Assert(child_p != NULL);
+#endif
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n3->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n3->values[idx];
+#else
+				*child_p = n3->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n32->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n32->values[idx];
+#else
+				*child_p = n32->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+				Assert(slotpos != RT_INVALID_SLOT_IDX);
+				n125->children[slotpos] = new_child;
+#else
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				*child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				*child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+	}
+
+#ifdef RT_ACTION_UPDATE
+	return;
+#else
+	return true;
+#endif							/* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..f944945db9
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,674 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	TestValueType		dummy;
+	uint64		key;
+	TestValueType		val;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	rt_radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* look up keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType value;
+
+		if (!rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (value != (TestValueType) keys[i])
+			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+				 value, (TestValueType) keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType update = keys[i] + 1;
+		if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		TestValueType		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != (TestValueType) key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType*) &key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+	radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, (TestValueType*) &x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != (TestValueType) x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			TestValueType		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != (TestValueType) expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	rt_free(radixtree);
+	MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.39.1

v25-0006-Adjust-some-inlining-declarations.patchtext/x-patch; charset=US-ASCII; name=v25-0006-Adjust-some-inlining-declarations.patchDownload

From 77541d3f48e6fef39645df5b3c535ac431e12194 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 6 Feb 2023 21:04:14 +0700
Subject: [PATCH v25 6/9] Adjust some inlining declarations

---
 src/include/lib/radixtree.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d6919aef08..4bd0aaa810 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1124,7 +1124,7 @@ RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_le
  * Create a new node as the root. Subordinate nodes will be created during
  * the insertion.
  */
-static void
+static pg_noinline void
 RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
 {
 	int			shift = RT_KEY_GET_SHIFT(key);
@@ -1215,7 +1215,7 @@ RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
 /*
  * Replace old_child with new_child, and free the old one.
  */
-static void
+static inline void
 RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
 				RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
 				RT_PTR_ALLOC new_child, uint64 key)
@@ -1242,7 +1242,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
  * The radix tree doesn't have sufficient height. Extend the radix tree so
  * it can store the key.
  */
-static void
+static pg_noinline void
 RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
 {
 	int			target_shift;
@@ -1281,7 +1281,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
  * The radix tree doesn't have inner and leaf nodes for given key-value pair.
  * Insert inner and leaf nodes from 'node' to bottom.
  */
-static inline void
+static pg_noinline void
 RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
 			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
 {
@@ -1486,7 +1486,7 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
 /*
  * Recursively free all nodes allocated to the DSA area.
  */
-static inline void
+static void
 RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
 {
 	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
-- 
2.39.1

v25-0007-Skip-unnecessary-searches-in-RT_NODE_INSERT_INNE.patchtext/x-patch; charset=US-ASCII; name=v25-0007-Skip-unnecessary-searches-in-RT_NODE_INSERT_INNE.patchDownload

From f2a3340200ea26c17de5c5261adbeaada64ae4b6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 6 Feb 2023 22:04:50 +0700
Subject: [PATCH v25 7/9] Skip unnecessary searches in RT_NODE_INSERT_INNER

For inner nodes, we know the key chunk doesn't exist already,
otherwise we would have found it while descending the tree.

To reinforce this fact, declare this function to return void.
---
 src/include/lib/radixtree.h             |  4 +--
 src/include/lib/radixtree_insert_impl.h | 48 ++++++++++++-------------
 2 files changed, 24 insertions(+), 28 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 4bd0aaa810..1cdb995e54 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -685,7 +685,7 @@ typedef struct RT_ITER
 } RT_ITER;
 
 
-static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 								 uint64 key, RT_PTR_ALLOC child);
 static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 								uint64 key, RT_VALUE_TYPE *value_p);
@@ -1375,7 +1375,7 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
  * If the node we're inserting into needs to grow, we update the parent's
  * child pointer with the pointer to the new larger node.
  */
-static bool
+static void
 RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 					uint64 key, RT_PTR_ALLOC child)
 {
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index c18e26b537..d56e58dcac 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -28,10 +28,10 @@
 #endif
 
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
-	bool		chunk_exists = false;
 
 #ifdef RT_NODE_LEVEL_LEAF
 	const bool is_leaf = true;
+	bool		chunk_exists = false;
 	Assert(RT_NODE_IS_LEAF(node));
 #else
 	const bool is_leaf = false;
@@ -43,21 +43,18 @@
 		case RT_NODE_KIND_3:
 			{
 				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
-				int			idx;
 
-				idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
 				if (idx != -1)
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-#ifdef RT_NODE_LEVEL_LEAF
 					n3->values[idx] = *value_p;
-#else
-					n3->children[idx] = child;
-#endif
 					break;
 				}
-
+#endif
 				if (unlikely(RT_NODE_MUST_GROW(n3)))
 				{
 					RT_PTR_ALLOC allocnode;
@@ -113,21 +110,18 @@
 			{
 				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
 				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
-				int			idx;
 
-				idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
 				if (idx != -1)
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-#ifdef RT_NODE_LEVEL_LEAF
 					n32->values[idx] = *value_p;
-#else
-					n32->children[idx] = child;
-#endif
 					break;
 				}
-
+#endif
 				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
 					n32->base.n.fanout < class32_max.fanout)
 				{
@@ -220,21 +214,19 @@
 		case RT_NODE_KIND_125:
 			{
 				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
-				int			slotpos = n125->base.slot_idxs[chunk];
+				int			slotpos;
 				int			cnt = 0;
 
+#ifdef RT_NODE_LEVEL_LEAF
+				slotpos = n125->base.slot_idxs[chunk];
 				if (slotpos != RT_INVALID_SLOT_IDX)
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-#ifdef RT_NODE_LEVEL_LEAF
 					n125->values[slotpos] = *value_p;
-#else
-					n125->children[slotpos] = child;
-#endif
 					break;
 				}
-
+#endif
 				if (unlikely(RT_NODE_MUST_GROW(n125)))
 				{
 					RT_PTR_ALLOC allocnode;
@@ -300,14 +292,10 @@
 
 #ifdef RT_NODE_LEVEL_LEAF
 				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
-#else
-				chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
-#endif
 				Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
-
-#ifdef RT_NODE_LEVEL_LEAF
 				RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
 #else
+				Assert(node->count < RT_NODE_MAX_SLOTS);
 				RT_NODE_INNER_256_SET(n256, chunk, child);
 #endif
 				break;
@@ -315,8 +303,12 @@
 	}
 
 	/* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
 	if (!chunk_exists)
 		node->count++;
+#else
+		node->count++;
+#endif
 
 	/*
 	 * Done. Finally, verify the chunk and value is inserted or replaced
@@ -324,7 +316,11 @@
 	 */
 	RT_VERIFY_NODE(node);
 
+#ifdef RT_NODE_LEVEL_LEAF
 	return chunk_exists;
+#else
+	return;
+#endif
 
 #undef RT_NODE3_TYPE
 #undef RT_NODE32_TYPE
-- 
2.39.1

v25-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchtext/x-patch; charset=US-ASCII; name=v25-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From 693e335f77211e9947cd356d9287c9af96e78815 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v25 8/9] Add TIDStore, to store sets of TIDs (ItemPointerData)
 efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 688 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   2 +
 src/include/access/tidstore.h                 |  49 ++
 src/include/storage/lwlock.h                  |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  35 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../modules/test_tidstore/test_tidstore.c     | 195 +++++
 .../test_tidstore/test_tidstore.control       |   4 +
 16 files changed, 1033 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..d936aa3da3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2192,6 +2192,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to access a shared TID bitmap during a parallel bitmap
        index scan.</entry>
      </row>
+     <row>
+      <entry><literal>SharedTidStore</literal></entry>
+      <entry>Waiting to access a shared TID store.</entry>
+     </row>
      <row>
       <entry><literal>SharedTupleStore</literal></entry>
       <entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..4c72673ce9
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,688 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *                                                |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits for offset number fits in a 64-bit value, we don't
+ * encode tids but directly use the block number and the offset number as key
+ * and value, respectively.
+ */
+#define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+	/* the number of tids in the store */
+	int64	num_tids;
+
+	/* These values are never changed after creation */
+	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
+	int		max_offset;		/* the maximum offset number */
+	int		offset_nbits;	/* the number of bits required for max_offset */
+	bool	encode_tids;	/* do we use tid encoding? */
+	int		offset_key_nbits;	/* the number of bits of a offset number
+								 * used for the key */
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+	LWLock	lock;
+
+	/* handles for TidStore and radix tree */
+	tidstore_handle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 *
+	 * Memory consumption depends on the number of stored tids, but also on the
+	 * distribution of them, how the radix tree stores, and the memory management
+	 * that backed the radix tree. The maximum bytes that a TidStore can
+	 * use is specified by the max_bytes in tidstore_create(). We want the total
+	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
+	 *
+	 * In local TidStore cases, the radix tree uses slab allocators for each kind
+	 * of node class. The most memory consuming case while adding Tids associated
+	 * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+	 * we deduct 70kB from the max_bytes.
+	 *
+	 * In shared cases, DSA allocates the memory segments big enough to follow
+	 * a geometric series that approximately doubles the total DSA size (see
+	 * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+	 * size and the simulation revealed, the 75% threshold for the maximum bytes
+	 * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+	 * threshold works for other cases.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes = (uint64) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - (70 * 1024);
+	}
+
+	ts->control->max_offset = max_offset;
+	ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+	/*
+	 * We use tid encoding if the number of bits for the offset number doesn't
+	 * fix in a value, uint64.
+	 */
+	if (ts->control->offset_nbits > TIDSTORE_VALUE_NBITS)
+	{
+		ts->control->encode_tids = true;
+		ts->control->offset_key_nbits =
+			ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+	}
+	else
+	{
+		ts->control->encode_tids = false;
+		ts->control->offset_key_nbits = 0;
+	}
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix
+		 * tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+tidstore_reset(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Free the radix tree and return allocated DSA segments to
+		 * the operating system.
+		 */
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+
+		LWLockRelease(&ts->control->lock);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+	}
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	ItemPointerData tid;
+	uint64	key_base;
+	uint64	*values;
+	int	nkeys;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (ts->control->encode_tids)
+	{
+		key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+		nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+	}
+	else
+	{
+		key_base = (uint64) blkno;
+		nkeys = 1;
+	}
+	values = palloc0(sizeof(uint64) * nkeys);
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint64	key;
+		uint32	off;
+		int idx;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		/* encode the tid to key and val */
+		key = tid_to_key_off(ts, &tid, &off);
+
+		idx = key - key_base;
+		Assert(idx >= 0 && idx < nkeys);
+
+		values[idx] |= UINT64CONST(1) << off;
+	}
+
+	if (TidStoreIsShared(ts))
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+	/* insert the calculated key-values to the tree */
+	for (int i = 0; i < nkeys; i++)
+	{
+		if (values[i])
+		{
+			uint64 key = key_base + i;
+
+			if (TidStoreIsShared(ts))
+				shared_rt_set(ts->tree.shared, key, &values[i]);
+			else
+				local_rt_set(ts->tree.local, key, &values[i]);
+		}
+	}
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+
+	if (TidStoreIsShared(ts))
+		LWLockRelease(&ts->control->lock);
+
+	pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val = 0;
+	uint32 off;
+	bool found;
+
+	key = tid_to_key_off(ts, tid, &off);
+
+	if (TidStoreIsShared(ts))
+		found = shared_rt_search(ts->tree.shared, key, &val);
+	else
+		found = local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	iter->result.blkno = InvalidBlockNumber;
+	iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (tidstore_num_tids(ts) == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+
+	return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		/* Process the previously collected key-value */
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = key_get_blkno(iter->ts, key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * We got a key-value pair for a different block. So return the
+			 * collected tids, and remember the key-value for the next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter->result.offsets);
+	pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+	uint64 num_tids;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (!TidStoreIsShared(ts))
+		return ts->control->num_tids;
+
+	LWLockAcquire(&ts->control->lock, LW_SHARED);
+	num_tids = ts->control->num_tids;
+	LWLockRelease(&ts->control->lock);
+
+	return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+	return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->result);
+
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if (i > iter->ts->control->max_offset)
+		{
+			Assert(!iter->ts->control->encode_tids);
+			break;
+		}
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+		Assert(result->num_offsets < iter->ts->control->max_offset);
+		result->offsets[result->num_offsets++] = off;
+	}
+
+	result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+	if (ts->control->encode_tids)
+		return (BlockNumber) (key >> ts->control->offset_key_nbits);
+
+	return (BlockNumber) key;
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+{
+	uint64 key;
+	uint64 tid_i;
+
+	if (!ts->control->encode_tids)
+	{
+		*off = ItemPointerGetOffsetNumber(tid);
+
+		/* Use the block number as the key */
+		return (int64) ItemPointerGetBlockNumber(tid);
+	}
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << ts->control->offset_nbits;
+
+	*off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+	key = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"SharedTupleStore",
 	/* LWTRANCHE_SHARED_TIDBITMAP: */
 	"SharedTidBitmap",
+	/* LWTRANCHE_SHARED_TIDSTORE: */
+	"SharedTidStore",
 	/* LWTRANCHE_PARALLEL_APPEND: */
 	"ParallelAppend",
 	/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	*offsets;
+	int				num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_SHARED_TIDBITMAP,
+	LWTRANCHE_SHARED_TIDSTORE,
 	LWTRANCHE_PARALLEL_APPEND,
 	LWTRANCHE_PER_XACT_PREDICATE_LIST,
 	LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9b849ae8e8
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,195 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+	ItemPointerData tid;
+	bool found;
+
+	ItemPointerSet(&tid, blkno, off);
+
+	found = tidstore_lookup_tid(ts, &tid);
+
+	if (found != expect)
+		elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+			 blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS	5
+#define TEST_TIDSTORE_NUM_OFFSETS	5
+
+	TidStore *ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+	BlockNumber	blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+	};
+	BlockNumber	blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+	};
+	OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+	int blk_idx;
+
+	/* prepare the offset array */
+	offs[0] = FirstOffsetNumber;
+	offs[1] = FirstOffsetNumber + 1;
+	offs[2] = max_offset / 2;
+	offs[3] = max_offset - 1;
+	offs[4] = max_offset;
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+
+	/* add tids */
+	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* lookup test */
+	for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+	{
+		bool expect = false;
+		for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+		{
+			if (offs[i] == off)
+			{
+				expect = true;
+				break;
+			}
+		}
+
+		check_tid(ts, 0, off, expect);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, expect);
+	}
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+			 tidstore_num_tids(ts),
+			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* iteration test */
+	iter = tidstore_begin_iterate(ts);
+	blk_idx = 0;
+	while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+	{
+		/* check the returned block number */
+		if (blks_sorted[blk_idx] != iter_result->blkno)
+			elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+				 iter_result->blkno, blks_sorted[blk_idx]);
+
+		/* check the returned offset numbers */
+		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+			elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+		for (int i = 0; i < iter_result->num_offsets; i++)
+		{
+			if (offs[i] != iter_result->offsets[i])
+				elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+					 iter_result->offsets[i], iter_result->blkno, offs[i]);
+		}
+
+		blk_idx++;
+	}
+
+	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+		elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+	/* remove all tids */
+	tidstore_reset(ts);
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+	/* lookup test for empty store */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, false);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, false);
+	}
+
+	tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+	TidStore *ts;
+	TidStoreIter *iter;
+	ItemPointerData tid;
+
+	elog(NOTICE, "testing empty tidstore");
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+
+	ItemPointerSet(&tid, 0, FirstOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+			 MaxBlockNumber, MaxOffsetNumber);
+
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+	if (tidstore_is_full(ts))
+		elog(ERROR, "tidstore_is_full on empty store returned true");
+
+	iter = tidstore_begin_iterate(ts);
+
+	if (tidstore_iterate_next(iter) != NULL)
+		elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+	tidstore_end_iterate(iter);
+
+	tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	elog(NOTICE, "testing basic operations");
+	test_basic(MaxHeapTuplesPerPage);
+	test_basic(10);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.39.1

v25-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchtext/x-patch; charset=US-ASCII; name=v25-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload

From dcbcf6cdd786f9debf1536ac73093107debfafe8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 17 Jan 2023 17:20:37 +0700
Subject: [PATCH v25 9/9] Use TIDStore for storing dead tuple TID during lazy
 vacuum

Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.

Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.

Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and num_dead_tuple_bytes.

In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.

XXX: needs to bump catalog version
---
 doc/src/sgml/monitoring.sgml               |   8 +-
 src/backend/access/heap/vacuumlazy.c       | 278 ++++++++-------------
 src/backend/catalog/system_views.sql       |   2 +-
 src/backend/commands/vacuum.c              |  78 +-----
 src/backend/commands/vacuumparallel.c      |  73 +++---
 src/backend/postmaster/autovacuum.c        |   6 +-
 src/backend/storage/lmgr/lwlock.c          |   2 +
 src/backend/utils/misc/guc_tables.c        |   2 +-
 src/include/commands/progress.h            |   4 +-
 src/include/commands/vacuum.h              |  25 +-
 src/include/storage/lwlock.h               |   1 +
 src/test/regress/expected/cluster.out      |   2 +-
 src/test/regress/expected/create_index.out |   2 +-
 src/test/regress/expected/rules.out        |   4 +-
 src/test/regress/sql/cluster.sql           |   2 +-
 src/test/regress/sql/create_index.sql      |   2 +-
 16 files changed, 177 insertions(+), 314 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d936aa3da3..0230c74e3d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6870,10 +6870,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6881,10 +6881,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..b4e40423a8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,18 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * create a TidStore with the maximum bytes that can be used by the TidStore.
+ * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
+ * vacuum the pages that we've pruned). This frees up the memory space dedicated
+ * to storing dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -220,11 +221,14 @@ typedef struct LVRelState
 typedef struct LVPagePruneState
 {
 	bool		hastup;			/* Page prevents rel truncation? */
-	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+
+	/* collected offsets of LP_DEAD items including existing ones */
+	OffsetNumber	deadoffsets[MaxHeapTuplesPerPage];
+	int				num_offsets;
 
 	/*
 	 * State describes the proper VM bit states to set for the page following
-	 * pruning and freezing.  all_visible implies !has_lpdead_items, but don't
+	 * pruning and freezing.  all_visible implies num_offsets == 0, but don't
 	 * trust all_frozen result unless all_visible is also set to true.
 	 */
 	bool		all_visible;	/* Every item visible to all? */
@@ -259,8 +263,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -487,11 +492,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
-	 * parallel VACUUM initialization as part of allocating shared memory
-	 * space used for dead_items.  (But do a failsafe precheck first, to
-	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
-	 * is already dangerously old.)
+	 * Allocate dead_items memory using dead_items_alloc.  This handles parallel
+	 * VACUUM initialization as part of allocating shared memory space used for
+	 * dead_items.  (But do a failsafe precheck first, to ensure that parallel
+	 * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+	 * old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
@@ -797,7 +802,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -825,21 +830,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				blkno,
 				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +911,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (tidstore_is_full(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -969,7 +973,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				continue;
 			}
 
-			/* Collect LP_DEAD items in dead_items array, count tuples */
+			/* Collect LP_DEAD items in dead_items, count tuples */
 			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
 								  &recordfreespace))
 			{
@@ -1011,14 +1015,14 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Prune, freeze, and count tuples.
 		 *
 		 * Accumulates details of remaining LP_DEAD line pointers on page in
-		 * dead_items array.  This includes LP_DEAD line pointers that we
-		 * pruned ourselves, as well as existing LP_DEAD line pointers that
-		 * were pruned some time earlier.  Also considers freezing XIDs in the
-		 * tuple headers of remaining items with storage.
+		 * dead_items.  This includes LP_DEAD line pointers that we pruned
+		 * ourselves, as well as existing LP_DEAD line pointers that were pruned
+		 * some time earlier.  Also considers freezing XIDs in the tuple headers
+		 * of remaining items with storage.
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
 
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+		Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
@@ -1034,14 +1038,12 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * performed here can be thought of as the one-pass equivalent of
 			 * a call to lazy_vacuum().
 			 */
-			if (prunestate.has_lpdead_items)
+			if (prunestate.num_offsets > 0)
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
-				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+									  prunestate.num_offsets, buf, vmbuffer);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1080,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
+		}
+		else if (prunestate.num_offsets > 0)
+		{
+			/* Save details of the LP_DEAD items from the page in dead_items */
+			tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+							  prunestate.num_offsets);
+
+			pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+										 tidstore_memory_usage(dead_items));
 		}
 
 		/*
@@ -1145,7 +1156,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
 		 * set, however.
 		 */
-		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+		else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
 		{
 			elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1193,7 +1204,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Final steps for block: drop cleanup lock, record free space in the
 		 * FSM
 		 */
-		if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+		if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
 		{
 			/*
 			 * Wait until lazy_vacuum_heap_rel() to save free space.  This
@@ -1249,7 +1260,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1524,9 +1535,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
  * The approach we take now is to restart pruning when the race condition is
  * detected.  This allows heap_page_prune() to prune the tuples inserted by
  * the now-aborted transaction.  This is a little crude, but it guarantees
- * that any items that make it into the dead_items array are simple LP_DEAD
- * line pointers, and that every remaining item with tuple storage is
- * considered as a candidate for freezing.
+ * that any items that make it into the dead_items are simple LP_DEAD line
+ * pointers, and that every remaining item with tuple storage is considered
+ * as a candidate for freezing.
  */
 static void
 lazy_scan_prune(LVRelState *vacrel,
@@ -1543,13 +1554,11 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				tuples_frozen,
-				lpdead_items,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	HeapPageFreeze pagefrz;
 	int64		fpi_before = pgWalUsage.wal_fpi;
-	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1580,6 @@ retry:
 	pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
-	lpdead_items = 0;
 	live_tuples = 0;
 	recently_dead_tuples = 0;
 
@@ -1580,9 +1588,9 @@ retry:
 	 *
 	 * We count tuples removed by the pruning step as tuples_deleted.  Its
 	 * final value can be thought of as the number of tuples that have been
-	 * deleted from the table.  It should not be confused with lpdead_items;
-	 * lpdead_items's final value can be thought of as the number of tuples
-	 * that were deleted from indexes.
+	 * deleted from the table.  It should not be confused with
+	 * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+	 * be thought of as the number of tuples that were deleted from indexes.
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
 									 InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1601,7 @@ retry:
 	 * requiring freezing among remaining tuples with storage
 	 */
 	prunestate->hastup = false;
-	prunestate->has_lpdead_items = false;
+	prunestate->num_offsets = 0;
 	prunestate->all_visible = true;
 	prunestate->all_frozen = true;
 	prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1646,7 @@ retry:
 			 * (This is another case where it's useful to anticipate that any
 			 * LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
 			 */
-			deadoffsets[lpdead_items++] = offnum;
+			prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
 			continue;
 		}
 
@@ -1875,7 +1883,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible && lpdead_items == 0)
+	if (prunestate->all_visible && prunestate->num_offsets == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1888,28 +1896,9 @@ retry:
 	}
 #endif
 
-	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel
-	 */
-	if (lpdead_items > 0)
+	if (prunestate->num_offsets > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
-		prunestate->has_lpdead_items = true;
-
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1917,7 @@ retry:
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
 	vacrel->tuples_frozen += tuples_frozen;
-	vacrel->lpdead_items += lpdead_items;
+	vacrel->lpdead_items += prunestate->num_offsets;
 	vacrel->live_tuples += live_tuples;
 	vacrel->recently_dead_tuples += recently_dead_tuples;
 }
@@ -1940,7 +1929,7 @@ retry:
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2099,7 +2088,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2129,8 +2118,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2127,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2198,7 +2179,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2227,7 +2208,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2254,8 +2235,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2300,7 +2281,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	tidstore_reset(vacrel->dead_items);
 }
 
 /*
@@ -2373,7 +2354,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2392,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2410,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while ((result = tidstore_iterate_next(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2437,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2451,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+							  buf, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2461,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	tidstore_end_iterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2470,36 +2454,31 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+					vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
 }
 
 /*
- *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *	lazy_vacuum_heap_page() -- free page's LP_DEAD items.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+					  OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2518,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -2687,8 +2660,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -3094,48 +3067,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 }
 
 /*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
-/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.
  *
  * Also handles parallel initialization as part of allocating dead_items in
  * DSM when required.
@@ -3143,11 +3076,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3174,7 +3105,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem, MaxHeapTuplesPerPage,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3187,11 +3118,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
+										 NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a526e607fe 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 7b1a4b127e..d8e680ca20 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2343,82 +2342,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch((void *) itemptr,
-								(void *) dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..d653683693 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,12 +9,11 @@
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
  * vacuum process.  ParalleVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * the shared TidStore. We launch parallel worker processes at the start of
+ * parallel index bulk-deletion and index cleanup and once all indexes are
+ * processed, the parallel worker processes exit.  Each time we process indexes
+ * in parallel, the parallel context is re-initialized so that the same DSM can
+ * be used for multiple passes of index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -103,6 +102,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TidStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +168,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +225,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int max_offset, int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +289,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +356,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +375,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +384,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +441,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_destroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +452,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TidStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +950,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +996,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1045,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index f5ea381c53..d88db3e1f8 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3397,12 +3397,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
 		return true;
 
 	/*
-	 * We clamp manually-set values to at least 1MB.  Since
+	 * We clamp manually-set values to at least 2MB.  Since
 	 * maintenance_work_mem is always set to at least this value, do the same
 	 * here.
 	 */
-	if (*newval < 1024)
-		*newval = 1024;
+	if (*newval < 2048)
+		*newval = 2048;
 
 	return true;
 }
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"LogicalRepLauncherDSA",
 	/* LWTRANCHE_LAUNCHER_HASH: */
 	"LogicalRepLauncherHash",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index b46e3b8c55..27a88b9369 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2312,7 +2312,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 2048, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..a3ebb169ef 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
 	MultiXactId MultiXactCutoff;
 };
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem, int max_offset,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DATA,
 	LWTRANCHE_LAUNCHER_DSA,
 	LWTRANCHE_LAUNCHER_HASH,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
 -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..f6ae02eb14 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 set enable_indexscan = off;
 
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
 
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
-- 
2.39.1

#199

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: John Naylor (#198)

10 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Feb 7, 2023 at 4:25 PM John Naylor <john.naylor@enterprisedb.com>
wrote:

[v25]

This conflicted with a commit from earlier today, so rebased in v26 with no
further changes.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v25-addendum-try-no-maintain-order.txttext/plain; charset=US-ASCII; name=v25-addendum-try-no-maintain-order.txtDownload

diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index 4e00b46d9b..3f831227c9 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -80,9 +80,10 @@
 				}
 				else
 				{
-					int			insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+					int			insertpos;// = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
 					int			count = n3->base.n.count;
-
+#ifdef RT_MAINTAIN_ORDERING
+					insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
 					/* shift chunks and children */
 					if (insertpos < count)
 					{
@@ -95,6 +96,9 @@
 												   count, insertpos);
 #endif
 					}
+#else
+					insertpos = count;
+#endif	/* order */
 
 					n3->base.chunks[insertpos] = chunk;
 #ifdef RT_NODE_LEVEL_LEAF
@@ -186,8 +190,10 @@
 				}
 				else
 				{
-					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int			insertpos;
 					int count = n32->base.n.count;
+#ifdef RT_MAINTAIN_ORDERING
+					insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
 
 					if (insertpos < count)
 					{
@@ -200,6 +206,9 @@
 												   count, insertpos);
 #endif
 					}
+#else
+					insertpos = count;
+#endif
 
 					n32->base.chunks[insertpos] = chunk;
 #ifdef RT_NODE_LEVEL_LEAF

v26-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v26-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From f6f476ba71864821cb5144f513165671c64db1b2 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v26 2/9] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 3d2225e1ae..5f9a511b4a 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 07fbb7ccf6..f4d1d60cd2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3662,7 +3662,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.39.1

v26-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchtext/x-patch; charset=US-ASCII; name=v26-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload

From cf3e16ed894fc0c6574c48eddad7c587e5dec688 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v26 1/9] Introduce helper SIMD functions for small byte arrays

vector8_min - helper for emulating ">=" semantics

vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask

Masahiko Sawada

Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..350e2caaea 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #endif
 
 /* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	/*
+	 * Note: There is a faster way to do this, but it returns a uint64 and
+	 * and if the caller wanted to extract the bit position using CTZ,
+	 * it would have to divide that result by 4.
+	 */
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the bitwise OR of the inputs
  */
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.39.1

v26-0005-Measure-load-time-of-bench_search_random_nodes.patchtext/x-patch; charset=US-ASCII; name=v26-0005-Measure-load-time-of-bench_search_random_nodes.patchDownload

From 4a0d293937876ec348f30ce4ab94da14b925b020 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 7 Feb 2023 13:06:00 +0700
Subject: [PATCH v26 5/9] Measure load time of bench_search_random_nodes

---
 .../bench_radix_tree/bench_radix_tree--1.0.sql  |  1 +
 contrib/bench_radix_tree/bench_radix_tree.c     | 17 ++++++++++++-----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 2fd689aa91..95eedbbe10 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -47,6 +47,7 @@ create function bench_search_random_nodes(
 cnt int8,
 filter_str text DEFAULT NULL,
 OUT mem_allocated int8,
+OUT load_ms int8,
 OUT search_ms int8)
 returns record
 as 'MODULE_PATHNAME'
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 73ddee32de..7d1e2eee57 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -395,9 +395,10 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
 				end_time;
 	long		secs;
 	int			usecs;
+	int64		load_time_ms;
 	int64		search_time_ms;
-	Datum		values[2] = {0};
-	bool		nulls[2] = {0};
+	Datum		values[3] = {0};
+	bool		nulls[3] = {0};
 	/* from trial and error */
 	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
 
@@ -416,13 +417,18 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
 
 	rt = rt_create(CurrentMemoryContext);
 
+	start_time = GetCurrentTimestamp();
 	for (uint64 i = 0; i < cnt; i++)
 	{
-		const uint64 hash = hash64(i);
-		const uint64 key = hash & filter;
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
 
 		rt_set(rt, key, &key);
 	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
 
 	elog(NOTICE, "sleeping for 2 seconds...");
 	pg_usleep(2 * 1000000L);
@@ -449,7 +455,8 @@ bench_search_random_nodes(PG_FUNCTION_ARGS)
 	rt_stats(rt);
 
 	values[0] = Int64GetDatum(rt_memory_usage(rt));
-	values[1] = Int64GetDatum(search_time_ms);
+	values[1] = Int64GetDatum(load_time_ms);
+	values[2] = Int64GetDatum(search_time_ms);
 
 	rt_free(rt);
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
-- 
2.39.1

v26-0004-Tool-for-measuring-radix-tree-performance.patchtext/x-patch; charset=US-ASCII; name=v26-0004-Tool-for-measuring-radix-tree-performance.patchDownload

From 0e328f6d85d30797af158f2a4070004fe40d93fe Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v26 4/9] Tool for measuring radix tree performance

Includes Meson support, but commented out to avoid warnings

XXX: Not for commit
---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  76 ++
 contrib/bench_radix_tree/bench_radix_tree.c   | 656 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/meson.build          |  33 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 contrib/meson.build                           |   1 +
 8 files changed, 822 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/meson.build
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..2fd689aa91
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,76 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..73ddee32de
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,656 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	rt_radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, &val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, &val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		search_time_ms;
+	Datum		values[2] = {0};
+	bool		nulls[2] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		const uint64 hash = hash64(i);
+		const uint64 key = hash & filter;
+
+		rt_set(rt, key, &key);
+	}
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, &key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+
+	rt_stats(rt);
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+  'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+  bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'bench_radix_tree',
+    '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+  bench_radix_tree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+  'bench_radix_tree.control',
+  'bench_radix_tree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'bench_radix_tree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'bench_radix_tree',
+    ],
+  },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
+#subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.39.1

v26-0003-Add-radixtree-template.patchtext/x-patch; charset=US-ASCII; name=v26-0003-Add-radixtree-template.patchDownload

From 4c4cbb9b13da160b8883e6c7f861516f3eedac6a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v26 3/9] Add radixtree template

WIP: commit message based on template comments
---
 src/backend/utils/mmgr/dsa.c                  |   12 +
 src/include/lib/radixtree.h                   | 2516 +++++++++++++++++
 src/include/lib/radixtree_delete_impl.h       |  122 +
 src/include/lib/radixtree_insert_impl.h       |  332 +++
 src/include/lib/radixtree_iter_impl.h         |  153 +
 src/include/lib/radixtree_search_impl.h       |  138 +
 src/include/utils/dsa.h                       |    1 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   35 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  674 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 20 files changed, 4086 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..d6919aef08
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2516 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *		Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ *  tional leaf node type which stores one value.
+ *  - Multi-value leaves: The values are stored in one of four
+ *  different leaf node types, which mirror the structure of
+ *  inner nodes, but contain values instead of pointers.
+ *  - Combined pointer/value slots: If values fit into point-
+ *  ers, no separate node types are necessary. Instead, each
+ *  pointer storage location in an inner node can either
+ *  store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included.  Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * 	 will result in radix tree type 'foo_radix_tree' and functions like
+ *	 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ *	 generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *	 declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *	 so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITER		- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ *    statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ *    in the future to tag the node pointer with the kind, even on
+ *    platforms with 32-bit pointers. This might speed up node traversal
+ *    in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree)	LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree)	LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree)			LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree)	((void) 0)
+#define RT_LOCK_SHARED(tree)	((void) 0)
+#define RT_UNLOCK(tree)			((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+	((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+	RT_NODE		n;
+
+	/* 3 children, for key chunks */
+	uint8		chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* bitmap to track which slots are in use */
+	bitmapword		isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/*
+	 * Unlike with inner256, zero is a valid value here, so we use a
+	 * bitmap to track which slots are in use.
+	 */
+	bitmapword	isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	RT_VALUE_TYPE	values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_3 = 0,
+	RT_CLASS_32_MIN,
+	RT_CLASS_32_MAX,
+	RT_CLASS_125,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_3] = {
+		.name = "radix tree node 3",
+		.fanout = 3,
+		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MIN] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MAX] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_125] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+	LWLock		lock;
+#endif
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+	RT_PTR_LOCAL node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the iteration on nodes of each level */
+	RT_NODE_ITER stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is constructed during iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/* replicate the search key */
+	spread_chunk = vector8_broadcast(chunk);
+
+	/* compare to all 32 keys stored in the node */
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+
+	/* convert comparison to a bitfield */
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+	/* mask off invalid entries */
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	/* convert bitfield to index by counting trailing zeros */
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		/*
+		 * This is coded with '>=' to match what we can do with SIMD,
+		 * with an assert to keep us honest.
+		 */
+		if (node->chunks[index] >= chunk)
+		{
+			Assert(node->chunks[index] != chunk);
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/*
+	 * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+	 * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+	 * we need to play some trickery using vector8_min() to effectively get
+	 * >=. There'll never be any equal elements in current uses, but that's
+	 * what we get here...
+	 */
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	if (key == 0)
+		return 0;
+	else
+		return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
+
+	if (is_leaf)
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (is_leaf)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+#endif
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->ctl->cnt[size_class]++;
+#endif
+
+	return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	if (is_leaf)
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+	node->kind = kind;
+
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+		memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+	}
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		is_leaf = shift == 0;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+	newnode->shift = shift;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+				  uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+	RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+	RT_COPY_NODE(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->ctl->root == allocnode)
+	{
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
+	}
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+				RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old_child->shift == new->shift);
+	Assert(old_child->count == new->count);
+#endif
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new larger node */
+		tree->ctl->root = new_child;
+	}
+	else
+		RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+	RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_3 *n3;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+		node->shift = shift;
+		node->count = 1;
+
+		n3 = (RT_NODE_INNER_3 *) node;
+		n3->base.chunks[0] = 0;
+		n3->children[0] = tree->ctl->root;
+
+		/* Update the root */
+		tree->ctl->root = allocnode;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static inline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+	int			shift = node->shift;
+
+	Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		is_leaf = newshift == 0;
+
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+		newchild->shift = newshift;
+		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+		parent = node;
+		node = newchild;
+		stored_node = allocchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+	tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static bool
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+	/* Create a slab context for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+		size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+		size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 size_class.name,
+												 inner_blocksize,
+												 size_class.inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												size_class.name,
+												leaf_blocksize,
+												size_class.leaf_size);
+	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static inline void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	if (RT_NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+				for (int i = 0; i < n3->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n3->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+				for (int i = 0; i < n32->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle);
+#else
+	pfree(tree->ctl);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+#endif
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	int			shift;
+	bool		updated;
+	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC stored_child;
+	RT_PTR_LOCAL  child;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	/* Empty tree, create the root */
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->ctl->max_val)
+		RT_EXTEND(tree, key);
+
+	stored_child = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, stored_child);
+	shift = parent->shift;
+
+	/* Descend the tree until we reach a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;;
+
+		child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+		if (RT_NODE_IS_LEAF(child))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+		{
+			RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		parent = child;
+		stored_child = new_child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->ctl->num_keys++;
+
+	RT_UNLOCK(tree);
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+	bool		found;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	Assert(value_p != NULL);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		if (RT_NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		node = RT_PTR_GET_LOCAL(tree, child);
+		shift -= RT_NODE_SPAN;
+	}
+
+	found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+	RT_UNLOCK(tree);
+	return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		/* Push the current node to the stack */
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		allocnode = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	deleted = RT_NODE_DELETE_LEAF(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->ctl->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (node->count > 0)
+	{
+		RT_UNLOCK(tree);
+		return true;
+	}
+
+	/* Free the empty leaf node */
+	RT_FREE_NODE(tree, allocnode);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		allocnode = stack[level--];
+
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		deleted = RT_NODE_DELETE_INNER(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (node->count > 0)
+			break;
+
+		/* The node became empty */
+		RT_FREE_NODE(tree, allocnode);
+	}
+
+	RT_UNLOCK(tree);
+	return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+	int			level = from;
+	RT_PTR_LOCAL node = from_node;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (RT_NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	MemoryContext old_ctx;
+	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+
+	RT_LOCK_SHARED(tree);
+
+	/* empty tree */
+	if (!iter->tree->ctl->root)
+		return iter;
+
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->ctl->root)
+		return false;
+
+	for (;;)
+	{
+		RT_PTR_LOCAL child = NULL;
+		RT_VALUE_TYPE value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+	Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+	RT_UNLOCK(iter->tree);
+	pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	Size		total = 0;
+
+	RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+#endif
+
+	RT_UNLOCK(tree);
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+				for (int i = 1; i < n3->n.count; i++)
+					Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = RT_BM_IDX(slot);
+					int			bitnum = RT_BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+	RT_LOCK_SHARED(tree);
+
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+	fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+	fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+		fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+				root->shift / RT_NODE_SPAN,
+				tree->ctl->cnt[RT_CLASS_3],
+				tree->ctl->cnt[RT_CLASS_32_MIN],
+				tree->ctl->cnt[RT_CLASS_32_MAX],
+				tree->ctl->cnt[RT_CLASS_125],
+				tree->ctl->cnt[RT_CLASS_256]);
+	}
+
+	RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+			 bool recurse, StringInfo buf)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+	StringInfoData spaces;
+
+	initStringInfo(&spaces);
+	appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+	appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+					 spaces.data,
+					 level == 0 ? "" : "-> ",
+					 RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+					 (node->kind == RT_NODE_KIND_3) ? 3 :
+					 (node->kind == RT_NODE_KIND_32) ? 32 :
+					 (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+					 node->fanout == 0 ? 256 : node->fanout,
+					 node->count, node->shift);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n3->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n3->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n3->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n32->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n32->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+				char *sep = "";
+
+				appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					appendStringInfo(buf, "%s[%d]=%d ",
+									 sep, i, b125->slot_idxs[i]);
+					sep = ",";
+				}
+
+				appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+				for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+					appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+				appendStringInfo(buf, "\n");
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					if (RT_NODE_IS_LEAF(node))
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					else
+					{
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+					appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+					for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+						appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+					appendStringInfo(buf, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL node;
+	StringInfoData buf;
+	int			shift;
+	int			level = 0;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	if (key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+				key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+		if (RT_NODE_IS_LEAF(node))
+		{
+			RT_VALUE_TYPE	dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			break;
+
+		allocnode = child;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+	StringInfoData buf;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	initStringInfo(&buf);
+
+	RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ *	  Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+										  n3->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+											n3->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+										  n32->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+				idx = RT_BM_IDX(slotpos);
+				bitnum = RT_BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+				RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..c18e26b537
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,332 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ *	  Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+	bool		chunk_exists = false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	const bool is_leaf = true;
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	const bool is_leaf = false;
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n3->values[idx] = *value_p;
+#else
+					n3->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n3)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+					/* grow node from 3 to 32 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+											  new32->base.chunks, new32->values);
+#else
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+											  new32->base.chunks, new32->children);
+#endif
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+					int			count = n3->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+												   count, insertpos);
+#endif
+					}
+
+					n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n3->values[insertpos] = *value_p;
+#else
+					n3->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx;
+
+				idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[idx] = *value_p;
+#else
+					n32->children[idx] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+					n32->base.n.fanout < class32_max.fanout)
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+					Assert(n32->base.n.fanout == class32_min.fanout);
+
+					/* grow to the next size class of this kind */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class32_min.leaf_size);
+#else
+					memcpy(newnode, node, class32_min.inner_size);
+#endif
+					newnode->fanout = class32_max.fanout;
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n32)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+					Assert(n32->base.n.fanout == class32_max.fanout);
+
+					/* grow node from 32 to 125 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < class32_max.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					/*
+					 * Since we just copied a dense array, we can set the bits
+					 * using a single store, provided the length of that array
+					 * is at most the number of bits in a bitmapword.
+					 */
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = *value_p;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			cnt = 0;
+
+				if (slotpos != RT_INVALID_SLOT_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = *value_p;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n125)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+					/* grow node from 125 to 256 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new256 = (RT_NODE256_TYPE *) newnode;
+
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+						cnt++;
+					}
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = *value_p;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+#else
+				chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
+#endif
+				Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+				RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+	if (!chunk_exists)
+		node->count++;
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	RT_VERIFY_NODE(node);
+
+	return chunk_exists;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..98c78eb237
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,153 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ *	  Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	bool		found = false;
+	uint8		key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	RT_VALUE_TYPE		value;
+
+	Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n3->base.n.count)
+					break;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n3->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+				key_chunk = n3->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+		*value_p = value;
+#endif
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return found;
+#else
+	return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ *	  Common implementation for search in leaf and inner nodes, plus
+ *	  update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+	Assert(child_p != NULL);
+#endif
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n3->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n3->values[idx];
+#else
+				*child_p = n3->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n32->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n32->values[idx];
+#else
+				*child_p = n32->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+				Assert(slotpos != RT_INVALID_SLOT_IDX);
+				n125->children[slotpos] = new_child;
+#else
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				*child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				*child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+	}
+
+#ifdef RT_ACTION_UPDATE
+	return;
+#else
+	return true;
+#endif							/* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..f944945db9
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,674 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	TestValueType		dummy;
+	uint64		key;
+	TestValueType		val;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	rt_radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* look up keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType value;
+
+		if (!rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (value != (TestValueType) keys[i])
+			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+				 value, (TestValueType) keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType update = keys[i] + 1;
+		if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		TestValueType		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != (TestValueType) key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType*) &key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+	radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, (TestValueType*) &x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != (TestValueType) x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			TestValueType		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != (TestValueType) expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	rt_free(radixtree);
+	MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.39.1

v26-0006-Adjust-some-inlining-declarations.patchtext/x-patch; charset=US-ASCII; name=v26-0006-Adjust-some-inlining-declarations.patchDownload

From 0726bb6b4e0250a72ce399d945d250724b4a29ab Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 6 Feb 2023 21:04:14 +0700
Subject: [PATCH v26 6/9] Adjust some inlining declarations

---
 src/include/lib/radixtree.h | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d6919aef08..4bd0aaa810 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1124,7 +1124,7 @@ RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_le
  * Create a new node as the root. Subordinate nodes will be created during
  * the insertion.
  */
-static void
+static pg_noinline void
 RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
 {
 	int			shift = RT_KEY_GET_SHIFT(key);
@@ -1215,7 +1215,7 @@ RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
 /*
  * Replace old_child with new_child, and free the old one.
  */
-static void
+static inline void
 RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
 				RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
 				RT_PTR_ALLOC new_child, uint64 key)
@@ -1242,7 +1242,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
  * The radix tree doesn't have sufficient height. Extend the radix tree so
  * it can store the key.
  */
-static void
+static pg_noinline void
 RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
 {
 	int			target_shift;
@@ -1281,7 +1281,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
  * The radix tree doesn't have inner and leaf nodes for given key-value pair.
  * Insert inner and leaf nodes from 'node' to bottom.
  */
-static inline void
+static pg_noinline void
 RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
 			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
 {
@@ -1486,7 +1486,7 @@ RT_GET_HANDLE(RT_RADIX_TREE *tree)
 /*
  * Recursively free all nodes allocated to the DSA area.
  */
-static inline void
+static void
 RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
 {
 	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
-- 
2.39.1

v26-0007-Skip-unnecessary-searches-in-RT_NODE_INSERT_INNE.patchtext/x-patch; charset=US-ASCII; name=v26-0007-Skip-unnecessary-searches-in-RT_NODE_INSERT_INNE.patchDownload

From 6831fe27a2c9c5765113b7903403c426f09f55f6 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 6 Feb 2023 22:04:50 +0700
Subject: [PATCH v26 7/9] Skip unnecessary searches in RT_NODE_INSERT_INNER

For inner nodes, we know the key chunk doesn't exist already,
otherwise we would have found it while descending the tree.

To reinforce this fact, declare this function to return void.
---
 src/include/lib/radixtree.h             |  4 +--
 src/include/lib/radixtree_insert_impl.h | 48 ++++++++++++-------------
 2 files changed, 24 insertions(+), 28 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 4bd0aaa810..1cdb995e54 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -685,7 +685,7 @@ typedef struct RT_ITER
 } RT_ITER;
 
 
-static bool RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 								 uint64 key, RT_PTR_ALLOC child);
 static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 								uint64 key, RT_VALUE_TYPE *value_p);
@@ -1375,7 +1375,7 @@ RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
  * If the node we're inserting into needs to grow, we update the parent's
  * child pointer with the pointer to the new larger node.
  */
-static bool
+static void
 RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
 					uint64 key, RT_PTR_ALLOC child)
 {
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index c18e26b537..d56e58dcac 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -28,10 +28,10 @@
 #endif
 
 	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
-	bool		chunk_exists = false;
 
 #ifdef RT_NODE_LEVEL_LEAF
 	const bool is_leaf = true;
+	bool		chunk_exists = false;
 	Assert(RT_NODE_IS_LEAF(node));
 #else
 	const bool is_leaf = false;
@@ -43,21 +43,18 @@
 		case RT_NODE_KIND_3:
 			{
 				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
-				int			idx;
 
-				idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
 				if (idx != -1)
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-#ifdef RT_NODE_LEVEL_LEAF
 					n3->values[idx] = *value_p;
-#else
-					n3->children[idx] = child;
-#endif
 					break;
 				}
-
+#endif
 				if (unlikely(RT_NODE_MUST_GROW(n3)))
 				{
 					RT_PTR_ALLOC allocnode;
@@ -113,21 +110,18 @@
 			{
 				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
 				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
-				int			idx;
 
-				idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
 				if (idx != -1)
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-#ifdef RT_NODE_LEVEL_LEAF
 					n32->values[idx] = *value_p;
-#else
-					n32->children[idx] = child;
-#endif
 					break;
 				}
-
+#endif
 				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
 					n32->base.n.fanout < class32_max.fanout)
 				{
@@ -220,21 +214,19 @@
 		case RT_NODE_KIND_125:
 			{
 				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
-				int			slotpos = n125->base.slot_idxs[chunk];
+				int			slotpos;
 				int			cnt = 0;
 
+#ifdef RT_NODE_LEVEL_LEAF
+				slotpos = n125->base.slot_idxs[chunk];
 				if (slotpos != RT_INVALID_SLOT_IDX)
 				{
 					/* found the existing chunk */
 					chunk_exists = true;
-#ifdef RT_NODE_LEVEL_LEAF
 					n125->values[slotpos] = *value_p;
-#else
-					n125->children[slotpos] = child;
-#endif
 					break;
 				}
-
+#endif
 				if (unlikely(RT_NODE_MUST_GROW(n125)))
 				{
 					RT_PTR_ALLOC allocnode;
@@ -300,14 +292,10 @@
 
 #ifdef RT_NODE_LEVEL_LEAF
 				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
-#else
-				chunk_exists = RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk);
-#endif
 				Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
-
-#ifdef RT_NODE_LEVEL_LEAF
 				RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
 #else
+				Assert(node->count < RT_NODE_MAX_SLOTS);
 				RT_NODE_INNER_256_SET(n256, chunk, child);
 #endif
 				break;
@@ -315,8 +303,12 @@
 	}
 
 	/* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
 	if (!chunk_exists)
 		node->count++;
+#else
+		node->count++;
+#endif
 
 	/*
 	 * Done. Finally, verify the chunk and value is inserted or replaced
@@ -324,7 +316,11 @@
 	 */
 	RT_VERIFY_NODE(node);
 
+#ifdef RT_NODE_LEVEL_LEAF
 	return chunk_exists;
+#else
+	return;
+#endif
 
 #undef RT_NODE3_TYPE
 #undef RT_NODE32_TYPE
-- 
2.39.1

v26-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchtext/x-patch; charset=US-ASCII; name=v26-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From f17e983832736a1daa64e67a10f9a64189b68210 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v26 8/9] Add TIDStore, to store sets of TIDs (ItemPointerData)
 efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 688 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   2 +
 src/include/access/tidstore.h                 |  49 ++
 src/include/storage/lwlock.h                  |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  35 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../modules/test_tidstore/test_tidstore.c     | 195 +++++
 .../test_tidstore/test_tidstore.control       |   4 +
 16 files changed, 1033 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1756f1a4b6..d936aa3da3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2192,6 +2192,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to access a shared TID bitmap during a parallel bitmap
        index scan.</entry>
      </row>
+     <row>
+      <entry><literal>SharedTidStore</literal></entry>
+      <entry>Waiting to access a shared TID store.</entry>
+     </row>
      <row>
       <entry><literal>SharedTupleStore</literal></entry>
       <entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..4c72673ce9
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,688 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *                                                |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits for offset number fits in a 64-bit value, we don't
+ * encode tids but directly use the block number and the offset number as key
+ * and value, respectively.
+ */
+#define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+	/* the number of tids in the store */
+	int64	num_tids;
+
+	/* These values are never changed after creation */
+	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
+	int		max_offset;		/* the maximum offset number */
+	int		offset_nbits;	/* the number of bits required for max_offset */
+	bool	encode_tids;	/* do we use tid encoding? */
+	int		offset_key_nbits;	/* the number of bits of a offset number
+								 * used for the key */
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+	LWLock	lock;
+
+	/* handles for TidStore and radix tree */
+	tidstore_handle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 *
+	 * Memory consumption depends on the number of stored tids, but also on the
+	 * distribution of them, how the radix tree stores, and the memory management
+	 * that backed the radix tree. The maximum bytes that a TidStore can
+	 * use is specified by the max_bytes in tidstore_create(). We want the total
+	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
+	 *
+	 * In local TidStore cases, the radix tree uses slab allocators for each kind
+	 * of node class. The most memory consuming case while adding Tids associated
+	 * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+	 * we deduct 70kB from the max_bytes.
+	 *
+	 * In shared cases, DSA allocates the memory segments big enough to follow
+	 * a geometric series that approximately doubles the total DSA size (see
+	 * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+	 * size and the simulation revealed, the 75% threshold for the maximum bytes
+	 * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+	 * threshold works for other cases.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes = (uint64) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - (70 * 1024);
+	}
+
+	ts->control->max_offset = max_offset;
+	ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+	/*
+	 * We use tid encoding if the number of bits for the offset number doesn't
+	 * fix in a value, uint64.
+	 */
+	if (ts->control->offset_nbits > TIDSTORE_VALUE_NBITS)
+	{
+		ts->control->encode_tids = true;
+		ts->control->offset_key_nbits =
+			ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+	}
+	else
+	{
+		ts->control->encode_tids = false;
+		ts->control->offset_key_nbits = 0;
+	}
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix
+		 * tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+tidstore_reset(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Free the radix tree and return allocated DSA segments to
+		 * the operating system.
+		 */
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+
+		LWLockRelease(&ts->control->lock);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+	}
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	ItemPointerData tid;
+	uint64	key_base;
+	uint64	*values;
+	int	nkeys;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (ts->control->encode_tids)
+	{
+		key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+		nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+	}
+	else
+	{
+		key_base = (uint64) blkno;
+		nkeys = 1;
+	}
+	values = palloc0(sizeof(uint64) * nkeys);
+
+	ItemPointerSetBlockNumber(&tid, blkno);
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint64	key;
+		uint32	off;
+		int idx;
+
+		ItemPointerSetOffsetNumber(&tid, offsets[i]);
+
+		/* encode the tid to key and val */
+		key = tid_to_key_off(ts, &tid, &off);
+
+		idx = key - key_base;
+		Assert(idx >= 0 && idx < nkeys);
+
+		values[idx] |= UINT64CONST(1) << off;
+	}
+
+	if (TidStoreIsShared(ts))
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+	/* insert the calculated key-values to the tree */
+	for (int i = 0; i < nkeys; i++)
+	{
+		if (values[i])
+		{
+			uint64 key = key_base + i;
+
+			if (TidStoreIsShared(ts))
+				shared_rt_set(ts->tree.shared, key, &values[i]);
+			else
+				local_rt_set(ts->tree.local, key, &values[i]);
+		}
+	}
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+
+	if (TidStoreIsShared(ts))
+		LWLockRelease(&ts->control->lock);
+
+	pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val = 0;
+	uint32 off;
+	bool found;
+
+	key = tid_to_key_off(ts, tid, &off);
+
+	if (TidStoreIsShared(ts))
+		found = shared_rt_search(ts->tree.shared, key, &val);
+	else
+		found = local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	iter->result.blkno = InvalidBlockNumber;
+	iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (tidstore_num_tids(ts) == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+
+	return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		/* Process the previously collected key-value */
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = key_get_blkno(iter->ts, key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * We got a key-value pair for a different block. So return the
+			 * collected tids, and remember the key-value for the next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter->result.offsets);
+	pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+	uint64 num_tids;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (!TidStoreIsShared(ts))
+		return ts->control->num_tids;
+
+	LWLockAcquire(&ts->control->lock, LW_SHARED);
+	num_tids = ts->control->num_tids;
+	LWLockRelease(&ts->control->lock);
+
+	return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+	return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->result);
+
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if (i > iter->ts->control->max_offset)
+		{
+			Assert(!iter->ts->control->encode_tids);
+			break;
+		}
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+		Assert(result->num_offsets < iter->ts->control->max_offset);
+		result->offsets[result->num_offsets++] = off;
+	}
+
+	result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+	if (ts->control->encode_tids)
+		return (BlockNumber) (key >> ts->control->offset_key_nbits);
+
+	return (BlockNumber) key;
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+{
+	uint64 key;
+	uint64 tid_i;
+
+	if (!ts->control->encode_tids)
+	{
+		*off = ItemPointerGetOffsetNumber(tid);
+
+		/* Use the block number as the key */
+		return (int64) ItemPointerGetBlockNumber(tid);
+	}
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << ts->control->offset_nbits;
+
+	*off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+	key = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"SharedTupleStore",
 	/* LWTRANCHE_SHARED_TIDBITMAP: */
 	"SharedTidBitmap",
+	/* LWTRANCHE_SHARED_TIDSTORE: */
+	"SharedTidStore",
 	/* LWTRANCHE_PARALLEL_APPEND: */
 	"ParallelAppend",
 	/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	*offsets;
+	int				num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_SHARED_TIDBITMAP,
+	LWTRANCHE_SHARED_TIDSTORE,
 	LWTRANCHE_PARALLEL_APPEND,
 	LWTRANCHE_PER_XACT_PREDICATE_LIST,
 	LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9b849ae8e8
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,195 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+	ItemPointerData tid;
+	bool found;
+
+	ItemPointerSet(&tid, blkno, off);
+
+	found = tidstore_lookup_tid(ts, &tid);
+
+	if (found != expect)
+		elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+			 blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS	5
+#define TEST_TIDSTORE_NUM_OFFSETS	5
+
+	TidStore *ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+	BlockNumber	blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+	};
+	BlockNumber	blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+	};
+	OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+	int blk_idx;
+
+	/* prepare the offset array */
+	offs[0] = FirstOffsetNumber;
+	offs[1] = FirstOffsetNumber + 1;
+	offs[2] = max_offset / 2;
+	offs[3] = max_offset - 1;
+	offs[4] = max_offset;
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+
+	/* add tids */
+	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* lookup test */
+	for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+	{
+		bool expect = false;
+		for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+		{
+			if (offs[i] == off)
+			{
+				expect = true;
+				break;
+			}
+		}
+
+		check_tid(ts, 0, off, expect);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, expect);
+	}
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+			 tidstore_num_tids(ts),
+			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* iteration test */
+	iter = tidstore_begin_iterate(ts);
+	blk_idx = 0;
+	while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+	{
+		/* check the returned block number */
+		if (blks_sorted[blk_idx] != iter_result->blkno)
+			elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+				 iter_result->blkno, blks_sorted[blk_idx]);
+
+		/* check the returned offset numbers */
+		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+			elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+		for (int i = 0; i < iter_result->num_offsets; i++)
+		{
+			if (offs[i] != iter_result->offsets[i])
+				elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+					 iter_result->offsets[i], iter_result->blkno, offs[i]);
+		}
+
+		blk_idx++;
+	}
+
+	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+		elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+	/* remove all tids */
+	tidstore_reset(ts);
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+	/* lookup test for empty store */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, false);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, false);
+	}
+
+	tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+	TidStore *ts;
+	TidStoreIter *iter;
+	ItemPointerData tid;
+
+	elog(NOTICE, "testing empty tidstore");
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+
+	ItemPointerSet(&tid, 0, FirstOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+			 MaxBlockNumber, MaxOffsetNumber);
+
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+	if (tidstore_is_full(ts))
+		elog(ERROR, "tidstore_is_full on empty store returned true");
+
+	iter = tidstore_begin_iterate(ts);
+
+	if (tidstore_iterate_next(iter) != NULL)
+		elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+	tidstore_end_iterate(iter);
+
+	tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	elog(NOTICE, "testing basic operations");
+	test_basic(MaxHeapTuplesPerPage);
+	test_basic(10);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.39.1

v26-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchtext/x-patch; charset=US-ASCII; name=v26-0009-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload

From ed8115b5f5c1b0745e35a0d6d72064ad9df4cf42 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 7 Feb 2023 17:19:29 +0700
Subject: [PATCH v26 9/9] Use TIDStore for storing dead tuple TID during lazy
 vacuum

Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.

Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.

Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and num_dead_tuple_bytes.

In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.

XXX: needs to bump catalog version
---
 doc/src/sgml/monitoring.sgml               |   8 +-
 src/backend/access/heap/vacuumlazy.c       | 278 ++++++++-------------
 src/backend/catalog/system_views.sql       |   2 +-
 src/backend/commands/vacuum.c              |  78 +-----
 src/backend/commands/vacuumparallel.c      |  73 +++---
 src/backend/postmaster/autovacuum.c        |   6 +-
 src/backend/storage/lmgr/lwlock.c          |   2 +
 src/backend/utils/misc/guc_tables.c        |   2 +-
 src/include/commands/progress.h            |   4 +-
 src/include/commands/vacuum.h              |  25 +-
 src/include/storage/lwlock.h               |   1 +
 src/test/regress/expected/cluster.out      |   2 +-
 src/test/regress/expected/create_index.out |   2 +-
 src/test/regress/expected/rules.out        |   4 +-
 src/test/regress/sql/cluster.sql           |   2 +-
 src/test/regress/sql/create_index.sql      |   2 +-
 16 files changed, 177 insertions(+), 314 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index d936aa3da3..0230c74e3d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6870,10 +6870,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6881,10 +6881,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..b4e40423a8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,18 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * create a TidStore with the maximum bytes that can be used by the TidStore.
+ * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
+ * vacuum the pages that we've pruned). This frees up the memory space dedicated
+ * to storing dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -220,11 +221,14 @@ typedef struct LVRelState
 typedef struct LVPagePruneState
 {
 	bool		hastup;			/* Page prevents rel truncation? */
-	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+
+	/* collected offsets of LP_DEAD items including existing ones */
+	OffsetNumber	deadoffsets[MaxHeapTuplesPerPage];
+	int				num_offsets;
 
 	/*
 	 * State describes the proper VM bit states to set for the page following
-	 * pruning and freezing.  all_visible implies !has_lpdead_items, but don't
+	 * pruning and freezing.  all_visible implies num_offsets == 0, but don't
 	 * trust all_frozen result unless all_visible is also set to true.
 	 */
 	bool		all_visible;	/* Every item visible to all? */
@@ -259,8 +263,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -487,11 +492,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
-	 * parallel VACUUM initialization as part of allocating shared memory
-	 * space used for dead_items.  (But do a failsafe precheck first, to
-	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
-	 * is already dangerously old.)
+	 * Allocate dead_items memory using dead_items_alloc.  This handles parallel
+	 * VACUUM initialization as part of allocating shared memory space used for
+	 * dead_items.  (But do a failsafe precheck first, to ensure that parallel
+	 * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+	 * old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
@@ -797,7 +802,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -825,21 +830,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				blkno,
 				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +911,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (tidstore_is_full(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -969,7 +973,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				continue;
 			}
 
-			/* Collect LP_DEAD items in dead_items array, count tuples */
+			/* Collect LP_DEAD items in dead_items, count tuples */
 			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
 								  &recordfreespace))
 			{
@@ -1011,14 +1015,14 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Prune, freeze, and count tuples.
 		 *
 		 * Accumulates details of remaining LP_DEAD line pointers on page in
-		 * dead_items array.  This includes LP_DEAD line pointers that we
-		 * pruned ourselves, as well as existing LP_DEAD line pointers that
-		 * were pruned some time earlier.  Also considers freezing XIDs in the
-		 * tuple headers of remaining items with storage.
+		 * dead_items.  This includes LP_DEAD line pointers that we pruned
+		 * ourselves, as well as existing LP_DEAD line pointers that were pruned
+		 * some time earlier.  Also considers freezing XIDs in the tuple headers
+		 * of remaining items with storage.
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
 
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+		Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
@@ -1034,14 +1038,12 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * performed here can be thought of as the one-pass equivalent of
 			 * a call to lazy_vacuum().
 			 */
-			if (prunestate.has_lpdead_items)
+			if (prunestate.num_offsets > 0)
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
-				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+									  prunestate.num_offsets, buf, vmbuffer);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1080,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
+		}
+		else if (prunestate.num_offsets > 0)
+		{
+			/* Save details of the LP_DEAD items from the page in dead_items */
+			tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+							  prunestate.num_offsets);
+
+			pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+										 tidstore_memory_usage(dead_items));
 		}
 
 		/*
@@ -1145,7 +1156,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
 		 * set, however.
 		 */
-		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+		else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
 		{
 			elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1193,7 +1204,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Final steps for block: drop cleanup lock, record free space in the
 		 * FSM
 		 */
-		if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+		if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
 		{
 			/*
 			 * Wait until lazy_vacuum_heap_rel() to save free space.  This
@@ -1249,7 +1260,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1524,9 +1535,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
  * The approach we take now is to restart pruning when the race condition is
  * detected.  This allows heap_page_prune() to prune the tuples inserted by
  * the now-aborted transaction.  This is a little crude, but it guarantees
- * that any items that make it into the dead_items array are simple LP_DEAD
- * line pointers, and that every remaining item with tuple storage is
- * considered as a candidate for freezing.
+ * that any items that make it into the dead_items are simple LP_DEAD line
+ * pointers, and that every remaining item with tuple storage is considered
+ * as a candidate for freezing.
  */
 static void
 lazy_scan_prune(LVRelState *vacrel,
@@ -1543,13 +1554,11 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				tuples_frozen,
-				lpdead_items,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	HeapPageFreeze pagefrz;
 	int64		fpi_before = pgWalUsage.wal_fpi;
-	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1580,6 @@ retry:
 	pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
-	lpdead_items = 0;
 	live_tuples = 0;
 	recently_dead_tuples = 0;
 
@@ -1580,9 +1588,9 @@ retry:
 	 *
 	 * We count tuples removed by the pruning step as tuples_deleted.  Its
 	 * final value can be thought of as the number of tuples that have been
-	 * deleted from the table.  It should not be confused with lpdead_items;
-	 * lpdead_items's final value can be thought of as the number of tuples
-	 * that were deleted from indexes.
+	 * deleted from the table.  It should not be confused with
+	 * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+	 * be thought of as the number of tuples that were deleted from indexes.
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
 									 InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1601,7 @@ retry:
 	 * requiring freezing among remaining tuples with storage
 	 */
 	prunestate->hastup = false;
-	prunestate->has_lpdead_items = false;
+	prunestate->num_offsets = 0;
 	prunestate->all_visible = true;
 	prunestate->all_frozen = true;
 	prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1646,7 @@ retry:
 			 * (This is another case where it's useful to anticipate that any
 			 * LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
 			 */
-			deadoffsets[lpdead_items++] = offnum;
+			prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
 			continue;
 		}
 
@@ -1875,7 +1883,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible && lpdead_items == 0)
+	if (prunestate->all_visible && prunestate->num_offsets == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1888,28 +1896,9 @@ retry:
 	}
 #endif
 
-	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel
-	 */
-	if (lpdead_items > 0)
+	if (prunestate->num_offsets > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
-		prunestate->has_lpdead_items = true;
-
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1917,7 @@ retry:
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
 	vacrel->tuples_frozen += tuples_frozen;
-	vacrel->lpdead_items += lpdead_items;
+	vacrel->lpdead_items += prunestate->num_offsets;
 	vacrel->live_tuples += live_tuples;
 	vacrel->recently_dead_tuples += recently_dead_tuples;
 }
@@ -1940,7 +1929,7 @@ retry:
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2099,7 +2088,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2129,8 +2118,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2127,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2198,7 +2179,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2227,7 +2208,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2254,8 +2235,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2300,7 +2281,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	tidstore_reset(vacrel->dead_items);
 }
 
 /*
@@ -2373,7 +2354,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2392,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2410,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while ((result = tidstore_iterate_next(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2437,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2451,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+							  buf, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2461,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	tidstore_end_iterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2470,36 +2454,31 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+					vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
 }
 
 /*
- *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *	lazy_vacuum_heap_page() -- free page's LP_DEAD items.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+					  OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2518,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -2687,8 +2660,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -3094,48 +3067,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 }
 
 /*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
-/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.
  *
  * Also handles parallel initialization as part of allocating dead_items in
  * DSM when required.
@@ -3143,11 +3076,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3174,7 +3105,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem, MaxHeapTuplesPerPage,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3187,11 +3118,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
+										 NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8608e3fa5b..a526e607fe 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1165,7 +1165,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index aa79d9de4d..d8e680ca20 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2343,82 +2342,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch(itemptr,
-								dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..d653683693 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,12 +9,11 @@
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
  * vacuum process.  ParalleVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * the shared TidStore. We launch parallel worker processes at the start of
+ * parallel index bulk-deletion and index cleanup and once all indexes are
+ * processed, the parallel worker processes exit.  Each time we process indexes
+ * in parallel, the parallel context is re-initialized so that the same DSM can
+ * be used for multiple passes of index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -103,6 +102,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TidStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +168,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +225,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int max_offset, int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +289,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +356,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +375,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +384,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +441,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_destroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +452,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TidStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +950,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +996,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1045,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ff6149a179..a371f6fbba 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3397,12 +3397,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
 		return true;
 
 	/*
-	 * We clamp manually-set values to at least 1MB.  Since
+	 * We clamp manually-set values to at least 2MB.  Since
 	 * maintenance_work_mem is always set to at least this value, do the same
 	 * here.
 	 */
-	if (*newval < 1024)
-		*newval = 1024;
+	if (*newval < 2048)
+		*newval = 2048;
 
 	return true;
 }
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"LogicalRepLauncherDSA",
 	/* LWTRANCHE_LAUNCHER_HASH: */
 	"LogicalRepLauncherHash",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index b46e3b8c55..27a88b9369 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2312,7 +2312,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 2048, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..a3ebb169ef 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
 	MultiXactId MultiXactCutoff;
 };
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem, int max_offset,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DATA,
 	LWTRANCHE_LAUNCHER_DSA,
 	LWTRANCHE_LAUNCHER_HASH,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
 -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e7a2f5856a..f6ae02eb14 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2020,8 +2020,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 set enable_indexscan = off;
 
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
 
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
-- 
2.39.1

#200

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#198)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On Tue, Feb 7, 2023 at 6:25 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Tue, Jan 31, 2023 at 9:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached v24 patches. The locking support patch is separated
(0005 patch). Also I kept the updates for TidStore and the vacuum
integration from v23 separate.

Okay, that's a lot more simple, and closer to what I imagined. For v25, I squashed v24's additions and added a couple of my own. I've kept the CF status at "needs review" because no specific action is required at the moment.

I did start to review the TID store some more, but that's on hold because something else came up: On a lark I decided to re-run some benchmarks to see if anything got lost in converting to a template, and that led me down a rabbit hole -- some good and bad news on that below.

0001:

I removed the uint64 case, as discussed. There is now a brief commit message, but needs to be fleshed out a bit. I took another look at the Arm optimization that Nathan found some month ago, for forming the highbit mask, but that doesn't play nicely with how node32 uses it, so I decided against it. I added a comment to describe the reasoning in case someone else gets a similar idea.

I briefly looked into "separate-commit TODO: move non-SIMD fallbacks to their own header to clean up the #ifdef maze.", but decided it wasn't such a clear win to justify starting the work now. It's still in the back of my mind, but I removed the reminder from the commit message.

The changes make sense to me.

0003:

The template now requires the value to be passed as a pointer. That was a pretty trivial change, but affected multiple other patches, so not sent separately. Also adds a forgotten RT_ prefix to the bitmap macros and adds a top comment to the *_impl.h headers. There are some comment fixes. The changes were either trivial or discussed earlier, so also not sent separately.

Great.

0004/5: I wanted to measure the load time as well as search time in bench_search_random_nodes(). That's kept separate to make it easier to test other patch versions.

The bad news is that the speed of loading TIDs in bench_seq/shuffle_search() has regressed noticeably. I can't reproduce this in any other bench function and was the reason for writing 0005 to begin with. More confusingly, my efforts to fix this improved *other* functions, but the former didn't budge at all. First the patches:

0006 adds and removes some "inline" declarations (where it made sense), and added some for "pg_noinline" based on Andres' advice some months ago.

Agreed.

0007 removes some dead code. RT_NODE_INSERT_INNER is only called during RT_SET_EXTEND, so it can't possibly find an existing key. This kind of change is much easier with the inner/node cases handled together in a template, as far as being sure of how those cases are different. I thought about trying the search in assert builds and verifying it doesn't exist, but thought yet another #ifdef would be too messy.

Agreed.

v25-addendum-try-no-maintain-order.txt -- It makes optional keeping the key chunks in order for the linear-search nodes. I believe the TID store no longer cares about the ordering, but this is a text file for now because I don't want to clutter the CI with a behavior change. Also, the second ART paper (on concurrency) mentioned that some locking schemes don't allow these arrays to be shifted. So it might make sense to give up entirely on guaranteeing ordered iteration, or at least make it optional as in the patch.

I think it's still important for lazy vacuum that an iteration over a
TID store returns TIDs in ascending order, because otherwise a heap
vacuum does random writes. That being said, we can have
RT_ITERATE_NEXT() return key-value pairs in an order regardless of how
the key chunks are stored in a node.

========================================
psql -c "select rt_load_ms, rt_search_ms from bench_seq_search(0, 1 * 1000 * 1000)"
(min load time of three)

v15:
rt_load_ms | rt_search_ms
------------+--------------
113 | 455

v25-0005:
rt_load_ms | rt_search_ms
------------+--------------
135 | 456

v25-0006 (inlining or not):
rt_load_ms | rt_search_ms
------------+--------------
136 | 455

v25-0007 (remove dead code):
rt_load_ms | rt_search_ms
------------+--------------
135 | 455

v25-addendum...txt (no ordering):
rt_load_ms | rt_search_ms
------------+--------------
134 | 455

Note: The regression seems to have started in v17, which is the first with a full template.

Nothing so far has helped here, and previous experience has shown that trying to profile 100ms will not be useful. Instead of putting more effort into diving deeper, it seems a better use of time to write a benchmark that calls the tid store itself. That's more realistic, since this function was intended to test load and search of tids, but the tid store doesn't quite operate so simply anymore. What do you think, Masahiko?

Yeah, that's more realistic. TidStore now encodes TIDs slightly
differently from the benchmark test.

I've attached the patch that adds a simple benchmark test using
TidStore. With this test, I got similar trends of results to yours
with gcc, but I've not analyzed them in depth yet.

query: select * from bench_tidstore_load(0, 10 * 1000 * 1000)

v15:
load_ms
---------
816

v25-0007 (remove dead code):
load_ms
---------
839

v25-addendum...txt (no ordering):
load_ms
---------
820

BTW it would be better to remove the RT_DEBUG macro from bench_radix_tree.c.

I'm inclined to keep 0006, because it might give a slight boost, and 0007 because it's never a bad idea to remove dead code.

Yeah, these two changes make sense to me too.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

0001-Add-bench_tidstore_load.patch.txttext/plain; charset=US-ASCII; name=0001-Add-bench_tidstore_load.patch.txtDownload

From e056133360436e115a434a8a21685a99602a5b5d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Feb 2023 15:53:14 +0900
Subject: [PATCH] Add bench_tidstore_load()

---
 .../bench_radix_tree--1.0.sql                 | 10 ++++
 contrib/bench_radix_tree/bench_radix_tree.c   | 46 +++++++++++++++++++
 2 files changed, 56 insertions(+)

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 95eedbbe10..fbf51c1086 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -75,3 +75,13 @@ OUT rt_sparseload_ms int8
 returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 7d1e2eee57..3c2caa3b90 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -9,6 +9,7 @@
  */
 #include "postgres.h"
 
+#include "access/tidstore.h"
 #include "common/pg_prng.h"
 #include "fmgr.h"
 #include "funcapi.h"
@@ -54,6 +55,7 @@ PG_FUNCTION_INFO_V1(bench_load_random_int);
 PG_FUNCTION_INFO_V1(bench_fixed_height_search);
 PG_FUNCTION_INFO_V1(bench_search_random_nodes);
 PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
 
 static uint64
 tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -168,6 +170,50 @@ vac_cmp_itemptr(const void *left, const void *right)
 }
 #endif
 
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	TidStore	*ts;
+	OffsetNumber *offs;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_ms;
+	TupleDesc	tupdesc;
+	Datum		values[2];
+	bool		nulls[2] = {false};
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+	for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+		offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+	ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* load tids */
+	start_time = GetCurrentTimestamp();
+	for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+		tidstore_add_tids(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_ms = secs * 1000 + usecs / 1000;
+
+	values[0] = Int64GetDatum(tidstore_memory_usage(ts));
+	values[1] = Int64GetDatum(load_ms);
+
+	tidstore_destroy(ts);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
 static Datum
 bench_search(FunctionCallInfo fcinfo, bool shuffle)
 {
-- 
2.31.1

#201

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#200)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Feb 9, 2023 at 2:08 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I think it's still important for lazy vacuum that an iteration over a
TID store returns TIDs in ascending order, because otherwise a heap
vacuum does random writes. That being said, we can have
RT_ITERATE_NEXT() return key-value pairs in an order regardless of how
the key chunks are stored in a node.

Okay, we can keep that possibility in mind if we need to go there.

Note: The regression seems to have started in v17, which is the first

with a full template.

0007 removes some dead code. RT_NODE_INSERT_INNER is only called during

RT_SET_EXTEND, so it can't possibly find an existing key. This kind of
change is much easier with the inner/node cases handled together in a
template, as far as being sure of how those cases are different. I thought
about trying the search in assert builds and verifying it doesn't exist,
but thought yet another #ifdef would be too messy.

It just occurred to me that these facts might be related. v17 was the first
use of the full template, and I decided then I liked one of your earlier
patches where replace_node() calls node_update_inner() better than calling
node_insert_inner() with a NULL parent, which was a bit hard to understand.
That now-dead code was actually used in the latter case for updating the
(original) parent. It's possible that trying to use separate paths
contributed to the regression. I'll try the other way and report back.

I've attached the patch that adds a simple benchmark test using
TidStore. With this test, I got similar trends of results to yours
with gcc, but I've not analyzed them in depth yet.

Thanks for that! I'll take a look.

BTW it would be better to remove the RT_DEBUG macro from

bench_radix_tree.c.

Absolutely.

--
John Naylor
EDB: http://www.enterprisedb.com

#202

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#200)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Feb 9, 2023 at 2:08 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

query: select * from bench_tidstore_load(0, 10 * 1000 * 1000)

v15:
load_ms
---------
816

How did you build the tid store and test on v15? I first tried to
apply v15-0009-PoC-lazy-vacuum-integration.patch, which conflicts with
vacuum now, so reset all that, but still getting build errors because the
tid store types and functions have changed.

--
John Naylor
EDB: http://www.enterprisedb.com

#203

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#202)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Feb 10, 2023 at 3:51 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Feb 9, 2023 at 2:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

query: select * from bench_tidstore_load(0, 10 * 1000 * 1000)

v15:
load_ms
---------
816

How did you build the tid store and test on v15? I first tried to apply v15-0009-PoC-lazy-vacuum-integration.patch, which conflicts with vacuum now, so reset all that, but still getting build errors because the tid store types and functions have changed.

I applied v26-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
on top of v15 radix tree and changed the TidStore so that it uses v15
(non-templated) radixtree. That way, we can test TidStore using v15
radix tree. I've attached the patch that I applied on top of
v26-0008-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

change_tidstore_for_v15.patchapplication/octet-stream; name=change_tidstore_for_v15.patchDownload

commit f2d6acbce26d7e05e64666ae00fca030a657de76
Author: Masahiko Sawada <sawada.mshk@gmail.com>
Date:   Wed Feb 8 15:52:47 2023 +0900

    Add TidStore from v26 patch.

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 4c72673ce9..5048400a9f 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -29,6 +29,7 @@
 #include "access/tidstore.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
+#include "lib/radixtree.h"
 #include "storage/lwlock.h"
 #include "utils/dsa.h"
 #include "utils/memutils.h"
@@ -74,21 +75,6 @@
 /* A magic value used to identify our TidStores. */
 #define TIDSTORE_MAGIC 0x826f6a10
 
-#define RT_PREFIX local_rt
-#define RT_SCOPE static
-#define RT_DECLARE
-#define RT_DEFINE
-#define RT_VALUE_TYPE uint64
-#include "lib/radixtree.h"
-
-#define RT_PREFIX shared_rt
-#define RT_SHMEM
-#define RT_SCOPE static
-#define RT_DECLARE
-#define RT_DEFINE
-#define RT_VALUE_TYPE uint64
-#include "lib/radixtree.h"
-
 /* The control object for a TidStore */
 typedef struct TidStoreControl
 {
@@ -110,7 +96,6 @@ typedef struct TidStoreControl
 
 	/* handles for TidStore and radix tree */
 	tidstore_handle		handle;
-	shared_rt_handle	tree_handle;
 } TidStoreControl;
 
 /* Per-backend state for a TidStore */
@@ -125,14 +110,9 @@ struct TidStore
 	/* Storage for Tids. Use either one depending on TidStoreIsShared() */
 	union
 	{
-		local_rt_radix_tree *local;
-		shared_rt_radix_tree *shared;
+		radix_tree *local;
 	} tree;
-
-	/* DSA area for TidStore if used */
-	dsa_area	*area;
 };
-#define TidStoreIsShared(ts) ((ts)->area != NULL)
 
 /* Iterator for TidStore */
 typedef struct TidStoreIter
@@ -142,8 +122,8 @@ typedef struct TidStoreIter
 	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
 	union
 	{
-		shared_rt_iter	*shared;
-		local_rt_iter	*local;
+		rt_iter	*shared;
+		rt_iter	*local;
 	} tree_iter;
 
 	/* we returned all tids? */
@@ -194,31 +174,10 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
 	 * perfectly works in case where the max_bytes is a power-of-2, and the 60%
 	 * threshold works for other cases.
 	 */
-	if (area != NULL)
-	{
-		dsa_pointer dp;
-		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
-
-		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
-										   LWTRANCHE_SHARED_TIDSTORE);
-
-		dp = dsa_allocate0(area, sizeof(TidStoreControl));
-		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
-		ts->control->max_bytes = (uint64) (max_bytes * ratio);
-		ts->area = area;
+	ts->tree.local = rt_create(CurrentMemoryContext);
 
-		ts->control->magic = TIDSTORE_MAGIC;
-		LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
-		ts->control->handle = dp;
-		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
-	}
-	else
-	{
-		ts->tree.local = local_rt_create(CurrentMemoryContext);
-
-		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
-		ts->control->max_bytes = max_bytes - (70 * 1024);
-	}
+	ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+	ts->control->max_bytes = max_bytes - (70 * 1024);
 
 	ts->control->max_offset = max_offset;
 	ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
@@ -242,50 +201,6 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
 	return ts;
 }
 
-/*
- * Attach to the shared TidStore using a handle. The returned object is
- * allocated in backend-local memory using the CurrentMemoryContext.
- */
-TidStore *
-tidstore_attach(dsa_area *area, tidstore_handle handle)
-{
-	TidStore *ts;
-	dsa_pointer control;
-
-	Assert(area != NULL);
-	Assert(DsaPointerIsValid(handle));
-
-	/* create per-backend state */
-	ts = palloc0(sizeof(TidStore));
-
-	/* Find the control object in shared memory */
-	control = handle;
-
-	/* Set up the TidStore */
-	ts->control = (TidStoreControl *) dsa_get_address(area, control);
-	Assert(ts->control->magic == TIDSTORE_MAGIC);
-
-	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
-	ts->area = area;
-
-	return ts;
-}
-
-/*
- * Detach from a TidStore. This detaches from radix tree and frees the
- * backend-local resources. The radix tree will continue to exist until
- * it is either explicitly destroyed, or the area that backs it is returned
- * to the operating system.
- */
-void
-tidstore_detach(TidStore *ts)
-{
-	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
-
-	shared_rt_detach(ts->tree.shared);
-	pfree(ts);
-}
-
 /*
  * Destroy a TidStore, returning all memory.
  *
@@ -298,25 +213,8 @@ tidstore_detach(TidStore *ts)
 void
 tidstore_destroy(TidStore *ts)
 {
-	if (TidStoreIsShared(ts))
-	{
-		Assert(ts->control->magic == TIDSTORE_MAGIC);
-
-		/*
-		 * Vandalize the control block to help catch programming error where
-		 * other backends access the memory formerly occupied by this radix
-		 * tree.
-		 */
-		ts->control->magic = 0;
-		dsa_free(ts->area, ts->control->handle);
-		shared_rt_free(ts->tree.shared);
-	}
-	else
-	{
-		pfree(ts->control);
-		local_rt_free(ts->tree.local);
-	}
-
+	pfree(ts->control);
+	rt_free(ts->tree.local);
 	pfree(ts);
 }
 
@@ -327,39 +225,11 @@ tidstore_destroy(TidStore *ts)
 void
 tidstore_reset(TidStore *ts)
 {
-	if (TidStoreIsShared(ts))
-	{
-		Assert(ts->control->magic == TIDSTORE_MAGIC);
-
-		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
-
-		/*
-		 * Free the radix tree and return allocated DSA segments to
-		 * the operating system.
-		 */
-		shared_rt_free(ts->tree.shared);
-		dsa_trim(ts->area);
+	rt_free(ts->tree.local);
+	ts->tree.local = rt_create(CurrentMemoryContext);
 
-		/* Recreate the radix tree */
-		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
-										   LWTRANCHE_SHARED_TIDSTORE);
-
-		/* update the radix tree handle as we recreated it */
-		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
-
-		/* Reset the statistics */
-		ts->control->num_tids = 0;
-
-		LWLockRelease(&ts->control->lock);
-	}
-	else
-	{
-		local_rt_free(ts->tree.local);
-		ts->tree.local = local_rt_create(CurrentMemoryContext);
-
-		/* Reset the statistics */
-		ts->control->num_tids = 0;
-	}
+	/* Reset the statistics */
+	ts->control->num_tids = 0;
 }
 
 /* Add Tids on a block to TidStore */
@@ -372,8 +242,6 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	uint64	*values;
 	int	nkeys;
 
-	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
-
 	if (ts->control->encode_tids)
 	{
 		key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
@@ -404,9 +272,6 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 		values[idx] |= UINT64CONST(1) << off;
 	}
 
-	if (TidStoreIsShared(ts))
-		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
-
 	/* insert the calculated key-values to the tree */
 	for (int i = 0; i < nkeys; i++)
 	{
@@ -414,19 +279,13 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 		{
 			uint64 key = key_base + i;
 
-			if (TidStoreIsShared(ts))
-				shared_rt_set(ts->tree.shared, key, &values[i]);
-			else
-				local_rt_set(ts->tree.local, key, &values[i]);
+			rt_set(ts->tree.local, key, values[i]);
 		}
 	}
 
 	/* update statistics */
 	ts->control->num_tids += num_offsets;
 
-	if (TidStoreIsShared(ts))
-		LWLockRelease(&ts->control->lock);
-
 	pfree(values);
 }
 
@@ -441,10 +300,7 @@ tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
 
 	key = tid_to_key_off(ts, tid, &off);
 
-	if (TidStoreIsShared(ts))
-		found = shared_rt_search(ts->tree.shared, key, &val);
-	else
-		found = local_rt_search(ts->tree.local, key, &val);
+	found = rt_search(ts->tree.local, key, &val);
 
 	if (!found)
 		return false;
@@ -464,18 +320,13 @@ tidstore_begin_iterate(TidStore *ts)
 {
 	TidStoreIter *iter;
 
-	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
-
 	iter = palloc0(sizeof(TidStoreIter));
 	iter->ts = ts;
 
 	iter->result.blkno = InvalidBlockNumber;
 	iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
 
-	if (TidStoreIsShared(ts))
-		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
-	else
-		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+	iter->tree_iter.local = rt_begin_iterate(ts->tree.local);
 
 	/* If the TidStore is empty, there is no business */
 	if (tidstore_num_tids(ts) == 0)
@@ -487,10 +338,7 @@ tidstore_begin_iterate(TidStore *ts)
 static inline bool
 tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
 {
-	if (TidStoreIsShared(iter->ts))
-		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
-
-	return local_rt_iterate_next(iter->tree_iter.local, key, val);
+	return rt_iterate_next(iter->tree_iter.local, key, val);
 }
 
 /*
@@ -547,10 +395,7 @@ tidstore_iterate_next(TidStoreIter *iter)
 void
 tidstore_end_iterate(TidStoreIter *iter)
 {
-	if (TidStoreIsShared(iter->ts))
-		shared_rt_end_iterate(iter->tree_iter.shared);
-	else
-		local_rt_end_iterate(iter->tree_iter.local);
+	rt_end_iterate(iter->tree_iter.local);
 
 	pfree(iter->result.offsets);
 	pfree(iter);
@@ -560,26 +405,13 @@ tidstore_end_iterate(TidStoreIter *iter)
 int64
 tidstore_num_tids(TidStore *ts)
 {
-	uint64 num_tids;
-
-	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
-
-	if (!TidStoreIsShared(ts))
-		return ts->control->num_tids;
-
-	LWLockAcquire(&ts->control->lock, LW_SHARED);
-	num_tids = ts->control->num_tids;
-	LWLockRelease(&ts->control->lock);
-
-	return num_tids;
+	return ts->control->num_tids;
 }
 
 /* Return true if the current memory usage of TidStore exceeds the limit */
 bool
 tidstore_is_full(TidStore *ts)
 {
-	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
-
 	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
 }
 
@@ -587,8 +419,6 @@ tidstore_is_full(TidStore *ts)
 size_t
 tidstore_max_memory(TidStore *ts)
 {
-	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
-
 	return ts->control->max_bytes;
 }
 
@@ -596,17 +426,7 @@ tidstore_max_memory(TidStore *ts)
 size_t
 tidstore_memory_usage(TidStore *ts)
 {
-	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
-
-	/*
-	 * In the shared case, TidStoreControl and radix_tree are backed by the
-	 * same DSA area and rt_memory_usage() returns the value including both.
-	 * So we don't need to add the size of TidStoreControl separately.
-	 */
-	if (TidStoreIsShared(ts))
-		return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
-
-	return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+	return sizeof(TidStore) + sizeof(TidStore) + rt_memory_usage(ts->tree.local);
 }
 
 /*
@@ -615,7 +435,6 @@ tidstore_memory_usage(TidStore *ts)
 tidstore_handle
 tidstore_get_handle(TidStore *ts)
 {
-	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
 
 	return ts->control->handle;
 }

#204

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#200)

3 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

I didn't get any closer to radix-tree regression, but I did find some
inefficiencies in tidstore_add_tids() that are worth talking about first,
addressed in a rough fashion in the attached .txt addendums that I can
clean up and incorporate later.

To start, I can reproduce the regression with this test as well:

select * from bench_tidstore_load(0, 10 * 1000 * 1000);

v15 + v26 store + adjustments:
mem_allocated | load_ms
---------------+---------
98202152 | 1676

v26 0001-0008
mem_allocated | load_ms
---------------+---------
98202032 | 1826

...and reverting to the alternate way to update the parent didn't help:

v26 0001-6, 0008, insert_inner w/ null parent

mem_allocated | load_ms
---------------+---------
98202032 | 1825

...and I'm kind of glad that wasn't the problem, because going back to that
would be a pain for the shmem case.

Running perf doesn't show anything much different in the proportions (note
that rt_set must have been inlined when declared locally in v26):

v15 + v26 store + adjustments:
65.88% postgres postgres [.] tidstore_add_tids
10.74% postgres postgres [.] rt_set
9.20% postgres postgres [.] palloc0
6.49% postgres postgres [.] rt_node_insert_leaf

v26 0001-0008
78.50% postgres postgres [.] tidstore_add_tids
8.88% postgres postgres [.] palloc0
6.24% postgres postgres [.] local_rt_node_insert_leaf

v2699-0001: The first thing I noticed is that palloc0 is taking way more
time than it should, and it's because the compiler doesn't know the
values[] array is small. One reason we need to zero the array is to make
the algorithm agnostic about what order the offsets come in, as I requested
in a previous review. Thinking some more, I was way too paranoid about
that. As long as access methods scan the line pointer array in the usual
way, maybe we can just assert that the keys we create are in order, and
zero any unused array entries as we find them. (I admit I can't actually
think of a reason we would ever encounter offsets out of order.) Also, we
can keep track of the last key we need to consider for insertion into the
radix tree, and ignore the rest. That might shave a few cycles during the
exclusive lock when the max offset of an LP_DEAD item < 64 on a given page,
which I think would be common in the wild. I also got rid of the special
case for non-encoding, since shifting by zero should work the same way.
These together led to a nice speedup on the v26 branch:

mem_allocated | load_ms
---------------+---------
98202032 | 1386

v2699-0002: The next thing I noticed is forming a full ItemIdPointer to
pass to tid_to_key_off(). That's bad for tidstore_add_tids() because
ItemPointerSetBlockNumber() must do this in order to allow the struct to be
SHORTALIGN'd:

static inline void
BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber)
{
blockId->bi_hi = blockNumber >> 16;
blockId->bi_lo = blockNumber & 0xffff;
}

Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must
reverse the above process:

static inline BlockNumber
BlockIdGetBlockNumber(const BlockIdData *blockId)
{
return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber)
blockId->bi_lo);
}

There is no reason to do any of this if we're not reading/writing directly
to/from an on-disk tid etc. To avoid this, I created a new function
encode_key_off() [name could be better], which deals with the raw block
number that we already have. Then turn tid_to_key_off() into a wrapper
around that, since we still need the full conversion for
tidstore_lookup_tid().

v2699-0003: Get rid of all the remaining special cases for encoding/or not.
I am unaware of the need to optimize that case or treat it in any way
differently. I haven't tested this on an installation with non-default
blocksize and didn't measure this separately, but 0002+0003 gives:

mem_allocated | load_ms
---------------+---------
98202032 | 1259

If these are acceptable, I can incorporate them into a later patchset. In
any case, speeding up tidstore_add_tids() will make any regressions in the
backing radix tree more obvious. I will take a look at that next week.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v2699-0002-Do-less-work-when-encoding-key-value.patch.txttext/plain; charset=US-ASCII; name=v2699-0002-Do-less-work-when-encoding-key-value.patch.txtDownload

From 6bdd33fa4f55757b54d16ce00dc60a21b929606e Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 11 Feb 2023 10:45:21 +0700
Subject: [PATCH v2699 2/3] Do less work when encoding key/value

---
 src/backend/access/common/tidstore.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 5d24680737..3d384cf645 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -159,6 +159,7 @@ typedef struct TidStoreIter
 
 static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
 static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off);
 static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
 
 /*
@@ -367,7 +368,6 @@ void
 tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 				  int num_offsets)
 {
-	ItemPointerData tid;
 	uint64	*values;
 	uint64	key;
 	uint64	prev_key;
@@ -381,16 +381,12 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	values = palloc(sizeof(uint64) * nkeys);
 	key = prev_key = key_base;
 
-	ItemPointerSetBlockNumber(&tid, blkno);
-
 	for (int i = 0; i < num_offsets; i++)
 	{
 		uint32	off;
 
-		ItemPointerSetOffsetNumber(&tid, offsets[i]);
-
 		/* encode the tid to key and val */
-		key = tid_to_key_off(ts, &tid, &off);
+		key = encode_key_off(ts, blkno, offsets[i], &off);
 
 		/* make sure we scanned the line pointer array in order */
 		Assert(key >= prev_key);
@@ -681,20 +677,29 @@ key_get_blkno(TidStore *ts, uint64 key)
 /* Encode a tid to key and offset */
 static inline uint64
 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+{
+	uint32 offset = ItemPointerGetOffsetNumber(tid);
+	BlockNumber block = ItemPointerGetBlockNumber(tid);
+
+	return encode_key_off(ts, block, offset, off);
+}
+
+/* encode a block and offset to a key and partial offset */
+static inline uint64
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off)
 {
 	uint64 key;
 	uint64 tid_i;
 
 	if (!ts->control->encode_tids)
 	{
-		*off = ItemPointerGetOffsetNumber(tid);
+		*off = offset;
 
 		/* Use the block number as the key */
-		return (int64) ItemPointerGetBlockNumber(tid);
+		return (int64) block;
 	}
 
-	tid_i = ItemPointerGetOffsetNumber(tid);
-	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << ts->control->offset_nbits;
+	tid_i = offset | ((uint64) block << ts->control->offset_nbits);
 
 	*off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
 	key = tid_i >> TIDSTORE_VALUE_NBITS;
-- 
2.39.1

v2699-0001-Miscellaneous-optimizations-for-tidstore_add_t.patch.txttext/plain; charset=US-ASCII; name=v2699-0001-Miscellaneous-optimizations-for-tidstore_add_t.patch.txtDownload

From c0bc497f50318c8e31ccdf0c2a9186ffc736abeb Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 10 Feb 2023 19:56:01 +0700
Subject: [PATCH v2699 1/3] Miscellaneous optimizations for tidstore_add_tids()

- remove palloc0; it's expensive for lengths not known at compile-time
- optimize for case with only one key per heap block
- make some intializations const and branch-free
- when writing to the radix tree, stop at the last non-zero bitmap
---
 src/backend/access/common/tidstore.c | 56 ++++++++++++++++++----------
 1 file changed, 36 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 4c72673ce9..5d24680737 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -368,51 +368,67 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 				  int num_offsets)
 {
 	ItemPointerData tid;
-	uint64	key_base;
 	uint64	*values;
-	int	nkeys;
+	uint64	key;
+	uint64	prev_key;
+	uint64	off_bitmap = 0;
+	int idx;
+	const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+	const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
 
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
-	if (ts->control->encode_tids)
-	{
-		key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
-		nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
-	}
-	else
-	{
-		key_base = (uint64) blkno;
-		nkeys = 1;
-	}
-	values = palloc0(sizeof(uint64) * nkeys);
+	values = palloc(sizeof(uint64) * nkeys);
+	key = prev_key = key_base;
 
 	ItemPointerSetBlockNumber(&tid, blkno);
+
 	for (int i = 0; i < num_offsets; i++)
 	{
-		uint64	key;
 		uint32	off;
-		int idx;
 
 		ItemPointerSetOffsetNumber(&tid, offsets[i]);
 
 		/* encode the tid to key and val */
 		key = tid_to_key_off(ts, &tid, &off);
 
-		idx = key - key_base;
-		Assert(idx >= 0 && idx < nkeys);
+		/* make sure we scanned the line pointer array in order */
+		Assert(key >= prev_key);
 
-		values[idx] |= UINT64CONST(1) << off;
+		if (key > prev_key)
+		{
+			idx = prev_key - key_base;
+			Assert(idx >= 0 && idx < nkeys);
+
+			/* write out offset bitmap for this key */
+			values[idx] = off_bitmap;
+
+			/* zero out any gaps up to the current key */
+			for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
+				values[empty_idx] = 0;
+
+			/* reset for current key -- the current offset will be handled below */
+			off_bitmap = 0;
+			prev_key = key;
+		}
+
+		off_bitmap |= UINT64CONST(1) << off;
 	}
 
+	/* save the final index for later */
+	idx = key - key_base;
+	/* write out last offset bitmap */
+	values[idx] = off_bitmap;
+
 	if (TidStoreIsShared(ts))
 		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
 
 	/* insert the calculated key-values to the tree */
-	for (int i = 0; i < nkeys; i++)
+	for (int i = 0; i <= idx; i++)
 	{
 		if (values[i])
 		{
-			uint64 key = key_base + i;
+			key = key_base + i;
 
 			if (TidStoreIsShared(ts))
 				shared_rt_set(ts->tree.shared, key, &values[i]);
-- 
2.39.1

v2699-0003-Force-all-callers-to-encode-no-matter-how-smal.patch.txttext/plain; charset=US-ASCII; name=v2699-0003-Force-all-callers-to-encode-no-matter-how-smal.patch.txtDownload

From 82c1f639aaa64cc943af3b53294a63d5d8f7a9b9 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sat, 11 Feb 2023 11:51:32 +0700
Subject: [PATCH v2699 3/3] Force all callers to encode, no matter how small
 the expected offset

---
 src/backend/access/common/tidstore.c | 36 +++++-----------------------
 1 file changed, 6 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 3d384cf645..ff8e66936e 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -99,7 +99,6 @@ typedef struct TidStoreControl
 	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
 	int		max_offset;		/* the maximum offset number */
 	int		offset_nbits;	/* the number of bits required for max_offset */
-	bool	encode_tids;	/* do we use tid encoding? */
 	int		offset_key_nbits;	/* the number of bits of a offset number
 								 * used for the key */
 
@@ -224,21 +223,15 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
 	ts->control->max_offset = max_offset;
 	ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
 
+	if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
+		ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+
 	/*
 	 * We use tid encoding if the number of bits for the offset number doesn't
 	 * fix in a value, uint64.
 	 */
-	if (ts->control->offset_nbits > TIDSTORE_VALUE_NBITS)
-	{
-		ts->control->encode_tids = true;
-		ts->control->offset_key_nbits =
-			ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
-	}
-	else
-	{
-		ts->control->encode_tids = false;
-		ts->control->offset_key_nbits = 0;
-	}
+	ts->control->offset_key_nbits =
+		ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
 
 	return ts;
 }
@@ -643,12 +636,6 @@ tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
 		uint64	tid_i;
 		OffsetNumber	off;
 
-		if (i > iter->ts->control->max_offset)
-		{
-			Assert(!iter->ts->control->encode_tids);
-			break;
-		}
-
 		if ((val & (UINT64CONST(1) << i)) == 0)
 			continue;
 
@@ -668,10 +655,7 @@ tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
 static inline BlockNumber
 key_get_blkno(TidStore *ts, uint64 key)
 {
-	if (ts->control->encode_tids)
-		return (BlockNumber) (key >> ts->control->offset_key_nbits);
-
-	return (BlockNumber) key;
+	return (BlockNumber) (key >> ts->control->offset_key_nbits);
 }
 
 /* Encode a tid to key and offset */
@@ -691,14 +675,6 @@ encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off)
 	uint64 key;
 	uint64 tid_i;
 
-	if (!ts->control->encode_tids)
-	{
-		*off = offset;
-
-		/* Use the block number as the key */
-		return (int64) block;
-	}
-
 	tid_i = offset | ((uint64) block << ts->control->offset_nbits);
 
 	*off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
-- 
2.39.1

#205

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#204)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sat, Feb 11, 2023 at 2:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I didn't get any closer to radix-tree regression,

Me neither. It seems that in v26, inserting chunks into node-32 is
slow but needs more analysis. I'll share if I found something
interesting.

but I did find some inefficiencies in tidstore_add_tids() that are worth talking about first, addressed in a rough fashion in the attached .txt addendums that I can clean up and incorporate later.

To start, I can reproduce the regression with this test as well:

select * from bench_tidstore_load(0, 10 * 1000 * 1000);

v15 + v26 store + adjustments:
mem_allocated | load_ms
---------------+---------
98202152 | 1676

v26 0001-0008
mem_allocated | load_ms
---------------+---------
98202032 | 1826

...and reverting to the alternate way to update the parent didn't help:

v26 0001-6, 0008, insert_inner w/ null parent

mem_allocated | load_ms
---------------+---------
98202032 | 1825

...and I'm kind of glad that wasn't the problem, because going back to that would be a pain for the shmem case.

Running perf doesn't show anything much different in the proportions (note that rt_set must have been inlined when declared locally in v26):

v15 + v26 store + adjustments:
65.88% postgres postgres [.] tidstore_add_tids
10.74% postgres postgres [.] rt_set
9.20% postgres postgres [.] palloc0
6.49% postgres postgres [.] rt_node_insert_leaf

v26 0001-0008
78.50% postgres postgres [.] tidstore_add_tids
8.88% postgres postgres [.] palloc0
6.24% postgres postgres [.] local_rt_node_insert_leaf

v2699-0001: The first thing I noticed is that palloc0 is taking way more time than it should, and it's because the compiler doesn't know the values[] array is small. One reason we need to zero the array is to make the algorithm agnostic about what order the offsets come in, as I requested in a previous review. Thinking some more, I was way too paranoid about that. As long as access methods scan the line pointer array in the usual way, maybe we can just assert that the keys we create are in order, and zero any unused array entries as we find them. (I admit I can't actually think of a reason we would ever encounter offsets out of order.)

I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.

Also, we can keep track of the last key we need to consider for insertion into the radix tree, and ignore the rest. That might shave a few cycles during the exclusive lock when the max offset of an LP_DEAD item < 64 on a given page, which I think would be common in the wild. I also got rid of the special case for non-encoding, since shifting by zero should work the same way. These together led to a nice speedup on the v26 branch:

mem_allocated | load_ms
---------------+---------
98202032 | 1386

v2699-0002: The next thing I noticed is forming a full ItemIdPointer to pass to tid_to_key_off(). That's bad for tidstore_add_tids() because ItemPointerSetBlockNumber() must do this in order to allow the struct to be SHORTALIGN'd:

static inline void
BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber)
{
blockId->bi_hi = blockNumber >> 16;
blockId->bi_lo = blockNumber & 0xffff;
}

Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must reverse the above process:

static inline BlockNumber
BlockIdGetBlockNumber(const BlockIdData *blockId)
{
return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber) blockId->bi_lo);
}

There is no reason to do any of this if we're not reading/writing directly to/from an on-disk tid etc. To avoid this, I created a new function encode_key_off() [name could be better], which deals with the raw block number that we already have. Then turn tid_to_key_off() into a wrapper around that, since we still need the full conversion for tidstore_lookup_tid().

v2699-0003: Get rid of all the remaining special cases for encoding/or not. I am unaware of the need to optimize that case or treat it in any way differently. I haven't tested this on an installation with non-default blocksize and didn't measure this separately, but 0002+0003 gives:

mem_allocated | load_ms
---------------+---------
98202032 | 1259

If these are acceptable, I can incorporate them into a later patchset.

These are nice improvements! I agree with all changes.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#206

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#205)

9 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Feb 13, 2023 at 2:51 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Sat, Feb 11, 2023 at 2:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I didn't get any closer to radix-tree regression,

Me neither. It seems that in v26, inserting chunks into node-32 is
slow but needs more analysis. I'll share if I found something
interesting.

If that were the case, then the other benchmarks I ran would likely have
slowed down as well, but they are the same or faster. There is one
microbenchmark I didn't run before: "select * from
bench_fixed_height_search(15)" (15 to reduce noise from growing size class,
and despite the name it measures load time as well). Trying this now shows
no difference: a few runs range 19 to 21ms in each version. That also
reinforces that update_inner is fine and that the move to value pointer API
didn't regress.

Changing TIDS_PER_BLOCK_FOR_LOAD to 1 to stress the tree more gives (min of
5, perf run separate from measurements):

v15 + v26 store:

mem_allocated | load_ms
---------------+---------
98202152 | 553

19.71% postgres postgres [.] tidstore_add_tids
+ 31.47% postgres postgres [.] rt_set
= 51.18%

20.62% postgres postgres [.] rt_node_insert_leaf
6.05% postgres postgres [.] AllocSetAlloc
4.74% postgres postgres [.] AllocSetFree
4.62% postgres postgres [.] palloc
2.23% postgres postgres [.] SlabAlloc

v26:

mem_allocated | load_ms
---------------+---------
98202032 | 617

57.45% postgres postgres [.] tidstore_add_tids

20.67% postgres postgres [.] local_rt_node_insert_leaf
5.99% postgres postgres [.] AllocSetAlloc
3.55% postgres postgres [.] palloc
3.05% postgres postgres [.] AllocSetFree
2.05% postgres postgres [.] SlabAlloc

So it seems the store itself got faster when we removed shared memory paths
from the v26 store to test it against v15.

I thought to favor the local memory case in the tidstore by controlling
inlining -- it's smaller and will be called much more often, so I tried the
following (done in 0007)

 #define RT_PREFIX shared_rt
 #define RT_SHMEM
-#define RT_SCOPE static
+#define RT_SCOPE static pg_noinline

That brings it down to

mem_allocated | load_ms
---------------+---------
98202032 | 590

That's better, but not still not within noise level. Perhaps some slowdown
is unavoidable, but it would be nice to understand why.

I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.

Further, currently we *already* assume we populate the tid array in order
(for binary search), so we can just continue assuming that (with an assert
added since it's more public in this form). I'm not sure why such basic
common sense evaded me a few versions ago...

If these are acceptable, I can incorporate them into a later patchset.

These are nice improvements! I agree with all changes.

Great, I've squashed these into the tidstore patch (0004). Also added 0005,
which is just a simplification.

I squashed the earlier dead code removal into the radix tree patch.

v27-0008 measures tid store iteration performance and adds a stub function
to prevent spurious warnings, so the benchmarking module can always be
built.

Getting the list of offsets from the old array for a given block is always
trivial, but tidstore_iter_extract_tids() is doing a huge amount of
unnecessary work when TIDS_PER_BLOCK_FOR_LOAD is 1, enough to exceed the
load time:

mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 589 | 915

Fortunately, it's an easy fix, done in 0009.

mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 589 | 153

I'll soon resume more cosmetic review of the tid store, but this is enough
to post.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v27-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchtext/x-patch; charset=US-ASCII; name=v27-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload

From d577ef9d9755e7ca4d3722c1a044381a81d66244 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v27 1/9] Introduce helper SIMD functions for small byte arrays

vector8_min - helper for emulating ">=" semantics

vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask

Masahiko Sawada

Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..350e2caaea 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #endif
 
 /* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	/*
+	 * Note: There is a faster way to do this, but it returns a uint64 and
+	 * and if the caller wanted to extract the bit position using CTZ,
+	 * it would have to divide that result by 4.
+	 */
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the bitwise OR of the inputs
  */
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.39.1

v27-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v27-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From 149a49f51f7a16b7c1eb762e704f1ec476ecb65a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v27 2/9] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 3d2225e1ae..5f9a511b4a 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 36d1dc0117..a0c60feade 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3669,7 +3669,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.39.1

v27-0005-Do-bitmap-conversion-in-one-place-rather-than-fo.patchtext/x-patch; charset=US-ASCII; name=v27-0005-Do-bitmap-conversion-in-one-place-rather-than-fo.patchDownload

From dba9497b5b587da873fbb2de89570ec8b36d604b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 12 Feb 2023 15:17:40 +0700
Subject: [PATCH v27 5/9] Do bitmap conversion in one place rather than forcing
 callers to do it

---
 src/backend/access/common/tidstore.c | 31 +++++++++++++++-------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index ff8e66936e..ad8c0866e2 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -70,6 +70,7 @@
  * and value, respectively.
  */
 #define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
+#define TIDSTORE_OFFSET_MASK ((1 << TIDSTORE_VALUE_NBITS) - 1)
 
 /* A magic value used to identify our TidStores. */
 #define TIDSTORE_MAGIC 0x826f6a10
@@ -158,8 +159,8 @@ typedef struct TidStoreIter
 
 static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
 static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
-static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off);
-static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
 
 /*
  * Create a TidStore. The returned object is allocated in backend-local memory.
@@ -376,10 +377,10 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 
 	for (int i = 0; i < num_offsets; i++)
 	{
-		uint32	off;
+		uint64	off_bit;
 
 		/* encode the tid to key and val */
-		key = encode_key_off(ts, blkno, offsets[i], &off);
+		key = encode_key_off(ts, blkno, offsets[i], &off_bit);
 
 		/* make sure we scanned the line pointer array in order */
 		Assert(key >= prev_key);
@@ -401,7 +402,7 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 			prev_key = key;
 		}
 
-		off_bitmap |= UINT64CONST(1) << off;
+		off_bitmap |= off_bit;
 	}
 
 	/* save the final index for later */
@@ -441,10 +442,10 @@ tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
 {
 	uint64 key;
 	uint64 val = 0;
-	uint32 off;
+	uint64 off_bit;
 	bool found;
 
-	key = tid_to_key_off(ts, tid, &off);
+	key = tid_to_key_off(ts, tid, &off_bit);
 
 	if (TidStoreIsShared(ts))
 		found = shared_rt_search(ts->tree.shared, key, &val);
@@ -454,7 +455,7 @@ tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
 	if (!found)
 		return false;
 
-	return (val & (UINT64CONST(1) << off)) != 0;
+	return (val & off_bit) != 0;
 }
 
 /*
@@ -660,26 +661,28 @@ key_get_blkno(TidStore *ts, uint64 key)
 
 /* Encode a tid to key and offset */
 static inline uint64
-tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
 {
 	uint32 offset = ItemPointerGetOffsetNumber(tid);
 	BlockNumber block = ItemPointerGetBlockNumber(tid);
 
-	return encode_key_off(ts, block, offset, off);
+	return encode_key_off(ts, block, offset, off_bit);
 }
 
 /* encode a block and offset to a key and partial offset */
 static inline uint64
-encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off)
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
 {
 	uint64 key;
 	uint64 tid_i;
+	uint32 off_lower;
 
-	tid_i = offset | ((uint64) block << ts->control->offset_nbits);
+	off_lower = offset & TIDSTORE_OFFSET_MASK;
+	Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
 
-	*off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+	*off_bit = UINT64CONST(1) << off_lower;
+	tid_i = offset | ((uint64) block << ts->control->offset_nbits);
 	key = tid_i >> TIDSTORE_VALUE_NBITS;
-	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
 
 	return key;
 }
-- 
2.39.1

v27-0003-Add-radixtree-template.patchtext/x-patch; charset=US-ASCII; name=v27-0003-Add-radixtree-template.patchDownload

From bf9d659187537b250683af321b0167d69c7fb18a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v27 3/9] Add radixtree template

WIP: commit message based on template comments
---
 src/backend/utils/mmgr/dsa.c                  |   12 +
 src/include/lib/radixtree.h                   | 2516 +++++++++++++++++
 src/include/lib/radixtree_delete_impl.h       |  122 +
 src/include/lib/radixtree_insert_impl.h       |  328 +++
 src/include/lib/radixtree_iter_impl.h         |  153 +
 src/include/lib/radixtree_search_impl.h       |  138 +
 src/include/utils/dsa.h                       |    1 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   35 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  674 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 20 files changed, 4082 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..1cdb995e54
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2516 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *		Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ *  tional leaf node type which stores one value.
+ *  - Multi-value leaves: The values are stored in one of four
+ *  different leaf node types, which mirror the structure of
+ *  inner nodes, but contain values instead of pointers.
+ *  - Combined pointer/value slots: If values fit into point-
+ *  ers, no separate node types are necessary. Instead, each
+ *  pointer storage location in an inner node can either
+ *  store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included.  Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * 	 will result in radix tree type 'foo_radix_tree' and functions like
+ *	 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ *	 generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *	 declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *	 so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITER		- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ *    statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ *    in the future to tag the node pointer with the kind, even on
+ *    platforms with 32-bit pointers. This might speed up node traversal
+ *    in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree)	LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree)	LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree)			LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree)	((void) 0)
+#define RT_LOCK_SHARED(tree)	((void) 0)
+#define RT_UNLOCK(tree)			((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+	((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+	RT_NODE		n;
+
+	/* 3 children, for key chunks */
+	uint8		chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* bitmap to track which slots are in use */
+	bitmapword		isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/*
+	 * Unlike with inner256, zero is a valid value here, so we use a
+	 * bitmap to track which slots are in use.
+	 */
+	bitmapword	isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	RT_VALUE_TYPE	values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_3 = 0,
+	RT_CLASS_32_MIN,
+	RT_CLASS_32_MAX,
+	RT_CLASS_125,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_3] = {
+		.name = "radix tree node 3",
+		.fanout = 3,
+		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MIN] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MAX] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_125] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+	LWLock		lock;
+#endif
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+	RT_PTR_LOCAL node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the iteration on nodes of each level */
+	RT_NODE_ITER stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is constructed during iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/* replicate the search key */
+	spread_chunk = vector8_broadcast(chunk);
+
+	/* compare to all 32 keys stored in the node */
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+
+	/* convert comparison to a bitfield */
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+	/* mask off invalid entries */
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	/* convert bitfield to index by counting trailing zeros */
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		/*
+		 * This is coded with '>=' to match what we can do with SIMD,
+		 * with an assert to keep us honest.
+		 */
+		if (node->chunks[index] >= chunk)
+		{
+			Assert(node->chunks[index] != chunk);
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/*
+	 * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+	 * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+	 * we need to play some trickery using vector8_min() to effectively get
+	 * >=. There'll never be any equal elements in current uses, but that's
+	 * what we get here...
+	 */
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	if (key == 0)
+		return 0;
+	else
+		return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
+
+	if (is_leaf)
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (is_leaf)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+#endif
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->ctl->cnt[size_class]++;
+#endif
+
+	return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	if (is_leaf)
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+	node->kind = kind;
+
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+		memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+	}
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static pg_noinline void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		is_leaf = shift == 0;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+	newnode->shift = shift;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+				  uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+	RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+	RT_COPY_NODE(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->ctl->root == allocnode)
+	{
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
+	}
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static inline void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+				RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old_child->shift == new->shift);
+	Assert(old_child->count == new->count);
+#endif
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new larger node */
+		tree->ctl->root = new_child;
+	}
+	else
+		RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+	RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static pg_noinline void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_3 *n3;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+		node->shift = shift;
+		node->count = 1;
+
+		n3 = (RT_NODE_INNER_3 *) node;
+		n3->base.chunks[0] = 0;
+		n3->children[0] = tree->ctl->root;
+
+		/* Update the root */
+		tree->ctl->root = allocnode;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static pg_noinline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+	int			shift = node->shift;
+
+	Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		is_leaf = newshift == 0;
+
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+		newchild->shift = newshift;
+		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+		parent = node;
+		node = newchild;
+		stored_node = allocchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+	tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static void
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+	/* Create a slab context for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+		size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+		size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 size_class.name,
+												 inner_blocksize,
+												 size_class.inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												size_class.name,
+												leaf_blocksize,
+												size_class.leaf_size);
+	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	if (RT_NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+				for (int i = 0; i < n3->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n3->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+				for (int i = 0; i < n32->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle);
+#else
+	pfree(tree->ctl);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+#endif
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	int			shift;
+	bool		updated;
+	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC stored_child;
+	RT_PTR_LOCAL  child;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	/* Empty tree, create the root */
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->ctl->max_val)
+		RT_EXTEND(tree, key);
+
+	stored_child = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, stored_child);
+	shift = parent->shift;
+
+	/* Descend the tree until we reach a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;;
+
+		child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+		if (RT_NODE_IS_LEAF(child))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+		{
+			RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		parent = child;
+		stored_child = new_child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->ctl->num_keys++;
+
+	RT_UNLOCK(tree);
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+	bool		found;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	Assert(value_p != NULL);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		if (RT_NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		node = RT_PTR_GET_LOCAL(tree, child);
+		shift -= RT_NODE_SPAN;
+	}
+
+	found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+	RT_UNLOCK(tree);
+	return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		/* Push the current node to the stack */
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		allocnode = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	deleted = RT_NODE_DELETE_LEAF(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->ctl->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (node->count > 0)
+	{
+		RT_UNLOCK(tree);
+		return true;
+	}
+
+	/* Free the empty leaf node */
+	RT_FREE_NODE(tree, allocnode);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		allocnode = stack[level--];
+
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		deleted = RT_NODE_DELETE_INNER(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (node->count > 0)
+			break;
+
+		/* The node became empty */
+		RT_FREE_NODE(tree, allocnode);
+	}
+
+	RT_UNLOCK(tree);
+	return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+	int			level = from;
+	RT_PTR_LOCAL node = from_node;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (RT_NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	MemoryContext old_ctx;
+	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+
+	RT_LOCK_SHARED(tree);
+
+	/* empty tree */
+	if (!iter->tree->ctl->root)
+		return iter;
+
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->ctl->root)
+		return false;
+
+	for (;;)
+	{
+		RT_PTR_LOCAL child = NULL;
+		RT_VALUE_TYPE value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+	Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+	RT_UNLOCK(iter->tree);
+	pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	Size		total = 0;
+
+	RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+#endif
+
+	RT_UNLOCK(tree);
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+				for (int i = 1; i < n3->n.count; i++)
+					Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = RT_BM_IDX(slot);
+					int			bitnum = RT_BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+	RT_LOCK_SHARED(tree);
+
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+	fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+	fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+		fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+				root->shift / RT_NODE_SPAN,
+				tree->ctl->cnt[RT_CLASS_3],
+				tree->ctl->cnt[RT_CLASS_32_MIN],
+				tree->ctl->cnt[RT_CLASS_32_MAX],
+				tree->ctl->cnt[RT_CLASS_125],
+				tree->ctl->cnt[RT_CLASS_256]);
+	}
+
+	RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+			 bool recurse, StringInfo buf)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+	StringInfoData spaces;
+
+	initStringInfo(&spaces);
+	appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+	appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+					 spaces.data,
+					 level == 0 ? "" : "-> ",
+					 RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+					 (node->kind == RT_NODE_KIND_3) ? 3 :
+					 (node->kind == RT_NODE_KIND_32) ? 32 :
+					 (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+					 node->fanout == 0 ? 256 : node->fanout,
+					 node->count, node->shift);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n3->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n3->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n3->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n32->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n32->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+				char *sep = "";
+
+				appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					appendStringInfo(buf, "%s[%d]=%d ",
+									 sep, i, b125->slot_idxs[i]);
+					sep = ",";
+				}
+
+				appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+				for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+					appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+				appendStringInfo(buf, "\n");
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					if (RT_NODE_IS_LEAF(node))
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					else
+					{
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+					appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+					for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+						appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+					appendStringInfo(buf, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL node;
+	StringInfoData buf;
+	int			shift;
+	int			level = 0;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	if (key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+				key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+		if (RT_NODE_IS_LEAF(node))
+		{
+			RT_VALUE_TYPE	dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			break;
+
+		allocnode = child;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+	StringInfoData buf;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	initStringInfo(&buf);
+
+	RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ *	  Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+										  n3->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+											n3->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+										  n32->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+				idx = RT_BM_IDX(slotpos);
+				bitnum = RT_BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+				RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..d56e58dcac
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,328 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ *	  Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	const bool is_leaf = true;
+	bool		chunk_exists = false;
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	const bool is_leaf = false;
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n3->values[idx] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n3)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+					/* grow node from 3 to 32 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+											  new32->base.chunks, new32->values);
+#else
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+											  new32->base.chunks, new32->children);
+#endif
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+					int			count = n3->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+												   count, insertpos);
+#endif
+					}
+
+					n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n3->values[insertpos] = *value_p;
+#else
+					n3->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+					n32->base.n.fanout < class32_max.fanout)
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+					Assert(n32->base.n.fanout == class32_min.fanout);
+
+					/* grow to the next size class of this kind */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class32_min.leaf_size);
+#else
+					memcpy(newnode, node, class32_min.inner_size);
+#endif
+					newnode->fanout = class32_max.fanout;
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n32)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+					Assert(n32->base.n.fanout == class32_max.fanout);
+
+					/* grow node from 32 to 125 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < class32_max.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					/*
+					 * Since we just copied a dense array, we can set the bits
+					 * using a single store, provided the length of that array
+					 * is at most the number of bits in a bitmapword.
+					 */
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = *value_p;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos;
+				int			cnt = 0;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				slotpos = n125->base.slot_idxs[chunk];
+				if (slotpos != RT_INVALID_SLOT_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n125->values[slotpos] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n125)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+					/* grow node from 125 to 256 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new256 = (RT_NODE256_TYPE *) newnode;
+
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+						cnt++;
+					}
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = *value_p;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+				Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+				RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+				Assert(node->count < RT_NODE_MAX_SLOTS);
+				RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
+	if (!chunk_exists)
+		node->count++;
+#else
+		node->count++;
+#endif
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	RT_VERIFY_NODE(node);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return chunk_exists;
+#else
+	return;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..98c78eb237
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,153 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ *	  Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	bool		found = false;
+	uint8		key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	RT_VALUE_TYPE		value;
+
+	Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n3->base.n.count)
+					break;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n3->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+				key_chunk = n3->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+		*value_p = value;
+#endif
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return found;
+#else
+	return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ *	  Common implementation for search in leaf and inner nodes, plus
+ *	  update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+	Assert(child_p != NULL);
+#endif
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n3->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n3->values[idx];
+#else
+				*child_p = n3->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n32->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n32->values[idx];
+#else
+				*child_p = n32->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+				Assert(slotpos != RT_INVALID_SLOT_IDX);
+				n125->children[slotpos] = new_child;
+#else
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				*child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				*child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+	}
+
+#ifdef RT_ACTION_UPDATE
+	return;
+#else
+	return true;
+#endif							/* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..f944945db9
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,674 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	TestValueType		dummy;
+	uint64		key;
+	TestValueType		val;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	rt_radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* look up keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType value;
+
+		if (!rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (value != (TestValueType) keys[i])
+			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+				 value, (TestValueType) keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType update = keys[i] + 1;
+		if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		TestValueType		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != (TestValueType) key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType*) &key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+	radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, (TestValueType*) &x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != (TestValueType) x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			TestValueType		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != (TestValueType) expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	rt_free(radixtree);
+	MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.39.1

v27-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchtext/x-patch; charset=US-ASCII; name=v27-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From 1be520c83274bc3a2f068689e665c254c8e3c04e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v27 4/9] Add TIDStore, to store sets of TIDs (ItemPointerData)
 efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 685 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   2 +
 src/include/access/tidstore.h                 |  49 ++
 src/include/storage/lwlock.h                  |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  35 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../modules/test_tidstore/test_tidstore.c     | 195 +++++
 .../test_tidstore/test_tidstore.control       |   4 +
 16 files changed, 1030 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b246ddc634..e44387d2c1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2192,6 +2192,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to access a shared TID bitmap during a parallel bitmap
        index scan.</entry>
      </row>
+     <row>
+      <entry><literal>SharedTidStore</literal></entry>
+      <entry>Waiting to access a shared TID store.</entry>
+     </row>
      <row>
       <entry><literal>SharedTupleStore</literal></entry>
       <entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..ff8e66936e
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,685 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *                                                |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits for offset number fits in a 64-bit value, we don't
+ * encode tids but directly use the block number and the offset number as key
+ * and value, respectively.
+ */
+#define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+	/* the number of tids in the store */
+	int64	num_tids;
+
+	/* These values are never changed after creation */
+	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
+	int		max_offset;		/* the maximum offset number */
+	int		offset_nbits;	/* the number of bits required for max_offset */
+	int		offset_key_nbits;	/* the number of bits of a offset number
+								 * used for the key */
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+	LWLock	lock;
+
+	/* handles for TidStore and radix tree */
+	tidstore_handle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 *
+	 * Memory consumption depends on the number of stored tids, but also on the
+	 * distribution of them, how the radix tree stores, and the memory management
+	 * that backed the radix tree. The maximum bytes that a TidStore can
+	 * use is specified by the max_bytes in tidstore_create(). We want the total
+	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
+	 *
+	 * In local TidStore cases, the radix tree uses slab allocators for each kind
+	 * of node class. The most memory consuming case while adding Tids associated
+	 * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+	 * we deduct 70kB from the max_bytes.
+	 *
+	 * In shared cases, DSA allocates the memory segments big enough to follow
+	 * a geometric series that approximately doubles the total DSA size (see
+	 * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+	 * size and the simulation revealed, the 75% threshold for the maximum bytes
+	 * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+	 * threshold works for other cases.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes = (uint64) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - (70 * 1024);
+	}
+
+	ts->control->max_offset = max_offset;
+	ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+	if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
+		ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+
+	/*
+	 * We use tid encoding if the number of bits for the offset number doesn't
+	 * fix in a value, uint64.
+	 */
+	ts->control->offset_key_nbits =
+		ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix
+		 * tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+tidstore_reset(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Free the radix tree and return allocated DSA segments to
+		 * the operating system.
+		 */
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+
+		LWLockRelease(&ts->control->lock);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+	}
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	uint64	*values;
+	uint64	key;
+	uint64	prev_key;
+	uint64	off_bitmap = 0;
+	int idx;
+	const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+	const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	values = palloc(sizeof(uint64) * nkeys);
+	key = prev_key = key_base;
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint32	off;
+
+		/* encode the tid to key and val */
+		key = encode_key_off(ts, blkno, offsets[i], &off);
+
+		/* make sure we scanned the line pointer array in order */
+		Assert(key >= prev_key);
+
+		if (key > prev_key)
+		{
+			idx = prev_key - key_base;
+			Assert(idx >= 0 && idx < nkeys);
+
+			/* write out offset bitmap for this key */
+			values[idx] = off_bitmap;
+
+			/* zero out any gaps up to the current key */
+			for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
+				values[empty_idx] = 0;
+
+			/* reset for current key -- the current offset will be handled below */
+			off_bitmap = 0;
+			prev_key = key;
+		}
+
+		off_bitmap |= UINT64CONST(1) << off;
+	}
+
+	/* save the final index for later */
+	idx = key - key_base;
+	/* write out last offset bitmap */
+	values[idx] = off_bitmap;
+
+	if (TidStoreIsShared(ts))
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+	/* insert the calculated key-values to the tree */
+	for (int i = 0; i <= idx; i++)
+	{
+		if (values[i])
+		{
+			key = key_base + i;
+
+			if (TidStoreIsShared(ts))
+				shared_rt_set(ts->tree.shared, key, &values[i]);
+			else
+				local_rt_set(ts->tree.local, key, &values[i]);
+		}
+	}
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+
+	if (TidStoreIsShared(ts))
+		LWLockRelease(&ts->control->lock);
+
+	pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val = 0;
+	uint32 off;
+	bool found;
+
+	key = tid_to_key_off(ts, tid, &off);
+
+	if (TidStoreIsShared(ts))
+		found = shared_rt_search(ts->tree.shared, key, &val);
+	else
+		found = local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	iter->result.blkno = InvalidBlockNumber;
+	iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (tidstore_num_tids(ts) == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+
+	return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		/* Process the previously collected key-value */
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = key_get_blkno(iter->ts, key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * We got a key-value pair for a different block. So return the
+			 * collected tids, and remember the key-value for the next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter->result.offsets);
+	pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+	uint64 num_tids;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (!TidStoreIsShared(ts))
+		return ts->control->num_tids;
+
+	LWLockAcquire(&ts->control->lock, LW_SHARED);
+	num_tids = ts->control->num_tids;
+	LWLockRelease(&ts->control->lock);
+
+	return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+	return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->result);
+
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+		Assert(result->num_offsets < iter->ts->control->max_offset);
+		result->offsets[result->num_offsets++] = off;
+	}
+
+	result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+	return (BlockNumber) (key >> ts->control->offset_key_nbits);
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+{
+	uint32 offset = ItemPointerGetOffsetNumber(tid);
+	BlockNumber block = ItemPointerGetBlockNumber(tid);
+
+	return encode_key_off(ts, block, offset, off);
+}
+
+/* encode a block and offset to a key and partial offset */
+static inline uint64
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off)
+{
+	uint64 key;
+	uint64 tid_i;
+
+	tid_i = offset | ((uint64) block << ts->control->offset_nbits);
+
+	*off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+	key = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"SharedTupleStore",
 	/* LWTRANCHE_SHARED_TIDBITMAP: */
 	"SharedTidBitmap",
+	/* LWTRANCHE_SHARED_TIDSTORE: */
+	"SharedTidStore",
 	/* LWTRANCHE_PARALLEL_APPEND: */
 	"ParallelAppend",
 	/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	*offsets;
+	int				num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_SHARED_TIDBITMAP,
+	LWTRANCHE_SHARED_TIDSTORE,
 	LWTRANCHE_PARALLEL_APPEND,
 	LWTRANCHE_PER_XACT_PREDICATE_LIST,
 	LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9b849ae8e8
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,195 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+	ItemPointerData tid;
+	bool found;
+
+	ItemPointerSet(&tid, blkno, off);
+
+	found = tidstore_lookup_tid(ts, &tid);
+
+	if (found != expect)
+		elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+			 blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS	5
+#define TEST_TIDSTORE_NUM_OFFSETS	5
+
+	TidStore *ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+	BlockNumber	blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+	};
+	BlockNumber	blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+	};
+	OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+	int blk_idx;
+
+	/* prepare the offset array */
+	offs[0] = FirstOffsetNumber;
+	offs[1] = FirstOffsetNumber + 1;
+	offs[2] = max_offset / 2;
+	offs[3] = max_offset - 1;
+	offs[4] = max_offset;
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+
+	/* add tids */
+	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* lookup test */
+	for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+	{
+		bool expect = false;
+		for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+		{
+			if (offs[i] == off)
+			{
+				expect = true;
+				break;
+			}
+		}
+
+		check_tid(ts, 0, off, expect);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, expect);
+	}
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+			 tidstore_num_tids(ts),
+			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* iteration test */
+	iter = tidstore_begin_iterate(ts);
+	blk_idx = 0;
+	while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+	{
+		/* check the returned block number */
+		if (blks_sorted[blk_idx] != iter_result->blkno)
+			elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+				 iter_result->blkno, blks_sorted[blk_idx]);
+
+		/* check the returned offset numbers */
+		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+			elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+		for (int i = 0; i < iter_result->num_offsets; i++)
+		{
+			if (offs[i] != iter_result->offsets[i])
+				elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+					 iter_result->offsets[i], iter_result->blkno, offs[i]);
+		}
+
+		blk_idx++;
+	}
+
+	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+		elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+	/* remove all tids */
+	tidstore_reset(ts);
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+	/* lookup test for empty store */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, false);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, false);
+	}
+
+	tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+	TidStore *ts;
+	TidStoreIter *iter;
+	ItemPointerData tid;
+
+	elog(NOTICE, "testing empty tidstore");
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+
+	ItemPointerSet(&tid, 0, FirstOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+			 MaxBlockNumber, MaxOffsetNumber);
+
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+	if (tidstore_is_full(ts))
+		elog(ERROR, "tidstore_is_full on empty store returned true");
+
+	iter = tidstore_begin_iterate(ts);
+
+	if (tidstore_iterate_next(iter) != NULL)
+		elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+	tidstore_end_iterate(iter);
+
+	tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	elog(NOTICE, "testing basic operations");
+	test_basic(MaxHeapTuplesPerPage);
+	test_basic(10);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.39.1

v27-0006-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchtext/x-patch; charset=US-ASCII; name=v27-0006-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchDownload

From b0515a40b3aa4709047c7b70b9c0cadded979d15 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v27 6/9] Tool for measuring radix tree and tidstore
 performance

Includes Meson support, but commented out to avoid warnings

XXX: Not for commit
---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  87 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 717 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/meson.build          |  33 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 contrib/meson.build                           |   1 +
 8 files changed, 894 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/meson.build
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..fbf51c1086
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,87 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..b5ad75364c
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,717 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+//#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	TidStore	*ts;
+	OffsetNumber *offs;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_ms;
+	TupleDesc	tupdesc;
+	Datum		values[2];
+	bool		nulls[2] = {false};
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+	for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+		offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+	ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* load tids */
+	start_time = GetCurrentTimestamp();
+	for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+		tidstore_add_tids(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_ms = secs * 1000 + usecs / 1000;
+
+	values[0] = Int64GetDatum(tidstore_memory_usage(ts));
+	values[1] = Int64GetDatum(load_ms);
+
+	tidstore_destroy(ts);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	rt_radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, &val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, &val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	int64		search_time_ms;
+	Datum		values[3] = {0};
+	bool		nulls[3] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+	values[2] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, &key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+  'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+  bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'bench_radix_tree',
+    '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+  bench_radix_tree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+  'bench_radix_tree.control',
+  'bench_radix_tree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'bench_radix_tree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'bench_radix_tree',
+    ],
+  },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
+#subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.39.1

v27-0008-Measure-iteration-of-tidstore.patchtext/x-patch; charset=US-ASCII; name=v27-0008-Measure-iteration-of-tidstore.patchDownload

From 72bb462b1dab005cbc2aff265baedbaaee62cb2b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 17:02:53 +0700
Subject: [PATCH v27 8/9] Measure iteration of tidstore

---
 .../bench_radix_tree--1.0.sql                 |  3 +-
 contrib/bench_radix_tree/bench_radix_tree.c   | 40 ++++++++++++++++---
 contrib/meson.build                           |  2 +-
 3 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index fbf51c1086..ad66265e23 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -80,7 +80,8 @@ create function bench_tidstore_load(
 minblk int4,
 maxblk int4,
 OUT mem_allocated int8,
-OUT load_ms int8
+OUT load_ms int8,
+OUT iter_ms int8
 )
 returns record
 as 'MODULE_PATHNAME'
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index b5ad75364c..6e5149e2c4 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -176,15 +176,18 @@ bench_tidstore_load(PG_FUNCTION_ARGS)
 	BlockNumber minblk = PG_GETARG_INT32(0);
 	BlockNumber maxblk = PG_GETARG_INT32(1);
 	TidStore	*ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
 	OffsetNumber *offs;
 	TimestampTz start_time,
 				end_time;
 	long		secs;
 	int			usecs;
 	int64		load_ms;
+	int64		iter_ms;
 	TupleDesc	tupdesc;
-	Datum		values[2];
-	bool		nulls[2] = {false};
+	Datum		values[3];
+	bool		nulls[3] = {false};
 
 	/* Build a tuple descriptor for our result type */
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
@@ -196,9 +199,6 @@ bench_tidstore_load(PG_FUNCTION_ARGS)
 
 	ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
 
-	elog(NOTICE, "sleeping for 2 seconds...");
-	pg_usleep(2 * 1000000L);
-
 	/* load tids */
 	start_time = GetCurrentTimestamp();
 	for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
@@ -207,8 +207,22 @@ bench_tidstore_load(PG_FUNCTION_ARGS)
 	TimestampDifference(start_time, end_time, &secs, &usecs);
 	load_ms = secs * 1000 + usecs / 1000;
 
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* iterate through tids */
+	iter = tidstore_begin_iterate(ts);
+	start_time = GetCurrentTimestamp();
+	while ((result = tidstore_iterate_next(iter)) != NULL)
+		;
+	tidstore_end_iterate(iter);
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	iter_ms = secs * 1000 + usecs / 1000;
+
 	values[0] = Int64GetDatum(tidstore_memory_usage(ts));
 	values[1] = Int64GetDatum(load_ms);
+	values[2] = Int64GetDatum(iter_ms);
 
 	tidstore_destroy(ts);
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
@@ -715,3 +729,19 @@ bench_node128_load(PG_FUNCTION_ARGS)
 	rt_free(rt);
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
+
+/* to silence warnings about unused iter functions */
+static void pg_attribute_unused()
+stub_iter()
+{
+	rt_radix_tree *rt;
+	rt_iter *iter;
+	uint64 key = 1;
+	uint64 value = 1;
+
+	rt = rt_create(CurrentMemoryContext);
+
+	iter = rt_begin_iterate(rt);
+	rt_iterate_next(iter, &key, &value);
+	rt_end_iterate(iter);
+}
\ No newline at end of file
diff --git a/contrib/meson.build b/contrib/meson.build
index 52253de793..421d469f8c 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,7 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
-#subdir('bench_radix_tree')
+subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.39.1

v27-0007-Prevent-inlining-of-interface-functions-for-shme.patchtext/x-patch; charset=US-ASCII; name=v27-0007-Prevent-inlining-of-interface-functions-for-shme.patchDownload

From 54ab02eb2188382185436059ff6e7ad95d970c5d Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 17:00:31 +0700
Subject: [PATCH v27 7/9] Prevent inlining of interface functions for shmem

---
 src/backend/access/common/tidstore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index ad8c0866e2..d1b4675ea4 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -84,7 +84,7 @@
 
 #define RT_PREFIX shared_rt
 #define RT_SHMEM
-#define RT_SCOPE static
+#define RT_SCOPE static pg_noinline
 #define RT_DECLARE
 #define RT_DEFINE
 #define RT_VALUE_TYPE uint64
-- 
2.39.1

v27-0009-Speed-up-tidstore_iter_extract_tids.patchtext/x-patch; charset=US-ASCII; name=v27-0009-Speed-up-tidstore_iter_extract_tids.patchDownload

From 8ccc66211973bcc44a6bad45c05302ca743c1489 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 17:53:37 +0700
Subject: [PATCH v27 9/9] Speed up tidstore_iter_extract_tids()

---
 src/backend/access/common/tidstore.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index d1b4675ea4..5a897c01f7 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -632,21 +632,21 @@ tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
 {
 	TidStoreIterResult *result = (&iter->result);
 
-	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	while (val)
 	{
 		uint64	tid_i;
 		OffsetNumber	off;
 
-		if ((val & (UINT64CONST(1) << i)) == 0)
-			continue;
-
 		tid_i = key << TIDSTORE_VALUE_NBITS;
-		tid_i |= i;
+		tid_i |= pg_rightmost_one_pos64(val);
 
 		off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
 
 		Assert(result->num_offsets < iter->ts->control->max_offset);
 		result->offsets[result->num_offsets++] = off;
+
+		/* unset the rightmost bit */
+		val &= ~pg_rightmost_one64(val);
 	}
 
 	result->blkno = key_get_blkno(iter->ts, key);
-- 
2.39.1

#207

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: John Naylor (#206)

10 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

The benchmark module shouldn't have been un-commented-out, so attached a
revert of that.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v28-0008-Measure-iteration-of-tidstore.patchtext/x-patch; charset=US-ASCII; name=v28-0008-Measure-iteration-of-tidstore.patchDownload

From 72bb462b1dab005cbc2aff265baedbaaee62cb2b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 17:02:53 +0700
Subject: [PATCH v28 08/10] Measure iteration of tidstore

---
 .../bench_radix_tree--1.0.sql                 |  3 +-
 contrib/bench_radix_tree/bench_radix_tree.c   | 40 ++++++++++++++++---
 contrib/meson.build                           |  2 +-
 3 files changed, 38 insertions(+), 7 deletions(-)

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index fbf51c1086..ad66265e23 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -80,7 +80,8 @@ create function bench_tidstore_load(
 minblk int4,
 maxblk int4,
 OUT mem_allocated int8,
-OUT load_ms int8
+OUT load_ms int8,
+OUT iter_ms int8
 )
 returns record
 as 'MODULE_PATHNAME'
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index b5ad75364c..6e5149e2c4 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -176,15 +176,18 @@ bench_tidstore_load(PG_FUNCTION_ARGS)
 	BlockNumber minblk = PG_GETARG_INT32(0);
 	BlockNumber maxblk = PG_GETARG_INT32(1);
 	TidStore	*ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
 	OffsetNumber *offs;
 	TimestampTz start_time,
 				end_time;
 	long		secs;
 	int			usecs;
 	int64		load_ms;
+	int64		iter_ms;
 	TupleDesc	tupdesc;
-	Datum		values[2];
-	bool		nulls[2] = {false};
+	Datum		values[3];
+	bool		nulls[3] = {false};
 
 	/* Build a tuple descriptor for our result type */
 	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
@@ -196,9 +199,6 @@ bench_tidstore_load(PG_FUNCTION_ARGS)
 
 	ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
 
-	elog(NOTICE, "sleeping for 2 seconds...");
-	pg_usleep(2 * 1000000L);
-
 	/* load tids */
 	start_time = GetCurrentTimestamp();
 	for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
@@ -207,8 +207,22 @@ bench_tidstore_load(PG_FUNCTION_ARGS)
 	TimestampDifference(start_time, end_time, &secs, &usecs);
 	load_ms = secs * 1000 + usecs / 1000;
 
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* iterate through tids */
+	iter = tidstore_begin_iterate(ts);
+	start_time = GetCurrentTimestamp();
+	while ((result = tidstore_iterate_next(iter)) != NULL)
+		;
+	tidstore_end_iterate(iter);
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	iter_ms = secs * 1000 + usecs / 1000;
+
 	values[0] = Int64GetDatum(tidstore_memory_usage(ts));
 	values[1] = Int64GetDatum(load_ms);
+	values[2] = Int64GetDatum(iter_ms);
 
 	tidstore_destroy(ts);
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
@@ -715,3 +729,19 @@ bench_node128_load(PG_FUNCTION_ARGS)
 	rt_free(rt);
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
+
+/* to silence warnings about unused iter functions */
+static void pg_attribute_unused()
+stub_iter()
+{
+	rt_radix_tree *rt;
+	rt_iter *iter;
+	uint64 key = 1;
+	uint64 value = 1;
+
+	rt = rt_create(CurrentMemoryContext);
+
+	iter = rt_begin_iterate(rt);
+	rt_iterate_next(iter, &key, &value);
+	rt_end_iterate(iter);
+}
\ No newline at end of file
diff --git a/contrib/meson.build b/contrib/meson.build
index 52253de793..421d469f8c 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,7 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
-#subdir('bench_radix_tree')
+subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.39.1

v28-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v28-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From 149a49f51f7a16b7c1eb762e704f1ec476ecb65a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v28 02/10] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 3d2225e1ae..5f9a511b4a 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 36d1dc0117..a0c60feade 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3669,7 +3669,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.39.1

v28-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchtext/x-patch; charset=US-ASCII; name=v28-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload

From d577ef9d9755e7ca4d3722c1a044381a81d66244 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v28 01/10] Introduce helper SIMD functions for small byte
 arrays

vector8_min - helper for emulating ">=" semantics

vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask

Masahiko Sawada

Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index c836360d4b..350e2caaea 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #endif
 
 /* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	/*
+	 * Note: There is a faster way to do this, but it returns a uint64 and
+	 * and if the caller wanted to extract the bit position using CTZ,
+	 * it would have to divide that result by 4.
+	 */
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the bitwise OR of the inputs
  */
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.39.1

v28-0005-Do-bitmap-conversion-in-one-place-rather-than-fo.patchtext/x-patch; charset=US-ASCII; name=v28-0005-Do-bitmap-conversion-in-one-place-rather-than-fo.patchDownload

From dba9497b5b587da873fbb2de89570ec8b36d604b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 12 Feb 2023 15:17:40 +0700
Subject: [PATCH v28 05/10] Do bitmap conversion in one place rather than
 forcing callers to do it

---
 src/backend/access/common/tidstore.c | 31 +++++++++++++++-------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index ff8e66936e..ad8c0866e2 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -70,6 +70,7 @@
  * and value, respectively.
  */
 #define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
+#define TIDSTORE_OFFSET_MASK ((1 << TIDSTORE_VALUE_NBITS) - 1)
 
 /* A magic value used to identify our TidStores. */
 #define TIDSTORE_MAGIC 0x826f6a10
@@ -158,8 +159,8 @@ typedef struct TidStoreIter
 
 static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
 static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
-static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off);
-static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
 
 /*
  * Create a TidStore. The returned object is allocated in backend-local memory.
@@ -376,10 +377,10 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 
 	for (int i = 0; i < num_offsets; i++)
 	{
-		uint32	off;
+		uint64	off_bit;
 
 		/* encode the tid to key and val */
-		key = encode_key_off(ts, blkno, offsets[i], &off);
+		key = encode_key_off(ts, blkno, offsets[i], &off_bit);
 
 		/* make sure we scanned the line pointer array in order */
 		Assert(key >= prev_key);
@@ -401,7 +402,7 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 			prev_key = key;
 		}
 
-		off_bitmap |= UINT64CONST(1) << off;
+		off_bitmap |= off_bit;
 	}
 
 	/* save the final index for later */
@@ -441,10 +442,10 @@ tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
 {
 	uint64 key;
 	uint64 val = 0;
-	uint32 off;
+	uint64 off_bit;
 	bool found;
 
-	key = tid_to_key_off(ts, tid, &off);
+	key = tid_to_key_off(ts, tid, &off_bit);
 
 	if (TidStoreIsShared(ts))
 		found = shared_rt_search(ts->tree.shared, key, &val);
@@ -454,7 +455,7 @@ tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
 	if (!found)
 		return false;
 
-	return (val & (UINT64CONST(1) << off)) != 0;
+	return (val & off_bit) != 0;
 }
 
 /*
@@ -660,26 +661,28 @@ key_get_blkno(TidStore *ts, uint64 key)
 
 /* Encode a tid to key and offset */
 static inline uint64
-tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
 {
 	uint32 offset = ItemPointerGetOffsetNumber(tid);
 	BlockNumber block = ItemPointerGetBlockNumber(tid);
 
-	return encode_key_off(ts, block, offset, off);
+	return encode_key_off(ts, block, offset, off_bit);
 }
 
 /* encode a block and offset to a key and partial offset */
 static inline uint64
-encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off)
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
 {
 	uint64 key;
 	uint64 tid_i;
+	uint32 off_lower;
 
-	tid_i = offset | ((uint64) block << ts->control->offset_nbits);
+	off_lower = offset & TIDSTORE_OFFSET_MASK;
+	Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
 
-	*off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+	*off_bit = UINT64CONST(1) << off_lower;
+	tid_i = offset | ((uint64) block << ts->control->offset_nbits);
 	key = tid_i >> TIDSTORE_VALUE_NBITS;
-	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
 
 	return key;
 }
-- 
2.39.1

v28-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchtext/x-patch; charset=US-ASCII; name=v28-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From 1be520c83274bc3a2f068689e665c254c8e3c04e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v28 04/10] Add TIDStore, to store sets of TIDs
 (ItemPointerData) efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 685 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   2 +
 src/include/access/tidstore.h                 |  49 ++
 src/include/storage/lwlock.h                  |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  35 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../modules/test_tidstore/test_tidstore.c     | 195 +++++
 .../test_tidstore/test_tidstore.control       |   4 +
 16 files changed, 1030 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b246ddc634..e44387d2c1 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2192,6 +2192,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to access a shared TID bitmap during a parallel bitmap
        index scan.</entry>
      </row>
+     <row>
+      <entry><literal>SharedTidStore</literal></entry>
+      <entry>Waiting to access a shared TID store.</entry>
+     </row>
      <row>
       <entry><literal>SharedTupleStore</literal></entry>
       <entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..ff8e66936e
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,685 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *                                                |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits for offset number fits in a 64-bit value, we don't
+ * encode tids but directly use the block number and the offset number as key
+ * and value, respectively.
+ */
+#define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+	/* the number of tids in the store */
+	int64	num_tids;
+
+	/* These values are never changed after creation */
+	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
+	int		max_offset;		/* the maximum offset number */
+	int		offset_nbits;	/* the number of bits required for max_offset */
+	int		offset_key_nbits;	/* the number of bits of a offset number
+								 * used for the key */
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+	LWLock	lock;
+
+	/* handles for TidStore and radix tree */
+	tidstore_handle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 *
+	 * Memory consumption depends on the number of stored tids, but also on the
+	 * distribution of them, how the radix tree stores, and the memory management
+	 * that backed the radix tree. The maximum bytes that a TidStore can
+	 * use is specified by the max_bytes in tidstore_create(). We want the total
+	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
+	 *
+	 * In local TidStore cases, the radix tree uses slab allocators for each kind
+	 * of node class. The most memory consuming case while adding Tids associated
+	 * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+	 * we deduct 70kB from the max_bytes.
+	 *
+	 * In shared cases, DSA allocates the memory segments big enough to follow
+	 * a geometric series that approximately doubles the total DSA size (see
+	 * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+	 * size and the simulation revealed, the 75% threshold for the maximum bytes
+	 * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+	 * threshold works for other cases.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes = (uint64) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - (70 * 1024);
+	}
+
+	ts->control->max_offset = max_offset;
+	ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+	if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
+		ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+
+	/*
+	 * We use tid encoding if the number of bits for the offset number doesn't
+	 * fix in a value, uint64.
+	 */
+	ts->control->offset_key_nbits =
+		ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix
+		 * tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+tidstore_reset(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Free the radix tree and return allocated DSA segments to
+		 * the operating system.
+		 */
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+
+		LWLockRelease(&ts->control->lock);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+	}
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	uint64	*values;
+	uint64	key;
+	uint64	prev_key;
+	uint64	off_bitmap = 0;
+	int idx;
+	const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+	const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	values = palloc(sizeof(uint64) * nkeys);
+	key = prev_key = key_base;
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint32	off;
+
+		/* encode the tid to key and val */
+		key = encode_key_off(ts, blkno, offsets[i], &off);
+
+		/* make sure we scanned the line pointer array in order */
+		Assert(key >= prev_key);
+
+		if (key > prev_key)
+		{
+			idx = prev_key - key_base;
+			Assert(idx >= 0 && idx < nkeys);
+
+			/* write out offset bitmap for this key */
+			values[idx] = off_bitmap;
+
+			/* zero out any gaps up to the current key */
+			for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
+				values[empty_idx] = 0;
+
+			/* reset for current key -- the current offset will be handled below */
+			off_bitmap = 0;
+			prev_key = key;
+		}
+
+		off_bitmap |= UINT64CONST(1) << off;
+	}
+
+	/* save the final index for later */
+	idx = key - key_base;
+	/* write out last offset bitmap */
+	values[idx] = off_bitmap;
+
+	if (TidStoreIsShared(ts))
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+	/* insert the calculated key-values to the tree */
+	for (int i = 0; i <= idx; i++)
+	{
+		if (values[i])
+		{
+			key = key_base + i;
+
+			if (TidStoreIsShared(ts))
+				shared_rt_set(ts->tree.shared, key, &values[i]);
+			else
+				local_rt_set(ts->tree.local, key, &values[i]);
+		}
+	}
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+
+	if (TidStoreIsShared(ts))
+		LWLockRelease(&ts->control->lock);
+
+	pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val = 0;
+	uint32 off;
+	bool found;
+
+	key = tid_to_key_off(ts, tid, &off);
+
+	if (TidStoreIsShared(ts))
+		found = shared_rt_search(ts->tree.shared, key, &val);
+	else
+		found = local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & (UINT64CONST(1) << off)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	iter->result.blkno = InvalidBlockNumber;
+	iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (tidstore_num_tids(ts) == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+
+	return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		/* Process the previously collected key-value */
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = key_get_blkno(iter->ts, key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * We got a key-value pair for a different block. So return the
+			 * collected tids, and remember the key-value for the next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter->result.offsets);
+	pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+	uint64 num_tids;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (!TidStoreIsShared(ts))
+		return ts->control->num_tids;
+
+	LWLockAcquire(&ts->control->lock, LW_SHARED);
+	num_tids = ts->control->num_tids;
+	LWLockRelease(&ts->control->lock);
+
+	return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+	return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->result);
+
+	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		if ((val & (UINT64CONST(1) << i)) == 0)
+			continue;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= i;
+
+		off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+		Assert(result->num_offsets < iter->ts->control->max_offset);
+		result->offsets[result->num_offsets++] = off;
+	}
+
+	result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+	return (BlockNumber) (key >> ts->control->offset_key_nbits);
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off)
+{
+	uint32 offset = ItemPointerGetOffsetNumber(tid);
+	BlockNumber block = ItemPointerGetBlockNumber(tid);
+
+	return encode_key_off(ts, block, offset, off);
+}
+
+/* encode a block and offset to a key and partial offset */
+static inline uint64
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off)
+{
+	uint64 key;
+	uint64 tid_i;
+
+	tid_i = offset | ((uint64) block << ts->control->offset_nbits);
+
+	*off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1);
+	key = tid_i >> TIDSTORE_VALUE_NBITS;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"SharedTupleStore",
 	/* LWTRANCHE_SHARED_TIDBITMAP: */
 	"SharedTidBitmap",
+	/* LWTRANCHE_SHARED_TIDSTORE: */
+	"SharedTidStore",
 	/* LWTRANCHE_PARALLEL_APPEND: */
 	"ParallelAppend",
 	/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	*offsets;
+	int				num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_SHARED_TIDBITMAP,
+	LWTRANCHE_SHARED_TIDSTORE,
 	LWTRANCHE_PARALLEL_APPEND,
 	LWTRANCHE_PER_XACT_PREDICATE_LIST,
 	LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9b849ae8e8
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,195 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+	ItemPointerData tid;
+	bool found;
+
+	ItemPointerSet(&tid, blkno, off);
+
+	found = tidstore_lookup_tid(ts, &tid);
+
+	if (found != expect)
+		elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+			 blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS	5
+#define TEST_TIDSTORE_NUM_OFFSETS	5
+
+	TidStore *ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+	BlockNumber	blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+	};
+	BlockNumber	blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+	};
+	OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+	int blk_idx;
+
+	/* prepare the offset array */
+	offs[0] = FirstOffsetNumber;
+	offs[1] = FirstOffsetNumber + 1;
+	offs[2] = max_offset / 2;
+	offs[3] = max_offset - 1;
+	offs[4] = max_offset;
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+
+	/* add tids */
+	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* lookup test */
+	for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+	{
+		bool expect = false;
+		for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+		{
+			if (offs[i] == off)
+			{
+				expect = true;
+				break;
+			}
+		}
+
+		check_tid(ts, 0, off, expect);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, expect);
+	}
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+			 tidstore_num_tids(ts),
+			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* iteration test */
+	iter = tidstore_begin_iterate(ts);
+	blk_idx = 0;
+	while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+	{
+		/* check the returned block number */
+		if (blks_sorted[blk_idx] != iter_result->blkno)
+			elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+				 iter_result->blkno, blks_sorted[blk_idx]);
+
+		/* check the returned offset numbers */
+		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+			elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+		for (int i = 0; i < iter_result->num_offsets; i++)
+		{
+			if (offs[i] != iter_result->offsets[i])
+				elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+					 iter_result->offsets[i], iter_result->blkno, offs[i]);
+		}
+
+		blk_idx++;
+	}
+
+	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+		elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+	/* remove all tids */
+	tidstore_reset(ts);
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+	/* lookup test for empty store */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, false);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, false);
+	}
+
+	tidstore_destroy(ts);
+}
+
+static void
+test_empty(void)
+{
+	TidStore *ts;
+	TidStoreIter *iter;
+	ItemPointerData tid;
+
+	elog(NOTICE, "testing empty tidstore");
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+
+	ItemPointerSet(&tid, 0, FirstOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+			 MaxBlockNumber, MaxOffsetNumber);
+
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+	if (tidstore_is_full(ts))
+		elog(ERROR, "tidstore_is_full on empty store returned true");
+
+	iter = tidstore_begin_iterate(ts);
+
+	if (tidstore_iterate_next(iter) != NULL)
+		elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+	tidstore_end_iterate(iter);
+
+	tidstore_destroy(ts);
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	elog(NOTICE, "testing basic operations");
+	test_basic(MaxHeapTuplesPerPage);
+	test_basic(10);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.39.1

v28-0003-Add-radixtree-template.patchtext/x-patch; charset=US-ASCII; name=v28-0003-Add-radixtree-template.patchDownload

From bf9d659187537b250683af321b0167d69c7fb18a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v28 03/10] Add radixtree template

WIP: commit message based on template comments
---
 src/backend/utils/mmgr/dsa.c                  |   12 +
 src/include/lib/radixtree.h                   | 2516 +++++++++++++++++
 src/include/lib/radixtree_delete_impl.h       |  122 +
 src/include/lib/radixtree_insert_impl.h       |  328 +++
 src/include/lib/radixtree_iter_impl.h         |  153 +
 src/include/lib/radixtree_search_impl.h       |  138 +
 src/include/utils/dsa.h                       |    1 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   35 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  674 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 20 files changed, 4082 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..1cdb995e54
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2516 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *		Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ *  tional leaf node type which stores one value.
+ *  - Multi-value leaves: The values are stored in one of four
+ *  different leaf node types, which mirror the structure of
+ *  inner nodes, but contain values instead of pointers.
+ *  - Combined pointer/value slots: If values fit into point-
+ *  ers, no separate node types are necessary. Instead, each
+ *  pointer storage location in an inner node can either
+ *  store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included.  Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * 	 will result in radix tree type 'foo_radix_tree' and functions like
+ *	 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ *	 generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *	 declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *	 so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITER		- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ *    statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ *    in the future to tag the node pointer with the kind, even on
+ *    platforms with 32-bit pointers. This might speed up node traversal
+ *    in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree)	LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree)	LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree)			LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree)	((void) 0)
+#define RT_LOCK_SHARED(tree)	((void) 0)
+#define RT_UNLOCK(tree)			((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+	((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+	RT_NODE		n;
+
+	/* 3 children, for key chunks */
+	uint8		chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* bitmap to track which slots are in use */
+	bitmapword		isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/*
+	 * Unlike with inner256, zero is a valid value here, so we use a
+	 * bitmap to track which slots are in use.
+	 */
+	bitmapword	isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	RT_VALUE_TYPE	values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_3 = 0,
+	RT_CLASS_32_MIN,
+	RT_CLASS_32_MAX,
+	RT_CLASS_125,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_3] = {
+		.name = "radix tree node 3",
+		.fanout = 3,
+		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MIN] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MAX] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_125] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+	LWLock		lock;
+#endif
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+	RT_PTR_LOCAL node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the iteration on nodes of each level */
+	RT_NODE_ITER stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is constructed during iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/* replicate the search key */
+	spread_chunk = vector8_broadcast(chunk);
+
+	/* compare to all 32 keys stored in the node */
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+
+	/* convert comparison to a bitfield */
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+	/* mask off invalid entries */
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	/* convert bitfield to index by counting trailing zeros */
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		/*
+		 * This is coded with '>=' to match what we can do with SIMD,
+		 * with an assert to keep us honest.
+		 */
+		if (node->chunks[index] >= chunk)
+		{
+			Assert(node->chunks[index] != chunk);
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/*
+	 * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+	 * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+	 * we need to play some trickery using vector8_min() to effectively get
+	 * >=. There'll never be any equal elements in current uses, but that's
+	 * what we get here...
+	 */
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	if (key == 0)
+		return 0;
+	else
+		return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
+
+	if (is_leaf)
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (is_leaf)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+#endif
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->ctl->cnt[size_class]++;
+#endif
+
+	return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	if (is_leaf)
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+	node->kind = kind;
+
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+		memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+	}
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static pg_noinline void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		is_leaf = shift == 0;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+	newnode->shift = shift;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+				  uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+	RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+	RT_COPY_NODE(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->ctl->root == allocnode)
+	{
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
+	}
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static inline void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+				RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old_child->shift == new->shift);
+	Assert(old_child->count == new->count);
+#endif
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new larger node */
+		tree->ctl->root = new_child;
+	}
+	else
+		RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+	RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static pg_noinline void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_3 *n3;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+		node->shift = shift;
+		node->count = 1;
+
+		n3 = (RT_NODE_INNER_3 *) node;
+		n3->base.chunks[0] = 0;
+		n3->children[0] = tree->ctl->root;
+
+		/* Update the root */
+		tree->ctl->root = allocnode;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static pg_noinline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+	int			shift = node->shift;
+
+	Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		is_leaf = newshift == 0;
+
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+		newchild->shift = newshift;
+		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+		parent = node;
+		node = newchild;
+		stored_node = allocchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+	tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static void
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+	/* Create a slab context for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+		size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+		size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 size_class.name,
+												 inner_blocksize,
+												 size_class.inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												size_class.name,
+												leaf_blocksize,
+												size_class.leaf_size);
+	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	if (RT_NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+				for (int i = 0; i < n3->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n3->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+				for (int i = 0; i < n32->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle);
+#else
+	pfree(tree->ctl);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+#endif
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	int			shift;
+	bool		updated;
+	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC stored_child;
+	RT_PTR_LOCAL  child;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	/* Empty tree, create the root */
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->ctl->max_val)
+		RT_EXTEND(tree, key);
+
+	stored_child = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, stored_child);
+	shift = parent->shift;
+
+	/* Descend the tree until we reach a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;;
+
+		child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+		if (RT_NODE_IS_LEAF(child))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+		{
+			RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		parent = child;
+		stored_child = new_child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->ctl->num_keys++;
+
+	RT_UNLOCK(tree);
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+	bool		found;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	Assert(value_p != NULL);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		if (RT_NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		node = RT_PTR_GET_LOCAL(tree, child);
+		shift -= RT_NODE_SPAN;
+	}
+
+	found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+	RT_UNLOCK(tree);
+	return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		/* Push the current node to the stack */
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		allocnode = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	deleted = RT_NODE_DELETE_LEAF(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->ctl->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (node->count > 0)
+	{
+		RT_UNLOCK(tree);
+		return true;
+	}
+
+	/* Free the empty leaf node */
+	RT_FREE_NODE(tree, allocnode);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		allocnode = stack[level--];
+
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		deleted = RT_NODE_DELETE_INNER(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (node->count > 0)
+			break;
+
+		/* The node became empty */
+		RT_FREE_NODE(tree, allocnode);
+	}
+
+	RT_UNLOCK(tree);
+	return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+	int			level = from;
+	RT_PTR_LOCAL node = from_node;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (RT_NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	MemoryContext old_ctx;
+	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+
+	RT_LOCK_SHARED(tree);
+
+	/* empty tree */
+	if (!iter->tree->ctl->root)
+		return iter;
+
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->ctl->root)
+		return false;
+
+	for (;;)
+	{
+		RT_PTR_LOCAL child = NULL;
+		RT_VALUE_TYPE value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+	Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+	RT_UNLOCK(iter->tree);
+	pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	Size		total = 0;
+
+	RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+#endif
+
+	RT_UNLOCK(tree);
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+				for (int i = 1; i < n3->n.count; i++)
+					Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = RT_BM_IDX(slot);
+					int			bitnum = RT_BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+	RT_LOCK_SHARED(tree);
+
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+	fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+	fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+		fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+				root->shift / RT_NODE_SPAN,
+				tree->ctl->cnt[RT_CLASS_3],
+				tree->ctl->cnt[RT_CLASS_32_MIN],
+				tree->ctl->cnt[RT_CLASS_32_MAX],
+				tree->ctl->cnt[RT_CLASS_125],
+				tree->ctl->cnt[RT_CLASS_256]);
+	}
+
+	RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+			 bool recurse, StringInfo buf)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+	StringInfoData spaces;
+
+	initStringInfo(&spaces);
+	appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+	appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+					 spaces.data,
+					 level == 0 ? "" : "-> ",
+					 RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+					 (node->kind == RT_NODE_KIND_3) ? 3 :
+					 (node->kind == RT_NODE_KIND_32) ? 32 :
+					 (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+					 node->fanout == 0 ? 256 : node->fanout,
+					 node->count, node->shift);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n3->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n3->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n3->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n32->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n32->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+				char *sep = "";
+
+				appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					appendStringInfo(buf, "%s[%d]=%d ",
+									 sep, i, b125->slot_idxs[i]);
+					sep = ",";
+				}
+
+				appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+				for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+					appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+				appendStringInfo(buf, "\n");
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					if (RT_NODE_IS_LEAF(node))
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					else
+					{
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+					appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+					for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+						appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+					appendStringInfo(buf, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL node;
+	StringInfoData buf;
+	int			shift;
+	int			level = 0;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	if (key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+				key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+		if (RT_NODE_IS_LEAF(node))
+		{
+			RT_VALUE_TYPE	dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			break;
+
+		allocnode = child;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+	StringInfoData buf;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	initStringInfo(&buf);
+
+	RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ *	  Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+										  n3->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+											n3->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+										  n32->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+				idx = RT_BM_IDX(slotpos);
+				bitnum = RT_BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+				RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..d56e58dcac
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,328 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ *	  Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	const bool is_leaf = true;
+	bool		chunk_exists = false;
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	const bool is_leaf = false;
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n3->values[idx] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n3)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+					/* grow node from 3 to 32 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+											  new32->base.chunks, new32->values);
+#else
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+											  new32->base.chunks, new32->children);
+#endif
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+					int			count = n3->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+												   count, insertpos);
+#endif
+					}
+
+					n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n3->values[insertpos] = *value_p;
+#else
+					n3->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+					n32->base.n.fanout < class32_max.fanout)
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+					Assert(n32->base.n.fanout == class32_min.fanout);
+
+					/* grow to the next size class of this kind */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class32_min.leaf_size);
+#else
+					memcpy(newnode, node, class32_min.inner_size);
+#endif
+					newnode->fanout = class32_max.fanout;
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n32)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+					Assert(n32->base.n.fanout == class32_max.fanout);
+
+					/* grow node from 32 to 125 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < class32_max.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					/*
+					 * Since we just copied a dense array, we can set the bits
+					 * using a single store, provided the length of that array
+					 * is at most the number of bits in a bitmapword.
+					 */
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = *value_p;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos;
+				int			cnt = 0;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				slotpos = n125->base.slot_idxs[chunk];
+				if (slotpos != RT_INVALID_SLOT_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n125->values[slotpos] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n125)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+					/* grow node from 125 to 256 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new256 = (RT_NODE256_TYPE *) newnode;
+
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+						cnt++;
+					}
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = *value_p;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+				Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+				RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+				Assert(node->count < RT_NODE_MAX_SLOTS);
+				RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
+	if (!chunk_exists)
+		node->count++;
+#else
+		node->count++;
+#endif
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	RT_VERIFY_NODE(node);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return chunk_exists;
+#else
+	return;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..98c78eb237
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,153 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ *	  Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	bool		found = false;
+	uint8		key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	RT_VALUE_TYPE		value;
+
+	Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n3->base.n.count)
+					break;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n3->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+				key_chunk = n3->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+		*value_p = value;
+#endif
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return found;
+#else
+	return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ *	  Common implementation for search in leaf and inner nodes, plus
+ *	  update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+	Assert(child_p != NULL);
+#endif
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n3->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n3->values[idx];
+#else
+				*child_p = n3->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n32->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n32->values[idx];
+#else
+				*child_p = n32->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+				Assert(slotpos != RT_INVALID_SLOT_IDX);
+				n125->children[slotpos] = new_child;
+#else
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				*child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				*child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+	}
+
+#ifdef RT_ACTION_UPDATE
+	return;
+#else
+	return true;
+#endif							/* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..f944945db9
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,674 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	TestValueType		dummy;
+	uint64		key;
+	TestValueType		val;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	rt_radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* look up keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType value;
+
+		if (!rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (value != (TestValueType) keys[i])
+			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+				 value, (TestValueType) keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType update = keys[i] + 1;
+		if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		TestValueType		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != (TestValueType) key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType*) &key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+	radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, (TestValueType*) &x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != (TestValueType) x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			TestValueType		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != (TestValueType) expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	rt_free(radixtree);
+	MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.39.1

v28-0006-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchtext/x-patch; charset=US-ASCII; name=v28-0006-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchDownload

From b0515a40b3aa4709047c7b70b9c0cadded979d15 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v28 06/10] Tool for measuring radix tree and tidstore
 performance

Includes Meson support, but commented out to avoid warnings

XXX: Not for commit
---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  87 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 717 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/meson.build          |  33 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 contrib/meson.build                           |   1 +
 8 files changed, 894 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/meson.build
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..fbf51c1086
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,87 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..b5ad75364c
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,717 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+//#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	TidStore	*ts;
+	OffsetNumber *offs;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_ms;
+	TupleDesc	tupdesc;
+	Datum		values[2];
+	bool		nulls[2] = {false};
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+	for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+		offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+	ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* load tids */
+	start_time = GetCurrentTimestamp();
+	for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+		tidstore_add_tids(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_ms = secs * 1000 + usecs / 1000;
+
+	values[0] = Int64GetDatum(tidstore_memory_usage(ts));
+	values[1] = Int64GetDatum(load_ms);
+
+	tidstore_destroy(ts);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	rt_radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, &val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, &val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	int64		search_time_ms;
+	Datum		values[3] = {0};
+	bool		nulls[3] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+	values[2] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, &key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+  'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+  bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'bench_radix_tree',
+    '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+  bench_radix_tree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+  'bench_radix_tree.control',
+  'bench_radix_tree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'bench_radix_tree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'bench_radix_tree',
+    ],
+  },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
+#subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.39.1

v28-0007-Prevent-inlining-of-interface-functions-for-shme.patchtext/x-patch; charset=US-ASCII; name=v28-0007-Prevent-inlining-of-interface-functions-for-shme.patchDownload

From 54ab02eb2188382185436059ff6e7ad95d970c5d Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 17:00:31 +0700
Subject: [PATCH v28 07/10] Prevent inlining of interface functions for shmem

---
 src/backend/access/common/tidstore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index ad8c0866e2..d1b4675ea4 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -84,7 +84,7 @@
 
 #define RT_PREFIX shared_rt
 #define RT_SHMEM
-#define RT_SCOPE static
+#define RT_SCOPE static pg_noinline
 #define RT_DECLARE
 #define RT_DEFINE
 #define RT_VALUE_TYPE uint64
-- 
2.39.1

v28-0009-Speed-up-tidstore_iter_extract_tids.patchtext/x-patch; charset=US-ASCII; name=v28-0009-Speed-up-tidstore_iter_extract_tids.patchDownload

From 8ccc66211973bcc44a6bad45c05302ca743c1489 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 17:53:37 +0700
Subject: [PATCH v28 09/10] Speed up tidstore_iter_extract_tids()

---
 src/backend/access/common/tidstore.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index d1b4675ea4..5a897c01f7 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -632,21 +632,21 @@ tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
 {
 	TidStoreIterResult *result = (&iter->result);
 
-	for (int i = 0; i < sizeof(uint64) * BITS_PER_BYTE; i++)
+	while (val)
 	{
 		uint64	tid_i;
 		OffsetNumber	off;
 
-		if ((val & (UINT64CONST(1) << i)) == 0)
-			continue;
-
 		tid_i = key << TIDSTORE_VALUE_NBITS;
-		tid_i |= i;
+		tid_i |= pg_rightmost_one_pos64(val);
 
 		off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
 
 		Assert(result->num_offsets < iter->ts->control->max_offset);
 		result->offsets[result->num_offsets++] = off;
+
+		/* unset the rightmost bit */
+		val &= ~pg_rightmost_one64(val);
 	}
 
 	result->blkno = key_get_blkno(iter->ts, key);
-- 
2.39.1

v28-0010-Revert-building-benchmark-module-for-CI.patchtext/x-patch; charset=US-ASCII; name=v28-0010-Revert-building-benchmark-module-for-CI.patchDownload

From 42ba46f8073ee33bc5df6766f74f4c57587b070a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 19:31:34 +0700
Subject: [PATCH v28 10/10] Revert building benchmark module for CI

---
 contrib/meson.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/meson.build b/contrib/meson.build
index 421d469f8c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,7 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
-subdir('bench_radix_tree')
+#subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.39.1

#208

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#206)

2 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Feb 14, 2023 at 8:24 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Feb 13, 2023 at 2:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Feb 11, 2023 at 2:33 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I didn't get any closer to radix-tree regression,

Me neither. It seems that in v26, inserting chunks into node-32 is
slow but needs more analysis. I'll share if I found something
interesting.

If that were the case, then the other benchmarks I ran would likely have slowed down as well, but they are the same or faster. There is one microbenchmark I didn't run before: "select * from bench_fixed_height_search(15)" (15 to reduce noise from growing size class, and despite the name it measures load time as well). Trying this now shows no difference: a few runs range 19 to 21ms in each version. That also reinforces that update_inner is fine and that the move to value pointer API didn't regress.

Changing TIDS_PER_BLOCK_FOR_LOAD to 1 to stress the tree more gives (min of 5, perf run separate from measurements):

v15 + v26 store:

mem_allocated | load_ms
---------------+---------
98202152 | 553

19.71% postgres postgres [.] tidstore_add_tids
+ 31.47% postgres postgres [.] rt_set
= 51.18%

20.62% postgres postgres [.] rt_node_insert_leaf
6.05% postgres postgres [.] AllocSetAlloc
4.74% postgres postgres [.] AllocSetFree
4.62% postgres postgres [.] palloc
2.23% postgres postgres [.] SlabAlloc

v26:

mem_allocated | load_ms
---------------+---------
98202032 | 617

57.45% postgres postgres [.] tidstore_add_tids

20.67% postgres postgres [.] local_rt_node_insert_leaf
5.99% postgres postgres [.] AllocSetAlloc
3.55% postgres postgres [.] palloc
3.05% postgres postgres [.] AllocSetFree
2.05% postgres postgres [.] SlabAlloc

So it seems the store itself got faster when we removed shared memory paths from the v26 store to test it against v15.

I thought to favor the local memory case in the tidstore by controlling inlining -- it's smaller and will be called much more often, so I tried the following (done in 0007)
#define RT_PREFIX shared_rt
#define RT_SHMEM
-#define RT_SCOPE static
+#define RT_SCOPE static pg_noinline
That brings it down to

mem_allocated | load_ms
---------------+---------
98202032 | 590

The improvement makes sense to me. I've also done the same test (with
changing TIDS_PER_BLOCK_FOR_LOAD to 1):

w/o 0007 patch:
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 334 | 445
(1 row)

w/ 0007 patch:
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 316 | 434
(1 row)

On the other hand, with TIDS_PER_BLOCK_FOR_LOAD being 30, the load
performance didn't improve:

w/0 0007 patch:
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 601 | 608
(1 row)

w/ 0007 patch:
mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 610 | 606
(1 row)

That being said, it might be within noise level, so I agree with 0007 patch.

Perhaps some slowdown is unavoidable, but it would be nice to understand why.

True.

I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.

Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continue assuming that (with an assert added since it's more public in this form). I'm not sure why such basic common sense evaded me a few versions ago...

Right. TidStore is implemented not only for heap, so loading
out-of-order TIDs might be important in the future.

If these are acceptable, I can incorporate them into a later patchset.

These are nice improvements! I agree with all changes.

Great, I've squashed these into the tidstore patch (0004). Also added 0005, which is just a simplification.

I've attached some small patches to improve the radix tree and tidstrore:

We have the following WIP comment in test_radixtree:

// WIP: compiles with warnings because rt_attach is defined but not used
// #define RT_SHMEM

How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
and friends?

FYI I've briefly tested the TidStore with blocksize = 32kb, and it
seems to work fine.

I squashed the earlier dead code removal into the radix tree patch.

Thanks!

v27-0008 measures tid store iteration performance and adds a stub function to prevent spurious warnings, so the benchmarking module can always be built.

Getting the list of offsets from the old array for a given block is always trivial, but tidstore_iter_extract_tids() is doing a huge amount of unnecessary work when TIDS_PER_BLOCK_FOR_LOAD is 1, enough to exceed the load time:

mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 589 | 915

Fortunately, it's an easy fix, done in 0009.

mem_allocated | load_ms | iter_ms
---------------+---------+---------
98202032 | 589 | 153

Cool!

I'll soon resume more cosmetic review of the tid store, but this is enough to post.

Thanks!

You removed the vacuum integration patch from v27, is there any reason for that?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v2899-0002-Small-improvements-for-radixtree-and-tests.patch.txttext/plain; charset=US-ASCII; name=v2899-0002-Small-improvements-for-radixtree-and-tests.patch.txtDownload

From f06557689f33d9b11be1083362fcce19665b4014 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Feb 2023 12:18:22 +0900
Subject: [PATCH v2899 2/2] Small improvements for radixtree and tests.

---
 src/include/lib/radixtree.h                      |  2 +-
 src/test/modules/test_radixtree/test_radixtree.c | 13 ++++++++++---
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 1cdb995e54..e546bd705c 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1622,7 +1622,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 	/* Descend the tree until we reach a leaf node */
 	while (shift >= 0)
 	{
-		RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;;
+		RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;
 
 		child = RT_PTR_GET_LOCAL(tree, stored_child);
 
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index f944945db9..afe53382f3 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -107,13 +107,12 @@ static const test_spec test_specs[] = {
 
 /* define the radix tree implementation to test */
 #define RT_PREFIX rt
-#define RT_SCOPE static
+#define RT_SCOPE
 #define RT_DECLARE
 #define RT_DEFINE
 #define RT_USE_DELETE
 #define RT_VALUE_TYPE TestValueType
-// WIP: compiles with warnings because rt_attach is defined but not used
-// #define RT_SHMEM
+/* #define RT_SHMEM */
 #include "lib/radixtree.h"
 
 
@@ -142,6 +141,8 @@ test_empty(void)
 #ifdef RT_SHMEM
 	int			tranche_id = LWLockNewTrancheId();
 	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
 	dsa = dsa_create(tranche_id);
 
 	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
@@ -188,6 +189,8 @@ test_basic(int children, bool test_inner)
 #ifdef RT_SHMEM
 	int			tranche_id = LWLockNewTrancheId();
 	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
 	dsa = dsa_create(tranche_id);
 #endif
 
@@ -358,6 +361,8 @@ test_node_types(uint8 shift)
 #ifdef RT_SHMEM
 	int			tranche_id = LWLockNewTrancheId();
 	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
 	dsa = dsa_create(tranche_id);
 #endif
 
@@ -406,6 +411,8 @@ test_pattern(const test_spec * spec)
 #ifdef RT_SHMEM
 	int			tranche_id = LWLockNewTrancheId();
 	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
 	dsa = dsa_create(tranche_id);
 #endif
 
-- 
2.31.1

v2899-0001-comment-update-and-test-the-shared-tidstore.patch.txttext/plain; charset=US-ASCII; name=v2899-0001-comment-update-and-test-the-shared-tidstore.patch.txtDownload

From f6ed6e18b2281cee96af98a39bdfc453117e6a21 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Feb 2023 12:17:59 +0900
Subject: [PATCH v2899 1/2] comment update and test the shared tidstore.

---
 src/backend/access/common/tidstore.c          | 19 +++-------
 .../modules/test_tidstore/test_tidstore.c     | 37 +++++++++++++++++--
 2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 015e3dea81..8c05e60d92 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -64,13 +64,9 @@
  * |---------------------------------------------| key
  *
  * The maximum height of the radix tree is 5 in this case.
- *
- * If the number of bits for offset number fits in a 64-bit value, we don't
- * encode tids but directly use the block number and the offset number as key
- * and value, respectively.
  */
 #define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
-#define TIDSTORE_OFFSET_MASK ((1 << TIDSTORE_VALUE_NBITS) - 1)
+#define TIDSTORE_OFFSET_MASK	((1 << TIDSTORE_VALUE_NBITS) - 1)
 
 /* A magic value used to identify our TidStores. */
 #define TIDSTORE_MAGIC 0x826f6a10
@@ -99,9 +95,10 @@ typedef struct TidStoreControl
 	/* These values are never changed after creation */
 	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
 	int		max_offset;		/* the maximum offset number */
-	int		offset_nbits;	/* the number of bits required for max_offset */
-	int		offset_key_nbits;	/* the number of bits of a offset number
-								 * used for the key */
+	int		offset_nbits;	/* the number of bits required for an offset
+							 * number */
+	int		offset_key_nbits;	/* the number of bits of an offset number
+								 * used in a key */
 
 	/* The below fields are used only in shared case */
 
@@ -227,10 +224,6 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
 	if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
 		ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
 
-	/*
-	 * We use tid encoding if the number of bits for the offset number doesn't
-	 * fix in a value, uint64.
-	 */
 	ts->control->offset_key_nbits =
 		ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
 
@@ -379,7 +372,7 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	{
 		uint64	off_bit;
 
-		/* encode the tid to key and val */
+		/* encode the tid to a key and partial offset */
 		key = encode_key_off(ts, blkno, offsets[i], &off_bit);
 
 		/* make sure we scanned the line pointer array in order */
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 9b849ae8e8..9a1217f833 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -18,10 +18,13 @@
 #include "miscadmin.h"
 #include "storage/block.h"
 #include "storage/itemptr.h"
+#include "storage/lwlock.h"
 #include "utils/memutils.h"
 
 PG_MODULE_MAGIC;
 
+/* #define TEST_SHARED_TIDSTORE 1 */
+
 #define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
 
 PG_FUNCTION_INFO_V1(test_tidstore);
@@ -59,6 +62,18 @@ test_basic(int max_offset)
 	OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
 	int blk_idx;
 
+#ifdef TEST_SHARED_TIDSTORE
+	int tranche_id = LWLockNewTrancheId();
+	dsa_area *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_tidstore");
+	dsa = dsa_create(tranche_id);
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+#else
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+#endif
+
 	/* prepare the offset array */
 	offs[0] = FirstOffsetNumber;
 	offs[1] = FirstOffsetNumber + 1;
@@ -66,8 +81,6 @@ test_basic(int max_offset)
 	offs[3] = max_offset - 1;
 	offs[4] = max_offset;
 
-	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
-
 	/* add tids */
 	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
 		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
@@ -144,6 +157,10 @@ test_basic(int max_offset)
 	}
 
 	tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+	dsa_detach(dsa);
+#endif
 }
 
 static void
@@ -153,9 +170,19 @@ test_empty(void)
 	TidStoreIter *iter;
 	ItemPointerData tid;
 
-	elog(NOTICE, "testing empty tidstore");
+#ifdef TEST_SHARED_TIDSTORE
+	int tranche_id = LWLockNewTrancheId();
+	dsa_area *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_tidstore");
+	dsa = dsa_create(tranche_id);
 
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+#else
 	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+#endif
+
+	elog(NOTICE, "testing empty tidstore");
 
 	ItemPointerSet(&tid, 0, FirstOffsetNumber);
 	if (tidstore_lookup_tid(ts, &tid))
@@ -180,6 +207,10 @@ test_empty(void)
 	tidstore_end_iterate(iter);
 
 	tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+	dsa_detach(dsa);
+#endif
 }
 
 Datum
-- 
2.31.1

#209

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#208)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Tue, Feb 14, 2023 at 8:24 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.

Further, currently we *already* assume we populate the tid array in

order (for binary search), so we can just continue assuming that (with an
assert added since it's more public in this form). I'm not sure why such
basic common sense evaded me a few versions ago...

Right. TidStore is implemented not only for heap, so loading
out-of-order TIDs might be important in the future.

That's what I was probably thinking about some weeks ago, but I'm having a
hard time imagining how it would come up, even for something like the
conveyor-belt concept.

We have the following WIP comment in test_radixtree:

// WIP: compiles with warnings because rt_attach is defined but not used
// #define RT_SHMEM

How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
and friends?

Sounds good to me, and the other fixes make sense as well.

FYI I've briefly tested the TidStore with blocksize = 32kb, and it
seems to work fine.

That was on my list, so great! How about the other end -- nominally we
allow 512b. (In practice it won't matter, but this would make sure I didn't
mess anything up when forcing all MaxTuplesPerPage to encode.)

You removed the vacuum integration patch from v27, is there any reason

for that?

Just an oversight.

Now for some general comments on the tid store...

+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory
associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)

Do we need to do anything for this todo?

It might help readability to have a concept of "off_upper/off_lower", just
so we can describe things more clearly. The key is block + off_upper, and
the value is a bitmap of all the off_lower bits. I hinted at that in my
addition of encode_key_off(). Along those lines, maybe
s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even
sure the TIDSTORE_ prefix is valuable for these local macros.

The word "value" as a variable name is pretty generic in this context, and
it might be better to call it the off_lower_bitmap, at least in some
places. The "key" doesn't have a good short term for naming, but in
comments we should make sure we're clear it's "block# + off_upper".

I'm not a fan of the name "tid_i", even as a temp variable -- maybe
"compressed_tid"?

maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/

It might be worth using typedefs for key and value type. Actually, since
key type is fixed for the foreseeable future, maybe the radix tree template
should define a key typedef?

The term "result" is probably fine within the tidstore, but as a public
name used by vacuum, it's not very descriptive. I don't have a good idea,
though.

Some files in backend/access use CamelCase for public functions, although
it's not consistent. I think doing that for tidstore would help
readability, since they would stand out from rt_* functions and vacuum
functions. It's a matter of taste, though.

I don't understand the control flow in tidstore_iterate_next(), or when
BlockNumberIsValid() is true. If this is the best way to code this, it
needs more commentary.

Some comments on vacuum:

I think we'd better get some real-world testing of this, fairly soon.

I had an idea: If it's not too much effort, it might be worth splitting it
into two parts: one that just adds the store (not caring about its memory
limits or progress reporting etc). During index scan, check both the new
store and the array and log a warning (we don't want to exit or crash,
better to try to investigate while live if possible) if the result doesn't
match. Then perhaps set up an instance and let something like TPC-C run for
a few days. The second patch would just restore the rest of the current
patch. That would help reassure us it's working as designed. Soon I plan to
do some measurements with vacuuming large tables to get some concrete
numbers that the community can get excited about.

We also want to verify that progress reporting works as designed and has no
weird corner cases.

* autovacuum_work_mem) memory space to keep track of dead TIDs. We
initially
...
+ * create a TidStore with the maximum bytes that can be used by the
TidStore.

This kind of implies that we allocate the maximum bytes upfront. I think
this sentence can be removed. We already mentioned in the previous
paragraph that we set an upper bound.

- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in
%u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ vacuumed_pages)));

I don't think the format string has to change, since num_tids was changed
back to int64 in an earlier patch version?

- * the memory space for storing dead items allocated in the DSM segment.
We
[a lot of whitespace adjustment]
+ * the shared TidStore. We launch parallel worker processes at the start of

The old comment still seems mostly ok? Maybe just s/DSM segment/DSA area/
or something else minor.

- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);

If we're starting from the minimum, "estimate" doesn't really describe it
anymore? Maybe "Initial size"?
What does dsa_minimum_size() work out to in practice? 1MB?
Also, I think PARALLEL_VACUUM_KEY_DSA is left over from an earlier patch.

Lastly, on the radix tree:

I find extend, set, and set_extend hard to keep straight when studying the
code. Maybe EXTEND -> EXTEND_UP , SET_EXTEND -> EXTEND_DOWN ?

RT_ITER_UPDATE_KEY is unused, but I somehow didn't notice when turning it
into a template.

+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);

+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)

These comments don't really help readers unfamiliar with the code. The
iteration coding in general needs clearer description.

In the test:

+ 4, /* RT_NODE_KIND_4 */

The small size was changed to 3 -- if this test needs to know the max size
for each kind (class?), I wonder why it didn't fail. Should it? Maybe we
need symbols for the various fanouts.

I also want to mention now that we better decide soon if we want to support
shrinking of nodes for v16, even if the tidstore never shrinks. We'll need
to do it at some point, but I'm not sure if doing it now would make more
work for future changes targeting highly concurrent workloads. If so, doing
it now would just be wasted work. On the other hand, someone might have a
use that needs deletion before someone else needs concurrency. Just in
case, I have a start of node-shrinking logic, but needs some work because
we need the (local pointer) parent to update to the new smaller node, just
like the growing case.

--
John Naylor
EDB: http://www.enterprisedb.com

#210

andres@anarazel.de

almost 3 years ago

In reply to: John Naylor (#209)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2023-02-16 16:22:56 +0700, John Naylor wrote:

On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com>

Right. TidStore is implemented not only for heap, so loading
out-of-order TIDs might be important in the future.

That's what I was probably thinking about some weeks ago, but I'm having a
hard time imagining how it would come up, even for something like the
conveyor-belt concept.

We really ought to replace the tid bitmap used for bitmap heap scans. The
hashtable we use is a pretty awful data structure for it. And that's not
filled in-order, for example.

Greetings,

Andres Freund

#211

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Andres Freund (#210)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Feb 16, 2023 at 11:44 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-02-16 16:22:56 +0700, John Naylor wrote:

On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com>

Right. TidStore is implemented not only for heap, so loading
out-of-order TIDs might be important in the future.

That's what I was probably thinking about some weeks ago, but I'm

having a

hard time imagining how it would come up, even for something like the
conveyor-belt concept.

We really ought to replace the tid bitmap used for bitmap heap scans. The
hashtable we use is a pretty awful data structure for it. And that's not
filled in-order, for example.

I took a brief look at that and agree we should sometime make it work there
as well.

v26 tidstore_add_tids() appears to assume that it's only called once per
blocknumber. While the order of offsets doesn't matter there for a single
block, calling it again with the same block would wipe out the earlier
offsets, IIUC. To do an actual "add tid" where the order doesn't matter, it
seems we would need to (acquire lock if needed), read the current bitmap
and OR in the new bit if it exists, then write it back out.

That sounds slow, so it might still be good for vacuum to call a function
that passes a block and an array of offsets that are assumed ordered (as in
v28), but with a more accurate name, like tidstore_set_block_offsets().

--
John Naylor
EDB: http://www.enterprisedb.com

#212

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#209)

10 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Feb 16, 2023 at 6:23 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Feb 14, 2023 at 8:24 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.

Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continue assuming that (with an assert added since it's more public in this form). I'm not sure why such basic common sense evaded me a few versions ago...

Right. TidStore is implemented not only for heap, so loading
out-of-order TIDs might be important in the future.

That's what I was probably thinking about some weeks ago, but I'm having a hard time imagining how it would come up, even for something like the conveyor-belt concept.

We have the following WIP comment in test_radixtree:

// WIP: compiles with warnings because rt_attach is defined but not used
// #define RT_SHMEM

How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
and friends?

Sounds good to me, and the other fixes make sense as well.

Thanks, I merged them.

FYI I've briefly tested the TidStore with blocksize = 32kb, and it
seems to work fine.

That was on my list, so great! How about the other end -- nominally we allow 512b. (In practice it won't matter, but this would make sure I didn't mess anything up when forcing all MaxTuplesPerPage to encode.)

According to the doc, the minimum block size is 1kB. It seems to work
fine with 1kB blocks.

You removed the vacuum integration patch from v27, is there any reason for that?

Just an oversight.

Now for some general comments on the tid store...
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
Do we need to do anything for this todo?

Since it's practically no problem, I think we can live with it for
now. dshash also has the same todo.

It might help readability to have a concept of "off_upper/off_lower", just so we can describe things more clearly. The key is block + off_upper, and the value is a bitmap of all the off_lower bits. I hinted at that in my addition of encode_key_off(). Along those lines, maybe s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even sure the TIDSTORE_ prefix is valuable for these local macros.

The word "value" as a variable name is pretty generic in this context, and it might be better to call it the off_lower_bitmap, at least in some places. The "key" doesn't have a good short term for naming, but in comments we should make sure we're clear it's "block# + off_upper".

I'm not a fan of the name "tid_i", even as a temp variable -- maybe "compressed_tid"?

maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/

It might be worth using typedefs for key and value type. Actually, since key type is fixed for the foreseeable future, maybe the radix tree template should define a key typedef?

The term "result" is probably fine within the tidstore, but as a public name used by vacuum, it's not very descriptive. I don't have a good idea, though.

Some files in backend/access use CamelCase for public functions, although it's not consistent. I think doing that for tidstore would help readability, since they would stand out from rt_* functions and vacuum functions. It's a matter of taste, though.

I don't understand the control flow in tidstore_iterate_next(), or when BlockNumberIsValid() is true. If this is the best way to code this, it needs more commentary.

The attached 0008 patch addressed all above comments on tidstore.

Some comments on vacuum:

I think we'd better get some real-world testing of this, fairly soon.

I had an idea: If it's not too much effort, it might be worth splitting it into two parts: one that just adds the store (not caring about its memory limits or progress reporting etc). During index scan, check both the new store and the array and log a warning (we don't want to exit or crash, better to try to investigate while live if possible) if the result doesn't match. Then perhaps set up an instance and let something like TPC-C run for a few days. The second patch would just restore the rest of the current patch. That would help reassure us it's working as designed.

Yeah, I did a similar thing in an earlier version of tidstore patch.
Since we're trying to introduce two new components: radix tree and
tidstore, I sometimes find it hard to investigate failures happening
during lazy (parallel) vacuum due to a bug either in tidstore or radix
tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
stores tids to both the radix tree and array, and checks if the
results match when lookup or iteration. It will use more memory but it
would not be a big problem in USE_ASSERT_CHECKING builds. It would
also be great if we can enable such checks on some bf animals.

Soon I plan to do some measurements with vacuuming large tables to get some concrete numbers that the community can get excited about.

Thanks!

We also want to verify that progress reporting works as designed and has no weird corner cases.

* autovacuum_work_mem) memory space to keep track of dead TIDs. We initially
...
+ * create a TidStore with the maximum bytes that can be used by the TidStore.

This kind of implies that we allocate the maximum bytes upfront. I think this sentence can be removed. We already mentioned in the previous paragraph that we set an upper bound.

Agreed.

- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+ vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+ vacuumed_pages)));

I don't think the format string has to change, since num_tids was changed back to int64 in an earlier patch version?

I think we need to change the format to INT64_FORMAT.

- * the memory space for storing dead items allocated in the DSM segment.  We
[a lot of whitespace adjustment]
+ * the shared TidStore. We launch parallel worker processes at the start of
The old comment still seems mostly ok? Maybe just s/DSM segment/DSA area/ or something else minor.
- /* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
- est_dead_items_len = vac_max_items_to_alloc_size(max_items);
- shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+ /* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+ shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
If we're starting from the minimum, "estimate" doesn't really describe it anymore? Maybe "Initial size"?
What does dsa_minimum_size() work out to in practice? 1MB?
Also, I think PARALLEL_VACUUM_KEY_DSA is left over from an earlier patch.

Right. The attached 0009 patch addressed comments on vacuum
integration except for the correctness checking.

Lastly, on the radix tree:

I find extend, set, and set_extend hard to keep straight when studying the code. Maybe EXTEND -> EXTEND_UP , SET_EXTEND -> EXTEND_DOWN ?

RT_ITER_UPDATE_KEY is unused, but I somehow didn't notice when turning it into a template.

It was used in radixtree_iter_impl.h. But I removed it as it was not necessary.

+ /*
+ * Set the node to the node iterator and update the iterator stack
+ * from this node.
+ */
+ RT_UPDATE_ITER_STACK(iter, child, level - 1);

+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)

These comments don't really help readers unfamiliar with the code. The iteration coding in general needs clearer description.

I agree with all of the above comments. The attached 0007 patch
addressed comments on the radix tree.

In the test:

+ 4, /* RT_NODE_KIND_4 */

The small size was changed to 3 -- if this test needs to know the max size for each kind (class?), I wonder why it didn't fail. Should it? Maybe we need symbols for the various fanouts.

Since this information is used to the number of keys inserted, it
doesn't check the node kind. So we just didn't test node-3. It might
be better to expose and use both RT_SIZE_CLASS and RT_SIZE_CLASS_INFO.

I also want to mention now that we better decide soon if we want to support shrinking of nodes for v16, even if the tidstore never shrinks. We'll need to do it at some point, but I'm not sure if doing it now would make more work for future changes targeting highly concurrent workloads. If so, doing it now would just be wasted work. On the other hand, someone might have a use that needs deletion before someone else needs concurrency. Just in case, I have a start of node-shrinking logic, but needs some work because we need the (local pointer) parent to update to the new smaller node, just like the growing case.

Thanks, that's also on my todo list. TBH I'm not sure we should
improve the deletion at this stage as there is no use case of deletion
in the core. I'd prefer to focus on improving the quality of the
current radix tree and tidstore now, and I think we can support
node-shrinking once we are confident with the current implementation.

On Fri, Feb 17, 2023 at 5:00 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

That sounds slow, so it might still be good for vacuum to call a function that passes a block and an array of offsets that are assumed ordered (as in v28), but with a more accurate name, like tidstore_set_block_offsets().

tidstore_set_block_offsets() sounds better. I used
TidStoreSetBlockOffsets() in the latest patch set.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v29-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v29-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload

From b9883174cb69d87e6c9fdccb33ae29d5f084cd8e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 7 Feb 2023 17:19:29 +0700
Subject: [PATCH v29 06/10] Use TIDStore for storing dead tuple TID during lazy
 vacuum

Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.

Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.

Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and num_dead_tuple_bytes.

In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.

XXX: needs to bump catalog version
---
 doc/src/sgml/monitoring.sgml               |   8 +-
 src/backend/access/heap/vacuumlazy.c       | 278 ++++++++-------------
 src/backend/catalog/system_views.sql       |   2 +-
 src/backend/commands/vacuum.c              |  78 +-----
 src/backend/commands/vacuumparallel.c      |  73 +++---
 src/backend/postmaster/autovacuum.c        |   6 +-
 src/backend/storage/lmgr/lwlock.c          |   2 +
 src/backend/utils/misc/guc_tables.c        |   2 +-
 src/include/commands/progress.h            |   4 +-
 src/include/commands/vacuum.h              |  25 +-
 src/include/storage/lwlock.h               |   1 +
 src/test/regress/expected/cluster.out      |   2 +-
 src/test/regress/expected/create_index.out |   2 +-
 src/test/regress/expected/rules.out        |   4 +-
 src/test/regress/sql/cluster.sql           |   2 +-
 src/test/regress/sql/create_index.sql      |   2 +-
 16 files changed, 177 insertions(+), 314 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e28206e056..1d84e17705 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7165,10 +7165,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -7176,10 +7176,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..b4e40423a8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,18 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * create a TidStore with the maximum bytes that can be used by the TidStore.
+ * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
+ * vacuum the pages that we've pruned). This frees up the memory space dedicated
+ * to storing dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -220,11 +221,14 @@ typedef struct LVRelState
 typedef struct LVPagePruneState
 {
 	bool		hastup;			/* Page prevents rel truncation? */
-	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+
+	/* collected offsets of LP_DEAD items including existing ones */
+	OffsetNumber	deadoffsets[MaxHeapTuplesPerPage];
+	int				num_offsets;
 
 	/*
 	 * State describes the proper VM bit states to set for the page following
-	 * pruning and freezing.  all_visible implies !has_lpdead_items, but don't
+	 * pruning and freezing.  all_visible implies num_offsets == 0, but don't
 	 * trust all_frozen result unless all_visible is also set to true.
 	 */
 	bool		all_visible;	/* Every item visible to all? */
@@ -259,8 +263,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -487,11 +492,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
-	 * parallel VACUUM initialization as part of allocating shared memory
-	 * space used for dead_items.  (But do a failsafe precheck first, to
-	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
-	 * is already dangerously old.)
+	 * Allocate dead_items memory using dead_items_alloc.  This handles parallel
+	 * VACUUM initialization as part of allocating shared memory space used for
+	 * dead_items.  (But do a failsafe precheck first, to ensure that parallel
+	 * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+	 * old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
@@ -797,7 +802,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -825,21 +830,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				blkno,
 				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +911,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (tidstore_is_full(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -969,7 +973,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				continue;
 			}
 
-			/* Collect LP_DEAD items in dead_items array, count tuples */
+			/* Collect LP_DEAD items in dead_items, count tuples */
 			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
 								  &recordfreespace))
 			{
@@ -1011,14 +1015,14 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Prune, freeze, and count tuples.
 		 *
 		 * Accumulates details of remaining LP_DEAD line pointers on page in
-		 * dead_items array.  This includes LP_DEAD line pointers that we
-		 * pruned ourselves, as well as existing LP_DEAD line pointers that
-		 * were pruned some time earlier.  Also considers freezing XIDs in the
-		 * tuple headers of remaining items with storage.
+		 * dead_items.  This includes LP_DEAD line pointers that we pruned
+		 * ourselves, as well as existing LP_DEAD line pointers that were pruned
+		 * some time earlier.  Also considers freezing XIDs in the tuple headers
+		 * of remaining items with storage.
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
 
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+		Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
@@ -1034,14 +1038,12 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * performed here can be thought of as the one-pass equivalent of
 			 * a call to lazy_vacuum().
 			 */
-			if (prunestate.has_lpdead_items)
+			if (prunestate.num_offsets > 0)
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
-				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+									  prunestate.num_offsets, buf, vmbuffer);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1080,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
+		}
+		else if (prunestate.num_offsets > 0)
+		{
+			/* Save details of the LP_DEAD items from the page in dead_items */
+			tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+							  prunestate.num_offsets);
+
+			pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+										 tidstore_memory_usage(dead_items));
 		}
 
 		/*
@@ -1145,7 +1156,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
 		 * set, however.
 		 */
-		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+		else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
 		{
 			elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1193,7 +1204,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Final steps for block: drop cleanup lock, record free space in the
 		 * FSM
 		 */
-		if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+		if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
 		{
 			/*
 			 * Wait until lazy_vacuum_heap_rel() to save free space.  This
@@ -1249,7 +1260,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1524,9 +1535,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
  * The approach we take now is to restart pruning when the race condition is
  * detected.  This allows heap_page_prune() to prune the tuples inserted by
  * the now-aborted transaction.  This is a little crude, but it guarantees
- * that any items that make it into the dead_items array are simple LP_DEAD
- * line pointers, and that every remaining item with tuple storage is
- * considered as a candidate for freezing.
+ * that any items that make it into the dead_items are simple LP_DEAD line
+ * pointers, and that every remaining item with tuple storage is considered
+ * as a candidate for freezing.
  */
 static void
 lazy_scan_prune(LVRelState *vacrel,
@@ -1543,13 +1554,11 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				tuples_frozen,
-				lpdead_items,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	HeapPageFreeze pagefrz;
 	int64		fpi_before = pgWalUsage.wal_fpi;
-	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1580,6 @@ retry:
 	pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
-	lpdead_items = 0;
 	live_tuples = 0;
 	recently_dead_tuples = 0;
 
@@ -1580,9 +1588,9 @@ retry:
 	 *
 	 * We count tuples removed by the pruning step as tuples_deleted.  Its
 	 * final value can be thought of as the number of tuples that have been
-	 * deleted from the table.  It should not be confused with lpdead_items;
-	 * lpdead_items's final value can be thought of as the number of tuples
-	 * that were deleted from indexes.
+	 * deleted from the table.  It should not be confused with
+	 * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+	 * be thought of as the number of tuples that were deleted from indexes.
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
 									 InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1601,7 @@ retry:
 	 * requiring freezing among remaining tuples with storage
 	 */
 	prunestate->hastup = false;
-	prunestate->has_lpdead_items = false;
+	prunestate->num_offsets = 0;
 	prunestate->all_visible = true;
 	prunestate->all_frozen = true;
 	prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1646,7 @@ retry:
 			 * (This is another case where it's useful to anticipate that any
 			 * LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
 			 */
-			deadoffsets[lpdead_items++] = offnum;
+			prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
 			continue;
 		}
 
@@ -1875,7 +1883,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible && lpdead_items == 0)
+	if (prunestate->all_visible && prunestate->num_offsets == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1888,28 +1896,9 @@ retry:
 	}
 #endif
 
-	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel
-	 */
-	if (lpdead_items > 0)
+	if (prunestate->num_offsets > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
-		prunestate->has_lpdead_items = true;
-
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1917,7 @@ retry:
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
 	vacrel->tuples_frozen += tuples_frozen;
-	vacrel->lpdead_items += lpdead_items;
+	vacrel->lpdead_items += prunestate->num_offsets;
 	vacrel->live_tuples += live_tuples;
 	vacrel->recently_dead_tuples += recently_dead_tuples;
 }
@@ -1940,7 +1929,7 @@ retry:
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2099,7 +2088,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2129,8 +2118,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2127,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2198,7 +2179,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2227,7 +2208,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2254,8 +2235,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2300,7 +2281,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	tidstore_reset(vacrel->dead_items);
 }
 
 /*
@@ -2373,7 +2354,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2392,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2410,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while ((result = tidstore_iterate_next(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2437,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2451,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+							  buf, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2461,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	tidstore_end_iterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2470,36 +2454,31 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+					vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
 }
 
 /*
- *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *	lazy_vacuum_heap_page() -- free page's LP_DEAD items.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+					  OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2518,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -2687,8 +2660,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -3094,48 +3067,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 }
 
 /*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
-/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.
  *
  * Also handles parallel initialization as part of allocating dead_items in
  * DSM when required.
@@ -3143,11 +3076,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3174,7 +3105,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem, MaxHeapTuplesPerPage,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3187,11 +3118,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
+										 NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..149d41b41c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index aa79d9de4d..d8e680ca20 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2343,82 +2342,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch(itemptr,
-								dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..d653683693 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,12 +9,11 @@
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
  * vacuum process.  ParalleVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * the shared TidStore. We launch parallel worker processes at the start of
+ * parallel index bulk-deletion and index cleanup and once all indexes are
+ * processed, the parallel worker processes exit.  Each time we process indexes
+ * in parallel, the parallel context is re-initialized so that the same DSM can
+ * be used for multiple passes of index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -103,6 +102,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TidStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +168,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +225,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int max_offset, int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +289,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +356,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +375,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +384,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +441,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_destroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +452,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TidStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +950,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +996,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1045,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ff6149a179..a371f6fbba 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3397,12 +3397,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
 		return true;
 
 	/*
-	 * We clamp manually-set values to at least 1MB.  Since
+	 * We clamp manually-set values to at least 2MB.  Since
 	 * maintenance_work_mem is always set to at least this value, do the same
 	 * here.
 	 */
-	if (*newval < 1024)
-		*newval = 1024;
+	if (*newval < 2048)
+		*newval = 2048;
 
 	return true;
 }
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"LogicalRepLauncherDSA",
 	/* LWTRANCHE_LAUNCHER_HASH: */
 	"LogicalRepLauncherHash",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index b46e3b8c55..27a88b9369 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2312,7 +2312,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 2048, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..a3ebb169ef 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
 	MultiXactId MultiXactCutoff;
 };
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem, int max_offset,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DATA,
 	LWTRANCHE_LAUNCHER_DSA,
 	LWTRANCHE_LAUNCHER_HASH,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index 6cd57e3eaa..d1889b9d10 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
 -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 174b725fff..8fa4e86be8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2032,8 +2032,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 set enable_indexscan = off;
 
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index a3738833b2..edb5e4b4f3 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
 
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
-- 
2.31.1

v29-0007-Review-radix-tree.patchapplication/octet-stream; name=v29-0007-Review-radix-tree.patchDownload

From 52e0d50d6e882c0444ccdf15f8afcc1aef3a6987 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 20 Feb 2023 11:28:50 +0900
Subject: [PATCH v29 07/10] Review radix tree.

Mainly improve the iteration codes and comments.
---
 src/include/lib/radixtree.h                   | 169 +++++++++---------
 src/include/lib/radixtree_iter_impl.h         |  85 ++++-----
 .../expected/test_radixtree.out               |   6 +-
 .../modules/test_radixtree/test_radixtree.c   | 103 +++++++----
 4 files changed, 197 insertions(+), 166 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index e546bd705c..8bea606c62 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -83,7 +83,7 @@
  * RT_SET			- Set a key-value pair
  * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
  * RT_ITERATE_NEXT	- Return next key-value pair, if any
- * RT_END_ITER		- End iteration
+ * RT_END_ITERATE	- End iteration
  * RT_MEMORY_USAGE	- Get the memory usage
  *
  * Interface for Shared Memory
@@ -152,8 +152,8 @@
 #define RT_INIT_NODE RT_MAKE_NAME(init_node)
 #define RT_FREE_NODE RT_MAKE_NAME(free_node)
 #define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
-#define RT_EXTEND RT_MAKE_NAME(extend)
-#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_EXTEND_UP RT_MAKE_NAME(extend_up)
+#define RT_EXTEND_DOWN RT_MAKE_NAME(extend_down)
 #define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
 #define RT_COPY_NODE RT_MAKE_NAME(copy_node)
 #define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
@@ -191,7 +191,7 @@
 #define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
 #define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
 #define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
-#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_SET_NODE_FROM RT_MAKE_NAME(iter_set_node_from)
 #define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
 #define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
 
@@ -612,7 +612,6 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 #endif
 
 /* Contains the actual tree and ancillary info */
-// WIP: this name is a bit strange
 typedef struct RT_RADIX_TREE_CONTROL
 {
 #ifdef RT_SHMEM
@@ -651,36 +650,40 @@ typedef struct RT_RADIX_TREE
  * Iteration support.
  *
  * Iterating the radix tree returns each pair of key and value in the ascending
- * order of the key. To support this, the we iterate nodes of each level.
+ * order of the key.
  *
- * RT_NODE_ITER struct is used to track the iteration within a node.
+ * RT_NODE_ITER is the struct for iteration of one radix tree node.
  *
  * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
- * in order to track the iteration of each level. During iteration, we also
- * construct the key whenever updating the node iteration information, e.g., when
- * advancing the current index within the node or when moving to the next node
- * at the same level.
- *
- * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
- * has the local pointers to nodes, rather than RT_PTR_ALLOC.
- * We need either a safeguard to disallow other processes to begin the iteration
- * while one process is doing or to allow multiple processes to do the iteration.
+ * for each level to track the iteration within the node.
  */
 typedef struct RT_NODE_ITER
 {
-	RT_PTR_LOCAL node;			/* current node being iterated */
-	int			current_idx;	/* current position. -1 for initial value */
+	/*
+	 * Local pointer to the node we are iterating over.
+	 *
+	 * Since the radix tree doesn't support the shared iteration among multiple
+	 * processes, we use RT_PTR_LOCAL rather than RT_PTR_ALLOC.
+	 */
+	RT_PTR_LOCAL node;
+
+	/*
+	 * The next index of the chunk array in RT_NODE_KIND_3 and
+	 * RT_NODE_KIND_32 nodes, or the next chunk in RT_NODE_KIND_125 and
+	 * RT_NODE_KIND_256 nodes. 0 for the initial value.
+	 */
+	int		idx;
 } RT_NODE_ITER;
 
 typedef struct RT_ITER
 {
 	RT_RADIX_TREE *tree;
 
-	/* Track the iteration on nodes of each level */
-	RT_NODE_ITER stack[RT_MAX_LEVEL];
-	int			stack_len;
+	/* Track the nodes for each level. level = 0 is for a leaf node */
+	RT_NODE_ITER node_iters[RT_MAX_LEVEL];
+	int			top_level;
 
-	/* The key is constructed during iteration */
+	/* The key constructed during the iteration */
 	uint64		key;
 } RT_ITER;
 
@@ -1243,7 +1246,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
  * it can store the key.
  */
 static pg_noinline void
-RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+RT_EXTEND_UP(RT_RADIX_TREE *tree, uint64 key)
 {
 	int			target_shift;
 	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
@@ -1282,7 +1285,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
  * Insert inner and leaf nodes from 'node' to bottom.
  */
 static pg_noinline void
-RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+RT_EXTEND_DOWN(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
 			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
 {
 	int			shift = node->shift;
@@ -1613,7 +1616,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 
 	/* Extend the tree if necessary */
 	if (key > tree->ctl->max_val)
-		RT_EXTEND(tree, key);
+		RT_EXTEND_UP(tree, key);
 
 	stored_child = tree->ctl->root;
 	parent = RT_PTR_GET_LOCAL(tree, stored_child);
@@ -1631,7 +1634,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 
 		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
 		{
-			RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+			RT_EXTEND_DOWN(tree, key, value_p, parent, stored_child, child);
 			RT_UNLOCK(tree);
 			return false;
 		}
@@ -1805,16 +1808,9 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 }
 #endif
 
-static inline void
-RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
-{
-	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
-	iter->key |= (((uint64) chunk) << shift);
-}
-
 /*
- * Advance the slot in the inner node. Return the child if exists, otherwise
- * null.
+ * Scan the inner node and return the next child node if exist, otherwise
+ * return NULL.
  */
 static inline RT_PTR_LOCAL
 RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
@@ -1825,8 +1821,8 @@ RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
 }
 
 /*
- * Advance the slot in the leaf node. On success, return true and the value
- * is set to value_p, otherwise return false.
+ * Scan the leaf node, and return true and the next value is set to value_p
+ * if exists. Otherwise return false.
  */
 static inline bool
 RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
@@ -1838,29 +1834,50 @@ RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
 }
 
 /*
- * Update each node_iter for inner nodes in the iterator node stack.
+ * While descending the radix tree from the 'from' node to the bottom, we
+ * set the next node to iterate for each level.
  */
 static void
-RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+RT_ITER_SET_NODE_FROM(RT_ITER *iter, RT_PTR_LOCAL from)
 {
-	int			level = from;
-	RT_PTR_LOCAL node = from_node;
+	int			level = from->shift / RT_NODE_SPAN;
+	RT_PTR_LOCAL node = from;
 
 	for (;;)
 	{
-		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+		RT_NODE_ITER *node_iter = &(iter->node_iters[level--]);
+
+#ifdef USE_ASSERT_CHECKING
+		if (node_iter->node)
+		{
+			/* We must have finished the iteration on the previous node */
+			if (RT_NODE_IS_LEAF(node_iter->node))
+			{
+				uint64 dummy;
+				Assert(!RT_NODE_LEAF_ITERATE_NEXT(iter, node_iter, &dummy));
+			}
+			else
+				Assert(!RT_NODE_INNER_ITERATE_NEXT(iter, node_iter));
+		}
+#endif
 
+		/* Set the node to the node iterator of this level */
 		node_iter->node = node;
-		node_iter->current_idx = -1;
+		node_iter->idx = 0;
 
-		/* We don't advance the leaf node iterator here */
 		if (RT_NODE_IS_LEAF(node))
-			return;
+		{
+			/* We will visit the leaf node when RT_ITERATE_NEXT() */
+			break;
+		}
 
-		/* Advance to the next slot in the inner node */
+		/*
+		 * Get the first child node from the node, which corresponds to the
+		 * lowest chunk within the node.
+		 */
 		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
 
-		/* We must find the first children in the node */
+		/* The first child must be found */
 		Assert(node);
 	}
 }
@@ -1874,14 +1891,11 @@ RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
 RT_SCOPE RT_ITER *
 RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 {
-	MemoryContext old_ctx;
 	RT_ITER    *iter;
 	RT_PTR_LOCAL root;
-	int			top_level;
 
-	old_ctx = MemoryContextSwitchTo(tree->context);
-
-	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter = (RT_ITER *) MemoryContextAllocZero(tree->context,
+											  sizeof(RT_ITER));
 	iter->tree = tree;
 
 	RT_LOCK_SHARED(tree);
@@ -1891,16 +1905,13 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 		return iter;
 
 	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
-	top_level = root->shift / RT_NODE_SPAN;
-	iter->stack_len = top_level;
+	iter->top_level = root->shift / RT_NODE_SPAN;
 
 	/*
-	 * Descend to the left most leaf node from the root. The key is being
-	 * constructed while descending to the leaf.
+	 * Set the next node to iterate for each level from the level of the
+	 * root node.
 	 */
-	RT_UPDATE_ITER_STACK(iter, root, top_level);
-
-	MemoryContextSwitchTo(old_ctx);
+	RT_ITER_SET_NODE_FROM(iter, root);
 
 	return iter;
 }
@@ -1912,6 +1923,8 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 RT_SCOPE bool
 RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
 {
+	Assert(value_p != NULL);
+
 	/* Empty tree */
 	if (!iter->tree->ctl->root)
 		return false;
@@ -1919,43 +1932,38 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
 	for (;;)
 	{
 		RT_PTR_LOCAL child = NULL;
-		RT_VALUE_TYPE value;
-		int			level;
-		bool		found;
-
-		/* Advance the leaf node iterator to get next key-value pair */
-		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
 
-		if (found)
+		/* Get the next chunk of the leaf node */
+		if (RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->node_iters[0]), value_p))
 		{
 			*key_p = iter->key;
-			*value_p = value;
 			return true;
 		}
 
 		/*
-		 * We've visited all values in the leaf node, so advance inner node
-		 * iterators from the level=1 until we find the next child node.
+		 * We've visited all values in the leaf node, so advance all inner node
+		 * iterators by visiting inner nodes from the level = 1 until we find the
+		 * next inner node that has a child node.
 		 */
-		for (level = 1; level <= iter->stack_len; level++)
+		for (int level = 1; level <= iter->top_level; level++)
 		{
-			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->node_iters[level]));
 
 			if (child)
 				break;
 		}
 
-		/* the iteration finished */
+		/* We've visited all nodes, so the iteration finished */
 		if (!child)
-			return false;
+			break;
 
 		/*
-		 * Set the node to the node iterator and update the iterator stack
-		 * from this node.
+		 * Found the new child node. We update the next node to iterate for each
+		 * level from the level of this child node.
 		 */
-		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+		RT_ITER_SET_NODE_FROM(iter, child);
 
-		/* Node iterators are updated, so try again from the leaf */
+		/* Find key-value from the leaf node again */
 	}
 
 	return false;
@@ -2470,8 +2478,8 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_INIT_NODE
 #undef RT_FREE_NODE
 #undef RT_FREE_RECURSE
-#undef RT_EXTEND
-#undef RT_SET_EXTEND
+#undef RT_EXTEND_UP
+#undef RT_EXTEND_DOWN
 #undef RT_SWITCH_NODE_KIND
 #undef RT_COPY_NODE
 #undef RT_REPLACE_NODE
@@ -2509,8 +2517,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_NODE_INSERT_LEAF
 #undef RT_NODE_INNER_ITERATE_NEXT
 #undef RT_NODE_LEAF_ITERATE_NEXT
-#undef RT_UPDATE_ITER_STACK
-#undef RT_ITER_UPDATE_KEY
+#undef RT_RT_ITER_SET_NODE_FROM
 #undef RT_VERIFY_NODE
 
 #undef RT_DEBUG
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 98c78eb237..5c1034768e 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -27,12 +27,10 @@
 #error node level must be either inner or leaf
 #endif
 
-	bool		found = false;
-	uint8		key_chunk;
+	uint8		key_chunk = 0;
 
 #ifdef RT_NODE_LEVEL_LEAF
-	RT_VALUE_TYPE		value;
-
+	Assert(value_p != NULL);
 	Assert(RT_NODE_IS_LEAF(node_iter->node));
 #else
 	RT_PTR_LOCAL child = NULL;
@@ -50,99 +48,92 @@
 			{
 				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
 
-				node_iter->current_idx++;
-				if (node_iter->current_idx >= n3->base.n.count)
-					break;
+				if (node_iter->idx >= n3->base.n.count)
+					return false;
+
 #ifdef RT_NODE_LEVEL_LEAF
-				value = n3->values[node_iter->current_idx];
+				*value_p = n3->values[node_iter->idx];
 #else
-				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->idx]);
 #endif
-				key_chunk = n3->base.chunks[node_iter->current_idx];
-				found = true;
+				key_chunk = n3->base.chunks[node_iter->idx];
+				node_iter->idx++;
 				break;
 			}
 		case RT_NODE_KIND_32:
 			{
 				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
 
-				node_iter->current_idx++;
-				if (node_iter->current_idx >= n32->base.n.count)
-					break;
+				if (node_iter->idx >= n32->base.n.count)
+					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				value = n32->values[node_iter->current_idx];
+				*value_p = n32->values[node_iter->idx];
 #else
-				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->idx]);
 #endif
-				key_chunk = n32->base.chunks[node_iter->current_idx];
-				found = true;
+				key_chunk = n32->base.chunks[node_iter->idx];
+				node_iter->idx++;
 				break;
 			}
 		case RT_NODE_KIND_125:
 			{
 				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
-				int			i;
+				int			chunk;
 
-				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
 				{
-					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
 						break;
 				}
 
-				if (i >= RT_NODE_MAX_SLOTS)
-					break;
+				if (chunk >= RT_NODE_MAX_SLOTS)
+					return false;
 
-				node_iter->current_idx = i;
 #ifdef RT_NODE_LEVEL_LEAF
-				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+				*value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
 #else
-				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, chunk));
 #endif
-				key_chunk = i;
-				found = true;
+				key_chunk = chunk;
+				node_iter->idx = chunk + 1;
 				break;
 			}
 		case RT_NODE_KIND_256:
 			{
 				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
-				int			i;
+				int			chunk;
 
-				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
 				{
 #ifdef RT_NODE_LEVEL_LEAF
-					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
 #else
-					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
 #endif
 						break;
 				}
 
-				if (i >= RT_NODE_MAX_SLOTS)
-					break;
+				if (chunk >= RT_NODE_MAX_SLOTS)
+					return false;
 
-				node_iter->current_idx = i;
 #ifdef RT_NODE_LEVEL_LEAF
-				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+				*value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
 #else
-				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, chunk));
 #endif
-				key_chunk = i;
-				found = true;
+				key_chunk = chunk;
+				node_iter->idx = chunk + 1;
 				break;
 			}
 	}
 
-	if (found)
-	{
-		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
-#ifdef RT_NODE_LEVEL_LEAF
-		*value_p = value;
-#endif
-	}
+	/* Update the part of the key */
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << node_iter->node->shift);
+	iter->key |= (((uint64) key_chunk) << node_iter->node->shift);
 
 #ifdef RT_NODE_LEVEL_LEAF
-	return found;
+	return true;
 #else
 	return child;
 #endif
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index ce645cb8b5..7ad1ce3605 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -4,8 +4,10 @@ CREATE EXTENSION test_radixtree;
 -- an error if something fails.
 --
 SELECT test_radixtree();
-NOTICE:  testing basic operations with leaf node 4
-NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 3
+NOTICE:  testing basic operations with inner node 3
+NOTICE:  testing basic operations with leaf node 15
+NOTICE:  testing basic operations with inner node 15
 NOTICE:  testing basic operations with leaf node 32
 NOTICE:  testing basic operations with inner node 32
 NOTICE:  testing basic operations with leaf node 125
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index afe53382f3..5a169854d9 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -43,12 +43,15 @@ typedef uint64 TestValueType;
  */
 static const bool rt_test_stats = false;
 
-static int	rt_node_kind_fanouts[] = {
-	0,
-	4,							/* RT_NODE_KIND_4 */
-	32,							/* RT_NODE_KIND_32 */
-	125,						/* RT_NODE_KIND_125 */
-	256							/* RT_NODE_KIND_256 */
+/*
+ * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
+ */
+static int	rt_node_class_fanouts[] = {
+	3,		/* RT_CLASS_3 */
+	15,		/* RT_CLASS_32_MIN */
+	32, 	/* RT_CLASS_32_MAX */
+	125,	/* RT_CLASS_125 */
+	256		/* RT_CLASS_256 */
 };
 /*
  * A struct to define a pattern of integers, for use with the test_pattern()
@@ -260,10 +263,9 @@ test_basic(int children, bool test_inner)
  * Check if keys from start to end with the shift exist in the tree.
  */
 static void
-check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
-					 int incr)
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end)
 {
-	for (int i = start; i < end; i++)
+	for (int i = start; i <= end; i++)
 	{
 		uint64		key = ((uint64) i << shift);
 		TestValueType		val;
@@ -277,22 +279,26 @@ check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
 	}
 }
 
+/*
+ * Insert 256 key-value pairs, and check if keys are properly inserted on each
+ * node class.
+ */
+/* Test keys [0, 256) */
+#define NODE_TYPE_TEST_KEY_MIN 0
+#define NODE_TYPE_TEST_KEY_MAX 256
 static void
-test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+test_node_types_insert_asc(rt_radix_tree *radixtree, uint8 shift)
 {
-	uint64		num_entries;
-	int		ninserted = 0;
-	int		start = insert_asc ? 0 : 256;
-	int 	incr = insert_asc ? 1 : -1;
-	int		end = insert_asc ? 256 : 0;
-	int		node_kind_idx = 1;
+	uint64 num_entries;
+	int node_class_idx = 0;
+	uint64 key_checked = 0;
 
-	for (int i = start; i != end; i += incr)
+	for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
 	{
 		uint64		key = ((uint64) i << shift);
 		bool		found;
 
-		found = rt_set(radixtree, key, (TestValueType*) &key);
+		found = rt_set(radixtree, key, (TestValueType *) &key);
 		if (found)
 			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
 
@@ -300,24 +306,49 @@ test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
 		 * After filling all slots in each node type, check if the values
 		 * are stored properly.
 		 */
-		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		if ((i + 1) == rt_node_class_fanouts[node_class_idx])
 		{
-			int check_start = insert_asc
-				? rt_node_kind_fanouts[node_kind_idx - 1]
-				: rt_node_kind_fanouts[node_kind_idx];
-			int check_end = insert_asc
-				? rt_node_kind_fanouts[node_kind_idx]
-				: rt_node_kind_fanouts[node_kind_idx - 1];
-
-			check_search_on_node(radixtree, shift, check_start, check_end, incr);
-			node_kind_idx++;
+			check_search_on_node(radixtree, shift, key_checked, i);
+			key_checked = i;
+			node_class_idx++;
 		}
-
-		ninserted++;
 	}
 
 	num_entries = rt_num_entries(radixtree);
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Similar to test_node_types_insert_asc(), but inserts keys in descending order.
+ */
+static void
+test_node_types_insert_desc(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+	int node_class_idx = 0;
+	uint64 key_checked = NODE_TYPE_TEST_KEY_MAX - 1;
+
+	for (int i = NODE_TYPE_TEST_KEY_MAX - 1; i >= NODE_TYPE_TEST_KEY_MIN; i--)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType *) &key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
 
+		if ((i + 1) == rt_node_class_fanouts[node_class_idx])
+		{
+			check_search_on_node(radixtree, shift, i, key_checked);
+			key_checked = i;
+			node_class_idx++;
+		}
+	}
+
+	num_entries = rt_num_entries(radixtree);
 	if (num_entries != 256)
 		elog(ERROR,
 			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
@@ -329,7 +360,7 @@ test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
 {
 	uint64		num_entries;
 
-	for (int i = 0; i < 256; i++)
+	for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
 	{
 		uint64		key = ((uint64) i << shift);
 		bool		found;
@@ -379,9 +410,9 @@ test_node_types(uint8 shift)
 	 * then delete all entries to make it empty, and insert and search entries
 	 * again.
 	 */
-	test_node_types_insert(radixtree, shift, true);
+	test_node_types_insert_asc(radixtree, shift);
 	test_node_types_delete(radixtree, shift);
-	test_node_types_insert(radixtree, shift, false);
+	test_node_types_insert_desc(radixtree, shift);
 
 	rt_free(radixtree);
 #ifdef RT_SHMEM
@@ -664,10 +695,10 @@ test_radixtree(PG_FUNCTION_ARGS)
 {
 	test_empty();
 
-	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	for (int i = 0; i < lengthof(rt_node_class_fanouts); i++)
 	{
-		test_basic(rt_node_kind_fanouts[i], false);
-		test_basic(rt_node_kind_fanouts[i], true);
+		test_basic(rt_node_class_fanouts[i], false);
+		test_basic(rt_node_class_fanouts[i], true);
 	}
 
 	for (int shift = 0; shift <= (64 - 8); shift += 8)
-- 
2.31.1

v29-0010-Revert-building-benchmark-module-for-CI.patchapplication/octet-stream; name=v29-0010-Revert-building-benchmark-module-for-CI.patchDownload

From b6a692913ce8c6868996336f4be778eb5f83d02c Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 19:31:34 +0700
Subject: [PATCH v29 10/10] Revert building benchmark module for CI

---
 contrib/meson.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/meson.build b/contrib/meson.build
index 421d469f8c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,7 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
-subdir('bench_radix_tree')
+#subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.31.1

v29-0009-Review-vacuum-integration.patchapplication/octet-stream; name=v29-0009-Review-vacuum-integration.patchDownload

From e804119fddce3bc0520bedc70c966470c7db35e9 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 17 Feb 2023 00:04:37 +0900
Subject: [PATCH v29 09/10] Review vacuum integration.

---
 src/backend/access/heap/vacuumlazy.c  | 61 +++++++++++++--------------
 src/backend/commands/vacuum.c         |  4 +-
 src/backend/commands/vacuumparallel.c | 25 +++++------
 3 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b4e40423a8..edb9079124 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -10,11 +10,10 @@
  * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * create a TidStore with the maximum bytes that can be used by the TidStore.
- * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
- * vacuum the pages that we've pruned). This frees up the memory space dedicated
- * to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -844,7 +843,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
+	initprog_val[2] = TidStoreMaxMemory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -911,7 +910,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		if (tidstore_is_full(vacrel->dead_items))
+		if (TidStoreIsFull(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1080,16 +1079,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(tidstore_num_tids(dead_items) == 0);
+			Assert(TidStoreNumTids(dead_items) == 0);
 		}
 		else if (prunestate.num_offsets > 0)
 		{
 			/* Save details of the LP_DEAD items from the page in dead_items */
-			tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
-							  prunestate.num_offsets);
+			TidStoreSetBlockOffsets(dead_items, blkno, prunestate.deadoffsets,
+									prunestate.num_offsets);
 
 			pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
-										 tidstore_memory_usage(dead_items));
+										 TidStoreMemoryUsage(dead_items));
 		}
 
 		/*
@@ -1260,7 +1259,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (tidstore_num_tids(dead_items) > 0)
+	if (TidStoreNumTids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -2127,10 +2126,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
+		TidStoreSetBlockOffsets(dead_items, blkno, deadoffsets, lpdead_items);
 
 		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
-									 tidstore_memory_usage(dead_items));
+									 TidStoreMemoryUsage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2179,7 +2178,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		tidstore_reset(vacrel->dead_items);
+		TidStoreReset(vacrel->dead_items);
 		return;
 	}
 
@@ -2208,7 +2207,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
+		Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2236,7 +2235,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
 		bypass = (vacrel->lpdead_item_pages < threshold) &&
-			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
+			TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2281,7 +2280,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	tidstore_reset(vacrel->dead_items);
+	TidStoreReset(vacrel->dead_items);
 }
 
 /*
@@ -2354,7 +2353,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
+		   TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2394,7 +2393,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
 	TidStoreIter *iter;
-	TidStoreIterResult *result;
+	TidStoreIterResult *iter_result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2409,8 +2408,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	iter = tidstore_begin_iterate(vacrel->dead_items);
-	while ((result = tidstore_iterate_next(iter)) != NULL)
+	iter = TidStoreBeginIterate(vacrel->dead_items);
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2419,7 +2418,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = result->blkno;
+		blkno = iter_result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2433,8 +2432,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
-							  buf, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, iter_result->offsets,
+							  iter_result->num_offsets, buf, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2444,7 +2443,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
-	tidstore_end_iterate(iter);
+	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2455,12 +2454,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
+		   (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
-					vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+			(errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, TidStoreNumTids(vacrel->dead_items),
 					vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
@@ -3118,8 +3117,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
-										 NULL);
+	vacrel->dead_items = TidStoreCreate(vac_work_mem, MaxHeapTuplesPerPage,
+										NULL);
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index d8e680ca20..5fb30d7e62 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2311,7 +2311,7 @@ vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
 	ereport(ivinfo->message_level,
 			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					tidstore_num_tids(dead_items))));
+					TidStoreNumTids(dead_items))));
 
 	return istat;
 }
@@ -2352,5 +2352,5 @@ vac_tid_reaped(ItemPointer itemptr, void *state)
 {
 	TidStore *dead_items = (TidStore *) state;
 
-	return tidstore_lookup_tid(dead_items, itemptr);
+	return TidStoreIsMember(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index d653683693..9225daf3ab 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,11 +9,12 @@
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
  * vacuum process.  ParalleVacuumState contains shared information as well as
- * the shared TidStore. We launch parallel worker processes at the start of
- * parallel index bulk-deletion and index cleanup and once all indexes are
- * processed, the parallel worker processes exit.  Each time we process indexes
- * in parallel, the parallel context is re-initialized so that the same DSM can
- * be used for multiple passes of index bulk-deletion and index cleanup.
+ * the memory space for storing dead items allocated in the DSA area.  We
+ * launch parallel worker processes at the start of parallel index
+ * bulk-deletion and index cleanup and once all indexes are processed, the
+ * parallel worker processes exit.	Each time we process indexes in parallel,
+ * the parallel context is re-initialized so that the same DSM can be used for
+ * multiple passes of index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,7 +105,7 @@ typedef struct PVShared
 	pg_atomic_uint32 idx;
 
 	/* Handle of the shared TidStore */
-	tidstore_handle	dead_items_handle;
+	TidStoreHandle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -289,7 +290,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	/* Initial size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
 	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
@@ -362,7 +363,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
 										 LWTRANCHE_PARALLEL_VACUUM_DSA,
 										 pcxt->seg);
-	dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+	dead_items = TidStoreCreate(vac_work_mem, max_offset, dead_items_dsa);
 	pvs->dead_items = dead_items;
 	pvs->dead_items_area = dead_items_dsa;
 
@@ -375,7 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
-	shared->dead_items_handle = tidstore_get_handle(dead_items);
+	shared->dead_items_handle = TidStoreGetHandle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -441,7 +442,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
-	tidstore_destroy(pvs->dead_items);
+	TidStoreDestroy(pvs->dead_items);
 	dsa_detach(pvs->dead_items_area);
 
 	DestroyParallelContext(pvs->pcxt);
@@ -999,7 +1000,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Set dead items */
 	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
 	dead_items_area = dsa_attach_in_place(area_space, seg);
-	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
+	dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1045,7 +1046,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
-	tidstore_detach(pvs.dead_items);
+	TidStoreDetach(dead_items);
 	dsa_detach(dead_items_area);
 
 	/* Pop the error context stack */
-- 
2.31.1

v29-0008-Review-TidStore.patchapplication/octet-stream; name=v29-0008-Review-TidStore.patchDownload

From fc373e0312e0b3c30bba8bd54286283542d627a2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Feb 2023 23:45:39 +0900
Subject: [PATCH v29 08/10] Review TidStore.

---
 src/backend/access/common/tidstore.c          | 340 +++++++++---------
 src/include/access/tidstore.h                 |  37 +-
 .../modules/test_tidstore/test_tidstore.c     |  68 ++--
 3 files changed, 234 insertions(+), 211 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 8c05e60d92..9360520482 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -3,18 +3,19 @@
  * tidstore.c
  *		Tid (ItemPointerData) storage implementation.
  *
- * This module provides a in-memory data structure to store Tids (ItemPointer).
- * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
- * stored in the radix tree.
+ * TidStore is a in-memory data structure to store tids (ItemPointerData).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.
  *
- * A TidStore can be shared among parallel worker processes by passing DSA area
- * to tidstore_create(). Other backends can attach to the shared TidStore by
- * tidstore_attach().
+ * TidStore can be shared among parallel worker processes by passing DSA area
+ * to TidStoreCreate(). Other backends can attach to the shared TidStore by
+ * TidStoreAttach().
  *
- * Regarding the concurrency, it basically relies on the concurrency support in
- * the radix tree, but we acquires the lock on a TidStore in some cases, for
- * example, when to reset the store and when to access the number tids in the
- * store (num_tids).
+ * Regarding the concurrency support, we use a single LWLock for the TidStore.
+ * The TidStore is exclusively locked when inserting encoded tids to the
+ * radix tree or when resetting itself. When searching on the TidStore or
+ * doing the iteration, it is not locked but the underlying radix tree is
+ * locked in shared mode.
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -34,16 +35,18 @@
 #include "utils/memutils.h"
 
 /*
- * For encoding purposes, tids are represented as a pair of 64-bit key and
- * 64-bit value. First, we construct 64-bit unsigned integer by combining
- * the block number and the offset number. The number of bits used for the
- * offset number is specified by max_offsets in tidstore_create(). We are
- * frugal with the bits, because smaller keys could help keeping the radix
- * tree shallow.
+ * For encoding purposes, a tid is represented as a pair of 64-bit key and
+ * 64-bit value.
  *
- * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
- * the offset number and uses the next 32 bits for the block number. That
- * is, only 41 bits are used:
+ * First, we construct a 64-bit unsigned integer by combining the block
+ * number and the offset number. The number of bits used for the offset number
+ * is specified by max_off in TidStoreCreate(). We are frugal with the bits,
+ * because smaller keys could help keeping the radix tree shallow.
+ *
+ * For example, a tid of heap on a 8kB block uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. 9 bits
+ * are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks. That is, only 41 bits are used:
  *
  * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
  *
@@ -52,30 +55,34 @@
  * u = unused bit
  * (high on the left, low on the right)
  *
- * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
- * on 8kB blocks.
- *
- * The 64-bit value is the bitmap representation of the lowest 6 bits
- * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
- * as the key:
+ * Then, 64-bit value is the bitmap representation of the lowest 6 bits
+ * (LOWER_OFFSET_NBITS) of the integer, and 64-bit key consists of the
+ * upper 3 bits of the offset number and the block number, 35 bits in
+ * total:
  *
  * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
  *                                                |----| value
- * |---------------------------------------------| key
+ *        |--------------------------------------| key
  *
  * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits required for offset numbers fits in LOWER_OFFSET_NBITS,
+ * 64-bit value is the bitmap representation of the offset number, and the
+ * 64-bit key is the block number.
  */
-#define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
-#define TIDSTORE_OFFSET_MASK	((1 << TIDSTORE_VALUE_NBITS) - 1)
+typedef uint64 tidkey;
+typedef uint64 offsetbm;
+#define LOWER_OFFSET_NBITS	6	/* log(sizeof(offsetbm), 2) */
+#define LOWER_OFFSET_MASK	((1 << LOWER_OFFSET_NBITS) - 1)
 
-/* A magic value used to identify our TidStores. */
+/* A magic value used to identify our TidStore. */
 #define TIDSTORE_MAGIC 0x826f6a10
 
 #define RT_PREFIX local_rt
 #define RT_SCOPE static
 #define RT_DECLARE
 #define RT_DEFINE
-#define RT_VALUE_TYPE uint64
+#define RT_VALUE_TYPE tidkey
 #include "lib/radixtree.h"
 
 #define RT_PREFIX shared_rt
@@ -83,7 +90,7 @@
 #define RT_SCOPE static
 #define RT_DECLARE
 #define RT_DEFINE
-#define RT_VALUE_TYPE uint64
+#define RT_VALUE_TYPE tidkey
 #include "lib/radixtree.h"
 
 /* The control object for a TidStore */
@@ -94,10 +101,10 @@ typedef struct TidStoreControl
 
 	/* These values are never changed after creation */
 	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
-	int		max_offset;		/* the maximum offset number */
-	int		offset_nbits;	/* the number of bits required for an offset
-							 * number */
-	int		offset_key_nbits;	/* the number of bits of an offset number
+	int		max_off;		/* the maximum offset number */
+	int		max_off_nbits;	/* the number of bits required for offset
+							 * numbers */
+	int		upper_off_nbits;	/* the number of bits of offset numbers
 								 * used in a key */
 
 	/* The below fields are used only in shared case */
@@ -106,7 +113,7 @@ typedef struct TidStoreControl
 	LWLock	lock;
 
 	/* handles for TidStore and radix tree */
-	tidstore_handle		handle;
+	TidStoreHandle		handle;
 	shared_rt_handle	tree_handle;
 } TidStoreControl;
 
@@ -147,24 +154,27 @@ typedef struct TidStoreIter
 	bool		finished;
 
 	/* save for the next iteration */
-	uint64		next_key;
-	uint64		next_val;
+	tidkey		next_tidkey;
+	offsetbm	next_off_bitmap;
 
-	/* output for the caller */
-	TidStoreIterResult result;
+	/*
+	 * output for the caller. Must be last because variable-size.
+	 */
+	TidStoreIterResult output;
 } TidStoreIter;
 
-static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
-static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
-static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
-static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
+static void iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap);
+static inline BlockNumber key_get_blkno(TidStore *ts, tidkey key);
+static inline tidkey encode_blk_off(TidStore *ts, BlockNumber block,
+									OffsetNumber offset, offsetbm *off_bit);
+static inline tidkey encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit);
 
 /*
  * Create a TidStore. The returned object is allocated in backend-local memory.
  * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
  */
 TidStore *
-tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
 {
 	TidStore	*ts;
 
@@ -176,12 +186,12 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
 	 * Memory consumption depends on the number of stored tids, but also on the
 	 * distribution of them, how the radix tree stores, and the memory management
 	 * that backed the radix tree. The maximum bytes that a TidStore can
-	 * use is specified by the max_bytes in tidstore_create(). We want the total
+	 * use is specified by the max_bytes in TidStoreCreate(). We want the total
 	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
 	 *
 	 * In local TidStore cases, the radix tree uses slab allocators for each kind
 	 * of node class. The most memory consuming case while adding Tids associated
-	 * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+	 * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
 	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
 	 * we deduct 70kB from the max_bytes.
 	 *
@@ -202,7 +212,7 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
 
 		dp = dsa_allocate0(area, sizeof(TidStoreControl));
 		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
-		ts->control->max_bytes = (uint64) (max_bytes * ratio);
+		ts->control->max_bytes = (size_t) (max_bytes * ratio);
 		ts->area = area;
 
 		ts->control->magic = TIDSTORE_MAGIC;
@@ -218,14 +228,14 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
 		ts->control->max_bytes = max_bytes - (70 * 1024);
 	}
 
-	ts->control->max_offset = max_offset;
-	ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+	ts->control->max_off = max_off;
+	ts->control->max_off_nbits = pg_ceil_log2_32(max_off);
 
-	if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
-		ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+	if (ts->control->max_off_nbits < LOWER_OFFSET_NBITS)
+		ts->control->max_off_nbits = LOWER_OFFSET_NBITS;
 
-	ts->control->offset_key_nbits =
-		ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+	ts->control->upper_off_nbits =
+		ts->control->max_off_nbits - LOWER_OFFSET_NBITS;
 
 	return ts;
 }
@@ -235,7 +245,7 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
  * allocated in backend-local memory using the CurrentMemoryContext.
  */
 TidStore *
-tidstore_attach(dsa_area *area, tidstore_handle handle)
+TidStoreAttach(dsa_area *area, TidStoreHandle handle)
 {
 	TidStore *ts;
 	dsa_pointer control;
@@ -266,7 +276,7 @@ tidstore_attach(dsa_area *area, tidstore_handle handle)
  * to the operating system.
  */
 void
-tidstore_detach(TidStore *ts)
+TidStoreDetach(TidStore *ts)
 {
 	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
 
@@ -279,12 +289,12 @@ tidstore_detach(TidStore *ts)
  *
  * TODO: The caller must be certain that no other backend will attempt to
  * access the TidStore before calling this function. Other backend must
- * explicitly call tidstore_detach to free up backend-local memory associated
- * with the TidStore. The backend that calls tidstore_destroy must not call
- * tidstore_detach.
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().
  */
 void
-tidstore_destroy(TidStore *ts)
+TidStoreDestroy(TidStore *ts)
 {
 	if (TidStoreIsShared(ts))
 	{
@@ -309,11 +319,11 @@ tidstore_destroy(TidStore *ts)
 }
 
 /*
- * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * Forget all collected Tids. It's similar to TidStoreDestroy() but we don't free
  * entire TidStore but recreate only the radix tree storage.
  */
 void
-tidstore_reset(TidStore *ts)
+TidStoreReset(TidStore *ts)
 {
 	if (TidStoreIsShared(ts))
 	{
@@ -350,30 +360,34 @@ tidstore_reset(TidStore *ts)
 	}
 }
 
-/* Add Tids on a block to TidStore */
+/*
+ * Set the given tids on the blkno to TidStore.
+ *
+ * NB: the offset numbers in offsets must be sorted in ascending order.
+ */
 void
-tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
-				  int num_offsets)
+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+						int num_offsets)
 {
-	uint64	*values;
-	uint64	key;
-	uint64	prev_key;
-	uint64	off_bitmap = 0;
+	offsetbm	*bitmaps;
+	tidkey		key;
+	tidkey		prev_key;
+	offsetbm	off_bitmap = 0;
 	int idx;
-	const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
-	const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+	const tidkey key_base = ((uint64) blkno) << ts->control->upper_off_nbits;
+	const int nkeys = UINT64CONST(1) << ts->control->upper_off_nbits;
 
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
-	values = palloc(sizeof(uint64) * nkeys);
+	bitmaps = palloc(sizeof(offsetbm) * nkeys);
 	key = prev_key = key_base;
 
 	for (int i = 0; i < num_offsets; i++)
 	{
-		uint64	off_bit;
+		offsetbm	off_bit;
 
 		/* encode the tid to a key and partial offset */
-		key = encode_key_off(ts, blkno, offsets[i], &off_bit);
+		key = encode_blk_off(ts, blkno, offsets[i], &off_bit);
 
 		/* make sure we scanned the line pointer array in order */
 		Assert(key >= prev_key);
@@ -384,11 +398,11 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 			Assert(idx >= 0 && idx < nkeys);
 
 			/* write out offset bitmap for this key */
-			values[idx] = off_bitmap;
+			bitmaps[idx] = off_bitmap;
 
 			/* zero out any gaps up to the current key */
 			for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
-				values[empty_idx] = 0;
+				bitmaps[empty_idx] = 0;
 
 			/* reset for current key -- the current offset will be handled below */
 			off_bitmap = 0;
@@ -401,7 +415,7 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	/* save the final index for later */
 	idx = key - key_base;
 	/* write out last offset bitmap */
-	values[idx] = off_bitmap;
+	bitmaps[idx] = off_bitmap;
 
 	if (TidStoreIsShared(ts))
 		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
@@ -409,14 +423,14 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	/* insert the calculated key-values to the tree */
 	for (int i = 0; i <= idx; i++)
 	{
-		if (values[i])
+		if (bitmaps[i])
 		{
 			key = key_base + i;
 
 			if (TidStoreIsShared(ts))
-				shared_rt_set(ts->tree.shared, key, &values[i]);
+				shared_rt_set(ts->tree.shared, key, &bitmaps[i]);
 			else
-				local_rt_set(ts->tree.local, key, &values[i]);
+				local_rt_set(ts->tree.local, key, &bitmaps[i]);
 		}
 	}
 
@@ -426,70 +440,70 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	if (TidStoreIsShared(ts))
 		LWLockRelease(&ts->control->lock);
 
-	pfree(values);
+	pfree(bitmaps);
 }
 
 /* Return true if the given tid is present in the TidStore */
 bool
-tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+TidStoreIsMember(TidStore *ts, ItemPointer tid)
 {
-	uint64 key;
-	uint64 val = 0;
-	uint64 off_bit;
+	tidkey key;
+	offsetbm off_bitmap = 0;
+	offsetbm off_bit;
 	bool found;
 
-	key = tid_to_key_off(ts, tid, &off_bit);
+	key = encode_tid(ts, tid, &off_bit);
 
 	if (TidStoreIsShared(ts))
-		found = shared_rt_search(ts->tree.shared, key, &val);
+		found = shared_rt_search(ts->tree.shared, key, &off_bitmap);
 	else
-		found = local_rt_search(ts->tree.local, key, &val);
+		found = local_rt_search(ts->tree.local, key, &off_bitmap);
 
 	if (!found)
 		return false;
 
-	return (val & off_bit) != 0;
+	return (off_bitmap & off_bit) != 0;
 }
 
 /*
- * Prepare to iterate through a TidStore. Since the radix tree is locked during the
- * iteration, so tidstore_end_iterate() needs to called when finished.
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, so TidStoreEndIterate() needs to be called when finished.
+ *
+ * The TidStoreIter struct is created in the caller's memory context.
  *
  * Concurrent updates during the iteration will be blocked when inserting a
  * key-value to the radix tree.
  */
 TidStoreIter *
-tidstore_begin_iterate(TidStore *ts)
+TidStoreBeginIterate(TidStore *ts)
 {
 	TidStoreIter *iter;
 
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
-	iter = palloc0(sizeof(TidStoreIter));
+	iter = palloc0(sizeof(TidStoreIter) +
+				   sizeof(OffsetNumber) * ts->control->max_off);
 	iter->ts = ts;
 
-	iter->result.blkno = InvalidBlockNumber;
-	iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
-
 	if (TidStoreIsShared(ts))
 		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
 	else
 		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
 
 	/* If the TidStore is empty, there is no business */
-	if (tidstore_num_tids(ts) == 0)
+	if (TidStoreNumTids(ts) == 0)
 		iter->finished = true;
 
 	return iter;
 }
 
 static inline bool
-tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+tidstore_iter(TidStoreIter *iter, tidkey *key, offsetbm *off_bitmap)
 {
 	if (TidStoreIsShared(iter->ts))
-		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, off_bitmap);
 
-	return local_rt_iterate_next(iter->tree_iter.local, key, val);
+	return local_rt_iterate_next(iter->tree_iter.local, key, off_bitmap);
 }
 
 /*
@@ -498,45 +512,48 @@ tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
  * numbers in each result is also sorted in ascending order.
  */
 TidStoreIterResult *
-tidstore_iterate_next(TidStoreIter *iter)
+TidStoreIterateNext(TidStoreIter *iter)
 {
-	uint64 key;
-	uint64 val;
-	TidStoreIterResult *result = &(iter->result);
+	tidkey key;
+	offsetbm off_bitmap = 0;
+	TidStoreIterResult *output = &(iter->output);
 
 	if (iter->finished)
 		return NULL;
 
-	if (BlockNumberIsValid(result->blkno))
-	{
-		/* Process the previously collected key-value */
-		result->num_offsets = 0;
-		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
-	}
+	/* Initialize the outputs */
+	output->blkno = InvalidBlockNumber;
+	output->num_offsets = 0;
 
-	while (tidstore_iter_kv(iter, &key, &val))
-	{
-		BlockNumber blkno;
+	/*
+	 * Decode the key and offset bitmap that are collected in the previous
+	 * time, if exists.
+	 */
+	if (iter->next_off_bitmap > 0)
+		iter_decode_key_off(iter, iter->next_tidkey, iter->next_off_bitmap);
 
-		blkno = key_get_blkno(iter->ts, key);
+	while (tidstore_iter(iter, &key, &off_bitmap))
+	{
+		BlockNumber blkno = key_get_blkno(iter->ts, key);
 
-		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		if (BlockNumberIsValid(output->blkno) && output->blkno != blkno)
 		{
 			/*
-			 * We got a key-value pair for a different block. So return the
-			 * collected tids, and remember the key-value for the next iteration.
+			 * We got tids for a different block. We return the collected
+			 * tids so far, and remember the key-value for the next
+			 * iteration.
 			 */
-			iter->next_key = key;
-			iter->next_val = val;
-			return result;
+			iter->next_tidkey = key;
+			iter->next_off_bitmap = off_bitmap;
+			return output;
 		}
 
-		/* Collect tids extracted from the key-value pair */
-		tidstore_iter_extract_tids(iter, key, val);
+		/* Collect tids decoded from the key and offset bitmap */
+		iter_decode_key_off(iter, key, off_bitmap);
 	}
 
 	iter->finished = true;
-	return result;
+	return output;
 }
 
 /*
@@ -544,22 +561,21 @@ tidstore_iterate_next(TidStoreIter *iter)
  * or when existing an iteration.
  */
 void
-tidstore_end_iterate(TidStoreIter *iter)
+TidStoreEndIterate(TidStoreIter *iter)
 {
 	if (TidStoreIsShared(iter->ts))
 		shared_rt_end_iterate(iter->tree_iter.shared);
 	else
 		local_rt_end_iterate(iter->tree_iter.local);
 
-	pfree(iter->result.offsets);
 	pfree(iter);
 }
 
 /* Return the number of tids we collected so far */
 int64
-tidstore_num_tids(TidStore *ts)
+TidStoreNumTids(TidStore *ts)
 {
-	uint64 num_tids;
+	int64 num_tids;
 
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
@@ -575,16 +591,16 @@ tidstore_num_tids(TidStore *ts)
 
 /* Return true if the current memory usage of TidStore exceeds the limit */
 bool
-tidstore_is_full(TidStore *ts)
+TidStoreIsFull(TidStore *ts)
 {
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
-	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+	return (TidStoreMemoryUsage(ts) > ts->control->max_bytes);
 }
 
 /* Return the maximum memory TidStore can use */
 size_t
-tidstore_max_memory(TidStore *ts)
+TidStoreMaxMemory(TidStore *ts)
 {
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
@@ -593,7 +609,7 @@ tidstore_max_memory(TidStore *ts)
 
 /* Return the memory usage of TidStore */
 size_t
-tidstore_memory_usage(TidStore *ts)
+TidStoreMemoryUsage(TidStore *ts)
 {
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
@@ -611,71 +627,75 @@ tidstore_memory_usage(TidStore *ts)
 /*
  * Get a handle that can be used by other processes to attach to this TidStore
  */
-tidstore_handle
-tidstore_get_handle(TidStore *ts)
+TidStoreHandle
+TidStoreGetHandle(TidStore *ts)
 {
 	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
 
 	return ts->control->handle;
 }
 
-/* Extract tids from the given key-value pair */
+/*
+ * Decode the key and offset bitmap to tids and store them to the iteration
+ * result.
+ */
 static void
-tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap)
 {
-	TidStoreIterResult *result = (&iter->result);
+	TidStoreIterResult *output = (&iter->output);
 
-	while (val)
+	while (off_bitmap)
 	{
-		uint64	tid_i;
+		uint64	compressed_tid;
 		OffsetNumber	off;
 
-		tid_i = key << TIDSTORE_VALUE_NBITS;
-		tid_i |= pg_rightmost_one_pos64(val);
+		compressed_tid = key << LOWER_OFFSET_NBITS;
+		compressed_tid |= pg_rightmost_one_pos64(off_bitmap);
 
-		off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+		off = compressed_tid & ((UINT64CONST(1) << iter->ts->control->max_off_nbits) - 1);
 
-		Assert(result->num_offsets < iter->ts->control->max_offset);
-		result->offsets[result->num_offsets++] = off;
+		Assert(output->num_offsets < iter->ts->control->max_off);
+		output->offsets[output->num_offsets++] = off;
 
 		/* unset the rightmost bit */
-		val &= ~pg_rightmost_one64(val);
+		off_bitmap &= ~pg_rightmost_one64(off_bitmap);
 	}
 
-	result->blkno = key_get_blkno(iter->ts, key);
+	output->blkno = key_get_blkno(iter->ts, key);
 }
 
 /* Get block number from the given key */
 static inline BlockNumber
-key_get_blkno(TidStore *ts, uint64 key)
+key_get_blkno(TidStore *ts, tidkey key)
 {
-	return (BlockNumber) (key >> ts->control->offset_key_nbits);
+	return (BlockNumber) (key >> ts->control->upper_off_nbits);
 }
 
-/* Encode a tid to key and offset */
-static inline uint64
-tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
+/* Encode a tid to key and partial offset */
+static inline tidkey
+encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit)
 {
-	uint32 offset = ItemPointerGetOffsetNumber(tid);
+	OffsetNumber offset = ItemPointerGetOffsetNumber(tid);
 	BlockNumber block = ItemPointerGetBlockNumber(tid);
 
-	return encode_key_off(ts, block, offset, off_bit);
+	return encode_blk_off(ts, block, offset, off_bit);
 }
 
 /* encode a block and offset to a key and partial offset */
-static inline uint64
-encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
+static inline tidkey
+encode_blk_off(TidStore *ts, BlockNumber block, OffsetNumber offset,
+			   offsetbm *off_bit)
 {
-	uint64 key;
-	uint64 tid_i;
+	tidkey key;
+	uint64 compressed_tid;
 	uint32 off_lower;
 
-	off_lower = offset & TIDSTORE_OFFSET_MASK;
-	Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
+	off_lower = offset & LOWER_OFFSET_MASK;
+	Assert(off_lower < (sizeof(offsetbm) * BITS_PER_BYTE));
 
 	*off_bit = UINT64CONST(1) << off_lower;
-	tid_i = offset | ((uint64) block << ts->control->offset_nbits);
-	key = tid_i >> TIDSTORE_VALUE_NBITS;
+	compressed_tid = offset | ((uint64) block << ts->control->max_off_nbits);
+	key = compressed_tid >> LOWER_OFFSET_NBITS;
 
 	return key;
 }
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index a35a52124a..66f0fdd482 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -17,33 +17,34 @@
 #include "storage/itemptr.h"
 #include "utils/dsa.h"
 
-typedef dsa_pointer tidstore_handle;
+typedef dsa_pointer TidStoreHandle;
 
 typedef struct TidStore TidStore;
 typedef struct TidStoreIter TidStoreIter;
 
+/* Result struct for TidStoreIterateNext */
 typedef struct TidStoreIterResult
 {
 	BlockNumber		blkno;
-	OffsetNumber	*offsets;
 	int				num_offsets;
+	OffsetNumber	offsets[FLEXIBLE_ARRAY_MEMBER];
 } TidStoreIterResult;
 
-extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
-extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
-extern void tidstore_detach(TidStore *ts);
-extern void tidstore_destroy(TidStore *ts);
-extern void tidstore_reset(TidStore *ts);
-extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
-							  int num_offsets);
-extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
-extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
-extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
-extern void tidstore_end_iterate(TidStoreIter *iter);
-extern int64 tidstore_num_tids(TidStore *ts);
-extern bool tidstore_is_full(TidStore *ts);
-extern size_t tidstore_max_memory(TidStore *ts);
-extern size_t tidstore_memory_usage(TidStore *ts);
-extern tidstore_handle tidstore_get_handle(TidStore *ts);
+extern TidStore *TidStoreCreate(size_t max_bytes, int max_off, dsa_area *dsa);
+extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer handle);
+extern void TidStoreDetach(TidStore *ts);
+extern void TidStoreDestroy(TidStore *ts);
+extern void TidStoreReset(TidStore *ts);
+extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+									int num_offsets);
+extern bool TidStoreIsMember(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * TidStoreBeginIterate(TidStore *ts);
+extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
+extern void TidStoreEndIterate(TidStoreIter *iter);
+extern int64 TidStoreNumTids(TidStore *ts);
+extern bool TidStoreIsFull(TidStore *ts);
+extern size_t TidStoreMaxMemory(TidStore *ts);
+extern size_t TidStoreMemoryUsage(TidStore *ts);
+extern TidStoreHandle TidStoreGetHandle(TidStore *ts);
 
 #endif		/* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 9a1217f833..8659e6780e 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -37,10 +37,10 @@ check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
 
 	ItemPointerSet(&tid, blkno, off);
 
-	found = tidstore_lookup_tid(ts, &tid);
+	found = TidStoreIsMember(ts, &tid);
 
 	if (found != expect)
-		elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+		elog(ERROR, "TidStoreIsMember for TID (%u, %u) returned %d, expected %d",
 			 blkno, off, found, expect);
 }
 
@@ -69,9 +69,9 @@ test_basic(int max_offset)
 	LWLockRegisterTranche(tranche_id, "test_tidstore");
 	dsa = dsa_create(tranche_id);
 
-	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
 #else
-	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
 #endif
 
 	/* prepare the offset array */
@@ -83,7 +83,7 @@ test_basic(int max_offset)
 
 	/* add tids */
 	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
-		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+		TidStoreSetBlockOffsets(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
 
 	/* lookup test */
 	for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
@@ -105,30 +105,30 @@ test_basic(int max_offset)
 	}
 
 	/* test the number of tids */
-	if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
-		elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
-			 tidstore_num_tids(ts),
+	if (TidStoreNumTids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "TidStoreNumTids returned " UINT64_FORMAT ", expected %d",
+			 TidStoreNumTids(ts),
 			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
 
 	/* iteration test */
-	iter = tidstore_begin_iterate(ts);
+	iter = TidStoreBeginIterate(ts);
 	blk_idx = 0;
-	while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		/* check the returned block number */
 		if (blks_sorted[blk_idx] != iter_result->blkno)
-			elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+			elog(ERROR, "TidStoreIterateNext returned block number %u, expected %u",
 				 iter_result->blkno, blks_sorted[blk_idx]);
 
 		/* check the returned offset numbers */
 		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
-			elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+			elog(ERROR, "TidStoreIterateNext %u offsets, expected %u",
 				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
 
 		for (int i = 0; i < iter_result->num_offsets; i++)
 		{
 			if (offs[i] != iter_result->offsets[i])
-				elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+				elog(ERROR, "TidStoreIterateNext offset number %u on block %u, expected %u",
 					 iter_result->offsets[i], iter_result->blkno, offs[i]);
 		}
 
@@ -136,15 +136,15 @@ test_basic(int max_offset)
 	}
 
 	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
-		elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+		elog(ERROR, "TidStoreIterateNext returned %d blocks, expected %d",
 			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
 
 	/* remove all tids */
-	tidstore_reset(ts);
+	TidStoreReset(ts);
 
 	/* test the number of tids */
-	if (tidstore_num_tids(ts) != 0)
-		elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+	if (TidStoreNumTids(ts) != 0)
+		elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
 
 	/* lookup test for empty store */
 	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
@@ -156,7 +156,7 @@ test_basic(int max_offset)
 		check_tid(ts, MaxBlockNumber, off, false);
 	}
 
-	tidstore_destroy(ts);
+	TidStoreDestroy(ts);
 
 #ifdef TEST_SHARED_TIDSTORE
 	dsa_detach(dsa);
@@ -177,36 +177,37 @@ test_empty(void)
 	LWLockRegisterTranche(tranche_id, "test_tidstore");
 	dsa = dsa_create(tranche_id);
 
-	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
 #else
-	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
 #endif
 
 	elog(NOTICE, "testing empty tidstore");
 
 	ItemPointerSet(&tid, 0, FirstOffsetNumber);
-	if (tidstore_lookup_tid(ts, &tid))
-		elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+	if (TidStoreIsMember(ts, &tid))
+		elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
+			 0, FirstOffsetNumber);
 
 	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
-	if (tidstore_lookup_tid(ts, &tid))
-		elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+	if (TidStoreIsMember(ts, &tid))
+		elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
 			 MaxBlockNumber, MaxOffsetNumber);
 
-	if (tidstore_num_tids(ts) != 0)
-		elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+	if (TidStoreNumTids(ts) != 0)
+		elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
 
-	if (tidstore_is_full(ts))
-		elog(ERROR, "tidstore_is_full on empty store returned true");
+	if (TidStoreIsFull(ts))
+		elog(ERROR, "TidStoreIsFull on empty store returned true");
 
-	iter = tidstore_begin_iterate(ts);
+	iter = TidStoreBeginIterate(ts);
 
-	if (tidstore_iterate_next(iter) != NULL)
-		elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+	if (TidStoreIterateNext(iter) != NULL)
+		elog(ERROR, "TidStoreIterateNext on empty store returned TIDs");
 
-	tidstore_end_iterate(iter);
+	TidStoreEndIterate(iter);
 
-	tidstore_destroy(ts);
+	TidStoreDestroy(ts);
 
 #ifdef TEST_SHARED_TIDSTORE
 	dsa_detach(dsa);
@@ -221,6 +222,7 @@ test_tidstore(PG_FUNCTION_ARGS)
 	elog(NOTICE, "testing basic operations");
 	test_basic(MaxHeapTuplesPerPage);
 	test_basic(10);
+	test_basic(MaxHeapTuplesPerPage * 2);
 
 	PG_RETURN_VOID();
 }
-- 
2.31.1

v29-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchapplication/octet-stream; name=v29-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchDownload

From 848d68ee7c484a7041c6d0d703304cadfdfc36a2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v29 05/10] Tool for measuring radix tree and tidstore
 performance

Includes Meson support, but commented out to avoid warnings

XXX: Not for commit
---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  88 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 747 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/meson.build          |  33 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 contrib/meson.build                           |   1 +
 8 files changed, 925 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/meson.build
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..ad66265e23
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,88 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT iter_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..6e5149e2c4
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,747 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+//#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	TidStore	*ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
+	OffsetNumber *offs;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_ms;
+	int64		iter_ms;
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false};
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+	for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+		offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+	ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
+
+	/* load tids */
+	start_time = GetCurrentTimestamp();
+	for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+		tidstore_add_tids(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_ms = secs * 1000 + usecs / 1000;
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* iterate through tids */
+	iter = tidstore_begin_iterate(ts);
+	start_time = GetCurrentTimestamp();
+	while ((result = tidstore_iterate_next(iter)) != NULL)
+		;
+	tidstore_end_iterate(iter);
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	iter_ms = secs * 1000 + usecs / 1000;
+
+	values[0] = Int64GetDatum(tidstore_memory_usage(ts));
+	values[1] = Int64GetDatum(load_ms);
+	values[2] = Int64GetDatum(iter_ms);
+
+	tidstore_destroy(ts);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	rt_radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, &val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, &val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	int64		search_time_ms;
+	Datum		values[3] = {0};
+	bool		nulls[3] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+	values[2] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, &key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* to silence warnings about unused iter functions */
+static void pg_attribute_unused()
+stub_iter()
+{
+	rt_radix_tree *rt;
+	rt_iter *iter;
+	uint64 key = 1;
+	uint64 value = 1;
+
+	rt = rt_create(CurrentMemoryContext);
+
+	iter = rt_begin_iterate(rt);
+	rt_iterate_next(iter, &key, &value);
+	rt_end_iterate(iter);
+}
\ No newline at end of file
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+  'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+  bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'bench_radix_tree',
+    '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+  bench_radix_tree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+  'bench_radix_tree.control',
+  'bench_radix_tree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'bench_radix_tree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'bench_radix_tree',
+    ],
+  },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..421d469f8c 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
+subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.31.1

v29-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v29-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From 39f0e713854942fbad3678bce9138adea546f1be Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v29 02/10] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 3d2225e1ae..5f9a511b4a 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index a49a9c03d9..7235ad25ee 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 45fc5759ce..f95d3dfd69 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3670,7 +3670,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.31.1

v29-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchapplication/octet-stream; name=v29-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload

From 58f8e2b82eb196d463114d8ec3dad343b2b027e0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v29 01/10] Introduce helper SIMD functions for small byte
 arrays

vector8_min - helper for emulating ">=" semantics

vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask

Masahiko Sawada

Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 1fa6c3bc6c..dfae14e463 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #endif
 
 /* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	/*
+	 * Note: There is a faster way to do this, but it returns a uint64 and
+	 * and if the caller wanted to extract the bit position using CTZ,
+	 * it would have to divide that result by 4.
+	 */
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the bitwise OR of the inputs
  */
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.31.1

v29-0003-Add-radixtree-template.patchapplication/octet-stream; name=v29-0003-Add-radixtree-template.patchDownload

From ab33774676db3e419dd56b2001f0cbf2bc291d3d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v29 03/10] Add radixtree template

WIP: commit message based on template comments
---
 src/backend/utils/mmgr/dsa.c                  |   12 +
 src/include/lib/radixtree.h                   | 2516 +++++++++++++++++
 src/include/lib/radixtree_delete_impl.h       |  122 +
 src/include/lib/radixtree_insert_impl.h       |  328 +++
 src/include/lib/radixtree_iter_impl.h         |  153 +
 src/include/lib/radixtree_search_impl.h       |  138 +
 src/include/utils/dsa.h                       |    1 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   35 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  681 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 20 files changed, 4089 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..e546bd705c
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2516 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *		Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ *  tional leaf node type which stores one value.
+ *  - Multi-value leaves: The values are stored in one of four
+ *  different leaf node types, which mirror the structure of
+ *  inner nodes, but contain values instead of pointers.
+ *  - Combined pointer/value slots: If values fit into point-
+ *  ers, no separate node types are necessary. Instead, each
+ *  pointer storage location in an inner node can either
+ *  store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included.  Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * 	 will result in radix tree type 'foo_radix_tree' and functions like
+ *	 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ *	 generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *	 declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *	 so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITER		- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ *    statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ *    in the future to tag the node pointer with the kind, even on
+ *    platforms with 32-bit pointers. This might speed up node traversal
+ *    in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree)	LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree)	LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree)			LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree)	((void) 0)
+#define RT_LOCK_SHARED(tree)	((void) 0)
+#define RT_UNLOCK(tree)			((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+	((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+	RT_NODE		n;
+
+	/* 3 children, for key chunks */
+	uint8		chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* bitmap to track which slots are in use */
+	bitmapword		isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/*
+	 * Unlike with inner256, zero is a valid value here, so we use a
+	 * bitmap to track which slots are in use.
+	 */
+	bitmapword	isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	RT_VALUE_TYPE	values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_3 = 0,
+	RT_CLASS_32_MIN,
+	RT_CLASS_32_MAX,
+	RT_CLASS_125,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_3] = {
+		.name = "radix tree node 3",
+		.fanout = 3,
+		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MIN] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MAX] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_125] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+	LWLock		lock;
+#endif
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+	RT_PTR_LOCAL node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the iteration on nodes of each level */
+	RT_NODE_ITER stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is constructed during iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/* replicate the search key */
+	spread_chunk = vector8_broadcast(chunk);
+
+	/* compare to all 32 keys stored in the node */
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+
+	/* convert comparison to a bitfield */
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+	/* mask off invalid entries */
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	/* convert bitfield to index by counting trailing zeros */
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		/*
+		 * This is coded with '>=' to match what we can do with SIMD,
+		 * with an assert to keep us honest.
+		 */
+		if (node->chunks[index] >= chunk)
+		{
+			Assert(node->chunks[index] != chunk);
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/*
+	 * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+	 * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+	 * we need to play some trickery using vector8_min() to effectively get
+	 * >=. There'll never be any equal elements in current uses, but that's
+	 * what we get here...
+	 */
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	if (key == 0)
+		return 0;
+	else
+		return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
+
+	if (is_leaf)
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (is_leaf)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+#endif
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->ctl->cnt[size_class]++;
+#endif
+
+	return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	if (is_leaf)
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+	node->kind = kind;
+
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+		memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+	}
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static pg_noinline void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		is_leaf = shift == 0;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+	newnode->shift = shift;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+				  uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+	RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+	RT_COPY_NODE(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->ctl->root == allocnode)
+	{
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
+	}
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static inline void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+				RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old_child->shift == new->shift);
+	Assert(old_child->count == new->count);
+#endif
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new larger node */
+		tree->ctl->root = new_child;
+	}
+	else
+		RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+	RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static pg_noinline void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_3 *n3;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+		node->shift = shift;
+		node->count = 1;
+
+		n3 = (RT_NODE_INNER_3 *) node;
+		n3->base.chunks[0] = 0;
+		n3->children[0] = tree->ctl->root;
+
+		/* Update the root */
+		tree->ctl->root = allocnode;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static pg_noinline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+	int			shift = node->shift;
+
+	Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		is_leaf = newshift == 0;
+
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+		newchild->shift = newshift;
+		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+		parent = node;
+		node = newchild;
+		stored_node = allocchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+	tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static void
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+	/* Create a slab context for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+		size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+		size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 size_class.name,
+												 inner_blocksize,
+												 size_class.inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												size_class.name,
+												leaf_blocksize,
+												size_class.leaf_size);
+	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	if (RT_NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+				for (int i = 0; i < n3->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n3->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+				for (int i = 0; i < n32->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle);
+#else
+	pfree(tree->ctl);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+#endif
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	int			shift;
+	bool		updated;
+	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC stored_child;
+	RT_PTR_LOCAL  child;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	/* Empty tree, create the root */
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->ctl->max_val)
+		RT_EXTEND(tree, key);
+
+	stored_child = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, stored_child);
+	shift = parent->shift;
+
+	/* Descend the tree until we reach a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;
+
+		child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+		if (RT_NODE_IS_LEAF(child))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+		{
+			RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		parent = child;
+		stored_child = new_child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->ctl->num_keys++;
+
+	RT_UNLOCK(tree);
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+	bool		found;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	Assert(value_p != NULL);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		if (RT_NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		node = RT_PTR_GET_LOCAL(tree, child);
+		shift -= RT_NODE_SPAN;
+	}
+
+	found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+	RT_UNLOCK(tree);
+	return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		/* Push the current node to the stack */
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		allocnode = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	deleted = RT_NODE_DELETE_LEAF(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->ctl->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (node->count > 0)
+	{
+		RT_UNLOCK(tree);
+		return true;
+	}
+
+	/* Free the empty leaf node */
+	RT_FREE_NODE(tree, allocnode);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		allocnode = stack[level--];
+
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		deleted = RT_NODE_DELETE_INNER(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (node->count > 0)
+			break;
+
+		/* The node became empty */
+		RT_FREE_NODE(tree, allocnode);
+	}
+
+	RT_UNLOCK(tree);
+	return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+	int			level = from;
+	RT_PTR_LOCAL node = from_node;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (RT_NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	MemoryContext old_ctx;
+	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+
+	RT_LOCK_SHARED(tree);
+
+	/* empty tree */
+	if (!iter->tree->ctl->root)
+		return iter;
+
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->ctl->root)
+		return false;
+
+	for (;;)
+	{
+		RT_PTR_LOCAL child = NULL;
+		RT_VALUE_TYPE value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+	Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+	RT_UNLOCK(iter->tree);
+	pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	Size		total = 0;
+
+	RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+#endif
+
+	RT_UNLOCK(tree);
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+				for (int i = 1; i < n3->n.count; i++)
+					Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = RT_BM_IDX(slot);
+					int			bitnum = RT_BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+	RT_LOCK_SHARED(tree);
+
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+	fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+	fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+		fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+				root->shift / RT_NODE_SPAN,
+				tree->ctl->cnt[RT_CLASS_3],
+				tree->ctl->cnt[RT_CLASS_32_MIN],
+				tree->ctl->cnt[RT_CLASS_32_MAX],
+				tree->ctl->cnt[RT_CLASS_125],
+				tree->ctl->cnt[RT_CLASS_256]);
+	}
+
+	RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+			 bool recurse, StringInfo buf)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+	StringInfoData spaces;
+
+	initStringInfo(&spaces);
+	appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+	appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+					 spaces.data,
+					 level == 0 ? "" : "-> ",
+					 RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+					 (node->kind == RT_NODE_KIND_3) ? 3 :
+					 (node->kind == RT_NODE_KIND_32) ? 32 :
+					 (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+					 node->fanout == 0 ? 256 : node->fanout,
+					 node->count, node->shift);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n3->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n3->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n3->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n32->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n32->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+				char *sep = "";
+
+				appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					appendStringInfo(buf, "%s[%d]=%d ",
+									 sep, i, b125->slot_idxs[i]);
+					sep = ",";
+				}
+
+				appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+				for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+					appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+				appendStringInfo(buf, "\n");
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					if (RT_NODE_IS_LEAF(node))
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					else
+					{
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+					appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+					for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+						appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+					appendStringInfo(buf, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL node;
+	StringInfoData buf;
+	int			shift;
+	int			level = 0;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	if (key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+				key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+		if (RT_NODE_IS_LEAF(node))
+		{
+			RT_VALUE_TYPE	dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			break;
+
+		allocnode = child;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+	StringInfoData buf;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	initStringInfo(&buf);
+
+	RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ *	  Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+										  n3->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+											n3->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+										  n32->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+				idx = RT_BM_IDX(slotpos);
+				bitnum = RT_BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+				RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..d56e58dcac
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,328 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ *	  Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	const bool is_leaf = true;
+	bool		chunk_exists = false;
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	const bool is_leaf = false;
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n3->values[idx] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n3)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+					/* grow node from 3 to 32 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+											  new32->base.chunks, new32->values);
+#else
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+											  new32->base.chunks, new32->children);
+#endif
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+					int			count = n3->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+												   count, insertpos);
+#endif
+					}
+
+					n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n3->values[insertpos] = *value_p;
+#else
+					n3->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+					n32->base.n.fanout < class32_max.fanout)
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+					Assert(n32->base.n.fanout == class32_min.fanout);
+
+					/* grow to the next size class of this kind */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class32_min.leaf_size);
+#else
+					memcpy(newnode, node, class32_min.inner_size);
+#endif
+					newnode->fanout = class32_max.fanout;
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n32)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+					Assert(n32->base.n.fanout == class32_max.fanout);
+
+					/* grow node from 32 to 125 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < class32_max.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					/*
+					 * Since we just copied a dense array, we can set the bits
+					 * using a single store, provided the length of that array
+					 * is at most the number of bits in a bitmapword.
+					 */
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = *value_p;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos;
+				int			cnt = 0;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				slotpos = n125->base.slot_idxs[chunk];
+				if (slotpos != RT_INVALID_SLOT_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n125->values[slotpos] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n125)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+					/* grow node from 125 to 256 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new256 = (RT_NODE256_TYPE *) newnode;
+
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+						cnt++;
+					}
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = *value_p;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+				Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+				RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+				Assert(node->count < RT_NODE_MAX_SLOTS);
+				RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
+	if (!chunk_exists)
+		node->count++;
+#else
+		node->count++;
+#endif
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	RT_VERIFY_NODE(node);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return chunk_exists;
+#else
+	return;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..98c78eb237
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,153 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ *	  Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	bool		found = false;
+	uint8		key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	RT_VALUE_TYPE		value;
+
+	Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n3->base.n.count)
+					break;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n3->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+				key_chunk = n3->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+		*value_p = value;
+#endif
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return found;
+#else
+	return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ *	  Common implementation for search in leaf and inner nodes, plus
+ *	  update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+	Assert(child_p != NULL);
+#endif
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n3->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n3->values[idx];
+#else
+				*child_p = n3->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n32->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n32->values[idx];
+#else
+				*child_p = n32->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+				Assert(slotpos != RT_INVALID_SLOT_IDX);
+				n125->children[slotpos] = new_child;
+#else
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				*child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				*child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+	}
+
+#ifdef RT_ACTION_UPDATE
+	return;
+#else
+	return true;
+#endif							/* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..afe53382f3
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,681 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+/* #define RT_SHMEM */
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	TestValueType		dummy;
+	uint64		key;
+	TestValueType		val;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	rt_radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* look up keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType value;
+
+		if (!rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (value != (TestValueType) keys[i])
+			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+				 value, (TestValueType) keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType update = keys[i] + 1;
+		if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		TestValueType		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != (TestValueType) key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType*) &key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+	radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, (TestValueType*) &x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != (TestValueType) x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			TestValueType		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != (TestValueType) expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	rt_free(radixtree);
+	MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index e52fe9f509..09fa6e7432 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.31.1

v29-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v29-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From 7a4bf52d585e41926b6a85cb7ae64be177cc0d04 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v29 04/10] Add TIDStore, to store sets of TIDs
 (ItemPointerData) efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 681 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   2 +
 src/include/access/tidstore.h                 |  49 ++
 src/include/storage/lwlock.h                  |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  35 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../modules/test_tidstore/test_tidstore.c     | 226 ++++++
 .../test_tidstore/test_tidstore.control       |   4 +
 16 files changed, 1057 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index dca50707ad..e28206e056 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2198,6 +2198,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to access a shared TID bitmap during a parallel bitmap
        index scan.</entry>
      </row>
+     <row>
+      <entry><literal>SharedTidStore</literal></entry>
+      <entry>Waiting to access a shared TID store.</entry>
+     </row>
      <row>
       <entry><literal>SharedTupleStore</literal></entry>
       <entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..8c05e60d92
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,681 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *                                                |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ */
+#define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
+#define TIDSTORE_OFFSET_MASK	((1 << TIDSTORE_VALUE_NBITS) - 1)
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+	/* the number of tids in the store */
+	int64	num_tids;
+
+	/* These values are never changed after creation */
+	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
+	int		max_offset;		/* the maximum offset number */
+	int		offset_nbits;	/* the number of bits required for an offset
+							 * number */
+	int		offset_key_nbits;	/* the number of bits of an offset number
+								 * used in a key */
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+	LWLock	lock;
+
+	/* handles for TidStore and radix tree */
+	tidstore_handle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 *
+	 * Memory consumption depends on the number of stored tids, but also on the
+	 * distribution of them, how the radix tree stores, and the memory management
+	 * that backed the radix tree. The maximum bytes that a TidStore can
+	 * use is specified by the max_bytes in tidstore_create(). We want the total
+	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
+	 *
+	 * In local TidStore cases, the radix tree uses slab allocators for each kind
+	 * of node class. The most memory consuming case while adding Tids associated
+	 * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+	 * we deduct 70kB from the max_bytes.
+	 *
+	 * In shared cases, DSA allocates the memory segments big enough to follow
+	 * a geometric series that approximately doubles the total DSA size (see
+	 * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+	 * size and the simulation revealed, the 75% threshold for the maximum bytes
+	 * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+	 * threshold works for other cases.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes = (uint64) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - (70 * 1024);
+	}
+
+	ts->control->max_offset = max_offset;
+	ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+	if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
+		ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+
+	ts->control->offset_key_nbits =
+		ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix
+		 * tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+tidstore_reset(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Free the radix tree and return allocated DSA segments to
+		 * the operating system.
+		 */
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+
+		LWLockRelease(&ts->control->lock);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+	}
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	uint64	*values;
+	uint64	key;
+	uint64	prev_key;
+	uint64	off_bitmap = 0;
+	int idx;
+	const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+	const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	values = palloc(sizeof(uint64) * nkeys);
+	key = prev_key = key_base;
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint64	off_bit;
+
+		/* encode the tid to a key and partial offset */
+		key = encode_key_off(ts, blkno, offsets[i], &off_bit);
+
+		/* make sure we scanned the line pointer array in order */
+		Assert(key >= prev_key);
+
+		if (key > prev_key)
+		{
+			idx = prev_key - key_base;
+			Assert(idx >= 0 && idx < nkeys);
+
+			/* write out offset bitmap for this key */
+			values[idx] = off_bitmap;
+
+			/* zero out any gaps up to the current key */
+			for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
+				values[empty_idx] = 0;
+
+			/* reset for current key -- the current offset will be handled below */
+			off_bitmap = 0;
+			prev_key = key;
+		}
+
+		off_bitmap |= off_bit;
+	}
+
+	/* save the final index for later */
+	idx = key - key_base;
+	/* write out last offset bitmap */
+	values[idx] = off_bitmap;
+
+	if (TidStoreIsShared(ts))
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+	/* insert the calculated key-values to the tree */
+	for (int i = 0; i <= idx; i++)
+	{
+		if (values[i])
+		{
+			key = key_base + i;
+
+			if (TidStoreIsShared(ts))
+				shared_rt_set(ts->tree.shared, key, &values[i]);
+			else
+				local_rt_set(ts->tree.local, key, &values[i]);
+		}
+	}
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+
+	if (TidStoreIsShared(ts))
+		LWLockRelease(&ts->control->lock);
+
+	pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val = 0;
+	uint64 off_bit;
+	bool found;
+
+	key = tid_to_key_off(ts, tid, &off_bit);
+
+	if (TidStoreIsShared(ts))
+		found = shared_rt_search(ts->tree.shared, key, &val);
+	else
+		found = local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & off_bit) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	iter->result.blkno = InvalidBlockNumber;
+	iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (tidstore_num_tids(ts) == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+
+	return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		/* Process the previously collected key-value */
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = key_get_blkno(iter->ts, key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * We got a key-value pair for a different block. So return the
+			 * collected tids, and remember the key-value for the next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter->result.offsets);
+	pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+	uint64 num_tids;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (!TidStoreIsShared(ts))
+		return ts->control->num_tids;
+
+	LWLockAcquire(&ts->control->lock, LW_SHARED);
+	num_tids = ts->control->num_tids;
+	LWLockRelease(&ts->control->lock);
+
+	return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+	return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->result);
+
+	while (val)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= pg_rightmost_one_pos64(val);
+
+		off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+		Assert(result->num_offsets < iter->ts->control->max_offset);
+		result->offsets[result->num_offsets++] = off;
+
+		/* unset the rightmost bit */
+		val &= ~pg_rightmost_one64(val);
+	}
+
+	result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+	return (BlockNumber) (key >> ts->control->offset_key_nbits);
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
+{
+	uint32 offset = ItemPointerGetOffsetNumber(tid);
+	BlockNumber block = ItemPointerGetBlockNumber(tid);
+
+	return encode_key_off(ts, block, offset, off_bit);
+}
+
+/* encode a block and offset to a key and partial offset */
+static inline uint64
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
+{
+	uint64 key;
+	uint64 tid_i;
+	uint32 off_lower;
+
+	off_lower = offset & TIDSTORE_OFFSET_MASK;
+	Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
+
+	*off_bit = UINT64CONST(1) << off_lower;
+	tid_i = offset | ((uint64) block << ts->control->offset_nbits);
+	key = tid_i >> TIDSTORE_VALUE_NBITS;
+
+	return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"SharedTupleStore",
 	/* LWTRANCHE_SHARED_TIDBITMAP: */
 	"SharedTidBitmap",
+	/* LWTRANCHE_SHARED_TIDSTORE: */
+	"SharedTidStore",
 	/* LWTRANCHE_PARALLEL_APPEND: */
 	"ParallelAppend",
 	/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	*offsets;
+	int				num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_SHARED_TIDBITMAP,
+	LWTRANCHE_SHARED_TIDSTORE,
 	LWTRANCHE_PARALLEL_APPEND,
 	LWTRANCHE_PER_XACT_PREDICATE_LIST,
 	LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9a1217f833
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,226 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+/* #define TEST_SHARED_TIDSTORE 1 */
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+	ItemPointerData tid;
+	bool found;
+
+	ItemPointerSet(&tid, blkno, off);
+
+	found = tidstore_lookup_tid(ts, &tid);
+
+	if (found != expect)
+		elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+			 blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS	5
+#define TEST_TIDSTORE_NUM_OFFSETS	5
+
+	TidStore *ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+	BlockNumber	blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+	};
+	BlockNumber	blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+	};
+	OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+	int blk_idx;
+
+#ifdef TEST_SHARED_TIDSTORE
+	int tranche_id = LWLockNewTrancheId();
+	dsa_area *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_tidstore");
+	dsa = dsa_create(tranche_id);
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+#else
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+#endif
+
+	/* prepare the offset array */
+	offs[0] = FirstOffsetNumber;
+	offs[1] = FirstOffsetNumber + 1;
+	offs[2] = max_offset / 2;
+	offs[3] = max_offset - 1;
+	offs[4] = max_offset;
+
+	/* add tids */
+	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* lookup test */
+	for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+	{
+		bool expect = false;
+		for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+		{
+			if (offs[i] == off)
+			{
+				expect = true;
+				break;
+			}
+		}
+
+		check_tid(ts, 0, off, expect);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, expect);
+	}
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+			 tidstore_num_tids(ts),
+			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* iteration test */
+	iter = tidstore_begin_iterate(ts);
+	blk_idx = 0;
+	while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+	{
+		/* check the returned block number */
+		if (blks_sorted[blk_idx] != iter_result->blkno)
+			elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+				 iter_result->blkno, blks_sorted[blk_idx]);
+
+		/* check the returned offset numbers */
+		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+			elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+		for (int i = 0; i < iter_result->num_offsets; i++)
+		{
+			if (offs[i] != iter_result->offsets[i])
+				elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+					 iter_result->offsets[i], iter_result->blkno, offs[i]);
+		}
+
+		blk_idx++;
+	}
+
+	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+		elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+	/* remove all tids */
+	tidstore_reset(ts);
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+	/* lookup test for empty store */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, false);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, false);
+	}
+
+	tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+	dsa_detach(dsa);
+#endif
+}
+
+static void
+test_empty(void)
+{
+	TidStore *ts;
+	TidStoreIter *iter;
+	ItemPointerData tid;
+
+#ifdef TEST_SHARED_TIDSTORE
+	int tranche_id = LWLockNewTrancheId();
+	dsa_area *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_tidstore");
+	dsa = dsa_create(tranche_id);
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+#else
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+#endif
+
+	elog(NOTICE, "testing empty tidstore");
+
+	ItemPointerSet(&tid, 0, FirstOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+			 MaxBlockNumber, MaxOffsetNumber);
+
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+	if (tidstore_is_full(ts))
+		elog(ERROR, "tidstore_is_full on empty store returned true");
+
+	iter = tidstore_begin_iterate(ts);
+
+	if (tidstore_iterate_next(iter) != NULL)
+		elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+	tidstore_end_iterate(iter);
+
+	tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+	dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	elog(NOTICE, "testing basic operations");
+	test_basic(MaxHeapTuplesPerPage);
+	test_basic(10);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.31.1

#213

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#212)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Feb 16, 2023 at 6:23 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Feb 14, 2023 at 8:24 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I can think that something like traversing a HOT chain could visit
offsets out of order. But fortunately we prune such collected TIDs
before heap vacuum in heap case.

Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just continue assuming that (with an assert added since it's more public in this form). I'm not sure why such basic common sense evaded me a few versions ago...

Right. TidStore is implemented not only for heap, so loading
out-of-order TIDs might be important in the future.

That's what I was probably thinking about some weeks ago, but I'm having a hard time imagining how it would come up, even for something like the conveyor-belt concept.

We have the following WIP comment in test_radixtree:

// WIP: compiles with warnings because rt_attach is defined but not used
// #define RT_SHMEM

How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
and friends?

Sounds good to me, and the other fixes make sense as well.

Thanks, I merged them.

FYI I've briefly tested the TidStore with blocksize = 32kb, and it
seems to work fine.

That was on my list, so great! How about the other end -- nominally we allow 512b. (In practice it won't matter, but this would make sure I didn't mess anything up when forcing all MaxTuplesPerPage to encode.)

According to the doc, the minimum block size is 1kB. It seems to work
fine with 1kB blocks.
You removed the vacuum integration patch from v27, is there any reason for that?

Just an oversight.

Now for some general comments on the tid store...
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
Do we need to do anything for this todo?
Since it's practically no problem, I think we can live with it for
now. dshash also has the same todo.

It might help readability to have a concept of "off_upper/off_lower", just so we can describe things more clearly. The key is block + off_upper, and the value is a bitmap of all the off_lower bits. I hinted at that in my addition of encode_key_off(). Along those lines, maybe s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even sure the TIDSTORE_ prefix is valuable for these local macros.

The word "value" as a variable name is pretty generic in this context, and it might be better to call it the off_lower_bitmap, at least in some places. The "key" doesn't have a good short term for naming, but in comments we should make sure we're clear it's "block# + off_upper".

I'm not a fan of the name "tid_i", even as a temp variable -- maybe "compressed_tid"?

maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/

It might be worth using typedefs for key and value type. Actually, since key type is fixed for the foreseeable future, maybe the radix tree template should define a key typedef?

The term "result" is probably fine within the tidstore, but as a public name used by vacuum, it's not very descriptive. I don't have a good idea, though.

Some files in backend/access use CamelCase for public functions, although it's not consistent. I think doing that for tidstore would help readability, since they would stand out from rt_* functions and vacuum functions. It's a matter of taste, though.

I don't understand the control flow in tidstore_iterate_next(), or when BlockNumberIsValid() is true. If this is the best way to code this, it needs more commentary.

The attached 0008 patch addressed all above comments on tidstore.

Some comments on vacuum:

I think we'd better get some real-world testing of this, fairly soon.

I had an idea: If it's not too much effort, it might be worth splitting it into two parts: one that just adds the store (not caring about its memory limits or progress reporting etc). During index scan, check both the new store and the array and log a warning (we don't want to exit or crash, better to try to investigate while live if possible) if the result doesn't match. Then perhaps set up an instance and let something like TPC-C run for a few days. The second patch would just restore the rest of the current patch. That would help reassure us it's working as designed.

Yeah, I did a similar thing in an earlier version of tidstore patch.
Since we're trying to introduce two new components: radix tree and
tidstore, I sometimes find it hard to investigate failures happening
during lazy (parallel) vacuum due to a bug either in tidstore or radix
tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
stores tids to both the radix tree and array, and checks if the
results match when lookup or iteration. It will use more memory but it
would not be a big problem in USE_ASSERT_CHECKING builds. It would
also be great if we can enable such checks on some bf animals.

I've tried this idea. Enabling this check on all debug builds (i.e.,
with USE_ASSERT_CHECKING macro) seems not a good idea so I use a
special macro for that, TIDSTORE_DEBUG. I think we can define this
macro on some bf animals (or possibly a new bf animal).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v29-0011-Debug-TIDStore.patch.txttext/plain; charset=US-ASCII; name=v29-0011-Debug-TIDStore.patch.txtDownload

From 107aa2af2966c10ce750e6b410ae570462423aab Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 22 Feb 2023 14:43:15 +0900
Subject: [PATCH v29 11/11] Debug TIDStore.

---
 src/backend/access/common/tidstore.c | 242 ++++++++++++++++++++++++++-
 1 file changed, 238 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 9360520482..438bf0c800 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -28,12 +28,20 @@
 #include "postgres.h"
 
 #include "access/tidstore.h"
+#include "catalog/index.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lwlock.h"
 #include "utils/dsa.h"
 #include "utils/memutils.h"
 
+#define TIDSTORE_DEBUG
+
+/* Enable TidStore debugging only when USE_ASSERT_CHECKING */
+#if defined(TIDSTORE_DEBUG) && !defined(USE_ASSERT_CHECKING)
+#undef TIDSTORE_DEBUG
+#endif
+
 /*
  * For encoding purposes, a tid is represented as a pair of 64-bit key and
  * 64-bit value.
@@ -115,6 +123,12 @@ typedef struct TidStoreControl
 	/* handles for TidStore and radix tree */
 	TidStoreHandle		handle;
 	shared_rt_handle	tree_handle;
+
+#ifdef TIDSTORE_DEBUG
+	dsm_handle		tids_handle;
+	int64			max_tids;
+	bool			tids_unordered;
+#endif
 } TidStoreControl;
 
 /* Per-backend state for a TidStore */
@@ -135,6 +149,11 @@ struct TidStore
 
 	/* DSA area for TidStore if used */
 	dsa_area	*area;
+
+#ifdef TIDSTORE_DEBUG
+	dsm_segment		*tids_seg;
+	ItemPointerData	*tids;
+#endif
 };
 #define TidStoreIsShared(ts) ((ts)->area != NULL)
 
@@ -157,6 +176,11 @@ typedef struct TidStoreIter
 	tidkey		next_tidkey;
 	offsetbm	next_off_bitmap;
 
+#ifdef TIDSTORE_DEBUG
+	/* iterator index for the ts->tids array */
+	int64		tids_idx;
+#endif
+
 	/*
 	 * output for the caller. Must be last because variable-size.
 	 */
@@ -169,6 +193,15 @@ static inline tidkey encode_blk_off(TidStore *ts, BlockNumber block,
 									OffsetNumber offset, offsetbm *off_bit);
 static inline tidkey encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit);
 
+/* debug functions available only when TIDSTORE_DEBUG */
+#ifdef TIDSTORE_DEBUG
+static void ts_debug_set_block_offsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+									   int num_offsets);
+static void ts_debug_iter_check_tids(TidStoreIter *iter);
+static bool ts_debug_is_member(TidStore *ts, ItemPointer tid);
+static int itemptr_cmp(const void *left, const void *right);
+#endif
+
 /*
  * Create a TidStore. The returned object is allocated in backend-local memory.
  * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
@@ -237,6 +270,26 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
 	ts->control->upper_off_nbits =
 		ts->control->max_off_nbits - LOWER_OFFSET_NBITS;
 
+#ifdef TIDSTORE_DEBUG
+	{
+		int64		max_tids = max_bytes / sizeof(ItemPointerData);
+
+		/* Allocate the array of tids too */
+		if (TidStoreIsShared(ts))
+		{
+			ts->tids_seg = dsm_create(sizeof(ItemPointerData) * max_tids, 0);
+			ts->tids = dsm_segment_address(ts->tids_seg);
+			ts->control->tids_handle = dsm_segment_handle(ts->tids_seg);
+			ts->control->max_tids = max_tids;
+		}
+		else
+		{
+			ts->tids = palloc(sizeof(ItemPointerData) * max_tids);
+			ts->control->max_tids = max_tids;
+		}
+	}
+#endif
+
 	return ts;
 }
 
@@ -266,6 +319,11 @@ TidStoreAttach(dsa_area *area, TidStoreHandle handle)
 	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
 	ts->area = area;
 
+#ifdef TIDSTORE_DEBUG
+	ts->tids_seg = dsm_attach(ts->control->tids_handle);
+	ts->tids = (ItemPointer) dsm_segment_address(ts->tids_seg);
+#endif
+
 	return ts;
 }
 
@@ -280,6 +338,11 @@ TidStoreDetach(TidStore *ts)
 {
 	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
 
+#ifdef TIDSTORE_DEBUG
+	if (TidStoreIsShared(ts))
+		dsm_detach(ts->tids_seg);
+#endif
+
 	shared_rt_detach(ts->tree.shared);
 	pfree(ts);
 }
@@ -315,6 +378,13 @@ TidStoreDestroy(TidStore *ts)
 		local_rt_free(ts->tree.local);
 	}
 
+#ifdef TIDSTORE_DEBUG
+	if (TidStoreIsShared(ts))
+		dsm_detach(ts->tids_seg);
+	else
+		 pfree(ts->tids);
+#endif
+
 	pfree(ts);
 }
 
@@ -434,6 +504,11 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 		}
 	}
 
+#ifdef TIDSTORE_DEBUG
+	/* Insert tids into the tid array too */
+	ts_debug_set_block_offsets(ts, blkno, offsets, num_offsets);
+#endif
+
 	/* update statistics */
 	ts->control->num_tids += num_offsets;
 
@@ -451,6 +526,11 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 	offsetbm off_bitmap = 0;
 	offsetbm off_bit;
 	bool found;
+	bool ret;
+
+#ifdef TIDSTORE_DEBUG
+	bool ret_debug = ts_debug_is_member(ts, tid);
+#endif
 
 	key = encode_tid(ts, tid, &off_bit);
 
@@ -460,9 +540,20 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 		found = local_rt_search(ts->tree.local, key, &off_bitmap);
 
 	if (!found)
+	{
+#ifdef TIDSTORE_DEBUG
+		Assert(!ret_debug);
+#endif
 		return false;
+	}
+
+	ret = (off_bitmap & off_bit) != 0;
 
-	return (off_bitmap & off_bit) != 0;
+#ifdef TIDSTORE_DEBUG
+	Assert(ret == ret_debug);
+#endif
+
+	return ret;
 }
 
 /*
@@ -494,6 +585,10 @@ TidStoreBeginIterate(TidStore *ts)
 	if (TidStoreNumTids(ts) == 0)
 		iter->finished = true;
 
+#ifdef TIDSTORE_DEBUG
+	iter->tids_idx = 0;
+#endif
+
 	return iter;
 }
 
@@ -515,6 +610,7 @@ TidStoreIterResult *
 TidStoreIterateNext(TidStoreIter *iter)
 {
 	tidkey key;
+	bool	iter_found;
 	offsetbm off_bitmap = 0;
 	TidStoreIterResult *output = &(iter->output);
 
@@ -532,7 +628,7 @@ TidStoreIterateNext(TidStoreIter *iter)
 	if (iter->next_off_bitmap > 0)
 		iter_decode_key_off(iter, iter->next_tidkey, iter->next_off_bitmap);
 
-	while (tidstore_iter(iter, &key, &off_bitmap))
+	while ((iter_found = tidstore_iter(iter, &key, &off_bitmap)))
 	{
 		BlockNumber blkno = key_get_blkno(iter->ts, key);
 
@@ -545,14 +641,20 @@ TidStoreIterateNext(TidStoreIter *iter)
 			 */
 			iter->next_tidkey = key;
 			iter->next_off_bitmap = off_bitmap;
-			return output;
+			break;
 		}
 
 		/* Collect tids decoded from the key and offset bitmap */
 		iter_decode_key_off(iter, key, off_bitmap);
 	}
 
-	iter->finished = true;
+	if (!iter_found)
+		iter->finished = true;
+
+#ifdef TIDSTORE_DEBUG
+	ts_debug_iter_check_tids(iter);
+#endif
+
 	return output;
 }
 
@@ -699,3 +801,135 @@ encode_blk_off(TidStore *ts, BlockNumber block, OffsetNumber offset,
 
 	return key;
 }
+
+#ifdef TIDSTORE_DEBUG
+/* Comparator routines for ItemPointer */
+static int
+itemptr_cmp(const void *left, const void *right)
+{
+	BlockNumber lblk,
+		rblk;
+	OffsetNumber loff,
+		roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+
+/* Insert tids to the tid array for debugging */
+static void
+ts_debug_set_block_offsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+						   int num_offsets)
+{
+	if (ts->control->num_tids > 0 &&
+		blkno < ItemPointerGetBlockNumber(&(ts->tids[ts->control->num_tids - 1])))
+	{
+		/* The array will be sorted at ts_debug_is_member() */
+		ts->control->tids_unordered = true;
+	}
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		ItemPointer tid;
+		int idx = ts->control->num_tids + i;
+
+		/* Enlarge the tid array if necessary */
+		if (idx >= ts->control->max_tids)
+		{
+			ts->control->max_tids *= 2;
+
+			if (TidStoreIsShared(ts))
+			{
+				dsm_segment *new_seg =
+					dsm_create(sizeof(ItemPointerData) * ts->control->max_tids, 0);
+				ItemPointer new_tids = dsm_segment_address(new_seg);
+
+				/* copy tids from old to new array */
+				memcpy(new_tids, ts->tids,
+					   sizeof(ItemPointerData) * (ts->control->max_tids / 2));
+
+				dsm_detach(ts->tids_seg);
+				ts->tids = new_tids;
+			}
+			else
+				ts->tids = repalloc(ts->tids,
+									sizeof(ItemPointerData) * ts->control->max_tids);
+		}
+
+		tid = &(ts->tids[idx]);
+		ItemPointerSetBlockNumber(tid, blkno);
+		ItemPointerSetOffsetNumber(tid, offsets[i]);
+	}
+}
+
+/* Return true if the given tid is present in the tid array */
+static bool
+ts_debug_is_member(TidStore *ts, ItemPointer tid)
+{
+	int64	litem,
+		ritem,
+		item;
+	ItemPointer res;
+
+	if (ts->control->num_tids == 0)
+		return false;
+
+	/* Make sure the tid array is sorted */
+	if (ts->control->tids_unordered)
+	{
+		qsort(ts->tids, ts->control->num_tids, sizeof(ItemPointerData), itemptr_cmp);
+		ts->control->tids_unordered = false;
+	}
+
+	litem = itemptr_encode(&ts->tids[0]);
+	ritem = itemptr_encode(&ts->tids[ts->control->num_tids - 1]);
+	item = itemptr_encode(tid);
+
+	/*
+	 * Doing a simple bound check before bsearch() is useful to avoid the
+	 * extra cost of bsearch(), especially if dead items on the heap are
+	 * concentrated in a certain range.	Since this function is called for
+	 * every index tuple, it pays to be really fast.
+	 */
+	if (item < litem || item > ritem)
+		return false;
+
+	res = bsearch(tid, ts->tids, ts->control->num_tids, sizeof(ItemPointerData),
+				  itemptr_cmp);
+
+	return (res != NULL);
+}
+
+/* Verify if the iterator output matches the tids in the array for debugging */
+static void
+ts_debug_iter_check_tids(TidStoreIter *iter)
+{
+	BlockNumber blkno = iter->output.blkno;
+
+	for (int i = 0; i < iter->output.num_offsets; i++)
+	{
+		ItemPointer tid = &(iter->ts->tids[iter->tids_idx + i]);
+
+		Assert((iter->tids_idx + i) < iter->ts->control->max_tids);
+		Assert(ItemPointerGetBlockNumber(tid) == blkno);
+		Assert(ItemPointerGetOffsetNumber(tid) == iter->output.offsets[i]);
+	}
+
+	iter->tids_idx += iter->output.num_offsets;
+}
+#endif
-- 
2.31.1

#214

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#213)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Feb 22, 2023 at 1:16 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

Yeah, I did a similar thing in an earlier version of tidstore patch.

Okay, if you had checks against the old array lookup in development, that
gives us better confidence.

Since we're trying to introduce two new components: radix tree and
tidstore, I sometimes find it hard to investigate failures happening
during lazy (parallel) vacuum due to a bug either in tidstore or radix
tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
stores tids to both the radix tree and array, and checks if the
results match when lookup or iteration. It will use more memory but it
would not be a big problem in USE_ASSERT_CHECKING builds. It would
also be great if we can enable such checks on some bf animals.

I've tried this idea. Enabling this check on all debug builds (i.e.,
with USE_ASSERT_CHECKING macro) seems not a good idea so I use a
special macro for that, TIDSTORE_DEBUG. I think we can define this
macro on some bf animals (or possibly a new bf animal).

I don't think any vacuum calls in regression tests would stress any of
this code very much, so it's not worth carrying the old way forward. I was
thinking of only doing this as a short-time sanity check for testing a
real-world workload.

--
John Naylor
EDB: http://www.enterprisedb.com

#215

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#214)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Feb 22, 2023 at 4:35 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Feb 22, 2023 at 1:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Yeah, I did a similar thing in an earlier version of tidstore patch.

Okay, if you had checks against the old array lookup in development, that gives us better confidence.

Since we're trying to introduce two new components: radix tree and
tidstore, I sometimes find it hard to investigate failures happening
during lazy (parallel) vacuum due to a bug either in tidstore or radix
tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
stores tids to both the radix tree and array, and checks if the
results match when lookup or iteration. It will use more memory but it
would not be a big problem in USE_ASSERT_CHECKING builds. It would
also be great if we can enable such checks on some bf animals.

I've tried this idea. Enabling this check on all debug builds (i.e.,
with USE_ASSERT_CHECKING macro) seems not a good idea so I use a
special macro for that, TIDSTORE_DEBUG. I think we can define this
macro on some bf animals (or possibly a new bf animal).

I don't think any vacuum calls in regression tests would stress any of this code very much, so it's not worth carrying the old way forward. I was thinking of only doing this as a short-time sanity check for testing a real-world workload.

I guess that It would also be helpful at least until the GA release.
People will be able to test them easily on their workloads or their
custom test scenarios.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#216

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#215)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Feb 22, 2023 at 3:29 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Wed, Feb 22, 2023 at 4:35 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I don't think any vacuum calls in regression tests would stress any of

this code very much, so it's not worth carrying the old way forward. I was
thinking of only doing this as a short-time sanity check for testing a
real-world workload.

I guess that It would also be helpful at least until the GA release.
People will be able to test them easily on their workloads or their
custom test scenarios.

That doesn't seem useful to me. If we've done enough testing to reassure us
the new way always gives the same answer, the old way is not needed at
commit time. If there is any doubt it will always give the same answer,
then the whole patchset won't be committed.

TPC-C was just an example. It should have testing comparing the old and new
methods. If you have already done that to some degree, that might be
enough. After performance tests, I'll also try some vacuums that use the
comparison patch.

--
John Naylor
EDB: http://www.enterprisedb.com

#217

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: John Naylor (#216)

Re: [PoC] Improve dead tuple storage for lazy vacuum

I ran a couple "in situ" tests on server hardware using UUID columns, since
they are common in the real world and have bad correlation to heap
order, so are a challenge for index vacuum.

=== test 1, delete everything from a small table, with very small
maintenance_work_mem:

alter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;

-- unrealistically low
alter system set maintenance_work_mem = '32MB';

create table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from
generate_series(1,50*1000*1000);
create index on test (x);

delete from test;
vacuum (verbose, truncate off) test;
--

master:
INFO: finished vacuuming "john.naylor.public.test": index scans: 9
system usage: CPU: user: 70.04 s, system: 19.85 s, elapsed: 802.06 s

v29 patch:
INFO: finished vacuuming "john.naylor.public.test": index scans: 1
system usage: CPU: user: 9.80 s, system: 2.62 s, elapsed: 36.68 s

This is a bit artificial, but it's easy to construct cases where the array
leads to multiple index scans but the new tid store can fit everythin
without breaking a sweat. I didn't save the progress reporting, but v29 was
using about 11MB for tid storage.

=== test 2: try to stress tid lookup with production maintenance_work_mem:
1. use unlogged table to reduce noise
2. vacuum freeze first to reduce heap scan time
3. delete some records at the beginning and end of heap to defeat binary
search's pre-check

alter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;

alter system set maintenance_work_mem = '1GB';

create unlogged table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from
generate_series(1,1000*1000*1000);
vacuum_freeze test;

select pg_size_pretty(pg_table_size('test'));
pg_size_pretty
----------------
41 GB

create index on test (x);

select pg_size_pretty(pg_total_relation_size('test'));
pg_size_pretty
----------------
71 GB

select max(ctid) from test;
max
--------------
(5405405,75)

delete from test where ctid < '(100000,0)'::tid;
delete from test where ctid > '(5300000,0)'::tid;

vacuum (verbose, truncate off) test;

both:
INFO: vacuuming "john.naylor.public.test"
INFO: finished vacuuming "john.naylor.public.test": index scans: 1
index scan needed: 205406 pages from table (3.80% of total) had 38000000
dead item identifiers removed

--
master:
system usage: CPU: user: 134.32 s, system: 19.24 s, elapsed: 286.14 s

v29 patch:
system usage: CPU: user: 97.71 s, system: 45.78 s, elapsed: 573.94 s

The entire vacuum took 25% less wall clock time. Reminder that this is
without wal logging, and also unscientific because only one run.

--
I took 10 seconds of perf data while index vacuuming was going on (showing
calls > 2%):

master:
40.59% postgres postgres [.] vac_cmp_itemptr
24.97% postgres libc-2.17.so [.] bsearch
6.67% postgres postgres [.] btvacuumpage
4.61% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
3.48% postgres postgres [.] PageIndexMultiDelete
2.67% postgres postgres [.] vac_tid_reaped
2.03% postgres postgres [.] compactify_tuples
2.01% postgres libc-2.17.so [.] __memcpy_ssse3_back

v29 patch:

29.22% postgres postgres [.] TidStoreIsMember
9.30% postgres postgres [.] btvacuumpage
7.76% postgres postgres [.] PageIndexMultiDelete
6.31% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
5.60% postgres postgres [.] compactify_tuples
4.26% postgres libc-2.17.so [.] __memcpy_ssse3_back
4.12% postgres postgres [.] hash_search_with_hash_value

--
master:
psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuples,
num_dead_tuples from pg_stat_progress_vacuum"
phase | heap_blks_total | heap_blks_scanned | max_dead_tuples
| num_dead_tuples
-------------------+-----------------+-------------------+-----------------+-----------------
vacuuming indexes | 5405406 | 5405406 | 178956969
| 38000000

v29 patch:
psql -c "select phase, heap_blks_total, heap_blks_scanned,
max_dead_tuple_bytes, dead_tuple_bytes from pg_stat_progress_vacuum"
phase | heap_blks_total | heap_blks_scanned |
max_dead_tuple_bytes | dead_tuple_bytes
-------------------+-----------------+-------------------+----------------------+------------------
vacuuming indexes | 5405406 | 5405406 |
1073670144 | 8678064

Here, the old array pessimistically needs 1GB allocated (as for any table >
~5GB), but only fills 228MB for tid lookup. The patch reports 8.7MB. Tables
that only fit, say, 30-50 tuples per page will have less extreme
differences in memory use. Same for the case where only a couple dead items
occur per page, with many uninteresting pages in between. Even so, the
allocation will be much more accurately sized in the patch, especially in
non-parallel vacuum.

There are other cases that could be tested (I mentioned some above), but
this is enough to show the improvements possible.

I still need to do some cosmetic follow-up to v29 as well as a status
report, and I will try to get back to that soon.

--
John Naylor
EDB: http://www.enterprisedb.com

#218

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#216)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Feb 22, 2023 at 6:55 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Feb 22, 2023 at 3:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Feb 22, 2023 at 4:35 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I don't think any vacuum calls in regression tests would stress any of this code very much, so it's not worth carrying the old way forward. I was thinking of only doing this as a short-time sanity check for testing a real-world workload.

I guess that It would also be helpful at least until the GA release.
People will be able to test them easily on their workloads or their
custom test scenarios.

That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed.

True. Even if we're done enough testing we cannot claim there is no
bug. My idea is to make the bug investigation easier but on
reflection, it seems not the best idea given this purpose. Instead, it
seems to be better to add more necessary assertions. What do you think
about the attached patch? Please note that it also includes the
changes for minimum memory requirement.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

add_assertions.patch.txttext/plain; charset=US-ASCII; name=add_assertions.patch.txtDownload

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 9360520482..fc20e58a95 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -75,6 +75,14 @@ typedef uint64 offsetbm;
 #define LOWER_OFFSET_NBITS	6	/* log(sizeof(offsetbm), 2) */
 #define LOWER_OFFSET_MASK	((1 << LOWER_OFFSET_NBITS) - 1)
 
+/*
+ * The minimum amount of memory required by TidStore is 2MB, the current minimum
+ * valid value for the maintenance_work_mem GUC. This is required to allocate the
+ * DSA initial segment, 1MB, and some meta data. This number is applied also to
+ * the local TidStore cases for simplicity.
+ */
+#define TIDSTORE_MIN_MEMORY	(2 * 1024 * 1024L)		/* 2MB */
+
 /* A magic value used to identify our TidStore. */
 #define TIDSTORE_MAGIC 0x826f6a10
 
@@ -101,7 +109,7 @@ typedef struct TidStoreControl
 
 	/* These values are never changed after creation */
 	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
-	int		max_off;		/* the maximum offset number */
+	OffsetNumber	max_off;		/* the maximum offset number */
 	int		max_off_nbits;	/* the number of bits required for offset
 							 * numbers */
 	int		upper_off_nbits;	/* the number of bits of offset numbers
@@ -174,10 +182,17 @@ static inline tidkey encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit
  * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
  */
 TidStore *
-TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
+TidStoreCreate(size_t max_bytes, OffsetNumber max_off, dsa_area *area)
 {
 	TidStore	*ts;
 
+	Assert(max_off <= MaxOffsetNumber);
+
+	/* Sanity check for the max_bytes */
+	if (max_bytes < TIDSTORE_MIN_MEMORY)
+		elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
+			 TIDSTORE_MIN_MEMORY, max_bytes);
+
 	ts = palloc0(sizeof(TidStore));
 
 	/*
@@ -192,8 +207,8 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
 	 * In local TidStore cases, the radix tree uses slab allocators for each kind
 	 * of node class. The most memory consuming case while adding Tids associated
 	 * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
-	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
-	 * we deduct 70kB from the max_bytes.
+	 * slab block for a new radix tree node, which is approximately 70kB at most.
+	 * Therefore, we deduct 70kB from the max_bytes.
 	 *
 	 * In shared cases, DSA allocates the memory segments big enough to follow
 	 * a geometric series that approximately doubles the total DSA size (see
@@ -378,6 +393,7 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	const int nkeys = UINT64CONST(1) << ts->control->upper_off_nbits;
 
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+	Assert(BlockNumberIsValid(blkno));
 
 	bitmaps = palloc(sizeof(offsetbm) * nkeys);
 	key = prev_key = key_base;
@@ -386,6 +402,8 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	{
 		offsetbm	off_bit;
 
+		Assert(offsets[i] <= ts->control->max_off);
+
 		/* encode the tid to a key and partial offset */
 		key = encode_blk_off(ts, blkno, offsets[i], &off_bit);
 
@@ -452,6 +470,8 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 	offsetbm off_bit;
 	bool found;
 
+	Assert(ItemPointerIsValid(tid));
+
 	key = encode_tid(ts, tid, &off_bit);
 
 	if (TidStoreIsShared(ts))
@@ -535,6 +555,7 @@ TidStoreIterateNext(TidStoreIter *iter)
 	while (tidstore_iter(iter, &key, &off_bitmap))
 	{
 		BlockNumber blkno = key_get_blkno(iter->ts, key);
+		Assert(BlockNumberIsValid(blkno));
 
 		if (BlockNumberIsValid(output->blkno) && output->blkno != blkno)
 		{
@@ -586,6 +607,7 @@ TidStoreNumTids(TidStore *ts)
 	num_tids = ts->control->num_tids;
 	LWLockRelease(&ts->control->lock);
 
+	Assert(num_tids >= 0);
 	return num_tids;
 }
 
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index 66f0fdd482..d1cc93cbb6 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -30,7 +30,7 @@ typedef struct TidStoreIterResult
 	OffsetNumber	offsets[FLEXIBLE_ARRAY_MEMBER];
 } TidStoreIterResult;
 
-extern TidStore *TidStoreCreate(size_t max_bytes, int max_off, dsa_area *dsa);
+extern TidStore *TidStoreCreate(size_t max_bytes, OffsetNumber max_off, dsa_area *dsa);
 extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer handle);
 extern void TidStoreDetach(TidStore *ts);
 extern void TidStoreDestroy(TidStore *ts);

#219

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#217)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Feb 23, 2023 at 6:41 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I ran a couple "in situ" tests on server hardware using UUID columns, since they are common in the real world and have bad correlation to heap order, so are a challenge for index vacuum.

Thank you for the test!

=== test 1, delete everything from a small table, with very small maintenance_work_mem:

alter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;

-- unrealistically low
alter system set maintenance_work_mem = '32MB';

create table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from generate_series(1,50*1000*1000);
create index on test (x);

delete from test;
vacuum (verbose, truncate off) test;
--

master:
INFO: finished vacuuming "john.naylor.public.test": index scans: 9
system usage: CPU: user: 70.04 s, system: 19.85 s, elapsed: 802.06 s

v29 patch:
INFO: finished vacuuming "john.naylor.public.test": index scans: 1
system usage: CPU: user: 9.80 s, system: 2.62 s, elapsed: 36.68 s

This is a bit artificial, but it's easy to construct cases where the array leads to multiple index scans but the new tid store can fit everythin without breaking a sweat. I didn't save the progress reporting, but v29 was using about 11MB for tid storage.

Cool.

=== test 2: try to stress tid lookup with production maintenance_work_mem:
1. use unlogged table to reduce noise
2. vacuum freeze first to reduce heap scan time
3. delete some records at the beginning and end of heap to defeat binary search's pre-check

alter system set shared_buffers ='4GB';
alter system set max_wal_size ='10GB';
alter system set checkpoint_timeout ='30 min';
alter system set autovacuum =off;

alter system set maintenance_work_mem = '1GB';

create unlogged table if not exists test (x uuid);
truncate table test;
insert into test (x) select gen_random_uuid() from generate_series(1,1000*1000*1000);
vacuum_freeze test;

select pg_size_pretty(pg_table_size('test'));
pg_size_pretty
----------------
41 GB

create index on test (x);

select pg_size_pretty(pg_total_relation_size('test'));
pg_size_pretty
----------------
71 GB

select max(ctid) from test;
max
--------------
(5405405,75)

delete from test where ctid < '(100000,0)'::tid;
delete from test where ctid > '(5300000,0)'::tid;

vacuum (verbose, truncate off) test;

both:
INFO: vacuuming "john.naylor.public.test"
INFO: finished vacuuming "john.naylor.public.test": index scans: 1
index scan needed: 205406 pages from table (3.80% of total) had 38000000 dead item identifiers removed

--
master:
system usage: CPU: user: 134.32 s, system: 19.24 s, elapsed: 286.14 s

v29 patch:
system usage: CPU: user: 97.71 s, system: 45.78 s, elapsed: 573.94 s

In v29 vacuum took twice as long (286 s vs. 573 s)?

The entire vacuum took 25% less wall clock time. Reminder that this is without wal logging, and also unscientific because only one run.

--
I took 10 seconds of perf data while index vacuuming was going on (showing calls > 2%):

master:
40.59% postgres postgres [.] vac_cmp_itemptr
24.97% postgres libc-2.17.so [.] bsearch
6.67% postgres postgres [.] btvacuumpage
4.61% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
3.48% postgres postgres [.] PageIndexMultiDelete
2.67% postgres postgres [.] vac_tid_reaped
2.03% postgres postgres [.] compactify_tuples
2.01% postgres libc-2.17.so [.] __memcpy_ssse3_back

v29 patch:

29.22% postgres postgres [.] TidStoreIsMember
9.30% postgres postgres [.] btvacuumpage
7.76% postgres postgres [.] PageIndexMultiDelete
6.31% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
5.60% postgres postgres [.] compactify_tuples
4.26% postgres libc-2.17.so [.] __memcpy_ssse3_back
4.12% postgres postgres [.] hash_search_with_hash_value

--
master:
psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuples, num_dead_tuples from pg_stat_progress_vacuum"
phase | heap_blks_total | heap_blks_scanned | max_dead_tuples | num_dead_tuples
-------------------+-----------------+-------------------+-----------------+-----------------
vacuuming indexes | 5405406 | 5405406 | 178956969 | 38000000

v29 patch:
psql -c "select phase, heap_blks_total, heap_blks_scanned, max_dead_tuple_bytes, dead_tuple_bytes from pg_stat_progress_vacuum"
phase | heap_blks_total | heap_blks_scanned | max_dead_tuple_bytes | dead_tuple_bytes
-------------------+-----------------+-------------------+----------------------+------------------
vacuuming indexes | 5405406 | 5405406 | 1073670144 | 8678064

Here, the old array pessimistically needs 1GB allocated (as for any table > ~5GB), but only fills 228MB for tid lookup. The patch reports 8.7MB. Tables that only fit, say, 30-50 tuples per page will have less extreme differences in memory use. Same for the case where only a couple dead items occur per page, with many uninteresting pages in between. Even so, the allocation will be much more accurately sized in the patch, especially in non-parallel vacuum.

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#220

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#219)

2 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Feb 24, 2023 at 3:41 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

In v29 vacuum took twice as long (286 s vs. 573 s)?

Not sure what happened there, and clearly I was looking at the wrong number
:/
I scripted the test for reproducibility and ran it three times. Also
included some variations (attached):

UUID times look comparable here, so no speedup or regression:

master:
system usage: CPU: user: 216.05 s, system: 35.81 s, elapsed: 634.22 s
system usage: CPU: user: 173.71 s, system: 31.24 s, elapsed: 599.04 s
system usage: CPU: user: 171.16 s, system: 30.21 s, elapsed: 583.21 s

v29:
system usage: CPU: user: 93.47 s, system: 40.92 s, elapsed: 594.10 s
system usage: CPU: user: 99.58 s, system: 44.73 s, elapsed: 606.80 s
system usage: CPU: user: 96.29 s, system: 42.74 s, elapsed: 600.10 s

Then, I tried sequential integers, which is a much more favorable access
pattern in general, and the new tid storage shows substantial improvement:

master:
system usage: CPU: user: 100.39 s, system: 7.79 s, elapsed: 121.57 s
system usage: CPU: user: 104.90 s, system: 8.81 s, elapsed: 124.24 s
system usage: CPU: user: 95.04 s, system: 7.55 s, elapsed: 116.44 s

v29:
system usage: CPU: user: 24.57 s, system: 8.53 s, elapsed: 61.07 s
system usage: CPU: user: 23.18 s, system: 8.25 s, elapsed: 58.99 s
system usage: CPU: user: 23.20 s, system: 8.98 s, elapsed: 66.86 s

That's fast enough that I thought an improvement would show up even with
standard WAL logging (no separate attachment, since it's a trivial change).
Seems a bit faster:

master:
system usage: CPU: user: 152.27 s, system: 11.76 s, elapsed: 216.86 s
system usage: CPU: user: 137.25 s, system: 11.07 s, elapsed: 213.62 s
system usage: CPU: user: 149.48 s, system: 12.15 s, elapsed: 220.96 s

v29:
system usage: CPU: user: 40.88 s, system: 15.99 s, elapsed: 170.98 s
system usage: CPU: user: 41.33 s, system: 15.45 s, elapsed: 166.75 s
system usage: CPU: user: 41.51 s, system: 18.20 s, elapsed: 203.94 s

There is more we could test here, but I feel better about these numbers.

In the next few days, I'll resume style review and list the remaining
issues we need to address.

--
John Naylor
EDB: http://www.enterprisedb.com

#221

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#218)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Wed, Feb 22, 2023 at 6:55 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

That doesn't seem useful to me. If we've done enough testing to

reassure us the new way always gives the same answer, the old way is not
needed at commit time. If there is any doubt it will always give the same
answer, then the whole patchset won't be committed.

My idea is to make the bug investigation easier but on
reflection, it seems not the best idea given this purpose.

My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old
tid array. As I've said, that doesn't seem like a good thing to carry
forward forevermore, in any form. Plus, comparing new code with new code is
not the same thing as comparing existing code with new code. That was my
idea upthread.

Maybe the effort my idea requires is too much vs. the likelihood of finding
a problem. In any case, it's clear that if I want that level of paranoia,
I'm going to have to do it myself.

What do you think
about the attached patch? Please note that it also includes the
changes for minimum memory requirement.

Most of the asserts look logical, or at least harmless.

- int max_off; /* the maximum offset number */
+ OffsetNumber max_off; /* the maximum offset number */

I agree with using the specific type for offsets here, but I'm not sure why
this change belongs in this patch. If we decided against the new asserts,
this would be easy to lose.

This change, however, defies common sense:

+/*
+ * The minimum amount of memory required by TidStore is 2MB, the current
minimum
+ * valid value for the maintenance_work_mem GUC. This is required to
allocate the
+ * DSA initial segment, 1MB, and some meta data. This number is applied
also to
+ * the local TidStore cases for simplicity.
+ */
+#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */

+ /* Sanity check for the max_bytes */
+ if (max_bytes < TIDSTORE_MIN_MEMORY)
+ elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
+ TIDSTORE_MIN_MEMORY, max_bytes);

Aside from the fact that this elog's something that would never get past
development, the #define just adds a hard-coded copy of something that is
already hard-coded somewhere else, whose size depends on an implementation
detail in a third place.

This also assumes that all users of tid store are limited by
maintenance_work_mem. Andres thought of an example of some day unifying
with tidbitmap.c, and maybe other applications will be limited by work_mem.

But now that I'm looking at the guc tables, I am reminded that work_mem's
minimum is 64kB, so this highlights a design problem: There is obviously no
requirement that the minimum work_mem has to be >= a single DSA segment,
even though operations like parallel hash and parallel bitmap heap scan are
limited by work_mem. It would be nice to find out what happens with these
parallel features when work_mem is tiny (maybe parallelism is not even
considered?).

--
John Naylor
EDB: http://www.enterprisedb.com

#222

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#221)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Feb 28, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Feb 22, 2023 at 6:55 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed.

My idea is to make the bug investigation easier but on
reflection, it seems not the best idea given this purpose.

My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't seem like a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same thing as comparing existing code with new code. That was my idea upthread.

Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that if I want that level of paranoia, I'm going to have to do it myself.

What do you think
about the attached patch? Please note that it also includes the
changes for minimum memory requirement.

Most of the asserts look logical, or at least harmless.
- int max_off; /* the maximum offset number */
+ OffsetNumber max_off; /* the maximum offset number */
I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If we decided against the new asserts, this would be easy to lose.

Right. I'll separate this change as a separate patch.

This change, however, defies common sense:
+/*
+ * The minimum amount of memory required by TidStore is 2MB, the current minimum
+ * valid value for the maintenance_work_mem GUC. This is required to allocate the
+ * DSA initial segment, 1MB, and some meta data. This number is applied also to
+ * the local TidStore cases for simplicity.
+ */
+#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
+ /* Sanity check for the max_bytes */
+ if (max_bytes < TIDSTORE_MIN_MEMORY)
+ elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
+ TIDSTORE_MIN_MEMORY, max_bytes);
Aside from the fact that this elog's something that would never get past development, the #define just adds a hard-coded copy of something that is already hard-coded somewhere else, whose size depends on an implementation detail in a third place.

This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of some day unifying with tidbitmap.c, and maybe other applications will be limited by work_mem.

But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a design problem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even though operations like parallel hash and parallel bitmap heap scan are limited by work_mem.

Right.

It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is not even considered?).

IIUC both don't care about the allocated DSA segment size. Parallel
hash accounts actual tuple (+ header) size as used memory but doesn't
consider how much DSA segment is allocated behind. Both parallel hash
and parallel bitmap scan can work even with work_mem = 64kB, but when
checking the total DSA segment size allocated during these operations,
it was 1MB.

I realized that there is a similar memory limit design issue also on
the non-shared tidstore cases. We deduct 70kB from max_bytes but it
won't work fine with work_mem = 64kB. Probably we need to reconsider
it. FYI 70kB comes from the maximum slab block size for node256.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#223

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#222)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Feb 28, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Feb 22, 2023 at 6:55 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed.

My idea is to make the bug investigation easier but on
reflection, it seems not the best idea given this purpose.

My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't seem like a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same thing as comparing existing code with new code. That was my idea upthread.

Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that if I want that level of paranoia, I'm going to have to do it myself.

What do you think
about the attached patch? Please note that it also includes the
changes for minimum memory requirement.

Most of the asserts look logical, or at least harmless.
- int max_off; /* the maximum offset number */
+ OffsetNumber max_off; /* the maximum offset number */
I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If we decided against the new asserts, this would be easy to lose.
Right. I'll separate this change as a separate patch.
This change, however, defies common sense:
+/*
+ * The minimum amount of memory required by TidStore is 2MB, the current minimum
+ * valid value for the maintenance_work_mem GUC. This is required to allocate the
+ * DSA initial segment, 1MB, and some meta data. This number is applied also to
+ * the local TidStore cases for simplicity.
+ */
+#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
+ /* Sanity check for the max_bytes */
+ if (max_bytes < TIDSTORE_MIN_MEMORY)
+ elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
+ TIDSTORE_MIN_MEMORY, max_bytes);
Aside from the fact that this elog's something that would never get past development, the #define just adds a hard-coded copy of something that is already hard-coded somewhere else, whose size depends on an implementation detail in a third place.

This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of some day unifying with tidbitmap.c, and maybe other applications will be limited by work_mem.

But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a design problem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even though operations like parallel hash and parallel bitmap heap scan are limited by work_mem.
Right.

It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is not even considered?).

IIUC both don't care about the allocated DSA segment size. Parallel
hash accounts actual tuple (+ header) size as used memory but doesn't
consider how much DSA segment is allocated behind. Both parallel hash
and parallel bitmap scan can work even with work_mem = 64kB, but when
checking the total DSA segment size allocated during these operations,
it was 1MB.

I realized that there is a similar memory limit design issue also on
the non-shared tidstore cases. We deduct 70kB from max_bytes but it
won't work fine with work_mem = 64kB. Probably we need to reconsider
it. FYI 70kB comes from the maximum slab block size for node256.

Currently, we calculate the slab block size enough to allocate 32
chunks from there. For node256, the leaf node is 2,088 bytes and the
slab block size is 66,816 bytes. One idea to fix this issue to
decrease it. For example, with 16 chunks the slab block size is 33,408
bytes and with 8 chunks it's 16,704 bytes. I ran a brief benchmark
test with 70kB block size and 16kB block size:

* 70kB slab blocks:
select * from bench_search_random_nodes(20 * 1000 * 1000, '0xFFFFFF');
height = 2, n3 = 0, n15 = 0, n32 = 0, n125 = 0, n256 = 65793
mem_allocated | load_ms | search_ms
---------------+---------+-----------
143085184 | 1216 | 750
(1 row)

* 16kB slab blocks:
select * from bench_search_random_nodes(20 * 1000 * 1000, '0xFFFFFF');
height = 2, n3 = 0, n15 = 0, n32 = 0, n125 = 0, n256 = 65793
mem_allocated | load_ms | search_ms
---------------+---------+-----------
157601248 | 1220 | 786
(1 row)

There is a performance difference a bit but a smaller slab block size
seems to be acceptable if there is no other better way.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#224

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#223)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Feb 28, 2023 at 10:09 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

On Tue, Feb 28, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <

sawada.mshk@gmail.com> wrote:

On Wed, Feb 22, 2023 at 6:55 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

That doesn't seem useful to me. If we've done enough testing to

reassure us the new way always gives the same answer, the old way is not
needed at commit time. If there is any doubt it will always give the same
answer, then the whole patchset won't be committed.

My idea is to make the bug investigation easier but on
reflection, it seems not the best idea given this purpose.

My concern with TIDSTORE_DEBUG is that it adds new code that mimics

the old tid array. As I've said, that doesn't seem like a good thing to
carry forward forevermore, in any form. Plus, comparing new code with new
code is not the same thing as comparing existing code with new code. That
was my idea upthread.

Maybe the effort my idea requires is too much vs. the likelihood of

finding a problem. In any case, it's clear that if I want that level of
paranoia, I'm going to have to do it myself.

What do you think
about the attached patch? Please note that it also includes the
changes for minimum memory requirement.

Most of the asserts look logical, or at least harmless.
- int max_off; /* the maximum offset number */
+ OffsetNumber max_off; /* the maximum offset number */
I agree with using the specific type for offsets here, but I'm not

sure why this change belongs in this patch. If we decided against the new
asserts, this would be easy to lose.

Right. I'll separate this change as a separate patch.
This change, however, defies common sense:
+/*
+ * The minimum amount of memory required by TidStore is 2MB, the

current minimum

+ * valid value for the maintenance_work_mem GUC. This is required to

allocate the

+ * DSA initial segment, 1MB, and some meta data. This number is

applied also to

+ * the local TidStore cases for simplicity.
+ */
+#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */

+ /* Sanity check for the max_bytes */
+ if (max_bytes < TIDSTORE_MIN_MEMORY)
+ elog(ERROR, "memory for TidStore must be at least %ld, but %zu

provided",

+ TIDSTORE_MIN_MEMORY, max_bytes);

Aside from the fact that this elog's something that would never get

past development, the #define just adds a hard-coded copy of something that
is already hard-coded somewhere else, whose size depends on an
implementation detail in a third place.

This also assumes that all users of tid store are limited by

maintenance_work_mem. Andres thought of an example of some day unifying
with tidbitmap.c, and maybe other applications will be limited by work_mem.

But now that I'm looking at the guc tables, I am reminded that

work_mem's minimum is 64kB, so this highlights a design problem: There is
obviously no requirement that the minimum work_mem has to be >= a single
DSA segment, even though operations like parallel hash and parallel bitmap
heap scan are limited by work_mem.

Right.

It would be nice to find out what happens with these parallel

features when work_mem is tiny (maybe parallelism is not even considered?).

IIUC both don't care about the allocated DSA segment size. Parallel
hash accounts actual tuple (+ header) size as used memory but doesn't
consider how much DSA segment is allocated behind. Both parallel hash
and parallel bitmap scan can work even with work_mem = 64kB, but when
checking the total DSA segment size allocated during these operations,
it was 1MB.

I realized that there is a similar memory limit design issue also on
the non-shared tidstore cases. We deduct 70kB from max_bytes but it
won't work fine with work_mem = 64kB. Probably we need to reconsider
it. FYI 70kB comes from the maximum slab block size for node256.

Currently, we calculate the slab block size enough to allocate 32
chunks from there. For node256, the leaf node is 2,088 bytes and the
slab block size is 66,816 bytes. One idea to fix this issue to
decrease it.

I think we're trying to solve the wrong problem here. I need to study this
more, but it seems that code that needs to stay within a memory limit only
needs to track what's been allocated in chunks within a block, since
writing there is what invokes a page fault. If we're not keeping track of
each and every chunk space, for speed, it doesn't follow that we need to
keep every block allocation within the configured limit. I'm guessing we
can just ask the context if the block space has gone *over* the limit, and
we can assume that the last allocation we perform will only fault one
additional page. We need to have a clear answer on this before doing
anything else.

If that's correct, and I'm not positive yet, we can get rid of all the
fragile assumptions about things the tid store has no business knowing
about, as well as the guc change. I'm not sure how this affects progress
reporting, because it would be nice if it didn't report dead_tuple_bytes
bigger than max_dead_tuple_bytes.

--
John Naylor
EDB: http://www.enterprisedb.com

#225

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#224)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Tue, Feb 28, 2023 at 10:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Feb 28, 2023 at 10:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Tue, Feb 28, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:
On Fri, Feb 24, 2023 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Feb 22, 2023 at 6:55 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

That doesn't seem useful to me. If we've done enough testing to reassure us the new way always gives the same answer, the old way is not needed at commit time. If there is any doubt it will always give the same answer, then the whole patchset won't be committed.

My idea is to make the bug investigation easier but on
reflection, it seems not the best idea given this purpose.

My concern with TIDSTORE_DEBUG is that it adds new code that mimics the old tid array. As I've said, that doesn't seem like a good thing to carry forward forevermore, in any form. Plus, comparing new code with new code is not the same thing as comparing existing code with new code. That was my idea upthread.

Maybe the effort my idea requires is too much vs. the likelihood of finding a problem. In any case, it's clear that if I want that level of paranoia, I'm going to have to do it myself.

What do you think
about the attached patch? Please note that it also includes the
changes for minimum memory requirement.

Most of the asserts look logical, or at least harmless.
- int max_off; /* the maximum offset number */
+ OffsetNumber max_off; /* the maximum offset number */
I agree with using the specific type for offsets here, but I'm not sure why this change belongs in this patch. If we decided against the new asserts, this would be easy to lose.
Right. I'll separate this change as a separate patch.
This change, however, defies common sense:
+/*
+ * The minimum amount of memory required by TidStore is 2MB, the current minimum
+ * valid value for the maintenance_work_mem GUC. This is required to allocate the
+ * DSA initial segment, 1MB, and some meta data. This number is applied also to
+ * the local TidStore cases for simplicity.
+ */
+#define TIDSTORE_MIN_MEMORY (2 * 1024 * 1024L) /* 2MB */
+ /* Sanity check for the max_bytes */
+ if (max_bytes < TIDSTORE_MIN_MEMORY)
+ elog(ERROR, "memory for TidStore must be at least %ld, but %zu provided",
+ TIDSTORE_MIN_MEMORY, max_bytes);
Aside from the fact that this elog's something that would never get past development, the #define just adds a hard-coded copy of something that is already hard-coded somewhere else, whose size depends on an implementation detail in a third place.

This also assumes that all users of tid store are limited by maintenance_work_mem. Andres thought of an example of some day unifying with tidbitmap.c, and maybe other applications will be limited by work_mem.

But now that I'm looking at the guc tables, I am reminded that work_mem's minimum is 64kB, so this highlights a design problem: There is obviously no requirement that the minimum work_mem has to be >= a single DSA segment, even though operations like parallel hash and parallel bitmap heap scan are limited by work_mem.
Right.

It would be nice to find out what happens with these parallel features when work_mem is tiny (maybe parallelism is not even considered?).

IIUC both don't care about the allocated DSA segment size. Parallel
hash accounts actual tuple (+ header) size as used memory but doesn't
consider how much DSA segment is allocated behind. Both parallel hash
and parallel bitmap scan can work even with work_mem = 64kB, but when
checking the total DSA segment size allocated during these operations,
it was 1MB.

I realized that there is a similar memory limit design issue also on
the non-shared tidstore cases. We deduct 70kB from max_bytes but it
won't work fine with work_mem = 64kB. Probably we need to reconsider
it. FYI 70kB comes from the maximum slab block size for node256.
Currently, we calculate the slab block size enough to allocate 32
chunks from there. For node256, the leaf node is 2,088 bytes and the
slab block size is 66,816 bytes. One idea to fix this issue to
decrease it.
I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing there is what invokes a page fault.

Right. I guess we've discussed what we use for calculating the *used*
memory amount but I don't remember.

I think I was confused by the fact that we use some different
approaches to calculate the amount of used memory. Parallel hash and
tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
allocated block size.

If we're not keeping track of each and every chunk space, for speed, it doesn't follow that we need to keep every block allocation within the configured limit. I'm guessing we can just ask the context if the block space has gone *over* the limit, and we can assume that the last allocation we perform will only fault one additional page. We need to have a clear answer on this before doing anything else.

If that's correct, and I'm not positive yet, we can get rid of all the fragile assumptions about things the tid store has no business knowing about, as well as the guc change.

True.

I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.

Yes, the progress reporting could be confusable. Particularly, in
shared tidstore cases, the dead_tuple_bytes could be much bigger than
max_dead_tuple_bytes. Probably what we need might be functions for
MemoryContext and dsa_area to get the amount of memory that has been
allocated, by not tracking every chunk space. For example, the
functions would be like what SlabStats() does; iterate over every
block and calculates the total/free memory usage.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#226

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#225)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com>

wrote:

I think we're trying to solve the wrong problem here. I need to study

this more, but it seems that code that needs to stay within a memory limit
only needs to track what's been allocated in chunks within a block, since
writing there is what invokes a page fault.

Right. I guess we've discussed what we use for calculating the *used*
memory amount but I don't remember.

I think I was confused by the fact that we use some different
approaches to calculate the amount of used memory. Parallel hash and
tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
allocated block size.

That's good to know. The latter says:

* After adding a new group to the hash table, check whether we need to
enter
* spill mode. Allocations may happen without adding new groups (for
instance,
* if the transition state size grows), so this check is imperfect.

I'm willing to claim that vacuum can be imperfect also, given the tid
store's properties: 1) on average much more efficient in used space, and 2)
no longer bound by the 1GB limit.

I'm not sure how this affects progress reporting, because it would be

nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.

Yes, the progress reporting could be confusable. Particularly, in
shared tidstore cases, the dead_tuple_bytes could be much bigger than
max_dead_tuple_bytes. Probably what we need might be functions for
MemoryContext and dsa_area to get the amount of memory that has been
allocated, by not tracking every chunk space. For example, the
functions would be like what SlabStats() does; iterate over every
block and calculates the total/free memory usage.

I'm not sure we need to invent new infrastructure for this. Looking at v29
in vacuumlazy.c, the order of operations for memory accounting is:

First, get the block-level space -- stop and vacuum indexes if we exceed
the limit:

/*
* Consider if we definitely have enough space to process TIDs on page
* already. If we are close to overrunning the available space for
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if
(TidStoreMemoryUsage(ts) > ts->control->max_bytes)"

Then, after pruning the current page, store the tids and then get the
block-level space again:

else if (prunestate.num_offsets > 0)
{
/* Save details of the LP_DEAD items from the page in dead_items */
TidStoreSetBlockOffsets(...);

pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
TidStoreMemoryUsage(dead_items));
}

Since the block-level measurement is likely overestimating quite a bit, I
propose to simply reverse the order of the actions here, effectively
reporting progress for the *last page* and not the current one: First
update progress with the current memory usage, then add tids for this page.
If this allocated a new block, only a small bit of that will be written to.
If this block pushes it over the limit, we will detect that up at the top
of the loop. It's kind of like our earlier attempts at a "fudge factor",
but simpler and less brittle. And, as far as OS pages we have actually
written to, I think it'll effectively respect the memory limit, at least in
the local mem case. And the numbers will make sense.

Thoughts?

But now that I'm looking more closely at the details of memory accounting,
I don't like that TidStoreMemoryUsage() is called twice per page pruned
(see above). Maybe it wouldn't noticeably slow things down, but it's a bit
sloppy. It seems like we should call it once per loop and save the result
somewhere. If that's the right way to go, that possibly indicates that
TidStoreIsFull() is not a useful interface, at least in this form.

--
John Naylor
EDB: http://www.enterprisedb.com

#227

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#226)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 3, 2023 at 8:04 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Wed, Mar 1, 2023 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 1, 2023 at 3:37 PM John Naylor <john.naylor@enterprisedb.com> wrote:

I think we're trying to solve the wrong problem here. I need to study this more, but it seems that code that needs to stay within a memory limit only needs to track what's been allocated in chunks within a block, since writing there is what invokes a page fault.

Right. I guess we've discussed what we use for calculating the *used*
memory amount but I don't remember.

I think I was confused by the fact that we use some different
approaches to calculate the amount of used memory. Parallel hash and
tidbitmap use the allocated chunk size whereas hash_agg_check_limits()
in nodeAgg.c uses MemoryContextMemAllocated(), which uses the
allocated block size.

That's good to know. The latter says:

* After adding a new group to the hash table, check whether we need to enter
* spill mode. Allocations may happen without adding new groups (for instance,
* if the transition state size grows), so this check is imperfect.

I'm willing to claim that vacuum can be imperfect also, given the tid store's properties: 1) on average much more efficient in used space, and 2) no longer bound by the 1GB limit.

I'm not sure how this affects progress reporting, because it would be nice if it didn't report dead_tuple_bytes bigger than max_dead_tuple_bytes.

Yes, the progress reporting could be confusable. Particularly, in
shared tidstore cases, the dead_tuple_bytes could be much bigger than
max_dead_tuple_bytes. Probably what we need might be functions for
MemoryContext and dsa_area to get the amount of memory that has been
allocated, by not tracking every chunk space. For example, the
functions would be like what SlabStats() does; iterate over every
block and calculates the total/free memory usage.

I'm not sure we need to invent new infrastructure for this. Looking at v29 in vacuumlazy.c, the order of operations for memory accounting is:

First, get the block-level space -- stop and vacuum indexes if we exceed the limit:

/*
* Consider if we definitely have enough space to process TIDs on page
* already. If we are close to overrunning the available space for
* dead_items TIDs, pause and do a cycle of vacuuming before we tackle
* this page.
*/
if (TidStoreIsFull(vacrel->dead_items)) --> which is basically "if (TidStoreMemoryUsage(ts) > ts->control->max_bytes)"

Then, after pruning the current page, store the tids and then get the block-level space again:

else if (prunestate.num_offsets > 0)
{
/* Save details of the LP_DEAD items from the page in dead_items */
TidStoreSetBlockOffsets(...);

pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
TidStoreMemoryUsage(dead_items));
}

Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.

Thoughts?

It looks to work but it still doesn't work in a case where a shared
tidstore is created with a 64kB memory limit, right?
TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
from the beginning.

BTW I realized that since the caller can pass dsa_area to tidstore
(and radix tree), if other data are allocated in the same DSA are,
TidStoreMemoryUsage() (and RT_MEMORY_USAGE()) returns the memory usage
that includes not only itself but also other data. Probably it's
better to comment that the passed dsa_area should be dedicated to a
tidstore (or a radix tree).

But now that I'm looking more closely at the details of memory accounting, I don't like that TidStoreMemoryUsage() is called twice per page pruned (see above). Maybe it wouldn't noticeably slow things down, but it's a bit sloppy. It seems like we should call it once per loop and save the result somewhere. If that's the right way to go, that possibly indicates that TidStoreIsFull() is not a useful interface, at least in this form.

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#228

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#227)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 6, 2023 at 1:28 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

Since the block-level measurement is likely overestimating quite a bit,

I propose to simply reverse the order of the actions here, effectively
reporting progress for the *last page* and not the current one: First
update progress with the current memory usage, then add tids for this page.
If this allocated a new block, only a small bit of that will be written to.
If this block pushes it over the limit, we will detect that up at the top
of the loop. It's kind of like our earlier attempts at a "fudge factor",
but simpler and less brittle. And, as far as OS pages we have actually
written to, I think it'll effectively respect the memory limit, at least in
the local mem case. And the numbers will make sense.

Thoughts?

It looks to work but it still doesn't work in a case where a shared
tidstore is created with a 64kB memory limit, right?
TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
from the beginning.

I have two ideas:

1. Make it optional to track chunk memory space by a template parameter. It
might be tiny compared to everything else that vacuum does. That would
allow other users to avoid that overhead.
2. When context block usage exceeds the limit (rare), make the additional
effort to get the precise usage -- I'm not sure such a top-down facility
exists, and I'm not feeling well enough today to study this further.

--
John Naylor
EDB: http://www.enterprisedb.com

#229

[1]: /messages/by-id/CAD21AoDK3gbX-jVxT6Pfso1Na0Krzr8Q15498Aj6tmXgzMFksA@mail.gmail.com

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#228)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Mar 7, 2023 at 1:01 AM John Naylor <john.naylor@enterprisedb.com> wrote:

On Mon, Mar 6, 2023 at 1:28 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.

Thoughts?

It looks to work but it still doesn't work in a case where a shared
tidstore is created with a 64kB memory limit, right?
TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
from the beginning.

I have two ideas:

1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else that vacuum does. That would allow other users to avoid that overhead.
2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not sure such a top-down facility exists, and I'm not feeling well enough today to study this further.

I prefer option (1) as it's straight forward. I mentioned a similar
idea before[1]/messages/by-id/CAD21AoDK3gbX-jVxT6Pfso1Na0Krzr8Q15498Aj6tmXgzMFksA@mail.gmail.com. RT_MEMORY_USAGE() is defined only when the macro is
defined. It might be worth checking if there is visible overhead of
tracking chunk memory space. IIRC we've not evaluated it yet.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#230

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#229)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Mar 7, 2023 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

1. Make it optional to track chunk memory space by a template

parameter. It might be tiny compared to everything else that vacuum does.
That would allow other users to avoid that overhead.

2. When context block usage exceeds the limit (rare), make the

additional effort to get the precise usage -- I'm not sure such a top-down
facility exists, and I'm not feeling well enough today to study this
further.

I prefer option (1) as it's straight forward. I mentioned a similar
idea before[1]. RT_MEMORY_USAGE() is defined only when the macro is
defined. It might be worth checking if there is visible overhead of
tracking chunk memory space. IIRC we've not evaluated it yet.

Ok, let's try this -- I can test and profile later this week.

--
John Naylor
EDB: http://www.enterprisedb.com

#231

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#230)

11 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 8, 2023 at 1:40 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Tue, Mar 7, 2023 at 8:25 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else that vacuum does. That would allow other users to avoid that overhead.
2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not sure such a top-down facility exists, and I'm not feeling well enough today to study this further.

I prefer option (1) as it's straight forward. I mentioned a similar
idea before[1]. RT_MEMORY_USAGE() is defined only when the macro is
defined. It might be worth checking if there is visible overhead of
tracking chunk memory space. IIRC we've not evaluated it yet.

Ok, let's try this -- I can test and profile later this week.

Thanks!

I've attached the new version patches. I merged improvements and fixes
I did in the v29 patch. 0007 through 0010 are updates from v29. The
main change made in v30 is to make the memory measurement and
RT_MEMORY_USAGE() optional, which is done in 0007 patch. The 0008 and
0009 patches are the updates for tidstore and the vacuum integration
patches. Here are results of quick tests (an average of 3 executions):

query: select * from bench_load_random_int(10 * 1000 * 1000)

* w/ RT_MEASURE_MEMORY_USAGE:
mem_allocated | load_ms
---------------+---------
1996512000 | 3305
(1 row)

* w/o RT_MEASURE_MEMORY_USAGE:
mem_allocated | load_ms
---------------+---------
0 | 3258
(1 row)

It seems to be within a noise level but I agree to make it optional.

Apart from the memory measurement stuff, I've done another todo item
on my list; adding min max classes for node3 and node125. I've done
that in 0010 patch, and here is a quick test result:

query: select * from bench_load_random_int(10 * 1000 * 1000)

* w/ 0000 patch
mem_allocated | load_ms
---------------+---------
1268630080 | 3275
(1 row)

* w/o 0000 patch
mem_allocated | load_ms
---------------+---------
1996512000 | 3214
(1 row)

That's a good improvement on the memory usage, without a noticeable
performance overhead. FYI CLASS_3_MIN has 1 fanout and is 24 bytes in
size, and CLASS_125_MIN has 61 fanouts and is 768 bytes in size.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v30-0008-Remove-the-max-memory-deduction-from-TidStore.patchapplication/octet-stream; name=v30-0008-Remove-the-max-memory-deduction-from-TidStore.patchDownload

From 5e3e7098eb12ec1d7ee546cc8f6e635638f131be Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Mar 2023 15:08:58 +0900
Subject: [PATCH v30 08/11] Remove the max memory deduction from TidStore.

---
 src/backend/access/common/tidstore.c | 43 +++++++---------------------
 1 file changed, 10 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 2d6f2b3ab9..54e2ef29db 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -82,6 +82,7 @@ typedef uint64 offsetbm;
 #define RT_SCOPE static
 #define RT_DECLARE
 #define RT_DEFINE
+#define RT_MEASURE_MEMORY_USAGE
 #define RT_VALUE_TYPE tidkey
 #include "lib/radixtree.h"
 
@@ -90,6 +91,7 @@ typedef uint64 offsetbm;
 #define RT_SCOPE static
 #define RT_DECLARE
 #define RT_DEFINE
+#define RT_MEASURE_MEMORY_USAGE
 #define RT_VALUE_TYPE tidkey
 #include "lib/radixtree.h"
 
@@ -182,39 +184,15 @@ TidStoreCreate(size_t max_bytes, OffsetNumber max_off, dsa_area *area)
 
 	ts = palloc0(sizeof(TidStore));
 
-	/*
-	 * Create the radix tree for the main storage.
-	 *
-	 * Memory consumption depends on the number of stored tids, but also on the
-	 * distribution of them, how the radix tree stores, and the memory management
-	 * that backed the radix tree. The maximum bytes that a TidStore can
-	 * use is specified by the max_bytes in TidStoreCreate(). We want the total
-	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
-	 *
-	 * In local TidStore cases, the radix tree uses slab allocators for each kind
-	 * of node class. The most memory consuming case while adding Tids associated
-	 * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
-	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
-	 * we deduct 70kB from the max_bytes.
-	 *
-	 * In shared cases, DSA allocates the memory segments big enough to follow
-	 * a geometric series that approximately doubles the total DSA size (see
-	 * make_new_segment() in dsa.c). We simulated the how DSA increases segment
-	 * size and the simulation revealed, the 75% threshold for the maximum bytes
-	 * perfectly works in case where the max_bytes is a power-of-2, and the 60%
-	 * threshold works for other cases.
-	 */
 	if (area != NULL)
 	{
 		dsa_pointer dp;
-		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
 
 		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
 										   LWTRANCHE_SHARED_TIDSTORE);
 
 		dp = dsa_allocate0(area, sizeof(TidStoreControl));
 		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
-		ts->control->max_bytes = (size_t) (max_bytes * ratio);
 		ts->area = area;
 
 		ts->control->magic = TIDSTORE_MAGIC;
@@ -225,11 +203,15 @@ TidStoreCreate(size_t max_bytes, OffsetNumber max_off, dsa_area *area)
 	else
 	{
 		ts->tree.local = local_rt_create(CurrentMemoryContext);
-
 		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
-		ts->control->max_bytes = max_bytes - (70 * 1024);
 	}
 
+	/*
+	 * max_bytes is forced to be at least 64KB, the current minimum valid value
+	 * for the work_mem GUC.
+	 */
+	ts->control->max_bytes = Max(64 * 1024L, max_bytes);
+
 	ts->control->max_off = max_off;
 	ts->control->max_off_nbits = pg_ceil_log2_32(max_off);
 
@@ -333,14 +315,8 @@ TidStoreReset(TidStore *ts)
 
 		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
 
-		/*
-		 * Free the radix tree and return allocated DSA segments to
-		 * the operating system.
-		 */
-		shared_rt_free(ts->tree.shared);
-		dsa_trim(ts->area);
-
 		/* Recreate the radix tree */
+		shared_rt_free(ts->tree.shared);
 		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
 										   LWTRANCHE_SHARED_TIDSTORE);
 
@@ -354,6 +330,7 @@ TidStoreReset(TidStore *ts)
 	}
 	else
 	{
+		/* Recreate the radix tree */
 		local_rt_free(ts->tree.local);
 		ts->tree.local = local_rt_create(CurrentMemoryContext);
 
-- 
2.31.1

v30-0011-Revert-building-benchmark-module-for-CI.patchapplication/octet-stream; name=v30-0011-Revert-building-benchmark-module-for-CI.patchDownload

From 7c16882823a3d5b65f32c0147ff9f59e77500390 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 19:31:34 +0700
Subject: [PATCH v30 11/11] Revert building benchmark module for CI

---
 contrib/meson.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/meson.build b/contrib/meson.build
index 421d469f8c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,7 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
-subdir('bench_radix_tree')
+#subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.31.1

v30-0007-Radix-tree-optionally-tracks-memory-usage-when-R.patchapplication/octet-stream; name=v30-0007-Radix-tree-optionally-tracks-memory-usage-when-R.patchDownload

From d271f527e12d91ea238f1bfef4e88220793fee76 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Mar 2023 15:08:19 +0900
Subject: [PATCH v30 07/11] Radix tree optionally tracks memory usage, when
 RT_MEASURE_MEMORY_USAGE.

---
 contrib/bench_radix_tree/bench_radix_tree.c   |  1 +
 src/backend/utils/mmgr/dsa.c                  | 12 ---
 src/include/lib/radixtree.h                   | 93 +++++++++++++++++--
 src/include/utils/dsa.h                       |  1 -
 .../modules/test_radixtree/test_radixtree.c   |  1 +
 5 files changed, 85 insertions(+), 23 deletions(-)

diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 63e842395d..fc6e4cb699 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -34,6 +34,7 @@ PG_MODULE_MAGIC;
 #define RT_DECLARE
 #define RT_DEFINE
 #define RT_USE_DELETE
+#define RT_MEASURE_MEMORY_USAGE
 #define RT_VALUE_TYPE uint64
 // WIP: compiles with warnings because rt_attach is defined but not used
 // #define RT_SHMEM
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 80555aefff..f5a62061a3 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,18 +1024,6 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
-size_t
-dsa_get_total_size(dsa_area *area)
-{
-	size_t		size;
-
-	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
-	size = area->control->total_segment_size;
-	LWLockRelease(DSA_AREA_LOCK(area));
-
-	return size;
-}
-
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 2e3963c3d5..6d65544dd0 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -84,7 +84,6 @@
  * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
  * RT_ITERATE_NEXT	- Return next key-value pair, if any
  * RT_END_ITERATE	- End iteration
- * RT_MEMORY_USAGE	- Get the memory usage
  *
  * Interface for Shared Memory
  * ---------
@@ -97,6 +96,8 @@
  * ---------
  *
  * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ * RT_MEMORY_USAGE	- Get the memory usage. Declared/define if
+ *					  RT_MEASURE_MEMORY_USAGE is defined.
  *
  *
  * Copyright (c) 2023, PostgreSQL Global Development Group
@@ -138,7 +139,9 @@
 #ifdef RT_USE_DELETE
 #define RT_DELETE RT_MAKE_NAME(delete)
 #endif
+#ifdef RT_MEASURE_MEMORY_USAGE
 #define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#endif
 #ifdef RT_DEBUG
 #define RT_DUMP RT_MAKE_NAME(dump)
 #define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
@@ -150,6 +153,9 @@
 #define RT_NEW_ROOT RT_MAKE_NAME(new_root)
 #define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
 #define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#ifdef RT_MEASURE_MEMORY_USAGE
+#define RT_FANOUT_GET_NODE_SIZE RT_MAKE_NAME(fanout_get_node_size)
+#endif
 #define RT_FREE_NODE RT_MAKE_NAME(free_node)
 #define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
 #define RT_EXTEND_UP RT_MAKE_NAME(extend_up)
@@ -255,7 +261,9 @@ RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
 RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
 RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
 
+#ifdef RT_MEASURE_MEMORY_USAGE
 RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+#endif
 
 #ifdef RT_DEBUG
 RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
@@ -624,6 +632,10 @@ typedef struct RT_RADIX_TREE_CONTROL
 	uint64		max_val;
 	uint64		num_keys;
 
+#ifdef RT_MEASURE_MEMORY_USAGE
+	int64		mem_used;
+#endif
+
 	/* statistics */
 #ifdef RT_DEBUG
 	int32		cnt[RT_SIZE_CLASS_COUNT];
@@ -1089,6 +1101,11 @@ RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
 													  allocsize);
 #endif
 
+#ifdef RT_MEASURE_MEMORY_USAGE
+	/* update memory usage */
+	tree->ctl->mem_used += allocsize;
+#endif
+
 #ifdef RT_DEBUG
 	/* update the statistics */
 	tree->ctl->cnt[size_class]++;
@@ -1165,6 +1182,54 @@ RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL no
 	return newnode;
 }
 
+#ifdef RT_MEASURE_MEMORY_USAGE
+/* Return the node size of the given fanout of the size class */
+static inline Size
+RT_FANOUT_GET_NODE_SIZE(int fanout, bool is_leaf)
+{
+	const Size fanout_inner_node_size[] = {
+		[3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].inner_size,
+		[15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].inner_size,
+		[32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].inner_size,
+		[125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].inner_size,
+		[256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].inner_size,
+	};
+	const Size fanout_leaf_node_size[] = {
+		[3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].leaf_size,
+		[15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].leaf_size,
+		[32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].leaf_size,
+		[125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].leaf_size,
+		[256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].leaf_size,
+	};
+	Size node_size;
+
+	node_size = is_leaf ?
+		fanout_leaf_node_size[fanout] : fanout_inner_node_size[fanout];
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		Size assert_node_size = 0;
+
+		for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+
+			if (size_class.fanout == fanout)
+			{
+				assert_node_size = is_leaf ?
+					size_class.leaf_size : size_class.inner_size;
+				break;
+			}
+		}
+
+		Assert(node_size == assert_node_size);
+	}
+#endif
+
+	return node_size;
+}
+#endif
+
 /* Free the given node */
 static void
 RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
@@ -1197,11 +1262,22 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
 	}
 #endif
 
+#ifdef RT_MEASURE_MEMORY_USAGE
+	/* update memory usage */
+	{
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+		tree->ctl->mem_used -= RT_FANOUT_GET_NODE_SIZE(node->fanout,
+													   RT_NODE_IS_LEAF(node));
+		Assert(tree->ctl->mem_used >= 0);
+	}
+#endif
+
 #ifdef RT_SHMEM
 	dsa_free(tree->dsa, allocnode);
 #else
 	pfree(allocnode);
 #endif
+
 }
 
 /* Update the parent's pointer when growing a node */
@@ -1989,27 +2065,23 @@ RT_END_ITERATE(RT_ITER *iter)
 /*
  * Return the statistics of the amount of memory used by the radix tree.
  */
+#ifdef RT_MEASURE_MEMORY_USAGE
 RT_SCOPE uint64
 RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
 {
 	Size		total = 0;
 
-	RT_LOCK_SHARED(tree);
-
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
-	total = dsa_get_total_size(tree->dsa);
-#else
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
-	{
-		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
-		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
-	}
 #endif
 
+	RT_LOCK_SHARED(tree);
+	total = tree->ctl->mem_used;
 	RT_UNLOCK(tree);
+
 	return total;
 }
+#endif
 
 /*
  * Verify the radix tree node.
@@ -2476,6 +2548,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_NEW_ROOT
 #undef RT_ALLOC_NODE
 #undef RT_INIT_NODE
+#undef RT_FANOUT_GET_NODE_SIZE
 #undef RT_FREE_NODE
 #undef RT_FREE_RECURSE
 #undef RT_EXTEND_UP
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 2af215484f..3ce4ee300a 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,7 +121,6 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
-extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 5a169854d9..19d286d84b 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -114,6 +114,7 @@ static const test_spec test_specs[] = {
 #define RT_DECLARE
 #define RT_DEFINE
 #define RT_USE_DELETE
+#define RT_MEASURE_MEMORY_USAGE
 #define RT_VALUE_TYPE TestValueType
 /* #define RT_SHMEM */
 #include "lib/radixtree.h"
-- 
2.31.1

v30-0009-Revert-the-update-for-the-minimum-value-of-maint.patchapplication/octet-stream; name=v30-0009-Revert-the-update-for-the-minimum-value-of-maint.patchDownload

From f7013c9023ff3f9a6707276303443f0b4e00ccbf Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Mar 2023 15:09:22 +0900
Subject: [PATCH v30 09/11] Revert the update for the minimum value of
 maintenance_work_mem.

---
 src/backend/postmaster/autovacuum.c | 6 +++---
 src/backend/utils/misc/guc_tables.c | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index a371f6fbba..ff6149a179 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3397,12 +3397,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
 		return true;
 
 	/*
-	 * We clamp manually-set values to at least 2MB.  Since
+	 * We clamp manually-set values to at least 1MB.  Since
 	 * maintenance_work_mem is always set to at least this value, do the same
 	 * here.
 	 */
-	if (*newval < 2048)
-		*newval = 2048;
+	if (*newval < 1024)
+		*newval = 1024;
 
 	return true;
 }
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8a64614cd1..1c0583fe26 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2313,7 +2313,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 2048, MAX_KILOBYTES,
+		65536, 1024, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
-- 
2.31.1

v30-0010-Add-min-and-max-classes-for-node3-and-node125.patchapplication/octet-stream; name=v30-0010-Add-min-and-max-classes-for-node3-and-node125.patchDownload

From ba41d3bfcf0d3016c61948ce6acc0d9582d8aad8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 9 Mar 2023 11:42:17 +0900
Subject: [PATCH v30 10/11] Add min and max classes for node3 and node125.

---
 src/include/lib/radixtree.h                   | 70 +++++++++++++------
 src/include/lib/radixtree_insert_impl.h       | 56 ++++++++++++++-
 .../expected/test_radixtree.out               |  4 ++
 .../modules/test_radixtree/test_radixtree.c   |  6 +-
 4 files changed, 110 insertions(+), 26 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 6d65544dd0..b655f4a2a2 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -225,10 +225,12 @@
 #define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
 #define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
 #define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
-#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_3_MIN RT_MAKE_NAME(class_3_min)
+#define RT_CLASS_3_MAX RT_MAKE_NAME(class_3_max)
 #define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
 #define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
-#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_125_MIN RT_MAKE_NAME(class_125_min)
+#define RT_CLASS_125_MAX RT_MAKE_NAME(class_125_max)
 #define RT_CLASS_256 RT_MAKE_NAME(class_256)
 
 /* generate forward declarations necessary to use the radix tree */
@@ -561,10 +563,12 @@ typedef struct RT_NODE_LEAF_256
  */
 typedef enum RT_SIZE_CLASS
 {
-	RT_CLASS_3 = 0,
+	RT_CLASS_3_MIN = 0,
+	RT_CLASS_3_MAX,
 	RT_CLASS_32_MIN,
 	RT_CLASS_32_MAX,
-	RT_CLASS_125,
+	RT_CLASS_125_MIN,
+	RT_CLASS_125_MAX,
 	RT_CLASS_256
 } RT_SIZE_CLASS;
 
@@ -580,7 +584,13 @@ typedef struct RT_SIZE_CLASS_ELEM
 } RT_SIZE_CLASS_ELEM;
 
 static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
-	[RT_CLASS_3] = {
+	[RT_CLASS_3_MIN] = {
+		.name = "radix tree node 1",
+		.fanout = 1,
+		.inner_size = sizeof(RT_NODE_INNER_3) + 1 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_3) + 1 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_3_MAX] = {
 		.name = "radix tree node 3",
 		.fanout = 3,
 		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
@@ -598,7 +608,13 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
 	},
-	[RT_CLASS_125] = {
+	[RT_CLASS_125_MIN] = {
+		.name = "radix tree node 125",
+		.fanout = 61,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 61 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 61 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_125_MAX] = {
 		.name = "radix tree node 125",
 		.fanout = 125,
 		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
@@ -934,7 +950,7 @@ static inline void
 RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
 						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
 {
-	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
 	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
 
@@ -946,7 +962,7 @@ static inline void
 RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
 						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
 {
-	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
 	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
 
@@ -1152,9 +1168,9 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
 	RT_PTR_ALLOC allocnode;
 	RT_PTR_LOCAL newnode;
 
-	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_MIN, is_leaf);
 	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3_MIN, is_leaf);
 	newnode->shift = shift;
 	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
 	tree->ctl->root = allocnode;
@@ -1188,17 +1204,21 @@ static inline Size
 RT_FANOUT_GET_NODE_SIZE(int fanout, bool is_leaf)
 {
 	const Size fanout_inner_node_size[] = {
-		[3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].inner_size,
+		[1] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MIN].inner_size,
+		[3] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].inner_size,
 		[15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].inner_size,
 		[32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].inner_size,
-		[125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].inner_size,
+		[61] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MIN].inner_size,
+		[125] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MAX].inner_size,
 		[256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].inner_size,
 	};
 	const Size fanout_leaf_node_size[] = {
-		[3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].leaf_size,
+		[1] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MIN].leaf_size,
+		[3] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].leaf_size,
 		[15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].leaf_size,
 		[32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].leaf_size,
-		[125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].leaf_size,
+		[61] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MIN].leaf_size,
+		[125] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MAX].leaf_size,
 		[256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].leaf_size,
 	};
 	Size node_size;
@@ -1337,9 +1357,9 @@ RT_EXTEND_UP(RT_RADIX_TREE *tree, uint64 key)
 		RT_PTR_LOCAL	node;
 		RT_NODE_INNER_3 *n3;
 
-		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_MIN, true);
 		node = RT_PTR_GET_LOCAL(tree, allocnode);
-		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3_MIN, true);
 		node->shift = shift;
 		node->count = 1;
 
@@ -1375,9 +1395,9 @@ RT_EXTEND_DOWN(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_L
 		int			newshift = shift - RT_NODE_SPAN;
 		bool		is_leaf = newshift == 0;
 
-		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3_MIN, is_leaf);
 		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
-		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3_MIN, is_leaf);
 		newchild->shift = newshift;
 		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
 
@@ -2177,12 +2197,14 @@ RT_STATS(RT_RADIX_TREE *tree)
 	{
 		RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
 
-		fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+		fprintf(stderr, "height = %d, n1 = %u, n3 = %u, n15 = %u, n32 = %u, n61 = %u, n125 = %u, n256 = %u\n",
 				root->shift / RT_NODE_SPAN,
-				tree->ctl->cnt[RT_CLASS_3],
+				tree->ctl->cnt[RT_CLASS_3_MIN],
+				tree->ctl->cnt[RT_CLASS_3_MAX],
 				tree->ctl->cnt[RT_CLASS_32_MIN],
 				tree->ctl->cnt[RT_CLASS_32_MAX],
-				tree->ctl->cnt[RT_CLASS_125],
+				tree->ctl->cnt[RT_CLASS_125_MIN],
+				tree->ctl->cnt[RT_CLASS_125_MAX],
 				tree->ctl->cnt[RT_CLASS_256]);
 	}
 
@@ -2519,10 +2541,12 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_SIZE_CLASS
 #undef RT_SIZE_CLASS_ELEM
 #undef RT_SIZE_CLASS_INFO
-#undef RT_CLASS_3
+#undef RT_CLASS_3_MIN
+#undef RT_CLASS_3_MAX
 #undef RT_CLASS_32_MIN
 #undef RT_CLASS_32_MAX
-#undef RT_CLASS_125
+#undef RT_CLASS_125_MIN
+#undef RT_CLASS_125_MAX
 #undef RT_CLASS_256
 
 /* function declarations */
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index d56e58dcac..d10093dfba 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -42,6 +42,7 @@
 	{
 		case RT_NODE_KIND_3:
 			{
+				const RT_SIZE_CLASS_ELEM class3_max = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX];
 				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
 
 #ifdef RT_NODE_LEVEL_LEAF
@@ -55,6 +56,32 @@
 					break;
 				}
 #endif
+				if (unlikely(RT_NODE_MUST_GROW(n3)) &&
+					n3->base.n.fanout < class3_max.fanout)
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					const RT_SIZE_CLASS_ELEM class3_min = RT_SIZE_CLASS_INFO[RT_CLASS_3_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_3_MAX;
+
+					Assert(n3->base.n.fanout == class3_min.fanout);
+
+					/* grow to the next size class of this kind */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n3 = (RT_NODE3_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class3_min.leaf_size);
+#else
+					memcpy(newnode, node, class3_min.inner_size);
+#endif
+					newnode->fanout = class3_max.fanout;
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+
 				if (unlikely(RT_NODE_MUST_GROW(n3)))
 				{
 					RT_PTR_ALLOC allocnode;
@@ -154,7 +181,7 @@
 					RT_PTR_LOCAL newnode;
 					RT_NODE125_TYPE *new125;
 					const uint8 new_kind = RT_NODE_KIND_125;
-					const RT_SIZE_CLASS new_class = RT_CLASS_125;
+					const RT_SIZE_CLASS new_class = RT_CLASS_125_MIN;
 
 					Assert(n32->base.n.fanout == class32_max.fanout);
 
@@ -213,6 +240,7 @@
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_125:
 			{
+				const RT_SIZE_CLASS_ELEM class125_max = RT_SIZE_CLASS_INFO[RT_CLASS_125_MAX];
 				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
 				int			slotpos;
 				int			cnt = 0;
@@ -227,6 +255,32 @@
 					break;
 				}
 #endif
+				if (unlikely(RT_NODE_MUST_GROW(n125)) &&
+					n125->base.n.fanout < class125_max.fanout)
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					const RT_SIZE_CLASS_ELEM class125_min = RT_SIZE_CLASS_INFO[RT_CLASS_125_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_125_MAX;
+
+					Assert(n125->base.n.fanout == class125_min.fanout);
+
+					/* grow to the next size class of this kind */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n125 = (RT_NODE125_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class125_min.leaf_size);
+#else
+					memcpy(newnode, node, class125_min.inner_size);
+#endif
+					newnode->fanout = class125_max.fanout;
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+
 				if (unlikely(RT_NODE_MUST_GROW(n125)))
 				{
 					RT_PTR_ALLOC allocnode;
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index 7ad1ce3605..f2b1d7e4f8 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -4,12 +4,16 @@ CREATE EXTENSION test_radixtree;
 -- an error if something fails.
 --
 SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 1
+NOTICE:  testing basic operations with inner node 1
 NOTICE:  testing basic operations with leaf node 3
 NOTICE:  testing basic operations with inner node 3
 NOTICE:  testing basic operations with leaf node 15
 NOTICE:  testing basic operations with inner node 15
 NOTICE:  testing basic operations with leaf node 32
 NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 61
+NOTICE:  testing basic operations with inner node 61
 NOTICE:  testing basic operations with leaf node 125
 NOTICE:  testing basic operations with inner node 125
 NOTICE:  testing basic operations with leaf node 256
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 19d286d84b..4f38b6e3de 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -47,10 +47,12 @@ static const bool rt_test_stats = false;
  * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
  */
 static int	rt_node_class_fanouts[] = {
-	3,		/* RT_CLASS_3 */
+	1,		/* RT_CLASS_3_MIN */
+	3,		/* RT_CLASS_3_MAX */
 	15,		/* RT_CLASS_32_MIN */
 	32, 	/* RT_CLASS_32_MAX */
-	125,	/* RT_CLASS_125 */
+	61,		/* RT_CLASS_125_MIN */
+	125,	/* RT_CLASS_125_MAX */
 	256		/* RT_CLASS_256 */
 };
 /*
-- 
2.31.1

v30-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v30-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload

From b4e4ea5f22ee8898fa7ef58a21d0da1d4d661a0a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 7 Feb 2023 17:19:29 +0700
Subject: [PATCH v30 06/11] Use TIDStore for storing dead tuple TID during lazy
 vacuum

Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.

Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.

Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and dead_tuple_bytes.

In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.

XXX: needs to bump catalog version
---
 doc/src/sgml/monitoring.sgml               |   8 +-
 src/backend/access/heap/vacuumlazy.c       | 279 ++++++++-------------
 src/backend/catalog/system_views.sql       |   2 +-
 src/backend/commands/vacuum.c              |  78 +-----
 src/backend/commands/vacuumparallel.c      |  66 +++--
 src/backend/postmaster/autovacuum.c        |   6 +-
 src/backend/storage/lmgr/lwlock.c          |   2 +
 src/backend/utils/misc/guc_tables.c        |   2 +-
 src/include/commands/progress.h            |   4 +-
 src/include/commands/vacuum.h              |  25 +-
 src/include/storage/lwlock.h               |   1 +
 src/test/regress/expected/cluster.out      |   2 +-
 src/test/regress/expected/create_index.out |   2 +-
 src/test/regress/expected/rules.out        |   4 +-
 src/test/regress/sql/cluster.sql           |   2 +-
 src/test/regress/sql/create_index.sql      |   2 +-
 16 files changed, 174 insertions(+), 311 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 97d588b1d8..61e163636a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7170,10 +7170,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -7181,10 +7181,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..edb9079124 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,17 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -40,6 +39,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -188,7 +188,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -220,11 +220,14 @@ typedef struct LVRelState
 typedef struct LVPagePruneState
 {
 	bool		hastup;			/* Page prevents rel truncation? */
-	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+
+	/* collected offsets of LP_DEAD items including existing ones */
+	OffsetNumber	deadoffsets[MaxHeapTuplesPerPage];
+	int				num_offsets;
 
 	/*
 	 * State describes the proper VM bit states to set for the page following
-	 * pruning and freezing.  all_visible implies !has_lpdead_items, but don't
+	 * pruning and freezing.  all_visible implies num_offsets == 0, but don't
 	 * trust all_frozen result unless all_visible is also set to true.
 	 */
 	bool		all_visible;	/* Every item visible to all? */
@@ -259,8 +262,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -487,11 +491,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
-	 * parallel VACUUM initialization as part of allocating shared memory
-	 * space used for dead_items.  (But do a failsafe precheck first, to
-	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
-	 * is already dangerously old.)
+	 * Allocate dead_items memory using dead_items_alloc.  This handles parallel
+	 * VACUUM initialization as part of allocating shared memory space used for
+	 * dead_items.  (But do a failsafe precheck first, to ensure that parallel
+	 * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+	 * old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
@@ -797,7 +801,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -825,21 +829,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				blkno,
 				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = TidStoreMaxMemory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +910,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (TidStoreIsFull(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -969,7 +972,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				continue;
 			}
 
-			/* Collect LP_DEAD items in dead_items array, count tuples */
+			/* Collect LP_DEAD items in dead_items, count tuples */
 			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
 								  &recordfreespace))
 			{
@@ -1011,14 +1014,14 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Prune, freeze, and count tuples.
 		 *
 		 * Accumulates details of remaining LP_DEAD line pointers on page in
-		 * dead_items array.  This includes LP_DEAD line pointers that we
-		 * pruned ourselves, as well as existing LP_DEAD line pointers that
-		 * were pruned some time earlier.  Also considers freezing XIDs in the
-		 * tuple headers of remaining items with storage.
+		 * dead_items.  This includes LP_DEAD line pointers that we pruned
+		 * ourselves, as well as existing LP_DEAD line pointers that were pruned
+		 * some time earlier.  Also considers freezing XIDs in the tuple headers
+		 * of remaining items with storage.
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
 
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+		Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
@@ -1034,14 +1037,12 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * performed here can be thought of as the one-pass equivalent of
 			 * a call to lazy_vacuum().
 			 */
-			if (prunestate.has_lpdead_items)
+			if (prunestate.num_offsets > 0)
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
-				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+									  prunestate.num_offsets, buf, vmbuffer);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1079,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(TidStoreNumTids(dead_items) == 0);
+		}
+		else if (prunestate.num_offsets > 0)
+		{
+			/* Save details of the LP_DEAD items from the page in dead_items */
+			TidStoreSetBlockOffsets(dead_items, blkno, prunestate.deadoffsets,
+									prunestate.num_offsets);
+
+			pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+										 TidStoreMemoryUsage(dead_items));
 		}
 
 		/*
@@ -1145,7 +1155,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
 		 * set, however.
 		 */
-		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+		else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
 		{
 			elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1193,7 +1203,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Final steps for block: drop cleanup lock, record free space in the
 		 * FSM
 		 */
-		if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+		if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
 		{
 			/*
 			 * Wait until lazy_vacuum_heap_rel() to save free space.  This
@@ -1249,7 +1259,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (TidStoreNumTids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1524,9 +1534,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
  * The approach we take now is to restart pruning when the race condition is
  * detected.  This allows heap_page_prune() to prune the tuples inserted by
  * the now-aborted transaction.  This is a little crude, but it guarantees
- * that any items that make it into the dead_items array are simple LP_DEAD
- * line pointers, and that every remaining item with tuple storage is
- * considered as a candidate for freezing.
+ * that any items that make it into the dead_items are simple LP_DEAD line
+ * pointers, and that every remaining item with tuple storage is considered
+ * as a candidate for freezing.
  */
 static void
 lazy_scan_prune(LVRelState *vacrel,
@@ -1543,13 +1553,11 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				tuples_frozen,
-				lpdead_items,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	HeapPageFreeze pagefrz;
 	int64		fpi_before = pgWalUsage.wal_fpi;
-	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1579,6 @@ retry:
 	pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
-	lpdead_items = 0;
 	live_tuples = 0;
 	recently_dead_tuples = 0;
 
@@ -1580,9 +1587,9 @@ retry:
 	 *
 	 * We count tuples removed by the pruning step as tuples_deleted.  Its
 	 * final value can be thought of as the number of tuples that have been
-	 * deleted from the table.  It should not be confused with lpdead_items;
-	 * lpdead_items's final value can be thought of as the number of tuples
-	 * that were deleted from indexes.
+	 * deleted from the table.  It should not be confused with
+	 * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+	 * be thought of as the number of tuples that were deleted from indexes.
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
 									 InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1600,7 @@ retry:
 	 * requiring freezing among remaining tuples with storage
 	 */
 	prunestate->hastup = false;
-	prunestate->has_lpdead_items = false;
+	prunestate->num_offsets = 0;
 	prunestate->all_visible = true;
 	prunestate->all_frozen = true;
 	prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1645,7 @@ retry:
 			 * (This is another case where it's useful to anticipate that any
 			 * LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
 			 */
-			deadoffsets[lpdead_items++] = offnum;
+			prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
 			continue;
 		}
 
@@ -1875,7 +1882,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible && lpdead_items == 0)
+	if (prunestate->all_visible && prunestate->num_offsets == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1888,28 +1895,9 @@ retry:
 	}
 #endif
 
-	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel
-	 */
-	if (lpdead_items > 0)
+	if (prunestate->num_offsets > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
-		prunestate->has_lpdead_items = true;
-
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1916,7 @@ retry:
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
 	vacrel->tuples_frozen += tuples_frozen;
-	vacrel->lpdead_items += lpdead_items;
+	vacrel->lpdead_items += prunestate->num_offsets;
 	vacrel->live_tuples += live_tuples;
 	vacrel->recently_dead_tuples += recently_dead_tuples;
 }
@@ -1940,7 +1928,7 @@ retry:
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2099,7 +2087,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2129,8 +2117,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2126,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		TidStoreSetBlockOffsets(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 TidStoreMemoryUsage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2198,7 +2178,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		TidStoreReset(vacrel->dead_items);
 		return;
 	}
 
@@ -2227,7 +2207,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2254,8 +2234,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2300,7 +2280,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	TidStoreReset(vacrel->dead_items);
 }
 
 /*
@@ -2373,7 +2353,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2392,9 +2372,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2410,10 +2389,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2408,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = TidStoreBeginIterate(vacrel->dead_items);
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2437,7 +2418,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = iter_result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2451,7 +2432,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, iter_result->offsets,
+							  iter_result->num_offsets, buf, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2461,6 +2443,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2470,36 +2453,31 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, TidStoreNumTids(vacrel->dead_items),
+					vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
 }
 
 /*
- *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *	lazy_vacuum_heap_page() -- free page's LP_DEAD items.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+					  OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2518,16 +2496,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2570,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -2687,8 +2659,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -3094,48 +3066,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 }
 
 /*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
-/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.
  *
  * Also handles parallel initialization as part of allocating dead_items in
  * DSM when required.
@@ -3143,11 +3075,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3174,7 +3104,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem, MaxHeapTuplesPerPage,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3187,11 +3117,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = TidStoreCreate(vac_work_mem, MaxHeapTuplesPerPage,
+										NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..149d41b41c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index aa79d9de4d..5fb30d7e62 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2303,16 +2302,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					TidStoreNumTids(dead_items))));
 
 	return istat;
 }
@@ -2343,82 +2342,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch(itemptr,
-								dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return TidStoreIsMember(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..9225daf3ab 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,10 +9,10 @@
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
  * vacuum process.  ParalleVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
+ * the memory space for storing dead items allocated in the DSA area.  We
  * launch parallel worker processes at the start of parallel index
  * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
+ * parallel worker processes exit.	Each time we process indexes in parallel,
  * the parallel context is re-initialized so that the same DSM can be used for
  * multiple passes of index bulk-deletion and index cleanup.
  *
@@ -103,6 +103,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TidStore */
+	TidStoreHandle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +169,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +226,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int max_offset, int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +290,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Initial size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +357,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = TidStoreCreate(vac_work_mem, max_offset, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = TidStoreGetHandle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +385,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +442,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	TidStoreDestroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +453,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TidStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +951,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +997,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1046,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	TidStoreDetach(dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ff6149a179..a371f6fbba 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3397,12 +3397,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
 		return true;
 
 	/*
-	 * We clamp manually-set values to at least 1MB.  Since
+	 * We clamp manually-set values to at least 2MB.  Since
 	 * maintenance_work_mem is always set to at least this value, do the same
 	 * here.
 	 */
-	if (*newval < 1024)
-		*newval = 1024;
+	if (*newval < 2048)
+		*newval = 2048;
 
 	return true;
 }
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"LogicalRepLauncherDSA",
 	/* LWTRANCHE_LAUNCHER_HASH: */
 	"LogicalRepLauncherHash",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..8a64614cd1 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2313,7 +2313,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 2048, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 689dbb7702..a3ebb169ef 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -276,21 +277,6 @@ struct VacuumCutoffs
 	MultiXactId MultiXactCutoff;
 };
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -339,18 +325,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem, int max_offset,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DATA,
 	LWTRANCHE_LAUNCHER_DSA,
 	LWTRANCHE_LAUNCHER_HASH,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4f..d320ad87dd 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
 -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e953d1f515..ef46c2994f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2032,8 +2032,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 set enable_indexscan = off;
 
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f300..d6e2471b00 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
 
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
-- 
2.31.1

v30-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v30-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From 95bb8dc701efa4a5923a355880b60885dc18cfa3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v30 04/11] Add TIDStore, to store sets of TIDs
 (ItemPointerData) efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 710 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   2 +
 src/include/access/tidstore.h                 |  50 ++
 src/include/storage/lwlock.h                  |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  35 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../modules/test_tidstore/test_tidstore.c     | 228 ++++++
 .../test_tidstore/test_tidstore.control       |   4 +
 16 files changed, 1089 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d0..97d588b1d8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2203,6 +2203,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to access a shared TID bitmap during a parallel bitmap
        index scan.</entry>
      </row>
+     <row>
+      <entry><literal>SharedTidStore</literal></entry>
+      <entry>Waiting to access a shared TID store.</entry>
+     </row>
      <row>
       <entry><literal>SharedTupleStore</literal></entry>
       <entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..2d6f2b3ab9
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,710 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * TidStore is a in-memory data structure to store tids (ItemPointerData).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.
+ *
+ * TidStore can be shared among parallel worker processes by passing DSA area
+ * to TidStoreCreate(). Other backends can attach to the shared TidStore by
+ * TidStoreAttach().
+ *
+ * Regarding the concurrency support, we use a single LWLock for the TidStore.
+ * The TidStore is exclusively locked when inserting encoded tids to the
+ * radix tree or when resetting itself. When searching on the TidStore or
+ * doing the iteration, it is not locked but the underlying radix tree is
+ * locked in shared mode.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, a tid is represented as a pair of 64-bit key and
+ * 64-bit value.
+ *
+ * First, we construct a 64-bit unsigned integer by combining the block
+ * number and the offset number. The number of bits used for the offset number
+ * is specified by max_off in TidStoreCreate(). We are frugal with the bits,
+ * because smaller keys could help keeping the radix tree shallow.
+ *
+ * For example, a tid of heap on a 8kB block uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. 9 bits
+ * are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks. That is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * Then, 64-bit value is the bitmap representation of the lowest 6 bits
+ * (LOWER_OFFSET_NBITS) of the integer, and 64-bit key consists of the
+ * upper 3 bits of the offset number and the block number, 35 bits in
+ * total:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *                                                |----| value
+ *        |--------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits required for offset numbers fits in LOWER_OFFSET_NBITS,
+ * 64-bit value is the bitmap representation of the offset number, and the
+ * 64-bit key is the block number.
+ */
+typedef uint64 tidkey;
+typedef uint64 offsetbm;
+#define LOWER_OFFSET_NBITS	6	/* log(sizeof(offsetbm), 2) */
+#define LOWER_OFFSET_MASK	((1 << LOWER_OFFSET_NBITS) - 1)
+
+/* A magic value used to identify our TidStore. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE tidkey
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE tidkey
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+	/* the number of tids in the store */
+	int64	num_tids;
+
+	/* These values are never changed after creation */
+	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
+	OffsetNumber	max_off;		/* the maximum offset number */
+	int		max_off_nbits;	/* the number of bits required for offset
+							 * numbers */
+	int		upper_off_nbits;	/* the number of bits of offset numbers
+								 * used in a key */
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+	LWLock	lock;
+
+	/* handles for TidStore and radix tree */
+	TidStoreHandle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	tidkey		next_tidkey;
+	offsetbm	next_off_bitmap;
+
+	/*
+	 * output for the caller. Must be last because variable-size.
+	 */
+	TidStoreIterResult output;
+} TidStoreIter;
+
+static void iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap);
+static inline BlockNumber key_get_blkno(TidStore *ts, tidkey key);
+static inline tidkey encode_blk_off(TidStore *ts, BlockNumber block,
+									OffsetNumber offset, offsetbm *off_bit);
+static inline tidkey encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+TidStoreCreate(size_t max_bytes, OffsetNumber max_off, dsa_area *area)
+{
+	TidStore	*ts;
+
+	Assert(max_off <= MaxOffsetNumber);
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 *
+	 * Memory consumption depends on the number of stored tids, but also on the
+	 * distribution of them, how the radix tree stores, and the memory management
+	 * that backed the radix tree. The maximum bytes that a TidStore can
+	 * use is specified by the max_bytes in TidStoreCreate(). We want the total
+	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
+	 *
+	 * In local TidStore cases, the radix tree uses slab allocators for each kind
+	 * of node class. The most memory consuming case while adding Tids associated
+	 * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
+	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+	 * we deduct 70kB from the max_bytes.
+	 *
+	 * In shared cases, DSA allocates the memory segments big enough to follow
+	 * a geometric series that approximately doubles the total DSA size (see
+	 * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+	 * size and the simulation revealed, the 75% threshold for the maximum bytes
+	 * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+	 * threshold works for other cases.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes = (size_t) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - (70 * 1024);
+	}
+
+	ts->control->max_off = max_off;
+	ts->control->max_off_nbits = pg_ceil_log2_32(max_off);
+
+	if (ts->control->max_off_nbits < LOWER_OFFSET_NBITS)
+		ts->control->max_off_nbits = LOWER_OFFSET_NBITS;
+
+	ts->control->upper_off_nbits =
+		ts->control->max_off_nbits - LOWER_OFFSET_NBITS;
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+TidStoreAttach(dsa_area *area, TidStoreHandle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+TidStoreDetach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().
+ */
+void
+TidStoreDestroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix
+		 * tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to TidStoreDestroy() but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+TidStoreReset(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Free the radix tree and return allocated DSA segments to
+		 * the operating system.
+		 */
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+
+		LWLockRelease(&ts->control->lock);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+	}
+}
+
+/*
+ * Set the given tids on the blkno to TidStore.
+ *
+ * NB: the offset numbers in offsets must be sorted in ascending order.
+ */
+void
+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+						int num_offsets)
+{
+	offsetbm	*bitmaps;
+	tidkey		key;
+	tidkey		prev_key;
+	offsetbm	off_bitmap = 0;
+	int idx;
+	const tidkey key_base = ((uint64) blkno) << ts->control->upper_off_nbits;
+	const int nkeys = UINT64CONST(1) << ts->control->upper_off_nbits;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+	Assert(BlockNumberIsValid(blkno));
+
+	bitmaps = palloc(sizeof(offsetbm) * nkeys);
+	key = prev_key = key_base;
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		offsetbm	off_bit;
+
+		Assert(offsets[i] <= ts->control->max_off);
+
+		/* encode the tid to a key and partial offset */
+		key = encode_blk_off(ts, blkno, offsets[i], &off_bit);
+
+		/* make sure we scanned the line pointer array in order */
+		Assert(key >= prev_key);
+
+		if (key > prev_key)
+		{
+			idx = prev_key - key_base;
+			Assert(idx >= 0 && idx < nkeys);
+
+			/* write out offset bitmap for this key */
+			bitmaps[idx] = off_bitmap;
+
+			/* zero out any gaps up to the current key */
+			for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
+				bitmaps[empty_idx] = 0;
+
+			/* reset for current key -- the current offset will be handled below */
+			off_bitmap = 0;
+			prev_key = key;
+		}
+
+		off_bitmap |= off_bit;
+	}
+
+	/* save the final index for later */
+	idx = key - key_base;
+	/* write out last offset bitmap */
+	bitmaps[idx] = off_bitmap;
+
+	if (TidStoreIsShared(ts))
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+	/* insert the calculated key-values to the tree */
+	for (int i = 0; i <= idx; i++)
+	{
+		if (bitmaps[i])
+		{
+			key = key_base + i;
+
+			if (TidStoreIsShared(ts))
+				shared_rt_set(ts->tree.shared, key, &bitmaps[i]);
+			else
+				local_rt_set(ts->tree.local, key, &bitmaps[i]);
+		}
+	}
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+
+	if (TidStoreIsShared(ts))
+		LWLockRelease(&ts->control->lock);
+
+	pfree(bitmaps);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+TidStoreIsMember(TidStore *ts, ItemPointer tid)
+{
+	tidkey key;
+	offsetbm off_bitmap = 0;
+	offsetbm off_bit;
+	bool found;
+
+	Assert(ItemPointerIsValid(tid));
+
+	key = encode_tid(ts, tid, &off_bit);
+
+	if (TidStoreIsShared(ts))
+		found = shared_rt_search(ts->tree.shared, key, &off_bitmap);
+	else
+		found = local_rt_search(ts->tree.local, key, &off_bitmap);
+
+	if (!found)
+		return false;
+
+	return (off_bitmap & off_bit) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, so TidStoreEndIterate() needs to be called when finished.
+ *
+ * The TidStoreIter struct is created in the caller's memory context.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+TidStoreBeginIterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter) +
+				   sizeof(OffsetNumber) * ts->control->max_off);
+	iter->ts = ts;
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (TidStoreNumTids(ts) == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter(TidStoreIter *iter, tidkey *key, offsetbm *off_bitmap)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, off_bitmap);
+
+	return local_rt_iterate_next(iter->tree_iter.local, key, off_bitmap);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+TidStoreIterateNext(TidStoreIter *iter)
+{
+	tidkey key;
+	offsetbm off_bitmap = 0;
+	TidStoreIterResult *output = &(iter->output);
+
+	if (iter->finished)
+		return NULL;
+
+	/* Initialize the outputs */
+	output->blkno = InvalidBlockNumber;
+	output->num_offsets = 0;
+
+	/*
+	 * Decode the key and offset bitmap that are collected in the previous
+	 * time, if exists.
+	 */
+	if (iter->next_off_bitmap > 0)
+		iter_decode_key_off(iter, iter->next_tidkey, iter->next_off_bitmap);
+
+	while (tidstore_iter(iter, &key, &off_bitmap))
+	{
+		BlockNumber blkno = key_get_blkno(iter->ts, key);
+		Assert(BlockNumberIsValid(blkno));
+
+		if (BlockNumberIsValid(output->blkno) && output->blkno != blkno)
+		{
+			/*
+			 * We got tids for a different block. We return the collected
+			 * tids so far, and remember the key-value for the next
+			 * iteration.
+			 */
+			iter->next_tidkey = key;
+			iter->next_off_bitmap = off_bitmap;
+			return output;
+		}
+
+		/* Collect tids decoded from the key and offset bitmap */
+		iter_decode_key_off(iter, key, off_bitmap);
+	}
+
+	iter->finished = true;
+	return output;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+TidStoreEndIterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+TidStoreNumTids(TidStore *ts)
+{
+	int64 num_tids;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (!TidStoreIsShared(ts))
+		return ts->control->num_tids;
+
+	LWLockAcquire(&ts->control->lock, LW_SHARED);
+	num_tids = ts->control->num_tids;
+	LWLockRelease(&ts->control->lock);
+
+	Assert(num_tids >= 0);
+	return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+TidStoreIsFull(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (TidStoreMemoryUsage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+TidStoreMaxMemory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+TidStoreMemoryUsage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+	return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+TidStoreHandle
+TidStoreGetHandle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/*
+ * Decode the key and offset bitmap to tids and store them to the iteration
+ * result.
+ */
+static void
+iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap)
+{
+	TidStoreIterResult *output = (&iter->output);
+
+	while (off_bitmap)
+	{
+		uint64	compressed_tid;
+		OffsetNumber	off;
+
+		compressed_tid = key << LOWER_OFFSET_NBITS;
+		compressed_tid |= pg_rightmost_one_pos64(off_bitmap);
+
+		off = compressed_tid & ((UINT64CONST(1) << iter->ts->control->max_off_nbits) - 1);
+
+		Assert(output->num_offsets < iter->ts->control->max_off);
+		output->offsets[output->num_offsets++] = off;
+
+		/* unset the rightmost bit */
+		off_bitmap &= ~pg_rightmost_one64(off_bitmap);
+	}
+
+	output->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, tidkey key)
+{
+	return (BlockNumber) (key >> ts->control->upper_off_nbits);
+}
+
+/* Encode a tid to key and partial offset */
+static inline tidkey
+encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit)
+{
+	OffsetNumber offset = ItemPointerGetOffsetNumber(tid);
+	BlockNumber block = ItemPointerGetBlockNumber(tid);
+
+	return encode_blk_off(ts, block, offset, off_bit);
+}
+
+/* encode a block and offset to a key and partial offset */
+static inline tidkey
+encode_blk_off(TidStore *ts, BlockNumber block, OffsetNumber offset,
+			   offsetbm *off_bit)
+{
+	tidkey key;
+	uint64 compressed_tid;
+	uint32 off_lower;
+
+	off_lower = offset & LOWER_OFFSET_MASK;
+	Assert(off_lower < (sizeof(offsetbm) * BITS_PER_BYTE));
+
+	*off_bit = UINT64CONST(1) << off_lower;
+	compressed_tid = offset | ((uint64) block << ts->control->max_off_nbits);
+	key = compressed_tid >> LOWER_OFFSET_NBITS;
+
+	return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"SharedTupleStore",
 	/* LWTRANCHE_SHARED_TIDBITMAP: */
 	"SharedTidBitmap",
+	/* LWTRANCHE_SHARED_TIDSTORE: */
+	"SharedTidStore",
 	/* LWTRANCHE_PARALLEL_APPEND: */
 	"ParallelAppend",
 	/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..d1cc93cbb6
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,50 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer TidStoreHandle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+/* Result struct for TidStoreIterateNext */
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	int				num_offsets;
+	OffsetNumber	offsets[FLEXIBLE_ARRAY_MEMBER];
+} TidStoreIterResult;
+
+extern TidStore *TidStoreCreate(size_t max_bytes, OffsetNumber max_off, dsa_area *dsa);
+extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer handle);
+extern void TidStoreDetach(TidStore *ts);
+extern void TidStoreDestroy(TidStore *ts);
+extern void TidStoreReset(TidStore *ts);
+extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+									int num_offsets);
+extern bool TidStoreIsMember(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * TidStoreBeginIterate(TidStore *ts);
+extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
+extern void TidStoreEndIterate(TidStoreIter *iter);
+extern int64 TidStoreNumTids(TidStore *ts);
+extern bool TidStoreIsFull(TidStore *ts);
+extern size_t TidStoreMaxMemory(TidStore *ts);
+extern size_t TidStoreMemoryUsage(TidStore *ts);
+extern TidStoreHandle TidStoreGetHandle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_SHARED_TIDBITMAP,
+	LWTRANCHE_SHARED_TIDSTORE,
 	LWTRANCHE_PARALLEL_APPEND,
 	LWTRANCHE_PER_XACT_PREDICATE_LIST,
 	LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..8659e6780e
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,228 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+/* #define TEST_SHARED_TIDSTORE 1 */
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+	ItemPointerData tid;
+	bool found;
+
+	ItemPointerSet(&tid, blkno, off);
+
+	found = TidStoreIsMember(ts, &tid);
+
+	if (found != expect)
+		elog(ERROR, "TidStoreIsMember for TID (%u, %u) returned %d, expected %d",
+			 blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS	5
+#define TEST_TIDSTORE_NUM_OFFSETS	5
+
+	TidStore *ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+	BlockNumber	blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+	};
+	BlockNumber	blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+	};
+	OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+	int blk_idx;
+
+#ifdef TEST_SHARED_TIDSTORE
+	int tranche_id = LWLockNewTrancheId();
+	dsa_area *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_tidstore");
+	dsa = dsa_create(tranche_id);
+
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+#else
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+#endif
+
+	/* prepare the offset array */
+	offs[0] = FirstOffsetNumber;
+	offs[1] = FirstOffsetNumber + 1;
+	offs[2] = max_offset / 2;
+	offs[3] = max_offset - 1;
+	offs[4] = max_offset;
+
+	/* add tids */
+	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+		TidStoreSetBlockOffsets(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* lookup test */
+	for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+	{
+		bool expect = false;
+		for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+		{
+			if (offs[i] == off)
+			{
+				expect = true;
+				break;
+			}
+		}
+
+		check_tid(ts, 0, off, expect);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, expect);
+	}
+
+	/* test the number of tids */
+	if (TidStoreNumTids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "TidStoreNumTids returned " UINT64_FORMAT ", expected %d",
+			 TidStoreNumTids(ts),
+			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* iteration test */
+	iter = TidStoreBeginIterate(ts);
+	blk_idx = 0;
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+	{
+		/* check the returned block number */
+		if (blks_sorted[blk_idx] != iter_result->blkno)
+			elog(ERROR, "TidStoreIterateNext returned block number %u, expected %u",
+				 iter_result->blkno, blks_sorted[blk_idx]);
+
+		/* check the returned offset numbers */
+		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+			elog(ERROR, "TidStoreIterateNext %u offsets, expected %u",
+				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+		for (int i = 0; i < iter_result->num_offsets; i++)
+		{
+			if (offs[i] != iter_result->offsets[i])
+				elog(ERROR, "TidStoreIterateNext offset number %u on block %u, expected %u",
+					 iter_result->offsets[i], iter_result->blkno, offs[i]);
+		}
+
+		blk_idx++;
+	}
+
+	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+		elog(ERROR, "TidStoreIterateNext returned %d blocks, expected %d",
+			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+	/* remove all tids */
+	TidStoreReset(ts);
+
+	/* test the number of tids */
+	if (TidStoreNumTids(ts) != 0)
+		elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
+
+	/* lookup test for empty store */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, false);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, false);
+	}
+
+	TidStoreDestroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+	dsa_detach(dsa);
+#endif
+}
+
+static void
+test_empty(void)
+{
+	TidStore *ts;
+	TidStoreIter *iter;
+	ItemPointerData tid;
+
+#ifdef TEST_SHARED_TIDSTORE
+	int tranche_id = LWLockNewTrancheId();
+	dsa_area *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_tidstore");
+	dsa = dsa_create(tranche_id);
+
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+#else
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+#endif
+
+	elog(NOTICE, "testing empty tidstore");
+
+	ItemPointerSet(&tid, 0, FirstOffsetNumber);
+	if (TidStoreIsMember(ts, &tid))
+		elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
+			 0, FirstOffsetNumber);
+
+	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+	if (TidStoreIsMember(ts, &tid))
+		elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
+			 MaxBlockNumber, MaxOffsetNumber);
+
+	if (TidStoreNumTids(ts) != 0)
+		elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
+
+	if (TidStoreIsFull(ts))
+		elog(ERROR, "TidStoreIsFull on empty store returned true");
+
+	iter = TidStoreBeginIterate(ts);
+
+	if (TidStoreIterateNext(iter) != NULL)
+		elog(ERROR, "TidStoreIterateNext on empty store returned TIDs");
+
+	TidStoreEndIterate(iter);
+
+	TidStoreDestroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+	dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	elog(NOTICE, "testing basic operations");
+	test_basic(MaxHeapTuplesPerPage);
+	test_basic(10);
+	test_basic(MaxHeapTuplesPerPage * 2);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.31.1

v30-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchapplication/octet-stream; name=v30-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchDownload

From 1feaf4249814a4bb7c5683649130b16cf3e5c754 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v30 05/11] Tool for measuring radix tree and tidstore
 performance

Includes Meson support, but commented out to avoid warnings

XXX: Not for commit
---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  88 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 747 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/meson.build          |  33 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 contrib/meson.build                           |   1 +
 8 files changed, 925 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/meson.build
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..ad66265e23
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,88 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT iter_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..63e842395d
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,747 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+//#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	TidStore	*ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
+	OffsetNumber *offs;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_ms;
+	int64		iter_ms;
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false};
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+	for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+		offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+	ts = TidStoreCreate(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
+
+	/* load tids */
+	start_time = GetCurrentTimestamp();
+	for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+		TidStoreSetBlockOffsets(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_ms = secs * 1000 + usecs / 1000;
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* iterate through tids */
+	iter = TidStoreBeginIterate(ts);
+	start_time = GetCurrentTimestamp();
+	while ((result = TidStoreIterateNext(iter)) != NULL)
+		;
+	TidStoreEndIterate(iter);
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	iter_ms = secs * 1000 + usecs / 1000;
+
+	values[0] = Int64GetDatum(TidStoreMemoryUsage(ts));
+	values[1] = Int64GetDatum(load_ms);
+	values[2] = Int64GetDatum(iter_ms);
+
+	TidStoreDestroy(ts);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	rt_radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, &val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, &val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	int64		search_time_ms;
+	Datum		values[3] = {0};
+	bool		nulls[3] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+	values[2] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, &key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* to silence warnings about unused iter functions */
+static void pg_attribute_unused()
+stub_iter()
+{
+	rt_radix_tree *rt;
+	rt_iter *iter;
+	uint64 key = 1;
+	uint64 value = 1;
+
+	rt = rt_create(CurrentMemoryContext);
+
+	iter = rt_begin_iterate(rt);
+	rt_iterate_next(iter, &key, &value);
+	rt_end_iterate(iter);
+}
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+  'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+  bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'bench_radix_tree',
+    '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+  bench_radix_tree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+  'bench_radix_tree.control',
+  'bench_radix_tree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'bench_radix_tree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'bench_radix_tree',
+    ],
+  },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..421d469f8c 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
+subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.31.1

v30-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v30-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From 756f0a7a1f3e9030ddc68ae635baa25c4a310b4d Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v30 02/11] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 36 ++------------------------------
 src/include/nodes/bitmapset.h    | 16 ++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 98660524ad..fcd8e2ccbc 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 3d2225e1ae..5f9a511b4a 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 158ef73a2b..bf7588e075 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -32,6 +32,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 86a9303bf5..4a5e776703 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3675,7 +3675,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.31.1

v30-0003-Add-a-macro-templatized-radix-tree.patchapplication/octet-stream; name=v30-0003-Add-a-macro-templatized-radix-tree.patchDownload

From 87b21d222bc9e2b8bdbd6cb7c880d1f5a5192242 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v30 03/11] Add a macro templatized radix tree.

The radix tree data structure is implemented based on the idea from
the paper "The Adaptive Radix Tree: ARTful Indexing for Main-Memory
Databases" by Viktor Leis, Alfons Kemper, and Thomas Neumann,
2013. There are some optimizations that are proposed in the ART paper
not yet implemented, particularly path compression and lazy path
expansion. For a better performance, the radix trees have to be
adjusted to the individual user-case at compile-time. So the radix
tree is implemented using a macro-templatized header file, which
generates functions and types based on a prefix and other parameters.

The key of radix tree is 64-bit unsigned integer but the caller can
specify the type of the value. Our main innovation implemented
in our radix tree implementation compared to the ART paper is
decoupling the notion of size class from kind. The size classes within
a given node kind have the same underlying type, but a variable number
of children/values. Nodes of different kinds necessarily belong to
different size classes. Growing from one node kind to another requires
special code for each case, but growing from one size class to another
within the same kind is basically just allocate + memcpy.

The radix tree can be created also on a DSA area. To handle
concurrency, we use a single reader-writer lock for the radix
tree. The current locking mechanism is not optimized for high
concurrency with mixed read-write workloads. In the future it might be
worthwhile to replace it with the Optimistic Lock Coupling or ROWEX
mentioned in the paper "The ART of Practical Synchronization" by the
same authors as the ART paper, 2016.

Later patches use this infrastructure to use such radix tree for
storing dead tuple TIDs during lazy vacuum.There are possible cases
where this could be useful (e.g., replacement for hash table for
shared buffer).

This includes a unit test module, in src/test/modules/test_radixtree.

Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA@mail.gmail.com
---
 src/backend/utils/mmgr/dsa.c                  |   12 +
 src/include/lib/radixtree.h                   | 2523 +++++++++++++++++
 src/include/lib/radixtree_delete_impl.h       |  122 +
 src/include/lib/radixtree_insert_impl.h       |  328 +++
 src/include/lib/radixtree_iter_impl.h         |  144 +
 src/include/lib/radixtree_search_impl.h       |  138 +
 src/include/utils/dsa.h                       |    1 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   38 +
 src/test/modules/test_radixtree/meson.build   |   35 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  712 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 20 files changed, 4120 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..2e3963c3d5
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2523 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *		Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ *  tional leaf node type which stores one value.
+ *  - Multi-value leaves: The values are stored in one of four
+ *  different leaf node types, which mirror the structure of
+ *  inner nodes, but contain values instead of pointers.
+ *  - Combined pointer/value slots: If values fit into point-
+ *  ers, no separate node types are necessary. Instead, each
+ *  pointer storage location in an inner node can either
+ *  store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included.  Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * 	 will result in radix tree type 'foo_radix_tree' and functions like
+ *	 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ *	 generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *	 declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *	 so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITERATE	- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND_UP RT_MAKE_NAME(extend_up)
+#define RT_EXTEND_DOWN RT_MAKE_NAME(extend_down)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_ITER_SET_NODE_FROM RT_MAKE_NAME(iter_set_node_from)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ *    statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ *    in the future to tag the node pointer with the kind, even on
+ *    platforms with 32-bit pointers. This might speed up node traversal
+ *    in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree)	LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree)	LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree)			LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree)	((void) 0)
+#define RT_LOCK_SHARED(tree)	((void) 0)
+#define RT_UNLOCK(tree)			((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+	((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+	RT_NODE		n;
+
+	/* 3 children, for key chunks */
+	uint8		chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* bitmap to track which slots are in use */
+	bitmapword		isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/*
+	 * Unlike with inner256, zero is a valid value here, so we use a
+	 * bitmap to track which slots are in use.
+	 */
+	bitmapword	isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	RT_VALUE_TYPE	values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_3 = 0,
+	RT_CLASS_32_MIN,
+	RT_CLASS_32_MAX,
+	RT_CLASS_125,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_3] = {
+		.name = "radix tree node 3",
+		.fanout = 3,
+		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MIN] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MAX] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_125] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+	LWLock		lock;
+#endif
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key.
+ *
+ * RT_NODE_ITER is the struct for iteration of one radix tree node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * for each level to track the iteration within the node.
+ */
+typedef struct RT_NODE_ITER
+{
+	/*
+	 * Local pointer to the node we are iterating over.
+	 *
+	 * Since the radix tree doesn't support the shared iteration among multiple
+	 * processes, we use RT_PTR_LOCAL rather than RT_PTR_ALLOC.
+	 */
+	RT_PTR_LOCAL node;
+
+	/*
+	 * The next index of the chunk array in RT_NODE_KIND_3 and
+	 * RT_NODE_KIND_32 nodes, or the next chunk in RT_NODE_KIND_125 and
+	 * RT_NODE_KIND_256 nodes. 0 for the initial value.
+	 */
+	int		idx;
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the nodes for each level. level = 0 is for a leaf node */
+	RT_NODE_ITER node_iters[RT_MAX_LEVEL];
+	int			top_level;
+
+	/* The key constructed during the iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/* replicate the search key */
+	spread_chunk = vector8_broadcast(chunk);
+
+	/* compare to all 32 keys stored in the node */
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+
+	/* convert comparison to a bitfield */
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+	/* mask off invalid entries */
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	/* convert bitfield to index by counting trailing zeros */
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		/*
+		 * This is coded with '>=' to match what we can do with SIMD,
+		 * with an assert to keep us honest.
+		 */
+		if (node->chunks[index] >= chunk)
+		{
+			Assert(node->chunks[index] != chunk);
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/*
+	 * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+	 * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+	 * we need to play some trickery using vector8_min() to effectively get
+	 * >=. There'll never be any equal elements in current uses, but that's
+	 * what we get here...
+	 */
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	if (key == 0)
+		return 0;
+	else
+		return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
+
+	if (is_leaf)
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (is_leaf)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+#endif
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->ctl->cnt[size_class]++;
+#endif
+
+	return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	if (is_leaf)
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+	node->kind = kind;
+
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+		memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+	}
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static pg_noinline void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		is_leaf = shift == 0;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+	newnode->shift = shift;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initialize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+				  uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+	RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+	RT_COPY_NODE(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->ctl->root == allocnode)
+	{
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
+	}
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static inline void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+				RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old_child->shift == new->shift);
+	Assert(old_child->count == new->count);
+#endif
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new larger node */
+		tree->ctl->root = new_child;
+	}
+	else
+		RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+	RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static pg_noinline void
+RT_EXTEND_UP(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_3 *n3;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+		node->shift = shift;
+		node->count = 1;
+
+		n3 = (RT_NODE_INNER_3 *) node;
+		n3->base.chunks[0] = 0;
+		n3->children[0] = tree->ctl->root;
+
+		/* Update the root */
+		tree->ctl->root = allocnode;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static pg_noinline void
+RT_EXTEND_DOWN(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+	int			shift = node->shift;
+
+	Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		is_leaf = newshift == 0;
+
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+		newchild->shift = newshift;
+		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+		parent = node;
+		node = newchild;
+		stored_node = allocchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+	tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static void
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+	/* Create a slab context for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+		size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+		size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 size_class.name,
+												 inner_blocksize,
+												 size_class.inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												size_class.name,
+												leaf_blocksize,
+												size_class.leaf_size);
+	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	if (RT_NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+				for (int i = 0; i < n3->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n3->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+				for (int i = 0; i < n32->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle);
+#else
+	pfree(tree->ctl);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+#endif
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	int			shift;
+	bool		updated;
+	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC stored_child;
+	RT_PTR_LOCAL  child;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	/* Empty tree, create the root */
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->ctl->max_val)
+		RT_EXTEND_UP(tree, key);
+
+	stored_child = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, stored_child);
+	shift = parent->shift;
+
+	/* Descend the tree until we reach a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;
+
+		child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+		if (RT_NODE_IS_LEAF(child))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+		{
+			RT_EXTEND_DOWN(tree, key, value_p, parent, stored_child, child);
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		parent = child;
+		stored_child = new_child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->ctl->num_keys++;
+
+	RT_UNLOCK(tree);
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+	bool		found;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	Assert(value_p != NULL);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		if (RT_NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		node = RT_PTR_GET_LOCAL(tree, child);
+		shift -= RT_NODE_SPAN;
+	}
+
+	found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+	RT_UNLOCK(tree);
+	return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		/* Push the current node to the stack */
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		allocnode = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	deleted = RT_NODE_DELETE_LEAF(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->ctl->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (node->count > 0)
+	{
+		RT_UNLOCK(tree);
+		return true;
+	}
+
+	/* Free the empty leaf node */
+	RT_FREE_NODE(tree, allocnode);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		allocnode = stack[level--];
+
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		deleted = RT_NODE_DELETE_INNER(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (node->count > 0)
+			break;
+
+		/* The node became empty */
+		RT_FREE_NODE(tree, allocnode);
+	}
+
+	RT_UNLOCK(tree);
+	return true;
+}
+#endif
+
+/*
+ * Scan the inner node and return the next child node if exist, otherwise
+ * return NULL.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Scan the leaf node, and return true and the next value is set to value_p
+ * if exists. Otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * While descending the radix tree from the 'from' node to the bottom, we
+ * set the next node to iterate for each level.
+ */
+static void
+RT_ITER_SET_NODE_FROM(RT_ITER *iter, RT_PTR_LOCAL from)
+{
+	int			level = from->shift / RT_NODE_SPAN;
+	RT_PTR_LOCAL node = from;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->node_iters[level--]);
+
+#ifdef USE_ASSERT_CHECKING
+		if (node_iter->node)
+		{
+			/* We must have finished the iteration on the previous node */
+			if (RT_NODE_IS_LEAF(node_iter->node))
+			{
+				uint64 dummy;
+				Assert(!RT_NODE_LEAF_ITERATE_NEXT(iter, node_iter, &dummy));
+			}
+			else
+				Assert(!RT_NODE_INNER_ITERATE_NEXT(iter, node_iter));
+		}
+#endif
+
+		/* Set the node to the node iterator of this level */
+		node_iter->node = node;
+		node_iter->idx = 0;
+
+		if (RT_NODE_IS_LEAF(node))
+		{
+			/* We will visit the leaf node when RT_ITERATE_NEXT() */
+			break;
+		}
+
+		/*
+		 * Get the first child node from the node, which corresponds to the
+		 * lowest chunk within the node.
+		 */
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+		/* The first child must be found */
+		Assert(node);
+	}
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
+
+	iter = (RT_ITER *) MemoryContextAllocZero(tree->context,
+											  sizeof(RT_ITER));
+	iter->tree = tree;
+
+	RT_LOCK_SHARED(tree);
+
+	/* empty tree */
+	if (!iter->tree->ctl->root)
+		return iter;
+
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	iter->top_level = root->shift / RT_NODE_SPAN;
+
+	/*
+	 * Set the next node to iterate for each level from the level of the
+	 * root node.
+	 */
+	RT_ITER_SET_NODE_FROM(iter, root);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+	Assert(value_p != NULL);
+
+	/* Empty tree */
+	if (!iter->tree->ctl->root)
+		return false;
+
+	for (;;)
+	{
+		RT_PTR_LOCAL child = NULL;
+
+		/* Get the next chunk of the leaf node */
+		if (RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->node_iters[0]), value_p))
+		{
+			*key_p = iter->key;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance all inner node
+		 * iterators by visiting inner nodes from the level = 1 until we find the
+		 * next inner node that has a child node.
+		 */
+		for (int level = 1; level <= iter->top_level; level++)
+		{
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->node_iters[level]));
+
+			if (child)
+				break;
+		}
+
+		/* We've visited all nodes, so the iteration finished */
+		if (!child)
+			break;
+
+		/*
+		 * Found the new child node. We update the next node to iterate for each
+		 * level from the level of this child node.
+		 */
+		RT_ITER_SET_NODE_FROM(iter, child);
+
+		/* Find key-value from the leaf node again */
+	}
+
+	return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+	Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+	RT_UNLOCK(iter->tree);
+	pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	Size		total = 0;
+
+	RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+#endif
+
+	RT_UNLOCK(tree);
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+				for (int i = 1; i < n3->n.count; i++)
+					Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = RT_BM_IDX(slot);
+					int			bitnum = RT_BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+	RT_LOCK_SHARED(tree);
+
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+	fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+	fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+		fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+				root->shift / RT_NODE_SPAN,
+				tree->ctl->cnt[RT_CLASS_3],
+				tree->ctl->cnt[RT_CLASS_32_MIN],
+				tree->ctl->cnt[RT_CLASS_32_MAX],
+				tree->ctl->cnt[RT_CLASS_125],
+				tree->ctl->cnt[RT_CLASS_256]);
+	}
+
+	RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+			 bool recurse, StringInfo buf)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+	StringInfoData spaces;
+
+	initStringInfo(&spaces);
+	appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+	appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+					 spaces.data,
+					 level == 0 ? "" : "-> ",
+					 RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+					 (node->kind == RT_NODE_KIND_3) ? 3 :
+					 (node->kind == RT_NODE_KIND_32) ? 32 :
+					 (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+					 node->fanout == 0 ? 256 : node->fanout,
+					 node->count, node->shift);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n3->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n3->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n3->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n32->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n32->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+				char *sep = "";
+
+				appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					appendStringInfo(buf, "%s[%d]=%d ",
+									 sep, i, b125->slot_idxs[i]);
+					sep = ",";
+				}
+
+				appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+				for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+					appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+				appendStringInfo(buf, "\n");
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					if (RT_NODE_IS_LEAF(node))
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					else
+					{
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+					appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+					for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+						appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+					appendStringInfo(buf, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL node;
+	StringInfoData buf;
+	int			shift;
+	int			level = 0;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	if (key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+				key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+		if (RT_NODE_IS_LEAF(node))
+		{
+			RT_VALUE_TYPE	dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			break;
+
+		allocnode = child;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+	StringInfoData buf;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	initStringInfo(&buf);
+
+	RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND_UP
+#undef RT_EXTEND_DOWN
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_RT_ITER_SET_NODE_FROM
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ *	  Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+										  n3->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+											n3->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+										  n32->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+				idx = RT_BM_IDX(slotpos);
+				bitnum = RT_BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+				RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..d56e58dcac
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,328 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ *	  Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	const bool is_leaf = true;
+	bool		chunk_exists = false;
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	const bool is_leaf = false;
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n3->values[idx] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n3)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+					/* grow node from 3 to 32 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+											  new32->base.chunks, new32->values);
+#else
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+											  new32->base.chunks, new32->children);
+#endif
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+					int			count = n3->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+												   count, insertpos);
+#endif
+					}
+
+					n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n3->values[insertpos] = *value_p;
+#else
+					n3->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+					n32->base.n.fanout < class32_max.fanout)
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+					Assert(n32->base.n.fanout == class32_min.fanout);
+
+					/* grow to the next size class of this kind */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class32_min.leaf_size);
+#else
+					memcpy(newnode, node, class32_min.inner_size);
+#endif
+					newnode->fanout = class32_max.fanout;
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n32)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+					Assert(n32->base.n.fanout == class32_max.fanout);
+
+					/* grow node from 32 to 125 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < class32_max.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					/*
+					 * Since we just copied a dense array, we can set the bits
+					 * using a single store, provided the length of that array
+					 * is at most the number of bits in a bitmapword.
+					 */
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = *value_p;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos;
+				int			cnt = 0;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				slotpos = n125->base.slot_idxs[chunk];
+				if (slotpos != RT_INVALID_SLOT_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n125->values[slotpos] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n125)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+					/* grow node from 125 to 256 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new256 = (RT_NODE256_TYPE *) newnode;
+
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+						cnt++;
+					}
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = *value_p;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+				Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+				RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+				Assert(node->count < RT_NODE_MAX_SLOTS);
+				RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
+	if (!chunk_exists)
+		node->count++;
+#else
+		node->count++;
+#endif
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	RT_VERIFY_NODE(node);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return chunk_exists;
+#else
+	return;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..5c1034768e
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,144 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ *	  Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		key_chunk = 0;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+				if (node_iter->idx >= n3->base.n.count)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n3->values[node_iter->idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->idx]);
+#endif
+				key_chunk = n3->base.chunks[node_iter->idx];
+				node_iter->idx++;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				if (node_iter->idx >= n32->base.n.count)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n32->values[node_iter->idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->idx]);
+#endif
+				key_chunk = n32->base.chunks[node_iter->idx];
+				node_iter->idx++;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			chunk;
+
+				for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
+						break;
+				}
+
+				if (chunk >= RT_NODE_MAX_SLOTS)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, chunk));
+#endif
+				key_chunk = chunk;
+				node_iter->idx = chunk + 1;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			chunk;
+
+				for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+						break;
+				}
+
+				if (chunk >= RT_NODE_MAX_SLOTS)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, chunk));
+#endif
+				key_chunk = chunk;
+				node_iter->idx = chunk + 1;
+				break;
+			}
+	}
+
+	/* Update the part of the key */
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << node_iter->node->shift);
+	iter->key |= (((uint64) key_chunk) << node_iter->node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return true;
+#else
+	return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ *	  Common implementation for search in leaf and inner nodes, plus
+ *	  update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+	Assert(child_p != NULL);
+#endif
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n3->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n3->values[idx];
+#else
+				*child_p = n3->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n32->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n32->values[idx];
+#else
+				*child_p = n32->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+				Assert(slotpos != RT_INVALID_SLOT_IDX);
+				n125->children[slotpos] = new_child;
+#else
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				*child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				*child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+	}
+
+#ifdef RT_ACTION_UPDATE
+	return;
+#else
+	return true;
+#endif							/* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..7ad1ce3605
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,38 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 3
+NOTICE:  testing basic operations with inner node 3
+NOTICE:  testing basic operations with leaf node 15
+NOTICE:  testing basic operations with inner node 15
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..5a169854d9
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,712 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/*
+ * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
+ */
+static int	rt_node_class_fanouts[] = {
+	3,		/* RT_CLASS_3 */
+	15,		/* RT_CLASS_32_MIN */
+	32, 	/* RT_CLASS_32_MAX */
+	125,	/* RT_CLASS_125 */
+	256		/* RT_CLASS_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+/* #define RT_SHMEM */
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	TestValueType		dummy;
+	uint64		key;
+	TestValueType		val;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	rt_radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* look up keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType value;
+
+		if (!rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (value != (TestValueType) keys[i])
+			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+				 value, (TestValueType) keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType update = keys[i] + 1;
+		if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end)
+{
+	for (int i = start; i <= end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		TestValueType		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != (TestValueType) key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+/*
+ * Insert 256 key-value pairs, and check if keys are properly inserted on each
+ * node class.
+ */
+/* Test keys [0, 256) */
+#define NODE_TYPE_TEST_KEY_MIN 0
+#define NODE_TYPE_TEST_KEY_MAX 256
+static void
+test_node_types_insert_asc(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+	int node_class_idx = 0;
+	uint64 key_checked = 0;
+
+	for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType *) &key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if ((i + 1) == rt_node_class_fanouts[node_class_idx])
+		{
+			check_search_on_node(radixtree, shift, key_checked, i);
+			key_checked = i;
+			node_class_idx++;
+		}
+	}
+
+	num_entries = rt_num_entries(radixtree);
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Similar to test_node_types_insert_asc(), but inserts keys in descending order.
+ */
+static void
+test_node_types_insert_desc(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+	int node_class_idx = 0;
+	uint64 key_checked = NODE_TYPE_TEST_KEY_MAX - 1;
+
+	for (int i = NODE_TYPE_TEST_KEY_MAX - 1; i >= NODE_TYPE_TEST_KEY_MIN; i--)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType *) &key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		if ((i + 1) == rt_node_class_fanouts[node_class_idx])
+		{
+			check_search_on_node(radixtree, shift, i, key_checked);
+			key_checked = i;
+			node_class_idx++;
+		}
+	}
+
+	num_entries = rt_num_entries(radixtree);
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert_asc(radixtree, shift);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert_desc(radixtree, shift);
+
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+	radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, (TestValueType*) &x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != (TestValueType) x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			TestValueType		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != (TestValueType) expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	rt_free(radixtree);
+	MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 0; i < lengthof(rt_node_class_fanouts); i++)
+	{
+		test_basic(rt_node_class_fanouts[i], false);
+		test_basic(rt_node_class_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index 2c5042eb41..14b37e8eef 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index abbba7aa63..d4d2f1da03 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.31.1

v30-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchapplication/octet-stream; name=v30-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload

From 22b578551e15e829e6649784eac8ec66d4a455c3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v30 01/11] Introduce helper SIMD functions for small byte
 arrays

vector8_min - helper for emulating ">=" semantics

vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask

Masahiko Sawada

Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 1fa6c3bc6c..dfae14e463 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #endif
 
 /* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	/*
+	 * Note: There is a faster way to do this, but it returns a uint64 and
+	 * and if the caller wanted to extract the bit position using CTZ,
+	 * it would have to divide that result by 4.
+	 */
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the bitwise OR of the inputs
  */
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.31.1

#232

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#231)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I've attached the new version patches. I merged improvements and fixes
I did in the v29 patch.

I haven't yet had a chance to look at those closely, since I've had to
devote time to other commitments. I remember I wasn't particularly
impressed that v29-0008 mixed my requested name-casing changes with a bunch
of other random things. Separating those out would be an obvious way to
make it easier for me to look at, whenever I can get back to this. I need
to look at the iteration changes as well, in addition to testing memory
measurement (thanks for the new results, they look encouraging).

Apart from the memory measurement stuff, I've done another todo item
on my list; adding min max classes for node3 and node125. I've done

This didn't help us move us closer to something committable the first time
you coded this without making sure it was a good idea. It's still not
helping and arguably makes it worse. To be fair, I did speak positively
about _considering_ additional size classes some months ago, but that has a
very obvious maintenance cost, something we can least afford right now.

I'm frankly baffled you thought this was important enough to work on again,
yet thought it was a waste of time to try to prove to ourselves that
autovacuum in a realistic, non-deterministic workload gave the same answer
as the current tid lookup. Even if we had gone that far, it doesn't seem
like a good idea to add non-essential code to critical paths right now.

We're rapidly running out of time, and we're at the point in the cycle
where it's impossible to get meaningful review from anyone not already
intimately familiar with the patch series. I only want to see progress on
addressing possible (especially architectural) objections from the
community, because if they don't notice them now, they surely will after
commit. I have my own list of possible objections as well as bikeshedding
points, which I'll clean up and share next week. I plan to invite Andres to
look at that list and give his impressions, because it's a lot quicker than
reading the patches. Based on that, I'll hopefully be able to decide
whether we have enough time to address any feedback and do remaining
polishing in time for feature freeze.

I'd suggest sharing your todo list in the meanwhile, it'd be good to
discuss what's worth doing and what is not.

--
John Naylor
EDB: http://www.enterprisedb.com

#233

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#232)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 10, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the new version patches. I merged improvements and fixes
I did in the v29 patch.

I haven't yet had a chance to look at those closely, since I've had to devote time to other commitments. I remember I wasn't particularly impressed that v29-0008 mixed my requested name-casing changes with a bunch of other random things. Separating those out would be an obvious way to make it easier for me to look at, whenever I can get back to this. I need to look at the iteration changes as well, in addition to testing memory measurement (thanks for the new results, they look encouraging).

Okay, I'll separate them again.

Apart from the memory measurement stuff, I've done another todo item
on my list; adding min max classes for node3 and node125. I've done

This didn't help us move us closer to something committable the first time you coded this without making sure it was a good idea. It's still not helping and arguably makes it worse. To be fair, I did speak positively about _considering_ additional size classes some months ago, but that has a very obvious maintenance cost, something we can least afford right now.

I'm frankly baffled you thought this was important enough to work on again, yet thought it was a waste of time to try to prove to ourselves that autovacuum in a realistic, non-deterministic workload gave the same answer as the current tid lookup. Even if we had gone that far, it doesn't seem like a good idea to add non-essential code to critical paths right now.

I didn't think that proving tidstore and the current tid lookup return
the same result was a waste of time. I've shared a patch to do that in
tidstore before. I agreed not to add it to the tree but we can test
that using this patch. Actually I've done a test that ran pgbench
workload for a few days.

IIUC it's still important to consider whether to have node1 since it
could be a good alternative for the path compression. The prototype
also implemented it. Of course we can leave it for future improvement.
But considering this item with the performance tests helps us to prove
our decoupling approach is promising.

We're rapidly running out of time, and we're at the point in the cycle where it's impossible to get meaningful review from anyone not already intimately familiar with the patch series. I only want to see progress on addressing possible (especially architectural) objections from the community, because if they don't notice them now, they surely will after commit.

Right, we've been making many design decisions. Some of them are
agreed just between you and me and some are agreed with other hackers.
There are some irrevertible design decisions due to the remaining
time.

I have my own list of possible objections as well as bikeshedding points, which I'll clean up and share next week.

Thanks.

I plan to invite Andres to look at that list and give his impressions, because it's a lot quicker than reading the patches. Based on that, I'll hopefully be able to decide whether we have enough time to address any feedback and do remaining polishing in time for feature freeze.

I'd suggest sharing your todo list in the meanwhile, it'd be good to discuss what's worth doing and what is not.

Apart from more rounds of reviews and tests, my todo items that need
discussion and possibly implementation are:

* The memory measurement in radix trees and the memory limit in
tidstores. I've implemented it in v30-0007 through 0009 but we need to
review it. This is the highest priority for me.

* Additional size classes. It's important for an alternative of path
compression as well as supporting our decoupling approach. Middle
priority.

* Node shrinking support. Low priority.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#234

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#233)

14 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 10, 2023 at 11:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 10, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the new version patches. I merged improvements and fixes
I did in the v29 patch.

I haven't yet had a chance to look at those closely, since I've had to devote time to other commitments. I remember I wasn't particularly impressed that v29-0008 mixed my requested name-casing changes with a bunch of other random things. Separating those out would be an obvious way to make it easier for me to look at, whenever I can get back to this. I need to look at the iteration changes as well, in addition to testing memory measurement (thanks for the new results, they look encouraging).

Okay, I'll separate them again.

Attached new patch series. In addition to separate them again, I've
fixed a conflict with HEAD.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v31-0013-Add-min-and-max-classes-for-node3-and-node125.patchapplication/octet-stream; name=v31-0013-Add-min-and-max-classes-for-node3-and-node125.patchDownload

From 1b43002d25137699d0e13158d821a8550e757348 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 9 Mar 2023 11:42:17 +0900
Subject: [PATCH v31 13/14] Add min and max classes for node3 and node125.

---
 src/include/lib/radixtree.h                   | 70 +++++++++++++------
 src/include/lib/radixtree_insert_impl.h       | 56 ++++++++++++++-
 .../expected/test_radixtree.out               |  4 ++
 .../modules/test_radixtree/test_radixtree.c   |  6 +-
 4 files changed, 110 insertions(+), 26 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index f7812eb12a..1759c909b6 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -225,10 +225,12 @@
 #define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
 #define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
 #define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
-#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_3_MIN RT_MAKE_NAME(class_3_min)
+#define RT_CLASS_3_MAX RT_MAKE_NAME(class_3_max)
 #define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
 #define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
-#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_125_MIN RT_MAKE_NAME(class_125_min)
+#define RT_CLASS_125_MAX RT_MAKE_NAME(class_125_max)
 #define RT_CLASS_256 RT_MAKE_NAME(class_256)
 
 /* generate forward declarations necessary to use the radix tree */
@@ -561,10 +563,12 @@ typedef struct RT_NODE_LEAF_256
  */
 typedef enum RT_SIZE_CLASS
 {
-	RT_CLASS_3 = 0,
+	RT_CLASS_3_MIN = 0,
+	RT_CLASS_3_MAX,
 	RT_CLASS_32_MIN,
 	RT_CLASS_32_MAX,
-	RT_CLASS_125,
+	RT_CLASS_125_MIN,
+	RT_CLASS_125_MAX,
 	RT_CLASS_256
 } RT_SIZE_CLASS;
 
@@ -580,7 +584,13 @@ typedef struct RT_SIZE_CLASS_ELEM
 } RT_SIZE_CLASS_ELEM;
 
 static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
-	[RT_CLASS_3] = {
+	[RT_CLASS_3_MIN] = {
+		.name = "radix tree node 1",
+		.fanout = 1,
+		.inner_size = sizeof(RT_NODE_INNER_3) + 1 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_3) + 1 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_3_MAX] = {
 		.name = "radix tree node 3",
 		.fanout = 3,
 		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
@@ -598,7 +608,13 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
 		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
 	},
-	[RT_CLASS_125] = {
+	[RT_CLASS_125_MIN] = {
+		.name = "radix tree node 125",
+		.fanout = 61,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 61 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 61 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_125_MAX] = {
 		.name = "radix tree node 125",
 		.fanout = 125,
 		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
@@ -934,7 +950,7 @@ static inline void
 RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
 						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
 {
-	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
 	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
 
@@ -946,7 +962,7 @@ static inline void
 RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
 						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
 {
-	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].fanout;
 	const Size chunk_size = sizeof(uint8) * fanout;
 	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
 
@@ -1152,9 +1168,9 @@ RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
 	RT_PTR_ALLOC allocnode;
 	RT_PTR_LOCAL newnode;
 
-	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_MIN, is_leaf);
 	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
-	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3_MIN, is_leaf);
 	newnode->shift = shift;
 	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
 	tree->ctl->root = allocnode;
@@ -1188,17 +1204,21 @@ static inline Size
 RT_FANOUT_GET_NODE_SIZE(int fanout, bool is_leaf)
 {
 	const Size fanout_inner_node_size[] = {
-		[3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].inner_size,
+		[1] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MIN].inner_size,
+		[3] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].inner_size,
 		[15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].inner_size,
 		[32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].inner_size,
-		[125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].inner_size,
+		[61] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MIN].inner_size,
+		[125] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MAX].inner_size,
 		[256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].inner_size,
 	};
 	const Size fanout_leaf_node_size[] = {
-		[3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].leaf_size,
+		[1] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MIN].leaf_size,
+		[3] = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX].leaf_size,
 		[15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].leaf_size,
 		[32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].leaf_size,
-		[125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].leaf_size,
+		[61] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MIN].leaf_size,
+		[125] = RT_SIZE_CLASS_INFO[RT_CLASS_125_MAX].leaf_size,
 		[256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].leaf_size,
 	};
 	Size node_size;
@@ -1337,9 +1357,9 @@ RT_EXTEND_UP(RT_RADIX_TREE *tree, uint64 key)
 		RT_PTR_LOCAL	node;
 		RT_NODE_INNER_3 *n3;
 
-		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3_MIN, true);
 		node = RT_PTR_GET_LOCAL(tree, allocnode);
-		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3_MIN, true);
 		node->shift = shift;
 		node->count = 1;
 
@@ -1375,9 +1395,9 @@ RT_EXTEND_DOWN(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_L
 		int			newshift = shift - RT_NODE_SPAN;
 		bool		is_leaf = newshift == 0;
 
-		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3_MIN, is_leaf);
 		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
-		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3_MIN, is_leaf);
 		newchild->shift = newshift;
 		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
 
@@ -2177,12 +2197,14 @@ RT_STATS(RT_RADIX_TREE *tree)
 	{
 		RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
 
-		fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+		fprintf(stderr, "height = %d, n1 = %u, n3 = %u, n15 = %u, n32 = %u, n61 = %u, n125 = %u, n256 = %u\n",
 				root->shift / RT_NODE_SPAN,
-				tree->ctl->cnt[RT_CLASS_3],
+				tree->ctl->cnt[RT_CLASS_3_MIN],
+				tree->ctl->cnt[RT_CLASS_3_MAX],
 				tree->ctl->cnt[RT_CLASS_32_MIN],
 				tree->ctl->cnt[RT_CLASS_32_MAX],
-				tree->ctl->cnt[RT_CLASS_125],
+				tree->ctl->cnt[RT_CLASS_125_MIN],
+				tree->ctl->cnt[RT_CLASS_125_MAX],
 				tree->ctl->cnt[RT_CLASS_256]);
 	}
 
@@ -2519,10 +2541,12 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_SIZE_CLASS
 #undef RT_SIZE_CLASS_ELEM
 #undef RT_SIZE_CLASS_INFO
-#undef RT_CLASS_3
+#undef RT_CLASS_3_MIN
+#undef RT_CLASS_3_MAX
 #undef RT_CLASS_32_MIN
 #undef RT_CLASS_32_MAX
-#undef RT_CLASS_125
+#undef RT_CLASS_125_MIN
+#undef RT_CLASS_125_MAX
 #undef RT_CLASS_256
 
 /* function declarations */
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
index d56e58dcac..d10093dfba 100644
--- a/src/include/lib/radixtree_insert_impl.h
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -42,6 +42,7 @@
 	{
 		case RT_NODE_KIND_3:
 			{
+				const RT_SIZE_CLASS_ELEM class3_max = RT_SIZE_CLASS_INFO[RT_CLASS_3_MAX];
 				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
 
 #ifdef RT_NODE_LEVEL_LEAF
@@ -55,6 +56,32 @@
 					break;
 				}
 #endif
+				if (unlikely(RT_NODE_MUST_GROW(n3)) &&
+					n3->base.n.fanout < class3_max.fanout)
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					const RT_SIZE_CLASS_ELEM class3_min = RT_SIZE_CLASS_INFO[RT_CLASS_3_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_3_MAX;
+
+					Assert(n3->base.n.fanout == class3_min.fanout);
+
+					/* grow to the next size class of this kind */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n3 = (RT_NODE3_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class3_min.leaf_size);
+#else
+					memcpy(newnode, node, class3_min.inner_size);
+#endif
+					newnode->fanout = class3_max.fanout;
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+
 				if (unlikely(RT_NODE_MUST_GROW(n3)))
 				{
 					RT_PTR_ALLOC allocnode;
@@ -154,7 +181,7 @@
 					RT_PTR_LOCAL newnode;
 					RT_NODE125_TYPE *new125;
 					const uint8 new_kind = RT_NODE_KIND_125;
-					const RT_SIZE_CLASS new_class = RT_CLASS_125;
+					const RT_SIZE_CLASS new_class = RT_CLASS_125_MIN;
 
 					Assert(n32->base.n.fanout == class32_max.fanout);
 
@@ -213,6 +240,7 @@
 			/* FALLTHROUGH */
 		case RT_NODE_KIND_125:
 			{
+				const RT_SIZE_CLASS_ELEM class125_max = RT_SIZE_CLASS_INFO[RT_CLASS_125_MAX];
 				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
 				int			slotpos;
 				int			cnt = 0;
@@ -227,6 +255,32 @@
 					break;
 				}
 #endif
+				if (unlikely(RT_NODE_MUST_GROW(n125)) &&
+					n125->base.n.fanout < class125_max.fanout)
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					const RT_SIZE_CLASS_ELEM class125_min = RT_SIZE_CLASS_INFO[RT_CLASS_125_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_125_MAX;
+
+					Assert(n125->base.n.fanout == class125_min.fanout);
+
+					/* grow to the next size class of this kind */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n125 = (RT_NODE125_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class125_min.leaf_size);
+#else
+					memcpy(newnode, node, class125_min.inner_size);
+#endif
+					newnode->fanout = class125_max.fanout;
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+
 				if (unlikely(RT_NODE_MUST_GROW(n125)))
 				{
 					RT_PTR_ALLOC allocnode;
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index 7ad1ce3605..f2b1d7e4f8 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -4,12 +4,16 @@ CREATE EXTENSION test_radixtree;
 -- an error if something fails.
 --
 SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 1
+NOTICE:  testing basic operations with inner node 1
 NOTICE:  testing basic operations with leaf node 3
 NOTICE:  testing basic operations with inner node 3
 NOTICE:  testing basic operations with leaf node 15
 NOTICE:  testing basic operations with inner node 15
 NOTICE:  testing basic operations with leaf node 32
 NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 61
+NOTICE:  testing basic operations with inner node 61
 NOTICE:  testing basic operations with leaf node 125
 NOTICE:  testing basic operations with inner node 125
 NOTICE:  testing basic operations with leaf node 256
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 19d286d84b..4f38b6e3de 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -47,10 +47,12 @@ static const bool rt_test_stats = false;
  * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
  */
 static int	rt_node_class_fanouts[] = {
-	3,		/* RT_CLASS_3 */
+	1,		/* RT_CLASS_3_MIN */
+	3,		/* RT_CLASS_3_MAX */
 	15,		/* RT_CLASS_32_MIN */
 	32, 	/* RT_CLASS_32_MAX */
-	125,	/* RT_CLASS_125 */
+	61,		/* RT_CLASS_125_MIN */
+	125,	/* RT_CLASS_125_MAX */
 	256		/* RT_CLASS_256 */
 };
 /*
-- 
2.31.1

v31-0011-Remove-the-max-memory-deduction-from-TidStore.patchapplication/octet-stream; name=v31-0011-Remove-the-max-memory-deduction-from-TidStore.patchDownload

From e86e43b93fb901aacd8d2b69aa53ad896c5b5e1c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Mar 2023 15:08:58 +0900
Subject: [PATCH v31 11/14] Remove the max memory deduction from TidStore.

---
 src/backend/access/common/tidstore.c | 43 +++++++---------------------
 1 file changed, 10 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 9360520482..ee73759648 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -82,6 +82,7 @@ typedef uint64 offsetbm;
 #define RT_SCOPE static
 #define RT_DECLARE
 #define RT_DEFINE
+#define RT_MEASURE_MEMORY_USAGE
 #define RT_VALUE_TYPE tidkey
 #include "lib/radixtree.h"
 
@@ -90,6 +91,7 @@ typedef uint64 offsetbm;
 #define RT_SCOPE static
 #define RT_DECLARE
 #define RT_DEFINE
+#define RT_MEASURE_MEMORY_USAGE
 #define RT_VALUE_TYPE tidkey
 #include "lib/radixtree.h"
 
@@ -180,39 +182,15 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
 
 	ts = palloc0(sizeof(TidStore));
 
-	/*
-	 * Create the radix tree for the main storage.
-	 *
-	 * Memory consumption depends on the number of stored tids, but also on the
-	 * distribution of them, how the radix tree stores, and the memory management
-	 * that backed the radix tree. The maximum bytes that a TidStore can
-	 * use is specified by the max_bytes in TidStoreCreate(). We want the total
-	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
-	 *
-	 * In local TidStore cases, the radix tree uses slab allocators for each kind
-	 * of node class. The most memory consuming case while adding Tids associated
-	 * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
-	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
-	 * we deduct 70kB from the max_bytes.
-	 *
-	 * In shared cases, DSA allocates the memory segments big enough to follow
-	 * a geometric series that approximately doubles the total DSA size (see
-	 * make_new_segment() in dsa.c). We simulated the how DSA increases segment
-	 * size and the simulation revealed, the 75% threshold for the maximum bytes
-	 * perfectly works in case where the max_bytes is a power-of-2, and the 60%
-	 * threshold works for other cases.
-	 */
 	if (area != NULL)
 	{
 		dsa_pointer dp;
-		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
 
 		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
 										   LWTRANCHE_SHARED_TIDSTORE);
 
 		dp = dsa_allocate0(area, sizeof(TidStoreControl));
 		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
-		ts->control->max_bytes = (size_t) (max_bytes * ratio);
 		ts->area = area;
 
 		ts->control->magic = TIDSTORE_MAGIC;
@@ -223,11 +201,15 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
 	else
 	{
 		ts->tree.local = local_rt_create(CurrentMemoryContext);
-
 		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
-		ts->control->max_bytes = max_bytes - (70 * 1024);
 	}
 
+	/*
+	 * max_bytes is forced to be at least 64KB, the current minimum valid value
+	 * for the work_mem GUC.
+	 */
+	ts->control->max_bytes = Max(64 * 1024L, max_bytes);
+
 	ts->control->max_off = max_off;
 	ts->control->max_off_nbits = pg_ceil_log2_32(max_off);
 
@@ -331,14 +313,8 @@ TidStoreReset(TidStore *ts)
 
 		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
 
-		/*
-		 * Free the radix tree and return allocated DSA segments to
-		 * the operating system.
-		 */
-		shared_rt_free(ts->tree.shared);
-		dsa_trim(ts->area);
-
 		/* Recreate the radix tree */
+		shared_rt_free(ts->tree.shared);
 		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
 										   LWTRANCHE_SHARED_TIDSTORE);
 
@@ -352,6 +328,7 @@ TidStoreReset(TidStore *ts)
 	}
 	else
 	{
+		/* Recreate the radix tree */
 		local_rt_free(ts->tree.local);
 		ts->tree.local = local_rt_create(CurrentMemoryContext);
 
-- 
2.31.1

v31-0009-Review-vacuum-integration.patchapplication/octet-stream; name=v31-0009-Review-vacuum-integration.patchDownload

From c1c126e0f4e9f5eeb642bd892bd40948a41b8aae Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 17 Feb 2023 00:04:37 +0900
Subject: [PATCH v31 09/14] Review vacuum integration.

---
 doc/src/sgml/monitoring.sgml          |  2 +-
 src/backend/access/heap/vacuumlazy.c  | 61 +++++++++++++--------------
 src/backend/commands/vacuum.c         |  4 +-
 src/backend/commands/vacuumparallel.c | 25 +++++------
 4 files changed, 46 insertions(+), 46 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 47b346d36c..61e163636a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7181,7 +7181,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
+       <structfield>dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
        Amount of dead tuple data collected since the last index vacuum cycle.
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index b4e40423a8..edb9079124 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -10,11 +10,10 @@
  * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * create a TidStore with the maximum bytes that can be used by the TidStore.
- * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
- * vacuum the pages that we've pruned). This frees up the memory space dedicated
- * to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -844,7 +843,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
+	initprog_val[2] = TidStoreMaxMemory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -911,7 +910,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		if (tidstore_is_full(vacrel->dead_items))
+		if (TidStoreIsFull(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1080,16 +1079,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(tidstore_num_tids(dead_items) == 0);
+			Assert(TidStoreNumTids(dead_items) == 0);
 		}
 		else if (prunestate.num_offsets > 0)
 		{
 			/* Save details of the LP_DEAD items from the page in dead_items */
-			tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
-							  prunestate.num_offsets);
+			TidStoreSetBlockOffsets(dead_items, blkno, prunestate.deadoffsets,
+									prunestate.num_offsets);
 
 			pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
-										 tidstore_memory_usage(dead_items));
+										 TidStoreMemoryUsage(dead_items));
 		}
 
 		/*
@@ -1260,7 +1259,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (tidstore_num_tids(dead_items) > 0)
+	if (TidStoreNumTids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -2127,10 +2126,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
+		TidStoreSetBlockOffsets(dead_items, blkno, deadoffsets, lpdead_items);
 
 		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
-									 tidstore_memory_usage(dead_items));
+									 TidStoreMemoryUsage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2179,7 +2178,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		tidstore_reset(vacrel->dead_items);
+		TidStoreReset(vacrel->dead_items);
 		return;
 	}
 
@@ -2208,7 +2207,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
+		Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2236,7 +2235,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
 		bypass = (vacrel->lpdead_item_pages < threshold) &&
-			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
+			TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2281,7 +2280,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	tidstore_reset(vacrel->dead_items);
+	TidStoreReset(vacrel->dead_items);
 }
 
 /*
@@ -2354,7 +2353,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
+		   TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2394,7 +2393,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
 	TidStoreIter *iter;
-	TidStoreIterResult *result;
+	TidStoreIterResult *iter_result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2409,8 +2408,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	iter = tidstore_begin_iterate(vacrel->dead_items);
-	while ((result = tidstore_iterate_next(iter)) != NULL)
+	iter = TidStoreBeginIterate(vacrel->dead_items);
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2419,7 +2418,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = result->blkno;
+		blkno = iter_result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2433,8 +2432,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
-							  buf, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, iter_result->offsets,
+							  iter_result->num_offsets, buf, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2444,7 +2443,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
-	tidstore_end_iterate(iter);
+	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2455,12 +2454,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
+		   (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
-					vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+			(errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, TidStoreNumTids(vacrel->dead_items),
 					vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
@@ -3118,8 +3117,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
-										 NULL);
+	vacrel->dead_items = TidStoreCreate(vac_work_mem, MaxHeapTuplesPerPage,
+										NULL);
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 785b825bbc..afedb87941 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2335,7 +2335,7 @@ vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
 	ereport(ivinfo->message_level,
 			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					tidstore_num_tids(dead_items))));
+					TidStoreNumTids(dead_items))));
 
 	return istat;
 }
@@ -2376,5 +2376,5 @@ vac_tid_reaped(ItemPointer itemptr, void *state)
 {
 	TidStore *dead_items = (TidStore *) state;
 
-	return tidstore_lookup_tid(dead_items, itemptr);
+	return TidStoreIsMember(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index d653683693..9225daf3ab 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,11 +9,12 @@
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
  * vacuum process.  ParalleVacuumState contains shared information as well as
- * the shared TidStore. We launch parallel worker processes at the start of
- * parallel index bulk-deletion and index cleanup and once all indexes are
- * processed, the parallel worker processes exit.  Each time we process indexes
- * in parallel, the parallel context is re-initialized so that the same DSM can
- * be used for multiple passes of index bulk-deletion and index cleanup.
+ * the memory space for storing dead items allocated in the DSA area.  We
+ * launch parallel worker processes at the start of parallel index
+ * bulk-deletion and index cleanup and once all indexes are processed, the
+ * parallel worker processes exit.	Each time we process indexes in parallel,
+ * the parallel context is re-initialized so that the same DSM can be used for
+ * multiple passes of index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -104,7 +105,7 @@ typedef struct PVShared
 	pg_atomic_uint32 idx;
 
 	/* Handle of the shared TidStore */
-	tidstore_handle	dead_items_handle;
+	TidStoreHandle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -289,7 +290,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	/* Initial size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
 	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
@@ -362,7 +363,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
 										 LWTRANCHE_PARALLEL_VACUUM_DSA,
 										 pcxt->seg);
-	dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+	dead_items = TidStoreCreate(vac_work_mem, max_offset, dead_items_dsa);
 	pvs->dead_items = dead_items;
 	pvs->dead_items_area = dead_items_dsa;
 
@@ -375,7 +376,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
-	shared->dead_items_handle = tidstore_get_handle(dead_items);
+	shared->dead_items_handle = TidStoreGetHandle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -441,7 +442,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
-	tidstore_destroy(pvs->dead_items);
+	TidStoreDestroy(pvs->dead_items);
 	dsa_detach(pvs->dead_items_area);
 
 	DestroyParallelContext(pvs->pcxt);
@@ -999,7 +1000,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Set dead items */
 	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
 	dead_items_area = dsa_attach_in_place(area_space, seg);
-	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
+	dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1045,7 +1046,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
-	tidstore_detach(pvs.dead_items);
+	TidStoreDetach(dead_items);
 	dsa_detach(dead_items_area);
 
 	/* Pop the error context stack */
-- 
2.31.1

v31-0007-Review-radix-tree.patchapplication/octet-stream; name=v31-0007-Review-radix-tree.patchDownload

From 2c280fb3697501c70e4ce43808e3a5175bbc5eb2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 20 Feb 2023 11:28:50 +0900
Subject: [PATCH v31 07/14] Review radix tree.

Mainly improve the iteration codes and comments.
---
 src/include/lib/radixtree.h                   | 169 +++++++++---------
 src/include/lib/radixtree_iter_impl.h         |  85 ++++-----
 .../expected/test_radixtree.out               |   6 +-
 .../modules/test_radixtree/test_radixtree.c   | 103 +++++++----
 4 files changed, 197 insertions(+), 166 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index e546bd705c..8bea606c62 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -83,7 +83,7 @@
  * RT_SET			- Set a key-value pair
  * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
  * RT_ITERATE_NEXT	- Return next key-value pair, if any
- * RT_END_ITER		- End iteration
+ * RT_END_ITERATE	- End iteration
  * RT_MEMORY_USAGE	- Get the memory usage
  *
  * Interface for Shared Memory
@@ -152,8 +152,8 @@
 #define RT_INIT_NODE RT_MAKE_NAME(init_node)
 #define RT_FREE_NODE RT_MAKE_NAME(free_node)
 #define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
-#define RT_EXTEND RT_MAKE_NAME(extend)
-#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_EXTEND_UP RT_MAKE_NAME(extend_up)
+#define RT_EXTEND_DOWN RT_MAKE_NAME(extend_down)
 #define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
 #define RT_COPY_NODE RT_MAKE_NAME(copy_node)
 #define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
@@ -191,7 +191,7 @@
 #define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
 #define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
 #define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
-#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_SET_NODE_FROM RT_MAKE_NAME(iter_set_node_from)
 #define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
 #define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
 
@@ -612,7 +612,6 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 #endif
 
 /* Contains the actual tree and ancillary info */
-// WIP: this name is a bit strange
 typedef struct RT_RADIX_TREE_CONTROL
 {
 #ifdef RT_SHMEM
@@ -651,36 +650,40 @@ typedef struct RT_RADIX_TREE
  * Iteration support.
  *
  * Iterating the radix tree returns each pair of key and value in the ascending
- * order of the key. To support this, the we iterate nodes of each level.
+ * order of the key.
  *
- * RT_NODE_ITER struct is used to track the iteration within a node.
+ * RT_NODE_ITER is the struct for iteration of one radix tree node.
  *
  * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
- * in order to track the iteration of each level. During iteration, we also
- * construct the key whenever updating the node iteration information, e.g., when
- * advancing the current index within the node or when moving to the next node
- * at the same level.
- *
- * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
- * has the local pointers to nodes, rather than RT_PTR_ALLOC.
- * We need either a safeguard to disallow other processes to begin the iteration
- * while one process is doing or to allow multiple processes to do the iteration.
+ * for each level to track the iteration within the node.
  */
 typedef struct RT_NODE_ITER
 {
-	RT_PTR_LOCAL node;			/* current node being iterated */
-	int			current_idx;	/* current position. -1 for initial value */
+	/*
+	 * Local pointer to the node we are iterating over.
+	 *
+	 * Since the radix tree doesn't support the shared iteration among multiple
+	 * processes, we use RT_PTR_LOCAL rather than RT_PTR_ALLOC.
+	 */
+	RT_PTR_LOCAL node;
+
+	/*
+	 * The next index of the chunk array in RT_NODE_KIND_3 and
+	 * RT_NODE_KIND_32 nodes, or the next chunk in RT_NODE_KIND_125 and
+	 * RT_NODE_KIND_256 nodes. 0 for the initial value.
+	 */
+	int		idx;
 } RT_NODE_ITER;
 
 typedef struct RT_ITER
 {
 	RT_RADIX_TREE *tree;
 
-	/* Track the iteration on nodes of each level */
-	RT_NODE_ITER stack[RT_MAX_LEVEL];
-	int			stack_len;
+	/* Track the nodes for each level. level = 0 is for a leaf node */
+	RT_NODE_ITER node_iters[RT_MAX_LEVEL];
+	int			top_level;
 
-	/* The key is constructed during iteration */
+	/* The key constructed during the iteration */
 	uint64		key;
 } RT_ITER;
 
@@ -1243,7 +1246,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
  * it can store the key.
  */
 static pg_noinline void
-RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+RT_EXTEND_UP(RT_RADIX_TREE *tree, uint64 key)
 {
 	int			target_shift;
 	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
@@ -1282,7 +1285,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
  * Insert inner and leaf nodes from 'node' to bottom.
  */
 static pg_noinline void
-RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+RT_EXTEND_DOWN(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
 			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
 {
 	int			shift = node->shift;
@@ -1613,7 +1616,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 
 	/* Extend the tree if necessary */
 	if (key > tree->ctl->max_val)
-		RT_EXTEND(tree, key);
+		RT_EXTEND_UP(tree, key);
 
 	stored_child = tree->ctl->root;
 	parent = RT_PTR_GET_LOCAL(tree, stored_child);
@@ -1631,7 +1634,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 
 		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
 		{
-			RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+			RT_EXTEND_DOWN(tree, key, value_p, parent, stored_child, child);
 			RT_UNLOCK(tree);
 			return false;
 		}
@@ -1805,16 +1808,9 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 }
 #endif
 
-static inline void
-RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
-{
-	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
-	iter->key |= (((uint64) chunk) << shift);
-}
-
 /*
- * Advance the slot in the inner node. Return the child if exists, otherwise
- * null.
+ * Scan the inner node and return the next child node if exist, otherwise
+ * return NULL.
  */
 static inline RT_PTR_LOCAL
 RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
@@ -1825,8 +1821,8 @@ RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
 }
 
 /*
- * Advance the slot in the leaf node. On success, return true and the value
- * is set to value_p, otherwise return false.
+ * Scan the leaf node, and return true and the next value is set to value_p
+ * if exists. Otherwise return false.
  */
 static inline bool
 RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
@@ -1838,29 +1834,50 @@ RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
 }
 
 /*
- * Update each node_iter for inner nodes in the iterator node stack.
+ * While descending the radix tree from the 'from' node to the bottom, we
+ * set the next node to iterate for each level.
  */
 static void
-RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+RT_ITER_SET_NODE_FROM(RT_ITER *iter, RT_PTR_LOCAL from)
 {
-	int			level = from;
-	RT_PTR_LOCAL node = from_node;
+	int			level = from->shift / RT_NODE_SPAN;
+	RT_PTR_LOCAL node = from;
 
 	for (;;)
 	{
-		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+		RT_NODE_ITER *node_iter = &(iter->node_iters[level--]);
+
+#ifdef USE_ASSERT_CHECKING
+		if (node_iter->node)
+		{
+			/* We must have finished the iteration on the previous node */
+			if (RT_NODE_IS_LEAF(node_iter->node))
+			{
+				uint64 dummy;
+				Assert(!RT_NODE_LEAF_ITERATE_NEXT(iter, node_iter, &dummy));
+			}
+			else
+				Assert(!RT_NODE_INNER_ITERATE_NEXT(iter, node_iter));
+		}
+#endif
 
+		/* Set the node to the node iterator of this level */
 		node_iter->node = node;
-		node_iter->current_idx = -1;
+		node_iter->idx = 0;
 
-		/* We don't advance the leaf node iterator here */
 		if (RT_NODE_IS_LEAF(node))
-			return;
+		{
+			/* We will visit the leaf node when RT_ITERATE_NEXT() */
+			break;
+		}
 
-		/* Advance to the next slot in the inner node */
+		/*
+		 * Get the first child node from the node, which corresponds to the
+		 * lowest chunk within the node.
+		 */
 		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
 
-		/* We must find the first children in the node */
+		/* The first child must be found */
 		Assert(node);
 	}
 }
@@ -1874,14 +1891,11 @@ RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
 RT_SCOPE RT_ITER *
 RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 {
-	MemoryContext old_ctx;
 	RT_ITER    *iter;
 	RT_PTR_LOCAL root;
-	int			top_level;
 
-	old_ctx = MemoryContextSwitchTo(tree->context);
-
-	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter = (RT_ITER *) MemoryContextAllocZero(tree->context,
+											  sizeof(RT_ITER));
 	iter->tree = tree;
 
 	RT_LOCK_SHARED(tree);
@@ -1891,16 +1905,13 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 		return iter;
 
 	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
-	top_level = root->shift / RT_NODE_SPAN;
-	iter->stack_len = top_level;
+	iter->top_level = root->shift / RT_NODE_SPAN;
 
 	/*
-	 * Descend to the left most leaf node from the root. The key is being
-	 * constructed while descending to the leaf.
+	 * Set the next node to iterate for each level from the level of the
+	 * root node.
 	 */
-	RT_UPDATE_ITER_STACK(iter, root, top_level);
-
-	MemoryContextSwitchTo(old_ctx);
+	RT_ITER_SET_NODE_FROM(iter, root);
 
 	return iter;
 }
@@ -1912,6 +1923,8 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 RT_SCOPE bool
 RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
 {
+	Assert(value_p != NULL);
+
 	/* Empty tree */
 	if (!iter->tree->ctl->root)
 		return false;
@@ -1919,43 +1932,38 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
 	for (;;)
 	{
 		RT_PTR_LOCAL child = NULL;
-		RT_VALUE_TYPE value;
-		int			level;
-		bool		found;
-
-		/* Advance the leaf node iterator to get next key-value pair */
-		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
 
-		if (found)
+		/* Get the next chunk of the leaf node */
+		if (RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->node_iters[0]), value_p))
 		{
 			*key_p = iter->key;
-			*value_p = value;
 			return true;
 		}
 
 		/*
-		 * We've visited all values in the leaf node, so advance inner node
-		 * iterators from the level=1 until we find the next child node.
+		 * We've visited all values in the leaf node, so advance all inner node
+		 * iterators by visiting inner nodes from the level = 1 until we find the
+		 * next inner node that has a child node.
 		 */
-		for (level = 1; level <= iter->stack_len; level++)
+		for (int level = 1; level <= iter->top_level; level++)
 		{
-			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->node_iters[level]));
 
 			if (child)
 				break;
 		}
 
-		/* the iteration finished */
+		/* We've visited all nodes, so the iteration finished */
 		if (!child)
-			return false;
+			break;
 
 		/*
-		 * Set the node to the node iterator and update the iterator stack
-		 * from this node.
+		 * Found the new child node. We update the next node to iterate for each
+		 * level from the level of this child node.
 		 */
-		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+		RT_ITER_SET_NODE_FROM(iter, child);
 
-		/* Node iterators are updated, so try again from the leaf */
+		/* Find key-value from the leaf node again */
 	}
 
 	return false;
@@ -2470,8 +2478,8 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_INIT_NODE
 #undef RT_FREE_NODE
 #undef RT_FREE_RECURSE
-#undef RT_EXTEND
-#undef RT_SET_EXTEND
+#undef RT_EXTEND_UP
+#undef RT_EXTEND_DOWN
 #undef RT_SWITCH_NODE_KIND
 #undef RT_COPY_NODE
 #undef RT_REPLACE_NODE
@@ -2509,8 +2517,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_NODE_INSERT_LEAF
 #undef RT_NODE_INNER_ITERATE_NEXT
 #undef RT_NODE_LEAF_ITERATE_NEXT
-#undef RT_UPDATE_ITER_STACK
-#undef RT_ITER_UPDATE_KEY
+#undef RT_RT_ITER_SET_NODE_FROM
 #undef RT_VERIFY_NODE
 
 #undef RT_DEBUG
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 98c78eb237..5c1034768e 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -27,12 +27,10 @@
 #error node level must be either inner or leaf
 #endif
 
-	bool		found = false;
-	uint8		key_chunk;
+	uint8		key_chunk = 0;
 
 #ifdef RT_NODE_LEVEL_LEAF
-	RT_VALUE_TYPE		value;
-
+	Assert(value_p != NULL);
 	Assert(RT_NODE_IS_LEAF(node_iter->node));
 #else
 	RT_PTR_LOCAL child = NULL;
@@ -50,99 +48,92 @@
 			{
 				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
 
-				node_iter->current_idx++;
-				if (node_iter->current_idx >= n3->base.n.count)
-					break;
+				if (node_iter->idx >= n3->base.n.count)
+					return false;
+
 #ifdef RT_NODE_LEVEL_LEAF
-				value = n3->values[node_iter->current_idx];
+				*value_p = n3->values[node_iter->idx];
 #else
-				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->idx]);
 #endif
-				key_chunk = n3->base.chunks[node_iter->current_idx];
-				found = true;
+				key_chunk = n3->base.chunks[node_iter->idx];
+				node_iter->idx++;
 				break;
 			}
 		case RT_NODE_KIND_32:
 			{
 				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
 
-				node_iter->current_idx++;
-				if (node_iter->current_idx >= n32->base.n.count)
-					break;
+				if (node_iter->idx >= n32->base.n.count)
+					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				value = n32->values[node_iter->current_idx];
+				*value_p = n32->values[node_iter->idx];
 #else
-				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->idx]);
 #endif
-				key_chunk = n32->base.chunks[node_iter->current_idx];
-				found = true;
+				key_chunk = n32->base.chunks[node_iter->idx];
+				node_iter->idx++;
 				break;
 			}
 		case RT_NODE_KIND_125:
 			{
 				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
-				int			i;
+				int			chunk;
 
-				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
 				{
-					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
 						break;
 				}
 
-				if (i >= RT_NODE_MAX_SLOTS)
-					break;
+				if (chunk >= RT_NODE_MAX_SLOTS)
+					return false;
 
-				node_iter->current_idx = i;
 #ifdef RT_NODE_LEVEL_LEAF
-				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+				*value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
 #else
-				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, chunk));
 #endif
-				key_chunk = i;
-				found = true;
+				key_chunk = chunk;
+				node_iter->idx = chunk + 1;
 				break;
 			}
 		case RT_NODE_KIND_256:
 			{
 				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
-				int			i;
+				int			chunk;
 
-				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
 				{
 #ifdef RT_NODE_LEVEL_LEAF
-					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
 #else
-					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
 #endif
 						break;
 				}
 
-				if (i >= RT_NODE_MAX_SLOTS)
-					break;
+				if (chunk >= RT_NODE_MAX_SLOTS)
+					return false;
 
-				node_iter->current_idx = i;
 #ifdef RT_NODE_LEVEL_LEAF
-				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+				*value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
 #else
-				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, chunk));
 #endif
-				key_chunk = i;
-				found = true;
+				key_chunk = chunk;
+				node_iter->idx = chunk + 1;
 				break;
 			}
 	}
 
-	if (found)
-	{
-		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
-#ifdef RT_NODE_LEVEL_LEAF
-		*value_p = value;
-#endif
-	}
+	/* Update the part of the key */
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << node_iter->node->shift);
+	iter->key |= (((uint64) key_chunk) << node_iter->node->shift);
 
 #ifdef RT_NODE_LEVEL_LEAF
-	return found;
+	return true;
 #else
 	return child;
 #endif
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index ce645cb8b5..7ad1ce3605 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -4,8 +4,10 @@ CREATE EXTENSION test_radixtree;
 -- an error if something fails.
 --
 SELECT test_radixtree();
-NOTICE:  testing basic operations with leaf node 4
-NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 3
+NOTICE:  testing basic operations with inner node 3
+NOTICE:  testing basic operations with leaf node 15
+NOTICE:  testing basic operations with inner node 15
 NOTICE:  testing basic operations with leaf node 32
 NOTICE:  testing basic operations with inner node 32
 NOTICE:  testing basic operations with leaf node 125
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index afe53382f3..5a169854d9 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -43,12 +43,15 @@ typedef uint64 TestValueType;
  */
 static const bool rt_test_stats = false;
 
-static int	rt_node_kind_fanouts[] = {
-	0,
-	4,							/* RT_NODE_KIND_4 */
-	32,							/* RT_NODE_KIND_32 */
-	125,						/* RT_NODE_KIND_125 */
-	256							/* RT_NODE_KIND_256 */
+/*
+ * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
+ */
+static int	rt_node_class_fanouts[] = {
+	3,		/* RT_CLASS_3 */
+	15,		/* RT_CLASS_32_MIN */
+	32, 	/* RT_CLASS_32_MAX */
+	125,	/* RT_CLASS_125 */
+	256		/* RT_CLASS_256 */
 };
 /*
  * A struct to define a pattern of integers, for use with the test_pattern()
@@ -260,10 +263,9 @@ test_basic(int children, bool test_inner)
  * Check if keys from start to end with the shift exist in the tree.
  */
 static void
-check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
-					 int incr)
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end)
 {
-	for (int i = start; i < end; i++)
+	for (int i = start; i <= end; i++)
 	{
 		uint64		key = ((uint64) i << shift);
 		TestValueType		val;
@@ -277,22 +279,26 @@ check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
 	}
 }
 
+/*
+ * Insert 256 key-value pairs, and check if keys are properly inserted on each
+ * node class.
+ */
+/* Test keys [0, 256) */
+#define NODE_TYPE_TEST_KEY_MIN 0
+#define NODE_TYPE_TEST_KEY_MAX 256
 static void
-test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+test_node_types_insert_asc(rt_radix_tree *radixtree, uint8 shift)
 {
-	uint64		num_entries;
-	int		ninserted = 0;
-	int		start = insert_asc ? 0 : 256;
-	int 	incr = insert_asc ? 1 : -1;
-	int		end = insert_asc ? 256 : 0;
-	int		node_kind_idx = 1;
+	uint64 num_entries;
+	int node_class_idx = 0;
+	uint64 key_checked = 0;
 
-	for (int i = start; i != end; i += incr)
+	for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
 	{
 		uint64		key = ((uint64) i << shift);
 		bool		found;
 
-		found = rt_set(radixtree, key, (TestValueType*) &key);
+		found = rt_set(radixtree, key, (TestValueType *) &key);
 		if (found)
 			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
 
@@ -300,24 +306,49 @@ test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
 		 * After filling all slots in each node type, check if the values
 		 * are stored properly.
 		 */
-		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		if ((i + 1) == rt_node_class_fanouts[node_class_idx])
 		{
-			int check_start = insert_asc
-				? rt_node_kind_fanouts[node_kind_idx - 1]
-				: rt_node_kind_fanouts[node_kind_idx];
-			int check_end = insert_asc
-				? rt_node_kind_fanouts[node_kind_idx]
-				: rt_node_kind_fanouts[node_kind_idx - 1];
-
-			check_search_on_node(radixtree, shift, check_start, check_end, incr);
-			node_kind_idx++;
+			check_search_on_node(radixtree, shift, key_checked, i);
+			key_checked = i;
+			node_class_idx++;
 		}
-
-		ninserted++;
 	}
 
 	num_entries = rt_num_entries(radixtree);
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Similar to test_node_types_insert_asc(), but inserts keys in descending order.
+ */
+static void
+test_node_types_insert_desc(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+	int node_class_idx = 0;
+	uint64 key_checked = NODE_TYPE_TEST_KEY_MAX - 1;
+
+	for (int i = NODE_TYPE_TEST_KEY_MAX - 1; i >= NODE_TYPE_TEST_KEY_MIN; i--)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType *) &key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
 
+		if ((i + 1) == rt_node_class_fanouts[node_class_idx])
+		{
+			check_search_on_node(radixtree, shift, i, key_checked);
+			key_checked = i;
+			node_class_idx++;
+		}
+	}
+
+	num_entries = rt_num_entries(radixtree);
 	if (num_entries != 256)
 		elog(ERROR,
 			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
@@ -329,7 +360,7 @@ test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
 {
 	uint64		num_entries;
 
-	for (int i = 0; i < 256; i++)
+	for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
 	{
 		uint64		key = ((uint64) i << shift);
 		bool		found;
@@ -379,9 +410,9 @@ test_node_types(uint8 shift)
 	 * then delete all entries to make it empty, and insert and search entries
 	 * again.
 	 */
-	test_node_types_insert(radixtree, shift, true);
+	test_node_types_insert_asc(radixtree, shift);
 	test_node_types_delete(radixtree, shift);
-	test_node_types_insert(radixtree, shift, false);
+	test_node_types_insert_desc(radixtree, shift);
 
 	rt_free(radixtree);
 #ifdef RT_SHMEM
@@ -664,10 +695,10 @@ test_radixtree(PG_FUNCTION_ARGS)
 {
 	test_empty();
 
-	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	for (int i = 0; i < lengthof(rt_node_class_fanouts); i++)
 	{
-		test_basic(rt_node_kind_fanouts[i], false);
-		test_basic(rt_node_kind_fanouts[i], true);
+		test_basic(rt_node_class_fanouts[i], false);
+		test_basic(rt_node_class_fanouts[i], true);
 	}
 
 	for (int shift = 0; shift <= (64 - 8); shift += 8)
-- 
2.31.1

v31-0014-Revert-building-benchmark-module-for-CI.patchapplication/octet-stream; name=v31-0014-Revert-building-benchmark-module-for-CI.patchDownload

From 7bae7b13e777c826c542ac33766ad8358672d9cc Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 19:31:34 +0700
Subject: [PATCH v31 14/14] Revert building benchmark module for CI

---
 contrib/meson.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/meson.build b/contrib/meson.build
index 421d469f8c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,7 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
-subdir('bench_radix_tree')
+#subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.31.1

v31-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchapplication/octet-stream; name=v31-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchDownload

From bc5b4650377c4dcb4f108013a5638d6f17cd13ef Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v31 05/14] Tool for measuring radix tree and tidstore
 performance

Includes Meson support, but commented out to avoid warnings

XXX: Not for commit
---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  88 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 747 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/meson.build          |  33 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 contrib/meson.build                           |   1 +
 8 files changed, 925 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/meson.build
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..ad66265e23
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,88 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT iter_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..6e5149e2c4
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,747 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+//#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	TidStore	*ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
+	OffsetNumber *offs;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_ms;
+	int64		iter_ms;
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false};
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+	for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+		offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+	ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
+
+	/* load tids */
+	start_time = GetCurrentTimestamp();
+	for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+		tidstore_add_tids(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_ms = secs * 1000 + usecs / 1000;
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* iterate through tids */
+	iter = tidstore_begin_iterate(ts);
+	start_time = GetCurrentTimestamp();
+	while ((result = tidstore_iterate_next(iter)) != NULL)
+		;
+	tidstore_end_iterate(iter);
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	iter_ms = secs * 1000 + usecs / 1000;
+
+	values[0] = Int64GetDatum(tidstore_memory_usage(ts));
+	values[1] = Int64GetDatum(load_ms);
+	values[2] = Int64GetDatum(iter_ms);
+
+	tidstore_destroy(ts);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	rt_radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, &val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, &val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	int64		search_time_ms;
+	Datum		values[3] = {0};
+	bool		nulls[3] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+	values[2] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, &key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* to silence warnings about unused iter functions */
+static void pg_attribute_unused()
+stub_iter()
+{
+	rt_radix_tree *rt;
+	rt_iter *iter;
+	uint64 key = 1;
+	uint64 value = 1;
+
+	rt = rt_create(CurrentMemoryContext);
+
+	iter = rt_begin_iterate(rt);
+	rt_iterate_next(iter, &key, &value);
+	rt_end_iterate(iter);
+}
\ No newline at end of file
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+  'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+  bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'bench_radix_tree',
+    '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+  bench_radix_tree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+  'bench_radix_tree.control',
+  'bench_radix_tree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'bench_radix_tree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'bench_radix_tree',
+    ],
+  },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..421d469f8c 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
+subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.31.1

v31-0008-Review-TidStore.patchapplication/octet-stream; name=v31-0008-Review-TidStore.patchDownload

From 6842622ec10cf702fd062caccb091ce5ecbe56b5 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 16 Feb 2023 23:45:39 +0900
Subject: [PATCH v31 08/14] Review TidStore.

---
 src/backend/access/common/tidstore.c          | 340 +++++++++---------
 src/include/access/tidstore.h                 |  37 +-
 .../modules/test_tidstore/test_tidstore.c     |  68 ++--
 3 files changed, 234 insertions(+), 211 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 8c05e60d92..9360520482 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -3,18 +3,19 @@
  * tidstore.c
  *		Tid (ItemPointerData) storage implementation.
  *
- * This module provides a in-memory data structure to store Tids (ItemPointer).
- * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
- * stored in the radix tree.
+ * TidStore is a in-memory data structure to store tids (ItemPointerData).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.
  *
- * A TidStore can be shared among parallel worker processes by passing DSA area
- * to tidstore_create(). Other backends can attach to the shared TidStore by
- * tidstore_attach().
+ * TidStore can be shared among parallel worker processes by passing DSA area
+ * to TidStoreCreate(). Other backends can attach to the shared TidStore by
+ * TidStoreAttach().
  *
- * Regarding the concurrency, it basically relies on the concurrency support in
- * the radix tree, but we acquires the lock on a TidStore in some cases, for
- * example, when to reset the store and when to access the number tids in the
- * store (num_tids).
+ * Regarding the concurrency support, we use a single LWLock for the TidStore.
+ * The TidStore is exclusively locked when inserting encoded tids to the
+ * radix tree or when resetting itself. When searching on the TidStore or
+ * doing the iteration, it is not locked but the underlying radix tree is
+ * locked in shared mode.
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -34,16 +35,18 @@
 #include "utils/memutils.h"
 
 /*
- * For encoding purposes, tids are represented as a pair of 64-bit key and
- * 64-bit value. First, we construct 64-bit unsigned integer by combining
- * the block number and the offset number. The number of bits used for the
- * offset number is specified by max_offsets in tidstore_create(). We are
- * frugal with the bits, because smaller keys could help keeping the radix
- * tree shallow.
+ * For encoding purposes, a tid is represented as a pair of 64-bit key and
+ * 64-bit value.
  *
- * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
- * the offset number and uses the next 32 bits for the block number. That
- * is, only 41 bits are used:
+ * First, we construct a 64-bit unsigned integer by combining the block
+ * number and the offset number. The number of bits used for the offset number
+ * is specified by max_off in TidStoreCreate(). We are frugal with the bits,
+ * because smaller keys could help keeping the radix tree shallow.
+ *
+ * For example, a tid of heap on a 8kB block uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. 9 bits
+ * are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks. That is, only 41 bits are used:
  *
  * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
  *
@@ -52,30 +55,34 @@
  * u = unused bit
  * (high on the left, low on the right)
  *
- * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
- * on 8kB blocks.
- *
- * The 64-bit value is the bitmap representation of the lowest 6 bits
- * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
- * as the key:
+ * Then, 64-bit value is the bitmap representation of the lowest 6 bits
+ * (LOWER_OFFSET_NBITS) of the integer, and 64-bit key consists of the
+ * upper 3 bits of the offset number and the block number, 35 bits in
+ * total:
  *
  * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
  *                                                |----| value
- * |---------------------------------------------| key
+ *        |--------------------------------------| key
  *
  * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits required for offset numbers fits in LOWER_OFFSET_NBITS,
+ * 64-bit value is the bitmap representation of the offset number, and the
+ * 64-bit key is the block number.
  */
-#define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
-#define TIDSTORE_OFFSET_MASK	((1 << TIDSTORE_VALUE_NBITS) - 1)
+typedef uint64 tidkey;
+typedef uint64 offsetbm;
+#define LOWER_OFFSET_NBITS	6	/* log(sizeof(offsetbm), 2) */
+#define LOWER_OFFSET_MASK	((1 << LOWER_OFFSET_NBITS) - 1)
 
-/* A magic value used to identify our TidStores. */
+/* A magic value used to identify our TidStore. */
 #define TIDSTORE_MAGIC 0x826f6a10
 
 #define RT_PREFIX local_rt
 #define RT_SCOPE static
 #define RT_DECLARE
 #define RT_DEFINE
-#define RT_VALUE_TYPE uint64
+#define RT_VALUE_TYPE tidkey
 #include "lib/radixtree.h"
 
 #define RT_PREFIX shared_rt
@@ -83,7 +90,7 @@
 #define RT_SCOPE static
 #define RT_DECLARE
 #define RT_DEFINE
-#define RT_VALUE_TYPE uint64
+#define RT_VALUE_TYPE tidkey
 #include "lib/radixtree.h"
 
 /* The control object for a TidStore */
@@ -94,10 +101,10 @@ typedef struct TidStoreControl
 
 	/* These values are never changed after creation */
 	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
-	int		max_offset;		/* the maximum offset number */
-	int		offset_nbits;	/* the number of bits required for an offset
-							 * number */
-	int		offset_key_nbits;	/* the number of bits of an offset number
+	int		max_off;		/* the maximum offset number */
+	int		max_off_nbits;	/* the number of bits required for offset
+							 * numbers */
+	int		upper_off_nbits;	/* the number of bits of offset numbers
 								 * used in a key */
 
 	/* The below fields are used only in shared case */
@@ -106,7 +113,7 @@ typedef struct TidStoreControl
 	LWLock	lock;
 
 	/* handles for TidStore and radix tree */
-	tidstore_handle		handle;
+	TidStoreHandle		handle;
 	shared_rt_handle	tree_handle;
 } TidStoreControl;
 
@@ -147,24 +154,27 @@ typedef struct TidStoreIter
 	bool		finished;
 
 	/* save for the next iteration */
-	uint64		next_key;
-	uint64		next_val;
+	tidkey		next_tidkey;
+	offsetbm	next_off_bitmap;
 
-	/* output for the caller */
-	TidStoreIterResult result;
+	/*
+	 * output for the caller. Must be last because variable-size.
+	 */
+	TidStoreIterResult output;
 } TidStoreIter;
 
-static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
-static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
-static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
-static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
+static void iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap);
+static inline BlockNumber key_get_blkno(TidStore *ts, tidkey key);
+static inline tidkey encode_blk_off(TidStore *ts, BlockNumber block,
+									OffsetNumber offset, offsetbm *off_bit);
+static inline tidkey encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit);
 
 /*
  * Create a TidStore. The returned object is allocated in backend-local memory.
  * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
  */
 TidStore *
-tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
 {
 	TidStore	*ts;
 
@@ -176,12 +186,12 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
 	 * Memory consumption depends on the number of stored tids, but also on the
 	 * distribution of them, how the radix tree stores, and the memory management
 	 * that backed the radix tree. The maximum bytes that a TidStore can
-	 * use is specified by the max_bytes in tidstore_create(). We want the total
+	 * use is specified by the max_bytes in TidStoreCreate(). We want the total
 	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
 	 *
 	 * In local TidStore cases, the radix tree uses slab allocators for each kind
 	 * of node class. The most memory consuming case while adding Tids associated
-	 * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+	 * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
 	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
 	 * we deduct 70kB from the max_bytes.
 	 *
@@ -202,7 +212,7 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
 
 		dp = dsa_allocate0(area, sizeof(TidStoreControl));
 		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
-		ts->control->max_bytes = (uint64) (max_bytes * ratio);
+		ts->control->max_bytes = (size_t) (max_bytes * ratio);
 		ts->area = area;
 
 		ts->control->magic = TIDSTORE_MAGIC;
@@ -218,14 +228,14 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
 		ts->control->max_bytes = max_bytes - (70 * 1024);
 	}
 
-	ts->control->max_offset = max_offset;
-	ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+	ts->control->max_off = max_off;
+	ts->control->max_off_nbits = pg_ceil_log2_32(max_off);
 
-	if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
-		ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+	if (ts->control->max_off_nbits < LOWER_OFFSET_NBITS)
+		ts->control->max_off_nbits = LOWER_OFFSET_NBITS;
 
-	ts->control->offset_key_nbits =
-		ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+	ts->control->upper_off_nbits =
+		ts->control->max_off_nbits - LOWER_OFFSET_NBITS;
 
 	return ts;
 }
@@ -235,7 +245,7 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
  * allocated in backend-local memory using the CurrentMemoryContext.
  */
 TidStore *
-tidstore_attach(dsa_area *area, tidstore_handle handle)
+TidStoreAttach(dsa_area *area, TidStoreHandle handle)
 {
 	TidStore *ts;
 	dsa_pointer control;
@@ -266,7 +276,7 @@ tidstore_attach(dsa_area *area, tidstore_handle handle)
  * to the operating system.
  */
 void
-tidstore_detach(TidStore *ts)
+TidStoreDetach(TidStore *ts)
 {
 	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
 
@@ -279,12 +289,12 @@ tidstore_detach(TidStore *ts)
  *
  * TODO: The caller must be certain that no other backend will attempt to
  * access the TidStore before calling this function. Other backend must
- * explicitly call tidstore_detach to free up backend-local memory associated
- * with the TidStore. The backend that calls tidstore_destroy must not call
- * tidstore_detach.
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().
  */
 void
-tidstore_destroy(TidStore *ts)
+TidStoreDestroy(TidStore *ts)
 {
 	if (TidStoreIsShared(ts))
 	{
@@ -309,11 +319,11 @@ tidstore_destroy(TidStore *ts)
 }
 
 /*
- * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * Forget all collected Tids. It's similar to TidStoreDestroy() but we don't free
  * entire TidStore but recreate only the radix tree storage.
  */
 void
-tidstore_reset(TidStore *ts)
+TidStoreReset(TidStore *ts)
 {
 	if (TidStoreIsShared(ts))
 	{
@@ -350,30 +360,34 @@ tidstore_reset(TidStore *ts)
 	}
 }
 
-/* Add Tids on a block to TidStore */
+/*
+ * Set the given tids on the blkno to TidStore.
+ *
+ * NB: the offset numbers in offsets must be sorted in ascending order.
+ */
 void
-tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
-				  int num_offsets)
+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+						int num_offsets)
 {
-	uint64	*values;
-	uint64	key;
-	uint64	prev_key;
-	uint64	off_bitmap = 0;
+	offsetbm	*bitmaps;
+	tidkey		key;
+	tidkey		prev_key;
+	offsetbm	off_bitmap = 0;
 	int idx;
-	const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
-	const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+	const tidkey key_base = ((uint64) blkno) << ts->control->upper_off_nbits;
+	const int nkeys = UINT64CONST(1) << ts->control->upper_off_nbits;
 
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
-	values = palloc(sizeof(uint64) * nkeys);
+	bitmaps = palloc(sizeof(offsetbm) * nkeys);
 	key = prev_key = key_base;
 
 	for (int i = 0; i < num_offsets; i++)
 	{
-		uint64	off_bit;
+		offsetbm	off_bit;
 
 		/* encode the tid to a key and partial offset */
-		key = encode_key_off(ts, blkno, offsets[i], &off_bit);
+		key = encode_blk_off(ts, blkno, offsets[i], &off_bit);
 
 		/* make sure we scanned the line pointer array in order */
 		Assert(key >= prev_key);
@@ -384,11 +398,11 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 			Assert(idx >= 0 && idx < nkeys);
 
 			/* write out offset bitmap for this key */
-			values[idx] = off_bitmap;
+			bitmaps[idx] = off_bitmap;
 
 			/* zero out any gaps up to the current key */
 			for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
-				values[empty_idx] = 0;
+				bitmaps[empty_idx] = 0;
 
 			/* reset for current key -- the current offset will be handled below */
 			off_bitmap = 0;
@@ -401,7 +415,7 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	/* save the final index for later */
 	idx = key - key_base;
 	/* write out last offset bitmap */
-	values[idx] = off_bitmap;
+	bitmaps[idx] = off_bitmap;
 
 	if (TidStoreIsShared(ts))
 		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
@@ -409,14 +423,14 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	/* insert the calculated key-values to the tree */
 	for (int i = 0; i <= idx; i++)
 	{
-		if (values[i])
+		if (bitmaps[i])
 		{
 			key = key_base + i;
 
 			if (TidStoreIsShared(ts))
-				shared_rt_set(ts->tree.shared, key, &values[i]);
+				shared_rt_set(ts->tree.shared, key, &bitmaps[i]);
 			else
-				local_rt_set(ts->tree.local, key, &values[i]);
+				local_rt_set(ts->tree.local, key, &bitmaps[i]);
 		}
 	}
 
@@ -426,70 +440,70 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	if (TidStoreIsShared(ts))
 		LWLockRelease(&ts->control->lock);
 
-	pfree(values);
+	pfree(bitmaps);
 }
 
 /* Return true if the given tid is present in the TidStore */
 bool
-tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+TidStoreIsMember(TidStore *ts, ItemPointer tid)
 {
-	uint64 key;
-	uint64 val = 0;
-	uint64 off_bit;
+	tidkey key;
+	offsetbm off_bitmap = 0;
+	offsetbm off_bit;
 	bool found;
 
-	key = tid_to_key_off(ts, tid, &off_bit);
+	key = encode_tid(ts, tid, &off_bit);
 
 	if (TidStoreIsShared(ts))
-		found = shared_rt_search(ts->tree.shared, key, &val);
+		found = shared_rt_search(ts->tree.shared, key, &off_bitmap);
 	else
-		found = local_rt_search(ts->tree.local, key, &val);
+		found = local_rt_search(ts->tree.local, key, &off_bitmap);
 
 	if (!found)
 		return false;
 
-	return (val & off_bit) != 0;
+	return (off_bitmap & off_bit) != 0;
 }
 
 /*
- * Prepare to iterate through a TidStore. Since the radix tree is locked during the
- * iteration, so tidstore_end_iterate() needs to called when finished.
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, so TidStoreEndIterate() needs to be called when finished.
+ *
+ * The TidStoreIter struct is created in the caller's memory context.
  *
  * Concurrent updates during the iteration will be blocked when inserting a
  * key-value to the radix tree.
  */
 TidStoreIter *
-tidstore_begin_iterate(TidStore *ts)
+TidStoreBeginIterate(TidStore *ts)
 {
 	TidStoreIter *iter;
 
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
-	iter = palloc0(sizeof(TidStoreIter));
+	iter = palloc0(sizeof(TidStoreIter) +
+				   sizeof(OffsetNumber) * ts->control->max_off);
 	iter->ts = ts;
 
-	iter->result.blkno = InvalidBlockNumber;
-	iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
-
 	if (TidStoreIsShared(ts))
 		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
 	else
 		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
 
 	/* If the TidStore is empty, there is no business */
-	if (tidstore_num_tids(ts) == 0)
+	if (TidStoreNumTids(ts) == 0)
 		iter->finished = true;
 
 	return iter;
 }
 
 static inline bool
-tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+tidstore_iter(TidStoreIter *iter, tidkey *key, offsetbm *off_bitmap)
 {
 	if (TidStoreIsShared(iter->ts))
-		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, off_bitmap);
 
-	return local_rt_iterate_next(iter->tree_iter.local, key, val);
+	return local_rt_iterate_next(iter->tree_iter.local, key, off_bitmap);
 }
 
 /*
@@ -498,45 +512,48 @@ tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
  * numbers in each result is also sorted in ascending order.
  */
 TidStoreIterResult *
-tidstore_iterate_next(TidStoreIter *iter)
+TidStoreIterateNext(TidStoreIter *iter)
 {
-	uint64 key;
-	uint64 val;
-	TidStoreIterResult *result = &(iter->result);
+	tidkey key;
+	offsetbm off_bitmap = 0;
+	TidStoreIterResult *output = &(iter->output);
 
 	if (iter->finished)
 		return NULL;
 
-	if (BlockNumberIsValid(result->blkno))
-	{
-		/* Process the previously collected key-value */
-		result->num_offsets = 0;
-		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
-	}
+	/* Initialize the outputs */
+	output->blkno = InvalidBlockNumber;
+	output->num_offsets = 0;
 
-	while (tidstore_iter_kv(iter, &key, &val))
-	{
-		BlockNumber blkno;
+	/*
+	 * Decode the key and offset bitmap that are collected in the previous
+	 * time, if exists.
+	 */
+	if (iter->next_off_bitmap > 0)
+		iter_decode_key_off(iter, iter->next_tidkey, iter->next_off_bitmap);
 
-		blkno = key_get_blkno(iter->ts, key);
+	while (tidstore_iter(iter, &key, &off_bitmap))
+	{
+		BlockNumber blkno = key_get_blkno(iter->ts, key);
 
-		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		if (BlockNumberIsValid(output->blkno) && output->blkno != blkno)
 		{
 			/*
-			 * We got a key-value pair for a different block. So return the
-			 * collected tids, and remember the key-value for the next iteration.
+			 * We got tids for a different block. We return the collected
+			 * tids so far, and remember the key-value for the next
+			 * iteration.
 			 */
-			iter->next_key = key;
-			iter->next_val = val;
-			return result;
+			iter->next_tidkey = key;
+			iter->next_off_bitmap = off_bitmap;
+			return output;
 		}
 
-		/* Collect tids extracted from the key-value pair */
-		tidstore_iter_extract_tids(iter, key, val);
+		/* Collect tids decoded from the key and offset bitmap */
+		iter_decode_key_off(iter, key, off_bitmap);
 	}
 
 	iter->finished = true;
-	return result;
+	return output;
 }
 
 /*
@@ -544,22 +561,21 @@ tidstore_iterate_next(TidStoreIter *iter)
  * or when existing an iteration.
  */
 void
-tidstore_end_iterate(TidStoreIter *iter)
+TidStoreEndIterate(TidStoreIter *iter)
 {
 	if (TidStoreIsShared(iter->ts))
 		shared_rt_end_iterate(iter->tree_iter.shared);
 	else
 		local_rt_end_iterate(iter->tree_iter.local);
 
-	pfree(iter->result.offsets);
 	pfree(iter);
 }
 
 /* Return the number of tids we collected so far */
 int64
-tidstore_num_tids(TidStore *ts)
+TidStoreNumTids(TidStore *ts)
 {
-	uint64 num_tids;
+	int64 num_tids;
 
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
@@ -575,16 +591,16 @@ tidstore_num_tids(TidStore *ts)
 
 /* Return true if the current memory usage of TidStore exceeds the limit */
 bool
-tidstore_is_full(TidStore *ts)
+TidStoreIsFull(TidStore *ts)
 {
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
-	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+	return (TidStoreMemoryUsage(ts) > ts->control->max_bytes);
 }
 
 /* Return the maximum memory TidStore can use */
 size_t
-tidstore_max_memory(TidStore *ts)
+TidStoreMaxMemory(TidStore *ts)
 {
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
@@ -593,7 +609,7 @@ tidstore_max_memory(TidStore *ts)
 
 /* Return the memory usage of TidStore */
 size_t
-tidstore_memory_usage(TidStore *ts)
+TidStoreMemoryUsage(TidStore *ts)
 {
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
@@ -611,71 +627,75 @@ tidstore_memory_usage(TidStore *ts)
 /*
  * Get a handle that can be used by other processes to attach to this TidStore
  */
-tidstore_handle
-tidstore_get_handle(TidStore *ts)
+TidStoreHandle
+TidStoreGetHandle(TidStore *ts)
 {
 	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
 
 	return ts->control->handle;
 }
 
-/* Extract tids from the given key-value pair */
+/*
+ * Decode the key and offset bitmap to tids and store them to the iteration
+ * result.
+ */
 static void
-tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap)
 {
-	TidStoreIterResult *result = (&iter->result);
+	TidStoreIterResult *output = (&iter->output);
 
-	while (val)
+	while (off_bitmap)
 	{
-		uint64	tid_i;
+		uint64	compressed_tid;
 		OffsetNumber	off;
 
-		tid_i = key << TIDSTORE_VALUE_NBITS;
-		tid_i |= pg_rightmost_one_pos64(val);
+		compressed_tid = key << LOWER_OFFSET_NBITS;
+		compressed_tid |= pg_rightmost_one_pos64(off_bitmap);
 
-		off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+		off = compressed_tid & ((UINT64CONST(1) << iter->ts->control->max_off_nbits) - 1);
 
-		Assert(result->num_offsets < iter->ts->control->max_offset);
-		result->offsets[result->num_offsets++] = off;
+		Assert(output->num_offsets < iter->ts->control->max_off);
+		output->offsets[output->num_offsets++] = off;
 
 		/* unset the rightmost bit */
-		val &= ~pg_rightmost_one64(val);
+		off_bitmap &= ~pg_rightmost_one64(off_bitmap);
 	}
 
-	result->blkno = key_get_blkno(iter->ts, key);
+	output->blkno = key_get_blkno(iter->ts, key);
 }
 
 /* Get block number from the given key */
 static inline BlockNumber
-key_get_blkno(TidStore *ts, uint64 key)
+key_get_blkno(TidStore *ts, tidkey key)
 {
-	return (BlockNumber) (key >> ts->control->offset_key_nbits);
+	return (BlockNumber) (key >> ts->control->upper_off_nbits);
 }
 
-/* Encode a tid to key and offset */
-static inline uint64
-tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
+/* Encode a tid to key and partial offset */
+static inline tidkey
+encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit)
 {
-	uint32 offset = ItemPointerGetOffsetNumber(tid);
+	OffsetNumber offset = ItemPointerGetOffsetNumber(tid);
 	BlockNumber block = ItemPointerGetBlockNumber(tid);
 
-	return encode_key_off(ts, block, offset, off_bit);
+	return encode_blk_off(ts, block, offset, off_bit);
 }
 
 /* encode a block and offset to a key and partial offset */
-static inline uint64
-encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
+static inline tidkey
+encode_blk_off(TidStore *ts, BlockNumber block, OffsetNumber offset,
+			   offsetbm *off_bit)
 {
-	uint64 key;
-	uint64 tid_i;
+	tidkey key;
+	uint64 compressed_tid;
 	uint32 off_lower;
 
-	off_lower = offset & TIDSTORE_OFFSET_MASK;
-	Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
+	off_lower = offset & LOWER_OFFSET_MASK;
+	Assert(off_lower < (sizeof(offsetbm) * BITS_PER_BYTE));
 
 	*off_bit = UINT64CONST(1) << off_lower;
-	tid_i = offset | ((uint64) block << ts->control->offset_nbits);
-	key = tid_i >> TIDSTORE_VALUE_NBITS;
+	compressed_tid = offset | ((uint64) block << ts->control->max_off_nbits);
+	key = compressed_tid >> LOWER_OFFSET_NBITS;
 
 	return key;
 }
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index a35a52124a..66f0fdd482 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -17,33 +17,34 @@
 #include "storage/itemptr.h"
 #include "utils/dsa.h"
 
-typedef dsa_pointer tidstore_handle;
+typedef dsa_pointer TidStoreHandle;
 
 typedef struct TidStore TidStore;
 typedef struct TidStoreIter TidStoreIter;
 
+/* Result struct for TidStoreIterateNext */
 typedef struct TidStoreIterResult
 {
 	BlockNumber		blkno;
-	OffsetNumber	*offsets;
 	int				num_offsets;
+	OffsetNumber	offsets[FLEXIBLE_ARRAY_MEMBER];
 } TidStoreIterResult;
 
-extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
-extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
-extern void tidstore_detach(TidStore *ts);
-extern void tidstore_destroy(TidStore *ts);
-extern void tidstore_reset(TidStore *ts);
-extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
-							  int num_offsets);
-extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
-extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
-extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
-extern void tidstore_end_iterate(TidStoreIter *iter);
-extern int64 tidstore_num_tids(TidStore *ts);
-extern bool tidstore_is_full(TidStore *ts);
-extern size_t tidstore_max_memory(TidStore *ts);
-extern size_t tidstore_memory_usage(TidStore *ts);
-extern tidstore_handle tidstore_get_handle(TidStore *ts);
+extern TidStore *TidStoreCreate(size_t max_bytes, int max_off, dsa_area *dsa);
+extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer handle);
+extern void TidStoreDetach(TidStore *ts);
+extern void TidStoreDestroy(TidStore *ts);
+extern void TidStoreReset(TidStore *ts);
+extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+									int num_offsets);
+extern bool TidStoreIsMember(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * TidStoreBeginIterate(TidStore *ts);
+extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
+extern void TidStoreEndIterate(TidStoreIter *iter);
+extern int64 TidStoreNumTids(TidStore *ts);
+extern bool TidStoreIsFull(TidStore *ts);
+extern size_t TidStoreMaxMemory(TidStore *ts);
+extern size_t TidStoreMemoryUsage(TidStore *ts);
+extern TidStoreHandle TidStoreGetHandle(TidStore *ts);
 
 #endif		/* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 9a1217f833..8659e6780e 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -37,10 +37,10 @@ check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
 
 	ItemPointerSet(&tid, blkno, off);
 
-	found = tidstore_lookup_tid(ts, &tid);
+	found = TidStoreIsMember(ts, &tid);
 
 	if (found != expect)
-		elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+		elog(ERROR, "TidStoreIsMember for TID (%u, %u) returned %d, expected %d",
 			 blkno, off, found, expect);
 }
 
@@ -69,9 +69,9 @@ test_basic(int max_offset)
 	LWLockRegisterTranche(tranche_id, "test_tidstore");
 	dsa = dsa_create(tranche_id);
 
-	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
 #else
-	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
 #endif
 
 	/* prepare the offset array */
@@ -83,7 +83,7 @@ test_basic(int max_offset)
 
 	/* add tids */
 	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
-		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+		TidStoreSetBlockOffsets(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
 
 	/* lookup test */
 	for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
@@ -105,30 +105,30 @@ test_basic(int max_offset)
 	}
 
 	/* test the number of tids */
-	if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
-		elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
-			 tidstore_num_tids(ts),
+	if (TidStoreNumTids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "TidStoreNumTids returned " UINT64_FORMAT ", expected %d",
+			 TidStoreNumTids(ts),
 			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
 
 	/* iteration test */
-	iter = tidstore_begin_iterate(ts);
+	iter = TidStoreBeginIterate(ts);
 	blk_idx = 0;
-	while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		/* check the returned block number */
 		if (blks_sorted[blk_idx] != iter_result->blkno)
-			elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+			elog(ERROR, "TidStoreIterateNext returned block number %u, expected %u",
 				 iter_result->blkno, blks_sorted[blk_idx]);
 
 		/* check the returned offset numbers */
 		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
-			elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+			elog(ERROR, "TidStoreIterateNext %u offsets, expected %u",
 				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
 
 		for (int i = 0; i < iter_result->num_offsets; i++)
 		{
 			if (offs[i] != iter_result->offsets[i])
-				elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+				elog(ERROR, "TidStoreIterateNext offset number %u on block %u, expected %u",
 					 iter_result->offsets[i], iter_result->blkno, offs[i]);
 		}
 
@@ -136,15 +136,15 @@ test_basic(int max_offset)
 	}
 
 	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
-		elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+		elog(ERROR, "TidStoreIterateNext returned %d blocks, expected %d",
 			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
 
 	/* remove all tids */
-	tidstore_reset(ts);
+	TidStoreReset(ts);
 
 	/* test the number of tids */
-	if (tidstore_num_tids(ts) != 0)
-		elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+	if (TidStoreNumTids(ts) != 0)
+		elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
 
 	/* lookup test for empty store */
 	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
@@ -156,7 +156,7 @@ test_basic(int max_offset)
 		check_tid(ts, MaxBlockNumber, off, false);
 	}
 
-	tidstore_destroy(ts);
+	TidStoreDestroy(ts);
 
 #ifdef TEST_SHARED_TIDSTORE
 	dsa_detach(dsa);
@@ -177,36 +177,37 @@ test_empty(void)
 	LWLockRegisterTranche(tranche_id, "test_tidstore");
 	dsa = dsa_create(tranche_id);
 
-	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
 #else
-	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
 #endif
 
 	elog(NOTICE, "testing empty tidstore");
 
 	ItemPointerSet(&tid, 0, FirstOffsetNumber);
-	if (tidstore_lookup_tid(ts, &tid))
-		elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+	if (TidStoreIsMember(ts, &tid))
+		elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
+			 0, FirstOffsetNumber);
 
 	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
-	if (tidstore_lookup_tid(ts, &tid))
-		elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+	if (TidStoreIsMember(ts, &tid))
+		elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
 			 MaxBlockNumber, MaxOffsetNumber);
 
-	if (tidstore_num_tids(ts) != 0)
-		elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+	if (TidStoreNumTids(ts) != 0)
+		elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
 
-	if (tidstore_is_full(ts))
-		elog(ERROR, "tidstore_is_full on empty store returned true");
+	if (TidStoreIsFull(ts))
+		elog(ERROR, "TidStoreIsFull on empty store returned true");
 
-	iter = tidstore_begin_iterate(ts);
+	iter = TidStoreBeginIterate(ts);
 
-	if (tidstore_iterate_next(iter) != NULL)
-		elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+	if (TidStoreIterateNext(iter) != NULL)
+		elog(ERROR, "TidStoreIterateNext on empty store returned TIDs");
 
-	tidstore_end_iterate(iter);
+	TidStoreEndIterate(iter);
 
-	tidstore_destroy(ts);
+	TidStoreDestroy(ts);
 
 #ifdef TEST_SHARED_TIDSTORE
 	dsa_detach(dsa);
@@ -221,6 +222,7 @@ test_tidstore(PG_FUNCTION_ARGS)
 	elog(NOTICE, "testing basic operations");
 	test_basic(MaxHeapTuplesPerPage);
 	test_basic(10);
+	test_basic(MaxHeapTuplesPerPage * 2);
 
 	PG_RETURN_VOID();
 }
-- 
2.31.1

v31-0003-Add-radixtree-template.patchapplication/octet-stream; name=v31-0003-Add-radixtree-template.patchDownload

From 014d2f9a13af4e9f57ff2f8e44fba61c71ecec66 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v31 03/14] Add radixtree template

WIP: commit message based on template comments
---
 src/backend/utils/mmgr/dsa.c                  |   12 +
 src/include/lib/radixtree.h                   | 2516 +++++++++++++++++
 src/include/lib/radixtree_delete_impl.h       |  122 +
 src/include/lib/radixtree_insert_impl.h       |  328 +++
 src/include/lib/radixtree_iter_impl.h         |  153 +
 src/include/lib/radixtree_search_impl.h       |  138 +
 src/include/utils/dsa.h                       |    1 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   35 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  681 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 20 files changed, 4089 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..e546bd705c
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2516 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *		Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ *  tional leaf node type which stores one value.
+ *  - Multi-value leaves: The values are stored in one of four
+ *  different leaf node types, which mirror the structure of
+ *  inner nodes, but contain values instead of pointers.
+ *  - Combined pointer/value slots: If values fit into point-
+ *  ers, no separate node types are necessary. Instead, each
+ *  pointer storage location in an inner node can either
+ *  store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included.  Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * 	 will result in radix tree type 'foo_radix_tree' and functions like
+ *	 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ *	 generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *	 declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *	 so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITER		- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ *    statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ *    in the future to tag the node pointer with the kind, even on
+ *    platforms with 32-bit pointers. This might speed up node traversal
+ *    in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree)	LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree)	LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree)			LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree)	((void) 0)
+#define RT_LOCK_SHARED(tree)	((void) 0)
+#define RT_UNLOCK(tree)			((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+	((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+	RT_NODE		n;
+
+	/* 3 children, for key chunks */
+	uint8		chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* bitmap to track which slots are in use */
+	bitmapword		isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/*
+	 * Unlike with inner256, zero is a valid value here, so we use a
+	 * bitmap to track which slots are in use.
+	 */
+	bitmapword	isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	RT_VALUE_TYPE	values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_3 = 0,
+	RT_CLASS_32_MIN,
+	RT_CLASS_32_MAX,
+	RT_CLASS_125,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_3] = {
+		.name = "radix tree node 3",
+		.fanout = 3,
+		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MIN] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MAX] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_125] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+	LWLock		lock;
+#endif
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+	RT_PTR_LOCAL node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the iteration on nodes of each level */
+	RT_NODE_ITER stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is constructed during iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/* replicate the search key */
+	spread_chunk = vector8_broadcast(chunk);
+
+	/* compare to all 32 keys stored in the node */
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+
+	/* convert comparison to a bitfield */
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+	/* mask off invalid entries */
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	/* convert bitfield to index by counting trailing zeros */
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		/*
+		 * This is coded with '>=' to match what we can do with SIMD,
+		 * with an assert to keep us honest.
+		 */
+		if (node->chunks[index] >= chunk)
+		{
+			Assert(node->chunks[index] != chunk);
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/*
+	 * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+	 * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+	 * we need to play some trickery using vector8_min() to effectively get
+	 * >=. There'll never be any equal elements in current uses, but that's
+	 * what we get here...
+	 */
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	if (key == 0)
+		return 0;
+	else
+		return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
+
+	if (is_leaf)
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (is_leaf)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+#endif
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->ctl->cnt[size_class]++;
+#endif
+
+	return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	if (is_leaf)
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+	node->kind = kind;
+
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+		memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+	}
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static pg_noinline void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		is_leaf = shift == 0;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+	newnode->shift = shift;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+				  uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+	RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+	RT_COPY_NODE(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->ctl->root == allocnode)
+	{
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
+	}
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static inline void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+				RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old_child->shift == new->shift);
+	Assert(old_child->count == new->count);
+#endif
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new larger node */
+		tree->ctl->root = new_child;
+	}
+	else
+		RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+	RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static pg_noinline void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_3 *n3;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+		node->shift = shift;
+		node->count = 1;
+
+		n3 = (RT_NODE_INNER_3 *) node;
+		n3->base.chunks[0] = 0;
+		n3->children[0] = tree->ctl->root;
+
+		/* Update the root */
+		tree->ctl->root = allocnode;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static pg_noinline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+	int			shift = node->shift;
+
+	Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		is_leaf = newshift == 0;
+
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+		newchild->shift = newshift;
+		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+		parent = node;
+		node = newchild;
+		stored_node = allocchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+	tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static void
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+	/* Create a slab context for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+		size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+		size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 size_class.name,
+												 inner_blocksize,
+												 size_class.inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												size_class.name,
+												leaf_blocksize,
+												size_class.leaf_size);
+	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	if (RT_NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+				for (int i = 0; i < n3->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n3->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+				for (int i = 0; i < n32->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle);
+#else
+	pfree(tree->ctl);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+#endif
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	int			shift;
+	bool		updated;
+	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC stored_child;
+	RT_PTR_LOCAL  child;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	/* Empty tree, create the root */
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->ctl->max_val)
+		RT_EXTEND(tree, key);
+
+	stored_child = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, stored_child);
+	shift = parent->shift;
+
+	/* Descend the tree until we reach a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;
+
+		child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+		if (RT_NODE_IS_LEAF(child))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+		{
+			RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		parent = child;
+		stored_child = new_child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->ctl->num_keys++;
+
+	RT_UNLOCK(tree);
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+	bool		found;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	Assert(value_p != NULL);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		if (RT_NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		node = RT_PTR_GET_LOCAL(tree, child);
+		shift -= RT_NODE_SPAN;
+	}
+
+	found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+	RT_UNLOCK(tree);
+	return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		/* Push the current node to the stack */
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		allocnode = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	deleted = RT_NODE_DELETE_LEAF(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->ctl->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (node->count > 0)
+	{
+		RT_UNLOCK(tree);
+		return true;
+	}
+
+	/* Free the empty leaf node */
+	RT_FREE_NODE(tree, allocnode);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		allocnode = stack[level--];
+
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		deleted = RT_NODE_DELETE_INNER(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (node->count > 0)
+			break;
+
+		/* The node became empty */
+		RT_FREE_NODE(tree, allocnode);
+	}
+
+	RT_UNLOCK(tree);
+	return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+	int			level = from;
+	RT_PTR_LOCAL node = from_node;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (RT_NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	MemoryContext old_ctx;
+	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+
+	RT_LOCK_SHARED(tree);
+
+	/* empty tree */
+	if (!iter->tree->ctl->root)
+		return iter;
+
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->ctl->root)
+		return false;
+
+	for (;;)
+	{
+		RT_PTR_LOCAL child = NULL;
+		RT_VALUE_TYPE value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+	Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+	RT_UNLOCK(iter->tree);
+	pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	Size		total = 0;
+
+	RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+#endif
+
+	RT_UNLOCK(tree);
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+				for (int i = 1; i < n3->n.count; i++)
+					Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = RT_BM_IDX(slot);
+					int			bitnum = RT_BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+	RT_LOCK_SHARED(tree);
+
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+	fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+	fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+		fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+				root->shift / RT_NODE_SPAN,
+				tree->ctl->cnt[RT_CLASS_3],
+				tree->ctl->cnt[RT_CLASS_32_MIN],
+				tree->ctl->cnt[RT_CLASS_32_MAX],
+				tree->ctl->cnt[RT_CLASS_125],
+				tree->ctl->cnt[RT_CLASS_256]);
+	}
+
+	RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+			 bool recurse, StringInfo buf)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+	StringInfoData spaces;
+
+	initStringInfo(&spaces);
+	appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+	appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+					 spaces.data,
+					 level == 0 ? "" : "-> ",
+					 RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+					 (node->kind == RT_NODE_KIND_3) ? 3 :
+					 (node->kind == RT_NODE_KIND_32) ? 32 :
+					 (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+					 node->fanout == 0 ? 256 : node->fanout,
+					 node->count, node->shift);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n3->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n3->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n3->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n32->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n32->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+				char *sep = "";
+
+				appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					appendStringInfo(buf, "%s[%d]=%d ",
+									 sep, i, b125->slot_idxs[i]);
+					sep = ",";
+				}
+
+				appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+				for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+					appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+				appendStringInfo(buf, "\n");
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					if (RT_NODE_IS_LEAF(node))
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					else
+					{
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+					appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+					for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+						appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+					appendStringInfo(buf, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL node;
+	StringInfoData buf;
+	int			shift;
+	int			level = 0;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	if (key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+				key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+		if (RT_NODE_IS_LEAF(node))
+		{
+			RT_VALUE_TYPE	dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			break;
+
+		allocnode = child;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+	StringInfoData buf;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	initStringInfo(&buf);
+
+	RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ *	  Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+										  n3->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+											n3->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+										  n32->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+				idx = RT_BM_IDX(slotpos);
+				bitnum = RT_BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+				RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..d56e58dcac
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,328 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ *	  Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	const bool is_leaf = true;
+	bool		chunk_exists = false;
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	const bool is_leaf = false;
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n3->values[idx] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n3)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+					/* grow node from 3 to 32 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+											  new32->base.chunks, new32->values);
+#else
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+											  new32->base.chunks, new32->children);
+#endif
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+					int			count = n3->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+												   count, insertpos);
+#endif
+					}
+
+					n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n3->values[insertpos] = *value_p;
+#else
+					n3->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+					n32->base.n.fanout < class32_max.fanout)
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+					Assert(n32->base.n.fanout == class32_min.fanout);
+
+					/* grow to the next size class of this kind */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class32_min.leaf_size);
+#else
+					memcpy(newnode, node, class32_min.inner_size);
+#endif
+					newnode->fanout = class32_max.fanout;
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n32)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+					Assert(n32->base.n.fanout == class32_max.fanout);
+
+					/* grow node from 32 to 125 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < class32_max.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					/*
+					 * Since we just copied a dense array, we can set the bits
+					 * using a single store, provided the length of that array
+					 * is at most the number of bits in a bitmapword.
+					 */
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = *value_p;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos;
+				int			cnt = 0;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				slotpos = n125->base.slot_idxs[chunk];
+				if (slotpos != RT_INVALID_SLOT_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n125->values[slotpos] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n125)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+					/* grow node from 125 to 256 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new256 = (RT_NODE256_TYPE *) newnode;
+
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+						cnt++;
+					}
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = *value_p;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+				Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+				RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+				Assert(node->count < RT_NODE_MAX_SLOTS);
+				RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
+	if (!chunk_exists)
+		node->count++;
+#else
+		node->count++;
+#endif
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	RT_VERIFY_NODE(node);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return chunk_exists;
+#else
+	return;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..98c78eb237
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,153 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ *	  Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	bool		found = false;
+	uint8		key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	RT_VALUE_TYPE		value;
+
+	Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n3->base.n.count)
+					break;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n3->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+				key_chunk = n3->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+		*value_p = value;
+#endif
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return found;
+#else
+	return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ *	  Common implementation for search in leaf and inner nodes, plus
+ *	  update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+	Assert(child_p != NULL);
+#endif
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n3->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n3->values[idx];
+#else
+				*child_p = n3->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n32->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n32->values[idx];
+#else
+				*child_p = n32->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+				Assert(slotpos != RT_INVALID_SLOT_IDX);
+				n125->children[slotpos] = new_child;
+#else
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				*child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				*child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+	}
+
+#ifdef RT_ACTION_UPDATE
+	return;
+#else
+	return true;
+#endif							/* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..9659eb85d7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..232cbdac80 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..afe53382f3
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,681 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+/* #define RT_SHMEM */
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	TestValueType		dummy;
+	uint64		key;
+	TestValueType		val;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	rt_radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* look up keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType value;
+
+		if (!rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (value != (TestValueType) keys[i])
+			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+				 value, (TestValueType) keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType update = keys[i] + 1;
+		if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		TestValueType		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != (TestValueType) key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType*) &key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+	radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, (TestValueType*) &x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != (TestValueType) x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			TestValueType		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != (TestValueType) expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	rt_free(radixtree);
+	MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index b0e9aa99a2..2f72d5ed4b 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index 8dee1b5670..133313255c 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.31.1

v31-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v31-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload

From b3ac3b456aa1448f3e959674f16bed18630266be Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 7 Feb 2023 17:19:29 +0700
Subject: [PATCH v31 06/14] Use TIDStore for storing dead tuple TID during lazy
 vacuum

Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.

Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.

Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and num_dead_tuple_bytes.

In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.

XXX: needs to bump catalog version
---
 doc/src/sgml/monitoring.sgml               |   8 +-
 src/backend/access/heap/vacuumlazy.c       | 278 ++++++++-------------
 src/backend/catalog/system_views.sql       |   2 +-
 src/backend/commands/vacuum.c              |  78 +-----
 src/backend/commands/vacuumparallel.c      |  73 +++---
 src/backend/postmaster/autovacuum.c        |   6 +-
 src/backend/storage/lmgr/lwlock.c          |   2 +
 src/backend/utils/misc/guc_tables.c        |   2 +-
 src/include/commands/progress.h            |   4 +-
 src/include/commands/vacuum.h              |  25 +-
 src/include/storage/lwlock.h               |   1 +
 src/test/regress/expected/cluster.out      |   2 +-
 src/test/regress/expected/create_index.out |   2 +-
 src/test/regress/expected/rules.out        |   4 +-
 src/test/regress/sql/cluster.sql           |   2 +-
 src/test/regress/sql/create_index.sql      |   2 +-
 16 files changed, 177 insertions(+), 314 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 97d588b1d8..47b346d36c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7170,10 +7170,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -7181,10 +7181,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..b4e40423a8 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,18 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * create a TidStore with the maximum bytes that can be used by the TidStore.
+ * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
+ * vacuum the pages that we've pruned). This frees up the memory space dedicated
+ * to storing dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -188,7 +189,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -220,11 +221,14 @@ typedef struct LVRelState
 typedef struct LVPagePruneState
 {
 	bool		hastup;			/* Page prevents rel truncation? */
-	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+
+	/* collected offsets of LP_DEAD items including existing ones */
+	OffsetNumber	deadoffsets[MaxHeapTuplesPerPage];
+	int				num_offsets;
 
 	/*
 	 * State describes the proper VM bit states to set for the page following
-	 * pruning and freezing.  all_visible implies !has_lpdead_items, but don't
+	 * pruning and freezing.  all_visible implies num_offsets == 0, but don't
 	 * trust all_frozen result unless all_visible is also set to true.
 	 */
 	bool		all_visible;	/* Every item visible to all? */
@@ -259,8 +263,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -487,11 +492,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
-	 * parallel VACUUM initialization as part of allocating shared memory
-	 * space used for dead_items.  (But do a failsafe precheck first, to
-	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
-	 * is already dangerously old.)
+	 * Allocate dead_items memory using dead_items_alloc.  This handles parallel
+	 * VACUUM initialization as part of allocating shared memory space used for
+	 * dead_items.  (But do a failsafe precheck first, to ensure that parallel
+	 * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+	 * old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
@@ -797,7 +802,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -825,21 +830,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				blkno,
 				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -906,8 +911,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (tidstore_is_full(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -969,7 +973,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				continue;
 			}
 
-			/* Collect LP_DEAD items in dead_items array, count tuples */
+			/* Collect LP_DEAD items in dead_items, count tuples */
 			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
 								  &recordfreespace))
 			{
@@ -1011,14 +1015,14 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Prune, freeze, and count tuples.
 		 *
 		 * Accumulates details of remaining LP_DEAD line pointers on page in
-		 * dead_items array.  This includes LP_DEAD line pointers that we
-		 * pruned ourselves, as well as existing LP_DEAD line pointers that
-		 * were pruned some time earlier.  Also considers freezing XIDs in the
-		 * tuple headers of remaining items with storage.
+		 * dead_items.  This includes LP_DEAD line pointers that we pruned
+		 * ourselves, as well as existing LP_DEAD line pointers that were pruned
+		 * some time earlier.  Also considers freezing XIDs in the tuple headers
+		 * of remaining items with storage.
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
 
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+		Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
@@ -1034,14 +1038,12 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * performed here can be thought of as the one-pass equivalent of
 			 * a call to lazy_vacuum().
 			 */
-			if (prunestate.has_lpdead_items)
+			if (prunestate.num_offsets > 0)
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
-				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+									  prunestate.num_offsets, buf, vmbuffer);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1078,7 +1080,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
+		}
+		else if (prunestate.num_offsets > 0)
+		{
+			/* Save details of the LP_DEAD items from the page in dead_items */
+			tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+							  prunestate.num_offsets);
+
+			pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+										 tidstore_memory_usage(dead_items));
 		}
 
 		/*
@@ -1145,7 +1156,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
 		 * set, however.
 		 */
-		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+		else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
 		{
 			elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1193,7 +1204,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Final steps for block: drop cleanup lock, record free space in the
 		 * FSM
 		 */
-		if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+		if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
 		{
 			/*
 			 * Wait until lazy_vacuum_heap_rel() to save free space.  This
@@ -1249,7 +1260,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1524,9 +1535,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
  * The approach we take now is to restart pruning when the race condition is
  * detected.  This allows heap_page_prune() to prune the tuples inserted by
  * the now-aborted transaction.  This is a little crude, but it guarantees
- * that any items that make it into the dead_items array are simple LP_DEAD
- * line pointers, and that every remaining item with tuple storage is
- * considered as a candidate for freezing.
+ * that any items that make it into the dead_items are simple LP_DEAD line
+ * pointers, and that every remaining item with tuple storage is considered
+ * as a candidate for freezing.
  */
 static void
 lazy_scan_prune(LVRelState *vacrel,
@@ -1543,13 +1554,11 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				tuples_frozen,
-				lpdead_items,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	HeapPageFreeze pagefrz;
 	int64		fpi_before = pgWalUsage.wal_fpi;
-	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1571,7 +1580,6 @@ retry:
 	pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
-	lpdead_items = 0;
 	live_tuples = 0;
 	recently_dead_tuples = 0;
 
@@ -1580,9 +1588,9 @@ retry:
 	 *
 	 * We count tuples removed by the pruning step as tuples_deleted.  Its
 	 * final value can be thought of as the number of tuples that have been
-	 * deleted from the table.  It should not be confused with lpdead_items;
-	 * lpdead_items's final value can be thought of as the number of tuples
-	 * that were deleted from indexes.
+	 * deleted from the table.  It should not be confused with
+	 * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+	 * be thought of as the number of tuples that were deleted from indexes.
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
 									 InvalidTransactionId, 0, &nnewlpdead,
@@ -1593,7 +1601,7 @@ retry:
 	 * requiring freezing among remaining tuples with storage
 	 */
 	prunestate->hastup = false;
-	prunestate->has_lpdead_items = false;
+	prunestate->num_offsets = 0;
 	prunestate->all_visible = true;
 	prunestate->all_frozen = true;
 	prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1638,7 +1646,7 @@ retry:
 			 * (This is another case where it's useful to anticipate that any
 			 * LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
 			 */
-			deadoffsets[lpdead_items++] = offnum;
+			prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
 			continue;
 		}
 
@@ -1875,7 +1883,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible && lpdead_items == 0)
+	if (prunestate->all_visible && prunestate->num_offsets == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1888,28 +1896,9 @@ retry:
 	}
 #endif
 
-	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel
-	 */
-	if (lpdead_items > 0)
+	if (prunestate->num_offsets > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
-		prunestate->has_lpdead_items = true;
-
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1928,7 +1917,7 @@ retry:
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
 	vacrel->tuples_frozen += tuples_frozen;
-	vacrel->lpdead_items += lpdead_items;
+	vacrel->lpdead_items += prunestate->num_offsets;
 	vacrel->live_tuples += live_tuples;
 	vacrel->recently_dead_tuples += recently_dead_tuples;
 }
@@ -1940,7 +1929,7 @@ retry:
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2099,7 +2088,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2129,8 +2118,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2139,17 +2127,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2198,7 +2179,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2227,7 +2208,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2254,8 +2235,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2300,7 +2281,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	tidstore_reset(vacrel->dead_items);
 }
 
 /*
@@ -2373,7 +2354,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || vacrel->failsafe_active);
 
 	/*
@@ -2392,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2410,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2428,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while ((result = tidstore_iterate_next(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2437,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2451,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+							  buf, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2461,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	tidstore_end_iterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2470,36 +2454,31 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+					vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
 }
 
 /*
- *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *	lazy_vacuum_heap_page() -- free page's LP_DEAD items.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+					  OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2518,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2597,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -2687,8 +2660,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -3094,48 +3067,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 }
 
 /*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
-/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.
  *
  * Also handles parallel initialization as part of allocating dead_items in
  * DSM when required.
@@ -3143,11 +3076,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3174,7 +3105,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem, MaxHeapTuplesPerPage,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3187,11 +3118,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
+										 NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..149d41b41c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,7 +1180,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 2e12baf8eb..785b825bbc 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -97,7 +97,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * Primary entry point for manual VACUUM and ANALYZE commands
@@ -2327,16 +2326,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2367,82 +2366,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch(itemptr,
-								dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index bcd40c80a1..d653683693 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,12 +9,11 @@
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
  * vacuum process.  ParalleVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * the shared TidStore. We launch parallel worker processes at the start of
+ * parallel index bulk-deletion and index cleanup and once all indexes are
+ * processed, the parallel worker processes exit.  Each time we process indexes
+ * in parallel, the parallel context is re-initialized so that the same DSM can
+ * be used for multiple passes of index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -103,6 +102,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TidStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -166,7 +168,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -222,20 +225,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int max_offset, int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -283,9 +289,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -351,6 +356,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -360,6 +375,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	pg_atomic_init_u32(&(shared->cost_balance), 0);
 	pg_atomic_init_u32(&(shared->active_nworkers), 0);
@@ -368,15 +384,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -434,6 +441,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_destroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -442,7 +452,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TidStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -940,7 +950,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -984,10 +996,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumCostActive = (VacuumCostDelay > 0);
@@ -1033,6 +1045,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index c0e2e00a7e..60caeae739 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3399,12 +3399,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
 		return true;
 
 	/*
-	 * We clamp manually-set values to at least 1MB.  Since
+	 * We clamp manually-set values to at least 2MB.  Since
 	 * maintenance_work_mem is always set to at least this value, do the same
 	 * here.
 	 */
-	if (*newval < 1024)
-		*newval = 1024;
+	if (*newval < 2048)
+		*newval = 2048;
 
 	return true;
 }
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"LogicalRepLauncherDSA",
 	/* LWTRANCHE_LAUNCHER_HASH: */
 	"LogicalRepLauncherHash",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 1c0583fe26..8a64614cd1 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2313,7 +2313,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 2048, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index bdfd96cfec..cec2d1d356 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -277,21 +278,6 @@ struct VacuumCutoffs
 	MultiXactId MultiXactCutoff;
 };
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -340,18 +326,17 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem, int max_offset,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DATA,
 	LWTRANCHE_LAUNCHER_DSA,
 	LWTRANCHE_LAUNCHER_HASH,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4f..d320ad87dd 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
 -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e953d1f515..ef46c2994f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2032,8 +2032,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 set enable_indexscan = off;
 
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f300..d6e2471b00 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
 
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
-- 
2.31.1

v31-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v31-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From 46ccfc2d0b588e090d1f46bc16f463789227aff4 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v31 02/14] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 34 +-------------------------------
 src/include/nodes/bitmapset.h    | 16 +++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 46 insertions(+), 36 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 7ba3cf635b..0b2962ed73 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -30,39 +30,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 static bool bms_is_empty_internal(const Bitmapset *a);
 
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 14de6a9ff1..c7e1711147 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -36,13 +36,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -73,6 +71,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 158ef73a2b..bf7588e075 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -32,6 +32,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 86a9303bf5..4a5e776703 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3675,7 +3675,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.31.1

v31-0012-Revert-the-update-for-the-minimum-value-of-maint.patchapplication/octet-stream; name=v31-0012-Revert-the-update-for-the-minimum-value-of-maint.patchDownload

From 8080e74de8597b6e8567fbfce5dbd2771937287c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Mar 2023 15:09:22 +0900
Subject: [PATCH v31 12/14] Revert the update for the minimum value of
 maintenance_work_mem.

---
 src/backend/postmaster/autovacuum.c | 6 +++---
 src/backend/utils/misc/guc_tables.c | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 60caeae739..c0e2e00a7e 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3399,12 +3399,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
 		return true;
 
 	/*
-	 * We clamp manually-set values to at least 2MB.  Since
+	 * We clamp manually-set values to at least 1MB.  Since
 	 * maintenance_work_mem is always set to at least this value, do the same
 	 * here.
 	 */
-	if (*newval < 2048)
-		*newval = 2048;
+	if (*newval < 1024)
+		*newval = 1024;
 
 	return true;
 }
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 8a64614cd1..1c0583fe26 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2313,7 +2313,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 2048, MAX_KILOBYTES,
+		65536, 1024, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
-- 
2.31.1

v31-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchapplication/octet-stream; name=v31-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload

From 2176fc0e5b4bee9e389f8a29637ef9ed29aec0da Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v31 01/14] Introduce helper SIMD functions for small byte
 arrays

vector8_min - helper for emulating ">=" semantics

vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask

Masahiko Sawada

Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 1fa6c3bc6c..dfae14e463 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #endif
 
 /* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	/*
+	 * Note: There is a faster way to do this, but it returns a uint64 and
+	 * and if the caller wanted to extract the bit position using CTZ,
+	 * it would have to divide that result by 4.
+	 */
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the bitwise OR of the inputs
  */
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.31.1

v31-0010-Radix-tree-optionally-tracks-memory-usage-when-R.patchapplication/octet-stream; name=v31-0010-Radix-tree-optionally-tracks-memory-usage-when-R.patchDownload

From 7da5e7808ba51aed7ad22b9758b3200cbfcd7d19 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 8 Mar 2023 15:08:19 +0900
Subject: [PATCH v31 10/14] Radix tree optionally tracks memory usage, when
 RT_MEASURE_MEMORY_USAGE.

---
 contrib/bench_radix_tree/bench_radix_tree.c   |  1 +
 src/backend/utils/mmgr/dsa.c                  | 12 ---
 src/include/lib/radixtree.h                   | 93 +++++++++++++++++--
 src/include/utils/dsa.h                       |  1 -
 .../modules/test_radixtree/test_radixtree.c   |  1 +
 5 files changed, 85 insertions(+), 23 deletions(-)

diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 6e5149e2c4..8a0c754a2c 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -34,6 +34,7 @@ PG_MODULE_MAGIC;
 #define RT_DECLARE
 #define RT_DEFINE
 #define RT_USE_DELETE
+#define RT_MEASURE_MEMORY_USAGE
 #define RT_VALUE_TYPE uint64
 // WIP: compiles with warnings because rt_attach is defined but not used
 // #define RT_SHMEM
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 80555aefff..f5a62061a3 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,18 +1024,6 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
-size_t
-dsa_get_total_size(dsa_area *area)
-{
-	size_t		size;
-
-	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
-	size = area->control->total_segment_size;
-	LWLockRelease(DSA_AREA_LOCK(area));
-
-	return size;
-}
-
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 8bea606c62..f7812eb12a 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -84,7 +84,6 @@
  * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
  * RT_ITERATE_NEXT	- Return next key-value pair, if any
  * RT_END_ITERATE	- End iteration
- * RT_MEMORY_USAGE	- Get the memory usage
  *
  * Interface for Shared Memory
  * ---------
@@ -97,6 +96,8 @@
  * ---------
  *
  * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ * RT_MEMORY_USAGE	- Get the memory usage. Declared/define if
+ *					  RT_MEASURE_MEMORY_USAGE is defined.
  *
  *
  * Copyright (c) 2023, PostgreSQL Global Development Group
@@ -138,7 +139,9 @@
 #ifdef RT_USE_DELETE
 #define RT_DELETE RT_MAKE_NAME(delete)
 #endif
+#ifdef RT_MEASURE_MEMORY_USAGE
 #define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#endif
 #ifdef RT_DEBUG
 #define RT_DUMP RT_MAKE_NAME(dump)
 #define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
@@ -150,6 +153,9 @@
 #define RT_NEW_ROOT RT_MAKE_NAME(new_root)
 #define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
 #define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#ifdef RT_MEASURE_MEMORY_USAGE
+#define RT_FANOUT_GET_NODE_SIZE RT_MAKE_NAME(fanout_get_node_size)
+#endif
 #define RT_FREE_NODE RT_MAKE_NAME(free_node)
 #define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
 #define RT_EXTEND_UP RT_MAKE_NAME(extend_up)
@@ -255,7 +261,9 @@ RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
 RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
 RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
 
+#ifdef RT_MEASURE_MEMORY_USAGE
 RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+#endif
 
 #ifdef RT_DEBUG
 RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
@@ -624,6 +632,10 @@ typedef struct RT_RADIX_TREE_CONTROL
 	uint64		max_val;
 	uint64		num_keys;
 
+#ifdef RT_MEASURE_MEMORY_USAGE
+	int64		mem_used;
+#endif
+
 	/* statistics */
 #ifdef RT_DEBUG
 	int32		cnt[RT_SIZE_CLASS_COUNT];
@@ -1089,6 +1101,11 @@ RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
 													  allocsize);
 #endif
 
+#ifdef RT_MEASURE_MEMORY_USAGE
+	/* update memory usage */
+	tree->ctl->mem_used += allocsize;
+#endif
+
 #ifdef RT_DEBUG
 	/* update the statistics */
 	tree->ctl->cnt[size_class]++;
@@ -1165,6 +1182,54 @@ RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL no
 	return newnode;
 }
 
+#ifdef RT_MEASURE_MEMORY_USAGE
+/* Return the node size of the given fanout of the size class */
+static inline Size
+RT_FANOUT_GET_NODE_SIZE(int fanout, bool is_leaf)
+{
+	const Size fanout_inner_node_size[] = {
+		[3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].inner_size,
+		[15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].inner_size,
+		[32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].inner_size,
+		[125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].inner_size,
+		[256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].inner_size,
+	};
+	const Size fanout_leaf_node_size[] = {
+		[3] = RT_SIZE_CLASS_INFO[RT_CLASS_3].leaf_size,
+		[15] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN].leaf_size,
+		[32] = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX].leaf_size,
+		[125] = RT_SIZE_CLASS_INFO[RT_CLASS_125].leaf_size,
+		[256] = RT_SIZE_CLASS_INFO[RT_CLASS_256].leaf_size,
+	};
+	Size node_size;
+
+	node_size = is_leaf ?
+		fanout_leaf_node_size[fanout] : fanout_inner_node_size[fanout];
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		Size assert_node_size = 0;
+
+		for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+
+			if (size_class.fanout == fanout)
+			{
+				assert_node_size = is_leaf ?
+					size_class.leaf_size : size_class.inner_size;
+				break;
+			}
+		}
+
+		Assert(node_size == assert_node_size);
+	}
+#endif
+
+	return node_size;
+}
+#endif
+
 /* Free the given node */
 static void
 RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
@@ -1197,11 +1262,22 @@ RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
 	}
 #endif
 
+#ifdef RT_MEASURE_MEMORY_USAGE
+	/* update memory usage */
+	{
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+		tree->ctl->mem_used -= RT_FANOUT_GET_NODE_SIZE(node->fanout,
+													   RT_NODE_IS_LEAF(node));
+		Assert(tree->ctl->mem_used >= 0);
+	}
+#endif
+
 #ifdef RT_SHMEM
 	dsa_free(tree->dsa, allocnode);
 #else
 	pfree(allocnode);
 #endif
+
 }
 
 /* Update the parent's pointer when growing a node */
@@ -1989,27 +2065,23 @@ RT_END_ITERATE(RT_ITER *iter)
 /*
  * Return the statistics of the amount of memory used by the radix tree.
  */
+#ifdef RT_MEASURE_MEMORY_USAGE
 RT_SCOPE uint64
 RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
 {
 	Size		total = 0;
 
-	RT_LOCK_SHARED(tree);
-
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
-	total = dsa_get_total_size(tree->dsa);
-#else
-	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
-	{
-		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
-		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
-	}
 #endif
 
+	RT_LOCK_SHARED(tree);
+	total = tree->ctl->mem_used;
 	RT_UNLOCK(tree);
+
 	return total;
 }
+#endif
 
 /*
  * Verify the radix tree node.
@@ -2476,6 +2548,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_NEW_ROOT
 #undef RT_ALLOC_NODE
 #undef RT_INIT_NODE
+#undef RT_FANOUT_GET_NODE_SIZE
 #undef RT_FREE_NODE
 #undef RT_FREE_RECURSE
 #undef RT_EXTEND_UP
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 2af215484f..3ce4ee300a 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,7 +121,6 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
-extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index 5a169854d9..19d286d84b 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -114,6 +114,7 @@ static const test_spec test_specs[] = {
 #define RT_DECLARE
 #define RT_DEFINE
 #define RT_USE_DELETE
+#define RT_MEASURE_MEMORY_USAGE
 #define RT_VALUE_TYPE TestValueType
 /* #define RT_SHMEM */
 #include "lib/radixtree.h"
-- 
2.31.1

v31-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v31-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From db646cb1da4a21182028096e036b0f86d61e8ce8 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v31 04/14] Add TIDStore, to store sets of TIDs
 (ItemPointerData) efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 681 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   2 +
 src/include/access/tidstore.h                 |  49 ++
 src/include/storage/lwlock.h                  |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  35 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../modules/test_tidstore/test_tidstore.c     | 226 ++++++
 .../test_tidstore/test_tidstore.control       |   4 +
 16 files changed, 1057 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d0..97d588b1d8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2203,6 +2203,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to access a shared TID bitmap during a parallel bitmap
        index scan.</entry>
      </row>
+     <row>
+      <entry><literal>SharedTidStore</literal></entry>
+      <entry>Waiting to access a shared TID store.</entry>
+     </row>
      <row>
       <entry><literal>SharedTupleStore</literal></entry>
       <entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..8c05e60d92
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,681 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *                                                |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ */
+#define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
+#define TIDSTORE_OFFSET_MASK	((1 << TIDSTORE_VALUE_NBITS) - 1)
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+	/* the number of tids in the store */
+	int64	num_tids;
+
+	/* These values are never changed after creation */
+	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
+	int		max_offset;		/* the maximum offset number */
+	int		offset_nbits;	/* the number of bits required for an offset
+							 * number */
+	int		offset_key_nbits;	/* the number of bits of an offset number
+								 * used in a key */
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+	LWLock	lock;
+
+	/* handles for TidStore and radix tree */
+	tidstore_handle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 *
+	 * Memory consumption depends on the number of stored tids, but also on the
+	 * distribution of them, how the radix tree stores, and the memory management
+	 * that backed the radix tree. The maximum bytes that a TidStore can
+	 * use is specified by the max_bytes in tidstore_create(). We want the total
+	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
+	 *
+	 * In local TidStore cases, the radix tree uses slab allocators for each kind
+	 * of node class. The most memory consuming case while adding Tids associated
+	 * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+	 * we deduct 70kB from the max_bytes.
+	 *
+	 * In shared cases, DSA allocates the memory segments big enough to follow
+	 * a geometric series that approximately doubles the total DSA size (see
+	 * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+	 * size and the simulation revealed, the 75% threshold for the maximum bytes
+	 * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+	 * threshold works for other cases.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes = (uint64) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - (70 * 1024);
+	}
+
+	ts->control->max_offset = max_offset;
+	ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+	if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
+		ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+
+	ts->control->offset_key_nbits =
+		ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix
+		 * tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+tidstore_reset(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Free the radix tree and return allocated DSA segments to
+		 * the operating system.
+		 */
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+
+		LWLockRelease(&ts->control->lock);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+	}
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	uint64	*values;
+	uint64	key;
+	uint64	prev_key;
+	uint64	off_bitmap = 0;
+	int idx;
+	const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+	const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	values = palloc(sizeof(uint64) * nkeys);
+	key = prev_key = key_base;
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint64	off_bit;
+
+		/* encode the tid to a key and partial offset */
+		key = encode_key_off(ts, blkno, offsets[i], &off_bit);
+
+		/* make sure we scanned the line pointer array in order */
+		Assert(key >= prev_key);
+
+		if (key > prev_key)
+		{
+			idx = prev_key - key_base;
+			Assert(idx >= 0 && idx < nkeys);
+
+			/* write out offset bitmap for this key */
+			values[idx] = off_bitmap;
+
+			/* zero out any gaps up to the current key */
+			for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
+				values[empty_idx] = 0;
+
+			/* reset for current key -- the current offset will be handled below */
+			off_bitmap = 0;
+			prev_key = key;
+		}
+
+		off_bitmap |= off_bit;
+	}
+
+	/* save the final index for later */
+	idx = key - key_base;
+	/* write out last offset bitmap */
+	values[idx] = off_bitmap;
+
+	if (TidStoreIsShared(ts))
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+	/* insert the calculated key-values to the tree */
+	for (int i = 0; i <= idx; i++)
+	{
+		if (values[i])
+		{
+			key = key_base + i;
+
+			if (TidStoreIsShared(ts))
+				shared_rt_set(ts->tree.shared, key, &values[i]);
+			else
+				local_rt_set(ts->tree.local, key, &values[i]);
+		}
+	}
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+
+	if (TidStoreIsShared(ts))
+		LWLockRelease(&ts->control->lock);
+
+	pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val = 0;
+	uint64 off_bit;
+	bool found;
+
+	key = tid_to_key_off(ts, tid, &off_bit);
+
+	if (TidStoreIsShared(ts))
+		found = shared_rt_search(ts->tree.shared, key, &val);
+	else
+		found = local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & off_bit) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	iter->result.blkno = InvalidBlockNumber;
+	iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (tidstore_num_tids(ts) == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+
+	return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		/* Process the previously collected key-value */
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = key_get_blkno(iter->ts, key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * We got a key-value pair for a different block. So return the
+			 * collected tids, and remember the key-value for the next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter->result.offsets);
+	pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+	uint64 num_tids;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (!TidStoreIsShared(ts))
+		return ts->control->num_tids;
+
+	LWLockAcquire(&ts->control->lock, LW_SHARED);
+	num_tids = ts->control->num_tids;
+	LWLockRelease(&ts->control->lock);
+
+	return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+	return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->result);
+
+	while (val)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= pg_rightmost_one_pos64(val);
+
+		off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+		Assert(result->num_offsets < iter->ts->control->max_offset);
+		result->offsets[result->num_offsets++] = off;
+
+		/* unset the rightmost bit */
+		val &= ~pg_rightmost_one64(val);
+	}
+
+	result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+	return (BlockNumber) (key >> ts->control->offset_key_nbits);
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
+{
+	uint32 offset = ItemPointerGetOffsetNumber(tid);
+	BlockNumber block = ItemPointerGetBlockNumber(tid);
+
+	return encode_key_off(ts, block, offset, off_bit);
+}
+
+/* encode a block and offset to a key and partial offset */
+static inline uint64
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
+{
+	uint64 key;
+	uint64 tid_i;
+	uint32 off_lower;
+
+	off_lower = offset & TIDSTORE_OFFSET_MASK;
+	Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
+
+	*off_bit = UINT64CONST(1) << off_lower;
+	tid_i = offset | ((uint64) block << ts->control->offset_nbits);
+	key = tid_i >> TIDSTORE_VALUE_NBITS;
+
+	return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"SharedTupleStore",
 	/* LWTRANCHE_SHARED_TIDBITMAP: */
 	"SharedTidBitmap",
+	/* LWTRANCHE_SHARED_TIDSTORE: */
+	"SharedTidStore",
 	/* LWTRANCHE_PARALLEL_APPEND: */
 	"ParallelAppend",
 	/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	*offsets;
+	int				num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_SHARED_TIDBITMAP,
+	LWTRANCHE_SHARED_TIDSTORE,
 	LWTRANCHE_PARALLEL_APPEND,
 	LWTRANCHE_PER_XACT_PREDICATE_LIST,
 	LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 9659eb85d7..bddc16ada7 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 232cbdac80..c0d5645ad8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9a1217f833
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,226 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+/* #define TEST_SHARED_TIDSTORE 1 */
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+	ItemPointerData tid;
+	bool found;
+
+	ItemPointerSet(&tid, blkno, off);
+
+	found = tidstore_lookup_tid(ts, &tid);
+
+	if (found != expect)
+		elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+			 blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS	5
+#define TEST_TIDSTORE_NUM_OFFSETS	5
+
+	TidStore *ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+	BlockNumber	blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+	};
+	BlockNumber	blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+	};
+	OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+	int blk_idx;
+
+#ifdef TEST_SHARED_TIDSTORE
+	int tranche_id = LWLockNewTrancheId();
+	dsa_area *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_tidstore");
+	dsa = dsa_create(tranche_id);
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+#else
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+#endif
+
+	/* prepare the offset array */
+	offs[0] = FirstOffsetNumber;
+	offs[1] = FirstOffsetNumber + 1;
+	offs[2] = max_offset / 2;
+	offs[3] = max_offset - 1;
+	offs[4] = max_offset;
+
+	/* add tids */
+	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* lookup test */
+	for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+	{
+		bool expect = false;
+		for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+		{
+			if (offs[i] == off)
+			{
+				expect = true;
+				break;
+			}
+		}
+
+		check_tid(ts, 0, off, expect);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, expect);
+	}
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+			 tidstore_num_tids(ts),
+			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* iteration test */
+	iter = tidstore_begin_iterate(ts);
+	blk_idx = 0;
+	while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+	{
+		/* check the returned block number */
+		if (blks_sorted[blk_idx] != iter_result->blkno)
+			elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+				 iter_result->blkno, blks_sorted[blk_idx]);
+
+		/* check the returned offset numbers */
+		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+			elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+		for (int i = 0; i < iter_result->num_offsets; i++)
+		{
+			if (offs[i] != iter_result->offsets[i])
+				elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+					 iter_result->offsets[i], iter_result->blkno, offs[i]);
+		}
+
+		blk_idx++;
+	}
+
+	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+		elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+	/* remove all tids */
+	tidstore_reset(ts);
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+	/* lookup test for empty store */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, false);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, false);
+	}
+
+	tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+	dsa_detach(dsa);
+#endif
+}
+
+static void
+test_empty(void)
+{
+	TidStore *ts;
+	TidStoreIter *iter;
+	ItemPointerData tid;
+
+#ifdef TEST_SHARED_TIDSTORE
+	int tranche_id = LWLockNewTrancheId();
+	dsa_area *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_tidstore");
+	dsa = dsa_create(tranche_id);
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+#else
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+#endif
+
+	elog(NOTICE, "testing empty tidstore");
+
+	ItemPointerSet(&tid, 0, FirstOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+			 MaxBlockNumber, MaxOffsetNumber);
+
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+	if (tidstore_is_full(ts))
+		elog(ERROR, "tidstore_is_full on empty store returned true");
+
+	iter = tidstore_begin_iterate(ts);
+
+	if (tidstore_iterate_next(iter) != NULL)
+		elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+	tidstore_end_iterate(iter);
+
+	tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+	dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	elog(NOTICE, "testing basic operations");
+	test_basic(MaxHeapTuplesPerPage);
+	test_basic(10);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.31.1

#235

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#233)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Mar 10, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I'd suggest sharing your todo list in the meanwhile, it'd be good to

discuss what's worth doing and what is not.

Apart from more rounds of reviews and tests, my todo items that need
discussion and possibly implementation are:

Quick thoughts on these:

* The memory measurement in radix trees and the memory limit in
tidstores. I've implemented it in v30-0007 through 0009 but we need to
review it. This is the highest priority for me.

Agreed.

* Additional size classes. It's important for an alternative of path
compression as well as supporting our decoupling approach. Middle
priority.

I'm going to push back a bit and claim this doesn't bring much gain, while
it does have a complexity cost. The node1 from Andres's prototype is 32
bytes in size, same as our node3, so it's roughly equivalent as a way to
ameliorate the lack of path compression. I say "roughly" because the loop
in node3 is probably noticeably slower. A new size class will by definition
still use that loop.

About a smaller node125-type class: I'm actually not even sure we need to
have any sub-max node bigger about 64 (node size 768 bytes). I'd just let
65+ go to the max node -- there won't be many of them, at least in
synthetic workloads we've seen so far.

* Node shrinking support. Low priority.

This is an architectural wart that's been neglected since the tid store
doesn't perform deletion. We'll need it sometime. If we're not going to
make this work, why ship a deletion API at all?

I took a look at this a couple weeks ago, and fixing it wouldn't be that
hard. I even had an idea of how to detect when to shrink size class within
a node kind, while keeping the header at 5 bytes. I'd be willing to put
effort into that, but to have a chance of succeeding, I'm unwilling to make
it more difficult by adding more size classes at this point.

--
John Naylor
EDB: http://www.enterprisedb.com

#236

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#235)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sun, Mar 12, 2023 at 12:54 AM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 10, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I'd suggest sharing your todo list in the meanwhile, it'd be good to discuss what's worth doing and what is not.

Apart from more rounds of reviews and tests, my todo items that need
discussion and possibly implementation are:

Quick thoughts on these:

* The memory measurement in radix trees and the memory limit in
tidstores. I've implemented it in v30-0007 through 0009 but we need to
review it. This is the highest priority for me.

Agreed.

* Additional size classes. It's important for an alternative of path
compression as well as supporting our decoupling approach. Middle
priority.

I'm going to push back a bit and claim this doesn't bring much gain, while it does have a complexity cost. The node1 from Andres's prototype is 32 bytes in size, same as our node3, so it's roughly equivalent as a way to ameliorate the lack of path compression.

But does it mean that our node1 would help reduce the memory further
since since our base node type (i.e. RT_NODE) is smaller than the base
node type of Andres's prototype? The result I shared before showed
1.2GB vs. 1.9GB.

I say "roughly" because the loop in node3 is probably noticeably slower. A new size class will by definition still use that loop.

I've evaluated the performance of node1 but the result seems to show
the opposite. I used the test query:

select * from bench_search_random_nodes(100 * 1000 * 1000,
'0xFF000000000000FF');

Which make the radix tree that has node1 like:

max_val = 18446744073709551615
num_keys = 65536
height = 7, n1 = 1536, n3 = 0, n15 = 0, n32 = 0, n61 = 0, n256 = 257

All internal nodes except for the root node are node1. The radix tree
that doesn't have node1 is:

max_val = 18446744073709551615
num_keys = 65536
height = 7, n3 = 1536, n15 = 0, n32 = 0, n125 = 0, n256 = 257

Here is the result:

* w/ node1
mem_allocated | load_ms | search_ms
---------------+---------+-----------
573448 | 1848 | 1707
(1 row)

* w/o node1
mem_allocated | load_ms | search_ms
---------------+---------+-----------
598024 | 2014 | 1825
(1 row)

Am I missing something?

About a smaller node125-type class: I'm actually not even sure we need to have any sub-max node bigger about 64 (node size 768 bytes). I'd just let 65+ go to the max node -- there won't be many of them, at least in synthetic workloads we've seen so far.

Makes sense to me.

* Node shrinking support. Low priority.

This is an architectural wart that's been neglected since the tid store doesn't perform deletion. We'll need it sometime. If we're not going to make this work, why ship a deletion API at all?

I took a look at this a couple weeks ago, and fixing it wouldn't be that hard. I even had an idea of how to detect when to shrink size class within a node kind, while keeping the header at 5 bytes. I'd be willing to put effort into that, but to have a chance of succeeding, I'm unwilling to make it more difficult by adding more size classes at this point.

I think that the deletion (and locking support) doesn't have use cases
in the core (i.e. tidstore) but is implemented so that external
extensions can use it. There might not be such extensions. Given the
lack of use cases in the core (and the rest of time), I think it's
okay even if the implementation of such API is minimal and not
optimized enough. For instance, the implementation of dshash.c is
minimalist, and doesn't have resizing. We can improve them in the
future if extensions or other core features want.

Personally I think we should focus on addressing feedback that we
would get and improving the existing use cases for the rest of time.
That's why considering min-max size class has a higher priority than
the node shrinking support in my todo list.

FYI, I've run TPC-C workload over the weekend, and didn't get any
failures of the assertion proving tidstore and the current tid lookup
return the same result.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#237

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#236)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 13, 2023 at 8:41 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Sun, Mar 12, 2023 at 12:54 AM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

* Additional size classes. It's important for an alternative of path
compression as well as supporting our decoupling approach. Middle
priority.

I'm going to push back a bit and claim this doesn't bring much gain,

while it does have a complexity cost. The node1 from Andres's prototype is
32 bytes in size, same as our node3, so it's roughly equivalent as a way to
ameliorate the lack of path compression.

But does it mean that our node1 would help reduce the memory further
since since our base node type (i.e. RT_NODE) is smaller than the base
node type of Andres's prototype? The result I shared before showed
1.2GB vs. 1.9GB.

The benefit is found in a synthetic benchmark with random integers. I
highly doubt that anyone would be willing to force us to keep
binary-searching the 1GB array for one more cycle on account of not adding
a size class here. I'll repeat myself and say that there are also
maintenance costs.

In contrast, I'm fairly certain that our attempts thus far at memory
accounting/limiting are not quite up to par, and lacking enough to
jeopardize the feature. We're already discussing that, so I'll say no more.

I say "roughly" because the loop in node3 is probably noticeably

slower. A new size class will by definition still use that loop.

I've evaluated the performance of node1 but the result seems to show
the opposite.

As an aside, I meant the loop in our node3 might make your node1 slower
than the prototype's node1, which was coded for 1 member only.

* Node shrinking support. Low priority.

This is an architectural wart that's been neglected since the tid store

doesn't perform deletion. We'll need it sometime. If we're not going to
make this work, why ship a deletion API at all?

I took a look at this a couple weeks ago, and fixing it wouldn't be

that hard. I even had an idea of how to detect when to shrink size class
within a node kind, while keeping the header at 5 bytes. I'd be willing to
put effort into that, but to have a chance of succeeding, I'm unwilling to
make it more difficult by adding more size classes at this point.

I think that the deletion (and locking support) doesn't have use cases
in the core (i.e. tidstore) but is implemented so that external
extensions can use it.

I think these cases are a bit different: Doing anything with a data
structure stored in shared memory without a synchronization scheme is
completely unthinkable and insane. I'm not yet sure if
deleting-without-shrinking is a showstopper, or if it's preferable in v16
than no deletion at all.

Anything we don't implement now is a limit on future use cases, and thus a
cause for objection. On the other hand, anything we implement also
represents more stuff that will have to be rewritten for high-concurrency.

FYI, I've run TPC-C workload over the weekend, and didn't get any
failures of the assertion proving tidstore and the current tid lookup
return the same result.

Great!

--
John Naylor
EDB: http://www.enterprisedb.com

#238

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#237)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 13, 2023 at 10:28 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Mar 13, 2023 at 8:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Mar 12, 2023 at 12:54 AM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Fri, Mar 10, 2023 at 9:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

* Additional size classes. It's important for an alternative of path
compression as well as supporting our decoupling approach. Middle
priority.

I'm going to push back a bit and claim this doesn't bring much gain, while it does have a complexity cost. The node1 from Andres's prototype is 32 bytes in size, same as our node3, so it's roughly equivalent as a way to ameliorate the lack of path compression.

But does it mean that our node1 would help reduce the memory further
since since our base node type (i.e. RT_NODE) is smaller than the base
node type of Andres's prototype? The result I shared before showed
1.2GB vs. 1.9GB.

The benefit is found in a synthetic benchmark with random integers. I highly doubt that anyone would be willing to force us to keep binary-searching the 1GB array for one more cycle on account of not adding a size class here. I'll repeat myself and say that there are also maintenance costs.

In contrast, I'm fairly certain that our attempts thus far at memory accounting/limiting are not quite up to par, and lacking enough to jeopardize the feature. We're already discussing that, so I'll say no more.

I agree that memory accounting/limiting stuff is the highest priority.
So what kinds of size classes do you think we need? node3, 15, 32, 61
and 256?

I say "roughly" because the loop in node3 is probably noticeably slower. A new size class will by definition still use that loop.

I've evaluated the performance of node1 but the result seems to show
the opposite.

As an aside, I meant the loop in our node3 might make your node1 slower than the prototype's node1, which was coded for 1 member only.

Agreed.

* Node shrinking support. Low priority.

This is an architectural wart that's been neglected since the tid store doesn't perform deletion. We'll need it sometime. If we're not going to make this work, why ship a deletion API at all?

I took a look at this a couple weeks ago, and fixing it wouldn't be that hard. I even had an idea of how to detect when to shrink size class within a node kind, while keeping the header at 5 bytes. I'd be willing to put effort into that, but to have a chance of succeeding, I'm unwilling to make it more difficult by adding more size classes at this point.

I think that the deletion (and locking support) doesn't have use cases
in the core (i.e. tidstore) but is implemented so that external
extensions can use it.

I think these cases are a bit different: Doing anything with a data structure stored in shared memory without a synchronization scheme is completely unthinkable and insane.

Right.

I'm not yet sure if deleting-without-shrinking is a showstopper, or if it's preferable in v16 than no deletion at all.

Anything we don't implement now is a limit on future use cases, and thus a cause for objection. On the other hand, anything we implement also represents more stuff that will have to be rewritten for high-concurrency.

Okay. Given that adding shrinking support also requires maintenance
costs (and probably new test cases?) and there are no use cases in the
core, I'm not sure it's worth supporting it at this stage. So I prefer
either shipping the deletion API as it is and removing the deletion
API. I think that it's a discussion point that we'd like to hear
feedback from other hackers.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#239

[1]: /messages/by-id/20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de
/messages/by-id/20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de
[2]: /messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de
/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#238)

Re: [PoC] Improve dead tuple storage for lazy vacuum

I wrote:

Since the block-level measurement is likely overestimating quite a

bit, I propose to simply reverse the order of the actions here, effectively
reporting progress for the *last page* and not the current one: First
update progress with the current memory usage, then add tids for this page.
If this allocated a new block, only a small bit of that will be written to.
If this block pushes it over the limit, we will detect that up at the top
of the loop. It's kind of like our earlier attempts at a "fudge factor",
but simpler and less brittle. And, as far as OS pages we have actually
written to, I think it'll effectively respect the memory limit, at least in
the local mem case. And the numbers will make sense.

Thoughts?

It looks to work but it still doesn't work in a case where a shared
tidstore is created with a 64kB memory limit, right?
TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
from the beginning.

I have two ideas:

1. Make it optional to track chunk memory space by a template parameter.

It might be tiny compared to everything else that vacuum does. That would
allow other users to avoid that overhead.

2. When context block usage exceeds the limit (rare), make the additional

effort to get the precise usage -- I'm not sure such a top-down facility
exists, and I'm not feeling well enough today to study this further.

Since then, Masahiko incorporated #1 into v31, and that's what I'm looking
at now. Unfortunately, If I had spent five minutes reminding myself what
the original objections were to this approach, I could have saved us some
effort. Back in July (!), Andres raised two points: GetMemoryChunkSpace()
is slow [1]/messages/by-id/20220704211822.kfxtzpcdmslzm2dy@awork3.anarazel.de, and fragmentation [2]/messages/by-id/20220704220038.at2ane5xkymzzssb@awork3.anarazel.de (leading to underestimation).

In v31, in the local case at least, the underestimation is actually worse
than tracking chunk space, since it ignores chunk header and alignment.
I'm not sure about the DSA case. This doesn't seem great.

It shouldn't be a surprise why a simple increment of raw allocation size is
comparable in speed -- GetMemoryChunkSpace() calls the right function
through a pointer, which is slower. If we were willing to underestimate for
the sake of speed, that takes away the reason for making memory tracking
optional.

Further, if the option is not specified, in v31 there is no way to get the
memory use at all, which seems odd. Surely the caller should be able to ask
the context/area, if it wants to.

I still like my idea at the top of the page -- at least for vacuum and
m_w_m. It's still not completely clear if it's right but I've got nothing
better. It also ignores the work_mem issue, but I've given up anticipating
all future cases at the moment.

I'll put this item and a couple other things together in a separate email
tomorrow.

--
John Naylor
EDB: http://www.enterprisedb.com

#240

[1]: /messages/by-id/CAFBsxsEnzivaJ13iCGdDoUMsXJVGOaahuBe_y=q6ow=LTzyDvA@mail.gmail.com

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#239)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Mar 14, 2023 at 8:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I wrote:

Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.

Thoughts?

It looks to work but it still doesn't work in a case where a shared
tidstore is created with a 64kB memory limit, right?
TidStoreMemoryUsage() returns 1MB and TidStoreIsFull() returns true
from the beginning.

I have two ideas:

1. Make it optional to track chunk memory space by a template parameter. It might be tiny compared to everything else that vacuum does. That would allow other users to avoid that overhead.
2. When context block usage exceeds the limit (rare), make the additional effort to get the precise usage -- I'm not sure such a top-down facility exists, and I'm not feeling well enough today to study this further.

Since then, Masahiko incorporated #1 into v31, and that's what I'm looking at now. Unfortunately, If I had spent five minutes reminding myself what the original objections were to this approach, I could have saved us some effort. Back in July (!), Andres raised two points: GetMemoryChunkSpace() is slow [1], and fragmentation [2] (leading to underestimation).

In v31, in the local case at least, the underestimation is actually worse than tracking chunk space, since it ignores chunk header and alignment. I'm not sure about the DSA case. This doesn't seem great.

Right.

It shouldn't be a surprise why a simple increment of raw allocation size is comparable in speed -- GetMemoryChunkSpace() calls the right function through a pointer, which is slower. If we were willing to underestimate for the sake of speed, that takes away the reason for making memory tracking optional.

Further, if the option is not specified, in v31 there is no way to get the memory use at all, which seems odd. Surely the caller should be able to ask the context/area, if it wants to.

There are precedents that don't provide a way to return memory usage,
such as simplehash.h and dshash.c.

I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.

What does it mean by "the precise usage" in your idea? Quoting from
the email you referred to, Andres said:

---
One thing I was wondering about is trying to choose node types in
roughly-power-of-two struct sizes. It's pretty easy to end up with significant
fragmentation in the slabs right now when inserting as you go, because some of
the smaller node types will be freed but not enough to actually free blocks of
memory. If we instead have ~power-of-two sizes we could just use a single slab
of the max size, and carve out the smaller node types out of that largest
allocation.

Btw, that fragmentation is another reason why I think it's better to track
memory usage via memory contexts, rather than doing so based on
GetMemoryChunkSpace().
---

IIUC he suggested measuring memory usage in block-level in order to
count blocks that are not actually freed but some of its chunks are
freed. That's why we used MemoryContextMemAllocated(). On the other
hand, recently you pointed out[1]/messages/by-id/CAFBsxsEnzivaJ13iCGdDoUMsXJVGOaahuBe_y=q6ow=LTzyDvA@mail.gmail.com:

---
I think we're trying to solve the wrong problem here. I need to study
this more, but it seems that code that needs to stay within a memory
limit only needs to track what's been allocated in chunks within a
block, since writing there is what invokes a page fault.
---

IIUC you suggested measuring memory usage by tracking how much memory
chunks are allocated within a block. If your idea at the top of the
page follows this method, it still doesn't deal with the point Andres
mentioned.

I'll put this item and a couple other things together in a separate email tomorrow.

Thanks!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#241

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#240)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Tue, Mar 14, 2023 at 8:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I wrote:

Since the block-level measurement is likely overestimating quite

a bit, I propose to simply reverse the order of the actions here,
effectively reporting progress for the *last page* and not the current one:
First update progress with the current memory usage, then add tids for this
page. If this allocated a new block, only a small bit of that will be
written to. If this block pushes it over the limit, we will detect that up
at the top of the loop. It's kind of like our earlier attempts at a "fudge
factor", but simpler and less brittle. And, as far as OS pages we have
actually written to, I think it'll effectively respect the memory limit, at
least in the local mem case. And the numbers will make sense.

I still like my idea at the top of the page -- at least for vacuum and

m_w_m. It's still not completely clear if it's right but I've got nothing
better. It also ignores the work_mem issue, but I've given up anticipating
all future cases at the moment.

IIUC you suggested measuring memory usage by tracking how much memory
chunks are allocated within a block. If your idea at the top of the
page follows this method, it still doesn't deal with the point Andres
mentioned.

Right, but that idea was orthogonal to how we measure memory use, and in
fact mentions blocks specifically. The re-ordering was just to make sure
that progress reporting didn't show current-use > max-use.

However, the big question remains DSA, since a new segment can be as large
as the entire previous set of allocations. It seems it just wasn't designed
for things where memory growth is unpredictable.

I'm starting to wonder if we need to give DSA a bit more info at the start.
Imagine a "soft" limit given to the DSA area when it is initialized. If the
total segment usage exceeds this, it stops doubling and instead new
segments get smaller. Modifying an example we used for the fudge-factor
idea some time ago:

m_w_m = 1GB, so calculate the soft limit to be 512MB and pass it to the DSA
area.

2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> hit soft limit, so
"stairstep down" the new segment sizes:

766 + 2*(128) + 64 = 1086MB -> stop

That's just an undeveloped idea, however, so likely v17 development, even
assuming it's not a bad idea (could be).

And sadly, unless we find some other, simpler answer soon for tracking and
limiting shared memory, the tid store is looking like v17 material.

--
John Naylor
EDB: http://www.enterprisedb.com

#242

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#241)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 17, 2023 at 4:03 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 14, 2023 at 8:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I wrote:

Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.

I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.

IIUC you suggested measuring memory usage by tracking how much memory
chunks are allocated within a block. If your idea at the top of the
page follows this method, it still doesn't deal with the point Andres
mentioned.

Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-ordering was just to make sure that progress reporting didn't show current-use > max-use.

Right. I still like your re-ordering idea. It's true that the most
area of the last allocated block before heap scanning stops is not
actually used yet. I'm guessing we can just check if the context
memory has gone over the limit. But I'm concerned it might not work
well in systems where overcommit memory is disabled.

However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations. It seems it just wasn't designed for things where memory growth is unpredictable.

I'm starting to wonder if we need to give DSA a bit more info at the start. Imagine a "soft" limit given to the DSA area when it is initialized. If the total segment usage exceeds this, it stops doubling and instead new segments get smaller. Modifying an example we used for the fudge-factor idea some time ago:

m_w_m = 1GB, so calculate the soft limit to be 512MB and pass it to the DSA area.

2*(1+2+4+8+16+32+64+128) + 256 = 766MB (74.8% of 1GB) -> hit soft limit, so "stairstep down" the new segment sizes:

766 + 2*(128) + 64 = 1086MB -> stop

That's just an undeveloped idea, however, so likely v17 development, even assuming it's not a bad idea (could be).

This is an interesting idea. But I'm concerned we don't have enough
time to get confident with adding this new concept to DSA.

And sadly, unless we find some other, simpler answer soon for tracking and limiting shared memory, the tid store is looking like v17 material.

Another problem we need to deal with is the supported minimum memory
in shared tidstore cases. Since the initial DSA segment size is 1MB,
memory usage of a shared tidstore will start from 1MB+. This is higher
than the minimum values of both work_mem and maintenance_work_mem,
64kB and 1MB respectively. Increasing the minimum m_w_m to 2MB seems
to be acceptable in the community but not for work_mem. One idea is to
deny the memory limit less than 2MB so it won't work with small m_w_m
settings. While it might be an acceptable restriction at this stage
(where there is no use case of using tidstore with work_mem in the
core) but it will be a blocker for the future adoptions such as
unifying with tidbitmap.c. Another idea is that the process can
specify the initial segment size at dsa_create() so that DSA can start
with a smaller segment, say 32kB. That way, a tidstore with a 32kB
limit gets full once it allocates the next DSA segment, 32kB. . But a
downside of this idea is to increase the number of segments behind
DSA. Assuming it's a relatively rare case where we use such a low
work_mem, it might be acceptable. FYI, the total number of DSM
segments available on the system is calculated by:

#define PG_DYNSHMEM_FIXED_SLOTS 64
#define PG_DYNSHMEM_SLOTS_PER_BACKEND 5

maxitems = PG_DYNSHMEM_FIXED_SLOTS
+ PG_DYNSHMEM_SLOTS_PER_BACKEND * MaxBackends;

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#243

sawada.mshk@gmail.com

almost 3 years ago

In reply to: Masahiko Sawada (#242)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 17, 2023 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 17, 2023 at 4:03 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 14, 2023 at 8:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I wrote:

Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.

I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.

IIUC you suggested measuring memory usage by tracking how much memory
chunks are allocated within a block. If your idea at the top of the
page follows this method, it still doesn't deal with the point Andres
mentioned.

Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-ordering was just to make sure that progress reporting didn't show current-use > max-use.

Right. I still like your re-ordering idea. It's true that the most
area of the last allocated block before heap scanning stops is not
actually used yet. I'm guessing we can just check if the context
memory has gone over the limit. But I'm concerned it might not work
well in systems where overcommit memory is disabled.

However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations. It seems it just wasn't designed for things where memory growth is unpredictable.

aset.c also has a similar characteristic; allocates an 8K block upon
the first allocation in a context, and doubles that size for each
successive block request. But we can specify the initial block size
and max blocksize. This made me think of another idea to specify both
to DSA and both values are calculated based on m_w_m. For example, we
can create a DSA in parallel_vacuum_init() as follows:

initial block size = min(m_w_m / 4, 1MB)
max block size = max(m_w_m / 8, 8MB)

In most cases, we can start with a 1MB initial segment, the same as
before. For small memory cases, say 1MB, we start with a 256KB initial
segment and heap scanning stops after DSA allocated 1.5MB (= 256kB +
256kB + 512kB + 512kB). For larger memory, we can have heap scan stop
after DSA allocates 1.25 times more memory than m_w_m. For example, if
m_w_m = 1GB, the both initial and maximum segment sizes are 1MB and
128MB respectively, and then DSA allocates the segments as follows
until heap scanning stops:

2 * (1 + 2 + 4 + 8 + 16 + 32 + 64 + 128) + (128 * 5) = 1150MB

dsa_allocate() will be extended to have the initial and maximum block
sizes like AllocSetContextCreate().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#244

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#243)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 20, 2023 at 12:25 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Mar 17, 2023 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

On Fri, Mar 17, 2023 at 4:03 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

On Tue, Mar 14, 2023 at 8:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I wrote:

Since the block-level measurement is likely overestimating

quite a bit, I propose to simply reverse the order of the actions here,
effectively reporting progress for the *last page* and not the current one:
First update progress with the current memory usage, then add tids for this
page. If this allocated a new block, only a small bit of that will be
written to. If this block pushes it over the limit, we will detect that up
at the top of the loop. It's kind of like our earlier attempts at a "fudge
factor", but simpler and less brittle. And, as far as OS pages we have
actually written to, I think it'll effectively respect the memory limit, at
least in the local mem case. And the numbers will make sense.

I still like my idea at the top of the page -- at least for

vacuum and m_w_m. It's still not completely clear if it's right but I've
got nothing better. It also ignores the work_mem issue, but I've given up
anticipating all future cases at the moment.

IIUC you suggested measuring memory usage by tracking how much

memory

chunks are allocated within a block. If your idea at the top of the
page follows this method, it still doesn't deal with the point

Andres

mentioned.

Right, but that idea was orthogonal to how we measure memory use, and

in fact mentions blocks specifically. The re-ordering was just to make sure
that progress reporting didn't show current-use > max-use.

Right. I still like your re-ordering idea. It's true that the most
area of the last allocated block before heap scanning stops is not
actually used yet. I'm guessing we can just check if the context
memory has gone over the limit. But I'm concerned it might not work
well in systems where overcommit memory is disabled.

However, the big question remains DSA, since a new segment can be as

large as the entire previous set of allocations. It seems it just wasn't
designed for things where memory growth is unpredictable.

aset.c also has a similar characteristic; allocates an 8K block upon
the first allocation in a context, and doubles that size for each
successive block request. But we can specify the initial block size
and max blocksize. This made me think of another idea to specify both
to DSA and both values are calculated based on m_w_m. For example, we

That's an interesting idea, and the analogous behavior to aset could be a
good thing for readability and maintainability. Worth seeing if it's
workable.

--
John Naylor
EDB: http://www.enterprisedb.com

#245

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#244)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 20, 2023 at 9:34 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Mar 20, 2023 at 12:25 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 17, 2023 at 4:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 17, 2023 at 4:03 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Wed, Mar 15, 2023 at 9:32 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 14, 2023 at 8:27 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I wrote:

Since the block-level measurement is likely overestimating quite a bit, I propose to simply reverse the order of the actions here, effectively reporting progress for the *last page* and not the current one: First update progress with the current memory usage, then add tids for this page. If this allocated a new block, only a small bit of that will be written to. If this block pushes it over the limit, we will detect that up at the top of the loop. It's kind of like our earlier attempts at a "fudge factor", but simpler and less brittle. And, as far as OS pages we have actually written to, I think it'll effectively respect the memory limit, at least in the local mem case. And the numbers will make sense.

I still like my idea at the top of the page -- at least for vacuum and m_w_m. It's still not completely clear if it's right but I've got nothing better. It also ignores the work_mem issue, but I've given up anticipating all future cases at the moment.

IIUC you suggested measuring memory usage by tracking how much memory
chunks are allocated within a block. If your idea at the top of the
page follows this method, it still doesn't deal with the point Andres
mentioned.

Right, but that idea was orthogonal to how we measure memory use, and in fact mentions blocks specifically. The re-ordering was just to make sure that progress reporting didn't show current-use > max-use.

Right. I still like your re-ordering idea. It's true that the most
area of the last allocated block before heap scanning stops is not
actually used yet. I'm guessing we can just check if the context
memory has gone over the limit. But I'm concerned it might not work
well in systems where overcommit memory is disabled.

However, the big question remains DSA, since a new segment can be as large as the entire previous set of allocations. It seems it just wasn't designed for things where memory growth is unpredictable.

aset.c also has a similar characteristic; allocates an 8K block upon
the first allocation in a context, and doubles that size for each
successive block request. But we can specify the initial block size
and max blocksize. This made me think of another idea to specify both
to DSA and both values are calculated based on m_w_m. For example, we

That's an interesting idea, and the analogous behavior to aset could be a good thing for readability and maintainability. Worth seeing if it's workable.

I've attached a quick hack patch. It can be applied on top of v32
patches. The changes to dsa.c are straightforward since it makes the
initial and max block sizes configurable. The patch includes a test
function, test_memory_usage() to simulate how DSA segments grow behind
the shared radix tree. If we set the first argument to true, it
calculates both initial and maximum block size based on work_mem (I
used work_mem here just because its value range is larger than m_w_m):

postgres(1:833654)=# select test_memory_usage(true);
NOTICE: memory limit 134217728
NOTICE: init 1048576 max 16777216
NOTICE: initial: 1048576
NOTICE: rt_create: 1048576
NOTICE: allocate new DSM [1] 1048576
NOTICE: allocate new DSM [2] 2097152
NOTICE: allocate new DSM [3] 2097152
NOTICE: allocate new DSM [4] 4194304
NOTICE: allocate new DSM [5] 4194304
NOTICE: allocate new DSM [6] 8388608
NOTICE: allocate new DSM [7] 8388608
NOTICE: allocate new DSM [8] 16777216
NOTICE: allocate new DSM [9] 16777216
NOTICE: allocate new DSM [10] 16777216
NOTICE: allocate new DSM [11] 16777216
NOTICE: allocate new DSM [12] 16777216
NOTICE: allocate new DSM [13] 16777216
NOTICE: allocate new DSM [14] 16777216
NOTICE: reached: 148897792 (+14680064)
NOTICE: 12718205 keys inserted: 148897792
test_memory_usage
-------------------

(1 row)

Time: 7195.664 ms (00:07.196)

Setting the first argument to false, we can specify both manually in
second and third arguments:

postgres(1:833654)=# select test_memory_usage(false, 1024 * 1024, 1024
* 1024 * 1024 * 10::bigint);
NOTICE: memory limit 134217728
NOTICE: init 1048576 max 10737418240
NOTICE: initial: 1048576
NOTICE: rt_create: 1048576
NOTICE: allocate new DSM [1] 1048576
NOTICE: allocate new DSM [2] 2097152
NOTICE: allocate new DSM [3] 2097152
NOTICE: allocate new DSM [4] 4194304
NOTICE: allocate new DSM [5] 4194304
NOTICE: allocate new DSM [6] 8388608
NOTICE: allocate new DSM [7] 8388608
NOTICE: allocate new DSM [8] 16777216
NOTICE: allocate new DSM [9] 16777216
NOTICE: allocate new DSM [10] 33554432
NOTICE: allocate new DSM [11] 33554432
NOTICE: allocate new DSM [12] 67108864
NOTICE: reached: 199229440 (+65011712)
NOTICE: 12718205 keys inserted: 199229440
test_memory_usage
-------------------

(1 row)

Time: 7187.571 ms (00:07.188)

It seems to work fine. The differences between the above two cases is
the maximum block size (16MB .vs 10GB). We allocated two more DSA
segments in the first segments but there was no big difference in the
performance in my test environment.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

dsa_init_max_block_size.patch.txttext/plain; charset=US-ASCII; name=dsa_init_max_block_size.patch.txtDownload

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index ad66265e23..12121dd1d4 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -86,3 +86,12 @@ OUT iter_ms int8
 returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function test_memory_usage(
+use_m_w_m bool,
+init_blksize int8 default (1024 * 1024),
+max_blksize int8 default (1024 * 1024 * 1024 * 10::bigint)
+)
+returns void
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 41d83aee11..0580faed6c 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -40,6 +40,18 @@ PG_MODULE_MAGIC;
 // #define RT_SHMEM
 #include "lib/radixtree.h"
 
+//#define RT_DEBUG
+#define RT_PREFIX shared_rt
+#define RT_SCOPE
+#define RT_DECLARE
+#define RT_DEFINE
+//#define RT_USE_DELETE
+//#define RT_MEASURE_MEMORY_USAGE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+#define RT_SHMEM
+#include "lib/radixtree.h"
+
 /*
  * Return the number of keys in the radix tree.
  */
@@ -57,6 +69,7 @@ PG_FUNCTION_INFO_V1(bench_fixed_height_search);
 PG_FUNCTION_INFO_V1(bench_search_random_nodes);
 PG_FUNCTION_INFO_V1(bench_node128_load);
 PG_FUNCTION_INFO_V1(bench_tidstore_load);
+PG_FUNCTION_INFO_V1(test_memory_usage);
 
 static uint64
 tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -745,4 +758,56 @@ stub_iter()
 	iter = rt_begin_iterate(rt);
 	rt_iterate_next(iter, &key, &value);
 	rt_end_iterate(iter);
-}
\ No newline at end of file
+}
+
+Datum
+test_memory_usage(PG_FUNCTION_ARGS)
+{
+	bool	use_work_mem = PG_GETARG_BOOL(0);
+	int64	init = PG_GETARG_INT64(1);
+	int64	max = PG_GETARG_INT64(2);
+	int tranche_id = LWLockNewTrancheId();
+	const int limit = work_mem * 1024;
+	dsa_area *dsa;
+	shared_rt_radix_tree *rt;
+	uint64 i;
+
+	LWLockRegisterTranche(tranche_id, "test");
+
+	if (use_work_mem)
+	{
+		init = Min(((int64)work_mem * 1024) / 4, 1024 * 1024);
+		max = Max(((int64)work_mem * 1024) / 8, (int64) 8 * 1024 * 1024);
+	}
+
+	elog(NOTICE, "memory limit %ld", (int64) work_mem * 1024);
+	elog(NOTICE, "init %ld max %ld", init, max);
+	dsa = dsa_create_ext(tranche_id, init, max);
+
+	elog(NOTICE, "initial: %zu", dsa_get_total_segment_size(dsa));
+
+	rt = shared_rt_create(CurrentMemoryContext, dsa, tranche_id);
+	elog(NOTICE, "rt_create: %zu", dsa_get_total_segment_size(dsa));
+
+	for (i = 0; i < (1000 * 1000 * 1000); i++)
+	{
+		volatile bool ret;
+		size_t size;
+
+		ret = shared_rt_set(rt, i, &i);
+
+		size = dsa_get_total_segment_size(dsa);
+
+		if (limit < size)
+		{
+			elog(NOTICE, "reached: %zu (+%zu)", size, size - limit);
+			break;
+		}
+	}
+
+	elog(NOTICE, "%ld keys inserted: %zu", i, dsa_get_total_segment_size(dsa));
+
+	shared_rt_free(rt);
+	dsa_detach(dsa);
+	PG_RETURN_VOID();
+}
diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..a81008d84e 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -60,14 +60,6 @@
 #include "utils/freepage.h"
 #include "utils/memutils.h"
 
-/*
- * The size of the initial DSM segment that backs a dsa_area created by
- * dsa_create.  After creating some number of segments of this size we'll
- * double this size, and so on.  Larger segments may be created if necessary
- * to satisfy large requests.
- */
-#define DSA_INITIAL_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
-
 /*
  * How many segments to create before we double the segment size.  If this is
  * low, then there is likely to be a lot of wasted space in the largest
@@ -77,17 +69,6 @@
  */
 #define DSA_NUM_SEGMENTS_AT_EACH_SIZE 2
 
-/*
- * The number of bits used to represent the offset part of a dsa_pointer.
- * This controls the maximum size of a segment, the maximum possible
- * allocation size and also the maximum number of segments per area.
- */
-#if SIZEOF_DSA_POINTER == 4
-#define DSA_OFFSET_WIDTH 27		/* 32 segments of size up to 128MB */
-#else
-#define DSA_OFFSET_WIDTH 40		/* 1024 segments of size up to 1TB */
-#endif
-
 /*
  * The maximum number of DSM segments that an area can own, determined by
  * the number of bits remaining (but capped at 1024).
@@ -98,9 +79,6 @@
 /* The bitmask for extracting the offset from a dsa_pointer. */
 #define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
 
-/* The maximum size of a DSM segment. */
-#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
-
 /* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
 #define DSA_PAGES_PER_SUPERBLOCK		16
 
@@ -319,6 +297,10 @@ typedef struct
 	dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
 	/* The object pools for each size class. */
 	dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+	/* initial allocation segment size */
+	size_t		init_segment_size;
+	/* maximum allocation segment size */
+	size_t		max_segment_size;
 	/* The total size of all active segments. */
 	size_t		total_segment_size;
 	/* The maximum total size of backing storage we are allowed. */
@@ -413,7 +395,9 @@ static dsa_segment_map *make_new_segment(dsa_area *area, size_t requested_pages)
 static dsa_area *create_internal(void *place, size_t size,
 								 int tranche_id,
 								 dsm_handle control_handle,
-								 dsm_segment *control_segment);
+								 dsm_segment *control_segment,
+								 size_t init_segment_size,
+								 size_t max_segment_size);
 static dsa_area *attach_internal(void *place, dsm_segment *segment,
 								 dsa_handle handle);
 static void check_for_freed_segments(dsa_area *area);
@@ -429,7 +413,7 @@ static void check_for_freed_segments_locked(dsa_area *area);
  * we require the caller to provide one.
  */
 dsa_area *
-dsa_create(int tranche_id)
+dsa_create_ext(int tranche_id, size_t init_segment_size, size_t max_segment_size)
 {
 	dsm_segment *segment;
 	dsa_area   *area;
@@ -438,7 +422,7 @@ dsa_create(int tranche_id)
 	 * Create the DSM segment that will hold the shared control object and the
 	 * first segment of usable space.
 	 */
-	segment = dsm_create(DSA_INITIAL_SEGMENT_SIZE, 0);
+	segment = dsm_create(init_segment_size, 0);
 
 	/*
 	 * All segments backing this area are pinned, so that DSA can explicitly
@@ -450,9 +434,10 @@ dsa_create(int tranche_id)
 
 	/* Create a new DSA area with the control object in this segment. */
 	area = create_internal(dsm_segment_address(segment),
-						   DSA_INITIAL_SEGMENT_SIZE,
+						   init_segment_size,
 						   tranche_id,
-						   dsm_segment_handle(segment), segment);
+						   dsm_segment_handle(segment), segment,
+						   init_segment_size, max_segment_size);
 
 	/* Clean up when the control segment detaches. */
 	on_dsm_detach(segment, &dsa_on_dsm_detach_release_in_place,
@@ -478,13 +463,15 @@ dsa_create(int tranche_id)
  * See dsa_create() for a note about the tranche arguments.
  */
 dsa_area *
-dsa_create_in_place(void *place, size_t size,
-					int tranche_id, dsm_segment *segment)
+dsa_create_in_place_ext(void *place, size_t size,
+						int tranche_id, dsm_segment *segment,
+						size_t init_segment_size, size_t max_segment_size)
 {
 	dsa_area   *area;
 
 	area = create_internal(place, size, tranche_id,
-						   DSM_HANDLE_INVALID, NULL);
+						   DSM_HANDLE_INVALID, NULL,
+						   init_segment_size, max_segment_size);
 
 	/*
 	 * Clean up when the control segment detaches, if a containing DSM segment
@@ -1024,6 +1011,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_segment_size(dsa_area *area)
+{
+	size_t size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
@@ -1203,7 +1202,8 @@ static dsa_area *
 create_internal(void *place, size_t size,
 				int tranche_id,
 				dsm_handle control_handle,
-				dsm_segment *control_segment)
+				dsm_segment *control_segment,
+				size_t init_segment_size, size_t max_segment_size)
 {
 	dsa_area_control *control;
 	dsa_area   *area;
@@ -1213,6 +1213,9 @@ create_internal(void *place, size_t size,
 	size_t		metadata_bytes;
 	int			i;
 
+	Assert(max_segment_size >= init_segment_size);
+	Assert(max_segment_size <= DSA_MAX_SEGMENT_SIZE);
+
 	/* Sanity check on the space we have to work in. */
 	if (size < dsa_minimum_size())
 		elog(ERROR, "dsa_area space must be at least %zu, but %zu provided",
@@ -1242,8 +1245,10 @@ create_internal(void *place, size_t size,
 	control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
 	control->segment_header.usable_pages = usable_pages;
 	control->segment_header.freed = false;
-	control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+	control->segment_header.size = size;
 	control->handle = control_handle;
+	control->init_segment_size = init_segment_size;
+	control->max_segment_size = max_segment_size;
 	control->max_total_segment_size = (size_t) -1;
 	control->total_segment_size = size;
 	control->segment_handles[0] = control_handle;
@@ -2112,12 +2117,13 @@ make_new_segment(dsa_area *area, size_t requested_pages)
 	 * move to huge pages in the future.  Then we work back to the number of
 	 * pages we can fit.
 	 */
-	total_size = DSA_INITIAL_SEGMENT_SIZE *
+	total_size = area->control->init_segment_size *
 		((size_t) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
-	total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+	total_size = Min(total_size, area->control->max_segment_size);
 	total_size = Min(total_size,
 					 area->control->max_total_segment_size -
 					 area->control->total_segment_size);
+	elog(NOTICE, "allocate new DSM [%zu] %zu", new_index, total_size);
 
 	total_pages = total_size / FPM_PAGE_SIZE;
 	metadata_bytes =
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..0baa32b9de 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -77,6 +77,28 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
 /* A sentinel value for dsa_pointer used to indicate failure to allocate. */
 #define InvalidDsaPointer ((dsa_pointer) 0)
 
+/*
+ * The size of the initial DSM segment that backs a dsa_area created by
+ * dsa_create.  After creating some number of segments of this size we'll
+ * double this size, and so on.  Larger segments may be created if necessary
+ * to satisfy large requests.
+ */
+#define DSA_INITIAL_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
+
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27		/* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40		/* 1024 segments of size up to 1TB */
+#endif
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
 /* Check if a dsa_pointer value is valid. */
 #define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
 
@@ -88,6 +110,14 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
 #define dsa_allocate0(area, size) \
 	dsa_allocate_extended(area, size, DSA_ALLOC_ZERO)
 
+/* Create dsa_area with default segment sizes */
+#define dsa_create(tranch_id) \
+	dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
+
+/* Create dsa_area with default segment sizes in an existing share memory space */
+#define dsa_create_in_place(place, size, tranch_id, segment) \
+	dsa_create_in_place_ext(place, size, tranch_id, segment, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
+
 /*
  * The type used for dsa_area handles.  dsa_handle values can be shared with
  * other processes, so that they can attach to them.  This provides a way to
@@ -102,10 +132,12 @@ typedef dsm_handle dsa_handle;
 /* Sentinel value to use for invalid dsa_handles. */
 #define DSA_HANDLE_INVALID ((dsa_handle) DSM_HANDLE_INVALID)
 
-
-extern dsa_area *dsa_create(int tranche_id);
-extern dsa_area *dsa_create_in_place(void *place, size_t size,
-									 int tranche_id, dsm_segment *segment);
+extern dsa_area *dsa_create_ext(int tranche_id, size_t init_segment_size,
+								size_t max_segment_size);
+extern dsa_area *dsa_create_in_place_ext(void *place, size_t size,
+										 int tranche_id, dsm_segment *segment,
+										 size_t init_segment_size,
+										 size_t max_segment_size);
 extern dsa_area *dsa_attach(dsa_handle handle);
 extern dsa_area *dsa_attach_in_place(void *place, dsm_segment *segment);
 extern void dsa_release_in_place(void *place);
@@ -117,6 +149,7 @@ extern void dsa_pin(dsa_area *area);
 extern void dsa_unpin(dsa_area *area);
 extern void dsa_set_size_limit(dsa_area *area, size_t limit);
 extern size_t dsa_minimum_size(void);
+extern size_t dsa_get_total_segment_size(dsa_area *area);
 extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);

#246

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Masahiko Sawada (#245)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 20, 2023 at 9:34 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Mon, Mar 20, 2023 at 9:34 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

That's an interesting idea, and the analogous behavior to aset could be

a good thing for readability and maintainability. Worth seeing if it's
workable.

I've attached a quick hack patch. It can be applied on top of v32
patches. The changes to dsa.c are straightforward since it makes the
initial and max block sizes configurable.

Good to hear -- this should probably be proposed in a separate thread for
wider visibility.

--
John Naylor
EDB: http://www.enterprisedb.com

#247

sawada.mshk@gmail.com

almost 3 years ago

In reply to: John Naylor (#246)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Mar 21, 2023 at 2:41 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Mar 20, 2023 at 9:34 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Mar 20, 2023 at 9:34 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

That's an interesting idea, and the analogous behavior to aset could be a good thing for readability and maintainability. Worth seeing if it's workable.

I've attached a quick hack patch. It can be applied on top of v32
patches. The changes to dsa.c are straightforward since it makes the
initial and max block sizes configurable.

Good to hear -- this should probably be proposed in a separate thread for wider visibility.

Agreed. I'll start a new thread for that.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#248

/messages/by-id/CAPsAnrn5yWsoWs8GhqwbwAJx1SeLxLntV54Biq0Z-J_E86Fnng@mail.gmail.com

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: John Naylor (#235)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Feb 16, 2023 at 11:44 PM Andres Freund <andres@anarazel.de> wrote:

We really ought to replace the tid bitmap used for bitmap heap scans. The
hashtable we use is a pretty awful data structure for it. And that's not
filled in-order, for example.

I spent some time studying tidbitmap.c, and not only does it make sense to
use a radix tree there, but since it has more complex behavior and stricter
runtime requirements, it should really be the thing driving the design and
tradeoffs, not vacuum:

- With lazy expansion and single-value leaves, the root of a radix tree can
point to a single leaf. That might get rid of the need to track TBMStatus,
since setting a single-leaf tree should be cheap.

- Fixed-size PagetableEntry's are pretty large, but the tid compression
scheme used in this thread (in addition to being complex) is not a great
fit for tidbitmap because it makes it more difficult to track per-block
metadata (see also next point). With the "combined pointer-value slots"
technique, if a page's max tid offset is 63 or less, the offsets can be
stored directly in the pointer for the exact case. The lowest bit can tag
to indicate a pointer to a single-value leaf. That would complicate
operations like union/intersection and tracking "needs recheck", but it
would reduce memory use and node-traversal in common cases.

- Managing lossy storage. With pure blocknumber keys, replacing exact
storage for a range of 256 pages amounts to replacing a last-level node
with a single leaf containing one lossy PagetableEntry. The leader could
iterate over the nodes, and rank the last-level nodes by how much storage
they (possibly with leaf children) are using, and come up with an optimal
lossy-conversion plan.

The above would address the points (not including better iteration and
parallel bitmap index scans) raised in

Ironically, by targeting a more difficult use case, it's easier since there
is less freedom. There are many ways to beat a binary search, but fewer
good ways to improve bitmap heap scan. I'd like to put aside vacuum for
some time and try killing two birds with one stone, building upon our work
thus far.

Note: I've moved the CF entry to the next CF, and set to waiting on
author for now. Since no action is currently required from Masahiko, I've
added myself as author as well. If tackling bitmap heap scan shows promise,
we could RWF and resurrect at a later time.

--
John Naylor
EDB: http://www.enterprisedb.com

#249

sawada.mshk@gmail.com

over 2 years ago

In reply to: John Naylor (#248)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Apr 7, 2023 at 6:55 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Thu, Feb 16, 2023 at 11:44 PM Andres Freund <andres@anarazel.de> wrote:

We really ought to replace the tid bitmap used for bitmap heap scans. The
hashtable we use is a pretty awful data structure for it. And that's not
filled in-order, for example.

I spent some time studying tidbitmap.c, and not only does it make sense to use a radix tree there, but since it has more complex behavior and stricter runtime requirements, it should really be the thing driving the design and tradeoffs, not vacuum:

- With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get rid of the need to track TBMStatus, since setting a single-leaf tree should be cheap.

Instead of introducing single-value leaves to the radix tree as
another structure, can we store pointers to PagetableEntry as values?

- Fixed-size PagetableEntry's are pretty large, but the tid compression scheme used in this thread (in addition to being complex) is not a great fit for tidbitmap because it makes it more difficult to track per-block metadata (see also next point). With the "combined pointer-value slots" technique, if a page's max tid offset is 63 or less, the offsets can be stored directly in the pointer for the exact case. The lowest bit can tag to indicate a pointer to a single-value leaf. That would complicate operations like union/intersection and tracking "needs recheck", but it would reduce memory use and node-traversal in common cases.

- Managing lossy storage. With pure blocknumber keys, replacing exact storage for a range of 256 pages amounts to replacing a last-level node with a single leaf containing one lossy PagetableEntry. The leader could iterate over the nodes, and rank the last-level nodes by how much storage they (possibly with leaf children) are using, and come up with an optimal lossy-conversion plan.

The above would address the points (not including better iteration and parallel bitmap index scans) raised in

/messages/by-id/CAPsAnrn5yWsoWs8GhqwbwAJx1SeLxLntV54Biq0Z-J_E86Fnng@mail.gmail.com

Ironically, by targeting a more difficult use case, it's easier since there is less freedom. There are many ways to beat a binary search, but fewer good ways to improve bitmap heap scan. I'd like to put aside vacuum for some time and try killing two birds with one stone, building upon our work thus far.

Note: I've moved the CF entry to the next CF, and set to waiting on author for now. Since no action is currently required from Masahiko, I've added myself as author as well. If tackling bitmap heap scan shows promise, we could RWF and resurrect at a later time.

Thanks. I'm going to continue researching the memory limitation and
try lazy path expansion until PG17 development begins.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#250

sawada.mshk@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#234)

18 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sat, Mar 11, 2023 at 12:26 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 10, 2023 at 11:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 10, 2023 at 3:42 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Mar 9, 2023 at 1:51 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the new version patches. I merged improvements and fixes
I did in the v29 patch.

I haven't yet had a chance to look at those closely, since I've had to devote time to other commitments. I remember I wasn't particularly impressed that v29-0008 mixed my requested name-casing changes with a bunch of other random things. Separating those out would be an obvious way to make it easier for me to look at, whenever I can get back to this. I need to look at the iteration changes as well, in addition to testing memory measurement (thanks for the new results, they look encouraging).

Okay, I'll separate them again.

Attached new patch series. In addition to separate them again, I've
fixed a conflict with HEAD.

I've attached updated version patches to make cfbot happy. Also, I've
splitted fixup patches further(from 0007 except for 0016 and 0018) to
make reviews easy. These patches have the prefix radix tree, tidstore,
and vacuum, indicating the part it changes. 0016 patch is to change
DSA so that we can specify both the initial and max segment size and
0017 makes use of it in vacuumparallel.c I'm still researching a
better solution for memory limitation but it's the best solution for
me for now.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v32-0015-vacuum-Miscellaneous-updates.patchapplication/octet-stream; name=v32-0015-vacuum-Miscellaneous-updates.patchDownload

From 16e55ffde1cb152dc94cf38a9f6c8442b78be284 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 18:07:04 +0900
Subject: [PATCH v32 15/18] vacuum: Miscellaneous updates

fix typos, comment updates, etc.
---
 doc/src/sgml/monitoring.sgml          |  2 +-
 src/backend/access/heap/vacuumlazy.c  | 17 ++++++++---------
 src/backend/commands/vacuumparallel.c | 13 +++++++------
 3 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 9b64614beb..67ab9fa2bc 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7331,7 +7331,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
+       <structfield>dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
        Amount of dead tuple data collected since the last index vacuum cycle.
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index be487aced6..228daad750 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -10,11 +10,10 @@
  * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * create a TidStore with the maximum bytes that can be used by the TidStore.
- * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
- * vacuum the pages that we've pruned). This frees up the memory space dedicated
- * to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -2392,7 +2391,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
 	TidStoreIter *iter;
-	TidStoreIterResult *result;
+	TidStoreIterResult *iter_result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2417,7 +2416,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = result->blkno;
+		blkno = iter_result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2431,8 +2430,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
-							  buf, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, iter_result->offsets,
+							  iter_result->num_offsets, buf, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index be83ceb871..8385d375db 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,11 +9,12 @@
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
  * vacuum process.  ParalleVacuumState contains shared information as well as
- * the shared TidStore. We launch parallel worker processes at the start of
- * parallel index bulk-deletion and index cleanup and once all indexes are
- * processed, the parallel worker processes exit.  Each time we process indexes
- * in parallel, the parallel context is re-initialized so that the same DSM can
- * be used for multiple passes of index bulk-deletion and index cleanup.
+ * the memory space for storing dead items allocated in the DSA area.  We
+ * launch parallel worker processes at the start of parallel index
+ * bulk-deletion and index cleanup and once all indexes are processed, the
+ * parallel worker processes exit.	Each time we process indexes in parallel,
+ * the parallel context is re-initialized so that the same DSM can be used for
+ * multiple passes of index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -299,7 +300,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	/* Initial size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
 	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-- 
2.31.1

v32-0016-Make-initial-and-maximum-DSA-segment-size-config.patchapplication/octet-stream; name=v32-0016-Make-initial-and-maximum-DSA-segment-size-config.patchDownload

From bc7b41a404cc1c8050c400919836951f78456aef Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 21:59:12 +0900
Subject: [PATCH v32 16/18] Make initial and maximum DSA segment size
 configurable

---
 src/backend/utils/mmgr/dsa.c | 64 +++++++++++++++++-------------------
 src/include/utils/dsa.h      | 45 ++++++++++++++++++++++---
 2 files changed, 71 insertions(+), 38 deletions(-)

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 80555aefff..b6238bf4a3 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -60,14 +60,6 @@
 #include "utils/freepage.h"
 #include "utils/memutils.h"
 
-/*
- * The size of the initial DSM segment that backs a dsa_area created by
- * dsa_create.  After creating some number of segments of this size we'll
- * double this size, and so on.  Larger segments may be created if necessary
- * to satisfy large requests.
- */
-#define DSA_INITIAL_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
-
 /*
  * How many segments to create before we double the segment size.  If this is
  * low, then there is likely to be a lot of wasted space in the largest
@@ -77,17 +69,6 @@
  */
 #define DSA_NUM_SEGMENTS_AT_EACH_SIZE 2
 
-/*
- * The number of bits used to represent the offset part of a dsa_pointer.
- * This controls the maximum size of a segment, the maximum possible
- * allocation size and also the maximum number of segments per area.
- */
-#if SIZEOF_DSA_POINTER == 4
-#define DSA_OFFSET_WIDTH 27		/* 32 segments of size up to 128MB */
-#else
-#define DSA_OFFSET_WIDTH 40		/* 1024 segments of size up to 1TB */
-#endif
-
 /*
  * The maximum number of DSM segments that an area can own, determined by
  * the number of bits remaining (but capped at 1024).
@@ -98,9 +79,6 @@
 /* The bitmask for extracting the offset from a dsa_pointer. */
 #define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
 
-/* The maximum size of a DSM segment. */
-#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
-
 /* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
 #define DSA_PAGES_PER_SUPERBLOCK		16
 
@@ -319,6 +297,10 @@ typedef struct
 	dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
 	/* The object pools for each size class. */
 	dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+	/* initial allocation segment size */
+	size_t		init_segment_size;
+	/* maximum allocation segment size */
+	size_t		max_segment_size;
 	/* The total size of all active segments. */
 	size_t		total_segment_size;
 	/* The maximum total size of backing storage we are allowed. */
@@ -413,7 +395,9 @@ static dsa_segment_map *make_new_segment(dsa_area *area, size_t requested_pages)
 static dsa_area *create_internal(void *place, size_t size,
 								 int tranche_id,
 								 dsm_handle control_handle,
-								 dsm_segment *control_segment);
+								 dsm_segment *control_segment,
+								 size_t init_segment_size,
+								 size_t max_segment_size);
 static dsa_area *attach_internal(void *place, dsm_segment *segment,
 								 dsa_handle handle);
 static void check_for_freed_segments(dsa_area *area);
@@ -429,7 +413,8 @@ static void check_for_freed_segments_locked(dsa_area *area);
  * we require the caller to provide one.
  */
 dsa_area *
-dsa_create(int tranche_id)
+dsa_create_extended(int tranche_id, size_t init_segment_size,
+					size_t max_segment_size)
 {
 	dsm_segment *segment;
 	dsa_area   *area;
@@ -438,7 +423,7 @@ dsa_create(int tranche_id)
 	 * Create the DSM segment that will hold the shared control object and the
 	 * first segment of usable space.
 	 */
-	segment = dsm_create(DSA_INITIAL_SEGMENT_SIZE, 0);
+	segment = dsm_create(init_segment_size, 0);
 
 	/*
 	 * All segments backing this area are pinned, so that DSA can explicitly
@@ -450,9 +435,10 @@ dsa_create(int tranche_id)
 
 	/* Create a new DSA area with the control object in this segment. */
 	area = create_internal(dsm_segment_address(segment),
-						   DSA_INITIAL_SEGMENT_SIZE,
+						   init_segment_size,
 						   tranche_id,
-						   dsm_segment_handle(segment), segment);
+						   dsm_segment_handle(segment), segment,
+						   init_segment_size, max_segment_size);
 
 	/* Clean up when the control segment detaches. */
 	on_dsm_detach(segment, &dsa_on_dsm_detach_release_in_place,
@@ -478,13 +464,15 @@ dsa_create(int tranche_id)
  * See dsa_create() for a note about the tranche arguments.
  */
 dsa_area *
-dsa_create_in_place(void *place, size_t size,
-					int tranche_id, dsm_segment *segment)
+dsa_create_in_place_extended(void *place, size_t size,
+							 int tranche_id, dsm_segment *segment,
+							 size_t init_segment_size, size_t max_segment_size)
 {
 	dsa_area   *area;
 
 	area = create_internal(place, size, tranche_id,
-						   DSM_HANDLE_INVALID, NULL);
+						   DSM_HANDLE_INVALID, NULL,
+						   init_segment_size, max_segment_size);
 
 	/*
 	 * Clean up when the control segment detaches, if a containing DSM segment
@@ -1215,7 +1203,8 @@ static dsa_area *
 create_internal(void *place, size_t size,
 				int tranche_id,
 				dsm_handle control_handle,
-				dsm_segment *control_segment)
+				dsm_segment *control_segment,
+				size_t init_segment_size, size_t max_segment_size)
 {
 	dsa_area_control *control;
 	dsa_area   *area;
@@ -1225,6 +1214,11 @@ create_internal(void *place, size_t size,
 	size_t		metadata_bytes;
 	int			i;
 
+	/* Validate the initial and maximum block sizes */
+	Assert(init_segment_size >= 1024);
+	Assert(max_segment_size >= init_segment_size);
+	Assert(max_segment_size <= DSA_MAX_SEGMENT_SIZE);
+
 	/* Sanity check on the space we have to work in. */
 	if (size < dsa_minimum_size())
 		elog(ERROR, "dsa_area space must be at least %zu, but %zu provided",
@@ -1254,8 +1248,10 @@ create_internal(void *place, size_t size,
 	control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
 	control->segment_header.usable_pages = usable_pages;
 	control->segment_header.freed = false;
-	control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+	control->segment_header.size = size;
 	control->handle = control_handle;
+	control->init_segment_size = init_segment_size;
+	control->max_segment_size = max_segment_size;
 	control->max_total_segment_size = (size_t) -1;
 	control->total_segment_size = size;
 	control->segment_handles[0] = control_handle;
@@ -2124,9 +2120,9 @@ make_new_segment(dsa_area *area, size_t requested_pages)
 	 * move to huge pages in the future.  Then we work back to the number of
 	 * pages we can fit.
 	 */
-	total_size = DSA_INITIAL_SEGMENT_SIZE *
+	total_size = area->control->init_segment_size *
 		((size_t) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
-	total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+	total_size = Min(total_size, area->control->max_segment_size);
 	total_size = Min(total_size,
 					 area->control->max_total_segment_size -
 					 area->control->total_segment_size);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 2af215484f..90b7b0d93f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -77,6 +77,28 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
 /* A sentinel value for dsa_pointer used to indicate failure to allocate. */
 #define InvalidDsaPointer ((dsa_pointer) 0)
 
+/*
+ * The default size of the initial DSM segment that backs a dsa_area created
+ * by dsa_create.  After creating some number of segments of this size we'll
+ * double this size, and so on.  Larger segments may be created if necessary
+ * to satisfy large requests.
+ */
+#define DSA_INITIAL_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
+
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27		/* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40		/* 1024 segments of size up to 1TB */
+#endif
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
 /* Check if a dsa_pointer value is valid. */
 #define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
 
@@ -88,6 +110,19 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
 #define dsa_allocate0(area, size) \
 	dsa_allocate_extended(area, size, DSA_ALLOC_ZERO)
 
+/* Create dsa_area with default segment sizes */
+#define dsa_create(tranch_id) \
+	dsa_create_extended(tranch_id, DSA_INITIAL_SEGMENT_SIZE, \
+						DSA_MAX_SEGMENT_SIZE)
+
+/*
+ * Create dsa_area with default segment sizes in an existing share memory
+ * space.
+ */
+#define dsa_create_in_place(place, size, tranch_id, segment) \
+	dsa_create_in_place_extended(place, size, tranch_id, segment, \
+								 DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
+
 /*
  * The type used for dsa_area handles.  dsa_handle values can be shared with
  * other processes, so that they can attach to them.  This provides a way to
@@ -102,10 +137,12 @@ typedef dsm_handle dsa_handle;
 /* Sentinel value to use for invalid dsa_handles. */
 #define DSA_HANDLE_INVALID ((dsa_handle) DSM_HANDLE_INVALID)
 
-
-extern dsa_area *dsa_create(int tranche_id);
-extern dsa_area *dsa_create_in_place(void *place, size_t size,
-									 int tranche_id, dsm_segment *segment);
+extern dsa_area *dsa_create_extended(int tranche_id, size_t init_segment_size,
+									 size_t max_segment_size);
+extern dsa_area *dsa_create_in_place_extended(void *place, size_t size,
+											  int tranche_id, dsm_segment *segment,
+											  size_t init_segment_size,
+											  size_t max_segment_size);
 extern dsa_area *dsa_attach(dsa_handle handle);
 extern dsa_area *dsa_attach_in_place(void *place, dsm_segment *segment);
 extern void dsa_release_in_place(void *place);
-- 
2.31.1

v32-0014-tidstore-Miscellaneous-updates.patchapplication/octet-stream; name=v32-0014-tidstore-Miscellaneous-updates.patchDownload

From f9be0044ee6e35dd44bceca59d733ba8cdf5373e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 18:01:46 +0900
Subject: [PATCH v32 14/18] tidstore: Miscellaneous updates.

comment updates, fix typos, etc.
---
 src/backend/access/common/tidstore.c          | 78 +++++++++++--------
 .../modules/test_tidstore/test_tidstore.c     |  1 +
 2 files changed, 47 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 15b77b5bcb..9360520482 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -3,18 +3,19 @@
  * tidstore.c
  *		Tid (ItemPointerData) storage implementation.
  *
- * This module provides a in-memory data structure to store Tids (ItemPointer).
- * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
- * stored in the radix tree.
+ * TidStore is a in-memory data structure to store tids (ItemPointerData).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.
  *
  * TidStore can be shared among parallel worker processes by passing DSA area
  * to TidStoreCreate(). Other backends can attach to the shared TidStore by
  * TidStoreAttach().
  *
- * Regarding the concurrency, it basically relies on the concurrency support in
- * the radix tree, but we acquires the lock on a TidStore in some cases, for
- * example, when to reset the store and when to access the number tids in the
- * store (num_tids).
+ * Regarding the concurrency support, we use a single LWLock for the TidStore.
+ * The TidStore is exclusively locked when inserting encoded tids to the
+ * radix tree or when resetting itself. When searching on the TidStore or
+ * doing the iteration, it is not locked but the underlying radix tree is
+ * locked in shared mode.
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -34,16 +35,18 @@
 #include "utils/memutils.h"
 
 /*
- * For encoding purposes, tids are represented as a pair of 64-bit key and
- * 64-bit value. First, we construct 64-bit unsigned integer by combining
- * the block number and the offset number. The number of bits used for the
- * offset number is specified by max_offsets in tidstore_create(). We are
- * frugal with the bits, because smaller keys could help keeping the radix
- * tree shallow.
+ * For encoding purposes, a tid is represented as a pair of 64-bit key and
+ * 64-bit value.
  *
- * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
- * the offset number and uses the next 32 bits for the block number. That
- * is, only 41 bits are used:
+ * First, we construct a 64-bit unsigned integer by combining the block
+ * number and the offset number. The number of bits used for the offset number
+ * is specified by max_off in TidStoreCreate(). We are frugal with the bits,
+ * because smaller keys could help keeping the radix tree shallow.
+ *
+ * For example, a tid of heap on a 8kB block uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. 9 bits
+ * are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks. That is, only 41 bits are used:
  *
  * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
  *
@@ -52,25 +55,27 @@
  * u = unused bit
  * (high on the left, low on the right)
  *
- * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
- * on 8kB blocks.
- *
- * The 64-bit value is the bitmap representation of the lowest 6 bits
- * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
- * as the key:
+ * Then, 64-bit value is the bitmap representation of the lowest 6 bits
+ * (LOWER_OFFSET_NBITS) of the integer, and 64-bit key consists of the
+ * upper 3 bits of the offset number and the block number, 35 bits in
+ * total:
  *
  * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
  *                                                |----| value
- * |---------------------------------------------| key
+ *        |--------------------------------------| key
  *
  * The maximum height of the radix tree is 5 in this case.
+ *
+ * If the number of bits required for offset numbers fits in LOWER_OFFSET_NBITS,
+ * 64-bit value is the bitmap representation of the offset number, and the
+ * 64-bit key is the block number.
  */
 typedef uint64 tidkey;
 typedef uint64 offsetbm;
 #define LOWER_OFFSET_NBITS	6	/* log(sizeof(offsetbm), 2) */
 #define LOWER_OFFSET_MASK	((1 << LOWER_OFFSET_NBITS) - 1)
 
-/* A magic value used to identify our TidStores. */
+/* A magic value used to identify our TidStore. */
 #define TIDSTORE_MAGIC 0x826f6a10
 
 #define RT_PREFIX local_rt
@@ -152,8 +157,10 @@ typedef struct TidStoreIter
 	tidkey		next_tidkey;
 	offsetbm	next_off_bitmap;
 
-	/* output for the caller */
-	TidStoreIterResult result;
+	/*
+	 * output for the caller. Must be last because variable-size.
+	 */
+	TidStoreIterResult output;
 } TidStoreIter;
 
 static void iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap);
@@ -205,7 +212,7 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
 
 		dp = dsa_allocate0(area, sizeof(TidStoreControl));
 		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
-		ts->control->max_bytes = (uint64) (max_bytes * ratio);
+		ts->control->max_bytes = (size_t) (max_bytes * ratio);
 		ts->area = area;
 
 		ts->control->magic = TIDSTORE_MAGIC;
@@ -353,7 +360,11 @@ TidStoreReset(TidStore *ts)
 	}
 }
 
-/* Add Tids on a block to TidStore */
+/*
+ * Set the given tids on the blkno to TidStore.
+ *
+ * NB: the offset numbers in offsets must be sorted in ascending order.
+ */
 void
 TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 						int num_offsets)
@@ -564,7 +575,7 @@ TidStoreEndIterate(TidStoreIter *iter)
 int64
 TidStoreNumTids(TidStore *ts)
 {
-	uint64 num_tids;
+	int64 num_tids;
 
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
@@ -624,11 +635,14 @@ TidStoreGetHandle(TidStore *ts)
 	return ts->control->handle;
 }
 
-/* Extract tids from the given key-value pair */
+/*
+ * Decode the key and offset bitmap to tids and store them to the iteration
+ * result.
+ */
 static void
 iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap)
 {
-	TidStoreIterResult *result = (&iter->result);
+	TidStoreIterResult *output = (&iter->output);
 
 	while (off_bitmap)
 	{
@@ -661,7 +675,7 @@ key_get_blkno(TidStore *ts, tidkey key)
 static inline tidkey
 encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit)
 {
-	uint32 offset = ItemPointerGetOffsetNumber(tid);
+	OffsetNumber offset = ItemPointerGetOffsetNumber(tid);
 	BlockNumber block = ItemPointerGetBlockNumber(tid);
 
 	return encode_blk_off(ts, block, offset, off_bit);
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 12d3027624..8659e6780e 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -222,6 +222,7 @@ test_tidstore(PG_FUNCTION_ARGS)
 	elog(NOTICE, "testing basic operations");
 	test_basic(MaxHeapTuplesPerPage);
 	test_basic(10);
+	test_basic(MaxHeapTuplesPerPage * 2);
 
 	PG_RETURN_VOID();
 }
-- 
2.31.1

v32-0018-Revert-building-benchmark-module-for-CI.patchapplication/octet-stream; name=v32-0018-Revert-building-benchmark-module-for-CI.patchDownload

From 9e42c43a7d081c06c02f0029e610c29d911732e3 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 19:31:34 +0700
Subject: [PATCH v32 18/18] Revert building benchmark module for CI

---
 contrib/meson.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/meson.build b/contrib/meson.build
index 421d469f8c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,7 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
-subdir('bench_radix_tree')
+#subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.31.1

v32-0017-tidstore-vacuum-Specify-the-init-and-max-DSA-seg.patchapplication/octet-stream; name=v32-0017-tidstore-vacuum-Specify-the-init-and-max-DSA-seg.patchDownload

From 11fda58f829c03d2a7c6476affc61862a078f741 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 22:27:04 +0900
Subject: [PATCH v32 17/18] tidstore, vacuum: Specify the init and max DSA
 segment size based on m_w_m

---
 src/backend/access/common/tidstore.c  | 32 +++++----------------------
 src/backend/commands/vacuumparallel.c | 21 ++++++++++++++----
 2 files changed, 23 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 9360520482..571d15c5c3 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -180,39 +180,15 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
 
 	ts = palloc0(sizeof(TidStore));
 
-	/*
-	 * Create the radix tree for the main storage.
-	 *
-	 * Memory consumption depends on the number of stored tids, but also on the
-	 * distribution of them, how the radix tree stores, and the memory management
-	 * that backed the radix tree. The maximum bytes that a TidStore can
-	 * use is specified by the max_bytes in TidStoreCreate(). We want the total
-	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
-	 *
-	 * In local TidStore cases, the radix tree uses slab allocators for each kind
-	 * of node class. The most memory consuming case while adding Tids associated
-	 * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
-	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
-	 * we deduct 70kB from the max_bytes.
-	 *
-	 * In shared cases, DSA allocates the memory segments big enough to follow
-	 * a geometric series that approximately doubles the total DSA size (see
-	 * make_new_segment() in dsa.c). We simulated the how DSA increases segment
-	 * size and the simulation revealed, the 75% threshold for the maximum bytes
-	 * perfectly works in case where the max_bytes is a power-of-2, and the 60%
-	 * threshold works for other cases.
-	 */
 	if (area != NULL)
 	{
 		dsa_pointer dp;
-		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
 
 		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
 										   LWTRANCHE_SHARED_TIDSTORE);
 
 		dp = dsa_allocate0(area, sizeof(TidStoreControl));
 		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
-		ts->control->max_bytes = (size_t) (max_bytes * ratio);
 		ts->area = area;
 
 		ts->control->magic = TIDSTORE_MAGIC;
@@ -223,11 +199,15 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
 	else
 	{
 		ts->tree.local = local_rt_create(CurrentMemoryContext);
-
 		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
-		ts->control->max_bytes = max_bytes - (70 * 1024);
 	}
 
+	/*
+	 * max_bytes is forced to be at least 64kB, the current minimum valid value
+	 * for the work_mem GUC.
+	 */
+	ts->control->max_bytes = Max(64 * 1024L, max_bytes);
+
 	ts->control->max_off = max_off;
 	ts->control->max_off_nbits = pg_ceil_log2_32(max_off);
 
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 8385d375db..17699aa007 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -252,6 +252,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	Size		est_indstats_len;
 	Size		est_shared_len;
 	Size		dsa_minsize = dsa_minimum_size();
+	Size		init_segsize;
+	Size		max_segsize;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -367,12 +369,23 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
-	/* Prepare DSA space for dead items */
+	/*
+	 * Prepare DSA space for dead items.
+	 *
+	 * Since total DSA size grows while following a geometric series by default,
+	 * we specify both the initial DSA segment and maximum DSA segment sizes
+	 * based on the memory available for parallel vacuum. Typically, the initial
+	 * segment size is 1MB and the maximum segment size is vac_work_mem / 8, and
+	 * heap scan stops after allocating 1.125 times more memory than vac_work_mem.
+	 */
+	init_segsize = Min(vac_work_mem / 4, (1024 * 1024));
+	max_segsize = Max(vac_work_mem / 8, (8 * 1024 * 1024));
 	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
-	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
-										 LWTRANCHE_PARALLEL_VACUUM_DSA,
-										 pcxt->seg);
+	dead_items_dsa = dsa_create_in_place_extended(area_space, dsa_minsize,
+												  LWTRANCHE_PARALLEL_VACUUM_DSA,
+												  pcxt->seg,
+												  init_segsize, max_segsize);
 	dead_items = TidStoreCreate(vac_work_mem, max_offset, dead_items_dsa);
 	pvs->dead_items = dead_items;
 	pvs->dead_items_area = dead_items_dsa;
-- 
2.31.1

v32-0010-radix-tree-fix-radix-tree-test-code.patchapplication/octet-stream; name=v32-0010-radix-tree-fix-radix-tree-test-code.patchDownload

From 591bede6738ca9e5c7264db7ff1d3dd9ba29247f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 17:35:14 +0900
Subject: [PATCH v32 10/18] radix tree: fix radix tree test code

fix tests for key insertion in ascending or descending order.

Also, we missed tests for MIN and MAX size classes.
---
 .../expected/test_radixtree.out               |   6 +-
 .../modules/test_radixtree/test_radixtree.c   | 103 ++++++++++++------
 2 files changed, 71 insertions(+), 38 deletions(-)

diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
index ce645cb8b5..7ad1ce3605 100644
--- a/src/test/modules/test_radixtree/expected/test_radixtree.out
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -4,8 +4,10 @@ CREATE EXTENSION test_radixtree;
 -- an error if something fails.
 --
 SELECT test_radixtree();
-NOTICE:  testing basic operations with leaf node 4
-NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 3
+NOTICE:  testing basic operations with inner node 3
+NOTICE:  testing basic operations with leaf node 15
+NOTICE:  testing basic operations with inner node 15
 NOTICE:  testing basic operations with leaf node 32
 NOTICE:  testing basic operations with inner node 32
 NOTICE:  testing basic operations with leaf node 125
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index afe53382f3..5a169854d9 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -43,12 +43,15 @@ typedef uint64 TestValueType;
  */
 static const bool rt_test_stats = false;
 
-static int	rt_node_kind_fanouts[] = {
-	0,
-	4,							/* RT_NODE_KIND_4 */
-	32,							/* RT_NODE_KIND_32 */
-	125,						/* RT_NODE_KIND_125 */
-	256							/* RT_NODE_KIND_256 */
+/*
+ * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
+ */
+static int	rt_node_class_fanouts[] = {
+	3,		/* RT_CLASS_3 */
+	15,		/* RT_CLASS_32_MIN */
+	32, 	/* RT_CLASS_32_MAX */
+	125,	/* RT_CLASS_125 */
+	256		/* RT_CLASS_256 */
 };
 /*
  * A struct to define a pattern of integers, for use with the test_pattern()
@@ -260,10 +263,9 @@ test_basic(int children, bool test_inner)
  * Check if keys from start to end with the shift exist in the tree.
  */
 static void
-check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
-					 int incr)
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end)
 {
-	for (int i = start; i < end; i++)
+	for (int i = start; i <= end; i++)
 	{
 		uint64		key = ((uint64) i << shift);
 		TestValueType		val;
@@ -277,22 +279,26 @@ check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
 	}
 }
 
+/*
+ * Insert 256 key-value pairs, and check if keys are properly inserted on each
+ * node class.
+ */
+/* Test keys [0, 256) */
+#define NODE_TYPE_TEST_KEY_MIN 0
+#define NODE_TYPE_TEST_KEY_MAX 256
 static void
-test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+test_node_types_insert_asc(rt_radix_tree *radixtree, uint8 shift)
 {
-	uint64		num_entries;
-	int		ninserted = 0;
-	int		start = insert_asc ? 0 : 256;
-	int 	incr = insert_asc ? 1 : -1;
-	int		end = insert_asc ? 256 : 0;
-	int		node_kind_idx = 1;
+	uint64 num_entries;
+	int node_class_idx = 0;
+	uint64 key_checked = 0;
 
-	for (int i = start; i != end; i += incr)
+	for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
 	{
 		uint64		key = ((uint64) i << shift);
 		bool		found;
 
-		found = rt_set(radixtree, key, (TestValueType*) &key);
+		found = rt_set(radixtree, key, (TestValueType *) &key);
 		if (found)
 			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
 
@@ -300,24 +306,49 @@ test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
 		 * After filling all slots in each node type, check if the values
 		 * are stored properly.
 		 */
-		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		if ((i + 1) == rt_node_class_fanouts[node_class_idx])
 		{
-			int check_start = insert_asc
-				? rt_node_kind_fanouts[node_kind_idx - 1]
-				: rt_node_kind_fanouts[node_kind_idx];
-			int check_end = insert_asc
-				? rt_node_kind_fanouts[node_kind_idx]
-				: rt_node_kind_fanouts[node_kind_idx - 1];
-
-			check_search_on_node(radixtree, shift, check_start, check_end, incr);
-			node_kind_idx++;
+			check_search_on_node(radixtree, shift, key_checked, i);
+			key_checked = i;
+			node_class_idx++;
 		}
-
-		ninserted++;
 	}
 
 	num_entries = rt_num_entries(radixtree);
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Similar to test_node_types_insert_asc(), but inserts keys in descending order.
+ */
+static void
+test_node_types_insert_desc(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+	int node_class_idx = 0;
+	uint64 key_checked = NODE_TYPE_TEST_KEY_MAX - 1;
+
+	for (int i = NODE_TYPE_TEST_KEY_MAX - 1; i >= NODE_TYPE_TEST_KEY_MIN; i--)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType *) &key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
 
+		if ((i + 1) == rt_node_class_fanouts[node_class_idx])
+		{
+			check_search_on_node(radixtree, shift, i, key_checked);
+			key_checked = i;
+			node_class_idx++;
+		}
+	}
+
+	num_entries = rt_num_entries(radixtree);
 	if (num_entries != 256)
 		elog(ERROR,
 			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
@@ -329,7 +360,7 @@ test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
 {
 	uint64		num_entries;
 
-	for (int i = 0; i < 256; i++)
+	for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
 	{
 		uint64		key = ((uint64) i << shift);
 		bool		found;
@@ -379,9 +410,9 @@ test_node_types(uint8 shift)
 	 * then delete all entries to make it empty, and insert and search entries
 	 * again.
 	 */
-	test_node_types_insert(radixtree, shift, true);
+	test_node_types_insert_asc(radixtree, shift);
 	test_node_types_delete(radixtree, shift);
-	test_node_types_insert(radixtree, shift, false);
+	test_node_types_insert_desc(radixtree, shift);
 
 	rt_free(radixtree);
 #ifdef RT_SHMEM
@@ -664,10 +695,10 @@ test_radixtree(PG_FUNCTION_ARGS)
 {
 	test_empty();
 
-	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	for (int i = 0; i < lengthof(rt_node_class_fanouts); i++)
 	{
-		test_basic(rt_node_kind_fanouts[i], false);
-		test_basic(rt_node_kind_fanouts[i], true);
+		test_basic(rt_node_class_fanouts[i], false);
+		test_basic(rt_node_class_fanouts[i], true);
 	}
 
 	for (int shift = 0; shift <= (64 - 8); shift += 8)
-- 
2.31.1

v32-0011-tidstore-vacuum-Use-camel-case-for-TidStore-APIs.patchapplication/octet-stream; name=v32-0011-tidstore-vacuum-Use-camel-case-for-TidStore-APIs.patchDownload

From 6f4ff3584cbbf4db3ed7268ebc360df0ad328696 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 17:47:10 +0900
Subject: [PATCH v32 11/18] tidstore, vacuum: Use camel case for TidStore APIs

---
 src/backend/access/common/tidstore.c          | 64 +++++++++---------
 src/backend/access/heap/vacuumlazy.c          | 44 ++++++------
 src/backend/commands/vacuum.c                 |  4 +-
 src/backend/commands/vacuumparallel.c         | 12 ++--
 src/include/access/tidstore.h                 | 34 +++++-----
 .../modules/test_tidstore/test_tidstore.c     | 67 ++++++++++---------
 6 files changed, 114 insertions(+), 111 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 8c05e60d92..283a326d13 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -7,9 +7,9 @@
  * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
  * stored in the radix tree.
  *
- * A TidStore can be shared among parallel worker processes by passing DSA area
- * to tidstore_create(). Other backends can attach to the shared TidStore by
- * tidstore_attach().
+ * TidStore can be shared among parallel worker processes by passing DSA area
+ * to TidStoreCreate(). Other backends can attach to the shared TidStore by
+ * TidStoreAttach().
  *
  * Regarding the concurrency, it basically relies on the concurrency support in
  * the radix tree, but we acquires the lock on a TidStore in some cases, for
@@ -106,7 +106,7 @@ typedef struct TidStoreControl
 	LWLock	lock;
 
 	/* handles for TidStore and radix tree */
-	tidstore_handle		handle;
+	TidStoreHandle		handle;
 	shared_rt_handle	tree_handle;
 } TidStoreControl;
 
@@ -164,7 +164,7 @@ static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_b
  * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
  */
 TidStore *
-tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
 {
 	TidStore	*ts;
 
@@ -176,12 +176,12 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
 	 * Memory consumption depends on the number of stored tids, but also on the
 	 * distribution of them, how the radix tree stores, and the memory management
 	 * that backed the radix tree. The maximum bytes that a TidStore can
-	 * use is specified by the max_bytes in tidstore_create(). We want the total
+	 * use is specified by the max_bytes in TidStoreCreate(). We want the total
 	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
 	 *
 	 * In local TidStore cases, the radix tree uses slab allocators for each kind
 	 * of node class. The most memory consuming case while adding Tids associated
-	 * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+	 * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
 	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
 	 * we deduct 70kB from the max_bytes.
 	 *
@@ -235,7 +235,7 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
  * allocated in backend-local memory using the CurrentMemoryContext.
  */
 TidStore *
-tidstore_attach(dsa_area *area, tidstore_handle handle)
+TidStoreAttach(dsa_area *area, TidStoreHandle handle)
 {
 	TidStore *ts;
 	dsa_pointer control;
@@ -266,7 +266,7 @@ tidstore_attach(dsa_area *area, tidstore_handle handle)
  * to the operating system.
  */
 void
-tidstore_detach(TidStore *ts)
+TidStoreDetach(TidStore *ts)
 {
 	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
 
@@ -279,12 +279,12 @@ tidstore_detach(TidStore *ts)
  *
  * TODO: The caller must be certain that no other backend will attempt to
  * access the TidStore before calling this function. Other backend must
- * explicitly call tidstore_detach to free up backend-local memory associated
- * with the TidStore. The backend that calls tidstore_destroy must not call
- * tidstore_detach.
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().
  */
 void
-tidstore_destroy(TidStore *ts)
+TidStoreDestroy(TidStore *ts)
 {
 	if (TidStoreIsShared(ts))
 	{
@@ -309,11 +309,11 @@ tidstore_destroy(TidStore *ts)
 }
 
 /*
- * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * Forget all collected Tids. It's similar to TidStoreDestroy() but we don't free
  * entire TidStore but recreate only the radix tree storage.
  */
 void
-tidstore_reset(TidStore *ts)
+TidStoreReset(TidStore *ts)
 {
 	if (TidStoreIsShared(ts))
 	{
@@ -352,8 +352,8 @@ tidstore_reset(TidStore *ts)
 
 /* Add Tids on a block to TidStore */
 void
-tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
-				  int num_offsets)
+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+						int num_offsets)
 {
 	uint64	*values;
 	uint64	key;
@@ -431,7 +431,7 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 
 /* Return true if the given tid is present in the TidStore */
 bool
-tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+TidStoreIsMember(TidStore *ts, ItemPointer tid)
 {
 	uint64 key;
 	uint64 val = 0;
@@ -452,14 +452,16 @@ tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
 }
 
 /*
- * Prepare to iterate through a TidStore. Since the radix tree is locked during the
- * iteration, so tidstore_end_iterate() needs to called when finished.
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, so TidStoreEndIterate() needs to be called when finished.
+ *
+ * The TidStoreIter struct is created in the caller's memory context.
  *
  * Concurrent updates during the iteration will be blocked when inserting a
  * key-value to the radix tree.
  */
 TidStoreIter *
-tidstore_begin_iterate(TidStore *ts)
+TidStoreBeginIterate(TidStore *ts)
 {
 	TidStoreIter *iter;
 
@@ -477,7 +479,7 @@ tidstore_begin_iterate(TidStore *ts)
 		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
 
 	/* If the TidStore is empty, there is no business */
-	if (tidstore_num_tids(ts) == 0)
+	if (TidStoreNumTids(ts) == 0)
 		iter->finished = true;
 
 	return iter;
@@ -498,7 +500,7 @@ tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
  * numbers in each result is also sorted in ascending order.
  */
 TidStoreIterResult *
-tidstore_iterate_next(TidStoreIter *iter)
+TidStoreIterateNext(TidStoreIter *iter)
 {
 	uint64 key;
 	uint64 val;
@@ -544,7 +546,7 @@ tidstore_iterate_next(TidStoreIter *iter)
  * or when existing an iteration.
  */
 void
-tidstore_end_iterate(TidStoreIter *iter)
+TidStoreEndIterate(TidStoreIter *iter)
 {
 	if (TidStoreIsShared(iter->ts))
 		shared_rt_end_iterate(iter->tree_iter.shared);
@@ -557,7 +559,7 @@ tidstore_end_iterate(TidStoreIter *iter)
 
 /* Return the number of tids we collected so far */
 int64
-tidstore_num_tids(TidStore *ts)
+TidStoreNumTids(TidStore *ts)
 {
 	uint64 num_tids;
 
@@ -575,16 +577,16 @@ tidstore_num_tids(TidStore *ts)
 
 /* Return true if the current memory usage of TidStore exceeds the limit */
 bool
-tidstore_is_full(TidStore *ts)
+TidStoreIsFull(TidStore *ts)
 {
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
-	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+	return (TidStoreMemoryUsage(ts) > ts->control->max_bytes);
 }
 
 /* Return the maximum memory TidStore can use */
 size_t
-tidstore_max_memory(TidStore *ts)
+TidStoreMaxMemory(TidStore *ts)
 {
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
@@ -593,7 +595,7 @@ tidstore_max_memory(TidStore *ts)
 
 /* Return the memory usage of TidStore */
 size_t
-tidstore_memory_usage(TidStore *ts)
+TidStoreMemoryUsage(TidStore *ts)
 {
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
@@ -611,8 +613,8 @@ tidstore_memory_usage(TidStore *ts)
 /*
  * Get a handle that can be used by other processes to attach to this TidStore
  */
-tidstore_handle
-tidstore_get_handle(TidStore *ts)
+TidStoreHandle
+TidStoreGetHandle(TidStore *ts)
 {
 	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2c72088e69..be487aced6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -842,7 +842,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
+	initprog_val[2] = TidStoreMaxMemory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -909,7 +909,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		if (tidstore_is_full(vacrel->dead_items))
+		if (TidStoreIsFull(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -1078,16 +1078,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(tidstore_num_tids(dead_items) == 0);
+			Assert(TidStoreNumTids(dead_items) == 0);
 		}
 		else if (prunestate.num_offsets > 0)
 		{
 			/* Save details of the LP_DEAD items from the page in dead_items */
-			tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
-							  prunestate.num_offsets);
+			TidStoreSetBlockOffsets(dead_items, blkno, prunestate.deadoffsets,
+									prunestate.num_offsets);
 
 			pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
-										 tidstore_memory_usage(dead_items));
+										 TidStoreMemoryUsage(dead_items));
 		}
 
 		/*
@@ -1258,7 +1258,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (tidstore_num_tids(dead_items) > 0)
+	if (TidStoreNumTids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -2125,10 +2125,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
+		TidStoreSetBlockOffsets(dead_items, blkno, deadoffsets, lpdead_items);
 
 		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
-									 tidstore_memory_usage(dead_items));
+									 TidStoreMemoryUsage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2177,7 +2177,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		tidstore_reset(vacrel->dead_items);
+		TidStoreReset(vacrel->dead_items);
 		return;
 	}
 
@@ -2206,7 +2206,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
+		Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2234,7 +2234,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
 		bypass = (vacrel->lpdead_item_pages < threshold) &&
-			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
+			TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2279,7 +2279,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	tidstore_reset(vacrel->dead_items);
+	TidStoreReset(vacrel->dead_items);
 }
 
 /*
@@ -2352,7 +2352,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
+		   TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2407,8 +2407,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	iter = tidstore_begin_iterate(vacrel->dead_items);
-	while ((result = tidstore_iterate_next(iter)) != NULL)
+	iter = TidStoreBeginIterate(vacrel->dead_items);
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2442,7 +2442,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
-	tidstore_end_iterate(iter);
+	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2453,12 +2453,12 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * the second heap pass.  No more, no less.
 	 */
 	Assert(vacrel->num_index_scans > 1 ||
-		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
+		   (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
-					vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+			(errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, TidStoreNumTids(vacrel->dead_items),
 					vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
@@ -3125,8 +3125,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
-										 NULL);
+	vacrel->dead_items = TidStoreCreate(vac_work_mem, MaxHeapTuplesPerPage,
+										NULL);
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index f3922b72dc..84f71fb14a 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2486,7 +2486,7 @@ vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
 	ereport(ivinfo->message_level,
 			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					tidstore_num_tids(dead_items))));
+					TidStoreNumTids(dead_items))));
 
 	return istat;
 }
@@ -2527,5 +2527,5 @@ vac_tid_reaped(ItemPointer itemptr, void *state)
 {
 	TidStore *dead_items = (TidStore *) state;
 
-	return tidstore_lookup_tid(dead_items, itemptr);
+	return TidStoreIsMember(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index c363f45e32..be83ceb871 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -110,7 +110,7 @@ typedef struct PVShared
 	pg_atomic_uint32 idx;
 
 	/* Handle of the shared TidStore */
-	tidstore_handle	dead_items_handle;
+	TidStoreHandle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -372,7 +372,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
 										 LWTRANCHE_PARALLEL_VACUUM_DSA,
 										 pcxt->seg);
-	dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+	dead_items = TidStoreCreate(vac_work_mem, max_offset, dead_items_dsa);
 	pvs->dead_items = dead_items;
 	pvs->dead_items_area = dead_items_dsa;
 
@@ -385,7 +385,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
-	shared->dead_items_handle = tidstore_get_handle(dead_items);
+	shared->dead_items_handle = TidStoreGetHandle(dead_items);
 
 	/* Use the same buffer size for all workers */
 	shared->ring_nbuffers = GetAccessStrategyBufferCount(bstrategy);
@@ -454,7 +454,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
-	tidstore_destroy(pvs->dead_items);
+	TidStoreDestroy(pvs->dead_items);
 	dsa_detach(pvs->dead_items_area);
 
 	DestroyParallelContext(pvs->pcxt);
@@ -1013,7 +1013,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Set dead items */
 	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
 	dead_items_area = dsa_attach_in_place(area_space, seg);
-	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
+	dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumUpdateCosts();
@@ -1061,7 +1061,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
-	tidstore_detach(pvs.dead_items);
+	TidStoreDetach(dead_items);
 	dsa_detach(dead_items_area);
 
 	/* Pop the error context stack */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index a35a52124a..f0a432d0da 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -17,7 +17,7 @@
 #include "storage/itemptr.h"
 #include "utils/dsa.h"
 
-typedef dsa_pointer tidstore_handle;
+typedef dsa_pointer TidStoreHandle;
 
 typedef struct TidStore TidStore;
 typedef struct TidStoreIter TidStoreIter;
@@ -29,21 +29,21 @@ typedef struct TidStoreIterResult
 	int				num_offsets;
 } TidStoreIterResult;
 
-extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
-extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
-extern void tidstore_detach(TidStore *ts);
-extern void tidstore_destroy(TidStore *ts);
-extern void tidstore_reset(TidStore *ts);
-extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
-							  int num_offsets);
-extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
-extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
-extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
-extern void tidstore_end_iterate(TidStoreIter *iter);
-extern int64 tidstore_num_tids(TidStore *ts);
-extern bool tidstore_is_full(TidStore *ts);
-extern size_t tidstore_max_memory(TidStore *ts);
-extern size_t tidstore_memory_usage(TidStore *ts);
-extern tidstore_handle tidstore_get_handle(TidStore *ts);
+extern TidStore *TidStoreCreate(size_t max_bytes, int max_off, dsa_area *dsa);
+extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer handle);
+extern void TidStoreDetach(TidStore *ts);
+extern void TidStoreDestroy(TidStore *ts);
+extern void TidStoreReset(TidStore *ts);
+extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+									int num_offsets);
+extern bool TidStoreIsMember(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * TidStoreBeginIterate(TidStore *ts);
+extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
+extern void TidStoreEndIterate(TidStoreIter *iter);
+extern int64 TidStoreNumTids(TidStore *ts);
+extern bool TidStoreIsFull(TidStore *ts);
+extern size_t TidStoreMaxMemory(TidStore *ts);
+extern size_t TidStoreMemoryUsage(TidStore *ts);
+extern TidStoreHandle TidStoreGetHandle(TidStore *ts);
 
 #endif		/* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 9a1217f833..12d3027624 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -37,10 +37,10 @@ check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
 
 	ItemPointerSet(&tid, blkno, off);
 
-	found = tidstore_lookup_tid(ts, &tid);
+	found = TidStoreIsMember(ts, &tid);
 
 	if (found != expect)
-		elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+		elog(ERROR, "TidStoreIsMember for TID (%u, %u) returned %d, expected %d",
 			 blkno, off, found, expect);
 }
 
@@ -69,9 +69,9 @@ test_basic(int max_offset)
 	LWLockRegisterTranche(tranche_id, "test_tidstore");
 	dsa = dsa_create(tranche_id);
 
-	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
 #else
-	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
 #endif
 
 	/* prepare the offset array */
@@ -83,7 +83,7 @@ test_basic(int max_offset)
 
 	/* add tids */
 	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
-		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+		TidStoreSetBlockOffsets(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
 
 	/* lookup test */
 	for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
@@ -105,30 +105,30 @@ test_basic(int max_offset)
 	}
 
 	/* test the number of tids */
-	if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
-		elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
-			 tidstore_num_tids(ts),
+	if (TidStoreNumTids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "TidStoreNumTids returned " UINT64_FORMAT ", expected %d",
+			 TidStoreNumTids(ts),
 			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
 
 	/* iteration test */
-	iter = tidstore_begin_iterate(ts);
+	iter = TidStoreBeginIterate(ts);
 	blk_idx = 0;
-	while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		/* check the returned block number */
 		if (blks_sorted[blk_idx] != iter_result->blkno)
-			elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+			elog(ERROR, "TidStoreIterateNext returned block number %u, expected %u",
 				 iter_result->blkno, blks_sorted[blk_idx]);
 
 		/* check the returned offset numbers */
 		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
-			elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+			elog(ERROR, "TidStoreIterateNext %u offsets, expected %u",
 				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
 
 		for (int i = 0; i < iter_result->num_offsets; i++)
 		{
 			if (offs[i] != iter_result->offsets[i])
-				elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+				elog(ERROR, "TidStoreIterateNext offset number %u on block %u, expected %u",
 					 iter_result->offsets[i], iter_result->blkno, offs[i]);
 		}
 
@@ -136,15 +136,15 @@ test_basic(int max_offset)
 	}
 
 	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
-		elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+		elog(ERROR, "TidStoreIterateNext returned %d blocks, expected %d",
 			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
 
 	/* remove all tids */
-	tidstore_reset(ts);
+	TidStoreReset(ts);
 
 	/* test the number of tids */
-	if (tidstore_num_tids(ts) != 0)
-		elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+	if (TidStoreNumTids(ts) != 0)
+		elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
 
 	/* lookup test for empty store */
 	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
@@ -156,7 +156,7 @@ test_basic(int max_offset)
 		check_tid(ts, MaxBlockNumber, off, false);
 	}
 
-	tidstore_destroy(ts);
+	TidStoreDestroy(ts);
 
 #ifdef TEST_SHARED_TIDSTORE
 	dsa_detach(dsa);
@@ -177,36 +177,37 @@ test_empty(void)
 	LWLockRegisterTranche(tranche_id, "test_tidstore");
 	dsa = dsa_create(tranche_id);
 
-	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
 #else
-	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
 #endif
 
 	elog(NOTICE, "testing empty tidstore");
 
 	ItemPointerSet(&tid, 0, FirstOffsetNumber);
-	if (tidstore_lookup_tid(ts, &tid))
-		elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+	if (TidStoreIsMember(ts, &tid))
+		elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
+			 0, FirstOffsetNumber);
 
 	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
-	if (tidstore_lookup_tid(ts, &tid))
-		elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+	if (TidStoreIsMember(ts, &tid))
+		elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
 			 MaxBlockNumber, MaxOffsetNumber);
 
-	if (tidstore_num_tids(ts) != 0)
-		elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+	if (TidStoreNumTids(ts) != 0)
+		elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
 
-	if (tidstore_is_full(ts))
-		elog(ERROR, "tidstore_is_full on empty store returned true");
+	if (TidStoreIsFull(ts))
+		elog(ERROR, "TidStoreIsFull on empty store returned true");
 
-	iter = tidstore_begin_iterate(ts);
+	iter = TidStoreBeginIterate(ts);
 
-	if (tidstore_iterate_next(iter) != NULL)
-		elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+	if (TidStoreIterateNext(iter) != NULL)
+		elog(ERROR, "TidStoreIterateNext on empty store returned TIDs");
 
-	tidstore_end_iterate(iter);
+	TidStoreEndIterate(iter);
 
-	tidstore_destroy(ts);
+	TidStoreDestroy(ts);
 
 #ifdef TEST_SHARED_TIDSTORE
 	dsa_detach(dsa);
-- 
2.31.1

v32-0012-tidstore-Use-concept-of-off_upper-and-off_lower.patchapplication/octet-stream; name=v32-0012-tidstore-Use-concept-of-off_upper-and-off_lower.patchDownload

From 3f38c7722deb260e5cc4ac003ab37cfe959b1954 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 17:54:49 +0900
Subject: [PATCH v32 12/18] tidstore: Use concept of off_upper and off_lower.

The key is block number + the upper of offset number, whereas the
value is the bitmap representation of the lower offset number.

Updated function and variable names accordingly.
---
 src/backend/access/common/tidstore.c | 191 ++++++++++++++-------------
 1 file changed, 99 insertions(+), 92 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 283a326d13..d9fe3d5f15 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -65,8 +65,10 @@
  *
  * The maximum height of the radix tree is 5 in this case.
  */
-#define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
-#define TIDSTORE_OFFSET_MASK	((1 << TIDSTORE_VALUE_NBITS) - 1)
+typedef uint64 tidkey;
+typedef uint64 offsetbm;
+#define LOWER_OFFSET_NBITS	6	/* log(sizeof(offsetbm), 2) */
+#define LOWER_OFFSET_MASK	((1 << LOWER_OFFSET_NBITS) - 1)
 
 /* A magic value used to identify our TidStores. */
 #define TIDSTORE_MAGIC 0x826f6a10
@@ -75,7 +77,7 @@
 #define RT_SCOPE static
 #define RT_DECLARE
 #define RT_DEFINE
-#define RT_VALUE_TYPE uint64
+#define RT_VALUE_TYPE tidkey
 #include "lib/radixtree.h"
 
 #define RT_PREFIX shared_rt
@@ -83,7 +85,7 @@
 #define RT_SCOPE static
 #define RT_DECLARE
 #define RT_DEFINE
-#define RT_VALUE_TYPE uint64
+#define RT_VALUE_TYPE tidkey
 #include "lib/radixtree.h"
 
 /* The control object for a TidStore */
@@ -94,10 +96,10 @@ typedef struct TidStoreControl
 
 	/* These values are never changed after creation */
 	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
-	int		max_offset;		/* the maximum offset number */
-	int		offset_nbits;	/* the number of bits required for an offset
-							 * number */
-	int		offset_key_nbits;	/* the number of bits of an offset number
+	int		max_off;		/* the maximum offset number */
+	int		max_off_nbits;	/* the number of bits required for offset
+							 * numbers */
+	int		upper_off_nbits;	/* the number of bits of offset numbers
 								 * used in a key */
 
 	/* The below fields are used only in shared case */
@@ -147,17 +149,18 @@ typedef struct TidStoreIter
 	bool		finished;
 
 	/* save for the next iteration */
-	uint64		next_key;
-	uint64		next_val;
+	tidkey		next_tidkey;
+	offsetbm	next_off_bitmap;
 
 	/* output for the caller */
 	TidStoreIterResult result;
 } TidStoreIter;
 
-static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
-static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
-static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
-static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
+static void iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap);
+static inline BlockNumber key_get_blkno(TidStore *ts, tidkey key);
+static inline tidkey encode_blk_off(TidStore *ts, BlockNumber block,
+									OffsetNumber offset, offsetbm *off_bit);
+static inline tidkey encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit);
 
 /*
  * Create a TidStore. The returned object is allocated in backend-local memory.
@@ -218,14 +221,14 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
 		ts->control->max_bytes = max_bytes - (70 * 1024);
 	}
 
-	ts->control->max_offset = max_offset;
-	ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+	ts->control->max_off = max_off;
+	ts->control->max_off_nbits = pg_ceil_log2_32(max_off);
 
-	if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
-		ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+	if (ts->control->max_off_nbits < LOWER_OFFSET_NBITS)
+		ts->control->max_off_nbits = LOWER_OFFSET_NBITS;
 
-	ts->control->offset_key_nbits =
-		ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+	ts->control->upper_off_nbits =
+		ts->control->max_off_nbits - LOWER_OFFSET_NBITS;
 
 	return ts;
 }
@@ -355,25 +358,25 @@ void
 TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 						int num_offsets)
 {
-	uint64	*values;
-	uint64	key;
-	uint64	prev_key;
-	uint64	off_bitmap = 0;
+	offsetbm	*bitmaps;
+	tidkey		key;
+	tidkey		prev_key;
+	offsetbm	off_bitmap = 0;
 	int idx;
-	const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
-	const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+	const tidkey key_base = ((uint64) blkno) << ts->control->upper_off_nbits;
+	const int nkeys = UINT64CONST(1) << ts->control->upper_off_nbits;
 
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
-	values = palloc(sizeof(uint64) * nkeys);
+	bitmaps = palloc(sizeof(offsetbm) * nkeys);
 	key = prev_key = key_base;
 
 	for (int i = 0; i < num_offsets; i++)
 	{
-		uint64	off_bit;
+		offsetbm	off_bit;
 
 		/* encode the tid to a key and partial offset */
-		key = encode_key_off(ts, blkno, offsets[i], &off_bit);
+		key = encode_blk_off(ts, blkno, offsets[i], &off_bit);
 
 		/* make sure we scanned the line pointer array in order */
 		Assert(key >= prev_key);
@@ -384,11 +387,11 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 			Assert(idx >= 0 && idx < nkeys);
 
 			/* write out offset bitmap for this key */
-			values[idx] = off_bitmap;
+			bitmaps[idx] = off_bitmap;
 
 			/* zero out any gaps up to the current key */
 			for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
-				values[empty_idx] = 0;
+				bitmaps[empty_idx] = 0;
 
 			/* reset for current key -- the current offset will be handled below */
 			off_bitmap = 0;
@@ -401,7 +404,7 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	/* save the final index for later */
 	idx = key - key_base;
 	/* write out last offset bitmap */
-	values[idx] = off_bitmap;
+	bitmaps[idx] = off_bitmap;
 
 	if (TidStoreIsShared(ts))
 		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
@@ -409,14 +412,14 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	/* insert the calculated key-values to the tree */
 	for (int i = 0; i <= idx; i++)
 	{
-		if (values[i])
+		if (bitmaps[i])
 		{
 			key = key_base + i;
 
 			if (TidStoreIsShared(ts))
-				shared_rt_set(ts->tree.shared, key, &values[i]);
+				shared_rt_set(ts->tree.shared, key, &bitmaps[i]);
 			else
-				local_rt_set(ts->tree.local, key, &values[i]);
+				local_rt_set(ts->tree.local, key, &bitmaps[i]);
 		}
 	}
 
@@ -426,29 +429,29 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	if (TidStoreIsShared(ts))
 		LWLockRelease(&ts->control->lock);
 
-	pfree(values);
+	pfree(bitmaps);
 }
 
 /* Return true if the given tid is present in the TidStore */
 bool
 TidStoreIsMember(TidStore *ts, ItemPointer tid)
 {
-	uint64 key;
-	uint64 val = 0;
-	uint64 off_bit;
+	tidkey key;
+	offsetbm off_bitmap = 0;
+	offsetbm off_bit;
 	bool found;
 
-	key = tid_to_key_off(ts, tid, &off_bit);
+	key = encode_tid(ts, tid, &off_bit);
 
 	if (TidStoreIsShared(ts))
-		found = shared_rt_search(ts->tree.shared, key, &val);
+		found = shared_rt_search(ts->tree.shared, key, &off_bitmap);
 	else
-		found = local_rt_search(ts->tree.local, key, &val);
+		found = local_rt_search(ts->tree.local, key, &off_bitmap);
 
 	if (!found)
 		return false;
 
-	return (val & off_bit) != 0;
+	return (off_bitmap & off_bit) != 0;
 }
 
 /*
@@ -486,12 +489,12 @@ TidStoreBeginIterate(TidStore *ts)
 }
 
 static inline bool
-tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+tidstore_iter(TidStoreIter *iter, tidkey *key, offsetbm *off_bitmap)
 {
 	if (TidStoreIsShared(iter->ts))
-		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, off_bitmap);
 
-	return local_rt_iterate_next(iter->tree_iter.local, key, val);
+	return local_rt_iterate_next(iter->tree_iter.local, key, off_bitmap);
 }
 
 /*
@@ -502,43 +505,46 @@ tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
 TidStoreIterResult *
 TidStoreIterateNext(TidStoreIter *iter)
 {
-	uint64 key;
-	uint64 val;
-	TidStoreIterResult *result = &(iter->result);
+	tidkey key;
+	offsetbm off_bitmap = 0;
+	TidStoreIterResult *output = &(iter->output);
 
 	if (iter->finished)
 		return NULL;
 
-	if (BlockNumberIsValid(result->blkno))
-	{
-		/* Process the previously collected key-value */
-		result->num_offsets = 0;
-		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
-	}
+	/* Initialize the outputs */
+	output->blkno = InvalidBlockNumber;
+	output->num_offsets = 0;
 
-	while (tidstore_iter_kv(iter, &key, &val))
-	{
-		BlockNumber blkno;
+	/*
+	 * Decode the key and offset bitmap that are collected in the previous
+	 * time, if exists.
+	 */
+	if (iter->next_off_bitmap > 0)
+		iter_decode_key_off(iter, iter->next_tidkey, iter->next_off_bitmap);
 
-		blkno = key_get_blkno(iter->ts, key);
+	while (tidstore_iter(iter, &key, &off_bitmap))
+	{
+		BlockNumber blkno = key_get_blkno(iter->ts, key);
 
-		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		if (BlockNumberIsValid(output->blkno) && output->blkno != blkno)
 		{
 			/*
-			 * We got a key-value pair for a different block. So return the
-			 * collected tids, and remember the key-value for the next iteration.
+			 * We got tids for a different block. We return the collected
+			 * tids so far, and remember the key-value for the next
+			 * iteration.
 			 */
-			iter->next_key = key;
-			iter->next_val = val;
-			return result;
+			iter->next_tidkey = key;
+			iter->next_off_bitmap = off_bitmap;
+			return output;
 		}
 
-		/* Collect tids extracted from the key-value pair */
-		tidstore_iter_extract_tids(iter, key, val);
+		/* Collect tids decoded from the key and offset bitmap */
+		iter_decode_key_off(iter, key, off_bitmap);
 	}
 
 	iter->finished = true;
-	return result;
+	return output;
 }
 
 /*
@@ -623,61 +629,62 @@ TidStoreGetHandle(TidStore *ts)
 
 /* Extract tids from the given key-value pair */
 static void
-tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+iter_decode_key_off(TidStoreIter *iter, tidkey key, offsetbm off_bitmap)
 {
 	TidStoreIterResult *result = (&iter->result);
 
-	while (val)
+	while (off_bitmap)
 	{
-		uint64	tid_i;
+		uint64	compressed_tid;
 		OffsetNumber	off;
 
-		tid_i = key << TIDSTORE_VALUE_NBITS;
-		tid_i |= pg_rightmost_one_pos64(val);
+		compressed_tid = key << LOWER_OFFSET_NBITS;
+		compressed_tid |= pg_rightmost_one_pos64(off_bitmap);
 
-		off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+		off = compressed_tid & ((UINT64CONST(1) << iter->ts->control->max_off_nbits) - 1);
 
-		Assert(result->num_offsets < iter->ts->control->max_offset);
-		result->offsets[result->num_offsets++] = off;
+		Assert(output->num_offsets < iter->ts->control->max_off);
+		output->offsets[output->num_offsets++] = off;
 
 		/* unset the rightmost bit */
-		val &= ~pg_rightmost_one64(val);
+		off_bitmap &= ~pg_rightmost_one64(off_bitmap);
 	}
 
-	result->blkno = key_get_blkno(iter->ts, key);
+	output->blkno = key_get_blkno(iter->ts, key);
 }
 
 /* Get block number from the given key */
 static inline BlockNumber
-key_get_blkno(TidStore *ts, uint64 key)
+key_get_blkno(TidStore *ts, tidkey key)
 {
-	return (BlockNumber) (key >> ts->control->offset_key_nbits);
+	return (BlockNumber) (key >> ts->control->upper_off_nbits);
 }
 
-/* Encode a tid to key and offset */
-static inline uint64
-tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
+/* Encode a tid to key and partial offset */
+static inline tidkey
+encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit)
 {
 	uint32 offset = ItemPointerGetOffsetNumber(tid);
 	BlockNumber block = ItemPointerGetBlockNumber(tid);
 
-	return encode_key_off(ts, block, offset, off_bit);
+	return encode_blk_off(ts, block, offset, off_bit);
 }
 
 /* encode a block and offset to a key and partial offset */
-static inline uint64
-encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
+static inline tidkey
+encode_blk_off(TidStore *ts, BlockNumber block, OffsetNumber offset,
+			   offsetbm *off_bit)
 {
-	uint64 key;
-	uint64 tid_i;
+	tidkey key;
+	uint64 compressed_tid;
 	uint32 off_lower;
 
-	off_lower = offset & TIDSTORE_OFFSET_MASK;
-	Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
+	off_lower = offset & LOWER_OFFSET_MASK;
+	Assert(off_lower < (sizeof(offsetbm) * BITS_PER_BYTE));
 
 	*off_bit = UINT64CONST(1) << off_lower;
-	tid_i = offset | ((uint64) block << ts->control->offset_nbits);
-	key = tid_i >> TIDSTORE_VALUE_NBITS;
+	compressed_tid = offset | ((uint64) block << ts->control->max_off_nbits);
+	key = compressed_tid >> LOWER_OFFSET_NBITS;
 
 	return key;
 }
-- 
2.31.1

v32-0009-radix-tree-Review-tree-iteration-code.patchapplication/octet-stream; name=v32-0009-radix-tree-Review-tree-iteration-code.patchDownload

From 989dd2cb442c1c2a6182bb5f7785c52f4d5cdb5e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 17:33:21 +0900
Subject: [PATCH v32 09/18] radix tree: Review tree iteration code

Cleanup the routines and improve comments and variable names.
---
 src/include/lib/radixtree.h           | 152 ++++++++++++++------------
 src/include/lib/radixtree_iter_impl.h |  85 +++++++-------
 2 files changed, 118 insertions(+), 119 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 088d1dfd9d..8bea606c62 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -83,7 +83,7 @@
  * RT_SET			- Set a key-value pair
  * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
  * RT_ITERATE_NEXT	- Return next key-value pair, if any
- * RT_END_ITER		- End iteration
+ * RT_END_ITERATE	- End iteration
  * RT_MEMORY_USAGE	- Get the memory usage
  *
  * Interface for Shared Memory
@@ -191,7 +191,7 @@
 #define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
 #define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
 #define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
-#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_SET_NODE_FROM RT_MAKE_NAME(iter_set_node_from)
 #define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
 #define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
 
@@ -650,36 +650,40 @@ typedef struct RT_RADIX_TREE
  * Iteration support.
  *
  * Iterating the radix tree returns each pair of key and value in the ascending
- * order of the key. To support this, the we iterate nodes of each level.
+ * order of the key.
  *
- * RT_NODE_ITER struct is used to track the iteration within a node.
+ * RT_NODE_ITER is the struct for iteration of one radix tree node.
  *
  * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
- * in order to track the iteration of each level. During iteration, we also
- * construct the key whenever updating the node iteration information, e.g., when
- * advancing the current index within the node or when moving to the next node
- * at the same level.
- *
- * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
- * has the local pointers to nodes, rather than RT_PTR_ALLOC.
- * We need either a safeguard to disallow other processes to begin the iteration
- * while one process is doing or to allow multiple processes to do the iteration.
+ * for each level to track the iteration within the node.
  */
 typedef struct RT_NODE_ITER
 {
-	RT_PTR_LOCAL node;			/* current node being iterated */
-	int			current_idx;	/* current position. -1 for initial value */
+	/*
+	 * Local pointer to the node we are iterating over.
+	 *
+	 * Since the radix tree doesn't support the shared iteration among multiple
+	 * processes, we use RT_PTR_LOCAL rather than RT_PTR_ALLOC.
+	 */
+	RT_PTR_LOCAL node;
+
+	/*
+	 * The next index of the chunk array in RT_NODE_KIND_3 and
+	 * RT_NODE_KIND_32 nodes, or the next chunk in RT_NODE_KIND_125 and
+	 * RT_NODE_KIND_256 nodes. 0 for the initial value.
+	 */
+	int		idx;
 } RT_NODE_ITER;
 
 typedef struct RT_ITER
 {
 	RT_RADIX_TREE *tree;
 
-	/* Track the iteration on nodes of each level */
-	RT_NODE_ITER stack[RT_MAX_LEVEL];
-	int			stack_len;
+	/* Track the nodes for each level. level = 0 is for a leaf node */
+	RT_NODE_ITER node_iters[RT_MAX_LEVEL];
+	int			top_level;
 
-	/* The key is constructed during iteration */
+	/* The key constructed during the iteration */
 	uint64		key;
 } RT_ITER;
 
@@ -1804,16 +1808,9 @@ RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
 }
 #endif
 
-static inline void
-RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
-{
-	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
-	iter->key |= (((uint64) chunk) << shift);
-}
-
 /*
- * Advance the slot in the inner node. Return the child if exists, otherwise
- * null.
+ * Scan the inner node and return the next child node if exist, otherwise
+ * return NULL.
  */
 static inline RT_PTR_LOCAL
 RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
@@ -1824,8 +1821,8 @@ RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
 }
 
 /*
- * Advance the slot in the leaf node. On success, return true and the value
- * is set to value_p, otherwise return false.
+ * Scan the leaf node, and return true and the next value is set to value_p
+ * if exists. Otherwise return false.
  */
 static inline bool
 RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
@@ -1837,29 +1834,50 @@ RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
 }
 
 /*
- * Update each node_iter for inner nodes in the iterator node stack.
+ * While descending the radix tree from the 'from' node to the bottom, we
+ * set the next node to iterate for each level.
  */
 static void
-RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+RT_ITER_SET_NODE_FROM(RT_ITER *iter, RT_PTR_LOCAL from)
 {
-	int			level = from;
-	RT_PTR_LOCAL node = from_node;
+	int			level = from->shift / RT_NODE_SPAN;
+	RT_PTR_LOCAL node = from;
 
 	for (;;)
 	{
-		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+		RT_NODE_ITER *node_iter = &(iter->node_iters[level--]);
+
+#ifdef USE_ASSERT_CHECKING
+		if (node_iter->node)
+		{
+			/* We must have finished the iteration on the previous node */
+			if (RT_NODE_IS_LEAF(node_iter->node))
+			{
+				uint64 dummy;
+				Assert(!RT_NODE_LEAF_ITERATE_NEXT(iter, node_iter, &dummy));
+			}
+			else
+				Assert(!RT_NODE_INNER_ITERATE_NEXT(iter, node_iter));
+		}
+#endif
 
+		/* Set the node to the node iterator of this level */
 		node_iter->node = node;
-		node_iter->current_idx = -1;
+		node_iter->idx = 0;
 
-		/* We don't advance the leaf node iterator here */
 		if (RT_NODE_IS_LEAF(node))
-			return;
+		{
+			/* We will visit the leaf node when RT_ITERATE_NEXT() */
+			break;
+		}
 
-		/* Advance to the next slot in the inner node */
+		/*
+		 * Get the first child node from the node, which corresponds to the
+		 * lowest chunk within the node.
+		 */
 		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
 
-		/* We must find the first children in the node */
+		/* The first child must be found */
 		Assert(node);
 	}
 }
@@ -1873,14 +1891,11 @@ RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
 RT_SCOPE RT_ITER *
 RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 {
-	MemoryContext old_ctx;
 	RT_ITER    *iter;
 	RT_PTR_LOCAL root;
-	int			top_level;
 
-	old_ctx = MemoryContextSwitchTo(tree->context);
-
-	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter = (RT_ITER *) MemoryContextAllocZero(tree->context,
+											  sizeof(RT_ITER));
 	iter->tree = tree;
 
 	RT_LOCK_SHARED(tree);
@@ -1890,16 +1905,13 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 		return iter;
 
 	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
-	top_level = root->shift / RT_NODE_SPAN;
-	iter->stack_len = top_level;
+	iter->top_level = root->shift / RT_NODE_SPAN;
 
 	/*
-	 * Descend to the left most leaf node from the root. The key is being
-	 * constructed while descending to the leaf.
+	 * Set the next node to iterate for each level from the level of the
+	 * root node.
 	 */
-	RT_UPDATE_ITER_STACK(iter, root, top_level);
-
-	MemoryContextSwitchTo(old_ctx);
+	RT_ITER_SET_NODE_FROM(iter, root);
 
 	return iter;
 }
@@ -1911,6 +1923,8 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
 RT_SCOPE bool
 RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
 {
+	Assert(value_p != NULL);
+
 	/* Empty tree */
 	if (!iter->tree->ctl->root)
 		return false;
@@ -1918,43 +1932,38 @@ RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
 	for (;;)
 	{
 		RT_PTR_LOCAL child = NULL;
-		RT_VALUE_TYPE value;
-		int			level;
-		bool		found;
-
-		/* Advance the leaf node iterator to get next key-value pair */
-		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
 
-		if (found)
+		/* Get the next chunk of the leaf node */
+		if (RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->node_iters[0]), value_p))
 		{
 			*key_p = iter->key;
-			*value_p = value;
 			return true;
 		}
 
 		/*
-		 * We've visited all values in the leaf node, so advance inner node
-		 * iterators from the level=1 until we find the next child node.
+		 * We've visited all values in the leaf node, so advance all inner node
+		 * iterators by visiting inner nodes from the level = 1 until we find the
+		 * next inner node that has a child node.
 		 */
-		for (level = 1; level <= iter->stack_len; level++)
+		for (int level = 1; level <= iter->top_level; level++)
 		{
-			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->node_iters[level]));
 
 			if (child)
 				break;
 		}
 
-		/* the iteration finished */
+		/* We've visited all nodes, so the iteration finished */
 		if (!child)
-			return false;
+			break;
 
 		/*
-		 * Set the node to the node iterator and update the iterator stack
-		 * from this node.
+		 * Found the new child node. We update the next node to iterate for each
+		 * level from the level of this child node.
 		 */
-		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+		RT_ITER_SET_NODE_FROM(iter, child);
 
-		/* Node iterators are updated, so try again from the leaf */
+		/* Find key-value from the leaf node again */
 	}
 
 	return false;
@@ -2508,8 +2517,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_NODE_INSERT_LEAF
 #undef RT_NODE_INNER_ITERATE_NEXT
 #undef RT_NODE_LEAF_ITERATE_NEXT
-#undef RT_UPDATE_ITER_STACK
-#undef RT_ITER_UPDATE_KEY
+#undef RT_RT_ITER_SET_NODE_FROM
 #undef RT_VERIFY_NODE
 
 #undef RT_DEBUG
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
index 98c78eb237..5c1034768e 100644
--- a/src/include/lib/radixtree_iter_impl.h
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -27,12 +27,10 @@
 #error node level must be either inner or leaf
 #endif
 
-	bool		found = false;
-	uint8		key_chunk;
+	uint8		key_chunk = 0;
 
 #ifdef RT_NODE_LEVEL_LEAF
-	RT_VALUE_TYPE		value;
-
+	Assert(value_p != NULL);
 	Assert(RT_NODE_IS_LEAF(node_iter->node));
 #else
 	RT_PTR_LOCAL child = NULL;
@@ -50,99 +48,92 @@
 			{
 				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
 
-				node_iter->current_idx++;
-				if (node_iter->current_idx >= n3->base.n.count)
-					break;
+				if (node_iter->idx >= n3->base.n.count)
+					return false;
+
 #ifdef RT_NODE_LEVEL_LEAF
-				value = n3->values[node_iter->current_idx];
+				*value_p = n3->values[node_iter->idx];
 #else
-				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->idx]);
 #endif
-				key_chunk = n3->base.chunks[node_iter->current_idx];
-				found = true;
+				key_chunk = n3->base.chunks[node_iter->idx];
+				node_iter->idx++;
 				break;
 			}
 		case RT_NODE_KIND_32:
 			{
 				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
 
-				node_iter->current_idx++;
-				if (node_iter->current_idx >= n32->base.n.count)
-					break;
+				if (node_iter->idx >= n32->base.n.count)
+					return false;
 
 #ifdef RT_NODE_LEVEL_LEAF
-				value = n32->values[node_iter->current_idx];
+				*value_p = n32->values[node_iter->idx];
 #else
-				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->idx]);
 #endif
-				key_chunk = n32->base.chunks[node_iter->current_idx];
-				found = true;
+				key_chunk = n32->base.chunks[node_iter->idx];
+				node_iter->idx++;
 				break;
 			}
 		case RT_NODE_KIND_125:
 			{
 				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
-				int			i;
+				int			chunk;
 
-				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
 				{
-					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, chunk))
 						break;
 				}
 
-				if (i >= RT_NODE_MAX_SLOTS)
-					break;
+				if (chunk >= RT_NODE_MAX_SLOTS)
+					return false;
 
-				node_iter->current_idx = i;
 #ifdef RT_NODE_LEVEL_LEAF
-				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+				*value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
 #else
-				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, chunk));
 #endif
-				key_chunk = i;
-				found = true;
+				key_chunk = chunk;
+				node_iter->idx = chunk + 1;
 				break;
 			}
 		case RT_NODE_KIND_256:
 			{
 				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
-				int			i;
+				int			chunk;
 
-				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
 				{
 #ifdef RT_NODE_LEVEL_LEAF
-					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
 #else
-					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
 #endif
 						break;
 				}
 
-				if (i >= RT_NODE_MAX_SLOTS)
-					break;
+				if (chunk >= RT_NODE_MAX_SLOTS)
+					return false;
 
-				node_iter->current_idx = i;
 #ifdef RT_NODE_LEVEL_LEAF
-				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+				*value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
 #else
-				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, chunk));
 #endif
-				key_chunk = i;
-				found = true;
+				key_chunk = chunk;
+				node_iter->idx = chunk + 1;
 				break;
 			}
 	}
 
-	if (found)
-	{
-		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
-#ifdef RT_NODE_LEVEL_LEAF
-		*value_p = value;
-#endif
-	}
+	/* Update the part of the key */
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << node_iter->node->shift);
+	iter->key |= (((uint64) key_chunk) << node_iter->node->shift);
 
 #ifdef RT_NODE_LEVEL_LEAF
-	return found;
+	return true;
 #else
 	return child;
 #endif
-- 
2.31.1

v32-0013-tidstore-Embed-output-offsets-in-TidStoreIterRes.patchapplication/octet-stream; name=v32-0013-tidstore-Embed-output-offsets-in-TidStoreIterRes.patchDownload

From 453dc7fd8078ba202569417d4ed65ce6e7f4a850 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 18:00:44 +0900
Subject: [PATCH v32 13/18] tidstore: Embed output offsets in
 TidStoreIterResult.

---
 src/backend/access/common/tidstore.c | 7 ++-----
 src/include/access/tidstore.h        | 3 ++-
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index d9fe3d5f15..15b77b5bcb 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -470,12 +470,10 @@ TidStoreBeginIterate(TidStore *ts)
 
 	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
 
-	iter = palloc0(sizeof(TidStoreIter));
+	iter = palloc0(sizeof(TidStoreIter) +
+				   sizeof(OffsetNumber) * ts->control->max_off);
 	iter->ts = ts;
 
-	iter->result.blkno = InvalidBlockNumber;
-	iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
-
 	if (TidStoreIsShared(ts))
 		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
 	else
@@ -559,7 +557,6 @@ TidStoreEndIterate(TidStoreIter *iter)
 	else
 		local_rt_end_iterate(iter->tree_iter.local);
 
-	pfree(iter->result.offsets);
 	pfree(iter);
 }
 
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index f0a432d0da..66f0fdd482 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -22,11 +22,12 @@ typedef dsa_pointer TidStoreHandle;
 typedef struct TidStore TidStore;
 typedef struct TidStoreIter TidStoreIter;
 
+/* Result struct for TidStoreIterateNext */
 typedef struct TidStoreIterResult
 {
 	BlockNumber		blkno;
-	OffsetNumber	*offsets;
 	int				num_offsets;
+	OffsetNumber	offsets[FLEXIBLE_ARRAY_MEMBER];
 } TidStoreIterResult;
 
 extern TidStore *TidStoreCreate(size_t max_bytes, int max_off, dsa_area *dsa);
-- 
2.31.1

v32-0008-radix-tree-remove-resolved-TODO.patchapplication/octet-stream; name=v32-0008-radix-tree-remove-resolved-TODO.patchDownload

From 84bad553eecc97bbc3d7ccacc90723ae22b7888f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 17:29:32 +0900
Subject: [PATCH v32 08/18] radix tree: remove resolved TODO

---
 src/include/lib/radixtree.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index c277d5a484..088d1dfd9d 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -612,7 +612,6 @@ static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
 #endif
 
 /* Contains the actual tree and ancillary info */
-// WIP: this name is a bit strange
 typedef struct RT_RADIX_TREE_CONTROL
 {
 #ifdef RT_SHMEM
-- 
2.31.1

v32-0007-radix-tree-rename-RT_EXTEND-and-RT_SET_EXTEND-to.patchapplication/octet-stream; name=v32-0007-radix-tree-rename-RT_EXTEND-and-RT_SET_EXTEND-to.patchDownload

From e25dc39fd502ae5c6c1c44a798a24dc5c6a1c7b0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 17:26:52 +0900
Subject: [PATCH v32 07/18] radix tree: rename RT_EXTEND and RT_SET_EXTEND to
 RT_EXTEND_UP/DOWN

---
 src/include/lib/radixtree.h | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index e546bd705c..c277d5a484 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -152,8 +152,8 @@
 #define RT_INIT_NODE RT_MAKE_NAME(init_node)
 #define RT_FREE_NODE RT_MAKE_NAME(free_node)
 #define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
-#define RT_EXTEND RT_MAKE_NAME(extend)
-#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_EXTEND_UP RT_MAKE_NAME(extend_up)
+#define RT_EXTEND_DOWN RT_MAKE_NAME(extend_down)
 #define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
 #define RT_COPY_NODE RT_MAKE_NAME(copy_node)
 #define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
@@ -1243,7 +1243,7 @@ RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
  * it can store the key.
  */
 static pg_noinline void
-RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+RT_EXTEND_UP(RT_RADIX_TREE *tree, uint64 key)
 {
 	int			target_shift;
 	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
@@ -1282,7 +1282,7 @@ RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
  * Insert inner and leaf nodes from 'node' to bottom.
  */
 static pg_noinline void
-RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+RT_EXTEND_DOWN(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
 			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
 {
 	int			shift = node->shift;
@@ -1613,7 +1613,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 
 	/* Extend the tree if necessary */
 	if (key > tree->ctl->max_val)
-		RT_EXTEND(tree, key);
+		RT_EXTEND_UP(tree, key);
 
 	stored_child = tree->ctl->root;
 	parent = RT_PTR_GET_LOCAL(tree, stored_child);
@@ -1631,7 +1631,7 @@ RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
 
 		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
 		{
-			RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+			RT_EXTEND_DOWN(tree, key, value_p, parent, stored_child, child);
 			RT_UNLOCK(tree);
 			return false;
 		}
@@ -2470,8 +2470,8 @@ RT_DUMP(RT_RADIX_TREE *tree)
 #undef RT_INIT_NODE
 #undef RT_FREE_NODE
 #undef RT_FREE_RECURSE
-#undef RT_EXTEND
-#undef RT_SET_EXTEND
+#undef RT_EXTEND_UP
+#undef RT_EXTEND_DOWN
 #undef RT_SWITCH_NODE_KIND
 #undef RT_COPY_NODE
 #undef RT_REPLACE_NODE
-- 
2.31.1

v32-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchapplication/octet-stream; name=v32-0006-Use-TIDStore-for-storing-dead-tuple-TID-during-l.patchDownload

From 1f6c4aa27d734b8c81369541481b0d3abd0d5dec Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 17 Apr 2023 17:22:03 +0900
Subject: [PATCH v32 06/18] Use TIDStore for storing dead tuple TID during lazy
 vacuum

Previously, we used an array of ItemPointerData to store dead tuple
TIDs, which was not space efficient and slow to lookup. Also, we had
the 1GB limit on its size.

Now we use TIDStore to store dead tuple TIDs. Since the TIDStore,
backed by the radix tree, incrementaly allocates the memory, we get
rid of the 1GB limit.

Since we are no longer able to exactly estimate the maximum number of
TIDs can be stored the pg_stat_progress_vacuum shows the progress
information based on the amount of memory in bytes. The column names
are also changed to max_dead_tuple_bytes and num_dead_tuple_bytes.

In addition, since the TIDStore use the radix tree internally, the
minimum amount of memory required by TIDStore is 1MB, the inital DSA
segment size. Due to that, we increase the minimum value of
maintenance_work_mem (also autovacuum_work_mem) from 1MB to 2MB.

XXX: needs to bump catalog version
---
 doc/src/sgml/monitoring.sgml               |   8 +-
 src/backend/access/heap/vacuumlazy.c       | 278 ++++++++-------------
 src/backend/catalog/system_views.sql       |   2 +-
 src/backend/commands/vacuum.c              |  78 +-----
 src/backend/commands/vacuumparallel.c      |  73 +++---
 src/backend/postmaster/autovacuum.c        |   6 +-
 src/backend/storage/lmgr/lwlock.c          |   2 +
 src/backend/utils/misc/guc_tables.c        |   2 +-
 src/include/commands/progress.h            |   4 +-
 src/include/commands/vacuum.h              |  25 +-
 src/include/storage/lwlock.h               |   1 +
 src/test/regress/expected/cluster.out      |   2 +-
 src/test/regress/expected/create_index.out |   2 +-
 src/test/regress/expected/rules.out        |   4 +-
 src/test/regress/sql/cluster.sql           |   2 +-
 src/test/regress/sql/create_index.sql      |   2 +-
 16 files changed, 177 insertions(+), 314 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index be4448fe6e..9b64614beb 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7320,10 +7320,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -7331,10 +7331,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>num_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
     </tbody>
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0a9ebd22bd..2c72088e69 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,18 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * create a TidStore with the maximum bytes that can be used by the TidStore.
+ * If the TidStore is full, we must call lazy_vacuum to vacuum indexes (and to
+ * vacuum the pages that we've pruned). This frees up the memory space dedicated
+ * to storing dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -40,6 +40,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xact.h"
@@ -186,7 +187,7 @@ typedef struct LVRelState
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -218,11 +219,14 @@ typedef struct LVRelState
 typedef struct LVPagePruneState
 {
 	bool		hastup;			/* Page prevents rel truncation? */
-	bool		has_lpdead_items;	/* includes existing LP_DEAD items */
+
+	/* collected offsets of LP_DEAD items including existing ones */
+	OffsetNumber	deadoffsets[MaxHeapTuplesPerPage];
+	int				num_offsets;
 
 	/*
 	 * State describes the proper VM bit states to set for the page following
-	 * pruning and freezing.  all_visible implies !has_lpdead_items, but don't
+	 * pruning and freezing.  all_visible implies num_offsets == 0, but don't
 	 * trust all_frozen result unless all_visible is also set to true.
 	 */
 	bool		all_visible;	/* Every item visible to all? */
@@ -257,8 +261,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  OffsetNumber *offsets, int num_offsets,
+								  Buffer buffer, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -485,11 +490,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
-	 * parallel VACUUM initialization as part of allocating shared memory
-	 * space used for dead_items.  (But do a failsafe precheck first, to
-	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
-	 * is already dangerously old.)
+	 * Allocate dead_items memory using dead_items_alloc.  This handles parallel
+	 * VACUUM initialization as part of allocating shared memory space used for
+	 * dead_items.  (But do a failsafe precheck first, to ensure that parallel
+	 * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+	 * old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
@@ -795,7 +800,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -823,21 +828,21 @@ lazy_scan_heap(LVRelState *vacrel)
 				blkno,
 				next_unskippable_block,
 				next_fsm_block_to_vacuum = 0;
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
 	Buffer		vmbuffer = InvalidBuffer;
 	bool		next_unskippable_allvis,
 				skipping_current_range;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = tidstore_max_memory(vacrel->dead_items);
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Set up an initial range of skippable blocks using the visibility map */
@@ -904,8 +909,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (tidstore_is_full(vacrel->dead_items))
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -967,7 +971,7 @@ lazy_scan_heap(LVRelState *vacrel)
 				continue;
 			}
 
-			/* Collect LP_DEAD items in dead_items array, count tuples */
+			/* Collect LP_DEAD items in dead_items, count tuples */
 			if (lazy_scan_noprune(vacrel, buf, blkno, page, &hastup,
 								  &recordfreespace))
 			{
@@ -1009,14 +1013,14 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Prune, freeze, and count tuples.
 		 *
 		 * Accumulates details of remaining LP_DEAD line pointers on page in
-		 * dead_items array.  This includes LP_DEAD line pointers that we
-		 * pruned ourselves, as well as existing LP_DEAD line pointers that
-		 * were pruned some time earlier.  Also considers freezing XIDs in the
-		 * tuple headers of remaining items with storage.
+		 * dead_items.  This includes LP_DEAD line pointers that we pruned
+		 * ourselves, as well as existing LP_DEAD line pointers that were pruned
+		 * some time earlier.  Also considers freezing XIDs in the tuple headers
+		 * of remaining items with storage.
 		 */
 		lazy_scan_prune(vacrel, buf, blkno, page, &prunestate);
 
-		Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
+		Assert(!prunestate.all_visible || (prunestate.num_offsets == 0));
 
 		/* Remember the location of the last page with nonremovable tuples */
 		if (prunestate.hastup)
@@ -1032,14 +1036,12 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * performed here can be thought of as the one-pass equivalent of
 			 * a call to lazy_vacuum().
 			 */
-			if (prunestate.has_lpdead_items)
+			if (prunestate.num_offsets > 0)
 			{
 				Size		freespace;
 
-				lazy_vacuum_heap_page(vacrel, blkno, buf, 0, vmbuffer);
-
-				/* Forget the LP_DEAD items that we just vacuumed */
-				dead_items->num_items = 0;
+				lazy_vacuum_heap_page(vacrel, blkno, prunestate.deadoffsets,
+									  prunestate.num_offsets, buf, vmbuffer);
 
 				/*
 				 * Periodically perform FSM vacuuming to make newly-freed
@@ -1076,7 +1078,16 @@ lazy_scan_heap(LVRelState *vacrel)
 			 * with prunestate-driven visibility map and FSM steps (just like
 			 * the two-pass strategy).
 			 */
-			Assert(dead_items->num_items == 0);
+			Assert(tidstore_num_tids(dead_items) == 0);
+		}
+		else if (prunestate.num_offsets > 0)
+		{
+			/* Save details of the LP_DEAD items from the page in dead_items */
+			tidstore_add_tids(dead_items, blkno, prunestate.deadoffsets,
+							  prunestate.num_offsets);
+
+			pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+										 tidstore_memory_usage(dead_items));
 		}
 
 		/*
@@ -1143,7 +1154,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * There should never be LP_DEAD items on a page with PD_ALL_VISIBLE
 		 * set, however.
 		 */
-		else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
+		else if ((prunestate.num_offsets > 0) && PageIsAllVisible(page))
 		{
 			elog(WARNING, "page containing LP_DEAD items is marked as all-visible in relation \"%s\" page %u",
 				 vacrel->relname, blkno);
@@ -1191,7 +1202,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Final steps for block: drop cleanup lock, record free space in the
 		 * FSM
 		 */
-		if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
+		if ((prunestate.num_offsets > 0) && vacrel->do_index_vacuuming)
 		{
 			/*
 			 * Wait until lazy_vacuum_heap_rel() to save free space.  This
@@ -1247,7 +1258,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (tidstore_num_tids(dead_items) > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1522,9 +1533,9 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer buf, BlockNumber blkno,
  * The approach we take now is to restart pruning when the race condition is
  * detected.  This allows heap_page_prune() to prune the tuples inserted by
  * the now-aborted transaction.  This is a little crude, but it guarantees
- * that any items that make it into the dead_items array are simple LP_DEAD
- * line pointers, and that every remaining item with tuple storage is
- * considered as a candidate for freezing.
+ * that any items that make it into the dead_items are simple LP_DEAD line
+ * pointers, and that every remaining item with tuple storage is considered
+ * as a candidate for freezing.
  */
 static void
 lazy_scan_prune(LVRelState *vacrel,
@@ -1541,13 +1552,11 @@ lazy_scan_prune(LVRelState *vacrel,
 	HTSV_Result res;
 	int			tuples_deleted,
 				tuples_frozen,
-				lpdead_items,
 				live_tuples,
 				recently_dead_tuples;
 	int			nnewlpdead;
 	HeapPageFreeze pagefrz;
 	int64		fpi_before = pgWalUsage.wal_fpi;
-	OffsetNumber deadoffsets[MaxHeapTuplesPerPage];
 	HeapTupleFreeze frozen[MaxHeapTuplesPerPage];
 
 	Assert(BufferGetBlockNumber(buf) == blkno);
@@ -1569,7 +1578,6 @@ retry:
 	pagefrz.NoFreezePageRelminMxid = vacrel->NewRelminMxid;
 	tuples_deleted = 0;
 	tuples_frozen = 0;
-	lpdead_items = 0;
 	live_tuples = 0;
 	recently_dead_tuples = 0;
 
@@ -1578,9 +1586,9 @@ retry:
 	 *
 	 * We count tuples removed by the pruning step as tuples_deleted.  Its
 	 * final value can be thought of as the number of tuples that have been
-	 * deleted from the table.  It should not be confused with lpdead_items;
-	 * lpdead_items's final value can be thought of as the number of tuples
-	 * that were deleted from indexes.
+	 * deleted from the table.  It should not be confused with
+	 * prunestate->deadoffsets; prunestate->deadoffsets's final value can
+	 * be thought of as the number of tuples that were deleted from indexes.
 	 */
 	tuples_deleted = heap_page_prune(rel, buf, vacrel->vistest,
 									 InvalidTransactionId, 0, &nnewlpdead,
@@ -1591,7 +1599,7 @@ retry:
 	 * requiring freezing among remaining tuples with storage
 	 */
 	prunestate->hastup = false;
-	prunestate->has_lpdead_items = false;
+	prunestate->num_offsets = 0;
 	prunestate->all_visible = true;
 	prunestate->all_frozen = true;
 	prunestate->visibility_cutoff_xid = InvalidTransactionId;
@@ -1636,7 +1644,7 @@ retry:
 			 * (This is another case where it's useful to anticipate that any
 			 * LP_DEAD items will become LP_UNUSED during the ongoing VACUUM.)
 			 */
-			deadoffsets[lpdead_items++] = offnum;
+			prunestate->deadoffsets[prunestate->num_offsets++] = offnum;
 			continue;
 		}
 
@@ -1873,7 +1881,7 @@ retry:
 	 */
 #ifdef USE_ASSERT_CHECKING
 	/* Note that all_frozen value does not matter when !all_visible */
-	if (prunestate->all_visible && lpdead_items == 0)
+	if (prunestate->all_visible && prunestate->num_offsets == 0)
 	{
 		TransactionId cutoff;
 		bool		all_frozen;
@@ -1886,28 +1894,9 @@ retry:
 	}
 #endif
 
-	/*
-	 * Now save details of the LP_DEAD items from the page in vacrel
-	 */
-	if (lpdead_items > 0)
+	if (prunestate->num_offsets > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
-		prunestate->has_lpdead_items = true;
-
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1926,7 +1915,7 @@ retry:
 	/* Finally, add page-local counts to whole-VACUUM counts */
 	vacrel->tuples_deleted += tuples_deleted;
 	vacrel->tuples_frozen += tuples_frozen;
-	vacrel->lpdead_items += lpdead_items;
+	vacrel->lpdead_items += prunestate->num_offsets;
 	vacrel->live_tuples += live_tuples;
 	vacrel->recently_dead_tuples += recently_dead_tuples;
 }
@@ -1938,7 +1927,7 @@ retry:
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2097,7 +2086,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2127,8 +2116,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
+		TidStore *dead_items = vacrel->dead_items;
 
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
@@ -2137,17 +2125,10 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
+		tidstore_add_tids(dead_items, blkno, deadoffsets, lpdead_items);
 
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+									 tidstore_memory_usage(dead_items));
 
 		vacrel->lpdead_items += lpdead_items;
 
@@ -2196,7 +2177,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		tidstore_reset(vacrel->dead_items);
 		return;
 	}
 
@@ -2225,7 +2206,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == tidstore_num_tids(vacrel->dead_items));
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2252,8 +2233,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			tidstore_memory_usage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2298,7 +2279,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	tidstore_reset(vacrel->dead_items);
 }
 
 /*
@@ -2371,7 +2352,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2390,9 +2371,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2408,10 +2388,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2426,7 +2407,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = tidstore_begin_iterate(vacrel->dead_items);
+	while ((result = tidstore_iterate_next(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2435,7 +2417,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2449,7 +2431,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, result->offsets, result->num_offsets,
+							  buf, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2459,6 +2442,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	tidstore_end_iterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2468,36 +2452,31 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (tidstore_num_tids(vacrel->dead_items) == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " UINT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, tidstore_num_tids(vacrel->dead_items),
+					vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
 }
 
 /*
- *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *	lazy_vacuum_heap_page() -- free page's LP_DEAD items.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+					  OffsetNumber *deadoffsets, int num_offsets, Buffer buffer,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2516,16 +2495,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2595,7 +2569,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -2692,8 +2665,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -3101,48 +3074,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 }
 
 /*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
-	autovacuum_work_mem != -1 ?
-	autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
-/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.
  *
  * Also handles parallel initialization as part of allocating dead_items in
  * DSM when required.
@@ -3150,11 +3083,9 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	int			vac_work_mem = IsAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3181,7 +3112,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem, MaxHeapTuplesPerPage,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
@@ -3194,11 +3125,8 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-
-	vacrel->dead_items = dead_items;
+	vacrel->dead_items = tidstore_create(vac_work_mem, MaxHeapTuplesPerPage,
+										 NULL);
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2129c916aa..134df925ce 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1190,7 +1190,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
 
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index a843f9ad92..f3922b72dc 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -119,7 +119,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * GUC check function to ensure GUC value specified is within the allowable
@@ -2478,16 +2477,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					tidstore_num_tids(dead_items))));
 
 	return istat;
 }
@@ -2518,82 +2517,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch(itemptr,
-								dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return tidstore_lookup_tid(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 87ea5c5242..c363f45e32 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -9,12 +9,11 @@
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
  * vacuum process.  ParalleVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
- * launch parallel worker processes at the start of parallel index
- * bulk-deletion and index cleanup and once all indexes are processed, the
- * parallel worker processes exit.  Each time we process indexes in parallel,
- * the parallel context is re-initialized so that the same DSM can be used for
- * multiple passes of index bulk-deletion and index cleanup.
+ * the shared TidStore. We launch parallel worker processes at the start of
+ * parallel index bulk-deletion and index cleanup and once all indexes are
+ * processed, the parallel worker processes exit.  Each time we process indexes
+ * in parallel, the parallel context is re-initialized so that the same DSM can
+ * be used for multiple passes of index bulk-deletion and index cleanup.
  *
  * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -109,6 +108,9 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* Handle of the shared TidStore */
+	tidstore_handle	dead_items_handle;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -175,7 +177,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -231,20 +234,23 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
-					 int elevel, BufferAccessStrategy bstrategy)
+					 int nrequested_workers, int vac_work_mem,
+					 int max_offset, int elevel,
+					 BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -293,9 +299,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Estimate size for dead tuple DSA -- PARALLEL_VACUUM_KEY_DSA */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -361,6 +366,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = tidstore_create(vac_work_mem, max_offset, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -370,6 +385,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = tidstore_get_handle(dead_items);
 
 	/* Use the same buffer size for all workers */
 	shared->ring_nbuffers = GetAccessStrategyBufferCount(bstrategy);
@@ -381,15 +397,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -447,6 +454,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	tidstore_destroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -455,7 +465,7 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 }
 
 /* Returns the dead items space */
-VacDeadItems *
+TidStore *
 parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
 {
 	return pvs->dead_items;
@@ -954,7 +964,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -998,10 +1010,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = tidstore_attach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumUpdateCosts();
@@ -1049,6 +1061,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	tidstore_detach(pvs.dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 53c8f8d79c..74915bee9b 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -3474,12 +3474,12 @@ check_autovacuum_work_mem(int *newval, void **extra, GucSource source)
 		return true;
 
 	/*
-	 * We clamp manually-set values to at least 1MB.  Since
+	 * We clamp manually-set values to at least 2MB.  Since
 	 * maintenance_work_mem is always set to at least this value, do the same
 	 * here.
 	 */
-	if (*newval < 1024)
-		*newval = 1024;
+	if (*newval < 2048)
+		*newval = 2048;
 
 	return true;
 }
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 55b3a04097..c223a7dc94 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -192,6 +192,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"LogicalRepLauncherDSA",
 	/* LWTRANCHE_LAUNCHER_HASH: */
 	"LogicalRepLauncherHash",
+	/* LWTRANCHE_PARALLEL_VACUUM_DSA: */
+	"ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index cab3ddbe11..0bbdf04980 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2353,7 +2353,7 @@ struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_KB
 		},
 		&maintenance_work_mem,
-		65536, 1024, MAX_KILOBYTES,
+		65536, 2048, MAX_KILOBYTES,
 		NULL, NULL, NULL
 	},
 
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index e5add41352..b209d3cf84 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 
 /* Phases of vacuum (as advertised via PROGRESS_VACUUM_PHASE) */
 #define PROGRESS_VACUUM_PHASE_SCAN_HEAP			1
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 17e9b4f68e..b48c6ebf2d 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -280,21 +281,6 @@ struct VacuumCutoffs
 	MultiXactId MultiXactCutoff;
 };
 
-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
-
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
 extern PGDLLIMPORT int vacuum_freeze_min_age;
@@ -347,10 +333,9 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* In postmaster/autovacuum.c */
 extern void AutoVacuumUpdateCostLimit(void);
@@ -359,10 +344,10 @@ extern void VacuumUpdateCosts(void);
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
-												 BufferAccessStrategy bstrategy);
+												 int vac_work_mem, int max_offset,
+												 int elevel, BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 07002fdfbe..537b34b30c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -207,6 +207,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PGSTATS_DATA,
 	LWTRANCHE_LAUNCHER_DSA,
 	LWTRANCHE_LAUNCHER_HASH,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/cluster.out b/src/test/regress/expected/cluster.out
index 2eec483eaa..e04f50726f 100644
--- a/src/test/regress/expected/cluster.out
+++ b/src/test/regress/expected/cluster.out
@@ -526,7 +526,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 -- ensure we don't use the index in CLUSTER nor the checking SELECTs
 set enable_indexscan = off;
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/expected/create_index.out b/src/test/regress/expected/create_index.out
index acfd9d1f4f..d320ad87dd 100644
--- a/src/test/regress/expected/create_index.out
+++ b/src/test/regress/expected/create_index.out
@@ -1214,7 +1214,7 @@ DROP TABLE unlogged_hash_table;
 -- CREATE INDEX hash_ovfl_index ON hash_ovfl_heap USING hash (x int4_ops);
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 919d947ec0..66d671a641 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2041,8 +2041,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
      LEFT JOIN pg_database d ON ((s.datid = d.oid)));
 pg_stat_recovery_prefetch| SELECT stats_reset,
diff --git a/src/test/regress/sql/cluster.sql b/src/test/regress/sql/cluster.sql
index a4cfaae807..a4cb5b98a5 100644
--- a/src/test/regress/sql/cluster.sql
+++ b/src/test/regress/sql/cluster.sql
@@ -258,7 +258,7 @@ create index cluster_sort on clstr_4 (hundred, thousand, tenthous);
 set enable_indexscan = off;
 
 -- Use external sort:
-set maintenance_work_mem = '1MB';
+set maintenance_work_mem = '2MB';
 cluster clstr_4 using cluster_sort;
 select * from
 (select hundred, lag(hundred) over () as lhundred,
diff --git a/src/test/regress/sql/create_index.sql b/src/test/regress/sql/create_index.sql
index d49ce9f300..d6e2471b00 100644
--- a/src/test/regress/sql/create_index.sql
+++ b/src/test/regress/sql/create_index.sql
@@ -367,7 +367,7 @@ DROP TABLE unlogged_hash_table;
 
 -- Test hash index build tuplesorting.  Force hash tuplesort using low
 -- maintenance_work_mem setting and fillfactor:
-SET maintenance_work_mem = '1MB';
+SET maintenance_work_mem = '2MB';
 CREATE INDEX hash_tuplesort_idx ON tenk1 USING hash (stringu1 name_ops) WITH (fillfactor = 10);
 EXPLAIN (COSTS OFF)
 SELECT count(*) FROM tenk1 WHERE stringu1 = 'TVAAAA';
-- 
2.31.1

v32-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchapplication/octet-stream; name=v32-0005-Tool-for-measuring-radix-tree-and-tidstore-perfo.patchDownload

From cff1ffa9af592765cf9073291fb1665b09b61d8a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v32 05/18] Tool for measuring radix tree and tidstore
 performance

Includes Meson support, but commented out to avoid warnings

XXX: Not for commit
---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  88 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 747 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/meson.build          |  33 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 contrib/meson.build                           |   1 +
 8 files changed, 925 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/meson.build
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..ad66265e23
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,88 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT iter_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..6e5149e2c4
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,747 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+//#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	TidStore	*ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
+	OffsetNumber *offs;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_ms;
+	int64		iter_ms;
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false};
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+	for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+		offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+	ts = tidstore_create(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
+
+	/* load tids */
+	start_time = GetCurrentTimestamp();
+	for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+		tidstore_add_tids(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_ms = secs * 1000 + usecs / 1000;
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* iterate through tids */
+	iter = tidstore_begin_iterate(ts);
+	start_time = GetCurrentTimestamp();
+	while ((result = tidstore_iterate_next(iter)) != NULL)
+		;
+	tidstore_end_iterate(iter);
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	iter_ms = secs * 1000 + usecs / 1000;
+
+	values[0] = Int64GetDatum(tidstore_memory_usage(ts));
+	values[1] = Int64GetDatum(load_ms);
+	values[2] = Int64GetDatum(iter_ms);
+
+	tidstore_destroy(ts);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	rt_radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, &val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, &val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	int64		search_time_ms;
+	Datum		values[3] = {0};
+	bool		nulls[3] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+	values[2] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, &key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						rt_search(rt, key, &val);
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+
+#ifdef RT_DEBUG
+	rt_stats(rt);
+#endif
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* to silence warnings about unused iter functions */
+static void pg_attribute_unused()
+stub_iter()
+{
+	rt_radix_tree *rt;
+	rt_iter *iter;
+	uint64 key = 1;
+	uint64 value = 1;
+
+	rt = rt_create(CurrentMemoryContext);
+
+	iter = rt_begin_iterate(rt);
+	rt_iterate_next(iter, &key, &value);
+	rt_end_iterate(iter);
+}
\ No newline at end of file
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+  'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+  bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'bench_radix_tree',
+    '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+  bench_radix_tree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+  'bench_radix_tree.control',
+  'bench_radix_tree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'bench_radix_tree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'bench_radix_tree',
+    ],
+  },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..421d469f8c 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
+subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.31.1

v32-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v32-0004-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From a804e3ebba8733d65497d5e9c3a47b32f175ea1e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v32 04/18] Add TIDStore, to store sets of TIDs
 (ItemPointerData) efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 681 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   2 +
 src/include/access/tidstore.h                 |  49 ++
 src/include/storage/lwlock.h                  |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  35 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../modules/test_tidstore/test_tidstore.c     | 226 ++++++
 .../test_tidstore/test_tidstore.control       |   4 +
 16 files changed, 1057 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 2903b67170..be4448fe6e 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2211,6 +2211,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to access a shared TID bitmap during a parallel bitmap
        index scan.</entry>
      </row>
+     <row>
+      <entry><literal>SharedTidStore</literal></entry>
+      <entry>Waiting to access a shared TID store.</entry>
+     </row>
      <row>
       <entry><literal>SharedTupleStore</literal></entry>
       <entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..8c05e60d92
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,681 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * This module provides a in-memory data structure to store Tids (ItemPointer).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, and
+ * stored in the radix tree.
+ *
+ * A TidStore can be shared among parallel worker processes by passing DSA area
+ * to tidstore_create(). Other backends can attach to the shared TidStore by
+ * tidstore_attach().
+ *
+ * Regarding the concurrency, it basically relies on the concurrency support in
+ * the radix tree, but we acquires the lock on a TidStore in some cases, for
+ * example, when to reset the store and when to access the number tids in the
+ * store (num_tids).
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, tids are represented as a pair of 64-bit key and
+ * 64-bit value. First, we construct 64-bit unsigned integer by combining
+ * the block number and the offset number. The number of bits used for the
+ * offset number is specified by max_offsets in tidstore_create(). We are
+ * frugal with the bits, because smaller keys could help keeping the radix
+ * tree shallow.
+ *
+ * For example, a tid of heap with 8kB blocks uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. That
+ * is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *                                                |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ */
+#define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
+#define TIDSTORE_OFFSET_MASK	((1 << TIDSTORE_VALUE_NBITS) - 1)
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+	/* the number of tids in the store */
+	int64	num_tids;
+
+	/* These values are never changed after creation */
+	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
+	int		max_offset;		/* the maximum offset number */
+	int		offset_nbits;	/* the number of bits required for an offset
+							 * number */
+	int		offset_key_nbits;	/* the number of bits of an offset number
+								 * used in a key */
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+	LWLock	lock;
+
+	/* handles for TidStore and radix tree */
+	tidstore_handle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/* output for the caller */
+	TidStoreIterResult result;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+tidstore_create(size_t max_bytes, int max_offset, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 *
+	 * Memory consumption depends on the number of stored tids, but also on the
+	 * distribution of them, how the radix tree stores, and the memory management
+	 * that backed the radix tree. The maximum bytes that a TidStore can
+	 * use is specified by the max_bytes in tidstore_create(). We want the total
+	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
+	 *
+	 * In local TidStore cases, the radix tree uses slab allocators for each kind
+	 * of node class. The most memory consuming case while adding Tids associated
+	 * with one page (i.e. during tidstore_add_tids()) is that we allocate a new
+	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+	 * we deduct 70kB from the max_bytes.
+	 *
+	 * In shared cases, DSA allocates the memory segments big enough to follow
+	 * a geometric series that approximately doubles the total DSA size (see
+	 * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+	 * size and the simulation revealed, the 75% threshold for the maximum bytes
+	 * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+	 * threshold works for other cases.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes = (uint64) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - (70 * 1024);
+	}
+
+	ts->control->max_offset = max_offset;
+	ts->control->offset_nbits = pg_ceil_log2_32(max_offset);
+
+	if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
+		ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+
+	ts->control->offset_key_nbits =
+		ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+tidstore_attach(dsa_area *area, tidstore_handle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+tidstore_detach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call tidstore_detach to free up backend-local memory associated
+ * with the TidStore. The backend that calls tidstore_destroy must not call
+ * tidstore_detach.
+ */
+void
+tidstore_destroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix
+		 * tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to tidstore_destroy but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+tidstore_reset(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Free the radix tree and return allocated DSA segments to
+		 * the operating system.
+		 */
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+
+		LWLockRelease(&ts->control->lock);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+	}
+}
+
+/* Add Tids on a block to TidStore */
+void
+tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+				  int num_offsets)
+{
+	uint64	*values;
+	uint64	key;
+	uint64	prev_key;
+	uint64	off_bitmap = 0;
+	int idx;
+	const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+	const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	values = palloc(sizeof(uint64) * nkeys);
+	key = prev_key = key_base;
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint64	off_bit;
+
+		/* encode the tid to a key and partial offset */
+		key = encode_key_off(ts, blkno, offsets[i], &off_bit);
+
+		/* make sure we scanned the line pointer array in order */
+		Assert(key >= prev_key);
+
+		if (key > prev_key)
+		{
+			idx = prev_key - key_base;
+			Assert(idx >= 0 && idx < nkeys);
+
+			/* write out offset bitmap for this key */
+			values[idx] = off_bitmap;
+
+			/* zero out any gaps up to the current key */
+			for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
+				values[empty_idx] = 0;
+
+			/* reset for current key -- the current offset will be handled below */
+			off_bitmap = 0;
+			prev_key = key;
+		}
+
+		off_bitmap |= off_bit;
+	}
+
+	/* save the final index for later */
+	idx = key - key_base;
+	/* write out last offset bitmap */
+	values[idx] = off_bitmap;
+
+	if (TidStoreIsShared(ts))
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+	/* insert the calculated key-values to the tree */
+	for (int i = 0; i <= idx; i++)
+	{
+		if (values[i])
+		{
+			key = key_base + i;
+
+			if (TidStoreIsShared(ts))
+				shared_rt_set(ts->tree.shared, key, &values[i]);
+			else
+				local_rt_set(ts->tree.local, key, &values[i]);
+		}
+	}
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+
+	if (TidStoreIsShared(ts))
+		LWLockRelease(&ts->control->lock);
+
+	pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+tidstore_lookup_tid(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val = 0;
+	uint64 off_bit;
+	bool found;
+
+	key = tid_to_key_off(ts, tid, &off_bit);
+
+	if (TidStoreIsShared(ts))
+		found = shared_rt_search(ts->tree.shared, key, &val);
+	else
+		found = local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & off_bit) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during the
+ * iteration, so tidstore_end_iterate() needs to called when finished.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+tidstore_begin_iterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	iter->result.blkno = InvalidBlockNumber;
+	iter->result.offsets = palloc(sizeof(OffsetNumber) * ts->control->max_offset);
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (tidstore_num_tids(ts) == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+
+	return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+tidstore_iterate_next(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->result);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		/* Process the previously collected key-value */
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = key_get_blkno(iter->ts, key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * We got a key-value pair for a different block. So return the
+			 * collected tids, and remember the key-value for the next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+tidstore_end_iterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter->result.offsets);
+	pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+tidstore_num_tids(TidStore *ts)
+{
+	uint64 num_tids;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (!TidStoreIsShared(ts))
+		return ts->control->num_tids;
+
+	LWLockAcquire(&ts->control->lock, LW_SHARED);
+	num_tids = ts->control->num_tids;
+	LWLockRelease(&ts->control->lock);
+
+	return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+tidstore_is_full(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (tidstore_memory_usage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+tidstore_max_memory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+tidstore_memory_usage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+	return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+tidstore_handle
+tidstore_get_handle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->result);
+
+	while (val)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= pg_rightmost_one_pos64(val);
+
+		off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+		Assert(result->num_offsets < iter->ts->control->max_offset);
+		result->offsets[result->num_offsets++] = off;
+
+		/* unset the rightmost bit */
+		val &= ~pg_rightmost_one64(val);
+	}
+
+	result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+	return (BlockNumber) (key >> ts->control->offset_key_nbits);
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
+{
+	uint32 offset = ItemPointerGetOffsetNumber(tid);
+	BlockNumber block = ItemPointerGetBlockNumber(tid);
+
+	return encode_key_off(ts, block, offset, off_bit);
+}
+
+/* encode a block and offset to a key and partial offset */
+static inline uint64
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
+{
+	uint64 key;
+	uint64 tid_i;
+	uint32 off_lower;
+
+	off_lower = offset & TIDSTORE_OFFSET_MASK;
+	Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
+
+	*off_bit = UINT64CONST(1) << off_lower;
+	tid_i = offset | ((uint64) block << ts->control->offset_nbits);
+	key = tid_i >> TIDSTORE_VALUE_NBITS;
+
+	return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d2ec396045..55b3a04097 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"SharedTupleStore",
 	/* LWTRANCHE_SHARED_TIDBITMAP: */
 	"SharedTidBitmap",
+	/* LWTRANCHE_SHARED_TIDSTORE: */
+	"SharedTidStore",
 	/* LWTRANCHE_PARALLEL_APPEND: */
 	"ParallelAppend",
 	/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..a35a52124a
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer tidstore_handle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	OffsetNumber	*offsets;
+	int				num_offsets;
+} TidStoreIterResult;
+
+extern TidStore *tidstore_create(size_t max_bytes, int max_offset, dsa_area *dsa);
+extern TidStore *tidstore_attach(dsa_area *dsa, dsa_pointer handle);
+extern void tidstore_detach(TidStore *ts);
+extern void tidstore_destroy(TidStore *ts);
+extern void tidstore_reset(TidStore *ts);
+extern void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+							  int num_offsets);
+extern bool tidstore_lookup_tid(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * tidstore_begin_iterate(TidStore *ts);
+extern TidStoreIterResult *tidstore_iterate_next(TidStoreIter *iter);
+extern void tidstore_end_iterate(TidStoreIter *iter);
+extern int64 tidstore_num_tids(TidStore *ts);
+extern bool tidstore_is_full(TidStore *ts);
+extern size_t tidstore_max_memory(TidStore *ts);
+extern size_t tidstore_memory_usage(TidStore *ts);
+extern tidstore_handle tidstore_get_handle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index d2c7afb8f4..07002fdfbe 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -199,6 +199,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_SHARED_TIDBITMAP,
+	LWTRANCHE_SHARED_TIDSTORE,
 	LWTRANCHE_PARALLEL_APPEND,
 	LWTRANCHE_PER_XACT_PREDICATE_LIST,
 	LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 89f42bf9e3..a6ec135430 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index beaf4080fb..f126ea9f2e 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -31,5 +31,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..9a1217f833
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,226 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+/* #define TEST_SHARED_TIDSTORE 1 */
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+	ItemPointerData tid;
+	bool found;
+
+	ItemPointerSet(&tid, blkno, off);
+
+	found = tidstore_lookup_tid(ts, &tid);
+
+	if (found != expect)
+		elog(ERROR, "lookup TID (%u, %u) returned %d, expected %d",
+			 blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS	5
+#define TEST_TIDSTORE_NUM_OFFSETS	5
+
+	TidStore *ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+	BlockNumber	blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+	};
+	BlockNumber	blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+	};
+	OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+	int blk_idx;
+
+#ifdef TEST_SHARED_TIDSTORE
+	int tranche_id = LWLockNewTrancheId();
+	dsa_area *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_tidstore");
+	dsa = dsa_create(tranche_id);
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+#else
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+#endif
+
+	/* prepare the offset array */
+	offs[0] = FirstOffsetNumber;
+	offs[1] = FirstOffsetNumber + 1;
+	offs[2] = max_offset / 2;
+	offs[3] = max_offset - 1;
+	offs[4] = max_offset;
+
+	/* add tids */
+	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+		tidstore_add_tids(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* lookup test */
+	for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+	{
+		bool expect = false;
+		for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+		{
+			if (offs[i] == off)
+			{
+				expect = true;
+				break;
+			}
+		}
+
+		check_tid(ts, 0, off, expect);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, expect);
+	}
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "tidstore_num_tids returned " UINT64_FORMAT ", expected %d",
+			 tidstore_num_tids(ts),
+			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* iteration test */
+	iter = tidstore_begin_iterate(ts);
+	blk_idx = 0;
+	while ((iter_result = tidstore_iterate_next(iter)) != NULL)
+	{
+		/* check the returned block number */
+		if (blks_sorted[blk_idx] != iter_result->blkno)
+			elog(ERROR, "tidstore_iterate_next returned block number %u, expected %u",
+				 iter_result->blkno, blks_sorted[blk_idx]);
+
+		/* check the returned offset numbers */
+		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+			elog(ERROR, "tidstore_iterate_next returned %u offsets, expected %u",
+				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+		for (int i = 0; i < iter_result->num_offsets; i++)
+		{
+			if (offs[i] != iter_result->offsets[i])
+				elog(ERROR, "tidstore_iterate_next returned offset number %u on block %u, expected %u",
+					 iter_result->offsets[i], iter_result->blkno, offs[i]);
+		}
+
+		blk_idx++;
+	}
+
+	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+		elog(ERROR, "tidstore_iterate_next returned %d blocks, expected %d",
+			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+	/* remove all tids */
+	tidstore_reset(ts);
+
+	/* test the number of tids */
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_tids on empty store returned non-zero");
+
+	/* lookup test for empty store */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, false);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, false);
+	}
+
+	tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+	dsa_detach(dsa);
+#endif
+}
+
+static void
+test_empty(void)
+{
+	TidStore *ts;
+	TidStoreIter *iter;
+	ItemPointerData tid;
+
+#ifdef TEST_SHARED_TIDSTORE
+	int tranche_id = LWLockNewTrancheId();
+	dsa_area *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_tidstore");
+	dsa = dsa_create(tranche_id);
+
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+#else
+	ts = tidstore_create(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+#endif
+
+	elog(NOTICE, "testing empty tidstore");
+
+	ItemPointerSet(&tid, 0, FirstOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (0,1) on empty store returned true");
+
+	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+	if (tidstore_lookup_tid(ts, &tid))
+		elog(ERROR, "tidstore_lookup_tid for (%u,%u) on empty store returned true",
+			 MaxBlockNumber, MaxOffsetNumber);
+
+	if (tidstore_num_tids(ts) != 0)
+		elog(ERROR, "tidstore_num_entries on empty store returned non-zero");
+
+	if (tidstore_is_full(ts))
+		elog(ERROR, "tidstore_is_full on empty store returned true");
+
+	iter = tidstore_begin_iterate(ts);
+
+	if (tidstore_iterate_next(iter) != NULL)
+		elog(ERROR, "tidstore_iterate_next on empty store returned TIDs");
+
+	tidstore_end_iterate(iter);
+
+	tidstore_destroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+	dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	elog(NOTICE, "testing basic operations");
+	test_basic(MaxHeapTuplesPerPage);
+	test_basic(10);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.31.1

v32-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchapplication/octet-stream; name=v32-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From 7fe0c744e052286a8c44716494fe4d644b0e8451 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v32 02/18] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 34 +-------------------------------
 src/include/nodes/bitmapset.h    | 16 +++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 46 insertions(+), 36 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 7ba3cf635b..0b2962ed73 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -30,39 +30,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 static bool bms_is_empty_internal(const Bitmapset *a);
 
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 14de6a9ff1..c7e1711147 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -36,13 +36,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -73,6 +71,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 158ef73a2b..bf7588e075 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -32,6 +32,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b4058b88c3..fd3d83c781 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3684,7 +3684,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.31.1

v32-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchapplication/octet-stream; name=v32-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload

From 51fe658fcecefb2b8c0d826c7d7d6070eb9e878c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v32 01/18] Introduce helper SIMD functions for small byte
 arrays

vector8_min - helper for emulating ">=" semantics

vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask

Masahiko Sawada

Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 1fa6c3bc6c..dfae14e463 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #endif
 
 /* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	/*
+	 * Note: There is a faster way to do this, but it returns a uint64 and
+	 * and if the caller wanted to extract the bit position using CTZ,
+	 * it would have to divide that result by 4.
+	 */
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the bitwise OR of the inputs
  */
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.31.1

v32-0003-Add-radixtree-template.patchapplication/octet-stream; name=v32-0003-Add-radixtree-template.patchDownload

From b88b152cac7c31b49416c4e59e93b3b5f0813759 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v32 03/18] Add radixtree template

WIP: commit message based on template comments
---
 src/backend/utils/mmgr/dsa.c                  |   12 +
 src/include/lib/radixtree.h                   | 2516 +++++++++++++++++
 src/include/lib/radixtree_delete_impl.h       |  122 +
 src/include/lib/radixtree_insert_impl.h       |  328 +++
 src/include/lib/radixtree_iter_impl.h         |  153 +
 src/include/lib/radixtree_search_impl.h       |  138 +
 src/include/utils/dsa.h                       |    1 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   36 +
 src/test/modules/test_radixtree/meson.build   |   35 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  681 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 20 files changed, 4089 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/include/lib/radixtree_delete_impl.h
 create mode 100644 src/include/lib/radixtree_insert_impl.h
 create mode 100644 src/include/lib/radixtree_iter_impl.h
 create mode 100644 src/include/lib/radixtree_search_impl.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index f5a62061a3..80555aefff 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..e546bd705c
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,2516 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *		Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ *  tional leaf node type which stores one value.
+ *  - Multi-value leaves: The values are stored in one of four
+ *  different leaf node types, which mirror the structure of
+ *  inner nodes, but contain values instead of pointers.
+ *  - Combined pointer/value slots: If values fit into point-
+ *  ers, no separate node types are necessary. Instead, each
+ *  pointer storage location in an inner node can either
+ *  store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included.  Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * 	 will result in radix tree type 'foo_radix_tree' and functions like
+ *	 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ *	 generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *	 declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *	 so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITER		- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#ifdef RT_DEBUG
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+#define RT_STATS RT_MAKE_NAME(stats)
+#endif
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_INIT_NODE RT_MAKE_NAME(init_node)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND RT_MAKE_NAME(extend)
+#define RT_SET_EXTEND RT_MAKE_NAME(set_extend)
+#define RT_SWITCH_NODE_KIND RT_MAKE_NAME(grow_node_kind)
+#define RT_COPY_NODE RT_MAKE_NAME(copy_node)
+#define RT_REPLACE_NODE RT_MAKE_NAME(replace_node)
+#define RT_PTR_GET_LOCAL RT_MAKE_NAME(ptr_get_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_VALUES_ARRAY_SHIFT RT_MAKE_NAME(chunk_values_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_VALUES_ARRAY_DELETE RT_MAKE_NAME(chunk_values_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_LEAF_125_GET_VALUE RT_MAKE_NAME(node_leaf_125_get_value)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_LEAF_256_IS_CHUNK_USED RT_MAKE_NAME(node_leaf_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_LEAF_256_GET_VALUE RT_MAKE_NAME(node_leaf_256_get_value)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_LEAF_256_SET RT_MAKE_NAME(node_leaf_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_NODE_LEAF_256_DELETE RT_MAKE_NAME(node_leaf_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_UPDATE_ITER_STACK RT_MAKE_NAME(update_iter_stack)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_3 RT_MAKE_NAME(node_base_3)
+#define RT_NODE_BASE_32 RT_MAKE_NAME(node_base_32)
+#define RT_NODE_BASE_125 RT_MAKE_NAME(node_base_125)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_3 RT_MAKE_NAME(node_inner_3)
+#define RT_NODE_INNER_32 RT_MAKE_NAME(node_inner_32)
+#define RT_NODE_INNER_125 RT_MAKE_NAME(node_inner_125)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_3 RT_MAKE_NAME(node_leaf_3)
+#define RT_NODE_LEAF_32 RT_MAKE_NAME(node_leaf_32)
+#define RT_NODE_LEAF_125 RT_MAKE_NAME(node_leaf_125)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_3 RT_MAKE_NAME(class_3)
+#define RT_CLASS_32_MIN RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_32_MAX RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_125 RT_MAKE_NAME(class_125)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#ifdef RT_DEBUG
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+#endif
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ *    statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ *    in the future to tag the node pointer with the kind, even on
+ *    platforms with 32-bit pointers. This might speed up node traversal
+ *    in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_3			0x00
+#define RT_NODE_KIND_32			0x01
+#define RT_NODE_KIND_125		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children.  We use uint16 to be able to indicate 256 children
+	 * at the fanout of 8.
+	 */
+	uint16		count;
+
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
+	uint8		fanout;
+
+	/*
+	 * Shift indicates which part of the key space is represented by this
+	 * node. That is, the key is shifted by 'shift' and the lowest
+	 * RT_NODE_SPAN bits are then represented in chunk.
+	 */
+	uint8		shift;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree)	LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree)	LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree)			LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree)	((void) 0)
+#define RT_LOCK_SHARED(tree)	((void) 0)
+#define RT_UNLOCK(tree)			((void) 0)
+#endif
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+#define RT_NODE_IS_LEAF(n)			(((RT_PTR_LOCAL) (n))->shift == 0)
+
+#define RT_NODE_MUST_GROW(node) \
+	((node)->base.n.count == (node)->base.n.fanout)
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_3
+{
+	RT_NODE		n;
+
+	/* 3 children, for key chunks */
+	uint8		chunks[3];
+} RT_NODE_BASE_3;
+
+typedef struct RT_NODE_BASE_32
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_32;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_125
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* bitmap to track which slots are in use */
+	bitmapword		isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_125;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_3;
+
+typedef struct RT_NODE_LEAF_3
+{
+	RT_NODE_BASE_3 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_3;
+
+typedef struct RT_NODE_INNER_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_32;
+
+typedef struct RT_NODE_LEAF_32
+{
+	RT_NODE_BASE_32 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_32;
+
+typedef struct RT_NODE_INNER_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_125;
+
+typedef struct RT_NODE_LEAF_125
+{
+	RT_NODE_BASE_125 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_125;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/*
+	 * Unlike with inner256, zero is a valid value here, so we use a
+	 * bitmap to track which slots are in use.
+	 */
+	bitmapword	isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	RT_VALUE_TYPE	values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_3 = 0,
+	RT_CLASS_32_MIN,
+	RT_CLASS_32_MAX,
+	RT_CLASS_125,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+	Size		leaf_size;
+} RT_SIZE_CLASS_ELEM;
+
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_3] = {
+		.name = "radix tree node 3",
+		.fanout = 3,
+		.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_3) + 3 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MIN] = {
+		.name = "radix tree node 15",
+		.fanout = 15,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 15 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 15 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_32_MAX] = {
+		.name = "radix tree node 32",
+		.fanout = 32,
+		.inner_size = sizeof(RT_NODE_INNER_32) + 32 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_32) + 32 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_125] = {
+		.name = "radix tree node 125",
+		.fanout = 125,
+		.inner_size = sizeof(RT_NODE_INNER_125) + 125 * sizeof(RT_PTR_ALLOC),
+		.leaf_size = sizeof(RT_NODE_LEAF_125) + 125 * sizeof(RT_VALUE_TYPE),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = 256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+		.leaf_size = sizeof(RT_NODE_LEAF_256),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+// WIP: this name is a bit strange
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+	LWLock		lock;
+#endif
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key. To support this, the we iterate nodes of each level.
+ *
+ * RT_NODE_ITER struct is used to track the iteration within a node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * in order to track the iteration of each level. During iteration, we also
+ * construct the key whenever updating the node iteration information, e.g., when
+ * advancing the current index within the node or when moving to the next node
+ * at the same level.
+ *
+ * XXX: Currently we allow only one process to do iteration. Therefore, rt_node_iter
+ * has the local pointers to nodes, rather than RT_PTR_ALLOC.
+ * We need either a safeguard to disallow other processes to begin the iteration
+ * while one process is doing or to allow multiple processes to do the iteration.
+ */
+typedef struct RT_NODE_ITER
+{
+	RT_PTR_LOCAL node;			/* current node being iterated */
+	int			current_idx;	/* current position. -1 for initial value */
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the iteration on nodes of each level */
+	RT_NODE_ITER stack[RT_MAX_LEVEL];
+	int			stack_len;
+
+	/* The key is constructed during iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								 uint64 key, RT_PTR_ALLOC child);
+static bool RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+								uint64 key, RT_VALUE_TYPE *value_p);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+/* Get the local address of an allocated node */
+static inline RT_PTR_LOCAL
+RT_PTR_GET_LOCAL(RT_RADIX_TREE *tree, RT_PTR_ALLOC node)
+{
+#ifdef RT_SHMEM
+	return dsa_get_address(tree->dsa, (dsa_pointer) node);
+#else
+	return node;
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_3 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/* replicate the search key */
+	spread_chunk = vector8_broadcast(chunk);
+
+	/* compare to all 32 keys stored in the node */
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+
+	/* convert comparison to a bitfield */
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+	/* mask off invalid entries */
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	/* convert bitfield to index by counting trailing zeros */
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_32 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		/*
+		 * This is coded with '>=' to match what we can do with SIMD,
+		 * with an assert to keep us honest.
+		 */
+		if (node->chunks[index] >= chunk)
+		{
+			Assert(node->chunks[index] != chunk);
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/*
+	 * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+	 * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+	 * we need to play some trickery using vector8_min() to effectively get
+	 * >=. There'll never be any equal elements in current uses, but that's
+	 * what we get here...
+	 */
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_SHIFT(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(values[idx + 1]), &(values[idx]), sizeof(RT_VALUE_TYPE) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_DELETE(uint8 *chunks, RT_VALUE_TYPE *values, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(values[idx]), &(values[idx + 1]), sizeof(RT_VALUE_TYPE) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size children_size = sizeof(RT_PTR_ALLOC) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_children, src_children, children_size);
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_3].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_125 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_125 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[node->base.slot_idxs[chunk]];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_125_GET_VALUE(RT_NODE_LEAF_125 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(((RT_NODE_BASE_125 *) node)->slot_idxs[chunk] != RT_INVALID_SLOT_IDX);
+	return node->values[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	return node->children[chunk] != RT_INVALID_PTR_ALLOC;
+}
+
+static inline bool
+RT_NODE_LEAF_256_IS_CHUNK_USED(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return node->children[chunk];
+}
+
+static inline RT_VALUE_TYPE
+RT_NODE_LEAF_256_GET_VALUE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	Assert(RT_NODE_IS_LEAF(node));
+	Assert(RT_NODE_LEAF_256_IS_CHUNK_USED(node, chunk));
+	return node->values[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = child;
+}
+
+/* Set the value in the node-256 */
+static inline void
+RT_NODE_LEAF_256_SET(RT_NODE_LEAF_256 *node, uint8 chunk, RT_VALUE_TYPE value)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->values[chunk] = value;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(!RT_NODE_IS_LEAF(node));
+	node->children[chunk] = RT_INVALID_PTR_ALLOC;
+}
+
+static inline void
+RT_NODE_LEAF_256_DELETE(RT_NODE_LEAF_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	Assert(RT_NODE_IS_LEAF(node));
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	if (key == 0)
+		return 0;
+	else
+		return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind.
+ */
+static RT_PTR_ALLOC
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	RT_PTR_ALLOC allocnode;
+	size_t allocsize;
+
+	if (is_leaf)
+		allocsize = RT_SIZE_CLASS_INFO[size_class].leaf_size;
+	else
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+	allocnode = dsa_allocate(tree->dsa, allocsize);
+#else
+	if (is_leaf)
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slabs[size_class],
+													  allocsize);
+	else
+		allocnode = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+#endif
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->ctl->cnt[size_class]++;
+#endif
+
+	return allocnode;
+}
+
+/* Initialize the node contents */
+static inline void
+RT_INIT_NODE(RT_PTR_LOCAL node, uint8 kind, RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	if (is_leaf)
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].leaf_size);
+	else
+		MemSet(node, 0, RT_SIZE_CLASS_INFO[size_class].inner_size);
+
+	node->kind = kind;
+
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+	/* Initialize slot_idxs to invalid values */
+	if (kind == RT_NODE_KIND_125)
+	{
+		RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+
+		memset(n125->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n125->slot_idxs));
+	}
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static pg_noinline void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		is_leaf = shift == 0;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL newnode;
+
+	allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+	newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+	newnode->shift = shift;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = allocnode;
+}
+
+static inline void
+RT_COPY_NODE(RT_PTR_LOCAL newnode, RT_PTR_LOCAL oldnode)
+{
+	newnode->shift = oldnode->shift;
+	newnode->count = oldnode->count;
+}
+
+/*
+ * Given a new allocated node and an old node, initalize the new
+ * node with the necessary fields and return its local pointer.
+ */
+static inline RT_PTR_LOCAL
+RT_SWITCH_NODE_KIND(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, RT_PTR_LOCAL node,
+				  uint8 new_kind, uint8 new_class, bool is_leaf)
+{
+	RT_PTR_LOCAL newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+	RT_INIT_NODE(newnode, new_kind, new_class, is_leaf);
+	RT_COPY_NODE(newnode, node);
+
+	return newnode;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode)
+{
+	/* If we're deleting the root node, make the tree empty */
+	if (tree->ctl->root == allocnode)
+	{
+		tree->ctl->root = RT_INVALID_PTR_ALLOC;
+		tree->ctl->max_val = 0;
+	}
+
+#ifdef RT_DEBUG
+	{
+		int i;
+		RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if (node->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
+	}
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, allocnode);
+#else
+	pfree(allocnode);
+#endif
+}
+
+/* Update the parent's pointer when growing a node */
+static inline void
+RT_NODE_UPDATE_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC new_child)
+{
+#define RT_ACTION_UPDATE
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+#undef RT_ACTION_UPDATE
+}
+
+/*
+ * Replace old_child with new_child, and free the old one.
+ */
+static inline void
+RT_REPLACE_NODE(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent,
+				RT_PTR_ALLOC stored_old_child, RT_PTR_LOCAL old_child,
+				RT_PTR_ALLOC new_child, uint64 key)
+{
+#ifdef USE_ASSERT_CHECKING
+	RT_PTR_LOCAL new = RT_PTR_GET_LOCAL(tree, new_child);
+
+	Assert(old_child->shift == new->shift);
+	Assert(old_child->count == new->count);
+#endif
+
+	if (parent == old_child)
+	{
+		/* Replace the root node with the new larger node */
+		tree->ctl->root = new_child;
+	}
+	else
+		RT_NODE_UPDATE_INNER(parent, key, new_child);
+
+	RT_FREE_NODE(tree, stored_old_child);
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static pg_noinline void
+RT_EXTEND(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	int			shift = root->shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_PTR_ALLOC	allocnode;
+		RT_PTR_LOCAL	node;
+		RT_NODE_INNER_3 *n3;
+
+		allocnode = RT_ALLOC_NODE(tree, RT_CLASS_3, true);
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		RT_INIT_NODE(node, RT_NODE_KIND_3, RT_CLASS_3, true);
+		node->shift = shift;
+		node->count = 1;
+
+		n3 = (RT_NODE_INNER_3 *) node;
+		n3->base.chunks[0] = 0;
+		n3->children[0] = tree->ctl->root;
+
+		/* Update the root */
+		tree->ctl->root = allocnode;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+}
+
+/*
+ * The radix tree doesn't have inner and leaf nodes for given key-value pair.
+ * Insert inner and leaf nodes from 'node' to bottom.
+ */
+static pg_noinline void
+RT_SET_EXTEND(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, RT_PTR_LOCAL parent,
+			  RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node)
+{
+	int			shift = node->shift;
+
+	Assert(RT_PTR_GET_LOCAL(tree, stored_node) == node);
+
+	while (shift >= RT_NODE_SPAN)
+	{
+		RT_PTR_ALLOC allocchild;
+		RT_PTR_LOCAL newchild;
+		int			newshift = shift - RT_NODE_SPAN;
+		bool		is_leaf = newshift == 0;
+
+		allocchild = RT_ALLOC_NODE(tree, RT_CLASS_3, is_leaf);
+		newchild = RT_PTR_GET_LOCAL(tree, allocchild);
+		RT_INIT_NODE(newchild, RT_NODE_KIND_3, RT_CLASS_3, is_leaf);
+		newchild->shift = newshift;
+		RT_NODE_INSERT_INNER(tree, parent, stored_node, node, key, allocchild);
+
+		parent = node;
+		node = newchild;
+		stored_node = allocchild;
+		shift -= RT_NODE_SPAN;
+	}
+
+	RT_NODE_INSERT_LEAF(tree, parent, stored_node, node, key, value_p);
+	tree->ctl->num_keys++;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the child
+ * pointer is set to child_p.
+ */
+static inline bool
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint64 key, RT_PTR_ALLOC *child_p)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Return true if the key is found, otherwise return false. On success, the pointer
+ * to the value is set to value_p.
+ */
+static inline bool
+RT_NODE_SEARCH_LEAF(RT_PTR_LOCAL node, uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_search_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_INNER(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Search for the value corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline bool
+RT_NODE_DELETE_LEAF(RT_PTR_LOCAL node, uint64 key)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_delete_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+#endif
+
+/*
+ * Insert "child" into "node".
+ *
+ * "parent" is the parent of "node", so the grandparent of the child.
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static void
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_PTR_ALLOC child)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/* Like RT_NODE_INSERT_INNER, but for leaf nodes */
+static bool
+RT_NODE_INSERT_LEAF(RT_RADIX_TREE *tree, RT_PTR_LOCAL parent, RT_PTR_ALLOC stored_node, RT_PTR_LOCAL node,
+					uint64 key, RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_insert_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+	/* Create a slab context for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+		size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+		size_t leaf_blocksize = RT_SLAB_BLOCK_SIZE(size_class.leaf_size);
+
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 size_class.name,
+												 inner_blocksize,
+												 size_class.inner_size);
+		tree->leaf_slabs[i] = SlabContextCreate(ctx,
+												size_class.name,
+												leaf_blocksize,
+												size_class.leaf_size);
+	}
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	if (RT_NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+				for (int i = 0; i < n3->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n3->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+				for (int i = 0; i < n32->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n32->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle);
+#else
+	pfree(tree->ctl);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+		MemoryContextDelete(tree->leaf_slabs[i]);
+	}
+#endif
+
+	pfree(tree);
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	int			shift;
+	bool		updated;
+	RT_PTR_LOCAL parent;
+	RT_PTR_ALLOC stored_child;
+	RT_PTR_LOCAL  child;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	/* Empty tree, create the root */
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->ctl->max_val)
+		RT_EXTEND(tree, key);
+
+	stored_child = tree->ctl->root;
+	parent = RT_PTR_GET_LOCAL(tree, stored_child);
+	shift = parent->shift;
+
+	/* Descend the tree until we reach a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC new_child = RT_INVALID_PTR_ALLOC;
+
+		child = RT_PTR_GET_LOCAL(tree, stored_child);
+
+		if (RT_NODE_IS_LEAF(child))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(child, key, &new_child))
+		{
+			RT_SET_EXTEND(tree, key, value_p, parent, stored_child, child);
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		parent = child;
+		stored_child = new_child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	updated = RT_NODE_INSERT_LEAF(tree, parent, stored_child, child, key, value_p);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->ctl->num_keys++;
+
+	RT_UNLOCK(tree);
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	RT_PTR_LOCAL node;
+	int			shift;
+	bool		found;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	Assert(value_p != NULL);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	node = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+	shift = node->shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		if (RT_NODE_IS_LEAF(node))
+			break;
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		node = RT_PTR_GET_LOCAL(tree, child);
+		shift -= RT_NODE_SPAN;
+	}
+
+	found = RT_NODE_SEARCH_LEAF(node, key, value_p);
+
+	RT_UNLOCK(tree);
+	return found;
+}
+
+#ifdef RT_USE_DELETE
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_LOCAL node;
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_ALLOC stack[RT_MAX_LEVEL] = {0};
+	int			shift;
+	int			level;
+	bool		deleted;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/*
+	 * Descend the tree to search the key while building a stack of nodes we
+	 * visited.
+	 */
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	level = -1;
+	while (shift > 0)
+	{
+		RT_PTR_ALLOC child = RT_INVALID_PTR_ALLOC;
+
+		/* Push the current node to the stack */
+		stack[++level] = allocnode;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		allocnode = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	/* Delete the key from the leaf node if exists */
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	deleted = RT_NODE_DELETE_LEAF(node, key);
+
+	if (!deleted)
+	{
+		/* no key is found in the leaf node */
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	/* Found the key to delete. Update the statistics */
+	tree->ctl->num_keys--;
+
+	/*
+	 * Return if the leaf node still has keys and we don't need to delete the
+	 * node.
+	 */
+	if (node->count > 0)
+	{
+		RT_UNLOCK(tree);
+		return true;
+	}
+
+	/* Free the empty leaf node */
+	RT_FREE_NODE(tree, allocnode);
+
+	/* Delete the key in inner nodes recursively */
+	while (level >= 0)
+	{
+		allocnode = stack[level--];
+
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		deleted = RT_NODE_DELETE_INNER(node, key);
+		Assert(deleted);
+
+		/* If the node didn't become empty, we stop deleting the key */
+		if (node->count > 0)
+			break;
+
+		/* The node became empty */
+		RT_FREE_NODE(tree, allocnode);
+	}
+
+	RT_UNLOCK(tree);
+	return true;
+}
+#endif
+
+static inline void
+RT_ITER_UPDATE_KEY(RT_ITER *iter, uint8 chunk, uint8 shift)
+{
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << shift);
+	iter->key |= (((uint64) chunk) << shift);
+}
+
+/*
+ * Advance the slot in the inner node. Return the child if exists, otherwise
+ * null.
+ */
+static inline RT_PTR_LOCAL
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter)
+{
+#define RT_NODE_LEVEL_INNER
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_INNER
+}
+
+/*
+ * Advance the slot in the leaf node. On success, return true and the value
+ * is set to value_p, otherwise return false.
+ */
+static inline bool
+RT_NODE_LEAF_ITERATE_NEXT(RT_ITER *iter, RT_NODE_ITER *node_iter,
+						  RT_VALUE_TYPE *value_p)
+{
+#define RT_NODE_LEVEL_LEAF
+#include "lib/radixtree_iter_impl.h"
+#undef RT_NODE_LEVEL_LEAF
+}
+
+/*
+ * Update each node_iter for inner nodes in the iterator node stack.
+ */
+static void
+RT_UPDATE_ITER_STACK(RT_ITER *iter, RT_PTR_LOCAL from_node, int from)
+{
+	int			level = from;
+	RT_PTR_LOCAL node = from_node;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->stack[level--]);
+
+		node_iter->node = node;
+		node_iter->current_idx = -1;
+
+		/* We don't advance the leaf node iterator here */
+		if (RT_NODE_IS_LEAF(node))
+			return;
+
+		/* Advance to the next slot in the inner node */
+		node = RT_NODE_INNER_ITERATE_NEXT(iter, node_iter);
+
+		/* We must find the first children in the node */
+		Assert(node);
+	}
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	MemoryContext old_ctx;
+	RT_ITER    *iter;
+	RT_PTR_LOCAL root;
+	int			top_level;
+
+	old_ctx = MemoryContextSwitchTo(tree->context);
+
+	iter = (RT_ITER *) palloc0(sizeof(RT_ITER));
+	iter->tree = tree;
+
+	RT_LOCK_SHARED(tree);
+
+	/* empty tree */
+	if (!iter->tree->ctl->root)
+		return iter;
+
+	root = RT_PTR_GET_LOCAL(tree, iter->tree->ctl->root);
+	top_level = root->shift / RT_NODE_SPAN;
+	iter->stack_len = top_level;
+
+	/*
+	 * Descend to the left most leaf node from the root. The key is being
+	 * constructed while descending to the leaf.
+	 */
+	RT_UPDATE_ITER_STACK(iter, root, top_level);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+	/* Empty tree */
+	if (!iter->tree->ctl->root)
+		return false;
+
+	for (;;)
+	{
+		RT_PTR_LOCAL child = NULL;
+		RT_VALUE_TYPE value;
+		int			level;
+		bool		found;
+
+		/* Advance the leaf node iterator to get next key-value pair */
+		found = RT_NODE_LEAF_ITERATE_NEXT(iter, &(iter->stack[0]), &value);
+
+		if (found)
+		{
+			*key_p = iter->key;
+			*value_p = value;
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance inner node
+		 * iterators from the level=1 until we find the next child node.
+		 */
+		for (level = 1; level <= iter->stack_len; level++)
+		{
+			child = RT_NODE_INNER_ITERATE_NEXT(iter, &(iter->stack[level]));
+
+			if (child)
+				break;
+		}
+
+		/* the iteration finished */
+		if (!child)
+			return false;
+
+		/*
+		 * Set the node to the node iterator and update the iterator stack
+		 * from this node.
+		 */
+		RT_UPDATE_ITER_STACK(iter, child, level - 1);
+
+		/* Node iterators are updated, so try again from the leaf */
+	}
+
+	return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+	Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+	RT_UNLOCK(iter->tree);
+	pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	Size		total = 0;
+
+	RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+		total += MemoryContextMemAllocated(tree->leaf_slabs[i], true);
+	}
+#endif
+
+	RT_UNLOCK(tree);
+	return total;
+}
+
+/*
+ * Verify the radix tree node.
+ */
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(node->count >= 0);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE_BASE_3 *n3 = (RT_NODE_BASE_3 *) node;
+
+				for (int i = 1; i < n3->n.count; i++)
+					Assert(n3->chunks[i - 1] < n3->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE_BASE_32 *n32 = (RT_NODE_BASE_32 *) node;
+
+				for (int i = 1; i < n32->n.count; i++)
+					Assert(n32->chunks[i - 1] < n32->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *n125 = (RT_NODE_BASE_125 *) node;
+				int			cnt = 0;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n125->slot_idxs[i];
+					int			idx = RT_BM_IDX(slot);
+					int			bitnum = RT_BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n125, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n125->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n125->n.count == cnt);
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches */
+					Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+/***************** DEBUG FUNCTIONS *****************/
+#ifdef RT_DEBUG
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void
+RT_STATS(RT_RADIX_TREE *tree)
+{
+	RT_LOCK_SHARED(tree);
+
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+	fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+	fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_PTR_LOCAL root = RT_PTR_GET_LOCAL(tree, tree->ctl->root);
+
+		fprintf(stderr, "height = %d, n3 = %u, n15 = %u, n32 = %u, n125 = %u, n256 = %u\n",
+				root->shift / RT_NODE_SPAN,
+				tree->ctl->cnt[RT_CLASS_3],
+				tree->ctl->cnt[RT_CLASS_32_MIN],
+				tree->ctl->cnt[RT_CLASS_32_MAX],
+				tree->ctl->cnt[RT_CLASS_125],
+				tree->ctl->cnt[RT_CLASS_256]);
+	}
+
+	RT_UNLOCK(tree);
+}
+
+static void
+RT_DUMP_NODE(RT_RADIX_TREE *tree, RT_PTR_ALLOC allocnode, int level,
+			 bool recurse, StringInfo buf)
+{
+	RT_PTR_LOCAL node = RT_PTR_GET_LOCAL(tree, allocnode);
+	StringInfoData spaces;
+
+	initStringInfo(&spaces);
+	appendStringInfoSpaces(&spaces, (level * 4) + 1);
+
+	appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u, shift %u:\n",
+					 spaces.data,
+					 level == 0 ? "" : "-> ",
+					 RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+					 (node->kind == RT_NODE_KIND_3) ? 3 :
+					 (node->kind == RT_NODE_KIND_32) ? 32 :
+					 (node->kind == RT_NODE_KIND_125) ? 125 : 256,
+					 node->fanout == 0 ? 256 : node->fanout,
+					 node->count, node->shift);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_3 *n3 = (RT_NODE_LEAF_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n3->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_3 *n3 = (RT_NODE_INNER_3 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n3->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n3->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_32 *n32 = (RT_NODE_LEAF_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n32->base.chunks[i]);
+					}
+					else
+					{
+						RT_NODE_INNER_32 *n32 = (RT_NODE_INNER_32 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n32->base.chunks[i]);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, n32->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE_BASE_125 *b125 = (RT_NODE_BASE_125 *) node;
+				char *sep = "";
+
+				appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					appendStringInfo(buf, "%s[%d]=%d ",
+									 sep, i, b125->slot_idxs[i]);
+					sep = ",";
+				}
+
+				appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+				for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+					appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+				appendStringInfo(buf, "\n");
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					if (RT_NODE_IS_LEAF(node))
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					else
+					{
+						RT_NODE_INNER_125 *n125 = (RT_NODE_INNER_125 *) b125;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_125_GET_CHILD(n125, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+					appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+					for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+						appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+					appendStringInfo(buf, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(tree, RT_NODE_INNER_256_GET_CHILD(n256, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+	}
+}
+
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL node;
+	StringInfoData buf;
+	int			shift;
+	int			level = 0;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	if (key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+				key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	allocnode = tree->ctl->root;
+	node = RT_PTR_GET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+		if (RT_NODE_IS_LEAF(node))
+		{
+			RT_VALUE_TYPE	dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node, key, &child))
+			break;
+
+		allocnode = child;
+		node = RT_PTR_GET_LOCAL(tree, allocnode);
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s", buf.data);
+}
+
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+	StringInfoData buf;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	initStringInfo(&buf);
+
+	RT_DUMP_NODE(tree, tree->ctl->root, 0, true, &buf);
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s",buf.data);
+}
+#endif
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_3
+#undef RT_NODE_KIND_32
+#undef RT_NODE_KIND_125
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_3
+#undef RT_NODE_BASE_32
+#undef RT_NODE_BASE_125
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_3
+#undef RT_NODE_INNER_32
+#undef RT_NODE_INNER_125
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_3
+#undef RT_NODE_LEAF_32
+#undef RT_NODE_LEAF_125
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_3
+#undef RT_CLASS_32_MIN
+#undef RT_CLASS_32_MAX
+#undef RT_CLASS_125
+#undef RT_CLASS_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_ALLOC_NODE
+#undef RT_INIT_NODE
+#undef RT_FREE_NODE
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND
+#undef RT_SET_EXTEND
+#undef RT_SWITCH_NODE_KIND
+#undef RT_COPY_NODE
+#undef RT_REPLACE_NODE
+#undef RT_PTR_GET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_UPDATE_ITER_STACK
+#undef RT_ITER_UPDATE_KEY
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/lib/radixtree_delete_impl.h b/src/include/lib/radixtree_delete_impl.h
new file mode 100644
index 0000000000..5f6dda1f12
--- /dev/null
+++ b/src/include/lib/radixtree_delete_impl.h
@@ -0,0 +1,122 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_delete_impl.h
+ *	  Common implementation for deletion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ * TODO: Shrink nodes when deletion would allow them to fit in a smaller
+ * size class.
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_delete_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n3->base.chunks, n3->values,
+										  n3->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n3->base.chunks, n3->children,
+											n3->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_CHUNK_VALUES_ARRAY_DELETE(n32->base.chunks, n32->values,
+										  n32->base.n.count, idx);
+#else
+				RT_CHUNK_CHILDREN_ARRAY_DELETE(n32->base.chunks, n32->children,
+											n32->base.n.count, idx);
+#endif
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+				int			idx;
+				int			bitnum;
+
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+				idx = RT_BM_IDX(slotpos);
+				bitnum = RT_BM_BIT(slotpos);
+				n125->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+				n125->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				RT_NODE_LEAF_256_DELETE(n256, chunk);
+#else
+				RT_NODE_INNER_256_DELETE(n256, chunk);
+#endif
+				break;
+			}
+	}
+
+	/* update statistics */
+	node->count--;
+
+	return true;
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_insert_impl.h b/src/include/lib/radixtree_insert_impl.h
new file mode 100644
index 0000000000..d56e58dcac
--- /dev/null
+++ b/src/include/lib/radixtree_insert_impl.h
@@ -0,0 +1,328 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_insert_impl.h
+ *	  Common implementation for insertion in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_insert_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	const bool is_leaf = true;
+	bool		chunk_exists = false;
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+	const bool is_leaf = false;
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_3_SEARCH_EQ(&n3->base, chunk);
+
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n3->values[idx] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n3)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE32_TYPE *new32;
+					const uint8 new_kind = RT_NODE_KIND_32;
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MIN;
+
+					/* grow node from 3 to 32 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					RT_CHUNK_VALUES_ARRAY_COPY(n3->base.chunks, n3->values,
+											  new32->base.chunks, new32->values);
+#else
+					RT_CHUNK_CHILDREN_ARRAY_COPY(n3->base.chunks, n3->children,
+											  new32->base.chunks, new32->children);
+#endif
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			insertpos = RT_NODE_3_GET_INSERTPOS(&n3->base, chunk);
+					int			count = n3->base.n.count;
+
+					/* shift chunks and children */
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n3->base.chunks, n3->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n3->base.chunks, n3->children,
+												   count, insertpos);
+#endif
+					}
+
+					n3->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n3->values[insertpos] = *value_p;
+#else
+					n3->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_32:
+			{
+				const RT_SIZE_CLASS_ELEM class32_max = RT_SIZE_CLASS_INFO[RT_CLASS_32_MAX];
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				int idx = RT_NODE_32_SEARCH_EQ(&n32->base, chunk);
+
+				if (idx != -1)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n32->values[idx] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n32)) &&
+					n32->base.n.fanout < class32_max.fanout)
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					const RT_SIZE_CLASS_ELEM class32_min = RT_SIZE_CLASS_INFO[RT_CLASS_32_MIN];
+					const RT_SIZE_CLASS new_class = RT_CLASS_32_MAX;
+
+					Assert(n32->base.n.fanout == class32_min.fanout);
+
+					/* grow to the next size class of this kind */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_PTR_GET_LOCAL(tree, allocnode);
+					n32 = (RT_NODE32_TYPE *) newnode;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					memcpy(newnode, node, class32_min.leaf_size);
+#else
+					memcpy(newnode, node, class32_min.inner_size);
+#endif
+					newnode->fanout = class32_max.fanout;
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+
+				if (unlikely(RT_NODE_MUST_GROW(n32)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE125_TYPE *new125;
+					const uint8 new_kind = RT_NODE_KIND_125;
+					const RT_SIZE_CLASS new_class = RT_CLASS_125;
+
+					Assert(n32->base.n.fanout == class32_max.fanout);
+
+					/* grow node from 32 to 125 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new125 = (RT_NODE125_TYPE *) newnode;
+
+					for (int i = 0; i < class32_max.fanout; i++)
+					{
+						new125->base.slot_idxs[n32->base.chunks[i]] = i;
+#ifdef RT_NODE_LEVEL_LEAF
+						new125->values[i] = n32->values[i];
+#else
+						new125->children[i] = n32->children[i];
+#endif
+					}
+
+					/*
+					 * Since we just copied a dense array, we can set the bits
+					 * using a single store, provided the length of that array
+					 * is at most the number of bits in a bitmapword.
+					 */
+					Assert(class32_max.fanout <= sizeof(bitmapword) * BITS_PER_BYTE);
+					new125->base.isset[0] = (bitmapword) (((uint64) 1 << class32_max.fanout) - 1);
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int	insertpos = RT_NODE_32_GET_INSERTPOS(&n32->base, chunk);
+					int count = n32->base.n.count;
+
+					if (insertpos < count)
+					{
+						Assert(count > 0);
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_CHUNK_VALUES_ARRAY_SHIFT(n32->base.chunks, n32->values,
+												   count, insertpos);
+#else
+						RT_CHUNK_CHILDREN_ARRAY_SHIFT(n32->base.chunks, n32->children,
+												   count, insertpos);
+#endif
+					}
+
+					n32->base.chunks[insertpos] = chunk;
+#ifdef RT_NODE_LEVEL_LEAF
+					n32->values[insertpos] = *value_p;
+#else
+					n32->children[insertpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos;
+				int			cnt = 0;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				slotpos = n125->base.slot_idxs[chunk];
+				if (slotpos != RT_INVALID_SLOT_IDX)
+				{
+					/* found the existing chunk */
+					chunk_exists = true;
+					n125->values[slotpos] = *value_p;
+					break;
+				}
+#endif
+				if (unlikely(RT_NODE_MUST_GROW(n125)))
+				{
+					RT_PTR_ALLOC allocnode;
+					RT_PTR_LOCAL newnode;
+					RT_NODE256_TYPE *new256;
+					const uint8 new_kind = RT_NODE_KIND_256;
+					const RT_SIZE_CLASS new_class = RT_CLASS_256;
+
+					/* grow node from 125 to 256 */
+					allocnode = RT_ALLOC_NODE(tree, new_class, is_leaf);
+					newnode = RT_SWITCH_NODE_KIND(tree, allocnode, node, new_kind, new_class, is_leaf);
+					new256 = (RT_NODE256_TYPE *) newnode;
+
+					for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n125->base.n.count; i++)
+					{
+						if (!RT_NODE_125_IS_CHUNK_USED(&n125->base, i))
+							continue;
+#ifdef RT_NODE_LEVEL_LEAF
+						RT_NODE_LEAF_256_SET(new256, i, RT_NODE_LEAF_125_GET_VALUE(n125, i));
+#else
+						RT_NODE_INNER_256_SET(new256, i, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+						cnt++;
+					}
+
+					RT_REPLACE_NODE(tree, parent, stored_node, node, allocnode, key);
+					node = newnode;
+				}
+				else
+				{
+					int			idx;
+					bitmapword	inverse;
+
+					/* get the first word with at least one bit not set */
+					for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+					{
+						if (n125->base.isset[idx] < ~((bitmapword) 0))
+							break;
+					}
+
+					/* To get the first unset bit in X, get the first set bit in ~X */
+					inverse = ~(n125->base.isset[idx]);
+					slotpos = idx * BITS_PER_BITMAPWORD;
+					slotpos += bmw_rightmost_one_pos(inverse);
+					Assert(slotpos < node->fanout);
+
+					/* mark the slot used */
+					n125->base.isset[idx] |= bmw_rightmost_one(inverse);
+					n125->base.slot_idxs[chunk] = slotpos;
+
+#ifdef RT_NODE_LEVEL_LEAF
+					n125->values[slotpos] = *value_p;
+#else
+					n125->children[slotpos] = child;
+#endif
+					break;
+				}
+			}
+			/* FALLTHROUGH */
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				chunk_exists = RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk);
+				Assert(chunk_exists || node->count < RT_NODE_MAX_SLOTS);
+				RT_NODE_LEAF_256_SET(n256, chunk, *value_p);
+#else
+				Assert(node->count < RT_NODE_MAX_SLOTS);
+				RT_NODE_INNER_256_SET(n256, chunk, child);
+#endif
+				break;
+			}
+	}
+
+	/* Update statistics */
+#ifdef RT_NODE_LEVEL_LEAF
+	if (!chunk_exists)
+		node->count++;
+#else
+		node->count++;
+#endif
+
+	/*
+	 * Done. Finally, verify the chunk and value is inserted or replaced
+	 * properly in the node.
+	 */
+	RT_VERIFY_NODE(node);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return chunk_exists;
+#else
+	return;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_iter_impl.h b/src/include/lib/radixtree_iter_impl.h
new file mode 100644
index 0000000000..98c78eb237
--- /dev/null
+++ b/src/include/lib/radixtree_iter_impl.h
@@ -0,0 +1,153 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_iter_impl.h
+ *	  Common implementation for iteration in leaf and inner nodes.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_iter_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	bool		found = false;
+	uint8		key_chunk;
+
+#ifdef RT_NODE_LEVEL_LEAF
+	RT_VALUE_TYPE		value;
+
+	Assert(RT_NODE_IS_LEAF(node_iter->node));
+#else
+	RT_PTR_LOCAL child = NULL;
+
+	Assert(!RT_NODE_IS_LEAF(node_iter->node));
+#endif
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	switch (node_iter->node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n3->base.n.count)
+					break;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n3->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n3->children[node_iter->current_idx]);
+#endif
+				key_chunk = n3->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node_iter->node;
+
+				node_iter->current_idx++;
+				if (node_iter->current_idx >= n32->base.n.count)
+					break;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				value = n32->values[node_iter->current_idx];
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, n32->children[node_iter->current_idx]);
+#endif
+				key_chunk = n32->base.chunks[node_iter->current_idx];
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_125 *) n125, i))
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_125_GET_VALUE(n125, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_125_GET_CHILD(n125, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node_iter->node;
+				int			i;
+
+				for (i = node_iter->current_idx + 1; i < RT_NODE_MAX_SLOTS; i++)
+				{
+#ifdef RT_NODE_LEVEL_LEAF
+					if (RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+#else
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+#endif
+						break;
+				}
+
+				if (i >= RT_NODE_MAX_SLOTS)
+					break;
+
+				node_iter->current_idx = i;
+#ifdef RT_NODE_LEVEL_LEAF
+				value = RT_NODE_LEAF_256_GET_VALUE(n256, i);
+#else
+				child = RT_PTR_GET_LOCAL(iter->tree, RT_NODE_INNER_256_GET_CHILD(n256, i));
+#endif
+				key_chunk = i;
+				found = true;
+				break;
+			}
+	}
+
+	if (found)
+	{
+		RT_ITER_UPDATE_KEY(iter, key_chunk, node_iter->node->shift);
+#ifdef RT_NODE_LEVEL_LEAF
+		*value_p = value;
+#endif
+	}
+
+#ifdef RT_NODE_LEVEL_LEAF
+	return found;
+#else
+	return child;
+#endif
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/lib/radixtree_search_impl.h b/src/include/lib/radixtree_search_impl.h
new file mode 100644
index 0000000000..a8925c75d0
--- /dev/null
+++ b/src/include/lib/radixtree_search_impl.h
@@ -0,0 +1,138 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree_search_impl.h
+ *	  Common implementation for search in leaf and inner nodes, plus
+ *	  update for inner nodes only.
+ *
+ * Note: There is deliberately no #include guard here
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * src/include/lib/radixtree_search_impl.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#if defined(RT_NODE_LEVEL_INNER)
+#define RT_NODE3_TYPE RT_NODE_INNER_3
+#define RT_NODE32_TYPE RT_NODE_INNER_32
+#define RT_NODE125_TYPE RT_NODE_INNER_125
+#define RT_NODE256_TYPE RT_NODE_INNER_256
+#elif defined(RT_NODE_LEVEL_LEAF)
+#define RT_NODE3_TYPE RT_NODE_LEAF_3
+#define RT_NODE32_TYPE RT_NODE_LEAF_32
+#define RT_NODE125_TYPE RT_NODE_LEAF_125
+#define RT_NODE256_TYPE RT_NODE_LEAF_256
+#else
+#error node level must be either inner or leaf
+#endif
+
+	uint8		chunk = RT_GET_KEY_CHUNK(key, node->shift);
+
+#ifdef RT_NODE_LEVEL_LEAF
+	Assert(value_p != NULL);
+	Assert(RT_NODE_IS_LEAF(node));
+#else
+#ifndef RT_ACTION_UPDATE
+	Assert(child_p != NULL);
+#endif
+	Assert(!RT_NODE_IS_LEAF(node));
+#endif
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_3:
+			{
+				RT_NODE3_TYPE *n3 = (RT_NODE3_TYPE *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_3 *) n3, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n3->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n3->values[idx];
+#else
+				*child_p = n3->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_32:
+			{
+				RT_NODE32_TYPE *n32 = (RT_NODE32_TYPE *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_32 *) n32, chunk);
+
+#ifdef RT_ACTION_UPDATE
+				Assert(idx >= 0);
+				n32->children[idx] = new_child;
+#else
+				if (idx < 0)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = n32->values[idx];
+#else
+				*child_p = n32->children[idx];
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_125:
+			{
+				RT_NODE125_TYPE *n125 = (RT_NODE125_TYPE *) node;
+				int			slotpos = n125->base.slot_idxs[chunk];
+
+#ifdef RT_ACTION_UPDATE
+				Assert(slotpos != RT_INVALID_SLOT_IDX);
+				n125->children[slotpos] = new_child;
+#else
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_125_GET_VALUE(n125, chunk);
+#else
+				*child_p = RT_NODE_INNER_125_GET_CHILD(n125, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE256_TYPE *n256 = (RT_NODE256_TYPE *) node;
+
+#ifdef RT_ACTION_UPDATE
+				RT_NODE_INNER_256_SET(n256, chunk, new_child);
+#else
+#ifdef RT_NODE_LEVEL_LEAF
+				if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, chunk))
+#else
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+#endif
+					return false;
+
+#ifdef RT_NODE_LEVEL_LEAF
+				*value_p = RT_NODE_LEAF_256_GET_VALUE(n256, chunk);
+#else
+				*child_p = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+#endif
+#endif							/* RT_ACTION_UPDATE */
+				break;
+			}
+	}
+
+#ifdef RT_ACTION_UPDATE
+	return;
+#else
+	return true;
+#endif							/* RT_ACTION_UPDATE */
+
+#undef RT_NODE3_TYPE
+#undef RT_NODE32_TYPE
+#undef RT_NODE125_TYPE
+#undef RT_NODE256_TYPE
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 79e3033ec2..89f42bf9e3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -28,6 +28,7 @@ SUBDIRS = \
 		  test_pg_db_role_setting \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index dcb82ed68f..beaf4080fb 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -25,6 +25,7 @@ subdir('test_parser')
 subdir('test_pg_db_role_setting')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..ce645cb8b5
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,36 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing basic operations with leaf node 4
+NOTICE:  testing basic operations with inner node 4
+NOTICE:  testing basic operations with leaf node 32
+NOTICE:  testing basic operations with inner node 32
+NOTICE:  testing basic operations with leaf node 125
+NOTICE:  testing basic operations with inner node 125
+NOTICE:  testing basic operations with leaf node 256
+NOTICE:  testing basic operations with inner node 256
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..afe53382f3
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,681 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+static int	rt_node_kind_fanouts[] = {
+	0,
+	4,							/* RT_NODE_KIND_4 */
+	32,							/* RT_NODE_KIND_32 */
+	125,						/* RT_NODE_KIND_125 */
+	256							/* RT_NODE_KIND_256 */
+};
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+/* #define RT_SHMEM */
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	TestValueType		dummy;
+	uint64		key;
+	TestValueType		val;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, bool test_inner)
+{
+	rt_radix_tree	*radixtree;
+	uint64 *keys;
+	int	shift = test_inner ? 8 : 0;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing basic operations with %s node %d",
+		 test_inner ? "inner" : "leaf", children);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/* prepare keys in order like 1, 32, 2, 31, 2, ... */
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+	{
+		if (i % 2 == 0)
+			keys[i] = (uint64) ((i / 2) + 1) << shift;
+		else
+			keys[i] = (uint64) (children - (i / 2)) << shift;
+	}
+
+	/* insert keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* look up keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType value;
+
+		if (!rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (value != (TestValueType) keys[i])
+			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+				 value, (TestValueType) keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType update = keys[i] + 1;
+		if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	pfree(keys);
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end,
+					 int incr)
+{
+	for (int i = start; i < end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		TestValueType		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != (TestValueType) key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+static void
+test_node_types_insert(rt_radix_tree *radixtree, uint8 shift, bool insert_asc)
+{
+	uint64		num_entries;
+	int		ninserted = 0;
+	int		start = insert_asc ? 0 : 256;
+	int 	incr = insert_asc ? 1 : -1;
+	int		end = insert_asc ? 256 : 0;
+	int		node_kind_idx = 1;
+
+	for (int i = start; i != end; i += incr)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType*) &key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if (ninserted == rt_node_kind_fanouts[node_kind_idx] - 1)
+		{
+			int check_start = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx - 1]
+				: rt_node_kind_fanouts[node_kind_idx];
+			int check_end = insert_asc
+				? rt_node_kind_fanouts[node_kind_idx]
+				: rt_node_kind_fanouts[node_kind_idx - 1];
+
+			check_search_on_node(radixtree, shift, check_start, check_end, incr);
+			node_kind_idx++;
+		}
+
+		ninserted++;
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = 0; i < 256; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert(radixtree, shift, true);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert(radixtree, shift, false);
+
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+	radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, (TestValueType*) &x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != (TestValueType) x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			TestValueType		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != (TestValueType) expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	rt_free(radixtree);
+	MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 1; i < lengthof(rt_node_kind_fanouts); i++)
+	{
+		test_basic(rt_node_kind_fanouts[i], false);
+		test_basic(rt_node_kind_fanouts[i], true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index 4e09c4686b..202bf1c04e 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index 8dee1b5670..133313255c 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.31.1

#251

john.naylor@enterprisedb.com

over 2 years ago

In reply to: Masahiko Sawada (#249)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Apr 17, 2023 at 8:49 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

- With lazy expansion and single-value leaves, the root of a radix tree

can point to a single leaf. That might get rid of the need to track
TBMStatus, since setting a single-leaf tree should be cheap.

Instead of introducing single-value leaves to the radix tree as
another structure, can we store pointers to PagetableEntry as values?

Well, that's pretty much what a single-value leaf is. Now that I've had
time to pause and regroup, I've looked into some aspects we previously put
off for future work, and this is one of them.

The concept is really quite trivial, and it's the simplest and most
flexible way to implement ART. Our, or at least my, documented reason not
to go that route was due to "an extra pointer traversal", but that's
partially mitigated by "lazy expansion", which is actually fairly easy to
do with single-value leaves. The two techniques complement each other in a
natural way. (Path compression, on the other hand, is much more complex.)

Note: I've moved the CF entry to the next CF, and set to waiting on

author for now. Since no action is currently required from Masahiko, I've
added myself as author as well. If tackling bitmap heap scan shows promise,
we could RWF and resurrect at a later time.

Thanks. I'm going to continue researching the memory limitation and

Sounds like the best thing to nail down at this point.

try lazy path expansion until PG17 development begins.

This doesn't seem like a useful thing to try and attach into the current
patch (if that's what you mean), as the current insert/delete paths are
quite complex. Using bitmap heap scan as a motivating use case, I hope to
refocus complexity to where it's most needed, and aggressively simplify
where possible.

--
John Naylor
EDB: http://www.enterprisedb.com

#252

sawada.mshk@gmail.com

over 2 years ago

In reply to: John Naylor (#251)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Apr 19, 2023 at 4:02 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Mon, Apr 17, 2023 at 8:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

- With lazy expansion and single-value leaves, the root of a radix tree can point to a single leaf. That might get rid of the need to track TBMStatus, since setting a single-leaf tree should be cheap.

Instead of introducing single-value leaves to the radix tree as
another structure, can we store pointers to PagetableEntry as values?

Well, that's pretty much what a single-value leaf is. Now that I've had time to pause and regroup, I've looked into some aspects we previously put off for future work, and this is one of them.

The concept is really quite trivial, and it's the simplest and most flexible way to implement ART. Our, or at least my, documented reason not to go that route was due to "an extra pointer traversal", but that's partially mitigated by "lazy expansion", which is actually fairly easy to do with single-value leaves. The two techniques complement each other in a natural way. (Path compression, on the other hand, is much more complex.)

Note: I've moved the CF entry to the next CF, and set to waiting on author for now. Since no action is currently required from Masahiko, I've added myself as author as well. If tackling bitmap heap scan shows promise, we could RWF and resurrect at a later time.

Thanks. I'm going to continue researching the memory limitation and

Sounds like the best thing to nail down at this point.

try lazy path expansion until PG17 development begins.

This doesn't seem like a useful thing to try and attach into the current patch (if that's what you mean), as the current insert/delete paths are quite complex. Using bitmap heap scan as a motivating use case, I hope to refocus complexity to where it's most needed, and aggressively simplify where possible.

I agree that we don't want to make the current patch complex further.

Thinking about the memory limitation more, I think that combination of
the idea of specifying the initial and max DSA segment size and
dsa_set_size_limit() works well. There are two points in terms of
memory limitation; when the memory usage reaches the limit we want (1)
to minimize the last allocated memory block that is allocated but not
used yet and (2) to minimize the amount of memory that exceeds the
memory limit. Since we can specify the maximum DSA segment size, the
last allocated block before reaching the memory limit is small. Also,
thanks to dsa_set_size_limit(), the total DSA size will stop at the
limit, so (memory_usage >= memory_limit) returns true without any
exceeding memory.

Given that we need to configure the initial and maximum DSA segment
size and set the DSA limit for TidStore memory accounting and
limiting, it would be better to create the DSA for TidStore by
TidStoreCreate() API, rather than creating DSA in the caller and pass
it to TidStoreCreate().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#253

john.naylor@enterprisedb.com

over 2 years ago

In reply to: John Naylor (#248)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Apr 7, 2023 at 4:55 PM John Naylor <john.naylor@enterprisedb.com>
wrote:

- Fixed-size PagetableEntry's are pretty large, but the tid compression

scheme used in this thread (in addition to being complex) is not a great
fit for tidbitmap because it makes it more difficult to track per-block
metadata (see also next point). With the "combined pointer-value slots"
technique, if a page's max tid offset is 63 or less, the offsets can be
stored directly in the pointer for the exact case. The lowest bit can tag
to indicate a pointer to a single-value leaf. That would complicate
operations like union/intersection and tracking "needs recheck", but it
would reduce memory use and node-traversal in common cases.

[just getting some thoughts out there before I have something concrete]

Thinking some more, this needn't be complicated at all. We'd just need to
reserve some bits of a bitmapword for the tag, as well as flags for
"ischunk" and "recheck". The other bits can be used for offsets.
Getting/storing the offsets basically amounts to adjusting the shift by a
constant. That way, this "embeddable PTE" could serve as both "PTE embedded
in a node pointer" and also the first member of a full PTE. A full PTE is
now just an array of embedded PTEs, except only the first one has the flags
we need. That reduces the number of places that have to be different.
Storing any set of offsets all less than ~60 would save
allocation/traversal in a large number of real cases. Furthermore, that
would reduce a full PTE to 40 bytes because there would be no padding.

This all assumes the key (block number) is no longer stored in the PTE,
whether embedded or not. That would mean this technique:

- With lazy expansion and single-value leaves, the root of a radix tree

can point to a single leaf. That might get rid of the need to track
TBMStatus, since setting a single-leaf tree should be cheap.

...is not a good trade off because it requires each leaf to have the key,
and would thus reduce the utility of embedded leaves. We just need to make
sure storing a single value is not costly, and I suspect it's not.
(Currently the overhead avoided is allocating and zeroing a few kilobytes
for a hash table). If it is not, then we don't need a special case in
tidbitmap, which would be a great simplification. If it is, there are other
ways to mitigate.

--
John Naylor
EDB: http://www.enterprisedb.com

#254

john.naylor@enterprisedb.com

over 2 years ago

In reply to: John Naylor (#251)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

I wrote:

the current insert/delete paths are quite complex. Using bitmap heap scan

as a motivating use case, I hope to refocus complexity to where it's most
needed, and aggressively simplify where possible.

Sometime in the not-too-distant future, I will start a new thread focusing
on bitmap heap scan, but for now, I just want to share some progress on
making the radix tree usable not only for that, but hopefully a wider range
of applications, while making the code simpler and the binary smaller. The
attached patches are incomplete (e.g. no iteration) and quite a bit messy,
so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 +
0007-09 ).

0001

This combines a few concepts that I didn't bother separating out after the
fact:
- Split insert_impl.h into multiple functions for improved readability and
maintainability.
- Use single-value leaves as the basis for storing values, with the goal to
get to "combined pointer-value slots" for efficiency and flexibility.
- With the latter in mind, searching the child within a node now returns
the address of the slot. This allows the same interface whether the slot
contains a child pointer or a value.
- Starting with RT_SET, start turning some iterative algorithms into
recursive ones. This is a more natural way to traverse a tree structure,
and we already see an advantage: Previously when growing a node, we
searched within the parent to update its reference to the new node, because
we didn't know the slot we descended from. Now we can simply update a
single variable.
- Since we recursively pass the "shift" down the stack, it doesn't have to
be stored in any node -- only the "top-level" start shift is stored in the
tree control struct. This was easy to code since the node's shift value was
hardly ever accessed anyway! The node header shrinks from 5 bytes to 4.

0002

Back in v15, we tried keeping DSA/local pointers as members of a struct. I
did not like the result, but still thought it was a good idea. RT_DELETE is
a complex function and I didn't want to try rewriting it without a pointer
abstraction, so I've resurrected this idea, but in a simpler, less
intrusive way. A key difference from v15 is using a union type for the
non-shmem case.

0004

Rewrite RT_DELETE using recursion. I find this simpler than the previous
open-coded stack.

0005-06

Deletion has an inefficiency: One function searches for the child to see if
it's there, then another function searches for it again to delete it. Since
0001, a successful child search returns the address of the slot, so we can
save it. For the two smaller "linear search" node kinds we can then use a
single subtraction to compute the chunk/slot index for deletion. Also,
split RT_NODE_DELETE_INNER into separate functions, for a similar reason as
the insert case in 0001.

0007

Anticipate node shrinking: If only one node-kind needs to be freed, we can
move a branch to that one code path, rather than every place where RT_FREE
is inlined.

0009

Teach node256 how to shrink *. Since we know the number of children in a
node256 can't possibly be zero, we can use uint8 to store the count and
interpret an overflow to zero as 256 for this node. The node header shrinks
from 4 bytes to 3.

* Other nodes will follow in due time, but only after I figure out how to
do it nicely (ideas welcome!) -- currently node32's two size classes work
fine for growing, but the code should be simplified before extending to
other cases.)

0010

Limited support for "combined pointer-value slots". At compile-time, choose
either that or "single-value leaves" based on the size of the value type
template parameter. Values that are pointer-sized or less can fit in the
last-level child slots of nominal "inner nodes" without duplicated
leaf-node code. Node256 now must act like the previous 'node256 leaf',
since zero is a valid value. Aside from that, this was a small change.

What I've shared here could work (in principal, since it uses uint64
values) for tidstore, possibly faster (untested) because of better code
density, but as mentioned I want to shoot for higher. For tidbitmap.c, I
want to extend this idea and branch at run-time on a per-value basis, so
that a page-table entry that fits in a pointer can go there, and if not,
it'll be a full leaf. (This technique enables more flexibility in
lossifying pages as well.) Run-time info will require e.g. an additional
bit per slot. Since the node header is now 3 bytes, we can spare one more
byte in the node3 case. In addition, we can and should also bump it back up
to node4, still keeping the metadata within 8 bytes (no struct padding).

I've started in this patchset to refer to the node kinds as "4/16/48/256",
regardless of their actual fanout. This is for readability (by matching the
language in the paper) and maintainability (should *not* ever change
again). The size classes (including multiple classes per kind) could be
determined by macros and #ifdef's. For example, in non-SIMD architectures,
it's likely slow to search an array of 32 key chunks, so in that case the
compiler should choose size classes similar to these four nominal kinds.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v33-ART.tar.gzapplication/gzip; name=v33-ART.tar.gzDownload

��\{w��������u��%�I=,9Nr��J��_���d�9:EYl$RKR~l���wf��ST�v���:2	f������f����m�����Z5m��w:~r�n�����V��t�^w���������������M�_�t�n6[�)�N���M��fK��:���>�|���qU������fV�]�������5k/��6��5M�w�VO�w�=Ukw��e6o*��i*-�]X&���X����c-��"�y��t�.���e�W���I������Wu��{S<S]�%R�2��.�'��j3��e����f�f��,���?t�}�~�L�����m�TJ�#��������dke�Z���p�Ts����
Qm�&����1g�ntVr���e�Y�\/V�����4��6�9��������0��i�a!�g�J��4���&4�J����V������o��o�{�6�F���O��lc��Q��!���Z�l��m�k����������?��-V�~��5S[m�zc�mun<�������z�.�?5��Ag:G������|�8�1��f��n��n�V��r����-c���$t�^o��0}�pu��b�\$(��D��5��t'�+JD��:]���8��N��q��RJ�r���&�3qJ�r���3Pg
y���*��bz�3\�6�Lc���h�s��uZ��s|��k�z�����z��-x@kK�YsH�������=V�������
�t�����}:y�G[w�F�>E�ZS���j��sZ�S�l�^�5��n�����"xW�>�e@qeY���L�>m7S��Y��U��e���,||�|���e�3����q�X;n��JK�z�Y��
�D>#Y-hf�;�x�W�����s}a�:��L/��������d:�o&�Wc|~1�e8�\K������
37^N���M��E�������������r:��|�����Ii���i
����	�a�OSr�`1z��o��]�(V8aZ;y;�'�������tz!�<j;�{{u�!w��y��[�g�R�2�Vw:����Y�i`��r�[G�'�]^o�%	�@�0M��Vhk$kX:�u>��I��	�;,|J$i*W���N9g�2"i2wyH�	3���;l�Sw�L�y�v��Y?��fq�����zg����"$��0�D|6�V�/�����I"�%�#jM�.�q�=��]�����9�JH`�CX��2��v���(ow��c���F��q���z������{k�n�PX�&`��"+�
�#������6A������d�v�"����Q5@���
`p�8�fO��p8���&K�9U�#t!
�������9�XY��8_��K63\�@;�2C��N�VW�z5���H��l���'�
��d`\�
0@���,�p���z�9(��c�����
:��{�*�]K�z{y~���F�r�P(��-c^����A���J�`�q�X���1�r�������?3�@O#4��!=�I�?����g������`pk�����p\�7m
gIt@�k�eoM�X�U��V?���j�*l��,T����f�f���i6l��'��n,P3��B�]ri�8�)RvV�[�d��� n�����aQz�w���ky��g#�0<E�|�J��zr3����<.��gf-�v/n���o�~��Zf��T��ko����f���D�9_�|Iw����.��E�G�9��=��Spo�Z��Rg|B`��?:��1|*�����.�3�����Y�0(l�5���4t�������a���K�;����]m�,_L�
uS]���ai�K�M��J���E��
��kQ
��v�UXL)x��>8�y#�_KqdHP��N>\�P���wDp$�����n��nT,��[�r�t�N�$"�e��1c����������E�d��3�������=m��T����K!�-@���A��x�(�x�4��%���z�����f��3�~�fp6z?����\]Nn�����@f[�����BA 1���]O!P8��
�M�f��;e�}����������X G0������8�J��r�
D��
7]e���)���p��-��-�:��=S]�	���w~���W�������q�g��d�:
��s����*�=��Oz4������M�&��$iG��nC��p�M���`���������Z�>��G��#4J��� T����S�GLAEy$�W��0X�����t\V�J���#[_D-�U*������h�!�
��YI�W
0��Ms���3��2-��T�oF�>�p�Ce���']�!�{Oef�J��#���*�K�T�"�0�nj�����^zQ��n��,WY�)>Ir���K�4+����%a�+�(y���HK f!��?}����`d�,H!*��L��FL@��S���"E�$�����9��zD��i
K
w1���P6Ys;����~�=����m�����/.������9���o�K.U������E�I�8O�d��Z\\=�V�S�,���y����H�oO�
>�����%bR�S�ck�d��o��p��*U��UN/=��������1�fG��K~��[w����{�N{
�p~���?^S��������+���}�����)��Q1+������U
(p����cd +	I���'m���-0Z*)��JO��&q���hPJ��2���3W�X%���x}^��A�\C�����]��U��lW��) �j��Y5��:"3��������v���T�����.j���|g����R,1q���������	����N7C��AHL�x����2Y
?3\�4�M~M�&ylY�_����%��^�R	�����e��La�^1>Z���i���/�6���o;�Q-������R�6��G��=!�?�ay��	�����`��+�<!��"�,;��?��5��{������_s*����������Z�.�<Ga�8��@l���:c�=BV���V��H'���������X�A���v"p�7�if����@���w����2�[L(��p�@'��}�;.?\��S�j����z��L��!�?��9��4{���H��UE@
R��YH���3�EL���P�����	s9��D�4���z���8�A�.X��^+��\����`��u�H�Vi�}���DLNg���s���>��L%�
��8ko�Q5��'/S��"{�P�q��u)m)1-e�x�)�@D
{�(�WCC��!��(���e�f�Sm���q�)��3)+�Nb~��L��'�)�,�;�c����� �g:-�Y.V>C��;r��<j������>�3���U
>���W�������h������
�L���^��f���Tp<P�-��������L�����B������2]��*�����}�EldB$SW�\,���n��y���h��D�/��m���<hb����jn_Z�{	�9�����9�2Z�J[����n?g����i��������$y�zR���D)�M��k�HHT"D^�3\~�.��[*9�0������#;$��#��?�S8�i�D_t�	��q��UQ�2�<���y���LT�T
�r�N��Ln!�D��i���r�B��_m0��[/�4u�������%$O���4��MLj�Z<�fx��$Az ���w:h�d��y�g���'����%o�r�:~�u���=X�ugs�ro�k�r{}6���$g�x:�K#?��{o'��KA9�R���9^��m���7����?���.22m������f�ut0o�m�})����6�-4���:�M�����|�6���N!9|
)SE����U<O�+� �A��:�$PA����4n�8�<���S�C����O�	����t�XNh��)�}AS�����Z��]���"��������Z�
��O����`�~�gGh$������,�B���u�b={��FS_���5����B��k8jNa�xA|��`��
���������*NP5A��������v�045�����[g�G�K�&U��X��vN�}t��./l�	���!����P������D-Y�W�)G&V��{�{]	���V��gxY��ce��	Ja���%�HC�'�x��X��E9�UOD��G����
�
Y���#��w�{!!���������B~����v"'�\�PHM����
��l�l�=UX��J�V�_��%4�����_2����r:����h1��<TC:�m$Q9la�AzI�s���MA�f+Rz�=�C�|�ga3�u�u$��L-����o����V*��	�>�I�����j
g��_�LAT���q����u����0�����2<
�V��2h��}�I��M8��gcU#�~�������,���]�#����T����1���B6����9Eb�������\H�����������<{�#V�����j�&�
�0�	�*>�����}p��m~b����w�d�(-�<�m�i�B�6����W���`8��xS��uve��4��m�"�1�Wo�����J�y#�S�\C^���{W(�N��{L{������s[�Xe�������
y��rnY�0B��'�dR��d�
�v����JKE�mx�jO*�J��@}����(��������(�a*����nz"k��W�P-����(�������}>%���x���}�:��!&��W�Z��CoJ�7)T���^y��`��-�W>o��ZiLh.F����i�S�	���)��d��a�O������I�����:%�6��`�`<~�WO���+ix)�m��N 1e\�#%�El������+]�W�	�{�����(��I{����B8F�@4��
"�����.iP��%!y��6*��(�����y�T�N��M����F�tU2�`~������2��?U�][��lIe�O�rb/^0�V�^$��RA{�C���/�+J_�m2���k�[����K�A7Y!G�E!Uf��pR_��g�m���^��`8���:����p �����hEK����A�TA��>���(�����?�_A
���m�����fY6X?��$\�Q�#��@n�@���t�1��WRmZ�(���.F�2����d%R���h�F�������"�2�"2lMdY�H��y+���{!�P
���D��YC3����8����`K��bp����Y�]�5���6����NA�SxQ�q�"^y�"KG��
�k�����#�!D�.�L�!<��/V���F��k��K��,L/3�#a��b���q
��4���2O�H
M�f2y�v��!���&r	������n����i&����uI�����c���n%�j����-#���2�.?���b�mz�/.���2#�%E�\��7�G�_v����������Q���;���?�%������k}�m�J���_�t�o��F��4w���bW�rT<�b�����2P�FO-�����Hc������7D�?��ondn���y���,��4X
����g��w�{�R�8�J��(;Ai�eG�'"��{��>bd3NC��q���H���cr�0a5=�����W�6m��u*�R}C���<��$����[��;����9�*����1��w���@6�U�����
��`���n���6�������i�y��d�qm�W~����d8�a2�'3�J���P�R��s�@7����\~��P6��H�6�����w�O�~mlo<k��/x��y�b�U8�2���c���%��f����|�m�2n)G��6��"f��K�+�d���#�����o��p#MF��L�X�Kv�1�D$���d���M�!����y�h0��.��F/�����_��'&������2�+?X��*�����9Kk��{G�r�2�
�!c8��o��k�c�;qq3&�E�1���<���!��/%�����8�S��s�������.m{����Y����:����������������������Ap�y�`����jx�*_�}g��o��L~\y��q�D�]����"(�������D��X�O������+Z�Tx��N8��N��_[���}�k��;m����/�������w�m$������h�cQ$(�����ul%������L�����$hqM�Z>,k=���vUu7������1�cI �]�]�_m��k�r����W�a������D�v��N�d��@��,G��&�4�[�4���F�M[���m��R���������vS*����{���&Kj����j��\���8�RVc�g��m�4X�Y �j]��O�	<8N�{mD~o�)�]u �M �����Q@\(��R��Ws�3����E�M��x��o���:� hx8���������C#��
������pM+F��"�^4	��q#�^�3A�����}�=��b���JS��j0�|-z�zgfCQr����>��OE4jF,�����N�-�Y�'#�3�%�[s�����lh�K����u������ �F�rek�r��4�=k���5N��A��.6m��d��N��=;/t�������^���.N�����8?�vL>��4]f
�i!�e��r
>�a|��*�6\�XF�����h��r�o�fW��2|����nH��L�c�������*�Z��g�O���4ca�f��-�df,�����N�	��*�4a�uLX7���6����)��G�<��4���B7�8���>~G�k�����~%
�����j*(���m6�p5�<�Fuz���������(*�v�#����$<�	q~�Z�>��c>@g��PDIg��> ���:��CP�C����.�kh�_�r*�P��4�>��C=������'<Z��T,����u|U� ��5�����Y��e>.t�R������@����kew!3���������a���B��AN�Q-8?)�oc=���#Q��q��RK����7�o3�A�Z<?u��56����d�����.�z3[,�����C���&""��"8Y$!� ���g)2�����D�J�D�S|.���$lk�H8�M4cF�B��#Q2/� �e4�E����6Z����$D����@�m�B�$C:1/��[������{f`k���^y_�*��ugl<�%�4iZ�O���	��3�����qY��u�����;/�])6P�D����c��{�)B��%P��%��=����.��~���kp?9W9�o�_�SMI���K)��T���lDf�@I0��3c���_��#�F��{~�>E����~����i����p>����T�3:%3� I&a�T��%u���0�U	�I"�r��^2W2H�
qmF���a
!���uK�{I$#�N����86_�+�a|�����`0��"d��"�]����
�C�&�M���
��N��VJ�VSZ�N-�
*�S��C��q�D$X����29MF����2�	5�km��>�L�Y�����|�"�Ll�v��Si�uMd��6t�D��Ft�&6v;�x�N����3'P�V�_�F�!-�o.���Sb�`[�Y�#�f����6^�����m'!M����	��[�o���HS|W���o��+�9SG ���r�8�
F��2-$�[y����uc��Z���=����jlO��E�d����r�=I{�y�W�k9��,���'IS�si��Ow9
����HG����B�����n�Q�H��Y��r��l�������z&O�+
>��5D���������N�9�?�qJu���S��BaZ�������Cj�n�M�GoV��Vs���\��Mg�M@��]y�������6���
n�j{�V���A�=w�[���|�X����P����R��U�J(��~\��j����+��xF���.�~)���
���{���Ur���H��,b�,0�c����"n>�!�e2aW��#'�[0����*!�(������i,1�J\���
�FNG�>�����o�w�I<��xe�����bz�k�)����w�b�7��q���B�=R~�%�$��T�{3r��%1t.���(PQ
Z�IA��Ve�`�{�t�w{��D�e��z����2���+�8��������;��v���p����F�_�,�1f��U��S�}�E~���DN�������w��wk�X�����#�C������t�yC� -��kIj�,��GS����{~;A�-1�bu�X������J�R��/���F����`�p�	TV�����%�`���f����YY��Uv���:�E�+~�B�����j����Q�)��n���4������W���=����#����������t�L�;����w�A�=6�d��/�]�+�+r[��1[���.�k^iV1���~���d���p��������,�%�Uu�v�s)4Bzst�G�.��x>o��G������H���W|[��mc�5c���������M!���(��nx��t&zm����-{�R�gv���$�kf��M�o��P�c�[i�T0�9��	w
_��0�y���
�Qy�'f��I�8g���5��G����L�2�����d���4|��r�a�8�#?,�#�m*�C�Q���W�?{�2oX2�
��<�����=��G�dk�p�����������������!Ya2���1����&��J�EK���^Tk�nn�R'_��W������X��~�����n��n?��Im���=knQ����#%GOG/�-��2J���O�Sj��Z^�wa��V�l7#��x~`	��d��NC���������Qj�h���LL�!'a<_�4|��O�~=�����d�f�����@^G�sz �4�C��F�
g���9!Fg��)(��Q�o���	u�.>��W�
���0�T�����>0����O7��KDHt\�Mc�������s��5A���!���&+������*M��7��	���7��>���:v������2���y�t������B'O&�}��';T#v�aN�����W>
JT=�����@���?������`�������]w�������!���9Ds����9f���U)Z�i
Q������i5�z�k�-�#�h�v��K]���"�]�_�"���9Q�����=���n�{Hz�wL*t���T/��K�L��	B��V��KWd���73
�N��R�	�m�6s���u��Z5/LS�$��/�����/!�9Z�0I�}l�Q�tx
7�8�T�^B�R���*��_�1�w�8<���A�qz�x2�(jM�������$�q�����+(g����"
@z�g��\�h��L���mZ��EK�Y'�S]�(@�@����a@����W�h����oG�q���,:^��"����$��d��o�����k���a��<������wZ��\������o{�^�Wh�!@�s=e���N�PR���9�Ts�"RHtd�����Me���i��z:XJ���[8������?��5�Rbi���v�����t�g���m�������dB��F�wY�m���q8���Q�u���������Q�Y|OA�9�J��#��i�������W�e|��9�{����)?��$v��*��&�b����;�e���]?h�7����<>��O����M���g7�I��|��Q��N�x�S|p�z����?x]�2�� )�3��o}������M�8�*�`�W�c���#�}�v �/X;�VX?|��(d�gu���)3�A<�5��K?����d�d��L�&��S|~Z1����bk���i��S��-���t0;��Y��XfgO����Ta�w��w��)�#�9���$z�W4�#������#�X�A�F�C
g�����m��Y�P��FEv�0�#��
�}K���X���iFw�R���X��N�z-M�$`���Yy�4'��|�4ze#�4��������f�fC9�&Q�5[O�	Nj���B�]�h��gT@?�f7�Z+���f�8�nd����+�-���k��5����`�>�
QM�"TM#�B\5Q�Z����������*_��U��AH���kM��{5�e�����~R����Z�Z9���S^����������?�E���J�7w��n\����=�6��}�2� �~�����X��km�����K��
�q�����:z}�(	`�[��D��=	���U����W�>5:�q��/j�J��J�^���$���ws�m�k6WI�

�<��y���2�����_� ����o�t�LV��9�v%(��T=%��7��W��PY��<��h�����`S�}UAE�}Y����-�����n���N9�a�@6�WU�����f�������5��[�%r��/�����;�0�6���.���;��T�6����_�%��
�y���K���^+���Z���-i�	��r�,�R�����Y���%X*����i&����X2�����Z�
�%S�sU�My�%���s�4��>�1��=�����9�S�<9��A�4�:��2}P���DHO)v�%C��6HR��y*-�T![�$&�D3x���V��\����,���|�D�J�[�RXj�)���W��D����EI,K�J�U��.#�N9���r�=Up��T�����g��5�l������c}N&n}H�#U��{����Ka	�-�L����
j+w/����	�P8��'�$<5���q���BY�%���3�������cN�Z$1VA:%��o?~@�,>C��K�7�|r'�2L,�33�P����*���jc�o�1��d��<���%��~.�BY/�E�We���7�3S����A�5t��V�n+��e=���y��o�c���c���F���b�J�&f�d��'�]��3I�|���������f�j?�L~�i�\VZ�0�����RnZu�o�
	 HNb��0=I�i	�|l�����C�1�v�F�?�|�0��3X��>����y��@���H��S������Uc2���rQ]��RqA6���e���k��/�/S�n����i�������V>[s���"(��A������\�5z���\l����vLkj���If���U�=!�on�C��*0��h�3���p�?���)!>�2V2��V\���d�� <S*Cx�(�$���D���f�/��4.^g��"��hX:��`��O�F~/�����j/��K�$�� �>��7������`�;���A�m5�o8�u����2��������)`����fA���i���`f)����)#���H�Z^���'�&����O:}~�*4��z}����	�������%FA,H��	��d
�/��!��IK��c�7��������o2.��1{�R�@��x>t=p6*�=�vN�6����K�Ak���u[���l����GaP�D�T]���� U�P@U�L��*�</x�_������ ��"]�f�k�X/��(�p���x�x�d|!�g'w���@����-F.p���h�8�1~6��C�����Pb=i��%
���7�����M�/���MzM������/�PO�2��m!>I+�Uy��{�@�d�$=������/R`���\�����=�a�c�O�t'�xS��`I������l��bk�>�/dCR��q^���5�m��`}	���Hq��I$���{l�\`X]_����`v��l:FP""Z&����4����9� ���Q,�B�;��T����%��ZBTZyUm�[2�������!�����-�]l��7�l��@:��\�t�Q3�����7g�?����1u[`�{U��������5���h�a��	�
�������U�yA8�����0�<��)|b�\�`��p�q%|�����xd��=cH��]]_�=*M �1��z��5Uh�l/ZA� ��G��U��v�?�������������dO�I{.3��I�"���P�E�������W�L
�-������������;P��l*��q^�I�#���&X�h��v�F]{�zML���;��Qg��u��������$!�%/���D�R����W�.)�JS>�����t^PnSt�k�X�r���Y������"�~TfU�3��p��u���0{�vv�����!�����&m�-2�u|����G��
�;���r��L�A�iT�J
i�*��Wd�W��|�|����SB��i0�,kL!~!H���Ze�I�s�L
,}:��������#4�������9����p8e��[���!����0	��*�D��g+���&j�K�%�~t,�U��y81xG���A\��[��s�\u8���8��r�+�=����&�vf�^�������5Tcb���F;��,�4w�fArK~�#���3����~o����pTc��QX��C*k5����$��A�2+��'x|������?�"e]�2��G�S]�9lKC��D�(`���@��9�=V�>:d�����Qc�7%���S��k��������V��������|�_�+u5_	R�:4�]�����xl�X�)U���K>�C��~9��,$�1���~��<	��!ad����
?���M���C.�����tN�Ci�����u�����ru��V�qMB'_>;��\����(y|
����m�R�?���f-�6!��v�������|_�3���U
0qF�QI&���~�?�97��nM>�?���i�$]x7��~X�����z@c���.p6�vk���9��9�E���H��Z��j����d�'�t�:���Za����N/������������8l�o����������� p\������!���	���o	��	N��yq���W���T7�*�`6��~z������V�m��0<����������sx��YArA���_��i�{���'����n������w{��
������t��h��� lE]�o����}�0��\��g>_�}����>G�Ut7�'��`��S���nWj����l���}~+"t����.�9�C�)��'ax��Y��������_�>a����_�U������qj�O86#b������qrj�A��|c�'�3Ig����P�Q����cxm�C�`3������*� X�^��x�+���������G�[ k����x=���l��``1����D_[p�=����|I:^�C�8�����%�Px���K�L�I*�H���a��Ki�
����VI����������@�/�*[��Q���* H|����H�R"���(��f�HULqH�JJ�dT�Am	A/��y�M�t�1P���������%by���S =�����}p���LH��Ro���T.K�W�2\�l��~�O��Zt�����"0gJ�L�eG��f��,�P�����`��h0�>��o��w��:(��~5Ar6�m�_�3�/�'3Q���Z���y<J=�U���r��2��m��u'����s�QV��E#:���.9�>F�	N:���$��O-�����%cL���#@BO��Q�	l ��
���Z��U]�l�4�K�����Q����B����^��@p�:�~3��U����
}����,�
����-0��k��R�w �g�"x��d�������E�2#��Am���A���u�xHE*�M�f���(�
u>3��=��A� �beKz~���9�
��t��bU;FX&P��*b�T{5�2���Ja��Lo����ip�%�dH�F�H�d��4�q87�C���{��Hd��0�[2Li�5)W���+��m�S� �L����p*��z�.}	�3�A��S��
x=��&B�l4���o7�s�l��mD���m��'�K��S��i�n4{D�h��#8q����$���,������"�:o\�~�����:���DVh��[����A��7�"�(�3Ab���S.1E�=�
K._-�
��-�����
j�.�W��<{���dNu-i9����[Px2�����[#��M�4[o��af��W�<�*_I���"�NbY����A^c�������[��m'0(^��r���O{�:�.E����(*�L/`���ypg�N.4E�����N��;��cA��s	R�{n���&�S�)�Hq���P0��p��:��A5L�FGSx]%/�=�b
��2�����w\��'���{@��G�%L��YL���}X�Yk�G����P�)�F�
GC�{-�h4Sn��]���$�N��K~�8l,E$�i'�J��1�����$_���5\��E�sQ�p�8w�x'VM����I���>P���H=*i}f/?5:��N��X�T8Kb�����^M�$W��xn�Ax��[%��pf-�9v�Bh�����d
�s����Je_x�1��������,�E)����/�Jt/}����kBR�-�@Yo�*����f/�Pb}C�>�%��4=K��h,��g��=m�oxr������5#M:w��"����:}���KPv���1����8T�X�����)�C�'�3S�Pq�gF�f,%��_�����TY���I5#�& m7��bg�����X�<�����5o4��U��Vc��E#����Rq��<u�&����0�#����ps��9�[��!��y��<
ha�����A������c�}!.d�T�)5�9y��aK@|��1a0�������x��^�
>���m���K�Gc�Zj,4f=yX�1���\�m��t���3��g�qx
?������o�`3Y��+U���j���-l%�4�^���F����2�����r@�dS�����z|����v���-�I9y���X:����`��aU�t�@�>�X��N����@v���?^K8aw�t��<����j
mJ6��W��.>~����	�E���]�1�(�y� �����"�
�H�������N�����4B�����]�% 7E��(��);��-����L>e����*�:�77z]n�Dl9#�I�K�]eZ]7�z��%�+�r0���D�������b2	Q
��l�;��H�;�{
�2I��gG����O�&�9�&���yY}M�<P�6
�$F�W���r����D[r�����=%qSw��^�9��T�g�^�W�4@�b�P+c'D�c�w���K�����}���l�P��\�����,������MS�0mDB(����\���v���T����Z3�����}���5iA��"$�6��lNV!�K>3�r�e����yE�VH���g��r��/�T��b����$�Y�W���B%�=��^�����xn�w�����$����Z^�&:�YPk���W��vp�-����hE� �r+	���x2�sk��o>[��JHs�Og,{Z��X�E�H�47.��%���D�wK�
T���������n~i����B���O/,����Bn�%T��Cl%X ��������0fR`2���|�UK��;@�������&Y��b����v����p��\�,�e�����UD��x>����@{(�����c|��}�D87�v�9J&������^ yd���+#)���uc�n�.)(}��&��OK]U����yl���zS&D�l���Q����L������
rz �n�%\l���dk�s5k��b�f{IL�h��H���������R�
�
$<�N�6V7��<'CV��N��4��nZ��c�|��!��5O��o�/��m;�_��)�������nY�����ZY�d��HB�h#���r�Ow	�O�]@����Kw4�Y6��JX�Y�v�.?�����Y(r;��+����c}op5�W���"�b�`fg�]��g�^�du��'�r1<;}��������S�L�"�K
�9�����;�o�W����k#0��Y���!���5�3��m>��?�������!�?:+T���C�b��K������f������^�����)�-Q�9�u�
8�6H%\���#n��_?f�c>*���$xc41J� �C00HC�.�e���u ,[Lr���+���lR�F3@�u9��)�p��S�	^("��N��h�^�r�4]I���xS�;�������Y���]U�i�c~����h=">A��}��em�cD}��/����>���_����Y���5��9-��7B`����*&Mv����LL�*$�����(��HUG��	���L��c�g�����0555��K'z�������#���+�L��4��G���y��H��L��PI�
�F����~$~��)j��s7T��R��1��8��3�1�OF���q3[,�����L4���A���'���$<��|'!a?K&!��>�����Tp�g�Re`	Y����*Qd���N1=�}�5U����a��+�A�1`t���N��u�nV�Z;�z�
9�����f.0
�$�P�I���u	�o'����$(�E��S��g��������������?�>U<�T�T:{ 9[�P	J�4E��afH�L����zU�(�gMM����G����l{��v�-���?�/������F�V������C�!A�^X�l�������0��K�??���s-���i�c�u�����w:�~��NaL[M���@�YDML:�����6��H�~	�)S�I&>$?�]����A�d
����Z��	f���:Z���w}?�����F��HnE��:���G�����
�[������X���\wz�1��P~���M�Sc�U@���5�����x��t9�pa4������N���[�B����"��o��"D^��+>U�./��V(�}K/�^UG*�����R��i
_H$-_�Q�+gf�6J��\utPhT���=T��^G��w���U��d���MI
�|�k(EO����[�@kO��o~����e�p�)S�����(#� <��@�������{��`^S�Pts��:_�9'�r:��W�q:`�&���p����V�Lz���C�7��&���F@l?���i]i?�V�����b\S|�x��
��H��FjfY4gx�������W������c��h/��2��r��5z�Y.�YZ��&���.|N�^���93��V�E�&���fo���:�!raI#	ye+�$8�Kn����������xKI��4������aM����^$4��Ou��N�r�I-ew������o�����U�E�'����~�	�^�o
�`��6���m������M�X��{0�������#��X�L�2�K�]h2�>�<b�\��Q��$Ua�9�G� ����Lw����a(9��0
U����zv�~���W���=s��������Ts29�����M�O�������%��r��t�8�!�p�i�d,�_�:�8�[��'�6�H��*��������m��v�LTlkO�wl�I�%�{�4J��F����SEe����j�x�)!_���!�t�d"?[n
Y����K?"�C����$Y[���<v�f�������U�o-v���k{�t��RB|���>�������a��B�=hyq�����0p����=
����?�G�/��p�'�����?�����ZP+K��*�Z�����Kt��MG���9���;��0g���s&.^PB1~����*��Z6�/ r|S'���%=���9Fl�
��8X��2��������:�����g�������F�xvAy�1,��w����y	�s��%��xr�����D�%����	l!�����n������J�%
r�g�[���TK�~��!���"��SV���"�R�l�|�������:c��q�n�O�������3���oqM�V���"Q�P�L,B!!*��
��� �S���7dLUm��t���!��O��34�����L�!�2#�^>:���h��z�w;�u<Z�_���
�������#����q�#���b6P�U �.K����]���(��[�~�o���E%�����O��G�+��������4�!I�G,�D�D��*�G��?Z��pvX,�i����>Z�������DS z9����R�y��:����U�	��_���W���:tI������M+,��O44���������_�?a�&
Q�$�����F�'0����.�t}�7�(�wU��]\����u�
��N;r��&�g;n0t�^5
IQ+��%E5`ND�	>^%��N�]�2����|�*����>t{�Ha��g`y��=��	������������p�Rf��JgV6)��6j���-������g��W��_W��l ��*,R�.����i���2�O7����-W��p���Y�����7l�����K����c����Jc�.�2�~��^x��+��^8�2���W���F22_��F��������$��4�^gG�a+�K+	�{B�]�F��M#c���B�Fy����!��:{��
f����J������s_��ho�b�/Z����`OO�.�������IN����9��������[l�#�}b��>w�y4\��c�����f����z�r(��|������R��q_��,
�1%Y�m�@�|�ho���������ACu�{E�r04��GC<�-��B�o/�\<�n x%��_�QyP��i��d���|5��/#�S�aT�1A4�=H�Y�=W�WY��������!{hu��(��������?��&���O�
�
�i�:�����^����w�>@B`#S&�'`
%���=b�:����!����*P��'�wBm��,.;��/�yf�Z
z���a�"��}��x�>iu��L]#$�8�8�%&R�����oW��)� ��P7H�m�2�z$��e�yLNA8x�QR8:�QNf�luSm�O��6D�A�*���U����6����4J:wh�����8�i+����<���aG�Xo6�8�P��\im8�T��P���2��*^���G�qC�zw��p�H)��x���W�/^� ��I�lk�����5�}D+�$���f�J�dZ_�3`�z�*Q���*��kD[�-R��z#�jDb1�	��
�t���s<�����S��W?���G��C�z�8�9�O��s�W�VI����������������������j������ �Q�q��(�� ��^��
^'��^��m����Vx����C���2�Z,#f�:R+8d��<�oh|�����%�ee=rvmbg!����o�_�)�w��������lF�N�
#?����
�,��]���������T�����.�^�DI*�]6�����4������p��i�����'R?��-`]���|�6���4�,;
g�4���4�F�T1	f�_f��%}S�i���@���F�Ry"�|w�q�&�����+��;���"���$��;����o2v/M?o��M&x�fM��$X��x0�a�������$i.������&2�����h�.Q�g�b���>���h4�c���b���2jR+���D%�H�&�Nz(��Lc���H^?*���vxd0��w QI��A�b��E������YU�?�k��0��H�W)�d3]��#�[^v8��;:�6�\x���h�v�{o������q�L���J��(Z��"}�C����n���)���{UR]~��`��YX���8C�'���0��������.��������'*�&���*���P�E�&����H�G�f����Ab���=y@@47�I����cu�9�Y��\?>V����uS�G5��T.Y&hm�
3�]����Y�1��CtG���~~�+C;�)1&���/l^MvV|6�C~�c�r��z8-���m���0J�(sa/p;88��>���wH��0��7��eI�-|E5d��j��t�2��m�F��4p"��������VD���
gC�k� _����b����~�Y0��)t��������(��<:#H�����������k�iVF�0��ZNs�{������	�2q%���Y  �Y-hD20	��L�|@�T=P-����������<fh�nb}:bv�K�IT-�X't�����.Ep6
q��l��	�
!�Iz��J�7��m]�`:P��,��8��B�o�8����v���<�!�U�'�������sO��/9�;A`U6��,��NE�Y.SG����D\�� �py�Hp�0��b9�I)r"h�!����/��"�[�s!iU�.��,�9�{�n�lE��[aL�4+���
D�����2�(��xS�.���wZ�^�z����AT6���q]���6F��3=���]��C	,�����9�_��z��E���^L���@�V2���'�' �9{I���(2���I�d8���8����C-�|��Je����'��?&@)g9y��d���e��%�	p�R���aj[=��V��Nm&W<��\�km/o~��2���3��a�����2�U��d0'=�8=�B�C��y����� ���R�C���Q:���������=�k���Y�T����.��o���xo6����������'g��p:����!��'O�������/�=���
g*�P�{�[���c���|�2_>�����K�u��q0�������FC�`�;���0i�����'X�M��w�T��`�5��+�`��N5*��d���C"!����.gKQ�~OxN��,
[pa���������wRN�W�/9�j���;W���go��g���'�.���z
g��)��P�������+"H�s�����K������P�����#�x����oF�^��W{���pMf�8\{�ncM��`4���x2������~?M�����Kl��������%��C2�%���L����Ud@���rU�����e���z�>k�������Hbm����x/S�.��ezr~�[9��q���s��	��3�s8��dQ�(1��pa���*5��7�����[|����Hf���F�I@�|���vf[��si��s���K�5��G���$ O�3Q�J��J#Peu�XA��})�!}H;#�M��6�T���x�4���n/�>0Y�>;������q�J!P�K�Y=N�!B0?�-�:[���7�a��8��_M7�������us�KN:��5q/C�������M����Os�j�a�K30���`:�Y4�n8P�9?����8&�v;���V8�-<������kk���}%���r O�h�X�z��t2[b'#�$|��&<���.g���r�`tP�����y<��7p��/`�u�_a��u�_-�m��������|��w�[�^'
:���������v��z����v�V���������	n��[���Q+����jU^���6�
�V�i�,����p��������`�s�>��O���������M_y]^c4;n������p��l[����^4�����Cf�.:����<�������O9���4����[x���N���S�c��#����Y��@QbZ!X��4�(S]��=iCx&�_
�Y�o ������������:�D�����p:��������9?of����������Q�6a��
�f����%�����?�tA������kNk�0�����z^�C�o�
��&�����<��V�|���>�����P?�D}�6"!���S��ngNj��N������-/w^�g?����8�T�E�;xx�{��|����a��/����D�z�� �?����5�����_�r�w�*Wo|��"����1oz�w^�;��S��y����v6���?Ug)�1�v;Y��d������+����
V�*����W2��;��1S�?,7��T,��,�����n�~(�����'J�L�W�-�{tSB��G���S��M�w�����xt[+��h9���^�Ve&z �p&��f<����]����p��Ksm(�+.R�����3k}��(��rx�o��"�����S����$��p��iBnJ<��MW}6pW;��������F�����Y�\���G1�y}���1�|v}=�2����KDBfX�����4���g���������9o���h���Ad��/���9 x�cU��%�������}O��oQ�>���?_/>/);�����&��d�U���g7ws#:�P��`oH�q������Y?����1��n`�����F�/�������1����"�����D�Z��HZN����|�{~�EHey�7S�*������H�#��i�@��i�8����K�B�XIw*�����'�[��N�(�hs~�V1�Ins�M!<Hy�9;����N�I�E�r�O9Cn$9j��������c�5�:m��7]8T���\%�p���FF4�/��������@�%�R`��eKJ��K�O�QU�]�+��|i�Y+
:�K���e�A������]�
N���V��-��Rn��A��X�*��i�hDcHNZ	Fgd��o},�:�\��?�����a>���v�.b��W��q�~���Yc�
�O����W������7��Z��}
G�^�����D��G����A���_���O���Om���E���W�9�s�|��<�8�������D&�\��x7�l:����E,M(w�@�A����T�'����&�����5E����
���1��c\%�=�Eov}�Z��T�v������2i���<#I�%]�������eJ�dL��E�J,���0lE���ec�*X��V��A��n�kk�
UGs}�����N
Y��_=
:����hL�,?�����`�vi�.4W�R+�w!����s%�������})��2n;�-zd��n%ZW���H������������B���*1�p�����9CU��M������Y��)"6���$��m�&9�7]Z����7�Wo��`��J�z�u]���S�lf}/s�x�0�K0�����[�~�#����������VPx��r5��WN������P����RY�k��@i�?+�_�.b�����@�����gz�E���D^V�S�������V- �m�K�����qF��;�����J�)'���u��%�J�/VN� ���%�h5YR7�/WS~4� �����.O=/��N=�#r?��JO����T��K���i�����6eI���.2�
�,&)�E�I��e��h]}Dq�zAW$���������O��*�?�|	{"*]�ju����va[x<�������U��{��#Yb%��tO�]��V�z\�}/�O������H-����� {��o+��M(cs�(t��j���#����G����x1��'����L�@P�BFS����,���
7e����j������`���]7�V����V?���(x��
�a������I�=��E�����ZY�ZQ��*@.�����g����7F��)
tP�lk�����������	��P��K�@
��YIf�����21��`a�b���mfLz|S&=��!Mz���lF�^<���nk�&=@�B(��-�����d��*���0����������g�ag�$.�L�>/G�6~<�MR��Y>�i�A� ��7~��L�Re����k

�n���� �!���P���A&���Z���� T��S3�]lk��L��Z��?��MS7$"�;���#��z��`��/�\��b�W|i3m�8�����"Eb���D�%m�8c��#���L��f+�/�������E�,|�`a��0i���|���n�����`�������s	!I$� x�������fA���;=[����=�V��$���k�V���B���*sU�0���A���9� 9�C�(��h�� ,�� B?d'z���d�D~9{���$F���h�y�yz�f*���#"8Y�F&f8�%��+J�B��:4�\������f��)$>�d6���;q4����^�����_���_���V�n
F#��(�������N;�^k��w�������91���r��w�E�_�?A�	�!i���N�d��g�\'�x^"�'$�d����zi	+|H+���a;�q�������yPR�5oa����m���;�����E^�����?`E�~i�?9�<}j�2��h�_��V���	^h��m�m���`�
lDY%�;���#�1�$������\���9V$��5���E�����Y��-��wC����d��������m�>���s��`�^��n������BLi�w�&�
ed�|���5��*���`#,���VH>�����G�=`����Bd8�W����pg�+ ^P`����9� ����������� ��A{���:^?�(��zm.�������v����<^a��-�������ZQk�!�z��!�
~���>�����'���8��5i�+�Y�wK��'�l
��9��+#K�=k`^����1[3���)���l{�?��Q��Z���O)bR�[`����g�6.�-3�����%��Y_��=�j2����C���������Q��.��; <(I����j����n�_�9�������,��"������df\�`T"'"�$E�~�y/�S3F�FS�Y_�H���d�d�cC��`M)�o5�a��UH�5%�Gn�AL�L`/�8>��@���?��X`i-���w���(���sO�l�]�[@���Q�M0��6�R�xw!a����9f5h%v.`�E�&xO�xv0�<���V�r{���VA��=�-�}e3�'���/���O�����i��PD���KX#��S2�����^��SJ��ep�����%,6\�O��q4�U�3H	d�]��8�� 'kNr�3f|�������|���j�������HF8������d�I6�k�M�WWe$���������HK�Z&�g�N���	��|UU%��w3��B��z]C���c�!PO��Q`���C����=,��#�K����&���Di�b#?&��v���f>�8��Z��$�I��Z�R�cZ�����$�Z�[���DN5���.9[6�S<=f##���}�����
_�����T�#dT�(y�dF���l�5.�{��tP�1��;8O�?g ��NW���j
I���!�F�m[�H�����:-������]��;������H!n�(�f������������r��������&�����?��r�1a@q��{��t4��*'��H�����|w�-�
�^�J�,���������?�T�^���v��F����/�L1�J����fv�}$"�i��h��lR��-������/��]�SA�5���j�7X4�Jd���;8��j�7]��Eaw\��WQvLL���TD=�A�G)�\O9u�x����u�D�kX�
r\��;*�������s�J���y��?��xD9|L�����N8�3}u������c���)������B�f@���+Xa~@��v�l��!�
v�h�h��G�wX�=6$
��>Z����@�'u���H=�1���?��v���/C�����|v=Wi,K�.5��H�4���j���g'e���	���	�j�������)yb|������5�?��LH8U���N���������_�C�?�u��g�����h>\8�V4�|�C�q7������OA����n�+�����g������j
�N<jG~��~�Ga���a���Z���uN��$��+�?�2A�LR+eTY��B��`v���|"��h����`6��(��Y��@�T��y4E�&�6nA�S���!@����NLy�#���m)��^�FY����d�zPma�!l6���
z�� �����L~t������J��ig^�S���� ����`���Qj��Cn9cQ��q �!ui�)��W~����/^<���Sv(��3=�)�V^�p��k�W���!d�z������a��'���i4����b��y�d36����xA�������Je�7'J�6D:�&��[�	M�����?�a�bk!M�v��QY�a.��h�A��q����l�pB�O��=?��wdX�R�9V]����M`��JA���W����D�c���u��'���	CQ����l8e�DK��S16<M%�FfO[�|��?QO�Y��O�u� 
G�~"�����p(��������#����Y��.oIr�-'L���(SaJ�"����]Mrfr���`vAZ��S&���B��a����;�[NB��N\��R}����T���l�[]rW�������0
��l�>[������u����5#�L9
��jY
8iE}�� ���������U�.��]��`������%�Ffy������4p�9��m�o'STa#���n+h���f�ZG��V��@/i� �>0�K/���p�v��"]-��W�MB�~{�
0K�-pIT�'�0S;N!�0X��R�+ru���Q-��3O���~���el�	n��������:�$E��'Q��R+�r���E!v�"�|�f��S<����J�R*)�.��A��&z�%AU�L�����bz��c�o�WU�3��:0\"�B�����8Y
����#������ng	.�������UVA��x~
������B�p�p���9��:��������3��{�%������]!���pt����p�5{���|71K�+�oR:)f�e4����L E=t����]"�O�f�E*p��c����)�O�"�_n��)�+�D�u|�{l�b������T���d�m$�,����n~���O�8E��������6�S�E�
�~��e�Y����^4���d`����a�
�	�51������<?kg[n_�v�M��k�!�k{X��n��w���'Vc�*��=�X%�P���#�}��%����2Z1��"5�ffU�s���������[D[�}���wy�1���%�m��%����������tmo�qJ��bqC0;b��|�7G������W��Y�xxFK�	]Hc6h^?��2xI.�I��j$������i|����bH�5X_��_?_?_?_?_?_?_?_?_?_?_?_?_?_?_?_?����..�0

#255

sawada.mshk@gmail.com

over 2 years ago

In reply to: John Naylor (#254)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On Tue, May 23, 2023 at 7:17 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I wrote:

the current insert/delete paths are quite complex. Using bitmap heap scan as a motivating use case, I hope to refocus complexity to where it's most needed, and aggressively simplify where possible.

Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I just want to share some progress on making the radix tree usable not only for that, but hopefully a wider range of applications, while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no iteration) and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ).

Thank you for making progress on this. I agree with these directions
overall. I have some comments and questions:

- With the latter in mind, searching the child within a node now returns the address of the slot. This allows the same interface whether the slot contains a child pointer or a value.

Probably we can apply similar changes to the iteration as well.

* Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) -- currently node32's two size classes work fine for growing, but the code should be simplified before extending to other cases.)

Within the size class, we just alloc a new node of lower size class
and do memcpy(). I guess it will be almost same as what we do for
growing. It might be a good idea to support node shrinking within the
size class for node32 (and node125 if we support). I don't think
shrinking class-3 to class-1 makes sense.

Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves" based on the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level child slots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256 leaf', since zero is a valid value. Aside from that, this was a small change.

Yes, but it also means that we use pointer-sized value anyway even if
the value size is less than that, which wastes the memory, no?

What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster (untested) because of better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to extend this idea and branch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go there, and if not, it'll be a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time info will require e.g. an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in the node3 case. In addition, we can and should also bump it back up to node4, still keeping the metadata within 8 bytes (no struct padding).

Sounds good.

I've started in this patchset to refer to the node kinds as "4/16/48/256", regardless of their actual fanout. This is for readability (by matching the language in the paper) and maintainability (should *not* ever change again). The size classes (including multiple classes per kind) could be determined by macros and #ifdef's. For example, in non-SIMD architectures, it's likely slow to search an array of 32 key chunks, so in that case the compiler should choose size classes similar to these four nominal kinds.

If we want to use the node kinds used in the paper, I think we should
change the number in RT_NODE_KIND_X too. Otherwise, it would be
confusing when reading the code without referring to the paper.
Particularly, this part is very confusing:

case RT_NODE_KIND_3:
RT_ADD_CHILD_4(tree, ref, node, chunk, child);
break;
case RT_NODE_KIND_32:
RT_ADD_CHILD_16(tree, ref, node, chunk, child);
break;
case RT_NODE_KIND_125:
RT_ADD_CHILD_48(tree, ref, node, chunk, child);
break;
case RT_NODE_KIND_256:
RT_ADD_CHILD_256(tree, ref, node, chunk, child);
break;

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#256

john.naylor@enterprisedb.com

over 2 years ago

In reply to: Masahiko Sawada (#255)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jun 5, 2023 at 5:32 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

Sometime in the not-too-distant future, I will start a new thread

focusing on bitmap heap scan, but for now, I just want to share some
progress on making the radix tree usable not only for that, but hopefully a
wider range of applications, while making the code simpler and the binary
smaller. The attached patches are incomplete (e.g. no iteration) and quite
a bit messy, so tar'd and gzip'd for the curious (should apply on top of
v32 0001-03 + 0007-09 ).

Thank you for making progress on this. I agree with these directions
overall. I have some comments and questions:

Glad to hear it and thanks for looking!

* Other nodes will follow in due time, but only after I figure out how

to do it nicely (ideas welcome!) -- currently node32's two size classes
work fine for growing, but the code should be simplified before extending
to other cases.)

Within the size class, we just alloc a new node of lower size class
and do memcpy(). I guess it will be almost same as what we do for
growing.

Oh, the memcpy part is great, very simple. I mean the (compile-time) "class
info" table lookups are a bit awkward. I'm thinking the hard-coded numbers
like this:

.fanout = 3,
.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),

...may be better with a #defined symbol that can also be used elsewhere.

I don't think
shrinking class-3 to class-1 makes sense.

Agreed. The smallest kind should just be freed when empty.

Limited support for "combined pointer-value slots". At compile-time,

choose either that or "single-value leaves" based on the size of the value
type template parameter. Values that are pointer-sized or less can fit in
the last-level child slots of nominal "inner nodes" without duplicated
leaf-node code. Node256 now must act like the previous 'node256 leaf',
since zero is a valid value. Aside from that, this was a small change.

Yes, but it also means that we use pointer-sized value anyway even if
the value size is less than that, which wastes the memory, no?

At a low-level, that makes sense, but I've found an interesting global
effect showing the opposite: _less_ memory, which may compensate:

psql -c "select * from bench_search_random_nodes(1*1000*1000)"
num_keys = 992660

(using a low enough number that the experimental change n125->n63 doesn't
affect anything)
height = 4, n3 = 375258, n15 = 137490, n32 = 0, n63 = 0, n256 = 1025

v31:
mem_allocated | load_ms | search_ms
---------------+---------+-----------
47800768 | 253 | 134

(unreleased code "similar" to v33, but among other things restores the
separate "extend down" function)
mem_allocated | load_ms | search_ms
---------------+---------+-----------
42926048 | 221 | 127

I'd need to make sure, but apparently just going from 6 non-empty memory
contexts to 3 (remember all values are embedded here) reduces memory
fragmentation significantly in this test. (That should also serve as a
demonstration that additional size classes have both runtime costs as well
as benefits. We need to have a balance.)

So, I'm inclined to think the only reason to prefer "multi-value leaves" is
if 1) the value type is _bigger_ than a pointer 2) there is no convenient
abbreviation (like tid bitmaps have) and 3) the use case really needs to
avoid another memory access. Under those circumstances, though, the new
code plus lazy expansion etc might suit and be easier to maintain. That
said, I've mostly left alone the "leaf" types and functions, as well as
added some detritus like "const bool = false;". It would look a *lot* nicer
if we gave up on multi-value leaves entirely, but there's no rush and I
don't want to close that door entirely just yet.

What I've shared here could work (in principal, since it uses uint64

values) for tidstore, possibly faster (untested) because of better code
density, but as mentioned I want to shoot for higher. For tidbitmap.c, I
want to extend this idea and branch at run-time on a per-value basis, so
that a page-table entry that fits in a pointer can go there, and if not,
it'll be a full leaf. (This technique enables more flexibility in
lossifying pages as well.) Run-time info will require e.g. an additional
bit per slot. Since the node header is now 3 bytes, we can spare one more
byte in the node3 case. In addition, we can and should also bump it back up
to node4, still keeping the metadata within 8 bytes (no struct padding).

Sounds good.

The additional bit per slot would require per-node logic and additional
branches, which is not great. I'm now thinking a much easier way to get
there is to give up (at least for now) on promising that "run-time
embeddable values" can use the full pointer-size (unlike value types found
embeddable at compile-time). Reserving the lowest pointer bit for a tag
"value or pointer-to-leaf" would have a much smaller code footprint. That
also has a curious side-effect for TID offsets: They are one-based so
reserving the zero bit would actually simplify things: getting rid of the
+1/-1 logic when converting bits to/from offsets.

In addition, without a new bitmap, the smallest node can actually be up to
a node5 with no struct padding, with a node2 as a subclass. (Those numbers
coincidentally were also one scenario in the paper, when calculating
worst-case memory usage). That's worth considering.

I've started in this patchset to refer to the node kinds as

"4/16/48/256", regardless of their actual fanout.

If we want to use the node kinds used in the paper, I think we should
change the number in RT_NODE_KIND_X too.

Oh absolutely, this is nowhere near ready for cosmetic review :-)

--
John Naylor
EDB: http://www.enterprisedb.com

#257

sawada.mshk@gmail.com

over 2 years ago

In reply to: John Naylor (#256)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jun 6, 2023 at 2:13 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Mon, Jun 5, 2023 at 5:32 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Sometime in the not-too-distant future, I will start a new thread focusing on bitmap heap scan, but for now, I just want to share some progress on making the radix tree usable not only for that, but hopefully a wider range of applications, while making the code simpler and the binary smaller. The attached patches are incomplete (e.g. no iteration) and quite a bit messy, so tar'd and gzip'd for the curious (should apply on top of v32 0001-03 + 0007-09 ).

Thank you for making progress on this. I agree with these directions
overall. I have some comments and questions:

Glad to hear it and thanks for looking!

* Other nodes will follow in due time, but only after I figure out how to do it nicely (ideas welcome!) -- currently node32's two size classes work fine for growing, but the code should be simplified before extending to other cases.)

Within the size class, we just alloc a new node of lower size class
and do memcpy(). I guess it will be almost same as what we do for
growing.

Oh, the memcpy part is great, very simple. I mean the (compile-time) "class info" table lookups are a bit awkward. I'm thinking the hard-coded numbers like this:

.fanout = 3,
.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),

...may be better with a #defined symbol that can also be used elsewhere.

FWIW, exposing these definitions would be good in terms of testing too
since we can use them in regression tests.

I don't think
shrinking class-3 to class-1 makes sense.

Agreed. The smallest kind should just be freed when empty.

Limited support for "combined pointer-value slots". At compile-time, choose either that or "single-value leaves" based on the size of the value type template parameter. Values that are pointer-sized or less can fit in the last-level child slots of nominal "inner nodes" without duplicated leaf-node code. Node256 now must act like the previous 'node256 leaf', since zero is a valid value. Aside from that, this was a small change.

Yes, but it also means that we use pointer-sized value anyway even if
the value size is less than that, which wastes the memory, no?

At a low-level, that makes sense, but I've found an interesting global effect showing the opposite: _less_ memory, which may compensate:

psql -c "select * from bench_search_random_nodes(1*1000*1000)"
num_keys = 992660

(using a low enough number that the experimental change n125->n63 doesn't affect anything)
height = 4, n3 = 375258, n15 = 137490, n32 = 0, n63 = 0, n256 = 1025

v31:
mem_allocated | load_ms | search_ms
---------------+---------+-----------
47800768 | 253 | 134

(unreleased code "similar" to v33, but among other things restores the separate "extend down" function)
mem_allocated | load_ms | search_ms
---------------+---------+-----------
42926048 | 221 | 127

I'd need to make sure, but apparently just going from 6 non-empty memory contexts to 3 (remember all values are embedded here) reduces memory fragmentation significantly in this test. (That should also serve as a demonstration that additional size classes have both runtime costs as well as benefits. We need to have a balance.)

Interesting. The result would probably vary if we change the slab
block sizes. I'd like to experiment if the code is available.

So, I'm inclined to think the only reason to prefer "multi-value leaves" is if 1) the value type is _bigger_ than a pointer 2) there is no convenient abbreviation (like tid bitmaps have) and 3) the use case really needs to avoid another memory access. Under those circumstances, though, the new code plus lazy expansion etc might suit and be easier to maintain.

Indeed.

What I've shared here could work (in principal, since it uses uint64 values) for tidstore, possibly faster (untested) because of better code density, but as mentioned I want to shoot for higher. For tidbitmap.c, I want to extend this idea and branch at run-time on a per-value basis, so that a page-table entry that fits in a pointer can go there, and if not, it'll be a full leaf. (This technique enables more flexibility in lossifying pages as well.) Run-time info will require e.g. an additional bit per slot. Since the node header is now 3 bytes, we can spare one more byte in the node3 case. In addition, we can and should also bump it back up to node4, still keeping the metadata within 8 bytes (no struct padding).

Sounds good.

The additional bit per slot would require per-node logic and additional branches, which is not great. I'm now thinking a much easier way to get there is to give up (at least for now) on promising that "run-time embeddable values" can use the full pointer-size (unlike value types found embeddable at compile-time). Reserving the lowest pointer bit for a tag "value or pointer-to-leaf" would have a much smaller code footprint.

Do you mean we can make sure that the value doesn't set the lowest
bit? Or is it an optimization for TIDStore?

In addition, without a new bitmap, the smallest node can actually be up to a node5 with no struct padding, with a node2 as a subclass. (Those numbers coincidentally were also one scenario in the paper, when calculating worst-case memory usage). That's worth considering.

Agreed.

FWIW please let me know if there are some experiments I can help with.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#258

john.naylor@enterprisedb.com

over 2 years ago

In reply to: Masahiko Sawada (#257)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jun 13, 2023 at 12:47 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Tue, Jun 6, 2023 at 2:13 PM John Naylor <john.naylor@enterprisedb.com>

wrote:

I'd need to make sure, but apparently just going from 6 non-empty

memory contexts to 3 (remember all values are embedded here) reduces memory
fragmentation significantly in this test. (That should also serve as a
demonstration that additional size classes have both runtime costs as well
as benefits. We need to have a balance.)

Interesting. The result would probably vary if we change the slab
block sizes. I'd like to experiment if the code is available.

I cleaned up a few things and attached v34 so you can do that if you like.
(Note: what I said about node63/n125 not making a difference in that one
test is not quite true since slab keeps a few empty blocks around. I did
some rough mental math and I think it doesn't change the conclusion any.)

0001-0007 is basically v33, but can apply on master.

0008 just adds back RT_EXTEND_DOWN. I left it out to simplify moving to
recursion.

Oh, the memcpy part is great, very simple. I mean the (compile-time)

"class info" table lookups are a bit awkward. I'm thinking the hard-coded
numbers like this:

.fanout = 3,
.inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),

...may be better with a #defined symbol that can also be used elsewhere.

FWIW, exposing these definitions would be good in terms of testing too
since we can use them in regression tests.

I added some definitions in 0012. It kind of doesn't matter now what sizes
are the test unless it also can test that it stays within the expected
size, if that makes sense. It is helpful during debugging to force growth
to stop at a certain size.

Within the size class, we just alloc a new node of lower size class
and do memcpy().

Not anymore. ;-) To be technical, it didn't "just" memcpy(), since it then
fell through to find the insert position and memmove(). In some parts of
Andres' prototype, no memmove() is necessary, because it memcpy()'s around
the insert position, and puts the new child in the right place. I've done
this in 0009.

The memcpy you mention was done for 1) simplicity 2) to avoid memset'ing.
Well, it was never necessary to memset the whole node in the first place.
Only the header, slot index array, and isset arrays need to be zeroed, so
in 0011 we always do only that. That combines alloc and init functionality,
and it's simple everywhere.

In 0010 I restored iteration functionality -- it can no longer get the
shift from the node, because it's not there as of v33. I was not
particularly impressed that there were no basic iteration tests, and in
fact the test_pattern test relied on functioning iteration. I added some
basic tests. I'm not entirely pleased with testing overall, but I think
it's at least sufficient for the job. I had the idea to replace "shift"
everywhere and use "level" as a fundamental concept. This is clearer. I do
want to make sure the compiler can compute the shift efficiently where
necessary. I think that can wait until much later.

0013 standardizes (mostly) on 4/16/48/256 for naming convention, regardless
of actual size, as I started to do earlier.

0014 is part cleanup of shrinking, and part making grow-node-48 more
consistent with the rest.

The additional bit per slot would require per-node logic and additional

branches, which is not great. I'm now thinking a much easier way to get
there is to give up (at least for now) on promising that "run-time
embeddable values" can use the full pointer-size (unlike value types found
embeddable at compile-time). Reserving the lowest pointer bit for a tag
"value or pointer-to-leaf" would have a much smaller code footprint.

Do you mean we can make sure that the value doesn't set the lowest
bit? Or is it an optimization for TIDStore?

It will be up to the caller (the user of the template) -- if an
abbreviation is possible that fits in the upper 63 bits (with something to
guard for 32-bit platforms), the developer will be able to specify a
conversion function so that the caller only sees the full value when
searching and setting. Without such a function, the template will fall back
to the size of the value type to determine how the value is stored.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v34-ART.tar.gzapplication/gzip; name=v34-ART.tar.gzDownload

��[mo����W�W�q�;%R�(�m�8������vO��+ri1&�
��������Y���
p�#��E�����<�FNm�0M�2N�,�~�	c,��H���������2QF SC�<���}&�������7]��o��3������7��8�n���:����a�����O�2�2��G9N��������7��Y���#LG8�{�s����t{A�0�[^��������]�	���4��?���l �#v����]��9�A�w;V���71��'���#d�b]�]x0�v����#��5�C�l\����������_�����L�c9�f��2����U��@Y)+Cem0���^5S�'��a&�(iq����gar�v|����y����8���lsu��>�$��4a�X�T�<��g��k��4T�iac9H�h,�T�q)�����={��1O�+�O�����x~��E�)�� T^��>b�,���Ng6��'Re7 ���-��N,���0B���x�����O���a0���?�O^���y��LIo__���?'����?_��tz�6�h0�z�0�����L��
c�=f2����4���0��U����a�DJ����o4�0�a����M��6�i��/��p��G����~��
���"t�m��9����������o��O@�XF&����X�I\
#����2�/}�M[L_�AU���}c�PU
�D����
Xt�~�:����l�hw������Y'GI��Y+�H���:�� �l�&��6����Cw���n'+���J�����r/�+]�?d[���6As;��
��O��pRK��%Nx�[�t���!��mZ-���}��7�_%���3���&��6����&{�.�bx��1�[���g3�&������H �����h�;����h>H���%�%��� �I��{��>��0�S�<�b���+�0`@��{$���;`�����������",T����|�������Q�C�b�<c�j=/���]������s�y���E�i_#��������w�����z}�{K3�{3�G>�����i��_c���
�������;��.B����o������X����b�m����nUC��R��v��~��qa�/8��$���aF��6�`S��O��.�-�o���)���l����a2T0����x����!2�}���{*����Z4�=
���k�{w@s~'�N�p_k�j����-��DEH+M���Y���D��+a2�3����w[k:���k�PLy,�
����b*���}�������a��%���L(~
���@�&�U@��~3*tS ���@��3^���?�MV��c��F����S��u�s�=C�X��#�7�g�<3dP\���������e�m{)�������?������l��u��������k;�^/�(����n)����������vB?^���2��E��l �"�����#�����)��R����PYu�1a���AcAY�c�g��a
�A�C����[��mp�����<
�{p��MR��d����$
� 4:��!r]��������zx��P�Ke��f��	���yrF*&�"E���*�0�@� (p��0Rm���qT��Y�")���(]��I*A8�m�z����c�
7���T�yic�8�� e�P�������������IF����{� ��GZ�r��~�G����e1Tx���A�Z
��1[H�!�Jd����
d�����9�hoG���Mc�s
�}�����o^�;��C�����w�1�����V���������T����^
7������7|[�<��-�z'��P�E]���t�il����e>c��r�	��t��nzzk{���d�8*���:�:�A�v�[��A��9�������g�-Y�*�/�����5�~�H��#n{�h�Q�m���nW��5)�f>���
���}��e���u`�^�^���:�����^�Z��w���v�d��*�������
A-����>�a
��|A)G0F��LD�q��X[����eap[�.�Z��vI������<m�(�!���(����'E�6��a�k8D���D&���G�h��d<��aD&�R+��4HK�`PBxYH���cTtG#	:�����'j"�0@�������Jq�2��M���$L����QT����G��e>�q/���R�{��|����'�Q:��2h�d
x4�	q�#�^Jr��5���/�3a��8���%�q�D��5&EV1!5�����T\7@6(����$[����1����"<��r�ra*���6^�+,�����<���>��[�
p
���n�Qs�?_
�=�>}v�<�`����
������`�M���a�����#��z���D�^�� ������PC����{������J#T�t^io����|����:i����A�����y%MG�W����lX��<m���E�����r��P#�i
G�Q��nU�8J'i�Ui5�Y� VU���:���D<���!~)�O��V���s,�m�u������e�������j�����FNs�{��j�{u9A�e�������4:��W��W�	T������X���7�d���;a������t!�g�Tl�,�x�JI��q�|k���'��O��o��.����x�,"�c�8^��5�()�h�X�#����/FC�����u)�k`as;67bA�!K)�u���y�/h�1���>r�����5��D���L��|�6�CT�,U
I�I�d���j���9XOr���j
��Z�	�i;�=)�a�7�@�F:��A�]��t[6���������������u�,��X������~�r�TN
�0��Vt1]�+%!����4�^����4W����LB�/	i�&!����4W�������j��zIHs9	i�%��lj,�q9�����z;�B�S���h���Te�����E��{j	���PDHuI�BM�Y����"	��IR��Z��wvv.�&&���Vd�>dG���{t�[�Y�U����U��ZV}qH	��y�s�N�n���e�l�uW�{+�9xo�{j�����7���Q<L��\�����"Y���
�,��n�7C��8����O��<h��!�����>�FY�c��������m���r?��R!�R@h��f����������Y��X��X����'>���.�{��3{��Gn�w�wm��}�w������_���X�b��?�#�����sS����6u*ee��6N��G�S�J>���0�O���),���M���5���O����f����N%�.����r����-|�T���xbu��}����������C��tz���}F%x��W%���������r��U�e��X�yA�9����#�@d��a�?��Uv����e**:�!tu���k?����xp~R���t�v�#� ����D�����l�!�����z�|KC��hyf��F�%��O���fIHt%����4�{P��0g��\�u~������M�\���X@���3�&:��K�P-�u����yp���+��6C�6�y��:�<��1�����#)k0���a&�$����D���I�[��I�,�I����I�6 �W����

[z�+A�h���3�������h���[�����6w�W��6�W�`zb�]�K�m`~7�1�!�Q��^��W�9��*RF7�v�>�I����������yxv���=����@�`uR�D�7����G4��$E������~�R�u�}��t�\-v�ax�\P"K�Xj����y?�����-d����l1#>���ET8�=0�����`��;�b�F?�,'T����2�:g�D�7K�G��<�������5��Ps��n�����c�
4�O�-z�;��f�#���b�I�(ZS�\�u�l��`-�p_�����+4T�Y��JF�e8O�!�O�tF���H�����!��/J^>�b�H����@�X�;.�]�k�v��/��<"���������y����!���o�]���+���[�>�}���.D^�e�um���c���P�e��<8_	�l��K�8)&p��kN�:���wu��-]��
g�T"�R#�W����>�6��H�j�~��J�1��n���VID)�h|3j����%2��08�p]��LK��83�=�R����� �[8I����������\3,5L��~����O��t~��`k
�5A�[�� ���DRsV�f��/�#�%cTz�5qX�p��G����`Y�0t���?/.�
�=�$U?��������b��0X���8�c%2��_���Iv�GY��)�y��P*�y���jR�U��y��:����S�US/��~�l��B�d����(�k��i��NG/AE2��%��*���$�+vJ����������^q"<,	���$��M]�H��e�Hz�V�!��U1~������`]%���<v�v���I%�n�r6��C|*C�����~%K���s
?XYS����ZU���o�YQ����{m������~
�W2�u�m��M������H�	��/�J�a�B�';��X�����5�{�{|%2����0�j���ghDZ0��Yi����������ME��-����C�V��:b���f2����=�*Q�Q� ��S^<�Z�/i|e��z�un�/u"�5�V���7o��7���0A������!�*Z�����[$�@��V�GO=K�q��xm����l.6��p^��< ��2�P�"
����� ��	^_�T2HN}|3S&^���=8�9$t=�|PGLw�Y������e���g������[SH
���)u������*/l�+vy=�:����E����������v�j��<��R��qB�����?����1X{V�U�2�u�������&Q8j_X���E�U�AohF�P�X`) T1R�������v�M-���l�_��;����P�,��'��e�����Oy4���]�
{R7@�z�$���p�[�	3�6i-��	��|
�]�'�8�I��)��rPl�gc�":U�*�K���Zdu������Rj�~[S��r
|����JI��Q^<8����vU�`Z<�qY����:�����;����5��os|�	{���gA�pz~A\�?#Q�Ec
@eg<�gJ����o�c��!��p����/O���(������x$!���q�&�(��>�|F�v@_��h=�-�u��P�R	�u���������Y�~�x��u�=p���7�1k�c#��FI|�E��=m}+��������$�������i[2��q1����9cc/�n������
�X�UI�����f\�ZY�`w����m�*3+/���O<?>�jt#��g���`����z�
���U��-�n�o��'�_No��X��/�|���Q��W����������X�1	�,*8����+g�U_3�e�+�V���qSO�	�d�p#yL>H[`�y�g���,�c�q�����|KA�(���$,d?�X\]9Ngz�� .]J.F
��H��C����JR�$�n��@�Lk�����#�����H����2tx@�34Q�6���B�4�L���g�yz����{|~t
������:��ef6Y����*��D5���'G��e#:�t����|+_Y���V�>A(�:UI)�O��=;x}{S82>p��^�+gtt����<��4��9��,�'���*�,��Z�O�}qt�������#3(��|��������EBu���
��
;:�o���}Y?�2l4&�2�~8�~FQ�U���]o(r��?_������3y@
'S�D_H"���xvtr~��������	�����x�vi��w��XW�W��u��������	�6/�4������{�rH^w�����_������<�DoC7YB�I��UE6�Ms��#>k�QEN��R��������K��:*��[qu�^S��T�t���������
��:�����=1L�wt�_F�����-C��-E�:��N$�"��b���1�\?N�3�-&kb
�������m�a�n���]�������������:�Y��#�mEO��3��!���8Xx�3��E�@���x��}������hZI����~��M��~��7Tf-������� oX��m�"64b�]��	Ks�4�L��$���Y�w���/�����E�Hw6�x|�������oMz�8>9>Y5P�k�4l����G�oO�Bfq�-=}w.i��M9[����e��*X�������������f�@�nM)�����?�|z����ah�XW�ys~��E���K��t>�+10-��h���@�?~��J[�#B7z���6���?=���h�����Bu[��[��q��'gG��o^��C'�y��{���V^+�?�=������g�G'�����_�Qr���l����CW]gt��\.���P�:[������-�x��3�8�w��9��n���������oYm9=K����lu`_P��{ 
(#�5
wP�v�ON�N�$6������5�������)i�)�&��p���ju����z%��{W�I�a�M/���Z:w����S--���`��'%��=	W�}�6_�<��]��!���P=��1�������GX��:xk���P���O0��~�q����r��	��
����}����Y_^L� 0g}��9��r�����o���o:�2����2���3���3����dX���oj]��kv;j�d# ���mB-�	Jq��!�)�&$���%������?�?����R�(�����z���]�<��3w%��<
O�O_����e����c]�5�4�<�6=5i��)���y�r�H��(�2�vf�Vf�VVy�fU���jI��UK���O�!�&
������B��2K���w6�JY�;40���qQ����JhTJz���rN��#�;�����n/�|-���N���]��_eU���Tf�����3k�E�����7�T�,���:>	oLaV��wYu�O�:���
�����U���"�Q����/*]���s��7��hs^�A�
���&R����*���'3��NU�^U�8��Z���������u
U���C�����0n��7pA��q�Zn\���k���e�
�������*�i�k�1���P1F��z�p����2S�kT�7���r��;�
kDJU���#�:��|�&���^�(��
��,�P�!C{������K�Y}����=8�����1-4wj�,�����|��S�����){s����z�t�v�������������j�_����sNK]]�X%82p#��S/a���<?�#(�0�{�����7'%����y�ey� �/:n*�t�5
�����g���������Q�6:�X�{_���/���������h]�D�|�;
��?������{Gzt��S��o�O�����j��Ll��_�|��T.3��
�$��������W�����%��G��Z!X�E�b�wz�t�D�L	��L�p�A~��}��1C0h�E�:����u��(I��gr�����E9M ���&���d������aj �xM��U]�����9���1��,/,3�T<YV�|��/��"~�Q�
��uV��9�c��������W�[���k���5�+]����H`[�+���n�
���7zy���XW�	����<2tU�8V.v�x��G�r���zL>�h
����I��M�,�hx5A��`OI u�3�n*E��8���'U����KR�������5� ��F7!t������y|�"�#G����F�[��)��:�O�a:Z����m���y+��z�c�V�{�_��L�����'��w��YN�[�"V�E?��q���F3�����0�<�{i"�t�k��/��pEgn���p]�()>_��b�}����+�{��]}��Z�f[Ln�if1�,�_3QN�9O2gd���&�h�TB�4v}/����CJ���D1���6 ��x*�n�(��-pd��$Sp�_L9��q���H����w�f��d7fmh�u����i�:�sC�J�3U�P1�K��h�
�:�V'��E���%3�����r+��v����oO�K�-��F�n>u�qn�8@�JxCS�w���1En�v�x5y�i��-������G�����(��\��RH:x���*&9-���~��OX!}
NF#p���giO��)t&�u��P��N ��ju�V@��(���N��#)4��+Su8�I/����wM�j�TB�6
uT]��!g|A����W^�����l>��#�@	w��#1�� �B���w=&>\���G����P0�
��]@k�xjw
g�����\,z*�	B������'�6<}���]�3c1��"��#��"������?>��%p�`j(5�*�4S;xJ�p\�Y�&6� �F����Vj��b���J�����h�H'a�dQz�=���������U7P:���A����C����Nn� .�B�%V�2�����Q9���S������SEh�H
FLpZ������O������8�2�(h�	�q�m	�?dk��R��i���gI�&���j'o_:���(2��w�/���|TFM���~�A�?�������1T[W����0��mq��{�t����F���[��R��V.�&_�e��(���)e����t*v��Z��DR�\A6qY�����e1H�U4���!�b������!"y�_R*)>,`��V�V@ZG��v2�6����M+�����U�y��E\�8A��"l��[�&�(Y����<�u|�T���	p�P
�S\%K:>�;�����lo>���+���V<�@���3�	������e���#�/�qWh�{N,�V���I'B��-��85�����1������ ������B��*|H�&r����R���0��J��f�m+l>��,q�1=����f���~��o�f����[�����4�av�OdG�b�6ZT&���v�x;�-h���h���������4��]2!���Z�Y*����:���@��v�&7� i�9�y��o����m24�M�Q�P�x��G�5F�R	�S�H�<T	,��Rj�b\`��
�����������sU��E���-�*��:��X>@D$��F������$��[��AW��,Yg�hvF����c�����G���n\GlQ~{�������G��'e��G���p�V���W�1�������/�]�K^o�M~���?r�[��x�N�Y���v}t8�7�K���nX��_y�U��s�UZ[4��l���!��:��i�/�t)V��� ���L��-���)�v������������`�k����w	Q���������g���-*y�<�|RQ'I��!i�<kh �9���h/ �u��3���B��Xj���ltn�X���T�gL(G6/�M5Q6
�($$e^�G�	*rS���dju6�t{'�6�RvkF�ED���f���W��������J��K�����np����Lm��=p�h�����U�-�6��U�	���5�"���+��~�m��2n���]�[�����Q	�
�&��m��	�b2�`z	)�g�U4�!F\��z�����M��~	o-
�}c�d�r�d�u�����UT�d�l��e�f���;8(/f���`i��d&�cT�D �4$����I\�^�+�I��6y��\�71���������0���s� �\���0����n#h�����e��@A���=c^|1��*�7�� �,����KyN�
C{)����es��<��q��4�T�tR�:O�m.�����H�����������07"�$��Ns}����M�6z��L<�P�}��/�V\�+��j�W�t&�
:u*�Cz�^�
�0��}+��!]&����J
���K�'���{�,������
"KP�����|���\�����V�jD�4�I��&��m|��BzY>�f�V�Sx\-�U�s���Y>��wV��Z��jc������5�?6��,Z+�j���.�0���-\F3����;K�'��)����g��TI���}�Zj��f���1�J��A��)zA�c�g������cp��0��-r������h|�ll�6�w\<�
26�H�+pDih�qo0F��z9��ln*�)��`P���._]A�sLV! v�y3n8~���5���W�8�V�x�_����V �\������e?��)�$���@���3!�U��/�1��J�������#)q}&iB]�#�"C��(�k>U��@�����~�o�����Fv���BwI�7�.^������u��)���x7�S=���P5��y�2��1jj��d1#1[`C����s����6������H^����b�Rw$\]����/���M,�s�Xip�)t�4��<��
z���x����u&��h�%)+�End�j���o�A��{B�����wD����xa���0�L�=Rwt _���M�oc`$���1��Ev������������-B^x��G�����Hi�a�?D��[�w��=��)�����E��xQO a��OL����zU1#�
-#W��
V�c�"2�(�Y����%lVH��/1y����jt_-  
	��3t��R%*���U�(�J�A�����Y���~7L�`���=�V�"6�%����
.�7�v5���.jM�`s�i���/No2�[NG}�ws�� ���������2�?�,���M��+E�r��T��2�������FYA�Nz�ZcR~����p���b`(���2&3��W�LF�� �8	>v�a��x���X,��0�r9�
�h�B>A�`	�E������[q��T��WT��������r2�� 7����s����N&����=�5�2&p������[e�������
��L�:����t�]�G���lIg���q�3���&���BVYaT>}�L���d���8H��V��-Kd�z�/�U���
��YL�����
+��N�������u�
����h/��~�$d������
��T���t�#{��G��jx��ez�
���V��R�R���>�q!e���P��l���O:O,����
�p�gA�iH�jZ-����j"{uf��+����M�t;H@�O�=�890����l�?��w��{��=�f~o1�znetk���w�7��+�X@@��k���z&������"S��&��}���$/8����-�io4m�Z�z,U��A<�I���j:��
n�lu�"����T���<������'/*n�K���5�������~��z��8J��N�b��5���H����M�~/J���T�2�p�P�9��arI8q~%�����u�����1�O�t�/��(��
����W�ZU�o���"�[��13�� ��	�|���M)�e�P��;��e��$.Q�z��@��7D�(�)RT*u��gu������(s������r�EB�����t]
-�dw��*��7�2����5�aUi��h�d����@8@��l�|B��3��i��� Q��V��������`����_�9;|��)�e���f�����l��)|N1���<���C�%#��L���$uy,��r���Y��z��H���	��%v��
�-O����7N��w�����
{�d��p�g1 EI�W��D����8X���a�6Lt��8�~��Aa}:�a���Zuq6��nb�|f:� �+�,��!F�qtCb�*	�\A��%\�c��`QK?>a3�C��R���3�P��X��������h���a	@�yhV�V=`��a	[���S�i	�P�V���^���b��w��~d���+���V�������O�����x@���b"9�������Do�c�=��6+�^+��-u�	�L#+��E��r=��C:���>A���Gh6�+���G4������x��M�����:�G������l�?��}/9����@v1D�N�6G��j����S�E�"9�p���MN�605)�\����9��g��X����"i�S/XDf�kU�b���AW���������J���jm��W���9�3����"�"D���&��p�f�LV&<KVS��z�0��&?ey,c=�'��z���Ul4��W#�/�������:�!���1��)�CI��LJ������l���#�E�������}���Y-{�+�j��KeW�WT��^)�N���e�e���f�������A:bK��91���|�IB�-X��?�����)a�rW��og���N�GU�EP�;M����7�o���K[v�X"�NWQQ�pe�J������WP�3�o���?�����[����s�r`�������'��"S�"5CP��f��j���f@�L��O��<���O;�?!�\�y�������S���.�Nu������T��3].����y�l��`��z �S�S�dG_�T`b����~��h�������:�p2���E9���3wl�#���t���7|�`��4��.]���H�u��'�J`P��`�A�I�,�/D���n���?��l`S�!
0w���y���y��+���� �����"���;����Y�v�]?�/w��T>����u��Npjk�B;	�9���d-����JxRR\�a�*��4���K����E��
����=�Q�I!$ �A��7��z���h�8E��ip����7�[y�E
\c��Zh��,�}��#`)��6�i@��7��h���Q�������s�!.����78�Ur$*a��s�K����"�r8��i�T���:��������l��N�[�MMw_���<��W�����7��%�|\Un�r��:m9G�����;
+~��#(����
���
�Qa�x'"�d�jA����
�a�clrR��1���k�C5�b ��i���� 0����_x9���e)�Q-L��Z���!k>z�d�P�'�!����O4�Q`�����^��V L1���m�tvn��I	�{��e��bV�e��/��XX�SIp`�v�n�������S_����j�7�J_
i��(�������D�D�l;����8[\H`06a-x��cN�T''t
w��(&�7�<~�����	E�B��|"%0�}	6�0�~{�FE��=)�������{r���M�U�3Q��F���v��8��
w���[ui���cN(�I��qo���AB�W(AVH��,+��SZM��y?��{yz��8�T�D��������$D���<�GBu��-bq��F#5����Kq�8QW���;�=a���w��3����K����R ;/�(���b��bw�^�}��6�:c���mz�R[A��O/�'X'^ef+
�X_����R�
cX����������9R����
:��-C�,��r�{H��6���������'s�\��j�y�4��]����v3�w���7��%�1�{f ��AF��D������a9�
����4+:n�'��k�}KI����:us�j1�rQ�D|?^f���W���xk����&2B"�K�1�Z0���YRk0�7��������_�p��\i?���d�{�����pp�����iP��M���uYn1�I
���y_v{��
'u���E�^*��d����A;
w��5�rg29xMw.}���@]k�L���E
����,�cv����ieQ���G[%5-�$/6W/��K3�tU�V/����s��V3��?����xk�"�9T�!nJ���f�_S��v���'q��d��_���VG�n�{�pl�z70T�[PD���+	�s�X��������#o,;�����&�`E�l<GI����h��J��K���y_�(X��e?<v[������<���6��T�|���u��)yM�"�y2�Ru��J@�P�s��9��4n���Vp�_1kWa%��+m������:�������.P����OO,^���Xa7.�)�C�kdh��:���W~�&R3�a��d>��������;>+2:�@��cU~Wt�7B�n}����e��������L4�����h�P��#gw�Z��R�
S�h_3GV�=�T��0}�����7@�/��F�q:n�z�NN�&��HknO�1u�[,���V{j�[/i]�v��]�7+�[$��
���6g8��F����M��<�m<T�KF��pa�5q)�2$����A�0q��$���x
p:[T5q�!@��C��]w]�Y]�x�I�_GK�Y���
�n)t>�����0���"
Lc����/rVZ���P�A�N6�u6���M��[�`�U���w��w����.'��p�	�H��b�:��:���C�����t��+d/>�&�\�J��r�!�y��K��f�����	C8"^g�E��i�Q:Y�Q��49��[0���[a-������\ik�������isH��f�wh�FhH_�((4�Q��Xzf�1"c'��4�*O3������Oy�-�9mX9��&:��B��h���Ac�����k:�zRv�;�����2'�g���v�;IM F7�&�����2/w�������������S���Q��
}NaZV\Q;�m�}YIAUa����N����i�
P]��tC(�P��M��@ �E�i�X��f���p��}B��%���b��]�(5���6.a���{���7�\GLU��}JMz�&�tc)����b�D�;@��3�+o�M�$�w�5`D6�_{#����-��U�O�]r'+����H4��-}���
�#���9�� ��f�`9??8�����SI,hSWr���=!�l>���i+>�q�[����=��J���O-�����a����L��a_���\��e��^-�=;�5	���������H}X�6>��y�=������bo1K�oM�e����3�6�f��x{tzt�����'��A\���1�����.A���������������r�����}s~V646n-��&n/��v��"��������8wR9:�%hW�u����%�!�1����}��eA����m7���y��8�������}E�o��14�V�8���
��Up(�[��u��4[����%�W�j�DM����H�:B$���*/0!���fY�`�0���T��H6h�iR�-E������;�4����_}�������_w�:7P7�d����Xyr��A��rxN������4�i:�<�<J�I�:����a�B��xY7\v@����+,��(p��f��~���q�TR��p^����!:�t6��E�X������g��E��p�|��k�brh�"	����[L&	��>�TH������:���Y��VZ��:����	��y��! �d)m��W�n��,�1f����x�YC��!x�r\�D<����k�U��	x�LM+����6��J��q����N_1����,���r_!S���1���E��)����fK	e���}Qi>z��L�F(v����>����Ocy �Q����6A��~����B}����9:��;���nw������g�1a3@���xY�n�A�GJL{h��\�������;���4p��UY}R;W�|��;��qc��vOO�P��;uE�MZ�����^M�%����|�n�SW�j�O�eW����K�Xr�o� }�O�6�9g�?[1v��_�jfx^W
\�9q��n�v��h���8B�-r!}���I>�L��ss��9~����}�2����������,�������UU5�=�Q�O�B�tpz����+�`��������6�������|13����m�d�k>(��,��!FW�I�O����p'$O�7��a/�\���}�;�uYo	�.g%�>�V�&���_�/�����5]��s��f>��r������o�y�Z���`'�'&�1�������?��(�.�������r����X��@!(8�*j'�G5�q��I�*���V�z31ZF���Rw`R�3�Y$�N�2�-�}i�S�hoc���sT^���[�<{Y
�_.���d�Y� j�47�,0�@���~Az���Q|j����'.Oc��;�V�&�9��F
9����<�1 )b����T��K������1�p����[+,��^�5�<�[��}����5r8��+#�Gz�x�����9M3��^���YW-��A���3l��-��j���ENr/�"����F���+z3L������Dq0�q�H��O�n�N���R��W4��9��K��xFg�����)	���T�O����2?� ����E88�P�G=�18P������������c�L%���u0�)���`����W�T����@d��.�aLK>�-��������f�D���T�����'���&���2�5Oz��Z��1WD����|�����"��-��B&�S=o���|���o|J��Iw�3�N�d�dd{�sYW�*����������15�]��0�?S���E���U�aY���S0)�z�����]�]�w�
�p[x�WVQa�_s�E���E��L��M'���(������7���6�T�}se�Z"�����!F\*�J��ng�Y�lH��Ch1���4�-�	�
2������m�'�Bc�Y��9�l�'��$7f@���d��8w9�ArMR�,��,���N-�"6|�������@��H�SfI5��a�}���L�J�\[���8X*1�����!�$xS��B�ib�\3�$�X_'XFM���aZ� �����pC�W=;^��fJ��x$����3�RU'��/!OP2�6�1| ���q`jbj�N�Kg?V��M���������0|�jz����Ou�NY����*J�r,���}.w�2=c�s��_�Yk��Z��<�,��:wX��R��g���J�K�>���W��g�IY�]�>5 �"����Bz5��k�K�-4��!����k�����������L����?���+rh����~���}��0XE��t�f��$�	@�����m��N���RKz�fJ���sN����I�Mo����d������\�P�� 5�yf(#I|g��&e3�
M��$��;�
������
�9�au�3�K����.�;�{�P~%����%���$�������V�����g#�sg������p�>$*D�(C�Rp�{���6'5�gxY��^=�L���%n=1�S<�?��*.?H17Q%m
f��|c� ��\k��8n,��@�e�
��L���d���A�>������E��9@��k��<��>,��c�
�F�v��c����a�>���l���������p�z��d�H�N���gHe��wn _~����Bn&H�Z1`���x�����3�z����4��!�W����`	��=W�������dVuy��B�h-G+4��-G�����)z�R�X��h�G��s*�K%T��U�$�����P��	���R�\�/���U���',q�/����1�d:H�[��m�_������a����uX��K�^+���"*l��S����Wwg^bf�k���A�g����b��6&��U����I~~1��N'S�
�T�Snm��R�)����<�qy)O��U��tc����5�\��v�w?�.��?�`������;���I�0�h��_��:8��t�N��.�������RZZ��"����Y�L�m������]��y?�I�u� 8��Sb�o��j�F|?����+��*�%�����K��}���b�a�xe�6T�R�@\:�����
G���K��~�����C���/`�D��&.`up7OM�eZ�W�'EJ�[VJaY94���lXe@������k�4�:�o!�/]���gs���/�*����pjZ{�#�L�^��Yr<�����kE��$Y��_�"U��?[�v�t��*��Kb�>�>����="�#�1 ,�J�������cC^%�o]*��~�����_���x,�k?
S'xAQ����������oC��x\�l����yY��c2��-�Q��{����s�S�o���z0��|-+�(O�]a ����]�&1<����^4>}��"�4�AU���*XC��%�'�������}�^�*0��E���{��9Sz�8>������$w��W���i��d����QN>��~���[(Z]Y��T�����H������Q����=����������\g�(������x��p����;���Z�Gf�^?5!_+��b����i �'���C}B��"eY����R]���LZ�>AM�F:�*�+�r�t_�:>������������z~�*�Y�}�,�,���b�-���k���.�8E8�^���py0;��0?\��>�K�"������?���M
O��L��/{�����W��/@4_��	�BdZq1\^�� ��0p��d(����n�p����XT@�O���d#+����N.���2��:r�4;���%��e��������������8����1������2����3���~$���u4V�d
K�9���gYST��#�V���-���������Qj$P/��Cp\�J��h�Y1���p}p�����U�X�S��mm�`>��i��r�Y9���W����Ub?(�����%�O���i$�5�����?�o>��O�^o0�*�lR_a�������o|�l>�����'GH�t�'0����O�x6�d��,����>Z�cg�pRk�EW�&X�����;�	�������K�p�#����l_�&�Q~���
�)<��&���W=
���:
>���-`K�jA�~c"���9����8���Y!�c���d����'n�o�
��1Tjd�F�#~{&{x���������Yj��y���?Y�[�k�_�n]�8����3���O>���/1��C�&@�}�y��
�T�V�s7QL|z������~.���)�^��Q����]0�"��g=OB[2��>BV����O����B�
@�3^��c��9�/������{��W��;�Ol�!�r��so������'������'�pbp���<�������,�����}�))Os�����t1gH ���#�)��r.����+�-g�u�'�����M=d�>g���[��S/G�C$*{��E���>��"k��j!.���J�>����V�K�$A<�y}��Y��|���tNO~M��T
����2^Ki24�U�kbIK~��9=��sI&��"��l���sa�[NsKz��,�����B/M����������^��
�VK��z��l���6jA��`�v����@��f�HE�����/���o'�P�E�E]�Z�j0�f2�mr�fN" qQ��d��
�6��������'�������v�5/���Z�&6���M�X���X�k�����6�F��l����E�����l�>/���8��FR��g�(�9�i2c0�	�"�e&B��nyN&��1DaR���8�����O����Z�>A�����^?g�R���p� �G�f�b����G!t9���H$#��k&_���EO��g��X�UD��jW��I2���g��R�N.���L.�)���]Lu������u�6�'U����z�5}��M�i�+�
�k�
��(IB�[�i�(���,�F�d�@���?�6,��G,��X�U���1����������L��SF����@�.O��Me�P����do���L�[����*�]NA���G����r�a �D�t�e9��24����#����	Di��"�<S�+{�w�G����T*�@U�~�a*(t�?�����09_c����U�'��<�0�x�:���\���?`)[��me.f�V3�������*��5�����b^�]��D�fa����O����,�n��S����/��`�)���JX����*+�*]������3E���0/,��%[��m��C�+�$X�
��f'��5���T�U2����0}��r V�.*��-��
�{�
>V[��V���dq�UL��E��~g;���{QoE5�����[��@
r?�L
���_W���Jj���T�����
B2Q�� ��fz�ql���ex�'���������JZ~`�?f�v�@)�b�w���k�-�����Zj����b���e���hx(U�h�[��b<|����i�;n).W��q�[���o�w�"��[�%9�B�

��#���}EH�-L2U�:�xL���4�vY�X������N�K%���SUd�]y�T��
J;t�y��Z�T��,/%�]�;
+���Ag����*xR`�R����B��iU��'1��u��\b���D����]�\�fwT3X��M�������������3&�DMl4#(i�JM���tw~�p��O8�e������2�
�h���\]��l�M��2))�[aR���E9���
�Tu����|V`�lNm����=X7��� �9:�
�V u<�L�����o_�����������d?���Z}�!����u������?�4n}���U���o1K� ������v�/����M����9~`�7$�;�b�S7�>��@���&5F�?�]&(��L�z��~�z��e���6{j���6�Y^�����!�e`iV�k�����}���u7�� E����T�;���,�=���[�W0���."�������C�uB�����D_J�Q���E$(��:9���g%�}IH��"�U�Q�DC �%�I��M,���S�F�K�CD�YD�5��]������"�<E��0�}[����������5�W1�VM��F�-�bz�w�����u=�+��n����]�#?���!�6���T�rb��D`?����������?Zi$-*�&��?�4�N���Z�<��b������u�N�h��`�����5�A�g6*���Z��� �.����t�(�/��"�W�.�Xw����c�'U���Cu2�&)�`<��]r��K."��V�XRITKX2�����y	v��Nf�{f�^ C^r=B���,��`���w���h�F�5;����Cm����d���c<��1p#��F�l�e��0s�|4��y�A������*@Z��sj���1k,�AMa��|���EL6GsdUq>�F�C�/���]�{k���,(M���`��> `ly������_^�>��2�-��$�����/��UUX�Kc�a����*��/��?�H� �q���d��g�#r�:�U�kH����;�W���q)��n�����B^��T���I�.EV��g<.kG�n.���6���7����Lr
������/�m��n$wF]<�!��*>d�t��NI:��O2�Y<F��O�H-w����<aN"#;���������Zc�Q�VT���������82��z7���vv��V{gU�&�[v�G��-|���/�l�yZ��=�������_~MK����`�o�;A*.e��g���MIe�Z�uV�R`:j�iN���|��t��Cge�I�1Z!���e}���HY��R�����u'�p9�T���O�����D������R�[�[����e��������3���M������{F����u#E��Z��s��l}��l�uA[_}E�j#uW���5&
���+���!�t�e`�i�Q�G���<Ck ��J�"�l�-v�������tw��Mv����]5�oHO�,
6�-���<�r�N����N�Aw�������4H�:�*�i
��Sh��T}R���s'/��U��KA%\%�u�����O��'q4�]����E��hw����t���p�k�&��QF�����T:K���p�p#�F�I�������	����Ke������&(^�)�hy.j$[�`�{���k�q}�#�,I�`f��i)���:���:'f�����|Gp������O=R�D�:h�P����<��Aa�����El]E��_g���b����H�e��i������:/U�Ch��/W�S��H�|/X���z����^D������%����|�_\���QY�-��?HY\��d��D)��z�z_�o�q�����Vt�jv�v�.�n(���������f�Y��
���#@�Y�@�����W���,���\'��+���2� j,���VR���{9��W����]��������*��|~l�RjI����~�G��Q_��$Y^}��s�T��d
1�&{�����`T�X1���	f'����������C�����wL4��v����������es;��m\��&��!��{ �����/�������O�$����k���F3y����p���Y��_��
�Pk�N��E�Q|%�<&����C"����(N&����s�[�y^�;���^�������:��^���^2�v3Sk��������M����G7��4���5�
Mw�1My�9O�|����S��B��~K�~��A��_���uYlp5�����Y�0Ri�w������F*-�A�~����/�1���<qo>�
�dmcs8��'��dF��<�_���<�����&(�d-�_5�/����}��qXOM��sS��<�_�~���Q�bp�l��������X�F��9>i�N��*��[�>Y�x������gGg�����[j�����:�e/��y`�6�zo��P�gR��J
���Z���'�5��3G/dG�u�6H���7/��A�!�����������rp�����O��*'��p(��I?��O���,�������D$	~��+��1�L����|����g7n�� m�%�_�5�/,�`QpU�����5*]}�=E�q��pzt���Q�����������^t����6��,b�8����`���X��&H��D��'#Ggj�P3��Z�z�W�:�mP��o%d�����
����!u�r��C�lL��$t~����7��.�q����B_����N|(������!JyrD�����X�:z3���	�@z�E�����DQ�{�{>�~��(Mf��"����jd������>��Z��9��\/����.v/:�"���xr�H��+C�����0��n�u���	xC��^&S�\�	"��xN�7��Mn�I>d���o2��k ��H*#u��������T��'�����#.�D�|}9�T�Dib�psW�B%�!�X�K+�2�������J5�1��X�c�jVr.,i��.a��7�oUm�v��������]�v�v��n9-[�����y���h����T�!x)Y���!r�W�,����a�o]�z1�K�g��r��X���k�[V���V�Z����i���`~��������*�l���:p�����F�)�Y %����%
zv��g�v��K���q�3�����a�����Wr������7���_@*F1�r�zF���KTL����h7�,f�� ����`�i������t�b!�!�,�,�po�<Ln��P1���'b��Ja�{�����64Y�����;���e�hv�<�AS���,|X����JK
jc�p��c���+J[HM�8����K��2lT��������.l�vW��LN��#��pC#�:�l�O3
��u�v��#w�SP�h���3�p�M��C��V&:]a-
��:D]�~�����Q��qxi&}���Y9���t��S�B�BH�v�Y|�VY�w^s��&���i���_��&t%!��E^����
W,�9��q/��{����3>�R���I}�U���3�e`)�n������Qv�m�n-���[j��������M�B���\��l0�����g81�����FW!���q	�f�6���'�m��%�H
�I�'�54�����b0�>q��oO�*-E`�:{zr�6+��$-���������|m��L<d��������/N^�=xq$�DE���X�6]��v{�uy;�h���l�6����x_�>��s8h�y	�8>D2�-z��,���������������H����gK5v�����l�\������:�;�=tP��=�����yy9��y�F�����}�.����.�- VDW���p����n0�T������M�*�����|	�3)���j����y�:z���������,W��T��[�|���L�����Z�)�'��$��c)��������-$cBlLQ�M��O�Q+��U�U����T1����|�o�����8����������z���23bC����A8�LCT`U�A�d
f�-��h�>�'K�vF����S:eng1��u!'R��s����{WPg?F��d�������	�q&��0�]���Dr���KR�����z#���,O�8�]+1��*@���@��	��U�9)q/z`a$q�p�s4a�y��
N6��g��Z�P�m�;3��q#�_y���f�x��w(%_����aY�5r�`�W�����=a!��lk�������,���W�%8��7p�?� �mU����E�,��3
(�sp�
�����vF#
K���&�f�0���#����S��*�V��p#j	���!.L�e��7�6`"zR�|�GX��}������]���^/F�6���rMf��b��cY�����F=���C��T�bX�hNRzuH"v�J��eSmC�jj`���E
���AS,��(F�"�.�r����H��F�3�]��5����3��B_8c`��������v�F�~RU��g`��M�K*���������
+Y�|������`pTg��iog�a�n���;���%�=�):|_�����OWT������_��b-o7�[n�'���Zm�����#�@��g��a�����F��!6�D��%$=�sw�b��+������^h9�{Q�3�oN����������5:��'��_���������i��c��J-�}���M�I��/��O���9�?����L�w6�d�m��'9d��8)�X��>��V��k?K����[�`��@���,a��)����?7���_�OaP��B��2<Pp�fC��b��#�O�y.�������u,����-�y�*��@���f�����|��K)z��7�����7�%��'���P��i|5���5�������C\W��OA� ���*Uf�?�<p�C��9�?J��iA��n����f0�b8[Y7S�>�y�Xs�(^��NO_����
�����k��@D�-h�����_������w��5����X�e*�B����
"�X��_�������l����H'���]��������]@������@n����T����#iR�~�7�J�u.���I7�K��tm2��Qx9�`����
���Ur��:���"����'VC�/bW<N�������'p������/�>���Y�t�aY�����U=�Y+�G�H�Z��"g]�����y@[��_��?��:�9X��T9����H�
���3�0�>�X(�|�����v_"{�~Y��
`HTi�r�v+�����G5�-��5
!�����pW����?�7UQv��Gd7��aOn�`�/(������,����?Uq4�L>���V#rz,,-

�Nz��|hp��b���K�}��;"�&!_�C����.���95�����P�k"��h����'���>�����Pq�(�QS��b�]de�p��!P6��h����������rI��1�`)e
����)�	����%7�,5��v�p��+�#)O��n'�����:��r��	!F��$&c�^�*�5����+��*��R(������PV!*�U\�t+8��0����>C�Z��li�]�%+��W}2&h%V��b�r$��`z�0�'��1�}��>@�/��)Z����%�K������`�TY]P�ji������D�O���������*��e��L������)��������"��DSa���K�~2L�������� ��L-�����Li����+0�uM�%���KP�.�!�!�'��G{��8��
��-��`	f_/�����E	8�+g��c��L���M!_Km��h}O�����m���y��b'y�Ja^n��w�55U��*1����G2t�^zLOm��V,h>0�]A�$�}��2-h�P�o�4h�=a5S��Y~��_�*���exB��[���$(�U��2�s�(�UnI����Q��V�,!@/)�D-�}�CLI� Kk�Z�M�T�i4��B�_�d,�����
#�SPn�H����c��$����������zU���<��5�EpL
 �
1}�5d�l���T�!�|��-���I{��C,s�cUK�|�������D��SG
XWq��8�W#X6�u������V���L�vXIJ�[��B��y�%'y��K����H���WS���/�Z��{sL1t�|����DVp��_��������F������;�.bI�n;�%w��
��0�?`{p����zcy_��[N��dW	�+I���~�-�	��Kt��,+�utT������6��7���QH�z�����T��?Z's�A��G
j&)�-�K�nm��H�Fl��!�D��,�S��T��6�D<l>$	���o�/ol�����F\���m���%���W����n���_OB	�������8g�m�3KY#��E��p�U��Y<gnr�s��i�
@MY7�����s�����J���)�������C��A<+;}�����������S���������Q;%��dk�+�~!��fU��2o���W�t��7��b^K���0��R>	�<3�d0�K<��
�7��i?�;�!�m1�Zi�Q�S���I�O�l�6j(��iylTZs����A�
�
�e�
��Q�RH	�M�}_���4li���QV����"6�K�o"�736���rh4a�0��5���6'�%f����HQh��-9R���w�J�nZ$J�';R����}&���r��k�3/be�N��Br~��>��`����e����|2Bn+��������b�t)qk4������!+0q�����������\bg����Lg�A���(��V���s����=�V\;��l�t�i�:�p��Q��m�������z�RvL���W�*��uO$��'����k���4���0����]��J{f����_}��X/�8&�s�����qB��Y�W��#t�����Pp��������g)tP��A���wK�r
����sZor���f W	=x��2��p@��Q�-"��p0��������h8Jn"VlmB�'��b %bU���� ��V�q���p3$��(������c�!�s��noAO��	�<R������T�)���B��g��#_����7";����@`d``�zs�]�S�H���'����.-G{X~ ���.G~x�Qo�0���M`2i���^���F�P���_,K?�����h�J'��EHQD�4�����[�l$~%L����Zb0[z�����a�O�_��g�8_�����Y�gboQ�&������Op�i[t�4\Qq'���@R@�K����,�1A*�6k��w��d�c���S������*�#�u(�-��
i�BV������S���d�*3.�?y7��~U�i�}�D}(�gE�������~r����>��R;%u����9�of�����I�.ZF����4%�j�
'*�:����$v?�w{%]��0��m�lga	���e�j��}�t
�����@��}���!���U��-����_�H-Duo�~�U�+XY\e��2	^Y���-�����f�gI�_U.�o��2/b��4i`�R�8Uu�7��9r���8K��\��U��?yn���e�vj_�B�!�v��<���kE�-����W���Q�X���:+�;t�V�����WV��&T�0����U,b=��#����#����@'����k`�0���2m�_���Y4_�|���xp����w��D����k��)������������UV�P�O\��
y�������CJ$OD��)��s�&v+����v+�C�g��vQR����N��Z_e@Q�)�'���L't=]&�LR��|`�����,"��`��?�>~V&��;��-`}�bQL����lE����L!�x
�lb���i�=sy�pmC��h1�w�Y�@�4�H�
���Kp�����J���,+�~��D���B�^���JA��@����S��@!�������������[���e�������t�t��Y��_3�4(c���lAv��dM ����u������������~
Z����w~5I��n��<RyF{�$�X��������d?&$o������`]i^J��d����'�~m�}i����w���guZ���� G'1�k�����b��	OoUc�@�N��R	�!��B���#xH����}j�K�A6�J�9]��9��K���������`v�^��8n^t�w$�j���f�����o/���������
;K��_�_�O��$����[�zsm�����c{�&O�����5y��������QC��6�1��6�g�������b�5�gs�����f{��o����V��l��V����7������[�.$���}r=�+���?����=�}��������w�N����z���E���w[���^�����x%%��x*�;Rb|��-��k��c�*J�����8�n�~$~H�g}�\���Q4Jqp���Tn$�z>TEs�rRK4��;;�m���k4�����c��������$U���ln�@�(+jb������uM ��F���vL,�/X$�)x��r�Cr
pA�P�	�����������s8o���&B�Y��&'���B��B��z���b�V������W����V�������pm��!; �E��u�|�9J���vn5O�n;��������9�6�\�k���y���b��Q��qR��y�:Z�r
X�V&U���z���Y�[��4U�V���_��C k�<�|�^,/ST����..�8�����������������/��,K���O-��������I/F�.��~�:&��o�b,k�������b�s�<��^��������v'n���o�a�\C��b�����G>9�(V�������Pi#����`|1��X[U�W���������L������}yNVWY���s1f���.? WI���o��8����*�;J��/��M��<.#�
?(x�Y�^kQ��\_�0����.~~�������xspz����K�������~�L����Y���������S���]���I���)'�{�f��r'��!�C��OsM'o_�\eX��]������OX��*FQ>t.�d%J�r�b�5Y��6[����4l�������[5�<&�*�������v"&�XE�)�m+�4���Vo�nB��>t�{N����_�������me��%��J)k���%�`��	^�U�c�����i�X��a$����Erl����|0.�#d�kf������b0�f�ky�gr/���_��==�>=>98��{vtpz��h���J�?;��9:�>}�����S�}���Y��n,)���o��:
����P�����x�������������y�Z���iAtg��k��}N�nh�����6�������lpn�)�h>�e=�2�p@�Y�����_��&�K:G2��x�"Kt�������|<�Go��R���4��|���,�S�;�"%
��x0���ju������Oq4=_L%���go$��0P4]�{�A��V/^�����I<?A�+�����j���9��%������pX�JO�5:6��k^z�F�}z|������*��>���[�%$)�������(�����
�{����?�(��'	k� �|�Rqf�+���=F{M��
5�Y�K�8��T���g�x����4������������.���L�A��elz^.��k���6�{����}�\"N���X<$����k��n�k
�D���C%����J�0�	��Cg���]��������V���~��r�
���o}��h@���c�N5T���U�����=�|e�
G#���*V�<��r(��H���l
����q�)xN��9����p��6@�2�]��]|����1�DrF"�6���q,PE~'�~��*����S5��5������fn��K��i!W�|2��$B�)�.!���n�)�;����R�}�;j���=m�v������g/�	�p���	��Q�m���;�uX%�z%������ &�|0K&M�4���'�d�S'����#���O��NO��9?S^��N8s��`��o���JF�A<n���+���>���r+U�,x ���C~�P>�����?F�no4Ul����f��0F�C�	�^�]�5�~#
�1
;+3�#S���4�!QJ�AipE@ox�z�gC�C"Ih�����NU�S�tHT�#du{���~��;#�)�o(�c�@�9�D"��=�C��w�4G���e_��R�o����S)��E��9�����G�������+c==���p.��hn��������qh
��
J�3�\����q���d=a�~	��gq�+�S)|C��N0�[BpfOD
�+<�s�;Z��hi�)^���Z�\�H�\o��3<"��X��x��A�kz�N�������W<��a�����������>;�B^:^��|HLF�W��0��L	�r����aWt��'�d�7^��U*�/����?��J�/a�AM��U$��f�����Y
j4��+�.���6v�eq1�cs<qN��st�'�;_
 ��IJ�'U*����#o����O����
z(]��#T��`���S*����*�L�'����%8"g?���x��������d�O��:��I�`�E�-nS���A��B^�����$�������S����~�y�Y����+���V`��`��C��_}]��{��0�j��:6'��|��G�`��)�/p�f���]��.�=4�7}K2��l�Y��	=yvp��UY�P���|���.2�23��N�����.vC�[z�w���KN*CE=���z4�i�H�v���_�,�H���0�@�Sj���s�9��������@,=+>!�)x��1ln��Xi]��O����������e��U��j����/;��_u`=�pJ(�|�����*a!�!��r"���(�����,�Z��l���Z������=�%,X���jc|���[��*Wi�
�o^��S��Ck/)�k���R�y
|4DTs;H�@^?���c]�/=�L<)��`�	��A&�Q���t��������@�S��F��W��1�N�44�,�����x�	�g�G�$n���:�
b^��^[���T�@���I���}�2�%`n���1PJR���B��tk\>��N����m��W�A��K
HX���e�U
p������z�}5�=[���t����*!QE3������N�9=�V����2��m��+�IO0B��A�0V�T���h"��$0g�0_�0�@	.�R�J��'V��S�
���W�c�3l�R�&���<�Y�#�T�9��;XtS�9���&���a�W\���:(���mN�%v:��r����@V��+J���tk���@V	��S��3���V9k`U� Q���H2O6������,��0u��������.w�Wd���wOm���lV�-_in�#����~��FH�m��8�q��XB+��`{�-)c���{�o����BPAo���?���}__Te^MU�{n;��x����R��J7z"��Vv�",����M'��(��\�'fg�<y(�K���B�7�bF�d:�G�Or�+!?��(����|>1��$�/����h�Q��x������tq������7{�V�����;�Z;���������f��l6��^�2y�}�8D�lF$?���B �|�;�J�t2]�SF�Y���T��Y�
��j����U~���d�t�����]����>0DC;�g3�p��2�PD	�������������������������\�in��������x��� ����TnV4��]G�R��m��m����JQ�l�����y�
����I����e���Y����w�rH��b��waH�r����kC�'��i��t��{������*T��?���d)f`c����������'�=�)�$KP����h���25��F\�}���Wm��������������3�NY���������o=;)��c�*�f��Y7���7��>�=8t1�%�'�@��-�B��A!������g���f��k�9������p+0���
�|"�r4!B�%S�����"nB�0^K�GD[Es�|2�'D�V�1���
�	�\� ���BG��Js_�,��/����Z����1+��Q(<VSLY��;��]����*IE?P�v���.��3����y��,�^��,�b6hHe���������&��"]v�EB����������[*]M�s9PF�V;!�����/��~��3�����1����]K_�����T'��"t�~����O@���>�gIb�H*#��������|����\��%"��k����B���{H�?����0�"��d��_�X��d�|1���-�@���-1�����b��G���Y����d"���Y\*=z�3S_'3x���&e,�[3&���0&cN�11\f��R�����U��0C5�*�M��G�]��x$ �
�s�\��kvq���&'������S�������O2�R������PP��X8��b���.�n�bw�
dvJ��&(�+������L(�c�kZO(��8V��f���W�������32��}Y��8��]
��H���-���9�:HhA��fow7���Ew��Ig��f�L�����!�������$��<��l����(6�����zwsf,�Z"��k7��{�L_�V���D q��$�Z���&9�v[�����(+��|'Y$�}�����K�����K��[t�U���:�d��w��B4���p�N����;�YQ� ���^3�w���Q��)k����~�.�v���0���
�������L���z�w>��3��7�����!T�lCr�^W���U<�Og�*�������Y�K�gQ7�]%��o���I��hg�t��G\�����gGg��b�f����9�������a7����u4����|�������KyJ|��=�cyn
z7�}�7nh��8����|Cw��x��r�Trp�P���+Oe���g-!1k�g���y}�t�����E�����~�8k�&}���G�5OLW��r���.�Tl���M������C�5����=��RM��*�v���\���@�[���no��"��1��o��_��M��-9|���L�����uvz[�^���j����.w{j�k�'T�]^�Y7[�T�U(CA(������<���G��5(��g���+E�g��c��.L��c	���������k%�u1�Dw^O&��t(�-<���)���q��N�c�E�@��q1����16i_h�O���a����/��v*�K����W��o����%�������Vo���o��v�h��.��r)���^g��X����9'���f~�	�����P(��5�OfW*��b\��8�� L[4[�;���-����J��0�Z������U�
�][��%�1���apR���T'�%�-1)�*EsL����������~����m��~! �[�����ac�Tm&Fok�\�V���*���"��
=�t�K�-��Xk2�pk2�Bk2���5Y+�5)�������cM�Q���n������i����"�����t�T+��T�}�V�V��!'���I�\��Z9��y�Us/�j��V�������j)��Z�N�@N�� '���[�i���N����t���I�[ '�VEN�~�+r����T+t5U�rR���I�73�;���
"'����\���u�BN�AN*�n�DN�rR-9���K1�$���!'������T.rR��-9�v��Z�/�[�d���
{,��$~�z�+�����(�*��A|S#�DR(��Q����_��iw\�O
#�����o������v#�E��������e����z;�V��i5;;Q{�b{�w��?ecUh��;�[��;��������
"VN����M������V��q��]���]>��H?�O����7��^�r�7���d�����/�_?{��8���q����em�pG���4'��`y���|4�A�|I?F�^B�������>���`m
��N8r4����r%=��(���������0�l�F�a�.�MS��O�_����
���F��u�~L����3�H-��X�g;jG�&(���l��wMG(
���s[lp=�CI�j~+�I�	�iM�����/��������yf��)��{�f����}��itz���Em7�{�-�������U��6(��_R�yIE�&y�P]�|#��`��<=zq,������#Y�)HHj�HT28 �-$�	��=��T��	2@�jT!`4��V�����2�����D�u�-�u�������g/���� !E��E��1h��Q��������mH�����R��0�V���������gG�������:���`����S����Q�4��d�>=:|{z�0�{�Y"���:���x��w�Y��)+���*��jW2�P���4Xxr�|r3v�z���<-pn�=�{u%�K��.D?�5_��50�-����7/C�0��CI\�}�	��~x��=>9;���<:x���]%~���J���T��=lbl�R$-���5��������gP�����mB
�Po@1���{�qw���~�m���������s�z�k�2w��?�?���x6�������7b�x=c$��=(���u_��u�^���)�74�X�20/��X$����z�E44���qo0B$h�{�\E����3��%�h��h

)�(+,z����g�����_����~��=������iV�����4�[
i^s��������7Z��TQ���-�8�;bqx;'>�R�w�]T��,_��Y�
�����N!a�{�~���A)����d�����4�7��<�>`-#��O�\���:iL+Zan
SWv����F�$2�/k���*���r�t�';J���;Y��C��{�!�6�;{��(n����9����#���P�l�����z�a�Q�#�>WdOA�O�nQ��RX�
zb����o	l&�/��w�6�?����
�Ua7!h�?1��tw�b99�3�M�[�������<%#<����f&�h�����/_b��/0Ik�KE�����"�!��	��rcl-Q�jb�df��9�&�����h1��B�4���-y�*Y�/�r���(�S������"���%�d�CE����6z�mshfW�;���}��t�h�jM�AR�b9��"J_���	�T2�-qV*tR��ZT�l��^.��O�d�pnq��=����F��ZjM/n<HE��N6-� ��d:���j��cV��^�?0��`����n�Uv���U�f�%��f���J��M��E	I�^(34\nX�5��!#� K���0Qr�1J�u��{b���QBD���o(O�����G/���|)�)pr�D�����y"'b@/#
"A:T�k���P�w@EUA�zl���������&��W�C�vg6�Bk������P��?��������u��`����Y9�)��7�}�oj���Jr���1kYM�����)������Hi<��>���v������������y�Z�R5B��>O�n��n�����8Hq
C������,6�=����2��'�"i4�e��z�8���N3V��yP��uQ�K�}
��d�*�^`�V�f��y�q���D��MC�r�n�{�Fl�#��	���r�b8%@[���Q��G�|F��.�����sH�~��Y�]����4��iJ���nA��v�t�����M�Q)�H��8���*�l�po*P�w��wIY�_�1�8��
5���=�H��7"���'p����!��3��(9���F�p7�/��=�H��[�'8�:���=m����G�����������P��#���'��Y�h`�����_��u�W�T�3R�8�j��,^�m�h�Z�&'s�P�Ci�N�K-I&�4H�����AB:����{������$�@���p�v��W�;]�����lg-���/�#<�R�v[�H���h�����x�N�>`��5X�!��gn������CH[M�wbv��Eaooy�������ubk��lL���p�7C�W�f�R���Z��j�V[{�\�9v�z��eg���}��N]����F�q�	>��4����i�3y��&�g���T��*�/@���|�V���2�x"���O�l���1':��Ozp"#�F�8����g�@���b?����.�e�j,P�M��goN�Cz�����g��|��h�����������ka`I-K����x}H�~p��n�N9>��}<�uUj�
*�D+:�!�����Ro0Y$�ZD��?������L�ga1}f�N��]��#"0;K<��
���Y�!0�P����h����LQ�gj�X��2�M��)��T�����@��+�~"KP6M�==o������u#�X��Tu��h�=�Z�
J�,���$q3��A2�{�@+��{o�����,�>�D�>�S���n
&`�P����h���/��H>����L��|���f�,Q������5#y�(�8��Hn���g������Q��o�M9j^2w�)WD�/p�����Um�[��^��l��u����o��I��Z��9$�2Kv���d(���'���9��]���1�x6nHx)M������'e��4*@���9���kCy��������)�'�ys����V���k�0UV0>zz��tpz�L���{MR2�@�h��Y���$�'���9��Q3k�
�a�)��mx�Q���P�/����U�}�����8���D���4?�6�����Z���v��&�����\\82��?��K����M�����]J�JW��1����C�E��e�d��K_��9�"���ZM)Be�Kh1<OD�S��MLs
p1�������������l0��i��������[mt�Um�Q�G�]yP�\��9���E�l�[�)�M�x0M8'kX��z%�8��CtH�Y'^���
�
��1S����*|���7�|�����d��-;�f��I`�����e���+��ox�-�f�"����������Q�L���9�hWE2�2�h��"��(�4�b<��I�Q��[v�Y����s��Z�K����(7�7ZZ��_Gs����A�RI����}��j���
�a,h�]�7���3���T�(i����r#���-!�!����T���%gW	�����4����$�������`�[��q��rU�!���1p6�&�A���s�E�nl���j2�8V���oA��9��XL��DS+��'��+�:��s���6��OK5�,��Hnd���������"�(V���b�������i�9R����������k��%��������P;I��/�Z�<'����������+x�����|_D���6����4#���H����C
���Wx�s�T��agOO������=WA��?m������X�y���qo���C��nog7�h�w��N��ho�l�������V;eq�w���\l��A~��P�mn��C�j��3������NSWC��.�.���p��g��cy29�|�����J��
b_�h,�����}
X�D�l�������������p�JJ�R��6���6��������1�8�5���a%����k?@I}\����f��R-d�$��;������!�=��w�=w��>�u�,�~��5\���9UlL�����P��� 7��@Mp+��3�r�k��Cr��
�Ys�>���(8	����8a����%�4)�lZn}mbn���SUC����8���
=+���[�N�-.'����tvk�A��V'D���"kp4tm^��-E���I��PC�\���I
tO�Q�����3+]�����>��-�+������`��$���gG������3|���?�bPQ���M��w�Y4��Q����em|��p��~��J`�/�Q��
&e5ii���M�-��+\��:~�L�g%���A^�61(���<����%n������mw�Xmw��5���������p��s��e�����[E�t����V�����_��$�!�#����<[K�L��=1�4��z�g2��L���2���;���HF.����Q�R.�ri m���	�BN�o��o��V�%wt���F���b~K$l�s��f��tP�8?E�2����@�|VT}^���]j���oouz��z}'����F'm���W�E��-��bLz[d���G�j���_����h���sK�O&�D�$��:XgG/������C(H���j�*&S��'���Xj�
��.U��6V�P�p�C�+�5;����%VS�p%{P�kY�juV�})L����W�}]��v{�u���;Qs{o����[m���.��R�pG��r
?:v�2_���:?�>���]?Q�����.�������v���T�'t)mK���R����$������[t��neLF(�}�jsPD���X
'������z��r�
`qO'I�B���t�������1�V�I@4o6J���5����������/�/������w\��~���A�`����/����M*�"X)�?�g�\���a�W��H��
#j��h�|�[�
����6�U`��x��9��oq�S��!�g�$�xO�(k�9p!��&�#�9W�H�!/gD�V���1�M�Os����ANu��U��,K��`��=���/����%H;�%�>�Q��U��X��BP+��4��Bd9�pZH�}�����e��Bd���Vv�zor����}C�
=�X�Rv:��`��� ����������u����	����o�*|�\a�2P���Gg� =��l���;S���!�:���	��1�.�u%=N8�n^k��L��;K{p��<��?�������h��0f.��_�)���4��*����~l}��%�����c@�������sUu�d�pD�&�Nm H��{�-���a���vt����������������U0EV����PS����}�A���Z���5>����'��[����
���s�N����u)	���hT[[b����n�S���
����K�� ?��
x�':� 9���U
k�!V	7l��5���P�����'"(9d�}R�+]!u+vUW�pi���"E�+2���C�/��yO�����mW��
_�N�e)|�`����j�Z��M�p���.l�b/����6��+�wD_h0�v�1�Y������P��,!@Vq���L[sS�b7����������6.$|"���MY�T�������2|��bH����`TH�V��x1gBJ���;5�����x�f�P���������j*E��s��g<��N�Z�p�������}���:$��8�x�-l'�-�aE������a��maf����vB����[T�a92%b1�<H�A���pF�^�R�F5$����$0�@:N`��Z��F����3)+D��+��*O�����0��R$Fh$�<m���3�����+�MBy�!��U]{����:����a+�g�M'�@
�R�$�J�4��s������rH
�����Q��O�R!��1����{3��kb�vy��z�����J�(T9��r��c�&v+����v��2�1��N�c�]K�k�~m>���������/������l{�������6��}������n�]�;�N'�k���r]{���e?��m�����=��"��x}f�����c�y��������BQ�P��xm�a�)���j�EmGc�C���. N0��G%ym
"�fL�m������"���>%~On����<G������I���+U2�^L�����
)�F�S��bQ1�q�����a�Xh��~a�s�����<�3����!��Q4�.0G����J�V��F1��k||+�r���n��a���gXM^�
�0\�|����z��Pd�ZB�T��gBr�X;$?p9%p������C�.H;.�h*5���]�3/�V��h���1��ti����\��+���N���b0�?� �t�i�#��E�N�^����|"�[k�����=�3F#��0I���E����(Y���qA��&�R��?7Ip�r���]M��Z{�4���z6��@�Ix�I�S�	����|�<L��k�"q�z9��n���>�&�b��p�|�D�~"`a�����KH]�y;p�1GQb�K9�5���
����EcId	f��E�����
T���&�5$=0o��_h[47u7Q��w5��-L$�r�f{�$BP��� ��p3!�4[-�$���~�vkW,�(��~]����'��v���U�!v�Xq���������:^��Q��E'z���\g��,�������;R6j�����=����9}k99}7���'���&�F�E�Q���l�w��fg����g�O�gKk`��;��n��|��Sw\��u�]+1��WA9�������U	Aq����������<�O�:�a[����Bu[��[��q�(.������a���H�9������Z!�C�eN���pz�k�����^��������]��j}[���F������ne���^�����������X����[������QK)�H��H���A�B"���"N�A�%J*�0=(����8A��0��j�p;�z��>����F�]�&���sz�|�����s|��zVl�$����d�y=	W�}��9_�<��=��!�����k:�c�5�
�����	Ai��?�G���P95zo8�i%d��Z������nT-2�|������Y`?+���d'��\j���@�������,�����?���;�����Y5��
M�2@U�#��(�,./��D����AH1�w6�Em` M����[�v��'�����E��d2��9	�R�{�x<��j�-����Z2�{7`7�������.�VH,#�Ng��Q"�e��0hJ���p���n���Z���7����������4�)������#�"Y

n�(��
���4)YQi�F��
���@�|��0E	[���d
��ps'g&������g��3�gd������_�������������@�#r���*���1�XKM�.C��g���'�@UJ�"��6�&@�F�+9����3=P�N3���R�����c�T��Y-�����u�}*�qMJ��(((�~u���c9��{m��n\�s�k�����K�R���"V����&$������=|���zLl�'��c�J��d3�����h���#�����&
�����G��<}y$,�DSQ�{pn~������g�7�R�T�����YD4���+���Z�b�9����b�\c;�G�g�G��%�������U�6�e���"�0	)�<�
@��h�u����[�>����F��h���u9B{Pe��z"���zNv������q���"4)5R���� B)��^�qE��������g��=}�A����X��]Th�{N���w�FF�!�=@�E���h��J ������J�9}0����d��o��S��E[�J'c�@���dt��	gA$/T�Dc�D��� ��D��
!�HB�!�4�h��4;�|��p:i��1�c:B�,*���RR:������
#�'4������hjrI�iC���qi�M�*3��K��vIo���?{��WM��{
�x�
bVxK�!f�0Q
��j���pH���l�a�����:~�l�*�G2�.���y����
AR��pmN�P��G��Y+n��G��]mK~����w����JuLX�D�{��D{=�x��m����5o���]�
����^�
�4��,��@�"��ux��>�h�`Y>�f���mE���^5;wY���c;xW`�Z�%k�66�z����F���lu
,Z+�j���&������p��O����U��zx���>~Q����698�����&U���U�!��*��|����$T@�hN�%�����'�t��Ia��x�5!�a��vKjjc<_,�v�������E`Vz��)��KxLzF�(��V��`�Z-�P/��p���V�����,�G�/�<R��~�B�hesjm������.F\m�mU���QS�>'[���T����:,#�|_�����K�����z��������^Q2��],�k��*_}&�G�f�e�o�'�d�\P�����my�n9�
�(��{:�D	��V�(H���,���<"-�g����}�[����4������Wv��^���X>[t����jm�z�b����~�7x���X(y��@>���e��I�(�oM����T��2��[�	�����\�{;$���H�[�#�$�)���0<�A8kF��@��Zng�w�����,��2i�5!�!����thY�`���bb���n�)Q�=�e�F��Y��/�`���xNk��;+�	%�*�Z�����mF�h.�b��p��\����f�����
�5�O�Kk��\&��	MMsb�x�[\�(�Lq>�y�	R�E��|�KOO6����N���������4��C�W�����jJ�Y�&z�B��iV�{���{���8w������O��upV{�'�e�����C`
�����	����������]���p���R���DO�����a�Y��{������V�B��_{_�z��y.��H���+���4{*4l���fP�������=��t1�a +p�Q6��w��)��u&M��&:sj�hd�P�L��W���X��&l\��1�.�~)U�L����{.L��y�����^gmG���";��Tc�����\j�����sSd����L�'�,�?����VD���k�
LL�����jWl���4�yH"~@#��������,D+q�%��?:f`jl��J�LI�����&E�bAYGy���������=���(�<ZQkP�o6=�,�/��Sd�}P�};.����a����L:Tc 7�Mt�z(4d�:,��d�o���I�}�D�p��u��r*�OD��&m;h6�w(+ZR����)'-�Y10`l�������q�Q���c����c�a�P�-Y���vj
��@Uh.	�����p!4��5e�d}VYXJ�����- #��Vk��~M+�_����[����W�o�Q��mJ�
L��f[����Q���
V����<��=5<�J�[�l��8-k+:�����������7wnO��r�Q$i��D���0ta"y����_)Y��J��b�X�lX\p6�cet�W���3	Q���`��>���HS�O#J�Q��v���6����T�c4�}�=�s���bs�
�l�@E���Jt���7�s�f�J���z�m`3+����\y��%��������j�&�bCh�U&�5-	��A
a�t0/���s-B:�X�>A���<�w�9��E)�������=���a��NA�2�F.3�8���l7������bW4���Uw<I�KG�tO_�>��R��8�����]&�����(_������57}�d�y��������;���^���5Q�K��}1�Nu���!8�k.|"v���_�`O����=�������S�+w�+����=)i�jq���
���~m��}���)��T���(�6���r����������+������{0�"�\����V����7���}� g=��t6z��`���P�Y�@�M�m�7��+;��^�$������>�u���:�$��[U?�����&���x���J*0�,O	>d�.���R���b�Cd��}��6�p(�Tc�d�q�WQ$f[������R���V��UV��������y�A 1��,vQ4LC��<0#��;�g����if�������6��J�����v*G�K��f
s����&�NC���]�4�����T_�ZT������N�J������8�v�[�8�v�-u7qKj3�H@�y��|YZ�T�J���VW�X��M�������JD"�����)�G\�,�����*/�����BNc��[�!�Q��?�I���,,/�s0)�����6�;��l�]�A<��n���GA�4�����8�u��=�v}qO/2'8���^�s�S���ZC��nIT��\���M����+�$���tt(����,�b:���	����?�c4N�H�9�����
��'�c��!R��q"?k�
���%X��W��Hk�}���k�8mT\��>�����>J6
�zhBF��������#H�7.�F���7��T5Zd�K���Vw��)b)/��������v	�dk���
�~���yJ������:��U��>��#>D�i�y�/<V���9�� ��h�f)�'���k�v2�.�I8���d+���b�P��B�"4���n�<�J7���T�h�v
i3�����1TJ�t�b����rL��e$�U�v�~��J��'N>�����������+����-��[���>���/��mE���-QW�oP���o~��{N[@���>�v�^-��`�s�,�����v��7_8���J�g�f�'q2~8�����
��>��A�De	��}���������'���e�p�g�9Y�/�\�s���;���,�1��������X����������
z�>PS��yC]���-/ju�J)���C:��lVr�������z��p��fI�E�����m���:��(���	�/���ShL	QB�$�;����S��zI��o�'0���bs�������B�c=�l�
Z�s��uj�\^���<���=T^'�2��9V�R�w���h�U��$��S�����H=0VMG�I��h�\W��;�{�S+�d�O��}�2lKt�l�����G��T��[
��_r�-��_QP�9�/����-Y����K"(o��Z��������^�����P
���%�q���T�~#d����b�|O�����
��F��-�T���R8
���E���mu�o�������������lx*�8����W����\��T���B>q�Y��Uf��[���j��%�l������o�O�<�L���2��(���fO�cY5�7TIGFs�)Y�����B��^3%{XW
�J��l��m��]��/2��i���<��O���:Q����1.����WV���;��������?���Z#R��,���4z��T���gp���1S@c����"�c�����f�`y ��XG��u�M������S��a����0c!�Dsj�k4�\�&	9f�5X�:!��q	�S%\e�5�c;>O����o��<����&�O���g:5�L�����dc�E��X��R�U�O%���������T�Q�Pm�v�o`�7_���C���/~��'��*w�9V�$yN'�|��m`\��O���Ci��Ts�!5��=�V��pj�_�}b)����K��d�km���S�K�o���
s�c�o����V�j�ovM�����;K��fn��P��3|������~U�V�)��Q�pI\_��j�����	V���N�3)��J�����v<w%�3��vyv������Ox���5�6�y����s���Blss�2���:�X���8���v�	9�����*�|��WC���{�O<��$�,�d�%������nJm��n�Z��nZ��R�)c3e��������{G@_[U���_���o�o�Q�U��v6�DPu�Z�([	DU�n: 6�QJns���	���F�EI��G��H�<�i�L������ qX�m�r�'�mZ��@~_w3��jPJ��
�
 �E��b8�6�W��x���_��M�a�d����g�aw �4�/p(�����,cAU#��m�����5���E��Z"I����|�y[Q�g�b!G�!~6�����*+Kh��],hq����
t���s>�M#�;�v(Wc�hQ�}���M�i��5{[��UE�������zP���c+��L�K&�5&��GYs��Rz�"t+�>��!��
O��.��@��]�j�s�������uak(��
�:������8�r0�[�0��.��#���
@d��!�H����/?XZ;=+�������+X&Ea��_co(�`Rg��'�c�&tY������z���)I�&E�� �%"c���&���Wo~y}��)��D\�n�����lh]9�]����s�{U��P�@k�k�>��i�p��Sp.�����|��!*S��q��V\���s���p��63��)���g(8���l��}�#o����$]4K��x�_
��w1��r:9�BG���(�-Hd�=��a�p���@��v�R�G��-�j%-[=�<�W���M��%x3#����_M���t���xA7I����-pb2����?��;+_�q�z��e�(���h�hE���(b���Q����1�"'�(&����&���dq��u?�)+e��B��Rx���z�+�$#i���Iq�\B����3��F��p���J�M"��mW����n����o4���"���������
Z��OU(�1E��1IcQ*�s4>�(��o�@�+J}�A�-~�S�@b����2�oU�$T�l_�3�-��X��X��m!v�M�O$t��������������eK��"���?�Avw��:��6(b��8���taz���	�h�N�� Ms�0-q9d�A.
i��8�k]M���[�|6[��a���X����`S��-�X�D������u"�u6)(��b���n�=��7o��#]K�K���x�x�����Z_;�e4�C�2vu�}�t��_���Hal���-�>�4���TP��?�8�s��%����7Onp{X'���Pg�R��"4���X�MB1
	�.���o����7x���������=�+�z�p~���}��w���]}��'���a}��{����bW�V��V�]m��vg����8�G�Ge�L���*
7#`�3�Xa(�Hld���T���C|y���TR�_�v�4�Z�j���*��
���i`?T����<�C9���,���Y�Q�ko�9���������'����u'+��3K���u)�;v�v�,��[�Jd,W
�����Da��{��u�����6�x��������A��{d%���==+�2����E���Pi_E%m���q�&�M�:�}�����������k�����O�<����}s~V6��hG���
�u�Y�^QJ���;���b0!��>.WG�-�Z;�
J�u��GjZ��w�����^������(ze�=��js.��m����l���-�����������&,���	�ecg�}�	�����LZ�����w`j����upw3�4w�����.f0�,��L0���JP���<��{:��K;��9Y
T���T��BJ�d��Lf�'�%��Q�#�����V����X�4��9B�����P��!z�@�
�
�(��:P���G
`�^��@,�n�_�G��]�
YO�z��U�7s�O���J+L�
��m8�;�X�(�r��/~��o��f`7��/�-�]�����B�&���� cc�+H<�X&�Ree�3N�zL�?�~�����B�G\K-3���������J}(�-�c��U*�[AWr���A�����o��p1P������Hr3b��hh��f�1tU��)�1&���?�S�(Q����Z���4��\��U���vF����U��Z���0
z�e!1��R.�"�,��1Y�8 ��,$������e���Y�j�|��}�`���i/���y}TN�,���~�TYc��b�8��{�K4�7�
�+�l����������QRs@8�|B��+C&� ��� �'��nui���=����:�J$�������Y����D��KJ�^c���\ �S���<tZ�������k2!����)�1:�S��lw��&N��f�E��6^
�R��F�����i������%���n7w�m���Z�jS����\�>��Q8��8�
�x�����j%{�B�R5�#����-���`Z���F��1���
�����H�-�4S�����a�!
��
.u����A����K��X��h1(5��H����s3G��C���"������S���:�=�[P �y{���k%���Jdh$,���[\,���r�B�#H�m�-�D%?��*��X��;�a|"cey�
%���}�0o�����-�y�!�������Sv���P��
F��]�3�8�`��[�\����j{&��W�m����{� �jK�C��A�6/�M����L.��&�F��W��PJQ���X->Y�go���@q��l���B��s���e��'i�(R���gZw>A�YdrHEn�_���S`���>����=��*�|5��|���W�Y���9j~�=l��3����6��9&�=�s|��uN��Iki���R
6�.Y�^��� ��(;�K~F5O�l~�#�sjCh[��y�C��{V�$�������	|��A�����	���!jS������V�x �T�xt��q����h�.;��L�,wA��
5Y��!r<rw�By@+}�C$)�Q������������{����q^C�L����-���Jx�s���Q��
b�j6�@�{���F[f� v�[��m�R"��/�!-��q#};(|�������t �=>�:�������{�g3�~���^���FS�nWwq*;;�@������d.��(���3�BG�DAB��Bn��������������d �R��6�tW�8��w���4��q����v������o	�@o1C�{��k4�M����]y/��D�� ��m����j�oa�t�aS�2�5X��dG��
�8'F
�D�
8M��Yi��2h�g^	N�?[p�9�
i��F/���fH��\j�l�:%�95�W��������p(��1
��L��S��x��{�-��T����!�co�f�1�3e�f�H���4�}��N�:��[�2���!�����cb#N�����A\�j����B��t��j*��?_��|��>,�E���x�P�K����EA��R���n2�H�2r2_-]�2�!�17JY����(����G���<'�R@^	R�+KH���i<��R`G���=-==zq|�=>?:=��R��<^g�}��u�C:����RgP��TV=�?z3�v��lg"��Uq�S<%��1i�:��	�������|��,�����"�����k{j��������_#�� ��������n5R.�n��T�<Rh{}����^����|2�����x��?T��~��NR�	�GE��| Wn��`���v,M��Z�
p�-<���?�����\vt��� t������#J�d�`]Udb����|&Y���r"���
��]>�b�����d�7f*�����3:�'��nD1����U0?O99G?�X\]�WiK;�$�k��K�e���l\A���=���������+?�a�Z*�wj�\��liI[����V�VtY�K�l�u��r �x?�T%�$'�$���w�$gV���
R��Q����������lV��'�UY�.��H�B�K�;�I;��\�!r���Ud����*:���f�M���l�}�����r[Kf���Vx�x[{+��;�A�2�FO"N���g��,�,{v�p�K@AB�+����io�h^;��SdZ���ZL�����Ty~ZN��=B=�������1�E��5W0����C����4�Lq.�XG��r^bd����)8�R��G���_�~��L���K�����aS���Cm�d���WPP����i��h��b#`���\.:-����nab������'0�x���{��T��@�\�PSJ���HpJ<;?8?�U���6:�]���|��6��)1���4�c����|�p�~���4;�7<'�YK?$���s�L���f���"UVxe%-X�H�� �����R�v���VS��{/p���T?h�*��7�&�\��Uq����m�$����L�^��	�_@W��I���h:����<�W�HU�W����@�04���rlx~���}������~%f�Lt���1���'���XL�5>�r@1(�'x>�!�0������|R���d�d�"[�K��W�	y�����X���V�E����b���0uNC�P����''��j3H1�B��o�L��K9��3;Nz%���"��O���"�	�c�`V�[=v���?a)?U�VA{/���/^Is�y��Uy�i-����cok�����Y����D���x��eZ2R���8��h(����S%A��;�jN��N���o�Lo-1��r�{!E���g�\e�<	F����������O��[�E�k��q���Zx����dG|��4�@��bi��2����������
��!�6����K:/����|���0��m�R�'���Z��&d��q2�{(�E4c����F�E��u����'h�bsf/��`��q5��x�=�w�UIE���;�vv[�S�N�����gWq�=���7�<)\����� �_��G���Oke��ScUT��~&��F+�6B>
we#�{�#�� �uZ�N��'�NGJ+���n������~����mFG]�Vd$K�nKTmM3��[���9.��B/����>���-��q�}�j<����{Bp����B|�]��� �(&��Y���u���4��]����{>�3��~W=�d������a�iN�� ��Mw��:�azPt�������t/�9_�a��n��l��4R����GM��9nxe}�lt"���i�����������'�Y�/�q��XV�Px���#������4�����	��B���.*��
mX_q��e��)c��M�`�A�\������up&�9$E0��t�gw@^�=����f�b������r����Rk����H/F_�$^>���dm���"��mBi�4�,�^��$����S��
���.'S�r����Y��������0�D�p�)���|�w6��O�0fK��l!��Fk��p\�t�?1�~@���~xz����!��������Cq^��d���2J��o+�p���w-����KQ�]
�"�Lf�MF?����'.��\S����[G�$�������@v���\n��Q�������Z�V+�����
���?nB�Or&�6[-������,k���Y�5�3�]I���h4�{[>�G���&c��Jx7���gi]�d2��%��&Pqp�\H2��Dh���E4�(�������B�w�u<6�AvD�G��x���2
E"a�Sl�d�������gL�#���h5Z��x3I�W���?_���E4���O�0��l����E������m���I��l@���m���Q��Z�r�TA�:%��TQVE�cYT�E�H@�_:2|-����T����KN?aW[�G<%��x��Ot�#��pw�R��t�w�2 \�w\X���}V���)�{��a/� l�X��s��NL(+Pv������m������q+��:����
l���������i~����r��J������r
��9_���4�jb�5l��W�6���������-6��KL����)�WZ�%�8�����*�����\�7~��E5I�r��d��}1��#g�u���X�8�\��Md���M)�}L�xz|����{��O��yk��@�@j��Ht��/����$?[�y�m��psu�A@i[VR_C��n�6;����T�7R���]V������|��=���T���������z�
>V[��V��*�����BN�e�P��wv�����*�~#�
�[��Jr;�L��_W���Jj���V�����J0o���%��Z���������NNj�?������L
)�������
.���q�8��4DK����P�4��Iz�1;��u���j����9�
_>��o�5X���w�2C9�x�����O�gB����i�� �S(Aj������I�x)l���i�}�7?�n9S�����|3���4h],?U�$�;�_+�i+�+�z��0���"����R#X�'�a�������MdgT��:7����F/�;�JK�����SUY�de�UFZ6�������,�����H^���dsj������.M2�v�����o!���O�������a���>�����u������T���4n}���S�[�*�������P�s���pn����t:�Os���0��������ZJ*���z���>����SU9�gyU7������	�	%�����������\���������v�u��=���9r�
zc��E�S��s��J��=i����%(��:9��'�9J�����a{~?I�igK4����;)oOo2����R��l���Q�IL��!��i`�0��#����x6�8O��_����Fsj�4(�s1�$s/��C^�>���y���?<����Ox$���u��v��z~���s#��iu*��e��{��~P����r�#����VZI�J�	a��V����E/�G�X��pe��g��/Z+*�o�a|��P���LY.�t�j]�~�wIzo:�}���Y��p�{y����F�����*^@��:��P��\���O/���'�\�
+��$*
�E�v!�<��jO����"���D�d4z��/Y��B�o��L1m��9��%�?]V���}F����/G��#���xf�d�F`!�r0�;	���x���#��m �:Htzq�UH�Nm$#���/�_�W�L5��;~�};1���U����b� �wU�������l�4YL���d�[^<>u������������� �v�,w��2U������b�C��(�} �I��B&=����#��T��g��Z@�X~V9"���X����-�x�s|����v��;�=��U�0K5_� |w)`�>�qY�;:�U���A������0f��kX�Gv��~�o�5���u�|0����hP@IT�T�T�NI:��O2�Y<F��O�H-w����7�U�)��ShL�����NL�h���[gL/'�/����dg�hw����t��z�"��j�]�:}��T[����>M���A=��EPl	�����+��~�*���_+��![W+{�Pu��o���p����]\�Qs���>�]���5Ie��"�G�<R��C��P,Vd�`���Nu�!6����-\5�Or���6o��.���'�pYh��r��{�G�Dv��7�4K��7�g0w^�����u��qO�G�2��T�-"	��9������t�A���N��L���V�^��<�o���?����.����[�t�9K�����`@������J/�����~s��6��w����x���"��:w���.�T�s����������V
�&��s|�@��}�@�{%<�����i�Ix!���Hxy�c�
��r-�s�t�^uc���������MG��o���\d4���f����P�q2�M��a��]-\o���.�{��pwq���(�D�f/no7:��vsg���7����p�����-�l
P
[�wxz3�����R��\t��X��!w$���<y��`�����8;zytx�U����>?><z,�T����'@/�����&�Q�p+�M�J�C��m��B-kL�[
�x-kT��
k�j�cB�����F���M�����aF?(I�V���oU�y�/7W�r�s�1+�����9X�����r�1+�����9X���A�����o�����\�� ��z������JcW��,R��M������
~��~�j�����^_MF���^�����p����ng��W���������W��RM�R�P�iV��R�iV[�y�%�k��:?��.��c��sL�w9��(�s�r"c!�|�����a"��j
?c��'�����Y3��cg�h��+BJfS)�^M J�z��V��l��a�+�mcs�pY�3���w���
��H%�2.H�����i�m�/�Q�����@��I����7�),U�vU���5��U�m��������%�n�^����B;/�T��v�do
3�e��3} �2[�����'#1�5m�~��W�,�b�/L�3�1s�E�l�r1��Pb�mL���C�\���{��D�����m���V����V1�sNr�MK�����B�H��Qm����MI�����A�/R#J�:���#����{�H��
�
D|cf[��Z���MX�@���5�b +y��pxM�VV�>������6�Y�Bkk�JTk�%��"�a[��1;���}sz�����V��&B�6��$�p��	��yy���ylA;,
���r�^���{�<�L?:�|���_�k4�f{�5�q6���3�������u	B��#k��K�G��S���3QrY*����}��J�/����O
:��A�` /�`��|�{�qw^`/y9�}8�o���qTZ��m�Rono�O����X��H�o��z6����������-`dU��Ih��prU>:=}}Z�����I�fqf���l�sB$�a3�_k�[���KZ��\F��� NL�0,7����"��q����d��jZ��/[����.#VZ�t����/��R����\����T���YGr�N�Z�0ke	���u�'4���6"�6ww�5M=�K�I��eP�F;��Kg>��1����,
�H����M`�������$_���_����������u=�U	����{�~{;���"\|jD��D�m��nI�l�Y�w���hH��T�o��>!��>%s�d�E`<Y�
+"x#�C�^��("��Q ����c����ZZ^U���Y��F�Y��$���y�3p$t�*L#@�Z�	��-H	��������I�K�,���R����\.����A���k7�G�^'q?���C�9$C�Q<��*��X�����T|/Z�8���%
����e�M��`���
v��������k��Z�
����������~����/���]�h���|8eZi���G=�=�o���mo|Z|~�t��������sI��������U��By�X#rX�
`8�|��A��fHmJ4��n�q�����P�`�BY�E�=����XU�1��MN�����n7J����������jt	b��BD�0�k�$��O�!oP�gFi�9��[�>�$I&��`����P)��KE��=b�&9Ji,k��_�?r�(Fn������*�:�������j5*h	�f+��Z��0���;�������	��tCA���x_��@p[%��|������������6�J4�I9Kc�rPy""��>Q���>9;/C�B�}�d.�����n4��/R�_��c�5~�������;H�P{O���Nx&V��`���8��L�5�Z5b�h@f��7b���j�Q���26�������sXNOQ�e��-8:7�:��<y�9�����L���&��XU��&.���!�g���R����i��P�=���H��)�R��<����X��6C=�����K��EW�i���8��6�R����)�6��WMd��,e���������6��d�����&M�T��
��������i��(�WO��s����?�cX��5�[Q�6��?�������Y�F�[�4�H*Cl���'s�Q���]{c��� ;��XK>67Wl�,i)i������I��������k5���o5�������}lo���n�)$�ARH���2���^�rR�B���|@Q����gs����[[�f���[[[�V������N��o�q_�������B���'���r�����{��x������v[[������n���������^�������4.���X��S����c��h��\�f��-�H�D��r7�V��1%����������5yB6,����/���*[}��{���N��v�����������$$����fs��@b���%����|�x�4x�����==;��/��db�o����D������p�P��H
��
$�P)�'W1�T��\y����F�2����.d�������g��Q��4���op>Z�
 a�2I���]>���U"8�o� ��"��1q���S�>�xv)�J�;y�6�G��R��1���=V]N��l,��)1�^]�0��t(��Y�_��KU�d��dwj�I��	(�0�����^S�&~���j�f���!�ZV��e)j����{T��V�Q,g�������9��z�qI��`��vGl��Wl!.t"��_r�����Q��Z
��%)�*k��B���i��E%����nmw�V�^�_��Q3�m���
�h��nQ@N~%������Mf�5������go_u%�������5�������G�P������_�<8?~y$���|y�R�=9;x~���d[5�K=R;�;�D��Zm4_?���VU�}��z���om��� ����%�#�8Q������q�����2������P�)�����;���F�����l�����-)G;��R�l���Q�@U����9���\#H�&.{������zqy9�����0��h��e�^\!`.)b$O$���$R��?	�~����~"�����$�#K`!��z.�|mY �v�_�Y],��p7��gI��*���K����0���=�?y��'#�])Pw�l^��$��Gg��8�	k-U���Q��)D�:���x�*'�4#�8�.��
�
N�f��=���� �mu8s�Mdj^A�I�}f��>���OZ���I���4i��� �Y�?��8V@Gqa�����a���9�3xK`:,;"��o�����m���JGe_u��T�r�Z^v[����'�h�'�{ne~��T�gpW�nU��R���[[�;�a5����������?:��%E���"_��������6R���	���J0�������w��������|�#G9>qR-S�����v�i�Y����.�����m|��`�u2���Y����� _�����=t��n�4m�F��erxvp��U��8��/��X��k�����9</������J�<d��������:I��T��r�	);J]@�%f%��\�U5�/.����6����8�V;��e��x����u�a�V��������	��5r�	8H�k����lE�"�T��Z�v?���r�V����������)��+5�o�[���z|l%���`V~�P�}��-��`��>{O'���t
��s�va�>�!�P���4�r#��iq8���O���/�]���:����u�C������a�`���b;j���r%Z�^����il�Xt����^�)uO��$|�W�!��W=��*P���*������_�o�@Qpj��x%�����}����b4���`b� ��@i
��'&_�$��������[�n�f��wF�h($iLSO����v�Y|$0�Lg�lB�s��r=9��{�����Y���d�V4�����e� ���S�3p=�k���B��o:8y���\��>����?�������1}��;Fp�s,;�C�7��.Lr��x��L`�N;4m�1x~�T�t���������U���B?J��	��4�x@���\���G/�O���p��npyI�p|���G`wZhpx��a�����h�q����K��Dz��fFu$3;v����j�Gf�>n����U\�^Vn����Nr��]t���3������>�!�
,����������;LC0(]L��:��);y?6���o�*���c���d��Z+��������8W�w(0�>L8(Y���!�)��'���I������!\� _�6w���r[������\NwFuv�PK���,�I�n,�
Kt�����tj��>���"R>f%�
q1�-������V�_t�z���J��m�D�����%Y��`�|�S������>9�/��aq���;'�hb#���_�]��H}T����l���Fg$I)R�����O�d�p.�n$���,������H���Ec+��aX�X��X�X�:�������v�������T� p����p
�j"�S/ ;`��������jSv�?y,{����W�t��Xg1;���A�A���"�N:{v��i�,Y2����p@/�Dy|�=z�������/�4
q
���X��e��C��e�����3�^"�������a����H@AE��;�����q�	���o;&�X�T�M����d��\ub���'�cXk�`��0�w�5�o�e�Bz�H%j��D�S^	.$XC#b�1���d��>j�����>\O� ���%d��>���@ON��Q_~Be:���aB��
��z�:��D~B��22��>��%��|��l�g�RW�����	p�j�f��c�T��3|R�N���'�_{ 1�K�^��M!
 �k��,2d����:P��(5J�9�M*mo�����&TsYS���+�J
9����~����=�<.�(W�����k�+��s����WR(��3���
�y	x��*���/����?q-��A���N�"�n�&���������r�1��/r�*gV�+{%j����t��hKz��0�W���[�#��&�%�GR\~fNV)o����������X���� P��������������,�=lwa����-���XE��{�/R)��%2�����
l��%�w1�t,V5iwK�5��LJ]J�t��p�k�����^p�Ut��b�t�k��iAGA
\p����U*rv���#�}e����Hk~O�������E#��	�Y��_<����+��. U�Z��J{R�!�%��M���}��L����|��F@���4���
Y��p�p����������vh{o5ZM������[��>���!����^5?�53M2���q�Q��'��au���2U	��>�r�%�+_�=�����+�{{���j�/�F���TQLtJa�!���"���$��B���;-�x��Ra\i_���cR�i�L
P4S�gs��k��Y��b�{��Q��-���+�w�Qus}��
���G��f~�@�f���pc4�f��v2� !L��a�J�&�NH2���'���K����b�
v{���`ni�J��m������=
e7����LT�!���s��H���le���>ng����X�vs�����$p��$h@*v�O�G�g�o�����M�r�����S'�:XB�v�_������<��-�bqY������(�C�v_���t���p��Q
�	�0�p�
g(�x���� �n�W���}��_�z-Y���*	�y�r'��OG�Riy!I���Y4�`�/�#}n�����M(k@����|(C�F�w��V;%�iSkN�&g�h���P�K^"��eYA����We��T�"UJ�Cr���/��P~/�	���Z�
~���=/e��T����|~D��7m�%��
M\��3����A�<{�M 	�������Tt���q��� ,������^�NCf�]�b0/F����s]J���o�3�)-��k�����X���&�Y�L'V�c������J��e�|dV&��M���fi��2������QD�-�k�'�P+���O���G L��0����������7���O^e�hl�s�M'S�i��nm��b�?\`�G�^�Q�C�n�G
�g����&��U��a/��+mH��w�[�V�!�=������O�K�P��g"Ub3�y�,�=BE DG��d]0��g����N���u��t��1P`�]�{�����3mt�]����O�()FC5(������(vIv>�~��8��vw�Q��6��V��N�Dh�Z����H�T��
��%�����DY�l�D3-	��@�WO���������fG���:=k�����_@7���/��1	�����.+I�7c�~��B�)u�nY)9�eE�()5$����A�$����:�	�#W�S�F������3r�Z����|�������6"�u����j���Jw�������N<R
�X�m����5�
xc���a&5��l5�
��i(�&��(����0I�P6��%n�L[���B��2��b��s��P�u+���)k��Y
�O5m)
���"J��C
%k�
j(��0�K4�C�5�
KR��P�"���*WTCa�6���V�P2
k(N�[j(�s��������Z�P,���P|�[YC��'��j(jE�h(Y�}k
�P���p��j(TN_�.��,��R��^`#)�w��
��P6���
����6��Kj���F���q?zK��[�-v���^���k+�-,��-���E���J�p����������)�����Wx;I`V���L��H�^�P��Q�Z�Z���;��DAQ2M5'g]����z���D(d��$�0HY�i akgk�ZOs"?�k�I.���X�x�6(���e������_{�7�y
Ev,�M��uR��1��k�Q��w�����5w�;;.�W���j���[�C���Eg���������e���k���hw�s����x����u��_���h�,�������\�=��*ljH�`��M�&Ltm��\��)�d��0����`�S�@�MD���I}���i/N_�BG�V��{�{��Vx]s�!E�(�����O~����_� ��,x3aL0F���H�3���C�����`-@�v���Y��|sy��a<G��PH��O��R�c�zTm�!���������C9
>�Y�L�q
������A���a0���Nh`y���M���j��5��@��A����A��:y[w�7q��z�����[�}���5[�L~�����;i���#21��0n*���r8�~v��"�;�/��U��p�������Rx����3���Q@W�T#$Y-,�Y&m���u�3)��LtA���a�"d��~x��;uemv���g������j)��s�`�RV���_�]����b:�<��-���KI]���Hn�#R/�K�$��I�U�����5����M�����l���
�@.��y���.�����:�s8��au_�H�c���.��zE(��>[Qj�Y6�$T��i3#�Mo
�)`T�2�jf=*�Ok�'S�\����;��l���[�-�[�-�v���$0��i��C�+	e6B�f<v�5�#[�w�Gn��}��i�������\��03h��^���*NN��iC��3Z +|;*R��d�%�d1�o�O|KL���`�2U�o�V�yH��rM������
!b�O�C�R��?K#v=�+l� ��{$�B$����$���I�Ph��s�������-Ji����6��{��(�MFp�����z���D�	+x*"8DP�@�g��<!�
��/�L�.�� ���7V�3V��A��`�r\
�q&���(��L$���k�O����6��z�d]�d�<��?�=��*��pq�
����7�
�p�bs�*S�)to���R�����|!�O�g5����������vF���f�����S��n����N0���aAD�rd���*�&�]R���w��� ��|�PV����z���~9>?���g�v�����K�
}F�h89�oi_����m1�l�H������oJ��f��3�t�'��E<u����\��l���M������R��)g'U[���%�@�%8H�~T��'���9B������N��%ag #���	��s���*��hk�+�dfRcB�����
���&�"��������V����B�
�>Q���e�N�5;���xw����<t�BI�~���������
�+�E�FE����. 0�u���hkk�
��y:mu�����YTM�6+[@zK-��Cg���xA~x���AzRN��.YP�8F*v>
QI�H|a�O�T�q;��.�����W��;��N��	\���9���t��Z�d�(v�b�Dv����9/KeX��#�[he�bZ�m���f����JO����d�b�!��
�@�>[�v+�vKq_�8,������Z��oR��i�kl!
������{��#�f���vN�";�n�����z���G�Q{>�T#4��d\;��Z_�6�!�MM�u
�j���h2[�$�������N�4;��������"jm7�{��������[��^���h��F�6[��v�k����Y4�B
0���x��~�����h6��CR����X �
I���DI=����5^#���s��2gX�Z��r��T��h:�Z����vL�em4�/�q��&K��TC�����i����{�v�^���2��I1U��6�~����^~�tS���O���<F�����r��C�q����:����8��=���7	�~���e_��e_�;�=��]������v��e�tce�Hm���~	B��pG�r�UC��|�F��m�X���1�������t:�������`�h"�-�������t6��`�C|�E��~���l]�:����uM|d�,B�OLU$�C��d�
����g�����g)B G�>jvv�[�W�$k;�#v�P�������zy��ON�D�=�nl��d��A"E�	����d�W9y������Ug{����Rx���MS�&�b����
7�
~A��������~>zI�@��<��)>��R���2�������y�F���-h=�B��g"�x��Y6�V&���ji��t�P��yC��������,��J����(8��_+4����Rpy�(����s<��?�`�]x�~{�����.&��d����V��q�*%C�c$ �|�#@�d!������W�i5\�D�U�*j��|r������H����	�17��q��NQ����yd"�<�l_����7�[��^��+V���y Z!�7j�El��$.bs3�)�
c�4���
N�%�gO2���������5�����E�MV���������N�r+RVT�w�[���=-���g�T)���`��z-*�����o�n�F���O���q$���>d��jb%�?��2����N�lIK����x3����*�W�H���������x��B�����4�#Qx��F]��U��]���A���+�x���b[x+&�@�4�Ol!v����'�z�yJqT�|G���P���Y�nf��l�U�G��	���P+'mW-!����UT#����2�.�!��R�V%%��L}������2��S��8]/`U\
��t��J�H�@�72���B*��X�Ld�0�1``�������h����C1�pl���H�u���,�cF;�cvm���l*S��86��J�����%��0�z�@�Ui�dk���A�R�a*��$��� �+�fzw.�Jx��U6�;�������B����q��a��|G^S�xW�U��1�IB	�](g��z<�	�*���D��e[w$�>�_$�h����v���_�d�|k=29�Q�A�!��2H��?��9F���n�Y�i\��5�b�e5���je�\�b���>�V�uj�Z���d@��
���~G���Fh=�.�
6�
 �
���&�/b*�����P5JG�w7�!���(������v�����~c�Gq>mDQ(�2��'���m��V����?���1oh�g��\�����g�� ���
�`G�%�|1A��B��?���\���V�.���A
oKp�?�3D�6e��.���9c���J�_��s�p�aOa��-;��6�8`��0�\�dT?�V}ln�.!�f������������-A=��>.����ue�^G�[�����`���N�T:�w���\%�Y��1������DM�("�	��+�W5A�U
���T
���*��L�����b<TA��[�O�Qe����a�v+��y�z:|t��������0J��qR�x�(����HK�A[=!>hu�(�@NQH��m,�$�,OJ��Iv{���1��
�u�~���R�����"�,d"7U�K:��f`Bh���i���F��������*����DP���$05��@h������QE���mM~t��*�vH�R�����\�H���S�YbC;��c}}Q�e���6
7�����<��<�30"�fdp�Uu�M��,�6sM�'VJ-��
������a����aZ�EF0�v����|
GC	���\�L��w	�gr�I�a��_��_�*��������~�r�FhP�8��3~�qQ���S�Y��a>�U�RJR?�� B5���z\F��k���T�@t��0�e(� ��4��
Du�Cm�Oo���>*�`�A;�� ��T�M=�/�(�9���[q���������P�"�HIB���Z��_y���LeRm���GHeh]*�F�8��I�fW�T��Rk�{�
�F!65�auIV�H������.�*����^�o���b)����(��Z(���'I��������(��kM��F�ZW�:�T1u�RW�:�^��A�.�'�@K����E*���w�l!b�]�����=~�����M����@�x}��o;<�0�2��^�?�L�
Q�+.�Y�����qP�c�RT��Jh��n*
Oepl���*��iXB����8+W	�����/jg�Ph���b��Q��;�.��p���j�V����V6��8����
��)��w�����j�N��LmIG5�3D@MG���Kb��7s&!Qo��t��{ai��N<�'t�5�(�-��|{�M��w5M��moO��h~T��96���7��++O�����������R��v7�_�05O&xz��1|G6�h����0�*��E�.�'W�I�K|VM�9��%1L-?5�2+�M�3�lH��f4hN�Q�0���Q�����XA�p���y��f�O|\I�	e���O�H��Y�����
]�o)�������U.k(TF�!����H�Z��L?��p|���P=�)��M���C�Q��#>�'G����������<���)�1���\������(�G�1�CKF���q��g=O����)�mn�a�0[h&���*�&#����o�R�d?a
�y��"QnI�x������L��&�����\��������[����t��M��������y�8����
��x�`�	�m�"J�F�^m���=i�Gfz�K��c�wM����Ig�Gjk������]�W���8��>�9~������lD��	�cV���k�K�K\L�����$�r��#,>Bi����B3j�+<B-�k�L|�/�i��L��4p��YO�� ��{�N%�/OT1(��`�e��wy��>�&~s��Lx}��sK�[c`����2���9.H�XNN�-B�B}�fU	���U#�<���m#22�gD��m��-�mLP.�=�vv���6?������9����_�)~`2������������x������w��� ���4e�F�	�k�S@+��o��^+�,;��d�a�r��`&��rZ�0�JkT�&����q�k>g��!�X�:�_������/��m�P���<�B�/Z&���fG����x�@��f)���g���G���(���Hqr�Hv+m��O3pw�Vq�w�GB-Ur.�qBI��V�|��@��G:3��a� m��
L���=��#��AEw���e�g���tOX�pD��������
�*9����� �X��1j�s&<;����B
A��!
lv`�[�$W��,�7B@��������������u�l�U�
��`�@ZN���	-�*���
�@2���u�M:�Z\�m7������������ ��e6%%2����o�_l������?h�����"������o,�6'�6�7��1h����\����!����������?���~/uF�����b���z�h8H���2�.��~���?����nQ�� @��V&��	j%�5�\E�N�[�P 2�H	*��~^/7�����f����L��}J����b$O �6V�V��D���=xr���8w���I��<�M���S�z����,�Bx���}�zz�|H���=r6���
0j��LRp:Mf�����%��	t�����t��<����w���G�;��m��Fw�E6A�A�;d���A�����(L���t��<)��
�����N��C�'R����Q�����$��^0zW�����Ua��Z9p�3:��5kB���Nr��4mcQ�e�G�aYY:h����3��r�6�O��-#���5�J������m�U��_����M��<9�/�.�Q+�2����1y"a���V�N7�8����b���������^=��c�C�h����(w�����}V�
����B�#�@�
��fS3+x�2:u"��JQ��C�f��4V��c��P�z��������B��T=~�nix�8�rTbUV�H����BT{���,�����qn�dN�oT�����������t[��2�AP����h�#7D��{�R4w��F@~��[���aA/<L��.`��f�}3�~��c���EH��p�2��t1���t~1sC�����6
���&Y�JE;&m�+�3-��Om!W�&m��� J���0�]MT=��Nf�[���L`��2��C�� ������������ln���W�w��Sy#n�����T��L��B"k+G������.�����*&���#��$,�a&�B|���-P���
����$�=_OL6.��!"8�2.%Mg#T��/ys_����#6��q�D]�{�+N5;(c�	���HS���/�'o��"$�d�l4lT�-�a�� ���iUs(��[�F�W@����0�(���<�0Rd���l��R�y��p�Gd�D�{p�N@t�[fe���HdJ���fD����4�V�a��k�� #I�F~D��(R2��f2Qp�-d��%3��%����N?sa0��u�1_Zot�&o0������	��`9 l,�:MF2���)�fQARG&��Ti}4JG�,dw����'Uc�,����w�,����������6f��A�P
�����%����4����a�m �;}m�T��l�f?�f��U��8��-���N����N�q�go�c{�2<LR������k�|
������K�k�5K�h!y�������0����vJ��'�#2*�uJ�C/�6����o�~��8l|��������C ��A@cu0���[�b�@�4�n�gYz��Fv����d��������x1�GZ�x�V�lE���$���&���~u���41����[0��B�n�:E��*������y�h�~G�Q�������=�G�Hz���
�
_�����a={����~�H��x�4Q����a���Y	�D�!bNco]g��/���20�Uc��9jY����e��'.�3#����m��o5��<C-��Z&*c��JQ���DElK�9[�l-�`����>Y)��M,72
���b�V0��EA�f��K�t��U��Y������F����������	�gmqTS1�J����[;~2�X��AD2N��r3�j�i[���l?����e�k������K����&�*s�R�����7���+-�)�N��-�0t�����E�4�>��n����x������B���Q��}N�x�9v.��1����l���,W��5���b��s:�l�d����m*t"�������u#�;6E��
������p�����6��^��3l�F��������;����������?�0���o������k\h����^�~:
��� $����E�������;{Q\����8�H���u��a��g����Z��V<~&j����<�)K0r�d
7d� l �2������;�x�Q����F�.�1�d�G�������vc'���CZ;f�����"�.�ng�nk�n���F���8������`<�nm�}1�#���?o��l��O�C��w�l�^����s����7o>=����$A	��vL��Crs��Wo\�r��o$yy���><;|u���m��=fz�|��q#
�M4@�R�����c�;v�7`��(�d��
�Z����5�w���(���XZ284��h:��������x����f�����d�����o��/���!���x�����w�K�G����IoM�'5v�s��#�_���
Y����v-H�~�5YQ=���N�vuKI��b�Z�������jRg�6=��m��B�����|���� 4�3�����
$:4����j�������U����8�!t�!�L`v��&�xP��7�����/��}��}�DyU�w�R���N����c�C��K��.��A�*��^M�u���V���O��V��������G���J_sork>)��j*�5b���Kb1�\���T�$���+��:�|h��#�
�,#�BZ�|�u�"��B��R��^�)������P�����D�eA���P@����8�A���M-��3|�$[x�":z�::�����`X�L��L��M�������*���h��@�uP���z�������sr5��MJq�[���x$����7Y�&
�)r�G����U�d]�A,�������t6�Y	�v��\6#�	�1=>}w��]���#a�����L���|^8/��&�5A�v�j��K����2|��3@�*\-�|d1`���*�0�@U�`FjI�b���-�nlE������}��)���.a������3(�f��,�K��
z`�3E�p�������/�	�W`(7'�%��Q��y���A����6��a�is����J���� ��Q���������\��t:a�����A�^�FQ4�w.�0��//�^�3��hp�f��/�bM��y�F�"���N�,�z{�`-���[��V8��W�K������/{3�;aG}"����8T�-�x�zz�9�g����%�����)�V8��� #s
�1�@9,��fa�y����2���<K��s��!���8|�����aw7���aw������� b��!l�������vt���5�g����12��PD�vBI�[`�I�y��#����1('�26����<q^�����^�W��x�����D>����(���V���J���������������������[������������|����e��<�J�d	���"�Rn_���y��}��|]�y|=N���}����_w�g|�-���2a_��4�;#K�	��^t�2���67'�ssn�9RB�f����7%���Q�����,/O��/�C7���J�����t�TP}o.���S���BU��	_b�&<��J'*���d6�{<OCPOs�����naS�(���%��@�hb����B)�:����He�������S��?Q�y���X�����"����,�w7�)���3��P��yGJb.�v~�e����9�9����]
�},���z����Q��hH�#|������oN�X���|���wG�����a��ds�.��d,i���0U*k�5��z!���}Lfc���w#{`���'��,z�0�����(��H���Z�)�3���P����e��������(Fsa}��
���������zzPt����&���E5#�@�/P*�B2�1RMh����Y�����J?���?�
L�BI�`�d"�7�|O�SK��N��H��$1��0������k�b~F����?@,Y0�A��{�I�L��������5�{�R(�>�_�>>�� b8d�����E;)	�:hu#�N��43����p��D|�l������Z�A�r!KA���.��At!���i<|^N�`�<w��kB�	q/�MH:�X�j�<W�����ko?��&����������HX�����'���fc*�^��� �y�'vs
m����Yp����W�����T��f�����N������v����zF������!���z�Cw����.��O�aW����${E�
(���^��z:�����Qw�)A����&�W����z�'�+>�,N{��)���
���V���W)�>c����|��G�g�#4{kf�T��D1����k��b�=��A]��W���b_�����6�Gq5[�k[�����9l5c���um;��V�$��F�9�!�����IE���&����v�H�[�k^W�����`KwM����]Sf�����SgP�.7o�j��T1��8�_��B���(��7
8������N�SBM�WH�0sq���q�[!�X�PfK�D(�-&P3�X*�.�w(��\�,����^>>Ps
�K�e�*�����\{�;B�R���1�F�D5��;�P�1���l��o�����ll|)�`�g`�{.��Q����k��}dj{��������G����g�i�JV��W�/��C�N��G��>����o{s���DmW�at��h?����F��r@
��1M��uNA�*�����������]����0w���.����:HB�!'�\":!�2�����o�r���
zBx�&��^�[����m�^�r���#$F�e�m�9Q@���@�x2�
�%@�!j�+�����4�/�2�+�$�<�Uk�"De�7�L�k��5�o��J�PU�
���e�7�����Yi<��D|@�N��������t�!��
2C�M4M�wj����]\5G�/$����J�Q����M��I�C���Zb��M���'Y���E5w�7���
-|LMT����"�f��H|�������_Cl^��H��!QH|�"����q�������[�-�)u@�'�5���p2� ��l�|G�����*�+F��e����So���6��d��0���AP��D��?�]�����c�����j'H�N�U�K@S`D�>c�����
%|u;��i?�+�Z���wT	b���5�:\��K�:���+�g�sk����B��y���z�h��,��M0��\J�K����W�7���<�v]�z��?���lls��z�>za��� @�3�?j�2���^���u��3��?�=:�!�wM�	������UK�
R���K
^/������@�����C�Y�����������>�^��������m��o|e�xG�i{����R[��B�3�@&�yM�~�����&;_��p�!S�'���������(�)0����U{����e�%Z�<UI��3M{��]{��������11p6PYv>G��9cl��J���U�3k(�n�m��=�)
�ScoR��H����.5`B���J����=~���3����7C����Q>�� �s,��,���>S� M�2���3���/�^��#�K����v�9��oD��+"��O[�o�=�"�?���n����
����j�IH���p$eGB�"I���{������}`��!�i�8���r�+��t\��	E�;�T=�UBG��+W�&g9_���2�+W���n��=��� *(��E����{vrwu>���j]�s��S��4��-���4D�������C�"�U'�L>*LnQ6��C����a���G~4���j1���H���B��mX�M�$O���:&����r�2c,��C���Er�qb�<���Yfh���'!��OC�������w4�pvv$\Q�x����������S' ��Q��
q��q���A'���Ry�%p�A�.1`�.Y���&�POE4���8S�/"}kK-mFzo�8��Z�N�V��'�0h�x��Z�[�����ns<���4���7�&���Sq���T���l�O_�����F��3vq��|�7�C:�Ct�_l���j,X}��Zo=�gZJm9u��&���
B�=�-}�w0����Y�byW���L��/�����|����'�����������`y�pd�)��Q_��fq�|�9���D�������5����
2w!�����)p?5�q9�O�,N�����s�	�6�Ze�f��@]<O
N���e@�+� �JP�\2[�`)�#C���h��L�N�*������i[����Is��k��6�*VA	�}�V�I�_1[I�<P��
�j����)���#PZ��U�X S�3M^���*��:�B�A�X����)���C��[%��BXWv��u{�z'��*�5�����������3��6sO�]-�,_�cx����#�m�u�X�k8�>Yv��m{��v��S���vR�U�8_���cY�"��+�p��GZ{~��o���*CO��rG��������E���}�[J��]������c���2N�)�(cL���J��]�e
�����.[B�DJ�'��0gY�TD�+��QXu��:CW����:B?���AU���9�R���� �]�H�C���f��vM�x���['�n-�x9���W,�����I��j2j�������FUse����R�N����afO�i)?��C����	C�BG>Q��::?4�
ZjV5��:=���5���W��R����'���r���8��K�{��~+����4���99����w9�k'�s�6��3���Z.a���,w�j�M��+��g&�72���*����zG�ox]-G{�^)���SAdR�r���
��va��v�4s�w�e��_�����������I�����:��.�]t��]��e9��������,S�a�"�[�
p�t�d�(~���	'�oy��Y'��9��~~�B�y���������/����z!5������u���U�w�k�("�;O*����`_��tr���#��h�?g���k�I���M�����M�����<��7�c��qfx>��I���
W���2�����lA%`nA�����
�fB��2H�@
a]���z�1���|%�Uy���9��I��W�������m��s��A��~�0Gw�?����M���Wb`&�����a��d��w����g����c����Q�9{���nx������_��w��#i��3@�Q�#��R���R#��w��
 [�)������Du��<�j#:��d�Z��`���x��v�G��}'����@��&y����)�x����cXG�����	On�rW+���aq�����X�Y>�x�F#���w�_� ���=_�e�S�������o9�_�G����vl�)$������mU�����X�q{�}UK�XZ���_K�VF-�w%,�__�(��A4�
@�y�{ �|%���}c������k����������bm��*�eR�,����|����9u�c+q`�s�yf��-�T4Y�y��9,I���/��_�R��xs��Y��Y MP�jPN��_n�����mQKX��v6:��R%��T���n���eo�����]&��q�t�Uut��%�����d^'3�d��������;��M
��[��T{��>b��5{�����f-������.��\�H��/O�:��!9�HIn�1W�l������u��t���y��{p�����Z�S+�P7�����~B�*��u�+wD�����7��47t�<�<��z^�#��
�-��ZW Y]�������/F�l�����@�\(���{4��O���\J����L�-v��Y7�t��>���+ ����)�t|�@��0B�u���Zm�
��$&��$�s+�������
'����$�C5J5�jA��h�+�O�N��������E����,��M9C8�"��\�`1�������E�RJc�� 
���^G���fX����-U[�<��i(C��%�o;)|S�y���w�#cR|�`��x�`���"|�7���7%j[i��y��c�����8��$8u��5��"����[�,���[O�������IfX��y��T%�}��U�����V�w;_�����f|/��� P��nE��7�=�����}��z#�����������>�J�#4�>~y�)��-�b:_�O�������j��������t2��0J��q���
��x���A��Fy%�����%J ��
Tv��(��HS����Q��9"����V���.t�f�IITQ������i�:�>��c�8OJ�^B]G1r{����p���d.er��cL�(R����1��Ql������8C@�T�m"}''ql'&&A���"G}N�|���E�������~&Xw��Q�;9�F�qQ�M����pl��Tq?k %�[��z\H*�!E��-�������JT�$��:�
�`���L	uEJ���e������`��[0��A	U7�?1�2"��Z?�KW~�%4�������17Y���a�D��=v9��6��kj��gn++z�2Q����+k�$������*_����*�4O����
�d�\�E������;����.��n���rIf�e|���|dP%F�
��V^�&�A`)�B%d7#�JZ�Ia��-��LH���*���q��p6�k+�Z��-��HY:�V��AL7��!�4jE]����bJ���/aF�O����r�xg����Z��1����G�@!vVNt�D(�N�j�����/���������I�?(�8?��R��ld�?-�8��l��	KF�<�.�a��!��]��s��"�w��*�V�����}��2x�Fq�1@m�����`k�O����Q��t��}��N?��K�UjCD���lzot����o#���� ������w��A�)��8�A�� �u��:�v��!@c���zR�������|hY�~�&�p?�'�8��$�z��� ��{�8��
�����K�X�g��S���|4�����?x�7����|�;�?�;_W�t��ov����[/�E�����Z�3��q?�����|����w:[��/��O��?�~��qZ�����=#je�2�V��,j���%e�
��F������B7D��������2�n>�9���b��w��=�d�-�+LpQ����
.,!���
�HL&W�~�Zr�E�vs��"�&���w{g���U,k)�*���aw{�����z|�U���e*)G��8����C�n���:���&C�6;B"~�I���ys��V�=a�\�6���'�����)eW)��1���d�����8v�F�����h/��D�e��|3��������������q~|��?���w��|�"]|J�	����4���u3��7�o_�}jW��������������l����COJ�,���%��'�j.P|y#IZw������|��EK��b~����������jX3�I�BQag�/���q=�)f�v�6��w�+��I��
 V��M��k0v���������|>zR�L��9�Q����0b6,C����v�YT���������=�e���~JB_�����*]c^<�U�p�V��:�km�T�����B�����/����� �����ad���
'p������{sU�.�fx�&��$��|�~@�T3�-6 ���^R�o����"k�K��@3"�os��Y��7��Q.j�DQ�`�#���_2���\�ksm����6���\�ksm����6���\�ksm����6���\�ksm����6���\�ksm����6���\_�����H��

#259

john.naylor@enterprisedb.com

over 2 years ago

In reply to: John Naylor (#258)

7 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

I wrote:

I cleaned up a few things and attached v34 so you can do that if you

like.

Of course, "clean" is a relative term. While making a small bit of progress
working in tidbitmap.c earlier this week, I thought it useful to prototype
some things in the tidstore, at which point I was reminded it no longer
compiles because of my recent work. I put in the necessary incantations so
that the v32 tidstore compiles and passes tests, so here's a patchset for
that (but no vacuum changes). I thought it was a good time to also condense
it down to look more similar to previous patches, as a basis for future
work.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v35-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchtext/x-patch; charset=US-ASCII; name=v35-0001-Introduce-helper-SIMD-functions-for-small-byte-a.patchDownload

From e37cbeddea0025ed0c6b643ed040e0d02202292b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 24 Oct 2022 14:07:09 +0900
Subject: [PATCH v35 1/7] Introduce helper SIMD functions for small byte arrays

vector8_min - helper for emulating ">=" semantics

vector8_highbit_mask - used to turn the result of a vector
comparison into a bitmask

Masahiko Sawada

Reviewed by Nathan Bossart, additional adjustments by me
Discussion: https://www.postgresql.org/message-id/CAD21AoDap240WDDdUDE0JMpCmuMMnGajrKrkCRxM7zn9Xk3JRA%40mail.gmail.com
---
 src/include/port/simd.h | 47 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 1fa6c3bc6c..dfae14e463 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -79,6 +79,7 @@ static inline bool vector8_has_le(const Vector8 v, const uint8 c);
 static inline bool vector8_is_highbit_set(const Vector8 v);
 #ifndef USE_NO_SIMD
 static inline bool vector32_is_highbit_set(const Vector32 v);
+static inline uint32 vector8_highbit_mask(const Vector8 v);
 #endif
 
 /* arithmetic operations */
@@ -96,6 +97,7 @@ static inline Vector8 vector8_ssub(const Vector8 v1, const Vector8 v2);
  */
 #ifndef USE_NO_SIMD
 static inline Vector8 vector8_eq(const Vector8 v1, const Vector8 v2);
+static inline Vector8 vector8_min(const Vector8 v1, const Vector8 v2);
 static inline Vector32 vector32_eq(const Vector32 v1, const Vector32 v2);
 #endif
 
@@ -299,6 +301,36 @@ vector32_is_highbit_set(const Vector32 v)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Return a bitmask formed from the high-bit of each element.
+ */
+#ifndef USE_NO_SIMD
+static inline uint32
+vector8_highbit_mask(const Vector8 v)
+{
+#ifdef USE_SSE2
+	return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+	/*
+	 * Note: There is a faster way to do this, but it returns a uint64 and
+	 * and if the caller wanted to extract the bit position using CTZ,
+	 * it would have to divide that result by 4.
+	 */
+	static const uint8 mask[16] = {
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+		1 << 0, 1 << 1, 1 << 2, 1 << 3,
+		1 << 4, 1 << 5, 1 << 6, 1 << 7,
+	  };
+
+	uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 /*
  * Return the bitwise OR of the inputs
  */
@@ -372,4 +404,19 @@ vector32_eq(const Vector32 v1, const Vector32 v2)
 }
 #endif							/* ! USE_NO_SIMD */
 
+/*
+ * Given two vectors, return a vector with the minimum element of each.
+ */
+#ifndef USE_NO_SIMD
+static inline Vector8
+vector8_min(const Vector8 v1, const Vector8 v2)
+{
+#ifdef USE_SSE2
+	return _mm_min_epu8(v1, v2);
+#elif defined(USE_NEON)
+	return vminq_u8(v1, v2);
+#endif
+}
+#endif							/* ! USE_NO_SIMD */
+
 #endif							/* SIMD_H */
-- 
2.41.0

v35-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchtext/x-patch; charset=US-ASCII; name=v35-0002-Move-some-bitmap-logic-out-of-bitmapset.c.patchDownload

From 2190487cddcdbc1859a05fc29b0b72858ad1d414 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 6 Dec 2022 13:39:41 +0700
Subject: [PATCH v35 2/7] Move some bitmap logic out of bitmapset.c

Add pg_rightmost_one32/64 functions. This functionality was previously
private to bitmapset.c as the RIGHTMOST_ONE macro. It has practical
use in other contexts, so move to pg_bitutils.h.

Also move the logic for selecting appropriate pg_bitutils functions
based on word size to bitmapset.h for wider visibility and add
appropriate selection for bmw_rightmost_one().

Since the previous macro relied on casting to signedbitmapword,
and the new functions do not, remove that typedef.

Design input and review by Tom Lane
Discussion: https://www.postgresql.org/message-id/CAFBsxsFW2JjTo58jtDB%2B3sZhxMx3t-3evew8%3DAcr%2BGGhC%2BkFaA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c    | 34 +-------------------------------
 src/include/nodes/bitmapset.h    | 16 +++++++++++++--
 src/include/port/pg_bitutils.h   | 31 +++++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 46 insertions(+), 36 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 7ba3cf635b..0b2962ed73 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -30,39 +30,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*----------
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *				xxxxxx10000
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *				yyyyyy01111
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *				yyyyyy10000
- * and then ANDing with the original value gives
- *				00000010000
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *----------
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)				pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)				pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 static bool bms_is_empty_internal(const Bitmapset *a);
 
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 14de6a9ff1..c7e1711147 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -36,13 +36,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -73,6 +71,20 @@ typedef enum
 	BMS_MULTIPLE				/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)				pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)				pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 21a4fa0341..5ad7397d07 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -32,6 +32,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*----------
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *				xxxxxx10000
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *				yyyyyy01111
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *				yyyyyy10000
+ * and then ANDing with the original value gives
+ *				00000010000
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *----------
+ */
+static inline uint32
+pg_rightmost_one32(uint32 word)
+{
+	int32 result = (int32) word & -((int32) word);
+	return (uint32) result;
+}
+
+static inline uint64
+pg_rightmost_one64(uint64 word)
+{
+	int64 result = (int64) word & -((int64) word);
+	return (uint64) result;
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 260854747b..d901353d66 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3713,7 +3713,6 @@ shmem_request_hook_type
 shmem_startup_hook_type
 sig_atomic_t
 sigjmp_buf
-signedbitmapword
 sigset_t
 size_t
 slist_head
-- 
2.41.0

v35-0004-Tool-for-measuring-radix-tree-performance.patchtext/x-patch; charset=US-ASCII; name=v35-0004-Tool-for-measuring-radix-tree-performance.patchDownload

From 75febe599470b399a2a9ae056787ccce28fd8ced Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 16 Sep 2022 11:57:03 +0900
Subject: [PATCH v35 4/7] Tool for measuring radix tree performance

XXX: Not for commit
---
 contrib/bench_radix_tree/Makefile             |  21 +
 .../bench_radix_tree--1.0.sql                 |  78 ++
 contrib/bench_radix_tree/bench_radix_tree.c   | 682 ++++++++++++++++++
 .../bench_radix_tree/bench_radix_tree.control |   6 +
 contrib/bench_radix_tree/expected/bench.out   |  13 +
 contrib/bench_radix_tree/meson.build          |  33 +
 contrib/bench_radix_tree/sql/bench.sql        |  16 +
 contrib/meson.build                           |   1 +
 8 files changed, 850 insertions(+)
 create mode 100644 contrib/bench_radix_tree/Makefile
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree--1.0.sql
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.c
 create mode 100644 contrib/bench_radix_tree/bench_radix_tree.control
 create mode 100644 contrib/bench_radix_tree/expected/bench.out
 create mode 100644 contrib/bench_radix_tree/meson.build
 create mode 100644 contrib/bench_radix_tree/sql/bench.sql

diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile
new file mode 100644
index 0000000000..952bb0ceae
--- /dev/null
+++ b/contrib/bench_radix_tree/Makefile
@@ -0,0 +1,21 @@
+# contrib/bench_radix_tree/Makefile
+
+MODULE_big = bench_radix_tree
+OBJS = \
+	bench_radix_tree.o
+
+EXTENSION = bench_radix_tree
+DATA = bench_radix_tree--1.0.sql
+
+REGRESS = bench_fixed_height
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/bench_radix_tree
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
new file mode 100644
index 0000000000..db33a1a828
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -0,0 +1,78 @@
+/* contrib/bench_radix_tree/bench_radix_tree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION bench_radix_tree" to load this file. \quit
+
+create function bench_shuffle_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_seq_search(
+minblk int4,
+maxblk int4,
+random_block bool DEFAULT false,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT array_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT array_load_ms int8,
+OUT rt_search_ms int8,
+OUT array_serach_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_load_random_int(
+cnt int8,
+OUT mem_allocated int8,
+OUT load_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_search_random_nodes(
+cnt int8,
+filter_str text DEFAULT NULL,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
new file mode 100644
index 0000000000..81ada0fd8f
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -0,0 +1,682 @@
+/*-------------------------------------------------------------------------
+ *
+ * bench_radix_tree.c
+ *
+ * Copyright (c) 2016-2022, PostgreSQL Global Development Group
+ *
+ *	  contrib/bench_radix_tree/bench_radix_tree.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "lib/radixtree.h"
+#include <math.h>
+#include "miscadmin.h"
+#include "storage/lwlock.h"
+#include "utils/builtins.h"
+#include "utils/timestamp.h"
+
+PG_MODULE_MAGIC;
+
+/* run benchmark also for binary-search case? */
+/* #define MEASURE_BINARY_SEARCH 1 */
+
+#define TIDS_PER_BLOCK_FOR_LOAD		30
+#define TIDS_PER_BLOCK_FOR_LOOKUP	50
+
+//#define RT_DEBUG
+#define RT_PREFIX rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE uint64
+// WIP: compiles with warnings because rt_attach is defined but not used
+// #define RT_SHMEM
+#include "lib/radixtree.h"
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+
+PG_FUNCTION_INFO_V1(bench_seq_search);
+PG_FUNCTION_INFO_V1(bench_shuffle_search);
+PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
+
+static uint64
+tid_to_key_off(ItemPointer tid, uint32 *off)
+{
+	uint64		upper;
+	uint32		shift = pg_ceil_log2_32(MaxHeapTuplesPerPage);
+	int64		tid_i;
+
+	Assert(ItemPointerGetOffsetNumber(tid) < MaxHeapTuplesPerPage);
+
+	tid_i = ItemPointerGetOffsetNumber(tid);
+	tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << shift;
+
+	/* log(sizeof(uint64) * BITS_PER_BYTE, 2) = log(64, 2) = 6 */
+	*off = tid_i & ((1 << 6) - 1);
+	upper = tid_i >> 6;
+	Assert(*off < (sizeof(uint64) * BITS_PER_BYTE));
+
+	Assert(*off < 64);
+
+	return upper;
+}
+
+static int
+shuffle_randrange(pg_prng_state *state, int lower, int upper)
+{
+	return (int) floor(pg_prng_double(state) * ((upper - lower) + 0.999999)) + lower;
+}
+
+/* Naive Fisher-Yates implementation*/
+static void
+shuffle_itemptrs(ItemPointer itemptr, uint64 nitems)
+{
+	/* reproducability */
+	pg_prng_state state;
+
+	pg_prng_seed(&state, 0);
+
+	for (int i = 0; i < nitems - 1; i++)
+	{
+		int			j = shuffle_randrange(&state, i, nitems - 1);
+		ItemPointerData t = itemptr[j];
+
+		itemptr[j] = itemptr[i];
+		itemptr[i] = t;
+	}
+}
+
+static ItemPointer
+generate_tids(BlockNumber minblk, BlockNumber maxblk, int ntids_per_blk, uint64 *ntids_p,
+			  bool random_block)
+{
+	ItemPointer tids;
+	uint64		maxitems;
+	uint64		ntids = 0;
+	pg_prng_state state;
+
+	maxitems = (maxblk - minblk + 1) * ntids_per_blk;
+	tids = MemoryContextAllocHuge(TopTransactionContext,
+								  sizeof(ItemPointerData) * maxitems);
+
+	if (random_block)
+		pg_prng_seed(&state, 0x9E3779B185EBCA87);
+
+	for (BlockNumber blk = minblk; blk < maxblk; blk++)
+	{
+		if (random_block && !pg_prng_bool(&state))
+			continue;
+
+		for (OffsetNumber off = FirstOffsetNumber;
+			 off <= ntids_per_blk; off++)
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			ItemPointerSetBlockNumber(&(tids[ntids]), blk);
+			ItemPointerSetOffsetNumber(&(tids[ntids]), off);
+
+			ntids++;
+		}
+	}
+
+	*ntids_p = ntids;
+	return tids;
+}
+
+#ifdef MEASURE_BINARY_SEARCH
+static int
+vac_cmp_itemptr(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+#endif
+
+static Datum
+bench_search(FunctionCallInfo fcinfo, bool shuffle)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	bool		random_block = PG_GETARG_BOOL(2);
+	rt_radix_tree *rt = NULL;
+	uint64		ntids;
+	uint64		key;
+	uint64		last_key = PG_UINT64_MAX;
+	uint64		val = 0;
+	ItemPointer tids;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[7];
+	bool		nulls[7];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	tids = generate_tids(minblk, maxblk, TIDS_PER_BLOCK_FOR_LOAD, &ntids, random_block);
+
+	/* measure the load time of the radix tree */
+	rt = rt_create(CurrentMemoryContext);
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		if (last_key != PG_UINT64_MAX && last_key != key)
+		{
+			rt_set(rt, last_key, &val);
+			val = 0;
+		}
+
+		last_key = key;
+		val |= (uint64) 1 << off;
+	}
+	if (last_key != PG_UINT64_MAX)
+		rt_set(rt, last_key, &val);
+
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	if (shuffle)
+		shuffle_itemptrs(tids, ntids);
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* meaure the serach time of the radix tree */
+	start_time = GetCurrentTimestamp();
+	for (int i = 0; i < ntids; i++)
+	{
+		ItemPointer tid = &(tids[i]);
+		uint32		off;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		key = tid_to_key_off(tid, &off);
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_num_entries(rt));
+	values[1] = Int64GetDatum(rt_memory_usage(rt));
+	values[2] = Int64GetDatum(sizeof(ItemPointerData) * ntids);
+	values[3] = Int64GetDatum(rt_load_ms);
+	nulls[4] = true;			/* ar_load_ms */
+	values[5] = Int64GetDatum(rt_search_ms);
+	nulls[6] = true;			/* ar_search_ms */
+
+#ifdef MEASURE_BINARY_SEARCH
+	{
+		ItemPointer itemptrs = NULL;
+
+		int64		ar_load_ms,
+					ar_search_ms;
+
+		/* measure the load time of the array */
+		itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
+										  sizeof(ItemPointerData) * ntids);
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointerSetBlockNumber(&(itemptrs[i]),
+									  ItemPointerGetBlockNumber(&(tids[i])));
+			ItemPointerSetOffsetNumber(&(itemptrs[i]),
+									   ItemPointerGetOffsetNumber(&(tids[i])));
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_load_ms = secs * 1000 + usecs / 1000;
+
+		/* next, measure the serach time of the array */
+		start_time = GetCurrentTimestamp();
+		for (int i = 0; i < ntids; i++)
+		{
+			ItemPointer tid = &(tids[i]);
+			volatile bool ret;	/* prevent calling bsearch from being
+								 * optimized out */
+
+			CHECK_FOR_INTERRUPTS();
+
+			ret = bsearch((void *) tid,
+						  (void *) itemptrs,
+						  ntids,
+						  sizeof(ItemPointerData),
+						  vac_cmp_itemptr);
+			(void) ret;
+		}
+		end_time = GetCurrentTimestamp();
+		TimestampDifference(start_time, end_time, &secs, &usecs);
+		ar_search_ms = secs * 1000 + usecs / 1000;
+
+		/* set the result */
+		nulls[4] = false;
+		values[4] = Int64GetDatum(ar_load_ms);
+		nulls[6] = false;
+		values[6] = Int64GetDatum(ar_search_ms);
+	}
+#endif
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_seq_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, false);
+}
+
+Datum
+bench_shuffle_search(PG_FUNCTION_ARGS)
+{
+	return bench_search(fcinfo, true);
+}
+
+Datum
+bench_load_random_int(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	pg_prng_state state;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	Datum		values[2];
+	bool		nulls[2];
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	pg_prng_seed(&state, 0);
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64		key = pg_prng_uint64(&state);
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+	x ^= x >> 30;
+	x *= UINT64CONST(0xbf58476d1ce4e5b9);
+	x ^= x >> 27;
+	x *= UINT64CONST(0x94d049bb133111eb);
+	x ^= x >> 31;
+	return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+	uint64		cnt = (uint64) PG_GETARG_INT64(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_time_ms;
+	int64		search_time_ms;
+	Datum		values[3] = {0};
+	bool		nulls[3] = {0};
+	/* from trial and error */
+	uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF);
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (!PG_ARGISNULL(1))
+	{
+		char		*filter_str = text_to_cstring(PG_GETARG_TEXT_P(1));
+
+		if (sscanf(filter_str, "0x%lX", &filter) == 0)
+			elog(ERROR, "invalid filter string %s", filter_str);
+	}
+	elog(NOTICE, "bench with filter 0x%lX", filter);
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+
+		rt_set(rt, key, &key);
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_time_ms = secs * 1000 + usecs / 1000;
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	start_time = GetCurrentTimestamp();
+	for (uint64 i = 0; i < cnt; i++)
+	{
+		uint64 hash = hash64(i);
+		uint64 key = hash & filter;
+		uint64		val;
+		volatile bool ret;		/* prevent calling rt_search from being
+								 * optimized out */
+
+		CHECK_FOR_INTERRUPTS();
+
+		ret = rt_search(rt, key, &val);
+		(void) ret;
+	}
+	end_time = GetCurrentTimestamp();
+
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	search_time_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	values[0] = Int64GetDatum(rt_memory_usage(rt));
+	values[1] = Int64GetDatum(load_time_ms);
+	values[2] = Int64GetDatum(search_time_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_fixed_height_search(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_load_ms,
+				rt_search_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	/* test boundary between vector and iteration */
+	const int	n_keys = 5 * 16 * 16 * 16 * 16;
+	uint64		r,
+				h,
+				i,
+				j,
+				k;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+
+	/*
+	 * lower nodes have limited fanout, the top is only limited by
+	 * bits-per-byte
+	 */
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_set;
+
+						rt_set(rt, key, &key_id);
+					}
+				}
+			}
+		}
+	}
+finish_set:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_load_ms = secs * 1000 + usecs / 1000;
+
+	rt_stats(rt);
+
+	/* meaure the search time of the radix tree */
+	start_time = GetCurrentTimestamp();
+
+	key_id = 0;
+	for (r = 1;; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			for (i = 1; i <= fanout; i++)
+			{
+				for (j = 1; j <= fanout; j++)
+				{
+					for (k = 1; k <= fanout; k++)
+					{
+						uint64		key,
+									val;
+						volatile bool ret;
+
+						key = (r << 32) | (h << 24) | (i << 16) | (j << 8) | (k);
+
+						CHECK_FOR_INTERRUPTS();
+
+						key_id++;
+						if (key_id > n_keys)
+							goto finish_search;
+
+						ret = rt_search(rt, key, &val);
+						(void) ret;
+					}
+				}
+			}
+		}
+	}
+finish_search:
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_search_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_load_ms);
+	values[4] = Int64GetDatum(rt_search_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+	int			fanout = PG_GETARG_INT32(0);
+	rt_radix_tree *rt;
+	TupleDesc	tupdesc;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		rt_sparseload_ms;
+	Datum		values[5];
+	bool		nulls[5];
+
+	uint64		r,
+				h;
+	uint64		key_id;
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	rt = rt_create(CurrentMemoryContext);
+
+	key_id = 0;
+
+	for (r = 1; r <= fanout; r++)
+	{
+		for (h = 1; h <= fanout; h++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (h);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+
+	rt_stats(rt);
+
+	/* measure sparse deletion and re-loading */
+	start_time = GetCurrentTimestamp();
+
+	for (int t = 0; t<10000; t++)
+	{
+		/* delete one key in each leaf */
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			rt_delete(rt, key);
+		}
+
+		/* add them all back */
+		key_id = 0;
+		for (r = 1; r <= fanout; r++)
+		{
+			uint64		key;
+
+			key = (r << 8) | (fanout);
+
+			key_id++;
+			rt_set(rt, key, &key_id);
+		}
+	}
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	rt_sparseload_ms = secs * 1000 + usecs / 1000;
+
+	MemSet(nulls, false, sizeof(nulls));
+	values[0] = Int32GetDatum(fanout);
+	values[1] = Int64GetDatum(rt_num_entries(rt));
+	values[2] = Int64GetDatum(rt_memory_usage(rt));
+	values[3] = Int64GetDatum(rt_sparseload_ms);
+
+	rt_free(rt);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+#if 1
+/* to silence warnings about unused iter functions */
+static void pg_attribute_unused()
+stub_iter()
+{
+	rt_radix_tree *rt;
+	rt_iter *iter;
+	uint64 key = 1;
+	uint64 value = 1;
+
+	rt = rt_create(CurrentMemoryContext);
+
+	iter = rt_begin_iterate(rt);
+	rt_iterate_next(iter, &key, &value);
+	rt_end_iterate(iter);
+}
+#endif
diff --git a/contrib/bench_radix_tree/bench_radix_tree.control b/contrib/bench_radix_tree/bench_radix_tree.control
new file mode 100644
index 0000000000..1d988e6c9a
--- /dev/null
+++ b/contrib/bench_radix_tree/bench_radix_tree.control
@@ -0,0 +1,6 @@
+# bench_radix_tree extension
+comment = 'benchmark suits for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/bench_radix_tree'
+relocatable = true
+trusted = true
diff --git a/contrib/bench_radix_tree/expected/bench.out b/contrib/bench_radix_tree/expected/bench.out
new file mode 100644
index 0000000000..60c303892e
--- /dev/null
+++ b/contrib/bench_radix_tree/expected/bench.out
@@ -0,0 +1,13 @@
+create extension bench_radix_tree;
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/bench_radix_tree/meson.build b/contrib/bench_radix_tree/meson.build
new file mode 100644
index 0000000000..332c1ae7df
--- /dev/null
+++ b/contrib/bench_radix_tree/meson.build
@@ -0,0 +1,33 @@
+bench_radix_tree_sources = files(
+  'bench_radix_tree.c',
+)
+
+if host_system == 'windows'
+  bench_radix_tree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'bench_radix_tree',
+    '--FILEDESC', 'bench_radix_tree - performance test code for radix tree',])
+endif
+
+bench_radix_tree = shared_module('bench_radix_tree',
+  bench_radix_tree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += bench_radix_tree
+
+install_data(
+  'bench_radix_tree.control',
+  'bench_radix_tree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'bench_radix_tree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'bench_radix_tree',
+    ],
+  },
+}
diff --git a/contrib/bench_radix_tree/sql/bench.sql b/contrib/bench_radix_tree/sql/bench.sql
new file mode 100644
index 0000000000..a46018c9d4
--- /dev/null
+++ b/contrib/bench_radix_tree/sql/bench.sql
@@ -0,0 +1,16 @@
+create extension bench_radix_tree;
+
+\o seq_search.data
+begin;
+select * from bench_seq_search(0, 1000000);
+commit;
+
+\o shuffle_search.data
+begin;
+select * from bench_shuffle_search(0, 1000000);
+commit;
+
+\o random_load.data
+begin;
+select * from bench_load_random_int(10000000);
+commit;
diff --git a/contrib/meson.build b/contrib/meson.build
index bd4a57c43c..421d469f8c 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,6 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
+subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.41.0

v35-0003-Add-radixtree-template.patchtext/x-patch; charset=US-ASCII; name=v35-0003-Add-radixtree-template.patchDownload

From 01ac02bfe6ff33f779ae0da2d0794cd1a3a2f1c3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Wed, 14 Sep 2022 12:38:51 +0000
Subject: [PATCH v35 3/7] Add radixtree template

WIP: commit message based on template comments
---
 src/backend/utils/mmgr/dsa.c                  |   12 +
 src/include/lib/radixtree.h                   | 3101 +++++++++++++++++
 src/include/utils/dsa.h                       |    1 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/meson.build                  |    1 +
 src/test/modules/test_radixtree/.gitignore    |    4 +
 src/test/modules/test_radixtree/Makefile      |   23 +
 src/test/modules/test_radixtree/README        |    7 +
 .../expected/test_radixtree.out               |   48 +
 src/test/modules/test_radixtree/meson.build   |   35 +
 .../test_radixtree/sql/test_radixtree.sql     |    7 +
 .../test_radixtree/test_radixtree--1.0.sql    |    8 +
 .../modules/test_radixtree/test_radixtree.c   |  776 +++++
 .../test_radixtree/test_radixtree.control     |    4 +
 src/tools/pginclude/cpluspluscheck            |    6 +
 src/tools/pginclude/headerscheck              |    6 +
 16 files changed, 4040 insertions(+)
 create mode 100644 src/include/lib/radixtree.h
 create mode 100644 src/test/modules/test_radixtree/.gitignore
 create mode 100644 src/test/modules/test_radixtree/Makefile
 create mode 100644 src/test/modules/test_radixtree/README
 create mode 100644 src/test/modules/test_radixtree/expected/test_radixtree.out
 create mode 100644 src/test/modules/test_radixtree/meson.build
 create mode 100644 src/test/modules/test_radixtree/sql/test_radixtree.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree--1.0.sql
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.c
 create mode 100644 src/test/modules/test_radixtree/test_radixtree.control

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index 7a3781466e..0fa155c525 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -1024,6 +1024,18 @@ dsa_set_size_limit(dsa_area *area, size_t limit)
 	LWLockRelease(DSA_AREA_LOCK(area));
 }
 
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+	size_t		size;
+
+	LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+	size = area->control->total_segment_size;
+	LWLockRelease(DSA_AREA_LOCK(area));
+
+	return size;
+}
+
 /*
  * Aggressively free all spare memory in the hope of returning DSM segments to
  * the operating system.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
new file mode 100644
index 0000000000..4df273ddeb
--- /dev/null
+++ b/src/include/lib/radixtree.h
@@ -0,0 +1,3101 @@
+/*-------------------------------------------------------------------------
+ *
+ * radixtree.h
+ *		Template for adaptive radix tree.
+ *
+ * This module employs the idea from the paper "The Adaptive Radix Tree: ARTful
+ * Indexing for Main-Memory Databases" by Viktor Leis, Alfons Kemper, and Thomas
+ * Neumann, 2013. The radix tree uses adaptive node sizes, a small number of node
+ * types, each with a different numbers of elements. Depending on the number of
+ * children, the appropriate node type is used.
+ *
+ * WIP: notes about traditional radix tree trading off span vs height...
+ *
+ * There are two kinds of nodes, inner nodes and leaves. Inner nodes
+ * map partial keys to child pointers.
+ *
+ * The ART paper mentions three ways to implement leaves:
+ *
+ * "- Single-value leaves: The values are stored using an addi-
+ *  tional leaf node type which stores one value.
+ *  - Multi-value leaves: The values are stored in one of four
+ *  different leaf node types, which mirror the structure of
+ *  inner nodes, but contain values instead of pointers.
+ *  - Combined pointer/value slots: If values fit into point-
+ *  ers, no separate node types are necessary. Instead, each
+ *  pointer storage location in an inner node can either
+ *  store a pointer or a value."
+ *
+ * We chose "multi-value leaves" to avoid the additional pointer traversal
+ * required by "single-value leaves"
+ *
+ * For simplicity, the key is assumed to be 64-bit unsigned integer. The
+ * tree doesn't need to contain paths where the highest bytes of all keys
+ * are zero. That way, the tree's height adapts to the distribution of keys.
+ *
+ * TODO: In the future it might be worthwhile to offer configurability of
+ * leaf implementation for different use cases. Single-values leaves would
+ * give more flexibility in key type, including variable-length keys.
+ *
+ * There are some optimizations not yet implemented, particularly path
+ * compression and lazy path expansion.
+ *
+ * To handle concurrency, we use a single reader-writer lock for the radix
+ * tree. The radix tree is exclusively locked during write operations such
+ * as RT_SET() and RT_DELETE(), and shared locked during read operations
+ * such as RT_SEARCH(). An iteration also holds the shared lock on the radix
+ * tree until it is completed.
+ *
+ * TODO: The current locking mechanism is not optimized for high concurrency
+ * with mixed read-write workloads. In the future it might be worthwhile
+ * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
+ * the paper "The ART of Practical Synchronization" by the same authors as
+ * the ART paper, 2016.
+ *
+ * WIP: the radix tree nodes don't shrink.
+ *
+ * To generate a radix tree and associated functions for a use case several
+ * macros have to be #define'ed before this file is included.  Including
+ * the file #undef's all those, so a new radix tree can be generated
+ * afterwards.
+ * The relevant parameters are:
+ * - RT_PREFIX - prefix for all symbol names generated. A prefix of 'foo'
+ * 	 will result in radix tree type 'foo_radix_tree' and functions like
+ *	 'foo_create'/'foo_free' and so forth.
+ * - RT_DECLARE - if defined function prototypes and type declarations are
+ *	 generated
+ * - RT_DEFINE - if defined function definitions are generated
+ * - RT_SCOPE - in which scope (e.g. extern, static inline) do function
+ *	 declarations reside
+ * - RT_VALUE_TYPE - the type of the value.
+ *
+ * Optional parameters:
+ * - RT_SHMEM - if defined, the radix tree is created in the DSA area
+ *	 so that multiple processes can access it simultaneously.
+ * - RT_DEBUG - if defined add stats tracking and debugging functions
+ *
+ * Interface
+ * ---------
+ *
+ * RT_CREATE		- Create a new, empty radix tree
+ * RT_FREE			- Free the radix tree
+ * RT_SEARCH		- Search a key-value pair
+ * RT_SET			- Set a key-value pair
+ * RT_BEGIN_ITERATE	- Begin iterating through all key-value pairs
+ * RT_ITERATE_NEXT	- Return next key-value pair, if any
+ * RT_END_ITERATE	- End iteration
+ * RT_MEMORY_USAGE	- Get the memory usage
+ *
+ * Interface for Shared Memory
+ * ---------
+ *
+ * RT_ATTACH		- Attach to the radix tree
+ * RT_DETACH		- Detach from the radix tree
+ * RT_GET_HANDLE	- Return the handle of the radix tree
+ *
+ * Optional Interface
+ * ---------
+ *
+ * RT_DELETE		- Delete a key-value pair. Declared/define if RT_USE_DELETE is defined
+ *
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/lib/radixtree.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "port/simd.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/* helpers */
+#define RT_MAKE_PREFIX(a) CppConcat(a,_)
+#define RT_MAKE_NAME(name) RT_MAKE_NAME_(RT_MAKE_PREFIX(RT_PREFIX),name)
+#define RT_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define RT_CREATE RT_MAKE_NAME(create)
+#define RT_FREE RT_MAKE_NAME(free)
+#define RT_SEARCH RT_MAKE_NAME(search)
+#ifdef RT_SHMEM
+#define RT_ATTACH RT_MAKE_NAME(attach)
+#define RT_DETACH RT_MAKE_NAME(detach)
+#define RT_GET_HANDLE RT_MAKE_NAME(get_handle)
+#endif
+#define RT_SET RT_MAKE_NAME(set)
+#define RT_BEGIN_ITERATE RT_MAKE_NAME(begin_iterate)
+#define RT_ITERATE_NEXT RT_MAKE_NAME(iterate_next)
+#define RT_END_ITERATE RT_MAKE_NAME(end_iterate)
+#ifdef RT_USE_DELETE
+#define RT_DELETE RT_MAKE_NAME(delete)
+#endif
+#define RT_MEMORY_USAGE RT_MAKE_NAME(memory_usage)
+#define RT_DUMP RT_MAKE_NAME(dump)
+#define RT_DUMP_NODE RT_MAKE_NAME(dump_node)
+#define RT_DUMP_SEARCH RT_MAKE_NAME(dump_search)
+
+#define RT_STATS RT_MAKE_NAME(stats)
+
+/* internal helper functions (no externally visible prototypes) */
+#define RT_NEW_ROOT RT_MAKE_NAME(new_root)
+#define RT_RECURSIVE_SET RT_MAKE_NAME(recursive_set)
+#define RT_RECURSIVE_DELETE RT_MAKE_NAME(recursive_delete)
+#define RT_ALLOC_NODE RT_MAKE_NAME(alloc_node)
+#define RT_ALLOC_LEAF RT_MAKE_NAME(alloc_leaf)
+#define RT_FREE_NODE RT_MAKE_NAME(free_node)
+#define RT_FREE_LEAF RT_MAKE_NAME(free_leaf)
+#define RT_FREE_RECURSE RT_MAKE_NAME(free_recurse)
+#define RT_EXTEND_UP RT_MAKE_NAME(extend_up)
+#define RT_EXTEND_DOWN RT_MAKE_NAME(extend_down)
+#define RT_COPY_COMMON RT_MAKE_NAME(copy_common)
+#define RT_PTR_SET_LOCAL RT_MAKE_NAME(ptr_set_local)
+#define RT_PTR_ALLOC_IS_VALID RT_MAKE_NAME(ptr_stored_is_valid)
+#define RT_NODE_3_SEARCH_EQ RT_MAKE_NAME(node_3_search_eq)
+#define RT_NODE_32_SEARCH_EQ RT_MAKE_NAME(node_32_search_eq)
+#define RT_NODE_3_GET_INSERTPOS RT_MAKE_NAME(node_3_get_insertpos)
+#define RT_NODE_32_GET_INSERTPOS RT_MAKE_NAME(node_32_get_insertpos)
+#define RT_CHUNK_CHILDREN_ARRAY_SHIFT RT_MAKE_NAME(chunk_children_array_shift)
+#define RT_CHUNK_CHILDREN_ARRAY_DELETE RT_MAKE_NAME(chunk_children_array_delete)
+#define RT_CHUNK_CHILDREN_ARRAY_COPY RT_MAKE_NAME(chunk_children_array_copy)
+#define RT_CHUNK_VALUES_ARRAY_COPY RT_MAKE_NAME(chunk_values_array_copy)
+#define RT_NODE_125_IS_CHUNK_USED RT_MAKE_NAME(node_125_is_chunk_used)
+#define RT_NODE_INNER_125_GET_CHILD RT_MAKE_NAME(node_inner_125_get_child)
+#define RT_NODE_INNER_256_IS_CHUNK_USED RT_MAKE_NAME(node_inner_256_is_chunk_used)
+#define RT_NODE_INNER_256_GET_CHILD RT_MAKE_NAME(node_inner_256_get_child)
+#define RT_NODE_INNER_256_SET RT_MAKE_NAME(node_inner_256_set)
+#define RT_NODE_INNER_256_DELETE RT_MAKE_NAME(node_inner_256_delete)
+#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
+#define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
+#define RT_NODE_SEARCH_INNER RT_MAKE_NAME(node_search_inner)
+#define RT_NODE_SEARCH_LEAF RT_MAKE_NAME(node_search_leaf)
+#define RT_NODE_UPDATE_INNER RT_MAKE_NAME(node_update_inner)
+#define RT_NODE_DELETE_INNER RT_MAKE_NAME(node_delete_inner)
+#define RT_NODE_DELETE_LEAF RT_MAKE_NAME(node_delete_leaf)
+#define RT_NODE_INSERT_INNER RT_MAKE_NAME(node_insert_inner)
+#define RT_ADD_CHILD_4 RT_MAKE_NAME(add_child_4)
+#define RT_ADD_CHILD_16 RT_MAKE_NAME(add_child_16)
+#define RT_ADD_CHILD_48 RT_MAKE_NAME(add_child_48)
+#define RT_ADD_CHILD_256 RT_MAKE_NAME(add_child_256)
+#define RT_GROW_NODE_4 RT_MAKE_NAME(grow_node_4)
+#define RT_GROW_NODE_16 RT_MAKE_NAME(grow_node_16)
+#define RT_GROW_NODE_48 RT_MAKE_NAME(grow_node_48)
+#define RT_GROW_NODE_256 RT_MAKE_NAME(grow_node_256)
+#define RT_REMOVE_CHILD_4 RT_MAKE_NAME(remove_child_4)
+#define RT_REMOVE_CHILD_16 RT_MAKE_NAME(remove_child_16)
+#define RT_REMOVE_CHILD_48 RT_MAKE_NAME(remove_child_48)
+#define RT_REMOVE_CHILD_256 RT_MAKE_NAME(remove_child_256)
+#define RT_NODE_INSERT_LEAF RT_MAKE_NAME(node_insert_leaf)
+#define RT_NODE_INNER_ITERATE_NEXT RT_MAKE_NAME(node_inner_iterate_next)
+#define RT_NODE_LEAF_ITERATE_NEXT RT_MAKE_NAME(node_leaf_iterate_next)
+#define RT_ITER_SET_NODE_FROM RT_MAKE_NAME(iter_set_node_from)
+#define RT_ITER_UPDATE_KEY RT_MAKE_NAME(iter_update_key)
+#define RT_VERIFY_NODE RT_MAKE_NAME(verify_node)
+
+/* type declarations */
+#define RT_RADIX_TREE RT_MAKE_NAME(radix_tree)
+#define RT_RADIX_TREE_CONTROL RT_MAKE_NAME(radix_tree_control)
+#define RT_ITER RT_MAKE_NAME(iter)
+#ifdef RT_SHMEM
+#define RT_HANDLE RT_MAKE_NAME(handle)
+#endif
+#define RT_NODE RT_MAKE_NAME(node)
+#define RT_NODE_PTR RT_MAKE_NAME(node_ptr)
+#define RT_NODE_ITER RT_MAKE_NAME(node_iter)
+#define RT_NODE_BASE_4 RT_MAKE_NAME(node_base_4)
+#define RT_NODE_BASE_16 RT_MAKE_NAME(node_base_16)
+#define RT_NODE_BASE_48 RT_MAKE_NAME(node_base_48)
+#define RT_NODE_BASE_256 RT_MAKE_NAME(node_base_256)
+#define RT_NODE_INNER_4 RT_MAKE_NAME(node_inner_4)
+#define RT_NODE_INNER_16 RT_MAKE_NAME(node_inner_16)
+#define RT_NODE_INNER_48 RT_MAKE_NAME(node_inner_48)
+#define RT_NODE_INNER_256 RT_MAKE_NAME(node_inner_256)
+#define RT_NODE_LEAF_4 RT_MAKE_NAME(node_leaf_4)
+#define RT_NODE_LEAF_16 RT_MAKE_NAME(node_leaf_16)
+#define RT_NODE_LEAF_48 RT_MAKE_NAME(node_leaf_48)
+#define RT_NODE_LEAF_256 RT_MAKE_NAME(node_leaf_256)
+#define RT_SIZE_CLASS RT_MAKE_NAME(size_class)
+#define RT_SIZE_CLASS_ELEM RT_MAKE_NAME(size_class_elem)
+#define RT_SIZE_CLASS_INFO RT_MAKE_NAME(size_class_info)
+#define RT_CLASS_4 RT_MAKE_NAME(class_4)
+#define RT_CLASS_16_LO RT_MAKE_NAME(class_32_min)
+#define RT_CLASS_16_HI RT_MAKE_NAME(class_32_max)
+#define RT_CLASS_48 RT_MAKE_NAME(class_48)
+#define RT_CLASS_256 RT_MAKE_NAME(class_256)
+
+/* generate forward declarations necessary to use the radix tree */
+#ifdef RT_DECLARE
+
+typedef struct RT_RADIX_TREE RT_RADIX_TREE;
+typedef struct RT_ITER RT_ITER;
+
+#ifdef RT_SHMEM
+typedef dsa_pointer RT_HANDLE;
+#endif
+#define RT_PTR_LOCAL RT_NODE *
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id);
+RT_SCOPE RT_RADIX_TREE * RT_ATTACH(dsa_area *dsa, dsa_pointer dp);
+RT_SCOPE void RT_DETACH(RT_RADIX_TREE *tree);
+RT_SCOPE RT_HANDLE RT_GET_HANDLE(RT_RADIX_TREE *tree);
+#else
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+#endif
+RT_SCOPE void RT_FREE(RT_RADIX_TREE *tree);
+
+RT_SCOPE bool RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+RT_SCOPE bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
+#ifdef RT_USE_DELETE
+RT_SCOPE bool RT_DELETE(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE RT_ITER * RT_BEGIN_ITERATE(RT_RADIX_TREE *tree);
+RT_SCOPE bool RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p);
+RT_SCOPE void RT_END_ITERATE(RT_ITER *iter);
+
+RT_SCOPE uint64 RT_MEMORY_USAGE(RT_RADIX_TREE *tree);
+
+#if 0
+RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree);
+RT_SCOPE void RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key);
+#endif
+
+RT_SCOPE void RT_STATS(RT_RADIX_TREE *tree);
+
+#endif							/* RT_DECLARE */
+
+
+/* generate implementation of the radix tree */
+#ifdef RT_DEFINE
+
+/* The number of bits encoded in one tree level */
+#define RT_NODE_SPAN	BITS_PER_BYTE
+
+/* The number of maximum slots in the node */
+#define RT_NODE_MAX_SLOTS (1 << RT_NODE_SPAN)
+
+/* Mask for extracting a chunk from the key */
+#define RT_CHUNK_MASK ((1 << RT_NODE_SPAN) - 1)
+
+/* Maximum shift the radix tree uses */
+#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
+
+/* Tree level the radix tree uses */
+#define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_NODE_SPAN)
+
+/*
+ * Number of bits necessary for isset array in the slot-index node.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ */
+#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 2)
+
+/* Invalid index used in node-125 */
+#define RT_INVALID_SLOT_IDX	0xFF
+
+/* Get a chunk from the key */
+#define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
+
+/* For accessing bitmaps */
+#define RT_BM_IDX(x)	((x) / BITS_PER_BITMAPWORD)
+#define RT_BM_BIT(x)	((x) % BITS_PER_BITMAPWORD)
+
+/*
+ * Node kinds
+ *
+ * The different node kinds are what make the tree "adaptive".
+ *
+ * Each node kind is associated with a different datatype and different
+ * search/set/delete/iterate algorithms adapted for its size. The largest
+ * kind, node256 is basically the same as a traditional radix tree,
+ * and would be most wasteful of memory when sparsely populated. The
+ * smaller nodes expend some additional CPU time to enable a smaller
+ * memory footprint.
+ *
+ * XXX There are 4 node kinds, and this should never be increased,
+ * for several reasons:
+ * 1. With 5 or more kinds, gcc tends to use a jump table for switch
+ *    statements.
+ * 2. The 4 kinds can be represented with 2 bits, so we have the option
+ *    in the future to tag the node pointer with the kind, even on
+ *    platforms with 32-bit pointers. This might speed up node traversal
+ *    in trees with highly random node kinds.
+ * 3. We can have multiple size classes per node kind.
+ */
+#define RT_NODE_KIND_4			0x00
+#define RT_NODE_KIND_16			0x01
+#define RT_NODE_KIND_48		0x02
+#define RT_NODE_KIND_256		0x03
+#define RT_NODE_KIND_COUNT		4
+
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size)	\
+	Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
+
+/* Common type for all nodes types */
+typedef struct RT_NODE
+{
+	/*
+	 * Number of children. uint8 is
+	 sufficient for all node kinds, because nodes shrink when this number
+	 gets lower than some thresold. Since node256 cannot possibly have zero
+	 children, we let the counter overflow and we intepret zero as "256" for
+	 this node kind.
+	 */
+	uint8		count;
+
+	/*
+	 * Max capacity for the current size class. Storing this in the
+	 * node enables multiple size classes per node kind.
+	 * Technically, kinds with a single size class don't need this, so we could
+	 * keep this in the individual base types, but the code is simpler this way.
+	 * Note: node256 is unique in that it cannot possibly have more than a
+	 * single size class, so for that kind we store zero, and uint8 is
+	 * sufficient for other kinds.
+	 */
+	uint8		fanout;
+
+	/* Node kind, one per search/set algorithm */
+	uint8		kind;
+} RT_NODE;
+
+
+#define RT_PTR_LOCAL RT_NODE *
+
+/* pointer returned by allocation */
+#ifdef RT_SHMEM
+#define RT_PTR_ALLOC dsa_pointer
+#else
+#define RT_PTR_ALLOC RT_PTR_LOCAL
+#endif
+
+
+#ifdef RT_SHMEM
+#define RT_INVALID_PTR_ALLOC InvalidDsaPointer
+#else
+#define RT_INVALID_PTR_ALLOC NULL
+#endif
+
+#ifdef RT_SHMEM
+#define RT_LOCK_EXCLUSIVE(tree)	LWLockAcquire(&tree->ctl->lock, LW_EXCLUSIVE)
+#define RT_LOCK_SHARED(tree)	LWLockAcquire(&tree->ctl->lock, LW_SHARED)
+#define RT_UNLOCK(tree)			LWLockRelease(&tree->ctl->lock);
+#else
+#define RT_LOCK_EXCLUSIVE(tree)	((void) 0)
+#define RT_LOCK_SHARED(tree)	((void) 0)
+#define RT_UNLOCK(tree)			((void) 0)
+#endif
+
+// fixme
+#define RT_NODE_IS_LEAF(x) false
+
+//todo: caller can define function to abbreviate value
+#define RT_VALUE_IS_EMBEDDABLE (sizeof(RT_VALUE_TYPE) <= SIZEOF_VOID_P)
+
+/*
+ * Inner nodes and leaf nodes have analogous structure. To distinguish
+ * them at runtime, we take advantage of the fact that the key chunk
+ * is accessed by shifting: Inner tree nodes (shift > 0), store the
+ * pointer to its child node in the slot. In leaf nodes (shift == 0),
+ * the slot contains the value corresponding to the key.
+ */
+
+#define RT_NODE_MUST_GROW(node) \
+	((node)->base.n.count == (node)->base.n.fanout)
+
+#ifdef RT_SHMEM
+typedef struct RT_NODE_PTR
+#else
+typedef union RT_NODE_PTR
+#endif
+{
+	RT_PTR_ALLOC	alloc;
+	RT_PTR_LOCAL	local;
+} RT_NODE_PTR;
+
+/*
+ * Base type of each node kinds for leaf and inner nodes.
+ * The base types must be a be able to accommodate the largest size
+ * class for variable-sized node kinds.
+ */
+typedef struct RT_NODE_BASE_4
+{
+	RT_NODE		n;
+
+	/* 3 children, for key chunks */
+	uint8		chunks[3];
+} RT_NODE_BASE_4;
+
+typedef struct RT_NODE_BASE_16
+{
+	RT_NODE		n;
+
+	/* 32 children, for key chunks */
+	uint8		chunks[32];
+} RT_NODE_BASE_16;
+
+/*
+ * node-125 uses slot_idx array, an array of RT_NODE_MAX_SLOTS length
+ * to store indexes into a second array that contains the values (or
+ * child pointers).
+ */
+typedef struct RT_NODE_BASE_48
+{
+	RT_NODE		n;
+
+	/* The index of slots for each fanout */
+	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
+
+	/* bitmap to track which slots are in use */
+	bitmapword		isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+} RT_NODE_BASE_48;
+
+typedef struct RT_NODE_BASE_256
+{
+	RT_NODE		n;
+} RT_NODE_BASE_256;
+
+/*
+ * Inner and leaf nodes.
+ *
+ * These are separate because the value type might be different than
+ * something fitting into a pointer-width type.
+ */
+typedef struct RT_NODE_INNER_4
+{
+	RT_NODE_BASE_4 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_4;
+
+typedef struct RT_NODE_LEAF_4
+{
+	RT_NODE_BASE_4 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_4;
+
+typedef struct RT_NODE_INNER_16
+{
+	RT_NODE_BASE_16 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_16;
+
+typedef struct RT_NODE_LEAF_16
+{
+	RT_NODE_BASE_16 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_16;
+
+typedef struct RT_NODE_INNER_48
+{
+	RT_NODE_BASE_48 base;
+
+	/* number of children depends on size class */
+	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_INNER_48;
+
+typedef struct RT_NODE_LEAF_48
+{
+	RT_NODE_BASE_48 base;
+
+	/* number of values depends on size class */
+	RT_VALUE_TYPE	values[FLEXIBLE_ARRAY_MEMBER];
+} RT_NODE_LEAF_48;
+
+/*
+ * node-256 is the largest node type. This node has an array
+ * for directly storing values (or child pointers in inner nodes).
+ * Unlike other node kinds, it's array size is by definition
+ * fixed.
+ */
+typedef struct RT_NODE_INNER_256
+{
+	RT_NODE_BASE_256 base;
+
+	/*
+	 * Zero is a valid value for embedded values, so we use a
+	 * bitmap to track which slots are in use.
+	 */
+	bitmapword	isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 children */
+	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+} RT_NODE_INNER_256;
+
+typedef struct RT_NODE_LEAF_256
+{
+	RT_NODE_BASE_256 base;
+
+	/*
+	 * Unlike with inner256, zero is a valid value here, so we use a
+	 * bitmap to track which slots are in use.
+	 */
+	bitmapword	isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+
+	/* Slots for 256 values */
+	RT_VALUE_TYPE	values[RT_NODE_MAX_SLOTS];
+} RT_NODE_LEAF_256;
+
+/*
+ * Node size classes
+ *
+ * Nodes of different kinds necessarily belong to different size classes.
+ * The main innovation in our implementation compared to the ART paper
+ * is decoupling the notion of size class from kind.
+ *
+ * The size classes within a given node kind have the same underlying
+ * type, but a variable number of children/values. This is possible
+ * because the base type contains small fixed data structures that
+ * work the same way regardless of how full the node is. We store the
+ * node's allocated capacity in the "fanout" member of RT_NODE, to allow
+ * runtime introspection.
+ *
+ * Growing from one node kind to another requires special code for each
+ * case, but growing from one size class to another within the same kind
+ * is basically just allocate + memcpy.
+ *
+ * The size classes have been chosen so that inner nodes on platforms
+ * with 64-bit pointers (and leaf nodes when using a 64-bit key) are
+ * equal to or slightly smaller than some DSA size class.
+ */
+typedef enum RT_SIZE_CLASS
+{
+	RT_CLASS_4 = 0,
+	RT_CLASS_16_LO,
+	RT_CLASS_16_HI,
+	RT_CLASS_48,
+	RT_CLASS_256
+} RT_SIZE_CLASS;
+
+// todo: macro based on DSA segment sizes
+#define RT_FANOUT_4		3 /* todo: (8 - sizeof(RT_NODE)) */
+#define RT_FANOUT_16_LO	15 /* todo: (160 - RT_FANOUT_16_HI - MAXALIGN(sizeof(RT_NODE)) / sizeof(uint64)) */
+#define RT_FANOUT_16_HI	32
+#define RT_FANOUT_48	125 /* todo: like above but 768 (63) */
+#define RT_FANOUT_256	256
+
+/* Information for each size class */
+typedef struct RT_SIZE_CLASS_ELEM
+{
+	const char *name;
+	int			fanout;
+
+	/* slab chunk size */
+	Size		inner_size;
+} RT_SIZE_CLASS_ELEM;
+
+// todo: adjust name automatically - scanf()?
+static const RT_SIZE_CLASS_ELEM RT_SIZE_CLASS_INFO[] = {
+	[RT_CLASS_4] = {
+		.name = "radix tree node 3",
+		.fanout = RT_FANOUT_4,
+		.inner_size = sizeof(RT_NODE_INNER_4) + RT_FANOUT_4 * sizeof(RT_PTR_ALLOC),
+	},
+	[RT_CLASS_16_LO] = {
+		.name = "radix tree node 15",
+		.fanout = RT_FANOUT_16_LO,
+		.inner_size = sizeof(RT_NODE_INNER_16) + RT_FANOUT_16_LO * sizeof(RT_PTR_ALLOC),
+	},
+	[RT_CLASS_16_HI] = {
+		.name = "radix tree node 32",
+		.fanout = RT_FANOUT_16_HI,
+		.inner_size = sizeof(RT_NODE_INNER_16) + RT_FANOUT_16_HI * sizeof(RT_PTR_ALLOC),
+	},
+	[RT_CLASS_48] = {
+		.name = "radix tree node 125",
+		.fanout = RT_FANOUT_48,
+		.inner_size = sizeof(RT_NODE_INNER_48) + RT_FANOUT_48 * sizeof(RT_PTR_ALLOC),
+	},
+	[RT_CLASS_256] = {
+		.name = "radix tree node 256",
+		.fanout = RT_FANOUT_256,
+		.inner_size = sizeof(RT_NODE_INNER_256),
+	},
+};
+
+#define RT_SIZE_CLASS_COUNT lengthof(RT_SIZE_CLASS_INFO)
+
+#ifdef RT_SHMEM
+/* A magic value used to identify our radix tree */
+#define RT_RADIX_TREE_MAGIC 0x54A48167
+#endif
+
+/* Contains the actual tree and ancillary info */
+typedef struct RT_RADIX_TREE_CONTROL
+{
+#ifdef RT_SHMEM
+	RT_HANDLE	handle;
+	uint32		magic;
+	LWLock		lock;
+#endif
+
+	RT_PTR_ALLOC root;
+	uint64		max_val;
+	uint64		num_keys;
+	int			start_shift; // xxx
+
+	/* statistics */
+#ifdef RT_DEBUG
+	int32		cnt[RT_SIZE_CLASS_COUNT];
+	int32		leafcnt;
+#endif
+} RT_RADIX_TREE_CONTROL;
+
+/* Entry point for allocating and accessing the tree */
+typedef struct RT_RADIX_TREE
+{
+	MemoryContext context;
+
+	/* pointing to either local memory or DSA */
+	RT_RADIX_TREE_CONTROL *ctl;
+
+#ifdef RT_SHMEM
+	dsa_area   *dsa;
+#else
+	MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
+	MemoryContextData *leaf_slab;
+#endif
+} RT_RADIX_TREE;
+
+/*
+ * Iteration support.
+ *
+ * Iterating the radix tree returns each pair of key and value in the ascending
+ * order of the key.
+ *
+ * RT_NODE_ITER is the struct for iteration of one radix tree node.
+ *
+ * RT_ITER is the struct for iteration of the radix tree, and uses RT_NODE_ITER
+ * for each level to track the iteration within the node.
+ */
+typedef struct RT_NODE_ITER
+{
+	RT_NODE_PTR node;
+
+	/*
+	 * The next index of the chunk array in RT_NODE_KIND_4 and
+	 * RT_NODE_KIND_16 nodes, or the next chunk in RT_NODE_KIND_48 and
+	 * RT_NODE_KIND_256 nodes. 0 for the initial value.
+	 */
+	int		idx;
+} RT_NODE_ITER;
+
+typedef struct RT_ITER
+{
+	RT_RADIX_TREE *tree;
+
+	/* Track the nodes for each level. level = 0 is for a leaf node */
+	RT_NODE_ITER node_iters[RT_MAX_LEVEL];
+	int			top_level;
+
+	/* The key constructed during the iteration */
+	uint64		key;
+} RT_ITER;
+
+
+static void RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node,
+								 uint8 chunk, RT_PTR_ALLOC child);
+
+/* verification (available only with assertion) */
+static void RT_VERIFY_NODE(RT_PTR_LOCAL node);
+
+static inline void
+RT_PTR_SET_LOCAL(RT_RADIX_TREE *tree, RT_NODE_PTR *node)
+{
+#ifdef RT_SHMEM
+	node->local = dsa_get_address(tree->dsa, node->alloc);
+#else
+#endif
+}
+
+static inline bool
+RT_PTR_ALLOC_IS_VALID(RT_PTR_ALLOC ptr)
+{
+#ifdef RT_SHMEM
+	return DsaPointerIsValid(ptr);
+#else
+	return PointerIsValid(ptr);
+#endif
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_3_SEARCH_EQ(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+	int			idx = -1;
+
+	for (int i = 0; i < node->n.count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_3_GET_INSERTPOS(RT_NODE_BASE_4 *node, uint8 chunk)
+{
+	int			idx;
+
+	for (idx = 0; idx < node->n.count; idx++)
+	{
+		if (node->chunks[idx] >= chunk)
+			break;
+	}
+
+	return idx;
+}
+
+/*
+ * Return index of the first element in the node's chunk array that equals
+ * 'chunk'. Return -1 if there is no such element.
+ */
+static inline int
+RT_NODE_32_SEARCH_EQ(RT_NODE_BASE_16 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	uint32		bitfield;
+	int			index_simd = -1;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index = -1;
+
+	for (int i = 0; i < count; i++)
+	{
+		if (node->chunks[i] == chunk)
+		{
+			index = i;
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/* replicate the search key */
+	spread_chunk = vector8_broadcast(chunk);
+
+	/* compare to all 32 keys stored in the node */
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	cmp1 = vector8_eq(spread_chunk, haystack1);
+	cmp2 = vector8_eq(spread_chunk, haystack2);
+
+	/* convert comparison to a bitfield */
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+
+	/* mask off invalid entries */
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	/* convert bitfield to index by counting trailing zeros */
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+/*
+ * Return index of the chunk and slot arrays for inserting into the node,
+ * such that the chunk array remains ordered.
+ */
+static inline int
+RT_NODE_32_GET_INSERTPOS(RT_NODE_BASE_16 *node, uint8 chunk)
+{
+	int			count = node->n.count;
+#ifndef USE_NO_SIMD
+	Vector8		spread_chunk;
+	Vector8		haystack1;
+	Vector8		haystack2;
+	Vector8		cmp1;
+	Vector8		cmp2;
+	Vector8		min1;
+	Vector8		min2;
+	uint32		bitfield;
+	int			index_simd;
+#endif
+
+#if defined(USE_NO_SIMD) || defined(USE_ASSERT_CHECKING)
+	int			index;
+
+	for (index = 0; index < count; index++)
+	{
+		/*
+		 * This is coded with '>=' to match what we can do with SIMD,
+		 * with an assert to keep us honest.
+		 */
+		if (node->chunks[index] >= chunk)
+		{
+			Assert(node->chunks[index] != chunk);
+			break;
+		}
+	}
+#endif
+
+#ifndef USE_NO_SIMD
+	/*
+	 * This is a bit more complicated than RT_NODE_32_SEARCH_EQ(), because
+	 * no unsigned uint8 comparison instruction exists, at least for SSE2. So
+	 * we need to play some trickery using vector8_min() to effectively get
+	 * >=. There'll never be any equal elements in current uses, but that's
+	 * what we get here...
+	 */
+	spread_chunk = vector8_broadcast(chunk);
+	vector8_load(&haystack1, &node->chunks[0]);
+	vector8_load(&haystack2, &node->chunks[sizeof(Vector8)]);
+	min1 = vector8_min(spread_chunk, haystack1);
+	min2 = vector8_min(spread_chunk, haystack2);
+	cmp1 = vector8_eq(spread_chunk, min1);
+	cmp2 = vector8_eq(spread_chunk, min2);
+	bitfield = vector8_highbit_mask(cmp1) | (vector8_highbit_mask(cmp2) << sizeof(Vector8));
+	bitfield &= ((UINT64CONST(1) << count) - 1);
+
+	if (bitfield)
+		index_simd = pg_rightmost_one_pos32(bitfield);
+	else
+		index_simd = count;
+
+	Assert(index_simd == index);
+	return index_simd;
+#else
+	return index;
+#endif
+}
+
+
+/*
+ * Functions to manipulate both chunks array and children/values array.
+ * These are used for node-3 and node-32.
+ * TODO: replace slow memmove's
+ */
+
+/* Shift the elements right at 'idx' by one */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_SHIFT(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx + 1]), &(chunks[idx]), sizeof(uint8) * (count - idx));
+	memmove(&(children[idx + 1]), &(children[idx]), sizeof(RT_PTR_ALLOC) * (count - idx));
+}
+
+/* Delete the element at 'idx' */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_DELETE(uint8 *chunks, RT_PTR_ALLOC *children, int count, int idx)
+{
+	memmove(&(chunks[idx]), &(chunks[idx + 1]), sizeof(uint8) * (count - idx - 1));
+	memmove(&(children[idx]), &(children[idx + 1]), sizeof(RT_PTR_ALLOC) * (count - idx - 1));
+}
+
+/* Copy both chunks and children/values arrays */
+static inline void
+RT_CHUNK_CHILDREN_ARRAY_COPY(uint8 *src_chunks, RT_PTR_ALLOC *src_children,
+						  uint8 *dst_chunks, RT_PTR_ALLOC *dst_children,
+						  uint8 chunk, RT_PTR_ALLOC child, int insertpos, int count)
+{
+	/* first copy old elements before insertpos */
+	memcpy(&dst_chunks[0], &src_chunks[0],
+			insertpos * sizeof(src_chunks[0]));
+	memcpy(&dst_children[0], &src_children[0],
+			insertpos * sizeof(src_children[0]));
+
+	/* then the new element */
+	dst_chunks[insertpos] = chunk;
+	dst_children[insertpos] = child;
+
+	/* and lastly the old elements after */
+	memcpy(&dst_chunks[insertpos + 1], &src_chunks[insertpos],
+		   (count - insertpos) * sizeof(src_chunks[0]));
+	memcpy(&dst_children[insertpos + 1], &src_children[insertpos],
+		   (count - insertpos) * sizeof(src_children[0]));
+}
+
+static inline void
+RT_CHUNK_VALUES_ARRAY_COPY(uint8 *src_chunks, RT_VALUE_TYPE *src_values,
+						uint8 *dst_chunks, RT_VALUE_TYPE *dst_values)
+{
+	const int fanout = RT_SIZE_CLASS_INFO[RT_CLASS_4].fanout;
+	const Size chunk_size = sizeof(uint8) * fanout;
+	const Size values_size = sizeof(RT_VALUE_TYPE) * fanout;
+
+	memcpy(dst_chunks, src_chunks, chunk_size);
+	memcpy(dst_values, src_values, values_size);
+}
+
+/* Functions to manipulate inner and leaf node-125 */
+
+/* Does the given chunk in the node has the value? */
+static inline bool
+RT_NODE_125_IS_CHUNK_USED(RT_NODE_BASE_48 *node, uint8 chunk)
+{
+	return node->slot_idxs[chunk] != RT_INVALID_SLOT_IDX;
+}
+
+static inline RT_PTR_ALLOC*
+RT_NODE_INNER_125_GET_CHILD(RT_NODE_INNER_48 *node, uint8 chunk)
+{
+	return &node->children[node->base.slot_idxs[chunk]];
+}
+
+/* Functions to manipulate inner and leaf node-256 */
+
+/* Return true if the slot corresponding to the given chunk is in use */
+static inline bool
+RT_NODE_INNER_256_IS_CHUNK_USED(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	return (node->isset[idx] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+static inline RT_PTR_ALLOC*
+RT_NODE_INNER_256_GET_CHILD(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	Assert(RT_NODE_INNER_256_IS_CHUNK_USED(node, chunk));
+	return &node->children[chunk];
+}
+
+/* Set the child in the node-256 */
+static inline void
+RT_NODE_INNER_256_SET(RT_NODE_INNER_256 *node, uint8 chunk, RT_PTR_ALLOC child)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	node->isset[idx] |= ((bitmapword) 1 << bitnum);
+	node->children[chunk] = child;
+}
+
+/* Set the slot at the given chunk position */
+static inline void
+RT_NODE_INNER_256_DELETE(RT_NODE_INNER_256 *node, uint8 chunk)
+{
+	int			idx = RT_BM_IDX(chunk);
+	int			bitnum = RT_BM_BIT(chunk);
+
+	node->isset[idx] &= ~((bitmapword) 1 << bitnum);
+}
+
+/*
+ * Return the largest shift that will allowing storing the given key.
+ */
+static inline int
+RT_KEY_GET_SHIFT(uint64 key)
+{
+	if (key == 0)
+		return 0;
+	else
+		return (pg_leftmost_one_pos64(key) / RT_NODE_SPAN) * RT_NODE_SPAN;
+}
+
+/*
+ * Return the max value that can be stored in the tree with the given shift.
+ */
+static uint64
+RT_SHIFT_GET_MAX_VAL(int shift)
+{
+	if (shift == RT_MAX_SHIFT)
+		return UINT64_MAX;
+
+	return (UINT64CONST(1) << (shift + RT_NODE_SPAN)) - 1;
+}
+
+/*
+ * Allocate a new node with the given node kind and size class.
+ */
+static inline RT_NODE_PTR
+RT_ALLOC_NODE(RT_RADIX_TREE *tree, const uint8 kind, const RT_SIZE_CLASS size_class, bool is_leaf)
+{
+	RT_NODE_PTR allocnode;
+	RT_PTR_LOCAL node;
+	size_t allocsize;
+
+		allocsize = RT_SIZE_CLASS_INFO[size_class].inner_size;
+
+#ifdef RT_SHMEM
+	allocnode.alloc = dsa_allocate(tree->dsa, allocsize);
+#else
+	allocnode.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->inner_slabs[size_class],
+													  allocsize);
+#endif
+
+	RT_PTR_SET_LOCAL(tree, &allocnode);
+	node = allocnode.local;
+
+	/* initialize contents */
+
+	memset(node, 0, sizeof(RT_NODE));
+	switch(kind)
+	{
+		case RT_NODE_KIND_4:
+		case RT_NODE_KIND_16:
+			break;
+		case RT_NODE_KIND_48:
+			{
+				RT_NODE_BASE_48 *n48 = (RT_NODE_BASE_48 *) node;
+
+				memset(n48->isset, 0, sizeof(n48->isset));
+				memset(n48->slot_idxs, RT_INVALID_SLOT_IDX, sizeof(n48->slot_idxs));
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				memset(n256->isset, 0, sizeof(n256->isset));
+				break;
+			}
+		default:
+			pg_unreachable();
+	}
+
+	node->kind = kind;
+	if (kind == RT_NODE_KIND_256)
+		/* See comment for the RT_NODE type */
+		// todo remove actual value from lookup table
+		Assert(node->fanout == 0);
+	else
+		node->fanout = RT_SIZE_CLASS_INFO[size_class].fanout;
+
+#ifdef RT_DEBUG
+	/* update the statistics */
+	tree->ctl->cnt[size_class]++;
+#endif
+
+	return allocnode;
+}
+
+/*
+ * Allocate a new leaf.
+ * XXX do we really need this separate from RT_ALLOC_NODE? We will
+ * if we need variable-sized leaves.
+ */
+static RT_NODE_PTR
+RT_ALLOC_LEAF(RT_RADIX_TREE *tree)
+{
+	RT_NODE_PTR		leaf;
+	size_t allocsize = sizeof(RT_VALUE_TYPE);
+
+#ifdef RT_SHMEM
+	leaf.alloc = dsa_allocate(tree->dsa, allocsize);
+#else
+	leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_slab, allocsize);
+#endif
+
+	RT_PTR_SET_LOCAL(tree, &leaf);
+
+#ifdef RT_DEBUG
+	tree->ctl->leafcnt++;
+#endif
+
+	return leaf;
+}
+
+/*
+ * Create a new node as the root. Subordinate nodes will be created during
+ * the insertion.
+ */
+static pg_noinline void
+RT_NEW_ROOT(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			shift = RT_KEY_GET_SHIFT(key);
+	bool		is_leaf = false;
+	RT_NODE_PTR node;
+
+	node = RT_ALLOC_NODE(tree, RT_NODE_KIND_4, RT_CLASS_4, is_leaf);
+	tree->ctl->start_shift = shift;
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(shift);
+	tree->ctl->root = node.alloc;
+}
+
+/*
+ * Copy relevant members of the node header.
+ * This is a separate function in case other fields are added.
+ */
+static inline void
+RT_COPY_COMMON(RT_NODE_PTR newnode, RT_NODE_PTR oldnode)
+{
+	(newnode.local)->count = (oldnode.local)->count;
+}
+
+/* Free the given node */
+static void
+RT_FREE_NODE(RT_RADIX_TREE *tree, RT_NODE_PTR node)
+{
+#ifdef RT_DEBUG
+	{
+		int i;
+
+		/* update the statistics */
+		for (i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			if ((node.local)->fanout == RT_SIZE_CLASS_INFO[i].fanout)
+				break;
+		}
+
+		/* fanout of node256 is intentionally 0 */
+		if (i == RT_SIZE_CLASS_COUNT)
+			i = RT_CLASS_256;
+
+		tree->ctl->cnt[i]--;
+		Assert(tree->ctl->cnt[i] >= 0);
+	}
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, node.alloc);
+#else
+	pfree(node.alloc);
+#endif
+}
+
+static inline void
+RT_FREE_LEAF(RT_RADIX_TREE *tree, RT_NODE_PTR node)
+{
+	// because no lazy expansion yet
+	Assert(node.alloc != tree->ctl->root);
+
+#ifdef RT_DEBUG
+	tree->ctl->leafcnt--;
+	Assert(tree->ctl->leafcnt >= 0);
+#endif
+
+#ifdef RT_SHMEM
+	dsa_free(tree->dsa, node.alloc);
+#else
+	pfree(node.alloc);
+#endif
+}
+
+/*
+ * The radix tree doesn't have sufficient height. Extend the radix tree so
+ * it can store the key.
+ */
+static pg_noinline void
+RT_EXTEND_UP(RT_RADIX_TREE *tree, uint64 key)
+{
+	int			target_shift;
+	// todo: move inside loop
+	int			shift = tree->ctl->start_shift + RT_NODE_SPAN;
+
+	target_shift = RT_KEY_GET_SHIFT(key);
+
+	/* Grow tree from 'shift' to 'target_shift' */
+	while (shift <= target_shift)
+	{
+		RT_NODE_PTR		node;
+		RT_NODE_INNER_4 *n4;
+
+		node = RT_ALLOC_NODE(tree, RT_NODE_KIND_4, RT_CLASS_4, false);
+		n4 = (RT_NODE_INNER_4 *) node.local;
+		n4->base.n.count = 1;
+		n4->base.chunks[0] = 0;
+		n4->children[0] = tree->ctl->root;
+
+		/* Update the root */
+		tree->ctl->root = node.alloc;
+
+		shift += RT_NODE_SPAN;
+	}
+
+	tree->ctl->max_val = RT_SHIFT_GET_MAX_VAL(target_shift);
+	tree->ctl->start_shift = target_shift;
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Return child if the key is found, otherwise return NULL.
+ */
+static inline RT_PTR_ALLOC *
+RT_NODE_SEARCH_INNER(RT_PTR_LOCAL node, uint8 chunk)
+{
+	/* Make sure we already converted to local pointer */
+	Assert(node != NULL);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+				int			idx = RT_NODE_3_SEARCH_EQ((RT_NODE_BASE_4 *) n4, chunk);
+
+				if (idx < 0)
+					return NULL;
+
+				return &n4->children[idx];
+			}
+		case RT_NODE_KIND_16:
+			{
+				RT_NODE_INNER_16 *n16 = (RT_NODE_INNER_16 *) node;
+				int			idx = RT_NODE_32_SEARCH_EQ((RT_NODE_BASE_16 *) n16, chunk);
+
+				if (idx < 0)
+					return NULL;
+
+				return &n16->children[idx];
+			}
+		case RT_NODE_KIND_48:
+			{
+				RT_NODE_INNER_48 *n48 = (RT_NODE_INNER_48 *) node;
+				int			slotpos = n48->base.slot_idxs[chunk];
+
+				if (slotpos == RT_INVALID_SLOT_IDX)
+					return NULL;
+
+				return RT_NODE_INNER_125_GET_CHILD(n48, chunk);
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+					return NULL;
+
+				return RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+			}
+		default:
+			pg_unreachable();
+	}
+}
+
+#ifdef RT_USE_DELETE
+
+/*
+ * When shrinking nodes, we generally wait until the count is about 3/4
+ * of the next lower node's fanout. This prevents ping-ponging between
+ * different node sizes.
+ * TODO: When shrinking to node4, 3 should be hard-coded, as that's the
+ * largest count where linear search is faster than SIMD, at least on
+ * x86-64.
+ */
+
+static inline void
+RT_REMOVE_CHILD_256(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node, uint8 chunk)
+{
+	int shrink_threshold;
+	RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node.local;
+
+	RT_NODE_INNER_256_DELETE(n256, chunk);
+
+	n256->base.n.count--;
+
+	/* to keep isset coding below simple, for now at least */
+	shrink_threshold = sizeof(bitmapword) * BITS_PER_BYTE;
+	shrink_threshold = Min(RT_FANOUT_48 / 4 * 3, shrink_threshold);
+
+	if (n256->base.n.count < shrink_threshold)
+	{
+		RT_NODE_PTR newnode;
+		RT_NODE_INNER_48 *new48;
+		int slot_idx = 0;
+
+		/* initialize new node */
+		newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_48, RT_CLASS_48, false);
+		new48 = (RT_NODE_INNER_48 *) newnode.local;
+
+		/* copy over the entries */
+		RT_COPY_COMMON(newnode, node);
+		for (int i = 0; i < 256; i++)
+		{
+			if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+			{
+				new48->base.slot_idxs[i] = slot_idx;
+				new48->children[slot_idx] = n256->children[i];
+				slot_idx++;
+			}
+		}
+
+		/*
+		 * Since we just copied a dense array, we can fill "isset"
+		 * using a single store, provided the length of that array
+		 * is at most the number of bits in a bitmapword.
+		 */
+		Assert(n256->base.n.count <= sizeof(bitmapword) * BITS_PER_BYTE);
+		new48->base.isset[0] = (bitmapword) (((uint64) 1 << n256->base.n.count) - 1);
+
+		/* free old node and update reference in parent */
+		*ref = newnode.alloc;
+		RT_FREE_NODE(tree, node);
+	}
+}
+
+
+static inline void
+RT_REMOVE_CHILD_48(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node, uint8 chunk)
+{
+	RT_NODE_INNER_48 *n48 = (RT_NODE_INNER_48 *) node.local;
+	int			slotpos = n48->base.slot_idxs[chunk];
+	int			idx;
+	int			bitnum;
+
+	Assert(slotpos != RT_INVALID_SLOT_IDX);
+
+	idx = RT_BM_IDX(slotpos);
+	bitnum = RT_BM_BIT(slotpos);
+	n48->base.isset[idx] &= ~((bitmapword) 1 << bitnum);
+	n48->base.slot_idxs[chunk] = RT_INVALID_SLOT_IDX;
+
+	n48->base.n.count--;
+}
+
+static inline void
+RT_REMOVE_CHILD_16(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node, uint8 chunk, RT_PTR_ALLOC *slot)
+{
+	RT_NODE_INNER_16 *n16 = (RT_NODE_INNER_16 *) node.local;
+	int			idx = slot - n16->children;;
+
+	Assert(idx >= 0);
+	Assert(n16->base.chunks[idx] == chunk);
+
+	RT_CHUNK_CHILDREN_ARRAY_DELETE(n16->base.chunks, n16->children,
+								n16->base.n.count, idx);
+	n16->base.n.count--;
+}
+
+static inline void
+RT_REMOVE_CHILD_4(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node, uint8 chunk, RT_PTR_ALLOC *slot)
+{
+	RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node.local;
+
+	if (n4->base.n.count == 1)
+	{
+		Assert(n4->base.chunks[0] == chunk);
+
+		/* deleting last entry, so just free the node and null out the parent's slot */
+		// We assume the caller already freed the child, if necessary
+		RT_FREE_NODE(tree, node);
+		*ref = RT_INVALID_PTR_ALLOC;
+
+		/* If we're deleting the root node, make the tree empty */
+		if (ref == &tree->ctl->root)
+			tree->ctl->max_val = 0;
+	}
+	else
+	{
+		int			idx = slot - n4->children;;
+
+		Assert(idx >= 0);
+		Assert(n4->base.chunks[idx] == chunk);
+
+		RT_CHUNK_CHILDREN_ARRAY_DELETE(n4->base.chunks, n4->children,
+								n4->base.n.count, idx);
+
+		n4->base.n.count--;
+	}
+}
+
+/*
+ * Search for the child pointer corresponding to 'key' in the given node.
+ *
+ * Delete the node and return true if the key is found, otherwise return false.
+ */
+static inline void
+RT_NODE_DELETE_INNER(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node, uint8 chunk, RT_PTR_ALLOC *slot)
+{
+	switch ((node.local)->kind)
+	{
+		case RT_NODE_KIND_4:
+			return RT_REMOVE_CHILD_4(tree, ref, node, chunk, slot);
+		case RT_NODE_KIND_16:
+			return RT_REMOVE_CHILD_16(tree, ref, node, chunk, slot);
+		case RT_NODE_KIND_48:
+			return RT_REMOVE_CHILD_48(tree, ref, node, chunk);
+		case RT_NODE_KIND_256:
+			return RT_REMOVE_CHILD_256(tree, ref, node, chunk);
+		default:
+			pg_unreachable();
+	}
+}
+
+#endif
+
+static inline void
+RT_ADD_CHILD_256(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node,
+				uint8 chunk, RT_PTR_ALLOC child)
+{
+	RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node.local;
+
+	RT_NODE_INNER_256_SET(n256, chunk, child);
+
+	n256->base.n.count++;
+	RT_VERIFY_NODE((RT_NODE *) n256);
+}
+
+static pg_noinline void
+RT_GROW_NODE_48(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node,
+				uint8 chunk, RT_PTR_ALLOC child)
+{
+	RT_NODE_INNER_48 *n48 = (RT_NODE_INNER_48 *) node.local;
+
+		RT_NODE_PTR newnode;
+		RT_NODE_INNER_256 *new256;
+		int			cnt = 0;
+
+		const bool is_leaf = false; // xxx
+
+		/* initialize new node */
+		newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_256, RT_CLASS_256, is_leaf);
+		new256 = (RT_NODE_INNER_256 *) newnode.local;
+
+		/* copy over the entries */
+		RT_COPY_COMMON(newnode, node);
+		for (int i = 0; i < RT_NODE_MAX_SLOTS && cnt < n48->base.n.count; i++)
+		{
+			if (!RT_NODE_125_IS_CHUNK_USED(&n48->base, i))
+				continue;
+
+			RT_NODE_INNER_256_SET(new256, i,
+									*RT_NODE_INNER_125_GET_CHILD(n48, i));
+			cnt++;
+		}
+
+		/* free old node and update reference in parent */
+		*ref = newnode.alloc;
+		RT_FREE_NODE(tree, node);
+
+		RT_ADD_CHILD_256(tree, ref, newnode, chunk, child);
+}
+
+static inline void
+RT_ADD_CHILD_48(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node,
+				uint8 chunk, RT_PTR_ALLOC child)
+{
+	RT_NODE_INNER_48 *n48 = (RT_NODE_INNER_48 *) node.local;
+
+	if (unlikely(RT_NODE_MUST_GROW(n48)))
+	{
+		RT_GROW_NODE_48(tree, ref, node, chunk, child);
+	}
+	else
+	{
+		int			slotpos;
+		int			idx;
+		bitmapword	inverse;
+
+		/* get the first word with at least one bit not set */
+		for (idx = 0; idx < RT_BM_IDX(RT_SLOT_IDX_LIMIT); idx++)
+		{
+			if (n48->base.isset[idx] < ~((bitmapword) 0))
+				break;
+		}
+
+		/* To get the first unset bit in X, get the first set bit in ~X */
+		inverse = ~(n48->base.isset[idx]);
+		slotpos = idx * BITS_PER_BITMAPWORD;
+		slotpos += bmw_rightmost_one_pos(inverse);
+		Assert(slotpos < n48->base.n.fanout);
+
+		/* mark the slot used */
+		n48->base.isset[idx] |= bmw_rightmost_one(inverse);
+		n48->base.slot_idxs[chunk] = slotpos;
+
+		n48->children[slotpos] = child;
+		n48->base.n.count++;
+		RT_VERIFY_NODE((RT_NODE *) n48);
+	}
+}
+
+static pg_noinline void
+RT_GROW_NODE_16(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node,
+				uint8 chunk, RT_PTR_ALLOC child)
+{
+	const bool is_leaf = false; // xxx
+	RT_NODE_INNER_16 *n16 = (RT_NODE_INNER_16 *) node.local;
+
+	if (n16->base.n.fanout < RT_FANOUT_16_HI)
+	{
+		RT_NODE_PTR newnode;
+		RT_NODE_INNER_16 *new16;
+		int	insertpos = RT_NODE_32_GET_INSERTPOS(&n16->base, chunk);
+
+		Assert(n16->base.n.fanout == RT_FANOUT_16_LO);
+
+		/* initialize new node */
+		newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_HI, is_leaf);
+		new16 = (RT_NODE_INNER_16 *) newnode.local;
+
+		/* copy over existing entries and insert new one */
+		RT_COPY_COMMON(newnode, node);
+		RT_CHUNK_CHILDREN_ARRAY_COPY(n16->base.chunks, n16->children,
+		new16->base.chunks, new16->children,
+		chunk, child, insertpos, n16->base.n.count);
+
+		/* update the fanout */
+		new16->base.n.fanout = RT_FANOUT_16_HI;
+
+		new16->base.n.count++;
+		RT_VERIFY_NODE((RT_NODE *) new16);
+
+		/* free old node and update references */
+		RT_FREE_NODE(tree, node);
+		*ref = newnode.alloc;
+	}
+	else
+	{
+		RT_NODE_PTR newnode;
+		RT_NODE_INNER_48 *new48;
+		const int			slotpos = RT_FANOUT_16_HI;
+		const int 			idx = RT_BM_IDX(slotpos);
+		const int 			bit = RT_BM_BIT(slotpos);
+
+		Assert(n16->base.n.fanout == RT_FANOUT_16_HI);
+
+		/* initialize new node */
+		newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_48, RT_CLASS_48, is_leaf);
+		new48 = (RT_NODE_INNER_48 *) newnode.local;
+
+		/* copy over the entries */
+		RT_COPY_COMMON(newnode, node);
+		for (int i = 0; i < RT_FANOUT_16_HI; i++)
+		{
+			new48->base.slot_idxs[n16->base.chunks[i]] = i;
+			new48->children[i] = n16->children[i];
+		}
+
+		/*
+		 * Since we just copied a dense array, we can fill "isset"
+		 * using a single store, provided the length of that array
+		 * is at most the number of bits in a bitmapword.
+		 */
+		Assert(RT_FANOUT_16_HI <= sizeof(bitmapword) * BITS_PER_BYTE);
+		new48->base.isset[0] = (bitmapword) (((uint64) 1 << RT_FANOUT_16_HI) - 1);
+
+		/* add new value */
+
+		/* mark slot used */
+		new48->base.isset[idx] |= ((bitmapword) 1 << bit);
+		new48->base.slot_idxs[chunk] = slotpos;
+
+		new48->children[slotpos] = child;
+		new48->base.n.count++;
+		RT_VERIFY_NODE((RT_NODE *) new48);
+
+		/* free old node and update reference in parent */
+		*ref = newnode.alloc;
+		RT_FREE_NODE(tree, node);
+	}
+}
+
+static inline void
+RT_ADD_CHILD_16(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node,
+				uint8 chunk, RT_PTR_ALLOC child)
+{
+	RT_NODE_INNER_16 *n16 = (RT_NODE_INNER_16 *) node.local;
+
+	if (unlikely(RT_NODE_MUST_GROW(n16)))
+		RT_GROW_NODE_16(tree, ref, node, chunk, child);
+	else
+	{
+		int	insertpos = RT_NODE_32_GET_INSERTPOS(&n16->base, chunk);
+		int count = n16->base.n.count;
+
+		if (insertpos < count)
+			RT_CHUNK_CHILDREN_ARRAY_SHIFT(n16->base.chunks, n16->children,
+									   count, insertpos);
+
+		n16->base.chunks[insertpos] = chunk;
+		n16->children[insertpos] = child;
+		n16->base.n.count++;
+		RT_VERIFY_NODE((RT_NODE *) n16);
+	}
+}
+
+static pg_noinline void
+RT_GROW_NODE_4(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node,
+				uint8 chunk, RT_PTR_ALLOC child)
+{
+	const bool is_leaf = false; // xxx
+	RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) (node.local);
+	RT_NODE_PTR newnode;
+	RT_NODE_INNER_16 *new16;
+	int			insertpos = RT_NODE_3_GET_INSERTPOS(&n4->base, chunk);
+
+	/* initialize new node */
+	newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO, is_leaf);
+	new16 = (RT_NODE_INNER_16 *) newnode.local;
+
+	/* copy over existing entries and insert new one */
+	RT_COPY_COMMON(newnode, node);
+	RT_CHUNK_CHILDREN_ARRAY_COPY(n4->base.chunks, n4->children,
+	new16->base.chunks, new16->children,
+	chunk, child, insertpos, n4->base.n.count);
+
+	new16->base.n.count++;
+	RT_VERIFY_NODE((RT_NODE *) new16);
+
+	/* free old node and update reference in parent */
+	*ref = newnode.alloc;
+	RT_FREE_NODE(tree, node);
+}
+
+static inline void
+RT_ADD_CHILD_4(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node,
+				uint8 chunk, RT_PTR_ALLOC child)
+{
+	RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) (node.local);
+
+	if (unlikely(RT_NODE_MUST_GROW(n4)))
+	{
+		RT_GROW_NODE_4(tree, ref, node, chunk, child);
+	}
+	else
+	{
+		int			insertpos = RT_NODE_3_GET_INSERTPOS(&n4->base, chunk);
+		int			count = n4->base.n.count;
+
+		/* shift chunks and children */
+		if (insertpos < count)
+			RT_CHUNK_CHILDREN_ARRAY_SHIFT(n4->base.chunks, n4->children,
+									   count, insertpos);
+
+		n4->base.chunks[insertpos] = chunk;
+		n4->children[insertpos] = child;
+		n4->base.n.count++;
+		RT_VERIFY_NODE((RT_NODE *) n4);
+	}
+}
+
+/*
+ * Insert "child" into "node".
+ *
+ * "ref" is the parent's child pointer to "node".
+ * If the node we're inserting into needs to grow, we update the parent's
+ * child pointer with the pointer to the new larger node.
+ */
+static void
+RT_NODE_INSERT_INNER(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node,
+								 uint8 chunk, RT_PTR_ALLOC child)
+{
+
+	switch ((node.local)->kind)
+	{
+		case RT_NODE_KIND_4:
+			RT_ADD_CHILD_4(tree, ref, node, chunk, child);
+			break;
+		case RT_NODE_KIND_16:
+			RT_ADD_CHILD_16(tree, ref, node, chunk, child);
+			break;
+		case RT_NODE_KIND_48:
+			RT_ADD_CHILD_48(tree, ref, node, chunk, child);
+			break;
+		case RT_NODE_KIND_256:
+			RT_ADD_CHILD_256(tree, ref, node, chunk, child);
+			break;
+		default:
+			pg_unreachable();
+	}
+}
+
+/*
+ * Create the radix tree in the given memory context and return it.
+ */
+RT_SCOPE RT_RADIX_TREE *
+#ifdef RT_SHMEM
+RT_CREATE(MemoryContext ctx, dsa_area *dsa, int tranche_id)
+#else
+RT_CREATE(MemoryContext ctx)
+#endif
+{
+	RT_RADIX_TREE *tree;
+	MemoryContext old_ctx;
+#ifdef RT_SHMEM
+	dsa_pointer dp;
+#endif
+
+	old_ctx = MemoryContextSwitchTo(ctx);
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+	tree->context = ctx;
+
+#ifdef RT_SHMEM
+	tree->dsa = dsa;
+	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, dp);
+	tree->ctl->handle = dp;
+	tree->ctl->magic = RT_RADIX_TREE_MAGIC;
+	LWLockInitialize(&tree->ctl->lock, tranche_id);
+#else
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) palloc0(sizeof(RT_RADIX_TREE_CONTROL));
+
+	/* Create a slab context for each size class */
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+		size_t inner_blocksize = RT_SLAB_BLOCK_SIZE(size_class.inner_size);
+
+		tree->inner_slabs[i] = SlabContextCreate(ctx,
+												 size_class.name,
+												 inner_blocksize,
+												 size_class.inner_size);
+	}
+		tree->leaf_slab = SlabContextCreate(ctx,
+												"radix tree leaves",
+												RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+												sizeof(RT_VALUE_TYPE));
+#endif
+
+	tree->ctl->root = RT_INVALID_PTR_ALLOC;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	return tree;
+}
+
+#ifdef RT_SHMEM
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+	RT_RADIX_TREE *tree;
+	dsa_pointer	control;
+
+	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+	/* Find the control object in shard memory */
+	control = handle;
+
+	tree->dsa = dsa;
+	tree->ctl = (RT_RADIX_TREE_CONTROL *) dsa_get_address(dsa, control);
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	return tree;
+}
+
+RT_SCOPE void
+RT_DETACH(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	pfree(tree);
+}
+
+RT_SCOPE RT_HANDLE
+RT_GET_HANDLE(RT_RADIX_TREE *tree)
+{
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	return tree->ctl->handle;
+}
+
+/*
+ * Recursively free all nodes allocated to the DSA area.
+ */
+static void
+RT_FREE_RECURSE(RT_RADIX_TREE *tree, RT_PTR_ALLOC ptr)
+{
+#if 0
+	RT_PTR_LOCAL node = RT_PTR_SET_LOCAL(tree, ptr);
+
+	check_stack_depth();
+	CHECK_FOR_INTERRUPTS();
+
+	/* The leaf node doesn't have child pointers */
+	/* TODO: track depth */
+	if (RT_NODE_IS_LEAF(node))
+	{
+		dsa_free(tree->dsa, ptr);
+		return;
+	}
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+				for (int i = 0; i < n4->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n4->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_16:
+			{
+				RT_NODE_INNER_16 *n16 = (RT_NODE_INNER_16 *) node;
+
+				for (int i = 0; i < n16->base.n.count; i++)
+					RT_FREE_RECURSE(tree, n16->children[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_48:
+			{
+				RT_NODE_INNER_48 *n48 = (RT_NODE_INNER_48 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(&n48->base, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, *RT_NODE_INNER_125_GET_CHILD(n48, i));
+				}
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+						continue;
+
+					RT_FREE_RECURSE(tree, *RT_NODE_INNER_256_GET_CHILD(n256, i));
+				}
+
+				break;
+			}
+	}
+
+	/* Free the inner node */
+	dsa_free(tree->dsa, ptr);
+#endif // 0
+}
+#endif
+
+/*
+ * Free the given radix tree.
+ */
+RT_SCOPE void
+RT_FREE(RT_RADIX_TREE *tree)
+{
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+
+	/* Free all memory used for radix tree nodes */
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_FREE_RECURSE(tree, tree->ctl->root);
+
+	/*
+	 * Vandalize the control block to help catch programming error where
+	 * other backends access the memory formerly occupied by this radix tree.
+	 */
+	tree->ctl->magic = 0;
+	dsa_free(tree->dsa, tree->ctl->handle);
+#else
+	pfree(tree->ctl);
+
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		MemoryContextDelete(tree->inner_slabs[i]);
+	}
+		MemoryContextDelete(tree->leaf_slab);
+#endif
+
+	pfree(tree);
+}
+
+static pg_noinline void
+RT_EXTEND_DOWN(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node, uint64 key, RT_VALUE_TYPE *value_p, int shift)
+{
+	RT_NODE_PTR		child;
+	RT_NODE_INNER_4	*n4;
+
+	while (shift > 0)
+	{
+		child = RT_ALLOC_NODE(tree, RT_NODE_KIND_4, RT_CLASS_4, false);
+
+		/* XXX ref is only valid the first time through,
+		 * but it doesn't matter since that's the only possible
+		 * time an insertion could cause "node" to grow.
+		 */
+		RT_NODE_INSERT_INNER(tree, ref, node,
+							RT_GET_KEY_CHUNK(key, shift), child.alloc);
+
+		node = child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	// todo: common function RT_MAKE_LEAF
+	/* Set child to either an embedded value, or a pointer to a new leaf */
+	if (RT_VALUE_IS_EMBEDDABLE)
+	{
+		memcpy(&child.alloc, value_p, sizeof(RT_VALUE_TYPE));
+	}
+	else
+	{
+		RT_NODE_PTR newleaf;
+
+		newleaf = RT_ALLOC_LEAF(tree);
+		memcpy(newleaf.local, value_p, sizeof(RT_VALUE_TYPE));
+
+		child.alloc = newleaf.alloc;
+	}
+
+	/* Insert child containing our value. */
+	Assert((node.local)->kind == RT_NODE_KIND_4);
+	n4 = (RT_NODE_INNER_4 *) node.local;
+	Assert(shift == 0);
+	n4->base.chunks[0] = RT_GET_KEY_CHUNK(key, shift);
+	n4->children[0] = child.alloc;
+	n4->base.n.count = 1;
+}
+
+/* Workhorse for RT_SET */
+// "ref" is the address of the parent's child, which we just followed -- needed for growing nodes
+static bool
+RT_RECURSIVE_SET(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node, uint64 key, RT_VALUE_TYPE *value_p, int shift)
+{
+	RT_PTR_ALLOC *slot;
+	RT_NODE_PTR child;
+	uint8 chunk = RT_GET_KEY_CHUNK(key, shift);
+
+	slot = RT_NODE_SEARCH_INNER(node.local, chunk);
+
+	if (shift > 0)
+	{
+		if (unlikely(!slot))
+		{
+			RT_EXTEND_DOWN(tree, ref, node, key, value_p, shift);
+			return false;
+		}
+		else
+		{
+			child.alloc = *slot;
+			RT_PTR_SET_LOCAL(tree, &child);
+			return RT_RECURSIVE_SET(tree, slot, child, key, value_p, shift - RT_NODE_SPAN);
+		}
+	}
+	else
+	{
+		if (slot)
+		{
+			/* Found value, so update it */
+			if (RT_VALUE_IS_EMBEDDABLE)
+			{
+				memcpy(slot, value_p, sizeof(RT_VALUE_TYPE));
+			}
+			else
+			{
+				child.alloc = *slot;
+				RT_PTR_SET_LOCAL(tree, &child);
+
+				memcpy(child.local, value_p, sizeof(RT_VALUE_TYPE));
+			}
+
+			return true;
+		}
+		else
+		{
+			/* Set child to either an embedded value, or a pointer to a new leaf */
+			if (RT_VALUE_IS_EMBEDDABLE)
+			{
+				memcpy(&child.alloc, value_p, sizeof(RT_VALUE_TYPE));
+			}
+			else
+			{
+				RT_NODE_PTR newleaf;
+
+				newleaf = RT_ALLOC_LEAF(tree);
+				memcpy(newleaf.local, value_p, sizeof(RT_VALUE_TYPE));
+
+				child.alloc = newleaf.alloc;
+			}
+
+			/* insert child containing our value */
+			RT_NODE_INSERT_INNER(tree, ref, node, chunk, child.alloc);
+			return false;
+		}
+	}
+}
+
+/*
+ * Set key to value. If the entry already exists, we update its value to 'value'
+ * and return true. Returns false if entry doesn't yet exist.
+ */
+RT_SCOPE bool
+RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	bool		updated;
+	RT_NODE_PTR	rootnode;
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	/* Empty tree, create the root */
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+		RT_NEW_ROOT(tree, key);
+
+	/* Extend the tree if necessary */
+	if (key > tree->ctl->max_val)
+		RT_EXTEND_UP(tree, key);
+
+	rootnode.alloc = tree->ctl->root;
+	RT_PTR_SET_LOCAL(tree, &rootnode);
+
+	updated = RT_RECURSIVE_SET(tree, &tree->ctl->root, rootnode,
+								key, value_p, tree->ctl->start_shift);
+
+	/* Update the statistics */
+	if (!updated)
+		tree->ctl->num_keys++;
+
+	RT_UNLOCK(tree);
+	return updated;
+}
+
+/*
+ * Search the given key in the radix tree. Return true if there is the key,
+ * otherwise return false. On success, we set the value to *value_p so it must
+ * not be NULL.
+ */
+RT_SCOPE bool
+RT_SEARCH(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p)
+{
+	RT_NODE_PTR node;
+	RT_PTR_ALLOC *child;
+	int			shift;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+	Assert(value_p != NULL);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	node.alloc = tree->ctl->root;
+
+	shift = tree->ctl->start_shift;
+
+	/* Descend the tree until a leaf node */
+	while (shift >= 0)
+	{
+		RT_PTR_SET_LOCAL(tree, &node);
+		child = RT_NODE_SEARCH_INNER(node.local, RT_GET_KEY_CHUNK(key, shift));
+		if (!child)
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
+
+		node.alloc = *child;
+		shift -= RT_NODE_SPAN;
+	}
+
+	if (RT_VALUE_IS_EMBEDDABLE)
+	{
+		memcpy(value_p, &node.alloc, sizeof(RT_VALUE_TYPE));
+	}
+	else
+	{
+		RT_PTR_SET_LOCAL(tree, &node);
+		memcpy(value_p, node.local, sizeof(RT_VALUE_TYPE));
+	}
+
+	RT_UNLOCK(tree);
+	return true;
+}
+
+#ifdef RT_USE_DELETE
+
+static bool
+RT_RECURSIVE_DELETE(RT_RADIX_TREE *tree, RT_PTR_ALLOC *ref, RT_NODE_PTR node, uint64 key, int shift)
+{
+	uint8 chunk = RT_GET_KEY_CHUNK(key, shift);
+	RT_PTR_ALLOC *slot = RT_NODE_SEARCH_INNER(node.local, chunk);
+	RT_NODE_PTR child;
+
+	if (!slot)
+		return false;
+
+	child.alloc = *slot;
+
+	if (shift == 0)
+	{
+		if (!RT_VALUE_IS_EMBEDDABLE)
+			RT_FREE_LEAF(tree, child);
+
+		RT_NODE_DELETE_INNER(tree, ref, node, chunk, slot);
+		return true;
+	}
+	else
+	{
+		bool deleted;
+
+		/* since we're not at lowest level, we know this is a pointer and not an embedded value */
+		RT_PTR_SET_LOCAL(tree, &child);
+
+		deleted = RT_RECURSIVE_DELETE(tree, slot, child, key, shift - RT_NODE_SPAN);
+
+		/* Child node was freed, so delete its slot now */
+		if (*slot == RT_INVALID_PTR_ALLOC)
+		{
+			Assert(deleted);
+			RT_NODE_DELETE_INNER(tree, ref, node, chunk, slot);
+		}
+
+		return deleted;
+	}
+
+}
+
+/*
+ * Delete the given key from the radix tree. Return true if the key is found (and
+ * deleted), otherwise do nothing and return false.
+ */
+RT_SCOPE bool
+RT_DELETE(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_NODE_PTR rootnode;
+	bool		deleted;
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	RT_LOCK_EXCLUSIVE(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		return false;
+	}
+
+	rootnode.alloc = tree->ctl->root;
+	RT_PTR_SET_LOCAL(tree, &rootnode);
+
+	deleted = RT_RECURSIVE_DELETE(tree, &tree->ctl->root, rootnode,
+									key, tree->ctl->start_shift);
+
+	/* Found the key to delete. Update the statistics */
+	if (deleted)
+		tree->ctl->num_keys--;
+
+	RT_UNLOCK(tree);
+	return deleted;
+}
+#endif
+
+
+/*
+ * Scan the inner node and return the next child node if exist, otherwise
+ * return NULL.
+ */
+static inline RT_PTR_ALLOC *
+RT_NODE_INNER_ITERATE_NEXT(RT_ITER *iter, int level)
+{
+
+	uint8		key_chunk = 0;
+	RT_NODE_ITER *node_iter;
+	RT_NODE_PTR	node;
+	RT_PTR_ALLOC *slot = NULL;
+
+#ifdef RT_SHMEM
+	Assert(iter->tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+#endif
+
+	node_iter = &(iter->node_iters[level]);
+	node = node_iter->node;
+
+	Assert(node.local != NULL);
+
+	switch ((node.local)->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) (node.local);
+
+				if (node_iter->idx >= n4->base.n.count)
+					return NULL;
+
+				slot = &n4->children[node_iter->idx];
+				key_chunk = n4->base.chunks[node_iter->idx];
+				node_iter->idx++;
+				break;
+			}
+		case RT_NODE_KIND_16:
+			{
+				RT_NODE_INNER_16 *n16 = (RT_NODE_INNER_16 *) (node.local);
+
+				if (node_iter->idx >= n16->base.n.count)
+					return NULL;
+
+				slot = &n16->children[node_iter->idx];
+				key_chunk = n16->base.chunks[node_iter->idx];
+				node_iter->idx++;
+				break;
+			}
+		case RT_NODE_KIND_48:
+			{
+				RT_NODE_INNER_48 *n48 = (RT_NODE_INNER_48 *) (node.local);
+				int			chunk;
+
+				for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
+				{
+					if (RT_NODE_125_IS_CHUNK_USED((RT_NODE_BASE_48 *) n48, chunk))
+						break;
+				}
+
+				if (chunk >= RT_NODE_MAX_SLOTS)
+					return NULL;
+
+				slot = RT_NODE_INNER_125_GET_CHILD(n48, chunk);
+
+				key_chunk = chunk;
+				node_iter->idx = chunk + 1;
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) (node.local);
+				int			chunk;
+
+				for (chunk = node_iter->idx; chunk < RT_NODE_MAX_SLOTS; chunk++)
+				{
+					if (RT_NODE_INNER_256_IS_CHUNK_USED(n256, chunk))
+						break;
+				}
+
+				if (chunk >= RT_NODE_MAX_SLOTS)
+					return NULL;
+
+				slot = RT_NODE_INNER_256_GET_CHILD(n256, chunk);
+
+				key_chunk = chunk;
+				node_iter->idx = chunk + 1;
+				break;
+			}
+	}
+
+	/* Update the part of the key */
+	iter->key &= ~(((uint64) RT_CHUNK_MASK) << (level * RT_NODE_SPAN));
+	iter->key |= (((uint64) key_chunk) << (level * RT_NODE_SPAN));
+
+	return slot;
+}
+
+/*
+ * While descending the radix tree from the 'from' node to the bottom, we
+ * set the next node to iterate for each level.
+ */
+static void
+RT_ITER_SET_NODE_FROM(RT_ITER *iter, RT_NODE_PTR from, int level)
+{
+	RT_NODE_PTR node = from;
+
+	for (;;)
+	{
+		RT_NODE_ITER *node_iter = &(iter->node_iters[level]);
+
+		RT_PTR_SET_LOCAL(iter->tree, &node);
+
+#if 0 //def USE_ASSERT_CHECKING fixme
+		if (node_iter->node)
+		{
+			/* We must have finished the iteration on the previous node */
+			if (RT_NODE_IS_LEAF(node_iter->node))
+			{
+				uint64 dummy;
+				Assert(!RT_NODE_LEAF_ITERATE_NEXT(iter, node_iter, &dummy));
+			}
+			else
+				Assert(!RT_NODE_INNER_ITERATE_NEXT(iter, node_iter, level));
+		}
+#endif
+
+		/* Set the node to the node iterator of this level */
+		node_iter->node = node;
+		node_iter->idx = 0;
+
+		if (level == 0)
+		{
+			/* We will visit the leaf node when RT_ITERATE_NEXT() */
+			break;
+		}
+
+		/*
+		 * Get the first child node from the node, which corresponds to the
+		 * lowest chunk within the node.
+		 */
+		node.alloc = *RT_NODE_INNER_ITERATE_NEXT(iter, level);
+
+		/* The first child must be found */
+		Assert(RT_PTR_ALLOC_IS_VALID(node.alloc));
+
+		level--;
+	}
+}
+
+/*
+ * Create and return the iterator for the given radix tree.
+ *
+ * The radix tree is locked in shared mode during the iteration, so
+ * RT_END_ITERATE needs to be called when finished to release the lock.
+ */
+RT_SCOPE RT_ITER *
+RT_BEGIN_ITERATE(RT_RADIX_TREE *tree)
+{
+	RT_ITER    *iter;
+	RT_NODE_PTR root;
+
+	iter = (RT_ITER *) MemoryContextAllocZero(tree->context,
+											  sizeof(RT_ITER));
+	iter->tree = tree;
+
+	RT_LOCK_SHARED(tree);
+
+	/* empty tree */
+	if (!iter->tree->ctl->root)
+		return iter;
+
+	root.alloc = iter->tree->ctl->root;
+	RT_PTR_SET_LOCAL(tree, &root);
+
+	iter->top_level = iter->tree->ctl->start_shift / RT_NODE_SPAN;
+
+	/*
+	 * Set the next node to iterate for each level from the level of the
+	 * root node.
+	 */
+	RT_ITER_SET_NODE_FROM(iter, root, iter->top_level);
+
+	return iter;
+}
+
+/*
+ * Return true with setting key_p and value_p if there is next key.  Otherwise
+ * return false.
+ */
+RT_SCOPE bool
+RT_ITERATE_NEXT(RT_ITER *iter, uint64 *key_p, RT_VALUE_TYPE *value_p)
+{
+	RT_PTR_ALLOC *slot = NULL;
+
+	Assert(value_p != NULL);
+
+	/* Empty tree */
+	if (!iter->tree->ctl->root)
+		return false;
+
+	do
+	{
+		RT_NODE_PTR child;
+
+		/* Get the next chunk of the leaf node */
+		slot = RT_NODE_INNER_ITERATE_NEXT(iter, 0);
+
+		if (slot)
+		{
+			*key_p = iter->key;
+			child.alloc = *slot;
+
+			// todo: deduplicate with rt_set?
+			if (RT_VALUE_IS_EMBEDDABLE)
+			{
+				memcpy(value_p, &child.alloc, sizeof(RT_VALUE_TYPE));
+			}
+			else
+			{
+				RT_PTR_SET_LOCAL(iter->tree, &child);
+				memcpy(value_p, child.local, sizeof(RT_VALUE_TYPE));
+			}
+
+			return true;
+		}
+
+		/*
+		 * We've visited all values in the leaf node, so advance all inner node
+		 * iterators by visiting inner nodes from the level = 1 until we find the
+		 * next inner node that has a child node.
+		 */
+		for (int level = 1; level <= iter->top_level; level++)
+		{
+			// fixme
+			slot = RT_NODE_INNER_ITERATE_NEXT(iter, level);
+
+			if (slot)
+			{
+				child.alloc = *slot;
+
+				/*
+				 * Found the new child node. We update the next node to iterate for each
+				 * level from the level of this child node.
+				 */
+				RT_ITER_SET_NODE_FROM(iter, child, level - 1);
+				break;
+			}
+		}
+	} while (slot != NULL);
+
+	/* We've visited all nodes, so the iteration finished */
+	return false;
+}
+
+/*
+ * Terminate the iteration and release the lock.
+ *
+ * This function needs to be called after finishing or when exiting an
+ * iteration.
+ */
+RT_SCOPE void
+RT_END_ITERATE(RT_ITER *iter)
+{
+#ifdef RT_SHMEM
+	Assert(LWLockHeldByMe(&iter->tree->ctl->lock));
+#endif
+
+	RT_UNLOCK(iter->tree);
+	pfree(iter);
+}
+
+/*
+ * Return the statistics of the amount of memory used by the radix tree.
+ */
+RT_SCOPE uint64
+RT_MEMORY_USAGE(RT_RADIX_TREE *tree)
+{
+	Size		total = 0;
+
+	RT_LOCK_SHARED(tree);
+
+#ifdef RT_SHMEM
+	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
+	total = dsa_get_total_size(tree->dsa);
+#else
+	for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+	{
+		total += MemoryContextMemAllocated(tree->inner_slabs[i], true);
+	}
+		total += MemoryContextMemAllocated(tree->leaf_slab, true);
+#endif
+
+	RT_UNLOCK(tree);
+	return total;
+}
+
+
+static void pg_attribute_unused()
+RT_DUMP_NODE(RT_PTR_LOCAL node, int level,
+			 bool recurse, StringInfo buf)
+{
+#ifdef RT_DEBUG
+	StringInfoData spaces;
+
+	recurse = false; // xxx
+
+	initStringInfo(&spaces);
+	appendStringInfoSpaces(&spaces, (level * 4) + 1);
+	// todo: clean up, can we use one of our tables for the kind-to-fanout mapping?
+	appendStringInfo(buf, "%s%s[%s] kind %d, fanout %d, count %u:\n",
+					 spaces.data,
+					 level == 0 ? "" : "-> ",
+					 RT_NODE_IS_LEAF(node) ? "LEAF" : "INNR",
+					 (node->kind == RT_NODE_KIND_4) ? 3 :
+					 (node->kind == RT_NODE_KIND_16) ? 32 :
+					 (node->kind == RT_NODE_KIND_48) ? 125 : 256,
+					 node->fanout == 0 ? 256 : node->fanout,
+
+					 (node->kind == RT_NODE_KIND_256) ?
+					 (node->count == 0 ? 256 : node->count) :
+					 node->count);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+#if 0
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_4 *n4 = (RT_NODE_LEAF_4 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n4->base.chunks[i]);
+					}
+					else
+#endif
+					{
+						RT_NODE_INNER_4 *n4 = (RT_NODE_INNER_4 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n4->base.chunks[i]);
+#if 0
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(n4->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+#endif
+
+#if 0
+						/* quick hack for inspecting values in a one-level tree */
+						if (n4->children[i] != NULL)
+							appendStringInfo(buf, " %lu\n", *((RT_VALUE_TYPE *) n4->children[i]));
+						else
+							appendStringInfo(buf, " (NULL)\n");
+#endif
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_16:
+			{
+				for (int i = 0; i < node->count; i++)
+				{
+#if 0
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_16 *n16 = (RT_NODE_LEAF_16 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X\n",
+										 spaces.data, i, n16->base.chunks[i]);
+					}
+					else
+#endif
+					{
+						RT_NODE_INNER_16 *n16 = (RT_NODE_INNER_16 *) node;
+
+						appendStringInfo(buf, "%schunk[%d] 0x%X",
+										 spaces.data, i, n16->base.chunks[i]);
+
+#if 0
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(n16->children[i], level + 1,
+										 recurse, buf);
+						}
+						else
+#endif
+						if (n16->children[i] != RT_INVALID_PTR_ALLOC)
+							appendStringInfo(buf, " %lu\n", *((RT_VALUE_TYPE *) n16->children[i]));
+						else
+							appendStringInfo(buf, " (NULL)\n");
+
+					}
+				}
+				break;
+			}
+		case RT_NODE_KIND_48:
+			{
+				RT_NODE_BASE_48 *b125 = (RT_NODE_BASE_48 *) node;
+				char *sep = "";
+
+				appendStringInfo(buf, "%sslot_idxs: ", spaces.data);
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+
+					appendStringInfo(buf, "%s[%d]=%d ",
+									 sep, i, b125->slot_idxs[i]);
+					sep = ",";
+				}
+
+				appendStringInfo(buf, "\n%sisset-bitmap: ", spaces.data);
+				for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+					appendStringInfo(buf, "%X ", ((uint8 *) b125->isset)[i]);
+				appendStringInfo(buf, "\n");
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (!RT_NODE_125_IS_CHUNK_USED(b125, i))
+						continue;
+#if 0
+					if (RT_NODE_IS_LEAF(node))
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					else
+#endif
+					{
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+#if 0
+						if (recurse)
+						{
+							RT_NODE_INNER_48 *n48 = (RT_NODE_INNER_48 *) b125;
+
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(*(RT_NODE_INNER_125_GET_CHILD(n48, i)),
+										 level + 1, recurse, buf);
+						}
+						else
+#endif
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+#if 0
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+					appendStringInfo(buf, "%sisset-bitmap: ", spaces.data);
+					for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+						appendStringInfo(buf, "%X ", ((uint8 *) n256->isset)[i]);
+					appendStringInfo(buf, "\n");
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					if (RT_NODE_IS_LEAF(node))
+					{
+						RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+
+						if (!RT_NODE_LEAF_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X\n",
+										 spaces.data, i);
+					}
+					else
+					{
+						RT_NODE_INNER_256 *n256 = (RT_NODE_INNER_256 *) node;
+
+						if (!RT_NODE_INNER_256_IS_CHUNK_USED(n256, i))
+							continue;
+
+						appendStringInfo(buf, "%schunk 0x%X",
+										 spaces.data, i);
+
+						if (recurse)
+						{
+							appendStringInfo(buf, "\n");
+							RT_DUMP_NODE(RT_NODE_INNER_256_GET_CHILD(n256, i),
+										 level + 1, recurse, buf);
+						}
+						else
+							appendStringInfo(buf, " (skipped)\n");
+					}
+				}
+				break;
+			}
+#endif
+	}
+#endif
+}
+
+/*
+ * Verify the radix tree node.
+ */
+// XXX somewhat whacked around to allow dumping single node types for debugging
+static void
+RT_VERIFY_NODE(RT_PTR_LOCAL node)
+{
+#ifdef USE_ASSERT_CHECKING
+	StringInfoData buf;
+
+	initStringInfo(&buf);
+
+	switch (node->kind)
+	{
+		case RT_NODE_KIND_4:
+			{
+				RT_NODE_BASE_4 *n4 = (RT_NODE_BASE_4 *) node;
+
+				if (false)
+				{
+					RT_DUMP_NODE(node, 0, false, &buf);
+					fprintf(stderr, "%s",buf.data);
+				}
+
+				for (int i = 1; i < n4->n.count; i++)
+					Assert(n4->chunks[i - 1] < n4->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_16:
+			{
+				RT_NODE_BASE_16 *n16 = (RT_NODE_BASE_16 *) node;
+
+				if (false)
+				{
+					RT_DUMP_NODE(node, 0, false, &buf);
+					fprintf(stderr, "%s",buf.data);
+				}
+
+				for (int i = 1; i < n16->n.count; i++)
+					Assert(n16->chunks[i - 1] < n16->chunks[i]);
+
+				break;
+			}
+		case RT_NODE_KIND_48:
+			{
+				RT_NODE_BASE_48 *n48 = (RT_NODE_BASE_48 *) node;
+				int			cnt = 0;
+
+				if (false)
+				{
+					RT_DUMP_NODE(node, 0, false, &buf);
+					fprintf(stderr, "%s",buf.data);
+				}
+
+				for (int i = 0; i < RT_NODE_MAX_SLOTS; i++)
+				{
+					uint8		slot = n48->slot_idxs[i];
+					int			idx = RT_BM_IDX(slot);
+					int			bitnum = RT_BM_BIT(slot);
+
+					if (!RT_NODE_125_IS_CHUNK_USED(n48, i))
+						continue;
+
+					/* Check if the corresponding slot is used */
+					Assert(slot < node->fanout);
+					Assert((n48->isset[idx] & ((bitmapword) 1 << bitnum)) != 0);
+
+					cnt++;
+				}
+
+				Assert(n48->n.count == cnt);
+
+				break;
+			}
+		case RT_NODE_KIND_256:
+			{
+				if (RT_NODE_IS_LEAF(node))
+				{
+					RT_NODE_LEAF_256 *n256 = (RT_NODE_LEAF_256 *) node;
+					int			cnt = 0;
+
+					for (int i = 0; i < RT_BM_IDX(RT_NODE_MAX_SLOTS); i++)
+						cnt += bmw_popcount(n256->isset[i]);
+
+					/* Check if the number of used chunk matches, accounting for overflow */
+					if (cnt == 256)
+						Assert(n256->base.n.count == 0);
+					else
+						Assert(n256->base.n.count == cnt);
+
+					break;
+				}
+			}
+	}
+#endif
+}
+
+
+/***************** DEBUG FUNCTIONS *****************/
+
+#define RT_UINT64_FORMAT_HEX "%" INT64_MODIFIER "X"
+
+RT_SCOPE void pg_attribute_unused()
+RT_STATS(RT_RADIX_TREE *tree)
+{
+#ifdef RT_DEBUG
+	RT_LOCK_SHARED(tree);
+
+	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
+	fprintf(stderr, "num_keys = " UINT64_FORMAT "\n", tree->ctl->num_keys);
+
+#ifdef RT_SHMEM
+	fprintf(stderr, "handle = " UINT64_FORMAT "\n", tree->ctl->handle);
+#endif
+
+	if (RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		fprintf(stderr, "height = %d", tree->ctl->start_shift / RT_NODE_SPAN);
+
+		for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
+		{
+			RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
+			fprintf(stderr, ", n%d = %u", size_class.fanout, tree->ctl->cnt[i]);
+		}
+
+		fprintf(stderr, "\n");
+	}
+
+	RT_UNLOCK(tree);
+#endif
+}
+
+
+#if 0
+RT_SCOPE void
+RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
+{
+	RT_PTR_ALLOC allocnode;
+	RT_PTR_LOCAL node;
+	StringInfoData buf;
+	int			shift;
+	int			level = 0;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	if (key > tree->ctl->max_val)
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "key " UINT64_FORMAT "(0x" RT_UINT64_FORMAT_HEX ") is larger than max val\n",
+				key, key);
+		return;
+	}
+
+	initStringInfo(&buf);
+	allocnode = tree->ctl->root;
+	node = RT_PTR_SET_LOCAL(tree, allocnode);
+	shift = node->shift;
+	while (shift >= 0)
+	{
+		RT_PTR_ALLOC child;
+
+		RT_DUMP_NODE(tree, allocnode, level, false, &buf);
+
+		if (RT_NODE_IS_LEAF(node))
+		{
+			RT_VALUE_TYPE	dummy;
+
+			/* We reached at a leaf node, find the corresponding slot */
+			RT_NODE_SEARCH_LEAF(node, key, &dummy);
+
+			break;
+		}
+
+		if (!RT_NODE_SEARCH_INNER(node.local, RT_GET_KEY_CHUNK(key, shift)))
+			break;
+
+		allocnode = child;
+		node = RT_PTR_SET_LOCAL(tree, allocnode);
+		shift -= RT_NODE_SPAN;
+		level++;
+	}
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s", buf.data);
+}
+
+// this might be better as "iterate over nodes", plus a callback to RT_DUMP_NODE,
+// which should really only concern itself with single nodes
+RT_SCOPE void
+RT_DUMP(RT_RADIX_TREE *tree)
+{
+	StringInfoData buf;
+
+	RT_STATS(tree);
+
+	RT_LOCK_SHARED(tree);
+
+	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root))
+	{
+		RT_UNLOCK(tree);
+		fprintf(stderr, "empty tree\n");
+		return;
+	}
+
+	initStringInfo(&buf);
+
+	RT_DUMP_NODE(tree->ctl->root, 0, true, &buf);
+	RT_UNLOCK(tree);
+
+	fprintf(stderr, "%s",buf.data);
+}
+
+#endif /* 0 */
+
+
+#endif							/* RT_DEFINE */
+
+
+/* undefine external parameters, so next radix tree can be defined */
+#undef RT_PREFIX
+#undef RT_SCOPE
+#undef RT_DECLARE
+#undef RT_DEFINE
+#undef RT_VALUE_TYPE
+
+/* locally declared macros */
+#undef RT_MAKE_PREFIX
+#undef RT_MAKE_NAME
+#undef RT_MAKE_NAME_
+#undef RT_NODE_SPAN
+#undef RT_NODE_MAX_SLOTS
+#undef RT_CHUNK_MASK
+#undef RT_MAX_SHIFT
+#undef RT_MAX_LEVEL
+#undef RT_GET_KEY_CHUNK
+#undef RT_BM_IDX
+#undef RT_BM_BIT
+#undef RT_LOCK_EXCLUSIVE
+#undef RT_LOCK_SHARED
+#undef RT_UNLOCK
+#undef RT_NODE_IS_LEAF
+#undef RT_NODE_MUST_GROW
+#undef RT_NODE_KIND_COUNT
+#undef RT_SIZE_CLASS_COUNT
+#undef RT_SLOT_IDX_LIMIT
+#undef RT_INVALID_SLOT_IDX
+#undef RT_SLAB_BLOCK_SIZE
+#undef RT_RADIX_TREE_MAGIC
+#undef RT_UINT64_FORMAT_HEX
+
+/* type declarations */
+#undef RT_RADIX_TREE
+#undef RT_RADIX_TREE_CONTROL
+#undef RT_NODE_PTR
+#undef RT_PTR_LOCAL
+#undef RT_PTR_ALLOC
+#undef RT_INVALID_PTR_ALLOC
+#undef RT_HANDLE
+#undef RT_ITER
+#undef RT_NODE
+#undef RT_NODE_ITER
+#undef RT_NODE_KIND_4
+#undef RT_NODE_KIND_16
+#undef RT_NODE_KIND_48
+#undef RT_NODE_KIND_256
+#undef RT_NODE_BASE_4
+#undef RT_NODE_BASE_16
+#undef RT_NODE_BASE_48
+#undef RT_NODE_BASE_256
+#undef RT_NODE_INNER_4
+#undef RT_NODE_INNER_16
+#undef RT_NODE_INNER_48
+#undef RT_NODE_INNER_256
+#undef RT_NODE_LEAF_4
+#undef RT_NODE_LEAF_16
+#undef RT_NODE_LEAF_48
+#undef RT_NODE_LEAF_256
+#undef RT_SIZE_CLASS
+#undef RT_SIZE_CLASS_ELEM
+#undef RT_SIZE_CLASS_INFO
+#undef RT_CLASS_4
+#undef RT_CLASS_16_LO
+#undef RT_CLASS_16_HI
+#undef RT_CLASS_48
+#undef RT_CLASS_256
+#undef RT_FANOUT_4
+#undef RT_FANOUT_16_LO
+#undef RT_FANOUT_16_HI
+#undef RT_FANOUT_48
+#undef RT_FANOUT_256
+
+/* function declarations */
+#undef RT_CREATE
+#undef RT_FREE
+#undef RT_ATTACH
+#undef RT_DETACH
+#undef RT_GET_HANDLE
+#undef RT_SEARCH
+#undef RT_SET
+#undef RT_BEGIN_ITERATE
+#undef RT_ITERATE_NEXT
+#undef RT_END_ITERATE
+#undef RT_USE_DELETE
+#undef RT_DELETE
+#undef RT_MEMORY_USAGE
+#undef RT_DUMP
+#undef RT_DUMP_NODE
+#undef RT_DUMP_SEARCH
+#undef RT_STATS
+
+/* internal helper functions */
+#undef RT_NEW_ROOT
+#undef RT_RECURSIVE_SET
+#undef RT_RECURSIVE_DELETE
+#undef RT_ALLOC_NODE
+#undef RT_ALLOC_LEAF
+#undef RT_FREE_NODE
+#undef RT_FREE_LEAF
+#undef RT_FREE_RECURSE
+#undef RT_EXTEND_UP
+#undef RT_EXTEND_DOWN
+#undef RT_COPY_COMMON
+#undef RT_PTR_SET_LOCAL
+#undef RT_PTR_ALLOC_IS_VALID
+#undef RT_NODE_3_SEARCH_EQ
+#undef RT_NODE_32_SEARCH_EQ
+#undef RT_NODE_3_GET_INSERTPOS
+#undef RT_NODE_32_GET_INSERTPOS
+#undef RT_CHUNK_CHILDREN_ARRAY_SHIFT
+#undef RT_CHUNK_VALUES_ARRAY_SHIFT
+#undef RT_CHUNK_CHILDREN_ARRAY_DELETE
+#undef RT_CHUNK_VALUES_ARRAY_DELETE
+#undef RT_CHUNK_CHILDREN_ARRAY_COPY
+#undef RT_CHUNK_VALUES_ARRAY_COPY
+#undef RT_NODE_125_IS_CHUNK_USED
+#undef RT_NODE_INNER_125_GET_CHILD
+#undef RT_NODE_LEAF_125_GET_VALUE
+#undef RT_NODE_INNER_256_IS_CHUNK_USED
+#undef RT_NODE_LEAF_256_IS_CHUNK_USED
+#undef RT_NODE_INNER_256_GET_CHILD
+#undef RT_NODE_LEAF_256_GET_VALUE
+#undef RT_NODE_INNER_256_SET
+#undef RT_NODE_LEAF_256_SET
+#undef RT_NODE_INNER_256_DELETE
+#undef RT_NODE_LEAF_256_DELETE
+#undef RT_KEY_GET_SHIFT
+#undef RT_SHIFT_GET_MAX_VAL
+#undef RT_NODE_SEARCH_INNER
+#undef RT_ADD_CHILD_4
+#undef RT_ADD_CHILD_16
+#undef RT_ADD_CHILD_48
+#undef RT_ADD_CHILD_256
+#undef RT_GROW_NODE_4
+#undef RT_GROW_NODE_16
+#undef RT_GROW_NODE_48
+#undef RT_GROW_NODE_256
+#undef RT_REMOVE_CHILD_4
+#undef RT_REMOVE_CHILD_16
+#undef RT_REMOVE_CHILD_48
+#undef RT_REMOVE_CHILD_256
+#undef RT_NODE_SEARCH_LEAF
+#undef RT_NODE_UPDATE_INNER
+#undef RT_NODE_DELETE_INNER
+#undef RT_NODE_DELETE_LEAF
+#undef RT_NODE_INSERT_INNER
+#undef RT_NODE_INSERT_LEAF
+#undef RT_NODE_INNER_ITERATE_NEXT
+#undef RT_NODE_LEAF_ITERATE_NEXT
+#undef RT_RT_ITER_SET_NODE_FROM
+#undef RT_VERIFY_NODE
+
+#undef RT_DEBUG
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index 3ce4ee300a..2af215484f 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -121,6 +121,7 @@ extern dsa_handle dsa_get_handle(dsa_area *area);
 extern dsa_pointer dsa_allocate_extended(dsa_area *area, size_t size, int flags);
 extern void dsa_free(dsa_area *area, dsa_pointer dp);
 extern void *dsa_get_address(dsa_area *area, dsa_pointer dp);
+extern size_t dsa_get_total_size(dsa_area *area);
 extern void dsa_trim(dsa_area *area);
 extern void dsa_dump(dsa_area *area);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 6331c976dc..05f16e880b 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -27,6 +27,7 @@ SUBDIRS = \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
+		  test_radixtree \
 		  test_rbtree \
 		  test_regex \
 		  test_rls_hooks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 17d369e378..995d8c0cc6 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -24,6 +24,7 @@ subdir('test_oat_hooks')
 subdir('test_parser')
 subdir('test_pg_dump')
 subdir('test_predtest')
+subdir('test_radixtree')
 subdir('test_rbtree')
 subdir('test_regex')
 subdir('test_rls_hooks')
diff --git a/src/test/modules/test_radixtree/.gitignore b/src/test/modules/test_radixtree/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_radixtree/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_radixtree/Makefile b/src/test/modules/test_radixtree/Makefile
new file mode 100644
index 0000000000..da06b93da3
--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+	$(WIN32RES) \
+	test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
+EXTENSION = test_radixtree
+DATA = test_radixtree--1.0.sql
+
+REGRESS = test_radixtree
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_radixtree
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_radixtree/README b/src/test/modules/test_radixtree/README
new file mode 100644
index 0000000000..a8b271869a
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_radixtree/expected/test_radixtree.out b/src/test/modules/test_radixtree/expected/test_radixtree.out
new file mode 100644
index 0000000000..617703d0a9
--- /dev/null
+++ b/src/test/modules/test_radixtree/expected/test_radixtree.out
@@ -0,0 +1,48 @@
+CREATE EXTENSION test_radixtree;
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
+NOTICE:  testing node   3 with height 0 and  ascending keys
+NOTICE:  testing node   3 with height 0 and descending keys
+NOTICE:  testing node   3 with height 1 and  ascending keys
+NOTICE:  testing node   3 with height 1 and descending keys
+NOTICE:  testing node  15 with height 0 and  ascending keys
+NOTICE:  testing node  15 with height 0 and descending keys
+NOTICE:  testing node  15 with height 1 and  ascending keys
+NOTICE:  testing node  15 with height 1 and descending keys
+NOTICE:  testing node  32 with height 0 and  ascending keys
+NOTICE:  testing node  32 with height 0 and descending keys
+NOTICE:  testing node  32 with height 1 and  ascending keys
+NOTICE:  testing node  32 with height 1 and descending keys
+NOTICE:  testing node 125 with height 0 and  ascending keys
+NOTICE:  testing node 125 with height 0 and descending keys
+NOTICE:  testing node 125 with height 1 and  ascending keys
+NOTICE:  testing node 125 with height 1 and descending keys
+NOTICE:  testing node 256 with height 0 and  ascending keys
+NOTICE:  testing node 256 with height 0 and descending keys
+NOTICE:  testing node 256 with height 1 and  ascending keys
+NOTICE:  testing node 256 with height 1 and descending keys
+NOTICE:  testing radix tree node types with shift "0"
+NOTICE:  testing radix tree node types with shift "8"
+NOTICE:  testing radix tree node types with shift "16"
+NOTICE:  testing radix tree node types with shift "24"
+NOTICE:  testing radix tree node types with shift "32"
+NOTICE:  testing radix tree node types with shift "40"
+NOTICE:  testing radix tree node types with shift "48"
+NOTICE:  testing radix tree node types with shift "56"
+NOTICE:  testing radix tree with pattern "all ones"
+NOTICE:  testing radix tree with pattern "alternating bits"
+NOTICE:  testing radix tree with pattern "clusters of ten"
+NOTICE:  testing radix tree with pattern "clusters of hundred"
+NOTICE:  testing radix tree with pattern "one-every-64k"
+NOTICE:  testing radix tree with pattern "sparse"
+NOTICE:  testing radix tree with pattern "single values, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^32"
+NOTICE:  testing radix tree with pattern "clusters, distance > 2^60"
+ test_radixtree 
+----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_radixtree/meson.build b/src/test/modules/test_radixtree/meson.build
new file mode 100644
index 0000000000..6add06bbdb
--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_radixtree_sources = files(
+  'test_radixtree.c',
+)
+
+if host_system == 'windows'
+  test_radixtree_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_radixtree',
+    '--FILEDESC', 'test_radixtree - test code for src/include//lib/radixtree.h',])
+endif
+
+test_radixtree = shared_module('test_radixtree',
+  test_radixtree_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_radixtree
+
+install_data(
+  'test_radixtree.control',
+  'test_radixtree--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_radixtree',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_radixtree',
+    ],
+  },
+}
diff --git a/src/test/modules/test_radixtree/sql/test_radixtree.sql b/src/test/modules/test_radixtree/sql/test_radixtree.sql
new file mode 100644
index 0000000000..41ece5e9f5
--- /dev/null
+++ b/src/test/modules/test_radixtree/sql/test_radixtree.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_radixtree;
+
+--
+-- All the logic is in the test_radixtree() function. It will throw
+-- an error if something fails.
+--
+SELECT test_radixtree();
diff --git a/src/test/modules/test_radixtree/test_radixtree--1.0.sql b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
new file mode 100644
index 0000000000..074a5a7ea7
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_radixtree/test_radixtree--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_radixtree" to load this file. \quit
+
+CREATE FUNCTION test_radixtree()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
new file mode 100644
index 0000000000..451206f2da
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -0,0 +1,776 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_radixtree.c
+ *		Test radixtree set data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_radixtree/test_radixtree.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "common/pg_prng.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+
+#define UINT64_HEX_FORMAT "%" INT64_MODIFIER "X"
+
+/*
+ * The tests pass with uint32, but build with warnings because the string
+ * format expects uint64.
+ */
+typedef uint64 TestValueType;
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns (you might
+ * want to increase the number of values used in each of the test, if
+ * you do that, to reduce noise).
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool rt_test_stats = false;
+
+/*
+ * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
+ */
+static int	rt_node_class_fanouts[] = {
+	3,		/* RT_CLASS_3 */
+	15,		/* RT_CLASS_32_MIN */
+	32, 	/* RT_CLASS_32_MAX */
+	125,	/* RT_CLASS_125 */
+	256		/* RT_CLASS_256 */
+};
+
+/*
+ * A struct to define a pattern of integers, for use with the test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+}			test_spec;
+
+/* Test patterns borrowed from test_integerset.c */
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 1000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 1000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 1000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 1000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 1000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 1000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		UINT64CONST(10000000000), 100000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		UINT64CONST(10000000000), 1000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		UINT64CONST(2000000000000000000),
+		23						/* can't be much higher than this, or we
+								 * overflow uint64 */
+	}
+};
+
+/* define the radix tree implementation to test */
+#define RT_PREFIX rt
+#define RT_SCOPE static pg_noinline
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_USE_DELETE
+#define RT_VALUE_TYPE TestValueType
+/* #define RT_SHMEM */
+#define RT_DEBUG
+#include "lib/radixtree.h"
+
+
+/*
+ * Return the number of keys in the radix tree.
+ */
+static uint64
+rt_num_entries(rt_radix_tree *tree)
+{
+	return tree->ctl->num_keys;
+}
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_radixtree);
+
+static void
+test_empty(void)
+{
+	rt_radix_tree *radixtree;
+	rt_iter		*iter;
+	TestValueType		dummy;
+	uint64		key;
+	TestValueType		val;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	if (rt_search(radixtree, 0, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, 1, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_search(radixtree, PG_UINT64_MAX, &dummy))
+		elog(ERROR, "rt_search on empty tree returned true");
+
+	if (rt_delete(radixtree, 0))
+		elog(ERROR, "rt_delete on empty tree returned true");
+
+	if (rt_num_entries(radixtree) != 0)
+		elog(ERROR, "rt_num_entries on empty tree return non-zero");
+
+	iter = rt_begin_iterate(radixtree);
+
+	if (rt_iterate_next(iter, &key, &val))
+		elog(ERROR, "rt_itereate_next on empty tree returned true");
+
+	rt_end_iterate(iter);
+
+	rt_free(radixtree);
+
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+static void
+test_basic(int children, int height, bool reverse)
+{
+	rt_radix_tree	*radixtree;
+	rt_iter    *iter;
+	uint64 *keys;
+	int shift = height * 8;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing node %3d with height %d and %s keys",
+		children, height, reverse ? "descending" : " ascending");
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	keys = palloc(sizeof(uint64) * children);
+	for (int i = 0; i < children; i++)
+			keys[i] = (uint64) i << shift;
+
+	/* insert keys */
+	if (reverse)
+	{
+	for (int i = children - 1; i >= 0; i--)
+	{
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+	}
+	else
+	{
+		for (int i = 0; i < children; i++)
+		{
+			if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+				elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+		}
+	}
+
+	rt_stats(radixtree);
+
+	/* look up keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType value;
+
+		if (!rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (value != (TestValueType) keys[i])
+			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+				 value, (TestValueType) keys[i]);
+	}
+
+	/* update keys */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType update = keys[i] + 1;
+		if (!rt_set(radixtree, keys[i], (TestValueType*) &update))
+			elog(ERROR, "could not update key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	/* repeat deleting and inserting keys */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (rt_set(radixtree, keys[i], (TestValueType*) &keys[i]))
+			elog(ERROR, "new inserted key 0x" UINT64_HEX_FORMAT " is found ", keys[i]);
+	}
+
+	/* look up keys after deleting and re-inserting */
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType value;
+
+		if (!rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
+		if (value != (TestValueType) keys[i])
+			elog(ERROR, "rt_search returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
+				 value, (TestValueType) keys[i]);
+	}
+
+	/* iterate over the tree */
+	iter = rt_begin_iterate(radixtree);
+
+	for (int i = 0; i < children; i++)
+	{
+		uint64		expected = keys[i];
+		uint64		iterkey;
+		TestValueType		iterval;
+
+		if (!rt_iterate_next(iter, &iterkey, &iterval))
+			elog(ERROR, "iteration terminated prematurely");
+
+		if (iterkey != expected)
+			elog(ERROR,
+				 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+				 iterkey, expected, i);
+		if (iterval != (TestValueType) expected)
+			elog(ERROR,
+				 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", iterval, expected, i);
+	}
+
+	rt_end_iterate(iter);
+
+
+	/* delete again and check that the tree is empty */
+	for (int i = 0; i < children; i++)
+	{
+		if (!rt_delete(radixtree, keys[i]))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+	for (int i = 0; i < children; i++)
+	{
+		TestValueType value;
+
+		if (rt_search(radixtree, keys[i], &value))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, keys[i]);
+	}
+
+	rt_stats(radixtree);
+
+	pfree(keys);
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Check if keys from start to end with the shift exist in the tree.
+ */
+static void
+check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end)
+{
+	for (int i = start; i <= end; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		TestValueType		val;
+
+		if (!rt_search(radixtree, key, &val))
+			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
+				 key, end);
+		if (val != (TestValueType) key)
+			elog(ERROR, "rt_search with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 key, val, key);
+	}
+}
+
+/*
+ * Insert 256 key-value pairs, and check if keys are properly inserted on each
+ * node class.
+ */
+/* Test keys [0, 256) */
+#define NODE_TYPE_TEST_KEY_MIN 0
+#define NODE_TYPE_TEST_KEY_MAX 256
+static void
+test_node_types_insert_asc(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+	int node_class_idx = 0;
+	uint64 key_checked = 0;
+
+	for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType *) &key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		/*
+		 * After filling all slots in each node type, check if the values
+		 * are stored properly.
+		 */
+		if ((i + 1) == rt_node_class_fanouts[node_class_idx])
+		{
+			check_search_on_node(radixtree, shift, key_checked, i);
+			key_checked = i;
+			node_class_idx++;
+		}
+	}
+
+	num_entries = rt_num_entries(radixtree);
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Similar to test_node_types_insert_asc(), but inserts keys in descending order.
+ */
+static void
+test_node_types_insert_desc(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64 num_entries;
+	int node_class_idx = 0;
+	uint64 key_checked = NODE_TYPE_TEST_KEY_MAX - 1;
+
+	for (int i = NODE_TYPE_TEST_KEY_MAX - 1; i >= NODE_TYPE_TEST_KEY_MIN; i--)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_set(radixtree, key, (TestValueType *) &key);
+		if (found)
+			elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " is found", key);
+
+		if ((i + 1) == rt_node_class_fanouts[node_class_idx])
+		{
+			check_search_on_node(radixtree, shift, i, key_checked);
+			key_checked = i;
+			node_class_idx++;
+		}
+	}
+
+	num_entries = rt_num_entries(radixtree);
+	if (num_entries != 256)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+static void pg_attribute_unused()
+test_node_types_delete(rt_radix_tree *radixtree, uint8 shift)
+{
+	uint64		num_entries;
+
+	for (int i = NODE_TYPE_TEST_KEY_MIN; i < NODE_TYPE_TEST_KEY_MAX; i++)
+	{
+		uint64		key = ((uint64) i << shift);
+		bool		found;
+
+		found = rt_delete(radixtree, key);
+
+		if (!found)
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, key);
+	}
+
+	num_entries = rt_num_entries(radixtree);
+
+	/* The tree must be empty */
+	if (num_entries != 0)
+		elog(ERROR,
+			 "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT,
+			 num_entries, UINT64CONST(256));
+}
+
+/*
+ * Test for inserting and deleting key-value pairs to each node type at the given shift
+ * level.
+ */
+static void pg_attribute_unused()
+test_node_types(uint8 shift)
+{
+	rt_radix_tree *radixtree;
+
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree node types with shift \"%d\"", shift);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
+#else
+	radixtree = rt_create(CurrentMemoryContext);
+#endif
+
+	/*
+	 * Insert and search entries for every node type at the 'shift' level,
+	 * then delete all entries to make it empty, and insert and search entries
+	 * again.
+	 */
+	test_node_types_insert_asc(radixtree, shift);
+	test_node_types_delete(radixtree, shift);
+	test_node_types_insert_desc(radixtree, shift);
+
+	rt_free(radixtree);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec * spec)
+{
+	rt_radix_tree *radixtree;
+	rt_iter    *iter;
+	MemoryContext radixtree_ctx;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	uint64		ndeleted;
+	uint64		nbefore;
+	uint64		nafter;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+#ifdef RT_SHMEM
+	int			tranche_id = LWLockNewTrancheId();
+	dsa_area   *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_radix_tree");
+	dsa = dsa_create(tranche_id);
+#endif
+
+	elog(NOTICE, "testing radix tree with pattern \"%s\"", spec->test_name);
+	if (rt_test_stats)
+		fprintf(stderr, "-----\ntesting radix tree with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the radix tree.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.
+	 */
+	radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
+										  "radixtree test",
+										  ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(radixtree_ctx, spec->test_name);
+
+#ifdef RT_SHMEM
+	radixtree = rt_create(radixtree_ctx, dsa, tranche_id);
+#else
+	radixtree = rt_create(radixtree_ctx);
+#endif
+
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			bool		found;
+
+			x = last_int + pattern_values[i];
+
+			found = rt_set(radixtree, x, (TestValueType*) &x);
+
+			if (found)
+				elog(ERROR, "newly inserted key 0x" UINT64_HEX_FORMAT " found", x);
+
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (rt_test_stats)
+		fprintf(stderr, "added " UINT64_FORMAT " values in %d ms\n",
+				spec->num_values, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by rt_memory_usage(), as well as the stats
+	 * from the memory context.  They should be in the same ballpark, but it's
+	 * hard to automate testing that, so if you're making changes to the
+	 * implementation, just observe that manually.
+	 */
+	if (rt_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by rt_memory_usage().  It
+		 * should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = rt_memory_usage(radixtree);
+		fprintf(stderr, "rt_memory_usage() reported " UINT64_FORMAT " (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(radixtree_ctx);
+	}
+
+	/* Check that rt_num_entries works */
+	n = rt_num_entries(radixtree);
+	if (n != spec->num_values)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT, n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_search()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 100000; n++)
+	{
+		bool		found;
+		bool		expected;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (found != expected)
+			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
+		if (found && (v != (TestValueType) x))
+			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
+				 v, x);
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "probed " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	iter = rt_begin_iterate(radixtree);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			TestValueType		val;
+
+			if (!rt_iterate_next(iter, &x, &val))
+				break;
+
+			if (x != expected)
+				elog(ERROR,
+					 "iterate returned wrong key; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d",
+					 x, expected, i);
+			if (val != (TestValueType) expected)
+				elog(ERROR,
+					 "iterate returned wrong value; got 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT " at %d", x, expected, i);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "iterated " UINT64_FORMAT " values in %d ms\n",
+				n, (int) (endtime - starttime) / 1000);
+
+	rt_end_iterate(iter);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after " UINT64_FORMAT " entries, expected " UINT64_FORMAT, n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned " UINT64_FORMAT " entries, " UINT64_FORMAT " was expected", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with rt_delete()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	nbefore = rt_num_entries(radixtree);
+	ndeleted = 0;
+	for (n = 0; n < 1; n++)
+	{
+		bool		found;
+		uint64		x;
+		TestValueType		v;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit values that are actually in the set.
+		 */
+		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
+
+		/* Is it present according to rt_search() ? */
+		found = rt_search(radixtree, x, &v);
+
+		if (!found)
+			continue;
+
+		/* If the key is found, delete it and check again */
+		if (!rt_delete(radixtree, x))
+			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_search(radixtree, x, &v))
+			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
+		if (rt_delete(radixtree, x))
+			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);
+
+		ndeleted++;
+	}
+	endtime = GetCurrentTimestamp();
+	if (rt_test_stats)
+		fprintf(stderr, "deleted " UINT64_FORMAT " values in %d ms\n",
+				ndeleted, (int) (endtime - starttime) / 1000);
+
+	nafter = rt_num_entries(radixtree);
+
+	/* Check that rt_num_entries works */
+	if ((nbefore - ndeleted) != nafter)
+		elog(ERROR, "rt_num_entries returned " UINT64_FORMAT ", expected " UINT64_FORMAT "after " UINT64_FORMAT " deletion",
+			 nafter, (nbefore - ndeleted), ndeleted);
+
+	rt_free(radixtree);
+	MemoryContextDelete(radixtree_ctx);
+#ifdef RT_SHMEM
+	dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_radixtree(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	for (int i = 0; i < lengthof(rt_node_class_fanouts); i++)
+	{
+		test_basic(rt_node_class_fanouts[i], 0, false);
+		test_basic(rt_node_class_fanouts[i], 0, true);
+		test_basic(rt_node_class_fanouts[i], 1, false);
+		test_basic(rt_node_class_fanouts[i], 1, true);
+	}
+
+	for (int shift = 0; shift <= (64 - 8); shift += 8)
+		test_node_types(shift);
+
+	/* Test different test patterns, with lots of entries */
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_radixtree/test_radixtree.control b/src/test/modules/test_radixtree/test_radixtree.control
new file mode 100644
index 0000000000..e53f2a3e0c
--- /dev/null
+++ b/src/test/modules/test_radixtree/test_radixtree.control
@@ -0,0 +1,4 @@
+comment = 'Test code for radix tree'
+default_version = '1.0'
+module_pathname = '$libdir/test_radixtree'
+relocatable = true
diff --git a/src/tools/pginclude/cpluspluscheck b/src/tools/pginclude/cpluspluscheck
index 4e09c4686b..202bf1c04e 100755
--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck
index 8dee1b5670..133313255c 100755
--- a/src/tools/pginclude/headerscheck
+++ b/src/tools/pginclude/headerscheck
@@ -96,6 +96,12 @@ do
 	test "$f" = src/include/nodes/nodetags.h && continue
 	test "$f" = src/backend/nodes/nodetags.h && continue
 
+	# radixtree_*_impl.h cannot be included standalone: they are just code fragments.
+	test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+	test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+	test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+	test "$f" = src/include/lib/radixtree_search_impl.h && continue
+
 	# These files are not meant to be included standalone, because
 	# they contain lists that might have multiple use-cases.
 	test "$f" = src/include/access/rmgrlist.h && continue
-- 
2.41.0

v35-0005-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchtext/x-patch; charset=US-ASCII; name=v35-0005-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From 5b384f42bbd4ea203f15bb967f26f5d3ec97b0d6 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 23 Dec 2022 23:50:39 +0900
Subject: [PATCH v35 5/7] Add TIDStore, to store sets of TIDs (ItemPointerData)
 efficiently.

The TIDStore is designed to store large sets of TIDs efficiently, and
is backed by the radix tree. A TID is encoded into 64-bit key and
value and inserted to the radix tree.

The TIDStore is not used for anything yet, aside from the test code,
but the follow up patch integrates the TIDStore with lazy vacuum,
reducing lazy vacuum memory usage and lifting the 1GB limit on its
size, by storing the list of dead TIDs more efficiently.

This includes a unit test module, in src/test/modules/test_tidstore.
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 689 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   2 +
 src/include/access/tidstore.h                 |  50 ++
 src/include/storage/lwlock.h                  |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  |  13 +
 src/test/modules/test_tidstore/meson.build    |  35 +
 .../test_tidstore/sql/test_tidstore.sql       |   7 +
 .../test_tidstore/test_tidstore--1.0.sql      |   8 +
 .../modules/test_tidstore/test_tidstore.c     | 228 ++++++
 .../test_tidstore/test_tidstore.control       |   4 +
 16 files changed, 1068 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5cfdc70c03..8ea8de406a 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2211,6 +2211,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       <entry>Waiting to access a shared TID bitmap during a parallel bitmap
        index scan.</entry>
      </row>
+     <row>
+      <entry><literal>SharedTidStore</literal></entry>
+      <entry>Waiting to access a shared TID store.</entry>
+     </row>
      <row>
       <entry><literal>SharedTupleStore</literal></entry>
       <entry>Waiting to access a shared tuple store during parallel
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index f5ac17b498..fce19c09ce 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..a1abcc14d2
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,689 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		Tid (ItemPointerData) storage implementation.
+ *
+ * TidStore is a in-memory data structure to store tids (ItemPointerData).
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.
+ *
+ * TidStore can be shared among parallel worker processes by passing DSA area
+ * to TidStoreCreate(). Other backends can attach to the shared TidStore by
+ * TidStoreAttach().
+ *
+ * Regarding the concurrency support, we use a single LWLock for the TidStore.
+ * The TidStore is exclusively locked when inserting encoded tids to the
+ * radix tree or when resetting itself. When searching on the TidStore or
+ * doing the iteration, it is not locked but the underlying radix tree is
+ * locked in shared mode.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+/*
+ * For encoding purposes, a tid is represented as a pair of 64-bit key and
+ * 64-bit value.
+ *
+ * First, we construct a 64-bit unsigned integer by combining the block
+ * number and the offset number. The number of bits used for the offset number
+ * is specified by max_off in TidStoreCreate(). We are frugal with the bits,
+ * because smaller keys could help keeping the radix tree shallow.
+ *
+ * For example, a tid of heap on a 8kB block uses the lowest 9 bits for
+ * the offset number and uses the next 32 bits for the block number. 9 bits
+ * are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks. That is, only 41 bits are used:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *
+ * X = bits used for offset number
+ * Y = bits used for block number
+ * u = unused bit
+ * (high on the left, low on the right)
+ *
+ * 9 bits are enough for the offset number, because MaxHeapTuplesPerPage < 2^9
+ * on 8kB blocks.
+ *
+ * The 64-bit value is the bitmap representation of the lowest 6 bits
+ * (TIDSTORE_VALUE_NBITS) of the integer, and the rest 35 bits are used
+ * as the key:
+ *
+ * uuuuuuuY YYYYYYYY YYYYYYYY YYYYYYYY YYYYYYYX XXXXXXXX
+ *                                                |----| value
+ * |---------------------------------------------| key
+ *
+ * The maximum height of the radix tree is 5 in this case.
+ */
+#define TIDSTORE_VALUE_NBITS	6	/* log(64, 2) */
+#define TIDSTORE_OFFSET_MASK	((1 << TIDSTORE_VALUE_NBITS) - 1)
+
+/* A magic value used to identify our TidStores. */
+#define TIDSTORE_MAGIC 0x826f6a10
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE uint64
+#include "lib/radixtree.h"
+
+/* The control object for a TidStore */
+typedef struct TidStoreControl
+{
+	/* the number of tids in the store */
+	int64	num_tids;
+
+	/* These values are never changed after creation */
+	size_t	max_bytes;		/* the maximum bytes a TidStore can use */
+	int		max_offset;		/* the maximum offset number */
+	int		offset_nbits;	/* the number of bits required for an offset
+							 * number */
+	int		offset_key_nbits;	/* the number of bits of an offset number
+								 * used in a key */
+
+	/* The below fields are used only in shared case */
+
+	uint32	magic;
+	LWLock	lock;
+
+	/* handles for TidStore and radix tree */
+	TidStoreHandle		handle;
+	shared_rt_handle	tree_handle;
+} TidStoreControl;
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/*
+	 * Control object. This is allocated in DSA area 'area' in the shared
+	 * case, otherwise in backend-local memory.
+	 */
+	TidStoreControl *control;
+
+	/* Storage for Tids. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	} tree;
+
+	/* DSA area for TidStore if used */
+	dsa_area	*area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+typedef struct TidStoreIter
+{
+	TidStore	*ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_rt_iter	*shared;
+		local_rt_iter	*local;
+	} tree_iter;
+
+	/* we returned all tids? */
+	bool		finished;
+
+	/* save for the next iteration */
+	uint64		next_key;
+	uint64		next_val;
+
+	/*
+	 * output for the caller. Must be last because variable-size.
+	 */
+	TidStoreIterResult output;
+} TidStoreIter;
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val);
+static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key);
+static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit);
+static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit);
+
+/*
+ * Create a TidStore. The returned object is allocated in backend-local memory.
+ * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
+ */
+TidStore *
+TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area)
+{
+	TidStore	*ts;
+
+	ts = palloc0(sizeof(TidStore));
+
+	/*
+	 * Create the radix tree for the main storage.
+	 *
+	 * Memory consumption depends on the number of stored tids, but also on the
+	 * distribution of them, how the radix tree stores, and the memory management
+	 * that backed the radix tree. The maximum bytes that a TidStore can
+	 * use is specified by the max_bytes in TidStoreCreate(). We want the total
+	 * amount of memory consumption by a TidStore not to exceed the max_bytes.
+	 *
+	 * In local TidStore cases, the radix tree uses slab allocators for each kind
+	 * of node class. The most memory consuming case while adding Tids associated
+	 * with one page (i.e. during TidStoreSetBlockOffsets()) is that we allocate a new
+	 * slab block for a new radix tree node, which is approximately 70kB. Therefore,
+	 * we deduct 70kB from the max_bytes.
+	 *
+	 * In shared cases, DSA allocates the memory segments big enough to follow
+	 * a geometric series that approximately doubles the total DSA size (see
+	 * make_new_segment() in dsa.c). We simulated the how DSA increases segment
+	 * size and the simulation revealed, the 75% threshold for the maximum bytes
+	 * perfectly works in case where the max_bytes is a power-of-2, and the 60%
+	 * threshold works for other cases.
+	 */
+	if (area != NULL)
+	{
+		dsa_pointer dp;
+		float ratio = ((max_bytes & (max_bytes - 1)) == 0) ? 0.75 : 0.6;
+
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		dp = dsa_allocate0(area, sizeof(TidStoreControl));
+		ts->control = (TidStoreControl *) dsa_get_address(area, dp);
+		ts->control->max_bytes = (size_t) (max_bytes * ratio);
+		ts->area = area;
+
+		ts->control->magic = TIDSTORE_MAGIC;
+		LWLockInitialize(&ts->control->lock, LWTRANCHE_SHARED_TIDSTORE);
+		ts->control->handle = dp;
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+	}
+	else
+	{
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		ts->control = (TidStoreControl *) palloc0(sizeof(TidStoreControl));
+		ts->control->max_bytes = max_bytes - (70 * 1024);
+	}
+
+	ts->control->max_offset = max_off;
+	ts->control->offset_nbits = pg_ceil_log2_32(max_off);
+
+	if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS)
+		ts->control->offset_nbits = TIDSTORE_VALUE_NBITS;
+
+	ts->control->offset_key_nbits =
+		ts->control->offset_nbits - TIDSTORE_VALUE_NBITS;
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+TidStoreAttach(dsa_area *area, TidStoreHandle handle)
+{
+	TidStore *ts;
+	dsa_pointer control;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the control object in shared memory */
+	control = handle;
+
+	/* Set up the TidStore */
+	ts->control = (TidStoreControl *) dsa_get_address(area, control);
+	Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+	ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+TidStoreDetach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().
+ */
+void
+TidStoreDestroy(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		/*
+		 * Vandalize the control block to help catch programming error where
+		 * other backends access the memory formerly occupied by this radix
+		 * tree.
+		 */
+		ts->control->magic = 0;
+		dsa_free(ts->area, ts->control->handle);
+		shared_rt_free(ts->tree.shared);
+	}
+	else
+	{
+		pfree(ts->control);
+		local_rt_free(ts->tree.local);
+	}
+
+	pfree(ts);
+}
+
+/*
+ * Forget all collected Tids. It's similar to TidStoreDestroy() but we don't free
+ * entire TidStore but recreate only the radix tree storage.
+ */
+void
+TidStoreReset(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+	{
+		Assert(ts->control->magic == TIDSTORE_MAGIC);
+
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+		/*
+		 * Free the radix tree and return allocated DSA segments to
+		 * the operating system.
+		 */
+		shared_rt_free(ts->tree.shared);
+		dsa_trim(ts->area);
+
+		/* Recreate the radix tree */
+		ts->tree.shared = shared_rt_create(CurrentMemoryContext, ts->area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+
+		/* update the radix tree handle as we recreated it */
+		ts->control->tree_handle = shared_rt_get_handle(ts->tree.shared);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+
+		LWLockRelease(&ts->control->lock);
+	}
+	else
+	{
+		local_rt_free(ts->tree.local);
+		ts->tree.local = local_rt_create(CurrentMemoryContext);
+
+		/* Reset the statistics */
+		ts->control->num_tids = 0;
+	}
+}
+
+/*
+ * Set the given tids on the blkno to TidStore.
+ *
+ * NB: the offset numbers in offsets must be sorted in ascending order.
+ */
+void
+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+						int num_offsets)
+{
+	uint64	*values;
+	uint64	key;
+	uint64	prev_key;
+	uint64	off_bitmap = 0;
+	int idx;
+	const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits;
+	const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	values = palloc(sizeof(uint64) * nkeys);
+	key = prev_key = key_base;
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		uint64	off_bit;
+
+		/* encode the tid to a key and partial offset */
+		key = encode_key_off(ts, blkno, offsets[i], &off_bit);
+
+		/* make sure we scanned the line pointer array in order */
+		Assert(key >= prev_key);
+
+		if (key > prev_key)
+		{
+			idx = prev_key - key_base;
+			Assert(idx >= 0 && idx < nkeys);
+
+			/* write out offset bitmap for this key */
+			values[idx] = off_bitmap;
+
+			/* zero out any gaps up to the current key */
+			for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++)
+				values[empty_idx] = 0;
+
+			/* reset for current key -- the current offset will be handled below */
+			off_bitmap = 0;
+			prev_key = key;
+		}
+
+		off_bitmap |= off_bit;
+	}
+
+	/* save the final index for later */
+	idx = key - key_base;
+	/* write out last offset bitmap */
+	values[idx] = off_bitmap;
+
+	if (TidStoreIsShared(ts))
+		LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE);
+
+	/* insert the calculated key-values to the tree */
+	for (int i = 0; i <= idx; i++)
+	{
+		if (values[i])
+		{
+			key = key_base + i;
+
+			if (TidStoreIsShared(ts))
+				shared_rt_set(ts->tree.shared, key, &values[i]);
+			else
+				local_rt_set(ts->tree.local, key, &values[i]);
+		}
+	}
+
+	/* update statistics */
+	ts->control->num_tids += num_offsets;
+
+	if (TidStoreIsShared(ts))
+		LWLockRelease(&ts->control->lock);
+
+	pfree(values);
+}
+
+/* Return true if the given tid is present in the TidStore */
+bool
+TidStoreIsMember(TidStore *ts, ItemPointer tid)
+{
+	uint64 key;
+	uint64 val = 0;
+	uint64 off_bit;
+	bool found;
+
+	key = tid_to_key_off(ts, tid, &off_bit);
+
+	if (TidStoreIsShared(ts))
+		found = shared_rt_search(ts->tree.shared, key, &val);
+	else
+		found = local_rt_search(ts->tree.local, key, &val);
+
+	if (!found)
+		return false;
+
+	return (val & off_bit) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, so TidStoreEndIterate() needs to be called when finished.
+ *
+ * The TidStoreIter struct is created in the caller's memory context.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+TidStoreBeginIterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	iter = palloc0(sizeof(TidStoreIter) +
+				   sizeof(OffsetNumber) * ts->control->max_offset);
+	iter->ts = ts;
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	/* If the TidStore is empty, there is no business */
+	if (TidStoreNumTids(ts) == 0)
+		iter->finished = true;
+
+	return iter;
+}
+
+static inline bool
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key, uint64 *val)
+{
+	if (TidStoreIsShared(iter->ts))
+		return shared_rt_iterate_next(iter->tree_iter.shared, key, val);
+
+	return local_rt_iterate_next(iter->tree_iter.local, key, val);
+}
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has tids
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+TidStoreIterateNext(TidStoreIter *iter)
+{
+	uint64 key;
+	uint64 val;
+	TidStoreIterResult *result = &(iter->output);
+
+	if (iter->finished)
+		return NULL;
+
+	if (BlockNumberIsValid(result->blkno))
+	{
+		/* Process the previously collected key-value */
+		result->num_offsets = 0;
+		tidstore_iter_extract_tids(iter, iter->next_key, iter->next_val);
+	}
+
+	while (tidstore_iter_kv(iter, &key, &val))
+	{
+		BlockNumber blkno;
+
+		blkno = key_get_blkno(iter->ts, key);
+
+		if (BlockNumberIsValid(result->blkno) && result->blkno != blkno)
+		{
+			/*
+			 * We got a key-value pair for a different block. So return the
+			 * collected tids, and remember the key-value for the next iteration.
+			 */
+			iter->next_key = key;
+			iter->next_val = val;
+			return result;
+		}
+
+		/* Collect tids extracted from the key-value pair */
+		tidstore_iter_extract_tids(iter, key, val);
+	}
+
+	iter->finished = true;
+	return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+TidStoreEndIterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter);
+}
+
+/* Return the number of tids we collected so far */
+int64
+TidStoreNumTids(TidStore *ts)
+{
+	int64 num_tids;
+
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	if (!TidStoreIsShared(ts))
+		return ts->control->num_tids;
+
+	LWLockAcquire(&ts->control->lock, LW_SHARED);
+	num_tids = ts->control->num_tids;
+	LWLockRelease(&ts->control->lock);
+
+	return num_tids;
+}
+
+/* Return true if the current memory usage of TidStore exceeds the limit */
+bool
+TidStoreIsFull(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return (TidStoreMemoryUsage(ts) > ts->control->max_bytes);
+}
+
+/* Return the maximum memory TidStore can use */
+size_t
+TidStoreMaxMemory(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->max_bytes;
+}
+
+/* Return the memory usage of TidStore */
+size_t
+TidStoreMemoryUsage(TidStore *ts)
+{
+	Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC);
+
+	/*
+	 * In the shared case, TidStoreControl and radix_tree are backed by the
+	 * same DSA area and rt_memory_usage() returns the value including both.
+	 * So we don't need to add the size of TidStoreControl separately.
+	 */
+	if (TidStoreIsShared(ts))
+		return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+	return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local);
+}
+
+/*
+ * Get a handle that can be used by other processes to attach to this TidStore
+ */
+TidStoreHandle
+TidStoreGetHandle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
+
+	return ts->control->handle;
+}
+
+/* Extract tids from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val)
+{
+	TidStoreIterResult *result = (&iter->output);
+
+	while (val)
+	{
+		uint64	tid_i;
+		OffsetNumber	off;
+
+		tid_i = key << TIDSTORE_VALUE_NBITS;
+		tid_i |= pg_rightmost_one_pos64(val);
+
+		off = tid_i & ((UINT64CONST(1) << iter->ts->control->offset_nbits) - 1);
+
+		Assert(result->num_offsets < iter->ts->control->max_offset);
+		result->offsets[result->num_offsets++] = off;
+
+		/* unset the rightmost bit */
+		val &= ~pg_rightmost_one64(val);
+	}
+
+	result->blkno = key_get_blkno(iter->ts, key);
+}
+
+/* Get block number from the given key */
+static inline BlockNumber
+key_get_blkno(TidStore *ts, uint64 key)
+{
+	return (BlockNumber) (key >> ts->control->offset_key_nbits);
+}
+
+/* Encode a tid to key and offset */
+static inline uint64
+tid_to_key_off(TidStore *ts, ItemPointer tid, uint64 *off_bit)
+{
+	OffsetNumber offset = ItemPointerGetOffsetNumber(tid);
+	BlockNumber block = ItemPointerGetBlockNumber(tid);
+
+	return encode_key_off(ts, block, offset, off_bit);
+}
+
+/* encode a block and offset to a key and partial offset */
+static inline uint64
+encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint64 *off_bit)
+{
+	uint64 key;
+	uint64 tid_i;
+	uint32 off_lower;
+
+	off_lower = offset & TIDSTORE_OFFSET_MASK;
+	Assert(off_lower < (sizeof(uint64) * BITS_PER_BYTE));
+
+	*off_bit = UINT64CONST(1) << off_lower;
+	tid_i = offset | ((uint64) block << ts->control->offset_nbits);
+	key = tid_i >> TIDSTORE_VALUE_NBITS;
+
+	return key;
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 01d738f306..e671805207 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -176,6 +176,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"SharedTupleStore",
 	/* LWTRANCHE_SHARED_TIDBITMAP: */
 	"SharedTidBitmap",
+	/* LWTRANCHE_SHARED_TIDSTORE: */
+	"SharedTidStore",
 	/* LWTRANCHE_PARALLEL_APPEND: */
 	"ParallelAppend",
 	/* LWTRANCHE_PER_XACT_PREDICATE_LIST: */
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..66f0fdd482
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,50 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef dsa_pointer TidStoreHandle;
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+/* Result struct for TidStoreIterateNext */
+typedef struct TidStoreIterResult
+{
+	BlockNumber		blkno;
+	int				num_offsets;
+	OffsetNumber	offsets[FLEXIBLE_ARRAY_MEMBER];
+} TidStoreIterResult;
+
+extern TidStore *TidStoreCreate(size_t max_bytes, int max_off, dsa_area *dsa);
+extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer handle);
+extern void TidStoreDetach(TidStore *ts);
+extern void TidStoreDestroy(TidStore *ts);
+extern void TidStoreReset(TidStore *ts);
+extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+									int num_offsets);
+extern bool TidStoreIsMember(TidStore *ts, ItemPointer tid);
+extern TidStoreIter * TidStoreBeginIterate(TidStore *ts);
+extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
+extern void TidStoreEndIterate(TidStoreIter *iter);
+extern int64 TidStoreNumTids(TidStore *ts);
+extern bool TidStoreIsFull(TidStore *ts);
+extern size_t TidStoreMaxMemory(TidStore *ts);
+extern size_t TidStoreMemoryUsage(TidStore *ts);
+extern TidStoreHandle TidStoreGetHandle(TidStore *ts);
+
+#endif		/* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 34169e5889..fb124b6ca1 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -200,6 +200,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_SHARED_TIDBITMAP,
+	LWTRANCHE_SHARED_TIDSTORE,
 	LWTRANCHE_PARALLEL_APPEND,
 	LWTRANCHE_PER_XACT_PREDICATE_LIST,
 	LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 05f16e880b..4bdc6845a6 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -33,6 +33,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 995d8c0cc6..eb61be3c63 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,5 +30,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..7ff2f9af87
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,13 @@
+CREATE EXTENSION test_tidstore;
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
+NOTICE:  testing empty tidstore
+NOTICE:  testing basic operations
+ test_tidstore 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..31f2da7b61
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  link_with: pgport_srv,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_tidstore
+
+install_data(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..03aea31815
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_tidstore;
+
+--
+-- All the logic is in the test_tidstore() function. It will throw
+-- an error if something fails.
+--
+SELECT test_tidstore();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..47e9149900
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_tidstore()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..8659e6780e
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,228 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+/* #define TEST_SHARED_TIDSTORE 1 */
+
+#define TEST_TIDSTORE_MAX_BYTES (2 * 1024 * 1024L) /* 2MB */
+
+PG_FUNCTION_INFO_V1(test_tidstore);
+
+static void
+check_tid(TidStore *ts, BlockNumber blkno, OffsetNumber off, bool expect)
+{
+	ItemPointerData tid;
+	bool found;
+
+	ItemPointerSet(&tid, blkno, off);
+
+	found = TidStoreIsMember(ts, &tid);
+
+	if (found != expect)
+		elog(ERROR, "TidStoreIsMember for TID (%u, %u) returned %d, expected %d",
+			 blkno, off, found, expect);
+}
+
+static void
+test_basic(int max_offset)
+{
+#define TEST_TIDSTORE_NUM_BLOCKS	5
+#define TEST_TIDSTORE_NUM_OFFSETS	5
+
+	TidStore *ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+	BlockNumber	blks[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, MaxBlockNumber, MaxBlockNumber - 1, 1, MaxBlockNumber / 2,
+	};
+	BlockNumber	blks_sorted[TEST_TIDSTORE_NUM_BLOCKS] = {
+		0, 1, MaxBlockNumber / 2, MaxBlockNumber - 1, MaxBlockNumber
+	};
+	OffsetNumber offs[TEST_TIDSTORE_NUM_OFFSETS];
+	int blk_idx;
+
+#ifdef TEST_SHARED_TIDSTORE
+	int tranche_id = LWLockNewTrancheId();
+	dsa_area *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_tidstore");
+	dsa = dsa_create(tranche_id);
+
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, dsa);
+#else
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, max_offset, NULL);
+#endif
+
+	/* prepare the offset array */
+	offs[0] = FirstOffsetNumber;
+	offs[1] = FirstOffsetNumber + 1;
+	offs[2] = max_offset / 2;
+	offs[3] = max_offset - 1;
+	offs[4] = max_offset;
+
+	/* add tids */
+	for (int i = 0; i < TEST_TIDSTORE_NUM_BLOCKS; i++)
+		TidStoreSetBlockOffsets(ts, blks[i], offs, TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* lookup test */
+	for (OffsetNumber off = FirstOffsetNumber ; off < max_offset; off++)
+	{
+		bool expect = false;
+		for (int i = 0; i < TEST_TIDSTORE_NUM_OFFSETS; i++)
+		{
+			if (offs[i] == off)
+			{
+				expect = true;
+				break;
+			}
+		}
+
+		check_tid(ts, 0, off, expect);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, expect);
+	}
+
+	/* test the number of tids */
+	if (TidStoreNumTids(ts) != (TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS))
+		elog(ERROR, "TidStoreNumTids returned " UINT64_FORMAT ", expected %d",
+			 TidStoreNumTids(ts),
+			 TEST_TIDSTORE_NUM_BLOCKS * TEST_TIDSTORE_NUM_OFFSETS);
+
+	/* iteration test */
+	iter = TidStoreBeginIterate(ts);
+	blk_idx = 0;
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+	{
+		/* check the returned block number */
+		if (blks_sorted[blk_idx] != iter_result->blkno)
+			elog(ERROR, "TidStoreIterateNext returned block number %u, expected %u",
+				 iter_result->blkno, blks_sorted[blk_idx]);
+
+		/* check the returned offset numbers */
+		if (TEST_TIDSTORE_NUM_OFFSETS != iter_result->num_offsets)
+			elog(ERROR, "TidStoreIterateNext %u offsets, expected %u",
+				 iter_result->num_offsets, TEST_TIDSTORE_NUM_OFFSETS);
+
+		for (int i = 0; i < iter_result->num_offsets; i++)
+		{
+			if (offs[i] != iter_result->offsets[i])
+				elog(ERROR, "TidStoreIterateNext offset number %u on block %u, expected %u",
+					 iter_result->offsets[i], iter_result->blkno, offs[i]);
+		}
+
+		blk_idx++;
+	}
+
+	if (blk_idx != TEST_TIDSTORE_NUM_BLOCKS)
+		elog(ERROR, "TidStoreIterateNext returned %d blocks, expected %d",
+			 blk_idx, TEST_TIDSTORE_NUM_BLOCKS);
+
+	/* remove all tids */
+	TidStoreReset(ts);
+
+	/* test the number of tids */
+	if (TidStoreNumTids(ts) != 0)
+		elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
+
+	/* lookup test for empty store */
+	for (OffsetNumber off = FirstOffsetNumber ; off < MaxHeapTuplesPerPage;
+		 off++)
+	{
+		check_tid(ts, 0, off, false);
+		check_tid(ts, 2, off, false);
+		check_tid(ts, MaxBlockNumber - 2, off, false);
+		check_tid(ts, MaxBlockNumber, off, false);
+	}
+
+	TidStoreDestroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+	dsa_detach(dsa);
+#endif
+}
+
+static void
+test_empty(void)
+{
+	TidStore *ts;
+	TidStoreIter *iter;
+	ItemPointerData tid;
+
+#ifdef TEST_SHARED_TIDSTORE
+	int tranche_id = LWLockNewTrancheId();
+	dsa_area *dsa;
+
+	LWLockRegisterTranche(tranche_id, "test_tidstore");
+	dsa = dsa_create(tranche_id);
+
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, dsa);
+#else
+	ts = TidStoreCreate(TEST_TIDSTORE_MAX_BYTES, MaxHeapTuplesPerPage, NULL);
+#endif
+
+	elog(NOTICE, "testing empty tidstore");
+
+	ItemPointerSet(&tid, 0, FirstOffsetNumber);
+	if (TidStoreIsMember(ts, &tid))
+		elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
+			 0, FirstOffsetNumber);
+
+	ItemPointerSet(&tid, MaxBlockNumber, MaxOffsetNumber);
+	if (TidStoreIsMember(ts, &tid))
+		elog(ERROR, "TidStoreIsMember for TID (%u,%u) on empty store returned true",
+			 MaxBlockNumber, MaxOffsetNumber);
+
+	if (TidStoreNumTids(ts) != 0)
+		elog(ERROR, "TidStoreNumTids on empty store returned non-zero");
+
+	if (TidStoreIsFull(ts))
+		elog(ERROR, "TidStoreIsFull on empty store returned true");
+
+	iter = TidStoreBeginIterate(ts);
+
+	if (TidStoreIterateNext(iter) != NULL)
+		elog(ERROR, "TidStoreIterateNext on empty store returned TIDs");
+
+	TidStoreEndIterate(iter);
+
+	TidStoreDestroy(ts);
+
+#ifdef TEST_SHARED_TIDSTORE
+	dsa_detach(dsa);
+#endif
+}
+
+Datum
+test_tidstore(PG_FUNCTION_ARGS)
+{
+	test_empty();
+
+	elog(NOTICE, "testing basic operations");
+	test_basic(MaxHeapTuplesPerPage);
+	test_basic(10);
+	test_basic(MaxHeapTuplesPerPage * 2);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
-- 
2.41.0

v35-0006-Add-tidstore-tests-to-benchmark.patchtext/x-patch; charset=US-ASCII; name=v35-0006-Add-tidstore-tests-to-benchmark.patchDownload

From 5baab6990ee5dd9d828b135f1a28f3eed403ff03 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 23 Jun 2023 15:41:39 +0700
Subject: [PATCH v35 6/7] Add tidstore tests to benchmark

Extracted from v32 0005 by Masahiko Sawada
---
 .../bench_radix_tree--1.0.sql                 | 10 +++
 contrib/bench_radix_tree/bench_radix_tree.c   | 65 ++++++++++++++++++-
 2 files changed, 72 insertions(+), 3 deletions(-)

diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index db33a1a828..ad66265e23 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -76,3 +76,13 @@ returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
 
+create function bench_tidstore_load(
+minblk int4,
+maxblk int4,
+OUT mem_allocated int8,
+OUT load_ms int8,
+OUT iter_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 81ada0fd8f..a3aba12aad 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -9,6 +9,7 @@
  */
 #include "postgres.h"
 
+#include "access/tidstore.h"
 #include "common/pg_prng.h"
 #include "fmgr.h"
 #include "funcapi.h"
@@ -54,6 +55,7 @@ PG_FUNCTION_INFO_V1(bench_load_random_int);
 PG_FUNCTION_INFO_V1(bench_fixed_height_search);
 PG_FUNCTION_INFO_V1(bench_search_random_nodes);
 PG_FUNCTION_INFO_V1(bench_node128_load);
+PG_FUNCTION_INFO_V1(bench_tidstore_load);
 
 static uint64
 tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -168,6 +170,64 @@ vac_cmp_itemptr(const void *left, const void *right)
 }
 #endif
 
+Datum
+bench_tidstore_load(PG_FUNCTION_ARGS)
+{
+	BlockNumber minblk = PG_GETARG_INT32(0);
+	BlockNumber maxblk = PG_GETARG_INT32(1);
+	TidStore	*ts;
+	TidStoreIter *iter;
+	TidStoreIterResult *result;
+	OffsetNumber *offs;
+	TimestampTz start_time,
+				end_time;
+	long		secs;
+	int			usecs;
+	int64		load_ms;
+	int64		iter_ms;
+	TupleDesc	tupdesc;
+	Datum		values[3];
+	bool		nulls[3] = {false};
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	offs = palloc(sizeof(OffsetNumber) * TIDS_PER_BLOCK_FOR_LOAD);
+	for (int i = 0; i < TIDS_PER_BLOCK_FOR_LOAD; i++)
+		offs[i] = i + 1; /* FirstOffsetNumber is 1 */
+
+	ts = TidStoreCreate(1 * 1024L * 1024L * 1024L, MaxHeapTuplesPerPage, NULL);
+
+	/* load tids */
+	start_time = GetCurrentTimestamp();
+	for (BlockNumber blkno = minblk; blkno < maxblk; blkno++)
+		TidStoreSetBlockOffsets(ts, blkno, offs, TIDS_PER_BLOCK_FOR_LOAD);
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	load_ms = secs * 1000 + usecs / 1000;
+
+	elog(NOTICE, "sleeping for 2 seconds...");
+	pg_usleep(2 * 1000000L);
+
+	/* iterate through tids */
+	iter = TidStoreBeginIterate(ts);
+	start_time = GetCurrentTimestamp();
+	while ((result = TidStoreIterateNext(iter)) != NULL)
+		;
+	TidStoreEndIterate(iter);
+	end_time = GetCurrentTimestamp();
+	TimestampDifference(start_time, end_time, &secs, &usecs);
+	iter_ms = secs * 1000 + usecs / 1000;
+
+	values[0] = Int64GetDatum(TidStoreMemoryUsage(ts));
+	values[1] = Int64GetDatum(load_ms);
+	values[2] = Int64GetDatum(iter_ms);
+
+	TidStoreDestroy(ts);
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
 static Datum
 bench_search(FunctionCallInfo fcinfo, bool shuffle)
 {
@@ -663,7 +723,7 @@ bench_node128_load(PG_FUNCTION_ARGS)
 	rt_free(rt);
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
-#if 1
+
 /* to silence warnings about unused iter functions */
 static void pg_attribute_unused()
 stub_iter()
@@ -678,5 +738,4 @@ stub_iter()
 	iter = rt_begin_iterate(rt);
 	rt_iterate_next(iter, &key, &value);
 	rt_end_iterate(iter);
-}
-#endif
+}
\ No newline at end of file
-- 
2.41.0

v35-0007-Revert-building-benchmark-module-for-CI.patchtext/x-patch; charset=US-ASCII; name=v35-0007-Revert-building-benchmark-module-for-CI.patchDownload

From 50e170b83d3e29c5b56b2279ff2befe20bd109dc Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 14 Feb 2023 19:31:34 +0700
Subject: [PATCH v35 7/7] Revert building benchmark module for CI

---
 contrib/meson.build | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/meson.build b/contrib/meson.build
index 421d469f8c..52253de793 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -12,7 +12,7 @@ subdir('amcheck')
 subdir('auth_delay')
 subdir('auto_explain')
 subdir('basic_archive')
-subdir('bench_radix_tree')
+#subdir('bench_radix_tree')
 subdir('bloom')
 subdir('basebackup_to_shell')
 subdir('bool_plperl')
-- 
2.41.0

#260

sawada.mshk@gmail.com

over 2 years ago

In reply to: John Naylor (#259)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Jun 23, 2023 at 6:54 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I wrote:

I cleaned up a few things and attached v34 so you can do that if you like.

Of course, "clean" is a relative term. While making a small bit of progress working in tidbitmap.c earlier this week, I thought it useful to prototype some things in the tidstore, at which point I was reminded it no longer compiles because of my recent work. I put in the necessary incantations so that the v32 tidstore compiles and passes tests, so here's a patchset for that (but no vacuum changes). I thought it was a good time to also condense it down to look more similar to previous patches, as a basis for future work.

Thank you for updating the patch set. I'll look at updates closely
early next week.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#261

sawada.mshk@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#260)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jun 27, 2023 at 5:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jun 23, 2023 at 6:54 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

I wrote:

I cleaned up a few things and attached v34 so you can do that if you like.

Of course, "clean" is a relative term. While making a small bit of progress working in tidbitmap.c earlier this week, I thought it useful to prototype some things in the tidstore, at which point I was reminded it no longer compiles because of my recent work. I put in the necessary incantations so that the v32 tidstore compiles and passes tests, so here's a patchset for that (but no vacuum changes). I thought it was a good time to also condense it down to look more similar to previous patches, as a basis for future work.

Thank you for updating the patch set. I'll look at updates closely
early next week.

I've run several benchmarks for v32, where before your recent change
starting, and v35 patch. Overall the numbers are better than the
previous version. Here is the test result where I used 1-byte value:

"select * from bench_load_random(10_000_000)"

* v35
radix tree leaves: 192 total in 0 blocks; 0 empty blocks; 0 free (0
chunks); 192 used
radix tree node 256: 13697472 total in 205 blocks; 0 empty blocks;
52400 free (25 chunks); 13645072 used
radix tree node 125: 86630592 total in 2115 blocks; 0 empty blocks;
7859376 free (6102 chunks); 78771216 used
radix tree node 32: 94912 total in 0 blocks; 10 empty blocks; 0 free
(0 chunks); 94912 used
radix tree node 15: 9269952 total in 1136 blocks; 0 empty blocks;
168 free (1 chunks); 9269784 used
radix tree node 3: 1915502784 total in 233826 blocks; 0 empty
blocks; 6560 free (164 chunks); 1915496224 used
mem_allocated | load_ms
---------------+---------
2025194752 | 3011
(1 row)

* v32
radix tree node 256: 192 total in 0 blocks; 0 empty blocks; 0 free
(0 chunks); 192 used
radix tree node 256: 13487552 total in 205 blocks; 0 empty blocks;
51600 free (25 chunks); 13435952 used
radix tree node 125: 192 total in 0 blocks; 0 empty blocks; 0 free
(0 chunks); 192 used
radix tree node 125: 86630592 total in 2115 blocks; 0 empty blocks;
7859376 free (6102 chunks); 78771216 used
radix tree node 32: 192 total in 0 blocks; 0 empty blocks; 0 free (0
chunks); 192 used
radix tree node 32: 94912 total in 0 blocks; 10 empty blocks; 0 free
(0 chunks); 94912 used
radix tree node 15: 192 total in 0 blocks; 0 empty blocks; 0 free (0
chunks); 192 used
radix tree node 15: 9269952 total in 1136 blocks; 0 empty blocks;
168 free (1 chunks); 9269784 used
radix tree node 3: 241597002 total in 29499 blocks; 0 empty blocks;
3864 free (161 chunks); 241593138 used
radix tree node 3: 1809039552 total in 221696 blocks; 0 empty
blocks; 5280 free (110 chunks); 1809034272 used
mem_allocated | load_ms
---------------+---------
2160118410 | 3069
(1 row)

As you mentioned, the 1-byte value is embedded into 8 byte so 7 bytes
are unused, but we use less memory since we use less slab contexts and
save fragmentations.

I've also tested some large value cases (e.g. the value is 80-bytes)
and got a similar result.

Regarding the codes, there are many todo and fixme comments so it
seems to me that your recent work is still in-progress. What is the
current status? Can I start reviewing the code or should I wait for a
while until your recent work completes?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#262

john.naylor@enterprisedb.com

over 2 years ago

In reply to: Masahiko Sawada (#261)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jul 4, 2023 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

As you mentioned, the 1-byte value is embedded into 8 byte so 7 bytes
are unused, but we use less memory since we use less slab contexts and
save fragmentations.

Thanks for testing. This tree is sparse enough that most of the space is
taken up by small inner nodes, and not by leaves. So, it's encouraging to
see a small space savings even here.

I've also tested some large value cases (e.g. the value is 80-bytes)
and got a similar result.

Interesting. With a separate allocation per value the overhead would be 8
bytes, or 10% here. It's plausible that savings elsewhere can hide that,
globally.

Regarding the codes, there are many todo and fixme comments so it
seems to me that your recent work is still in-progress. What is the
current status? Can I start reviewing the code or should I wait for a
while until your recent work completes?

Well, it's going to be a bit of a mess until I can demonstrate it working
(and working well) with bitmap heap scan. Fixing that now is just going to
create conflicts. I do have a couple small older patches laying around that
were quick experiments -- I think at least some of them should give a
performance boost in loading speed, but haven't had time to test. Would you
like to take a look?

--
John Naylor
EDB: http://www.enterprisedb.com

#263

sawada.mshk@gmail.com

over 2 years ago

In reply to: John Naylor (#262)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jul 5, 2023 at 8:21 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Tue, Jul 4, 2023 at 12:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

As you mentioned, the 1-byte value is embedded into 8 byte so 7 bytes
are unused, but we use less memory since we use less slab contexts and
save fragmentations.

Thanks for testing. This tree is sparse enough that most of the space is taken up by small inner nodes, and not by leaves. So, it's encouraging to see a small space savings even here.

I've also tested some large value cases (e.g. the value is 80-bytes)
and got a similar result.

Interesting. With a separate allocation per value the overhead would be 8 bytes, or 10% here. It's plausible that savings elsewhere can hide that, globally.

Regarding the codes, there are many todo and fixme comments so it
seems to me that your recent work is still in-progress. What is the
current status? Can I start reviewing the code or should I wait for a
while until your recent work completes?

Well, it's going to be a bit of a mess until I can demonstrate it working (and working well) with bitmap heap scan. Fixing that now is just going to create conflicts. I do have a couple small older patches laying around that were quick experiments -- I think at least some of them should give a performance boost in loading speed, but haven't had time to test. Would you like to take a look?

Yes, I can experiment with these patches in the meantime.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#264

john.naylor@enterprisedb.com

over 2 years ago

In reply to: Masahiko Sawada (#263)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Wed, Jul 5, 2023 at 8:21 PM John Naylor <john.naylor@enterprisedb.com>

wrote:

Well, it's going to be a bit of a mess until I can demonstrate it

working (and working well) with bitmap heap scan. Fixing that now is just
going to create conflicts. I do have a couple small older patches laying
around that were quick experiments -- I think at least some of them should
give a performance boost in loading speed, but haven't had time to test.
Would you like to take a look?

Yes, I can experiment with these patches in the meantime.

Okay, here it is in v36. 0001-6 are same as v35.

0007 removes a wasted extra computation newly introduced by refactoring
growing nodes. 0008 just makes 0011 nicer. Not worth testing by themselves,
but better to be tidy.
0009 is an experiment to get rid of slow memmoves in node4, addressing a
long-standing inefficiency. It looks a bit tricky, but I think it's
actually straightforward after drawing out the cases with pen and paper. It
works if the fanout is either 4 or 5, so we have some wiggle room. This may
give a noticeable boost if the input is reversed or random.
0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with
sparse trees.
0011 reduces function calls when growing the smaller nodes. Not sure about
this one -- possibly worth it for node4 only?

If these help, it'll show up more easily in smaller inputs. Large inputs
tend to be more dominated by RAM latency.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v36-ART.tar.gzapplication/gzip; name=v36-ART.tar.gzDownload

��;ks����j�
wzcG"E�e�M'��$n�8c�Mo{;�%�$��!������z+��93w�p������'����i�gI�J���>�X���������2��@�z�(�����1��7��!�7]�������o|:]���r���u��c}c����������"�y��7�(�6�K��O������u����`��^�����?���ks�v�A�����.d����Y]f���k�ij���]����[���������������a����d��v�sq������K/$�6��C�{h�X����v]>
/?d�?�9~�&��L�e��d5�2��������Y1+Cf�OS~�i������	�+X,�"�y����/vY&b���72
��A��c��l�	����E��|$X*�"��g
F���yf�iac9�iK;�iWb�)���w<����Y��������.���ac�����vf^�e�}�Fy>�[��tj�e���O�!�a+�l(��o���#y��m��pr��rrj�t1>�����
�����_�]t?'��n�����sL:��5]�5��^+L���Ek,�����o��_����C�Y,#�<X�P�M��L�t�{�}M�� `�>s�[��l����w�
����������6�H;��6����-�_�dz��tY��2�	Z����`)#Vs�����`m9�U5�I���X��y��k�p�Y�����q!��a��/���w�}�-��6�v�p6��R�Y# k		���4�z��X 6	2���>o�N�\��^w�N�(�Y��,�fU{Y���D��mY�B|z���h@�<��ux����Z"��0aS{a�q_�=bQ����J���g���+\;��3���&��j��s����+Az�Vx�ccPnZ��8���/�FLD����-��������x�i��%����S�I�{
�>��q?��8�b��+�0`��{D���;@�������h�nF",�`��r�-S~�V�C1
�&9�U�yq��:�'�B0����E(�\�q���%v����'��-����M�����|6�A������F<�X�����;���� ��?,�O������X����d���w�|��z�S6u��[���������T����r�9����M`?�8������o�SV���$���I?�qM���c^E7
!�{��S�>�WT,s�� ,w��,�������5�F���2������[D��ZUZ%j%OL�L��++l	�q�gJ���������4���x�Ny�&xNP7��T i�T��) jAC�qW���
��	�������v�@Z!L@! W�R���B	7 ��s0�:��1������hm��S����?Q�_[���g2:����a����u�m`��_�l��,�k�K�����������?q��^0�l���z��Y��Y��	�.F�>h�/�?�A�r�T��f$���b0R�7�h2������C�w�X�u7�m���Y2�r0��������cVM;�}6�S�����D����:\4�p���~�(����fl�BT&�,���i8�����g05�����7��7��w����T�,��H�f�`�5�AQ1	)����L2�95
��<�2��v��0�Z$���"=
W�x�J i���-NpU���2�Y~^Z��0N�?H�$��AH�GG��6?A9-�B�A<]���} ��#En�{j?@�G�"��i2�p�S��55��1����Jd����
pd����s�����R4�>�
��9O�'�~��e�?��x#;��W��_������;;�m1�����#/��7oF���}�������Vo%�Yk������`����b���g��X�b�6\_`75��=nWHr3���:�[�AdF�Y�����9�gsQ���7�,��ua���l�/C����^����a��v�m��&���g�oC����{\Q�
�[9��������������%������hlgO ����
Y��~�`�Pp�Pha�LL������9��a�1
?�"���DNA�
T[i�Q���tQ���_�K���'D��1j1A6�zq��<e<.�Y��I�����$2�,@�'<BE�rH�A���Q��d�
@q��j	JH�C�G���v���������S����P�"����<���D���DaS&��X���	�	�,��(����R���h>�q�^��dy���<s�Q�{��)/AE��{2}4�	a�#�^Jtb������3aG�Npp�[K��3�j_��~X�����V%Rq������'��2���Y.
�����S��������Vb�`A�������m�6h�/�o���g�]�/~9�9{~�8�a������������M�n���1��G�������YYZD����O��J|��:a/^0�=#Mi$�����*�M�wv@C/��m�Y]������v,�p^IN���UK5���
�p��-�u���5��Q
r��"M��vA������jG� m;+��?s��,%Yq�96���~�_
�S���g�9��6��WM�`{��v|�rp�-������2r�����xV���c��M�#/K�n�D��\��Q\�&`�����b3a�����56��P�� �
V�.k���<%0�E)^��T�|�\�r�����F�\��kS�j�M<�j�1x�.f�\�~�X,��m��7�!������*��5tac�.ll��Y
�P�OW�����m<F�6��@�r�����5����\y5��� ����Qf�b)$e.��3�LW������ gM
oa@�2�wn���������-*s
�
��A�4�T�i7m�w�;4����������W7i�5��G���	��e��S}8((�	} h
��a��k|� ��U����B���j���AH��!�� �����j�xt�X
B_/i,!
����M��5&�4�>P�`G5�55p��Y>
U����\,R��SI`%h ����CZ�(q�%J��J�c��J	������sU1)�W�"K�!9BW�������b��z)�m����i���T��5:N��������}�]U�[Q����a��f��.(p|SY=u�S���k#)o���Xp���BG8��\�������x���/;�����}�FZ�#������������r?��S!�B������5����������?�5���?����?&�������~�c�����p��}0��n��9m������Ia�!����>�;X�17�l��`Q�fVV1��}8{�5�Rf�Y]�����A���'������g\����������(�jzT
~���5��E<j~�z�j~fU���`Z���\�~+��#�`�db�0��?���[(rPV25���E��m?������t��.���w�p��_�1�@��>��!�-n���j�����hyfhZC���O]5^�A����9=������i`Y2�s����6��W�qTd��7��/���	�x�+P�p��R<2s�R(�6C��}�Pn�0�~x��OV�$�-��$|s��$���$Td�+�2���D	��Z���l��*���X��
�0nu:^���\�\A�Z�\���e��G��:@:�Zh���(���M�r��m��(�N�����������wr}�?v��_���x�
o�A����b�_+s�TK��a��Cou�N�u�}*�t�\Mv���ZO(�������cy^����b��A�D�����,��#��!^/� ���Y��
���3��{u)V�)tU�0r>��`%EgI�G���3�����csl�5l�����9	(���p����}_��Z����"��$��g@3�&0X�<��/sH��������9)�pS��N���R-4����RJ���2E�/�W��?��L�Q������Cvtu!<�
�,9L}��D�P�0G�4��,���-��<xS�(
0��3 R������\�;Q�<��g �5(�4[	~�����	SbL����bT6�
{Uz�M���
g�T"�����J��Ve;c���'ug��J���nR�|��(�
l4~I3�{�������������%R3�l���I�F�$����$f�P���-0rV��&	PK?hA�'�u6k',X�q�C �V 3H�86�T��f��/�#V)Fd$z�pX�����5����aY��UZ���%#UF1>��F����cPe_Y�U���t��X��B��P :�(�<|��x-!-�"U�36Y�5k���!�
TE�����;u?52�H598}9��8��v#��2��������E2����2Ox���;%�2r"A���.��Zq"<!�{��AI��.g�}�03�}�By�dnU���"�K
�v������<�����P2l7^9�]�!>���)7�yX����\��D����l7[e�����n!rg����\�xJ�S���C�:��F]��i�"%M��	��/E�<����Nv��QV�j��A���Q���HI�J'�(��r��)�p�g��+uG��}~�7�9ID6'��'��Ji�#f�n*�|��UI������Hyy?�d_��Z�y}9r&x��2���hg���/5����1A������UZU��	]:l�H����VX+>�%+�����3+P��^�3�1�"�O�M�R:\�c�^d�T$��z�<���I%�p��/�d�),�����AB�C�v�P��!�*���	��
��g�������S9S
��)U���dYQ�*���M���fo�?NN�OoN���a�F�c_JJ�FG�e����������@��rpn� #_�s��:��A��G��ZP>o��>�
�	���,FH<����3+�N���sP���qx'|Z�:��D��d��,�v����G��
�%��a
�fP����EW�N�L"������K���}_��f���7JeR�/yJ����H���j\��%��]r-�E�R���������;E"�4�y *�e�������>�Z����.u0�����`C�J�3��"��
���Z+���U�M	}�F�m���`��j�`,�.9���s����z-J��Xv�S�4�$i��M�X��0hiQ��w����������oP
��*s�}<��Ur����hT�D�R>#T;�/T��@���u���PR�I���m��u�U
T1��������1#�����7�1�pg�5�`"%��"^)�5��V��_�������:������T������0�FY�o.~�����)]�����kO3�zt��i������7�o"�����):���$k�,��\���yl3�<�~-�ew����&<�������ZlC&�E�`���:u�������T�{59��A� �����^�_[S-��
�nK_2��uF��F�L�����`������=��EP#�H8��
���go_��-!\��B����K]\]YFJj�'@.
��S\������/����6I8���1�1gY��yl�J��gk]d�<��E��N�����i�t���g�y�~yr�9�l��8K��HLV^Ihb1�,�e�D��B"�������K����95�����BTl�=��yU^�������_;o/�^B��9���c:��v f����x��
:��<��<���{e0���mY�8����O�}����ttz���iU�p�������lB��?���d�����lx%�_TWA�G��l[>�~F�� �+�~�^������y�N�b����N����`2�[q�>�<yq�������N�R�:W�WI8A���{���N�z�~c�, 0�����Gza���)�m�/�X+���?X�

���z�VR��$�-�84Gi�m������ ��y>,v
���G��y�}��N�iJ]��"�����t�N��q�w�3zB����:�{Ti����.C��.� �,(�]y'Y	-�CDv��r�(]N#�,�	)��5�Kw�s�e��e���;�S��1q�]��w�utmW�6{Q�����f�z�5�_��k�5��k������8=,F�t�����;Q���S�OX^��{�G����!��4�W����O�p�gr�$B����{lN��t������Fg6�8;y�~������v�f����E�� M��������8O�^�=�,�����Ym*��}��W�&id�i0��u,�nKg�M��5Owz�
��[aK�aYL����~9���On�v
A��*�y����!���Q�85�\��n�����]g:������t-���`N�=UQ�.*h�m7{���������bk�[:=��������k�k#�=9�h�_�9���8������Fj�Zy������O^��O;G��G��;���s�z������|t0�X'���5���Io����m`l���|�!�x��-�ff���_���Q��N9�l�H��R�����i�K�v�d=-��K���\3���5W�������J��Zo|)���B�N}\9M� �/�_q0L
���X��>z�\�P���O;��3��	xF���];}7���W�}��H����>�UY=�:g��5^U;c�\9c���2{&����������pn�~� ����QmfU�63{iev���#�3��x���g���8���Mn��p��k���U����2�ie�I�GWJ��\��Z�n�����U���U����ckY��%5S�����iBq�	` �<���y"�e�G�t��!��v�6Rj������i
�5l�P�����TfL$��]������_=d��h>KBY�����G���O�u.SL���2�b�����UV�[]�����
6���/c��K�f.p����yA-��/5r����?;�H�',�@�I�5�c���g�����M+�N���JY���oBt�|3bJ���oKz�M����JQB�H�,L������wRX�;'��7%��7#���VJ�b�t��:�pe`M)0@�d����uV��e�<9}q�Y��������
_�jSp������.�-��t�U%��O�;���ljG�0��\�PLP�j�_eq*P#;Z5��g\��LhY��s�����}�%����$Yw��&���H!�C/����T �}�m�\*?�Q��<i��S���7�T�m��v1h)8�]�#�����%�p�i��[j���V�RV�w�����R_KZn�U���o�fL��hA����Q#��+�O���!��"�Z������^�������C�zac��Ke/6�Z��r�e�&��W������t����g���F����m){��C�����~�r�f�5&c�eK(`_2Z'��{����c�����P(Xmp#��39�G�����^��!�LS�{�<�9:����^�[
��E+Ui����FA�r����"�cd�3y����2B4Z>(��*n0"t�t(�z}t�� �i=(U��7����yJ���dA�sG��{rz�l@	���^�5�~����*��sh<���(������L����w/��u�i���*�+l��P�K��-��(xQBv1�1�����^�`6���V���ZK�<6+��c1���'�A>
 �AM.�	�Z��b�
�R�����ONQ/���U>�x�-�D�������0�G�N����`��</O�$,@87r`�LV@���w�|�FK��>�^{"������CF

$p,�o���7#T�����B9�������
�'�*�WK{����`�Cd���I>&�Tg�
`�%	�.K%�px5��(�	�f��pf��V��WQBM�H�8( c����C�����=2�*����r��0�G��1���` �i8K��w:�B�0������\3����F����7o�y<Bs�h�����
����d2��b�-�o��{gXH7�M,r�1���1#N�M�O���fJax��liA���\��;�6�h��-_�z�I�����i0��c{b��A��l�j��j�g
7�4����� ���'���5�^�	8e��il�S����$��z	6"���o*S�P�z�DI	�G����)���b�.�w�F�7��C�����{C�P/�o��<NM�S�[�������\���/'��j@��Os���������-|]�x-N��g�~���2�k�y8��y�K'�]0���I��ZT��aG�q��5B�d���5���S��u��3#E��N 6�NN���<�4���W�F1�h�$�����a��%�u�$g���t����fFau��S%�P7��[�9.Df��b0�{1�J�y��Q/��F]��7� <�D�A;W�@���
8�@&�8���L��2��A�����$������-i���k�RiA�>��B����3L��n���a*����e�����a���*�Kl��4�Y!��1�Ofd K:�Z���&k�0�v���teI�]���6�'� ��B��G�0���(��CZ��,�a�t1�1Z�>���n!�
?��-����_D�x�YY�[:"�|8m�Pj.E65�v�B�a9Jm�#d)$��r���d��XS$EdI`4�o}�	(^��x��ZX#��w9%����l~*-�W�'�xE�<��������$
��a�8	�dw��v������N%��=������:��0L���|Xz
X��U�B�9rT_�-vl�z{���F�qbw[1DP�����^*���Q��UJ-��n0�?����yr��o��!J�'��c�-	�E��LU�[����pW"#f+�P"#o�~�>>>z��H�����p����|Pc�
'iwbv5&L
�ur	�l���WQ��E�\K���3p
Ex���������,�P6�p��2����)��#�����1��~"^,x*���hk����'� �@N���
�����r{O�@��e	JJ�D��8��t_LER+���R���.r%O�KO�f(��x-����LKe�6)�A��wp�� ���Dg9D���)����8���L^q*E��j���+���4�i_.}C�0��O��Zlu���&s�x���'^����4M���!�3i8V@��g��X]du�J�n�&6���~�����>��>L�iv�����yPm�;��+(�x���O$D*��>��&�l��x�g%B���<�M"qz����i�oB������V�2�����'�%������fr��o���Wmr�h8���%������L2��l���~����h����JXd�g�_R�S(�F��v�N�2���E4��#��l��HN�����x��]���M�6]��b�Y{n�R#�P[3N1\�N���b26�x��x�~{����D��l=���s{gx<�v�����������+=G��.�e�h�Q�g����p�O-[�
��Wy����,Z�JH^�d��c�Am`n��/f�M�@�Ea����f�JS���`R[�/0
R!/%�B�n�-t?o������)���Jn>\��_#]+�����]T����} N���P���F��6��F}��,�%�Tw�+Ns��5�������o�u!�`J���^�.8��n<o����(����$�J�/�/�@�}�V��\(�����\R����S��������RA�������t)�5��B:i��*:�d1s�����3��14$+��'G� -�T����J��F`�
a�1�����*0��@=D��
?���}��F<���?I����%y�������4
Mq����M3�����)��=JH�<���Y�
�H�'���C������f��
��@�}_�k���&�z[�f�<C�7Q�[b���	d�[y^�&7H9����P/843&L������C�J���0��r�4 �h�7Z��)�I+����b��t{��K���	���Xc�1c�A,���m��!��
�wd4���j�0��9�G VI�
��A�uA��M��J�0�@����� �m&����|T���st��tb=i���������l��e��S���AS�I
h�]^����]Y�Q�VP
�����ri��8�\u��]mV(��.��	��;zw����i>��n`{,����\���B+��Har�p�7[A�Y�j�y��lc�C~!7��g��������\+��p<��C���KI�QH�*�R�B||1Z2����l�x��Ei�C)v�����U>
_�I���&����S��/cwO�m'SP����,�%�-�7zz��
�����F���QNA ��})ZD�\=��^�(�!]g���=T�d�d�?�������%r������p�5���dm	��-Z�2�'hf3�	����p9�/�n�}��@�����s���l�)���t�DO�H�!�I<��D�k���y}���yP���8j���}KS:~C
z	�H$m����#&1�5�815'��8 �����^��p�:^s��,�S���
�~����|$.j��J4.��-��zD���'��OB��$ea���K�'(����qw���.$Iu�r�_����D���g�k��jg9����S&i�6�.%�������c�
E7��Civ%D���<�a�z���� �,����
J{P>9x��M��I��z�����^iK��B<&�)D��T��j�2^�q�$� ��"�}��bD#�I�B)c��_%������Q����4��d�')*q��/��u��'��@��Q��'��J-7T�[���DZ����p����d��H��
B�LR6��	����Pe�y�)#��6K-��he4BRe���*��LK(���O��E���]u���&�Z_���x{s��G���Ts�������������)���5�'�6j��P��#�^GV�AB)���r%�"(*T��\��.�4���#����:��N]�� �-z��l��?���h�M�N	�))�����
]I���
j��Y��":NX�_2��&��[��)
{��~��a��~�d!�^QT��PD"��p�+F
�<c�`��zG�����I�3H!X�1
Y.��1F������x&�ko�c@vF����$Ux�/�e��*���]1��)\07,Q��X1�q�M3b�(r��]��EFu��"���*�>8���/����??�6�a�x��#�!�+�����������%G����t����~*�L���l�����]7��
*�$?R�)7�,IpE����;�h�t(��Z�t���@�/����8���Kz�����w�������z}o���x�e��fW,/�8+����xz&�������Qo>�����ZS��C���������5�io4��j&�����8������x��Z�2Bg�v!��?���b���O����xY�_q��p�����mM�����0x�S6��V�Y{"��H����&a�&s�#UP�+#X�
�:�tC'����r�=�`����<�&_y����knif�
T���@���9�b����k����3�<��q�����7���_�qp��;�0���a
�
�Yokp�s��G�0����-N�,�0���'A�]�Gwq��b���:�������g��d�L�R�T������0k����
����Y�	uo�����������=\n�e���)�UF���>XC�l�eB��������������J�J�5F���'��� �.y��OB%`�En���13P�'Ip
�����1���`X�A7_�?����4$�.����
}-0	FLzQ��y������b����aZ#��YP���Oq�q�e
#�_�k��bBm�D*c�t&'���#�X��-�[��(@���I+E-=}R&G����$=	��gV&�\Zp�I�H��p��F$@"sh�\�6�`��e	G������Z�kk]���Z�0��������y�[/���]���Jw��_�0����1�����TleMW����Q6,lC�*8�{u�H_k\S���4	x���D�{�(g��Q!���@���=,�C�J@x��b�(&;,9��h��J����B_E�|������=v��{���g��P������6G��h������S�EKY�k�I ���XM�����q��.�\�e�r�"��iu�"���BB��xd�d�����rY�Y��_Zz��+������<�����)O�]�����,H�_�d@L�`"��:��/IU�k�����H�=&�H�~3��j���J�m7�m6��O�7��L����#L�$ON��j����������.(_[2��
b���e�3���PwH�g`�������ft��������T;�&�:�������J������J�����U�kXc����Y��@�#Mm�(������S�
���QUo�9;sy�����������
���8�m����e4�H�I��Jm�F`��\o����JyA���������E3�Dt����"�>�������D����<��S�0+���x>+�}%�1����2Tm���4[D,[�~�'Pk�l��e�uO����S9W�������[NY�
��L^a������������
���-��
9��\�X�zT�*��@+�\����a�E8�G\9-����5���|�d?U
g����])�SG@\q�R���h:0�7�4�?	���E�T-�d��@�#��0��3�kex����v(:#L!/�`�As���p�3�WL�Z�m������7���;g�����)��O�1]|)����@s����K���&G1���?����������\��Fri���c/
9��� FH�=�LN{-�L=e\�B��-��L\��� z��V��;:tq����E��8���%�J�����K9l�=�bVY���z�������B
�����C�Z��W��"]�ag-�:�Cn�4�3Fl��0����jjK��jH
��Q�q��
�����
��"C�(����J1�������J^�9}m������cG,����2$jO9�>��i��`�����L�����~Z`A�U^�xE5j�������P�����i�s4�!�����*c���o��q��	�9
I���x6q`��/h[���<	8r]��Ij�R���#�	��%_3R��F@	1��4���S�p2�����P�R�H�'�J����G�6US��ljY6���A`�l���c�c�Ll��.-���}���)*J����e��?��P�e6PZ'|���c���+c
y��4�'�mf�b�3�6����Us]\�l�S����x�e@�`���3`�W�3�2]�����~9�Xt����E(h$�o��LU�!���W��r�����W��6�e���q���s@.@��DA�]u��e��7�M����t��o	=rho������v	�}`��"�Ht��f�u@Y8�vd�*�<y�&�<�D1Q��fR�#������]�����JQ�E��!���)@-��9��bG7t��'��������1o���s9��)�y�
���������My}�������|s4���h	�f+m9v�v�Pvd�DL�~yk����4�{g�����r���1FB
��
�_1,�tW8R�!&�U~\��U�/��5�l *�X�S q�k']v��sS,������a#����
 -t�S�=��g��8H��(� &������9���uZ��z�k���h1w�������D	E�q#��uZ�r�����N2!B�xp���0xo�=v��d�bF���@*��,�D��������m�������k�(A� @�$�C���I�vH����������2�(>���$G�0`~��c|��oY���4����b��@����
HNK�]S�:M+��X�OBo+nU(�{���L������~,%H4�������f`��$����Rr�oD��HA+�7�LZ`�[��o�D:�a��L�����2Q3�h���'	������@<}0�,mY�<��g��q?
�����d
R#��h��9��X���:����I��A�nS��w?95T�:����)}�y@��TBP]& @#��G<Pm�3�Z�T�~�y��V���6C�ch�l��~��-d@].��g�:;k^��W���t�i���E��	����Rs�Q�Xs��$m�����r�/6�c��Rx�L����w�*�|Bv# �oBA����PG�G���t}�A����J������n��4�A>���"z+	�}��V��MQ@5'�	��1M��|�
���]P���%�l.'6�*����Ln��y(���5&sb����.�~��j��
28��v3���o��:���`���	����Mav�Z�G����`�@��LZ�Sj#�#��`&I!��l]y������Ii����s2/�+���0k�c7�8-�bj�S����N���������n���H��
�K��l�A	��n����L[&i��iSD&�7ezPd�1"�r��9R%�P���0��=�3be��3J�H�|����*�nP��������V[%�+��9DP��:1��X�X�ka���[2 3�S@|���b�k� `2��~E�/'�~D�"Ed!L�eL�
�xN�����r�N��3CR�h_�����I�<�U1�W��P���4FGi
0���@/#�#�=d^{��1���!XEA�Y�/:,
6�(;�u0x�u�|czLs��d��eQqh������b=:�\I��V���'�Yj���d�Y���i���`?;�2yS��������h�S
,����W(����W�������v?_n6��Q�a��uI��"Z�����6���7��U��yq"%��<��tu���{��������k��?c�Z�qR�����N8U!���$@��]W�QUNh"��d_\� *Q9+W�U���=	o��N@i�PP�jnJTE[d'6�F��gCF�]<	��d<����8_j��*#u����=MY;�;M+�S�=M
�ajd�%�@���_��<���Q�7K����!��B��'����qx�<�R�f+���i�AY4J�a�ZD�#��\e4*����*DYCme������d4,����l(�N���M�n=s������`��f`(�M���I��=���s�z�I���s�����	-�>N[���V"��]$N�������0#0����k��m��
7���.{��<z(I��>�^�M�..T���}�B�^3���f{>Z)7����4��
���m<e�F�/��_F���-�-��4?��:�Z�U(�b6-e�wj�2=fI
 �Cf����|W�,A��y��`h�0�e�L�;M�;V���LRF�.��x9�]��RXb�r9q���(a���������N���&p�����2	��#_�s�����;�I�sw�-+�H�m�f�B�5�y��g��[�zF`��T��I�����l.-�`���'�$����"���sg�A�!YiCR��\x������DT59���-��������M	W�H��B��z/�D��#(�z��X�����b�$&($���A����Y�0�@�#U������U�����J�d��a�hf�:g���h�*��)�J	�
�n�(&E����m�X����F���(���~���.	�P�@y�'���V����w�fsw�������p��{�������1LyoS[��J2g!����]G=������nf��IQ=�,w$N{5��Q��	��M�"?��=sI�lN�[QYw&��qZ��!��% WsY.�u{��"T H�e@�9���*�O.�� 3���U�EEz��^$����7fN�����9����H�
Y�;���2'�L�fJ��v�(�%D����D6|4�2b����3�\��8�m���XN��RP�I�gS�
/Q�IK�IJ�����3�\k
��%r���:b�L����r�=�J�//������I�g�	ug:}��ZYSo��Rq��wJc�+���p��PT�y����6E���=����m���2���1���v5P�+E1�����(�c�'��������>(E�����at,������J��F����)^��n�M��`��'py��hvi���+U������*��Re����V=�n���e&���8��i��9��k;p����h��}t��;������J`�������q�:|[A�4.i����|�x�=99����U��,�Y�O-o|���f{x.'y��r5���NA���+F�V��`���P�4P�P�#(�@���������u'�Y����a�Dt�T$l�?u}�(� e���d�Ol����P���(�5oZIA���	���u��/[}k9��R1(�,oIV�,IV����D�d�:V�?B	��8�����!�^=�<�/X;��4""�k<��c�+C8-����+�u_;�[R���;�7�������l;%2(|�����qD�M�<����E���(��H����l���������=�IB:�����@�h�0�����������[
&��E=T�&�%/#<_���@%6�c�;�2��4:Nc//���/�=�l;n��xV�7�A����m�:V�M����y�#0�n��n��b�P�p!{����V_����6��E�������Evx*�����:G�������$ �}�`��N?�������'�g���_��������kP�DI�Li[�X�Lv�� ��~t�E�_P�d�������6��c����5U���s9`���i���h`Xl�{���:��=�S/�JJ��\E�&��w���(9�4�3'�=��R.m+��6�I��7��w?K����������z����ZzkP�j-Q^�o�2)zo3):e@��@i�-�jZH?��J�I����|�rb"DeVq�E'^��su==����c�C*
���>%��I3!y
7�u4�=L�5�M�f�h����L����������3�SX]����(���y��-P���L�����BJ��r�)��d��FiQ�'+��Nd������h	Cd�P�C�{(�5B��rzG������j����(f�b��.�}����$dE'zJ!>�@:�!��H�
�+AQ!�	�AS�Dm�9���k6Y\]��m��+�+�i�AB��m��@�M'IsTPo��c��b"���Rtk�Px���xNq�LXC�(D�y�"����p\P��M-�/e�/RA� x���
�w���Bq�sA�t�,���q�! ``���<��F#����Y�������	������@�d�WYqq��-2�`���pT�;5.Hr�u�@Z��M��f���W�E��c�����S�[a���TD��	�rK�����	��e%;��V�vt/�������,�1"n_��|Z���e�[�R�Q)��j0�X�4K%���MGRyQ�Ue��������{�m���|g"Uh[�lC��z��%J��3(��[���������}�;:�RJip`�)R�r*B+]�2�2�n-�B��P������3���B[J�eP�phA
1����O�����I&����e�
���{FL4�5�"/�ZG�N�J�^Jk ������	������kg������Z��Kh�k�n���Ho�����J�E�X*FM�d���_����N�M��rt�W��2��V��	+�����Cl�q�.s��������$?�`�)�Q����j�����S����������?�����h�{��-�AqD��1���g(wU�L)a��/U���Y+���T�Fx�!�����13�=���nr%�AH�	��z�.7�{��F�5��`(�C�c_/�@�zy�fG����1�:m����(A����S�w�_�@h��F���H����{��)K"#P~�� ��R��^�)N��Nay`���,��������Y�H}��;�Q��3����
�����~������*���o*O��OG��c�8opT!���C&�*�R]'�Xr��&]@Y��q��,�@QCC�������A�fn�1b�!����(m6���FMY�Z6�M(�xM������Z+��@w�!g�b=���%\��RD���������N�v,���m��y��x3/K(��=���� ���d6���eD��.�	�M���"�����.��U=J��Z��OXR�)���	����x�|��s��B2��|�f,��n5K�rI��,&3���I=G�D6�aB�������E���i�])������R��/}�Mc��{���^���&IS�~5maO
�b�(^.�������b��WNS�Vt�4��I�2��� ���7!�[R���+�	�uN�Z�6��hl�H����U^E|K����e��+M����T8��:
R�����{Q/��q��[G�'��e��H��S�h������X�O�
�*;n�q�;�R�X"�
��@.������T��3O4Yzz����$zz�-�g�o8M����E��r��f�JU��Hp;wG�t��C�h~)�#>I����v��I:��W���a?�!���!�&�
���JY69k,��by�ku;�%1�<��K����]���!��^�|��m���VA���:=��u1F��@�4��O�Zcw�Z��!������M�2���8���_������&��4����R�����R�	
 �&����*p��t�{}t��	����
��n�
�AP����h@�2�O���/(���G����N��������|>��MI� �;�sJ
f�i/Hy�^���v	����H��<��E�CmWtx�
��P;�(/��iC��f���.B 	9�@�Z�����Q$�B�T�U��D(_%��A<��k�&B�v!"9!�I<Y$f��L#b�CK���T1}���d�2_�Fl���I�+��gh�R
zh�T�����5hC��T�����n\������Q��P�{!_:��������X���W?�I<��/Rz�a��,r�<1��_q�^Z���B�Rb���B�GNx����gTV��������+7�vE�F.��"�v#��!k<���_iL�o�h��e�^j�ei���/3"����!�K��m�V��C�>vE��\�z����]�Y�����w��`�>@=��-	=��g��'��������<�0p���3�������F���N�c�O*0����yu����l-�����r�P���E=]���Y����[o�����E��L;|����yw]������X�j�g�~MA��p���i��$��3
�"�e�]u�H=����>�@oL��I��M��A� 8�I5V���	/�z�X�
\"�X���4���!��O\��Q`�P
3��h�����V
���7��)���&�T6���i?�/��s�#0�G�?����;Z�e����bg	�e������nie���D-�w<����J�^�'����t PN��)yU%`w�-R�	Y,q1������$����eH$1��u:M6�w�2�Wm����(��X���(Bvm��(t����;����Y�1E����K�j-���~jsKq1���)��� HM9�c
G�a�!�f��#t���:����\$�QS�D��W�Z'�1/=�-�fD�D���1���������V��Pl���a����Q�A
���\wn���
w[�6��������T�_�Sz�D������bRg��v�^�\F�]����'s�Bf�u:/���������]�L�[y�P';N��H:{}��������(���>R������
�����|w������|��
y����/+�L�6���7�x{1^�d<�n1p���M�K��0H�a/J$�-�C��3���j�������X��_�"E-�j@��u�+�X]�i+b�T0
M"�&���!�M�86P�������E���"/��l������C�k?��282|%����:V"xE����3�������v�i���)(
������uy����"*������6�pm���������@$*�P&	������r5]�	�j	���T9���9��c{(f�;x�{}�u�7���n���\�(�Ji��S�Y8�
����C�}P���;.�����v��s�^���L��~C����������/)
�~�0���e�Ij����D��K�PVT���kj�,�G!H��x�/�x��aMQPq_���$����d��*�Dh�D��<�Z$gb�R\9����Q��~Hg%�+��������O��T����D��_�p�����to����_��;�4�v����=���i�w���F��������8P
�.@��|�0�=�	n=����M&�����n*�0%W�k�)�����9h8O~������ @���f�o���.�mW'�	�?$
�D1�7Y�t��`�
�o�����;������4=U����������.��pT��z�^����l�X7���4��[`�G���
�c�P������	3�)/nT���xg<�������~������^���6f�l�)���|�qZo����,�*���j�ODc�����*��mV�r�k`<=���!W���w�U��v%q����J����+f�b���b`�d2�n@Et)H�g�P�`x�0���hD��S������J�J[t���S�MK`����J�.�Jz����[���hZW4"�:����������0��a���LS1�� ����l��a�(
X8���:��/����]rc�|z�U-
`�o��xN����l�o-[7����3�G�F�A9�����e����&4�����D�,���v���
vA��#UJ��3��L��B��?���z�����C�3�3�Y�J��T:��Y�����d�:�,@�P1&��B��N}KA7�{��&��M�\cG]���5������Vh��N'S\��A�ZG8��:�n/)#�Z�����!��`��<y�z���#�C������n�(_�1i��U��u��s�f���Fp?*�oO�_���^����O��fo��'����?~}t���-X��z������I�<�~�
�,��u������:�>�z3��0���s�/�5v$�>'�C_;�}p��d�,�{�q�Hae�f4M���,$)���"�0-��C;�7�1x���~_�_�eJ������pL�a�]4�
�R���8:�Ee��jNR�Y��s����
^���<�3�
��v�I��,Z�	z�?���6�cg��c�at!�`��Dz����������
����:a�����V@�pJ�4'O��@W��std��F���9���W�;6�I=�������B�����qW�<O���E��NWE��!#��r"Au���I�|E~�����C��l}���Ct��9�����R��x�3�e�an���l�}�,�cKL3��n ���'��.8��w#���
�`[�JbG47-L�4(���� -0���M�/Jr��w��!�4��v/B�$��\��,���-CX�o��2��i�U�������	)@���N�,�1
���'�mU����4�>	xBk��Q�sh�����(
���~��}�6���\t��|��`>8nz��y�2�hd���c, �	��z��l�8�cx���������a�|�Nj��bV�7�#�n�eot�j��~e>�p���8%�����'v���:%�c�����g,sr�� ��4����h�������%�ZV�'��k�j����@�����.��63z��D��h�:��s"j�M��R�2�_�������)��,��j3�x��\0��$��?���r2�s_����x��AjE3^�� 3@�cOT<�����F���)�0N��p��G������O��3g�/�N��^�]��T_����l��y����T	KN'e���qN*�P�o��N�e-7���KY.����.2�s���BMF\=s���y�*)�	�w

\�5%��x�1F����>x�E�!f-�h�h��HOH'�H?uo�;��[�3�X�/��zBu�t�r_+�����Bu����O�]m�K;)��rf��$��RH_],/����ROSY�<	�W�������h%�s�4��X����`Is+F%���s��|/u�������@�A���+ n>JF&?n�{������o�������e$��?�fo��;���,��g
����J��W>������}s|$cJ�^�'�^z[5#�g��������-����%�,`�' �k\��]<� (���y�&��n<�
�hw1���n?	��A7��V,Z��{Q#���JX.��A���h5A�Ri6[�R)������%����j�Zl;�g?��F<�$H�<�id���oVQi��|����G}�*����\���*1�D]��	�V��n��y��.����yX�������g�h�b`�)��y��wG��b%��������w4�z��;�o�{�reoPmF�V��l��pj�7�_�����O�/�'�_���;�p����]���g��_����<(�1����G�����d�t�'��x�|EGQ2�����L�x��Z����Q}�U.��[�J��\��f3Kk��m��6hu�E���qv�pN�{X����r��������;����*�������r����`��[����d�����[����#�����U��\�������`�����>���a��dw;U)V�,6�>� x�Q�����z��,y��pr6s-����G����l��!����.N?�4��~X���X(���V��Y���J_��}��t�+PlY��v��������99�����m-O�v��|q��}��xJq������a�&3"g����]�R��q�89;��J�>G���R�\)'�C��������6c|���d|/N^����&�A|��]����O�<U�������Dg����;��H� ���;���:Q�ryW�g6n�* l�-,_
'�p�E�(�!��KT�<� F������y���u{
���������W[���VG�;3�>�,d�U4�����)	c�&������B�Qq����h��(��NA�n��+pAt2j��6�E=ho>62��m1��x�U-�0��[;h���`Fqo6)u�q�z�>�H��y�P���$@D����7y�T��Qj%��"%��b�;T8�4E
�1��E��y}�z�
�:�v ����^E���(�>�Q��!������������W��Jxp+�[6��j!�3���%��R	��}��-�#V����������9m$d��F ���h.Nq��`F����`�_�Z�8=�<y�~���� ������*nF��E
VS�U6��nR�z����\������o��fs�V^��z�s�V�e����[y���I��s�W�e���_y������s�W�e���_yu��G
�����lW�oU�u�j�������W���^���������Z,<����`��ME�a%��a�n<��� ���A.����b��E���C|�HP��K����j&(���
YwQ|�b��9	����^	M�y�V���h�#�l���������Bl�]zmJ4��+�n���%�J������ �����b�!�x��c�|��pl�����3bA y����c4�;�$�����7L0��yV����v��n��I>�=�������m�<	f��`�:W��<�M �e^T���1����J'�]%����)1�R	�����h���Q�yeR��)��$�Z"`�1u��-%�+fO_
����wW�	���>��74����������>O;s�}��z���0$�o
&C��R�XZCa��i�����66G�����y���+��b��yA: �,P�n���Yf]�B��9X�HjR4���/E0��G��]FP<Z���+��9���E��[a���SB�������h�2 {���;��o��F��G����-�wc�H��o�d*,����C�$�f�f�](�\b���(^7��Go��q�0�����w=	�&Q�����<}8	�d��"���_�sW��o) ��������
|$`����������[;G�Cy�9��	��������G/���;AQ����{k���j��������5O�~�A���n�Q��S����%\���)f���g��|13><�L?�����
A�R��7��h����
^�D;8��D����d1U�O����'/N�]",B���l�^�����~{JSJ�����&_�C8��w@Og�+|�_FW3��(Nza������K.� ��&��,��v�;��.��k���|;��T%+�Q4�/�� RD�h*g�'������N�}e:���� DtL��������%B�e,n�$�F�p����3��!�"x@���}�d8�^A�����?�t)��a}4�P���H�����vZ��cC����E8�LCd`�����_@���Y��t,�����]����w-��VYA�%�Q�/b!������X�QC;�1�a��4�6�8��T,8�3	��������b��}��������Ckb������8�� �E3�h�Q���������!N���2�\Wq��bSb��h&����nc���8���8��^D���aE��u��k�n���&Q�d����sPhSG�:@�������|tf��9���v���<e9�g7+�<'z��H�a�!�������zQ:Q#u������k��'���T���;�YU�� ��7����&o��kp�e�Et*y����*"����ZQ�\�@[����� �4����$<XX�C��}-��~��^/F![O�������!4��I�����4��K����2�9'��)p�����:�����\X:�br�sZ�\������>��� �o����
��0
��=�MCz�ia]1�����V+E z��uE��#��+U���]a��c�Y����a%����Y=>[��:���z3�K�pe��@�V�	���"~vzq�7�.�v�K�o�����k�YY�r-�@_�VWn���|���x�^�{N�Ct�CF
Xr��($2k�.9�J4(������N/��}���!�Uc=%']>����$���
iWM��j�G�O�~l����"��9�����h���mKV7��F~�����<u���D�F#�^EI~���WV��R�B�+���f*�1���I>o������W����
�r��C3�/�h�\N���V>�<�������b1#��'�	��������:1���|-���%==��ANY�hL��T�<����q��n�h���ec����I�*2;�!p�����h�"��
���d������p�L[��w�W����L�	�^p�W���������I@�,ze@�����������{��,� ����~4���+�o�
n��u��A�x\����^31.�O4��~(
���=�p���s��x�u\�9���q�P�C�R�����E�j$2��A�~gp�3����\0Zy�����Q�0�{@HB'���e����La���Yi0�|�	��	d�6x������<p�I	��)[��}����>rz?$x����].9�6$��f��F�Xlg���r��):��9��L��H��������r�$�x��b�'�la��x!��rP�$5�"�hu({�@���SC����(�:71�����Q��q[���4Dq�!D>���@�A{�UW*��7� 
$��R�;N��gc$v��%�y1t��C������q�����1�&��{W��x��}�C�(�1>R����B��&\a�bT�to�RV7F�Ri��FC�H���v�4Z|��x������n �����%5�|_����5]�H�P��>���J�t���T�D���4�&�{�V�L	*��E%���O��=�V�1���T&�5���wK��j
���#�3�amc��:{������(t{�u�������Q���g�Bz�ao�RM���mjCofB!�������u��9����,e=��@���h������_z�_�3C���+0kD�n��xYA��N�s����{�y�Ey��i�kc��e���x�,��,
U�h�VP���rA�*�k-���'��)3��$��a� ^��d��6W�h,+%`��e���h,E~��b	��'P��a��p���)��b!z!�T�{��z�{��NC�S���0��[z���-	a���x�0b8��
Y'�F�6S�(��>
�i��H�g����l�YL�1�7��� �x�2k����*E�Un��1r�;��������G��>�����-�;4�N����l�b�B6)�1���>�b�|��]o���b�O����;4 �~Y����
0�N�dv����e
�X�O9��������8D��gs�V1�n�)��EQ%�7R�q#�`���
�i�^����c�
�~�{��2?��+����A�����]q�-*0��4Z�����?f�?��.���+�'��G���<!���>��u�_�5�(�3���8��y�'�R�S�Y?�y���f���?�H����u&Y]&�0x�����9��uf�X�Pe��p���6>e��}�~�W��32���YG`C6��4�&b/%8Z$h�a��hK����������c���X iR���XdA���U�1��b���s[�	��4�.1k����[��������.�S�{��!��9*f�$���UZ���rJ
jH<+y� Se3"G��(��t���H��-5��2~�s��q�-��,,�NY�8�T�o!�p���el�YTQ���y�ny�M�ISY���Q�7@�T�i?�No�	^_J����!���/���cu��?����VEY���L�T(*��k0��Lc^6�5��oC���8����*!T%���TK+R�2o��1���_�:^�+�������7�������������,�l�F��/��	W��GQ�L|�S��-7N�m�`��(�m�zX$������.���O����D3�C�-����{I�����MO��H�R�P`b�"�?IO^@	ye����o����;JPIL��[��b������%�����������q��^�_�����B^�����(��,o�-.�����6������6w����4��e�[a_��K����

����%�J�+s���[Z
h)�Ar�[(�S��A�k��QI����N��b�l���O��I%�r��[���Y�-�
vX/��^�5�0_��wpM���>E��8�?��Q����[S�m-y9�����.Z�+�M �f�Z�9�`��#L�)p�BQ��D�D�S�4�A
Mp��#fJ;�da� �!�c���_(�9�\�H�����+V7�����������l����u8�|���d��4��C��1:>�E@����u8�R���q��
X�I�������#����b��	���d�q����a�|Y!���X�Nr�S{��}u��Y
���5[ ����u�iH��P)�b`b�2�P��F"@�?Yt����v�x��3�
�&��Z�$�Lf��h��p�>����V��`\���<���=�OF������J�X�!��Qx��E�����	e_����eO���4u��X��i�Le>���K�GD �a<�	Y�l�a�V�����n"T�/m���n,(BU��$� zAq�����
0�:�j wal��z"���@���n6`��@<�s$�����)��e�RU�A�L�M3����?7"�f���"@`f�Y
{s�]
SI"��'�������0�@<����y[W�q��A:��m�c�J�S]��(��Hr 6T\�@a �f���B���:4Q#�"	���R�" ���'�/�	_/�M��@Jr�0�_�����U;I�����g�2'k}D�X�Z|S�K��YDaL�
)�tz������>�������-����1��m�rG���L���J2m]"
Q�FT��dbj2���6E�k�����_�ce��}��wE���R��O�������P���M����`�Z�pp�
���~�v��ao�k���	�C��F$��\��\!�|2�B���@���q+
���!���&c��O�@��n"��}?�*k
6&WY,����mK`hR��	�����.j���e��� �x]���,K����.�F�d[��h�E����D���q_�O�.F�vkNA��E��\Z�hIv�(��n9�M.����C����,��p���D��@�����"���n9����5�A�����.Q���c��pSe�q8_����y3(������a'�����j~=���c[[ex������@�:�>���[��
jT7��jt��^��.����((^q���VA>�c��:3l`L�i�
'#8��Eb*������������6GGa�;��T��@M��!��������N��dVt�`�I��)����1������J�^cr���c�1B�!n���U��v�
�y���C$2��xA��\c01��?�1d�rF �
J��4��7���^�;����{@N�����Q���Uz�f��-�k�ZwP�U���{{f������� &��P2L��Z��{��V�@l��`�g�(r%�;���5���>�T�	��W �'�2�QT.�P/��j�����'��1%i�52��
@<52�`�2bZBvi�k����.-\p��l
�kAl���jO"
��:�8��y�*�"4@��(i�`9 P&i�0&�u�,4��x:� ��^�@��l8"nw6��As��W�kA��c����E�x��QT��5������j����[}�����U��AN������~�~K��V���������c�Y�@�t)�FI\�%�	f�(���0r�4�a�q/*OA��I@j��V��*�_�������h�U���F���J���5�T������� ���&��e�V��7����j��o�U�������{�nmo�Y��n���7�a���5����*���Au_P�����&vs�y���:�0	.���?&��<J�?���(���D=���@Z/f�`�����jA��xo�q��T*���E�o��|������O���������H?*`5-!
`������2�:4����%3o�b@f�T��5?��Z58�\.������GT�o;;K�s�{X����z�����
�D�A3X�����o0�$�Z_Z�L�c����d��������Az��m�[�b|N�@�b���H��8�.�lZ-nh5lV+(��L�w����m�J*6�o�V��� [6]���vW�Y�3>��u��^F^�xu?fn�*��Z]k�I*����J��z��������9Zw��Tl��
���i��5���$~�#,����Y77z�^�a���p�;6����8����M���cyr�Q2�=�N���Pr���Q<�?���Q��O�/R�u0	��<n�8z���$������`����? �N��}�z_d�UaB�Q���y�4=�
���+�B&��"P�m��I:����N������^]��jo���^�j�
��^�h.Y����W���2�WP�/���a5��E0'��!�G��{��$�� �C6�Os�o_��dZ��������4oX}H�"!�~���A9��,aS����+M�8���j��q���6�
�$ZU��W���n������{�z�2yz�JgVm�@r�QN�\��%j���N�28P�4v��s@�w?�qU�~�N��5���=,�ZI�|	 ���f�
��d6�x�>��o���!�dd��������������_;����?U&8e��������y������������\�^YQ��/o���*8�]O��M��|��!2-��n�����Ha�L���
���`�t��y��`��-�9GV��=&��qiNP�/)gq��:����[�H����
��J�
�R��~g>�%�L���<�3A���C�2��K'`�b:����\Nw� �����Z�^��?�����T��h�F�����`1�Q% �3G�2��
I4?EP������ �M�Asb+9TE�i���+=|gu����T��g'��C~�l�Z��E�f�P��,*�(�(�|:h ���X�x�4h�E��?+�/���J� U��{i�]���T�>@^JkL4���%��'7`�_�=����`8��T���6`��<��Dm���R>�O~��C�#8
��Q�"N��Y���m��T�9���X��2��$<
V>����c��^���V���~���O]b!F�q�53t����
S�!���Fu��2fv�a�h���t��������5�	!F[;W���3�&y�������,c�����G2s	?fSA��b0��/�����^pQ�g��4��<Y�u�����8p��a36���E"�_�i!v�r2�$I��O:�B c_8�]������J�2���A�������k?{~��7�������!���7	9�4,!���e��K�{�k���8 $�"�%�&|��g��������6k'������o./�1�u.l���z���b>?�o�]��=�Jp5���a*"����'rh�.�����K��h�c���FS��8�{4� ���9&>.�������������+�qH��}Q�\`4�������5�	$��w^�R��,�4U��\	���Y���bj�t���~��
<M�Z�s�@D)a``;e�8��`���p8<�����U��40v{��^�/��_�a��Y�4������h,a}vv�*_��uBc�b��6p�+��U����e���Q���@2�8Jz��b
����g�|,���� ���~����
���O���e����$|$���ua
?!2����������Axp�d�]�b8�r��*�wz$:�m��4PT<�������Tt���~svq"H����0�H�{(���mM���h�(��@s.��_��~�d�Aa�H��d��IG���/�w{�����C���7�DY�?d
�>)��6#Hw�6d�by0�������2N<��2����
��-���Y�>���YD���X!kCS��q�	.���:���H�<f?�r�,�
�0X���MR\,(o�Tb��#�?,��;�9�2��_#[�����!F���&��h
�Mh5�d�O��2����k<0�ye�y0H���h�+A^�d(�3�����s��
�� IbH��u��:���C��~L�b���C6y����qC��f~�����o)���
�Zm�D��l�v�)J�R�=��*�����c��)� tnU�z+�����t�lH�Y����O,y����x/�x����:������9��F��t�zX����O�4��|?�z��[4{<T.�K�B
��n���^u�������Xe
��W�2FN�
��9�`	���Ra�/��e�A���r�v0�f
�4���&y.������
�E��'���%V�����yb�(#���z,!�xGd���q����K`�3��6�k`W�	 R��'��I��i�a��a`�T������-Z{���8��.���+�j�Z<��iyy�<�K�@�o�{�ph��L�2<�����y��)/\O����-�����]����%��C'OE�[��f�E	)�?[�
Xs�!T�r��'^�����U�����
,�7�����c��>� Y��!$���T�
�#�m���8�� ����
��F��!=4�b��d:��������O�&����'F&�����O�u3|
=�RwT>u{��~�_�E�h�{P8����3�4���A�[����j�u��Q,����1!p��0�A?���Gl�G0�LCJm*f�I>��c�y{��E�W�����c�e&fE��/.~5�����8b��h6�q)C
�����:�O�/~��^+���W�����/��j~�w-0�wC�H�����)_-(���g��#�4�	����'~
�8�����������$���A^7!FV�������=�tX_����dA
��P����������{�!�%�.{�N%�W^X`.Q�X��Z���_x���W���#'[�3B�j����)�Fl�}+B�kR0�������c]�>;����.S�l�������
��{)&iO3I��a� La7��D�Z���n��T*%<��d���%�X�=������A�x���7������ �������=_|�-H�����hn��lBD/��8���H��'S0"�8��=&&K�x�@��$��1>�AR=<f^���"���G����$O�b�YLJ����Y�o��,IE?P�f���*�"����q�N�<��;�����N1&���_�.?�E�9	�l_��.@���i@�E7���ne��x�������9����?�/3���w�+U��/�;kA���
�M����r^�������������#g_�P@{��\|}�o��h"����%�a���\�������oOZ.�����2�	�����I}�T�qM3%����P &&{]0��s�������S�l���H�������J?��a�?V)���X�7G���9���r/�p�"v/_�MVO�<�h*����m$}	�X"a6^��c�i��9��w��\T�����.+k�������U�Y($��/�v���<��x+W�d����|��0�����s�yl^e5�sU�	eKx�,��F��ir�cn�E�q���d�R����*������t���Z���;���������Un�@,e4���VTg���,�?�������w�������~y��-Y1O8�5��Y��+��Am�0b�����)�6E�P[��-�X@���b��D�E��	�=�Rm�J�R�r|8;�[�,N�i�6Q�n����3�h��N�F�V��3���m1���	�z�W
����8���������;=nX����	�XZ'�a��@��z��?��3��w7b����!T�lG`�^G���U4.s����Z������ 1�|������o��I���T��1������W����s_��d��$��/�|�}AFi�I/ ��,2��|�p���/���C���	�P;��#��pC�W�h����;�E�� �����h���
,���-$dM�L�V��Zb(�v�a��p4���?|��I��xI������#�u��CJB(6��"{�������!+p�LL��{�������f��%'��=����J��;�76;�vg&no����v����[bx?B7�*�e����{�z�\n���F�`����.a�y_m{{]d]�a��Zq��S,���p��������8����t�$I4�c��,q�1�d���{Z�J��d������9���^O&��t(�-<���������W:��K��9����|RJ�K"���do��e��A��������^���^�7������~�m4k��5���^+�VZ{�����{�J3�W��A�+������G��k��{����e���0���@k�
\0�@�V�vN����d����uy�F(V_�)�"�7b�j�l�R�-QS{����nA%h[��I��N�f��w�B���%�,���@Y���P]H����k��~������ `9ym���-�����pr,�c��8�:s�J���g�FY,F�-A;/ ���4 �X��1��0�e����g��R@O<O�@�W���U�� ��r?
���#��eq"St$A,�1g�"r���U�~S*S���������P��P���j����L.���C�������|�����f�3{k��9�vT0���LF����k�{'K�����F{�`����\�?�o�D��-��]���oH�P����JRc�&RPyV���(��{���H'R�����Y��tP�%���g���j�Z+0�R,���P�%�'��8���fk��
�����@�Yc
H�M�l��Mk��.�����4��;��dC�o�%�6���2�.}�\�^o���Wz�z����V?jT���E.m����E��������V+�\�����+� ��+�O�R9h���Z�g�/�h�������M<��-.�#(�>?�%���$�sx�,BC�F2�p�)	O)���0��X@�.�w�e~���lr���>?��~�rf������H���+��w69�S��	�TrY�Y{!���hN^I����z+������� *=��rs������J��5���VD�G��'���e�������'�����P��s����0��b�0'ZL�A�(��t���V������������^u��8h���^T=�Uz�����1�W�N{(u����:���U�[�z�J�W�~�����`�UeA���#\y�]Xw�
Rc����a]im5��z�F����veWV������������L_;qO���-�c.1��W��Z��h +Y^������d}1B����	b��=�����n����a?#(�=x�`���2�1$f<�f����r
��/� �cH+:Q->'��B98�3��	�����	�W��c;�����yt������ =��.��.*7�.QL��_^A�<������i�/�'�������)0Ubp7��X��_�'��4
l��w=a%q��9V�y4��_�1����dl
IT���9A��!���L����a!�l����'�������^�7�������__;����E�5��*M���c4U����p������q��������KLE�������b���
����u���B&q�1�����I�}.p�nQ��!`!���LL����0�j���X�21�)6����	�f��b�b@��n�|��n<��1�!�o�^L0�?-�����F�R��=��%8�x�<q~��dp6����3�f�+�(����~��H���'#@�3����,Es
��79t:���p�X.��O!�0r�|�.
�C�><�5��%���r4�#���q�T%0�
�5UG/�ZRj����'��k������`���A���!15�v-4V$��F�����=V�����������.x�U�D� "

�����A�E����� ����lmS0`����q�g��R_��A�(�3���N�
�`�PSoy���g�mN}q
���<���:�3�^��7�`�����+�\��' ��B`�n���	��V������N=�Mb�������"�n�YO�u�5e��[��p���E������/9N`��$�K��4������A�W)q_@<��V��������O�Zs����%��l��{N�����<-Jz\l���s����K%��T���IS��!�!��T����.x����Q&��{��96�c�������)U� �fRP�0�Ey�r�Rg��;�q��3��8"��������H0i@��_��}y�:��3Ls��8"�Y��E<S�:nK��ij"�,X�/mZ�U
��!�&ko���6�����nAp�I���)r��o�#�������{��>LU-5��8���OX!�����:#z�C3<���s��o���Hp�"�_l%�n���s�� �c���c<�*��QS;�h����&N�~�lf	kM�[�1��Q��gL
����}^V�!�i#~1�h
������N�Ur��b����i"Q�e���QD�����B�/�2E��U�6>Pa���@��#�v�������I!�������)B��IN��do�u�	�*�����Gs�sb��8Q��������+�w��������*z��
5��p8D�J����\Q#�_��������v�-�0��	����
�8��|���F{�\����3����1��`�_���t��sr��V��}"�nz�H�����=R���z�x������v8�A�Oc�g�7|�����$��J�UF{�6�I_T0i���sE|W)�}��7�7����81��n�y��V�M,��#�uU��d��3���HQr�,<*^HT
"�q	���S���wf�����O9��/��n
Yg��SYEF���dT,u?�r���$R�M�-<�"&8,FS<�F���K�%�!dh���Qk�8!�,�/���58�:�<R��'b��(���|�����vL��e�~��6E���weL?��dJ2n������!1;��&�1�3��������Q���^�3P]��p2&��:
��UC�D2�X'3"f����1*�@=��^�	X#���{	��kPP�}�9��|h2��p�=��
3�����b�Y!(G,_S��$_(3-vD��l��G7�(��$D
����,����t��t:����p���Nl
����	w`.Vx�-��� #�b����&&T&������J�# ��1LW�d	����:V�h���%��O8�A>�����x��n��-�d�=��D����Z�\�����S�5N!�X8�x�����+N��d�7p�q��9�uHW��xn�"�gb�)�B�0������� ����ZCy�3�%�b�z��"���/���\����HK`�E=����9�������O����I�)���'�;^�Z�[�	J��������Om�L���#�I�9��N`���t���C�s�v����LStz��An��r��i�~��^�'�^s������O&�=��,��#�������yEO�VM"?�����_����K�A�v�����6"@&��4U�0�%�Y�%x�����37��0���td�5�<��pj2W�D�$�n!Sv���r5#�gv��[jFN��J����������KK��6T�;&E���^	4eQ�$�]E�rc����Q����I��&�2x���i�S���<�>���������g���nt�6���� ���#t�1_!�(�P��*��}|��|�����<��aT�c.������9��!�?w���
���pr@G�H!�z�K���k�$��D$Juo��X2�)���M<2�&��`X/��xH*Fd�X�.fq/2�	�b��Jp��1H��<o��uES)��D����P�]9iZ
�Q���;���<xl��z?im�7�p��Bj�p	�}*�t�-&������$2'���	Q&BRu&��y�M���b
	a�M�uT�Hs�(�����:��2�*����F&g�G�&S"y�*����/�:J��pEW���A�lF�&T���
�E�n���:�gq��R�p�N���=t����,!sG1��2�v��N���5?v��	&��b�8e8_p�����1q���(��Kk�N3��qHT�UD�T�41��]�
�c�(�:p/&��'�'x<4�g����alT<g����-��:?�#@�D73��(5��ND��G@B���u���B���z�&� �;�w�_��8�����0gr�o&M!+,9m�>k~<�`o
�A ��H�dp.7�x��\��M9A1����s0aB�v��%���x�$2G
x�y�K|�H���%��r
A���!^yZ��{�~F*�<��n�*��X�e�(�7����O�=N�����$�rN&3f��'9�~4�#W��L�/FW�$���w� ����s;��#�wJ[%��-
K����Ws��!}	��J�����F���1�nc)����s`n3�Ju�F4��G/���nB��u��h$C#������	�x��'j5� �f�:�K��E���I�CR��ZBe=
g ��������@C��[����j�
�|?h�m������J�I��R��9��y`O����9�n�T��R�	h�+Y2W2�Z�R��
���G�X���.��\.
�1�E-$�r*��h	"�i�7���h6������*�&�z���h����aV���f��Y�w*�2�N�xo�jZY�����J��x�9��{��<��a��l��g��(�O�h�,��at��8��E����W&�������F3�j��-#�nG����n��"|1���	M~x?��}�a�r:���`�>�����3I*�2:�J�=�n���3�s�6>�����z�~���O�<�������@Q�
��!A����G,b�l/*�n,�XB���$�
�a]� �nay����<l��8I}�-e�� �O4�2������ 6j�\)Pp�%������]H1�G�Li�i'd�M-��]���0���{3��!yq|M�D�Z���]��-e���!�U���I������!<�8����������1�a�e6��B`'+	��a�"�R��Urb����>s>Z�T�Z�V�����!W%��|�@l�4Y���� ��\�����w��#�f��O0s��T&\�j�-C��0(�o-l���=����c���[y���R�RFS��"�{2�QHe�~@E��tAM�#����J���G�a��I8������
�b�U�]�M�����eVU�L�*`]�IG��L�b��Y�z����J�*n����l�Dw���s�%��!��D��i����
����}�vx���@*�����:�D��Q���sX|4�hp%K��#��X�SX��~����}�x������L��k��V.@���<�T{Q������a���(. ���d��,\	5n%�Z6e8��^bmH'�� �'��-(��S��
K����,0YI2�'���<^*8�$YW.ph�����&rI��B�W�9q�����G�E}?��11@_��7�l�����P��9���v��-Wdo�����oTRM�d�v�*�����7����3sp��pgY�:5:Kj����_p�p��*��&�?ISIO`T"2���#3��Ll������=���X!�e�����#���d��T�3T��@���v��h�+�� @�h;��'�Dz&����V����[��W��f��.���3j.a�$Em��X|r�����N1���|��(��cG$W$�"�[L�=M�hH���$�'n��*%2��|~�^��T����N��y�a,�����6Q5���q��IQ�K�|���ZJ ��2j�:N�������Ny�|���5�c��������$9�Y��(���6��M�*T�Z;in
���Z�L��B����6/���'��9�9�� ����U�:,�aLk��"&s-�[�D������i�6>Am�"�RR�I�<D��'�2r�u�I�R�}����gr����& ���29��F2!��Y���N|���M&��L�;���L'I���|"4�R����i���L2���[c ��
����b8�v������
m�X�V�dA���'
����]�2��E���"$$��3�V4� �����Y����
�uj�P`����JFs�mR��R'�q���[����U4K
�����bM�2��F�|�R����^3�:i:�bfH��JEr��Cc�Vi1�x�NpK$
��&:���E����~�
_p�����c��)��i�w��z��6�r�^��9$��x0��t���Zt���!��������Kv�5��rbwW������zkP�4����_mU�j����w���!���0��~C��?-��@@H
�k`��;��\��,���SA�%�j��V��vc8n�Sq|�CbU_�y �k��g���.�L7���1�i]/��7G�G�^�_u���i����p���)xz�	h|w�LCH�v����%VO�zvH����k��j*�~�������]o�����]s�.�%�)��=���E�j!�cp�W�'Oh����c��h�<[L����6�~hs��������d��,7�	��U�����K�e�l�j/[�����,|�������v����������g�sH����'N���s���
}��7�i2�Ky��oEk���17������ ���<���\7��=��y��t�h"��9Bz��WS?������k[S���l���QfK����������+^���K�5c'�B<u�'y4�������I�K_�[oT��^�uP.��Z�����4���L��M���
�h0���w�o��:�'�������$:?o??;?��S����V1I�}�����-��"
�3(+��C?S��PIN���G��?�8*�w������N�G{�j3j�*�r�������^�L���
�q���k�:l���������s[[�76L:�����z�\�:��������$m!K-�I8G+�������v�i�������3�tL����~�W�����6����k�W��?�vf\�JqO�b����h�x�B�d�I��4���5�X,����Z^�����s��=���Ii�Y���k���[����+�L�����V�����xX����Z��Ol���Q��wr������^;o_(v�Y�<��y�R�!5�m�/9�G�	���m��e��bzl�G�G�S�5C�=o���H��,����^�zwc�<?;}q��=�'�A|��]�����l��*X�3��nI���w�6N�V<0�Z��0�L)�'/�w�f�f�������OnaQ��\���D���2��&�aI����$k��f0�
�Ak�6fI��������4�Y�����������U�F�x�`�����:�t�`��}!c�A�� >!�3ut�~�~~�6*Fqzvy�����t�[�@W����.k��<�}V�G�����W1\ �	���������$�:���}q�� �RE�1}q���X:��"�iL2,�S1b?�pS��+mW�=���V��|�N��u�M�[�s�i�j����;zw2N�
��M{����}&�N��,��,I2���b�[~i�2|���^���,}��Y��?�����R��^mU�n�/�]�c���wryl�Y~`^�d����������jz��������o���#���A��!^bz�x��t
��x*�!@�5j�=���([;�z������A�I�9����_���`�
_�=}~���/i|����i]�z��Q`uqy~��rk��"x�|������f^��|{��<��(�����Vs� j��*����NW	�Y�b�T�~)i\�Jdf'2]gL��FZ��\,�
WI6u���QV���b�Qk�)V��m��\�JV*)��zR�,W������\��'O�C��E��KW�TY���w�s��5���^�t��3�/��8�sr����s5o�8In-{��u��6T��e�������B8���Q�"�����D{�u���Ki`���L����������*�>??;/�n#��:9�?,�������C�HnS��&��1�"�E�R%�%�<�kyC��P�T��>}��������\����!,������Rdy�����0�wb��oY�#c�,�"$G1�������9�w���_�r�Y�w��A��d�)�%O����Y�	���X�F���ds�9�
D
1n�?�n�2%9K�h���GW��g\>�[+:r�m�G���%��
g*V�f�������&��	�nG�����A�4�~�:�	q@#��
�<��2w�P��z_chY��^����wu�]�����ICj2M;�,��������t���a@��lL�=,���4�����:�7?���'F4�����skN���gN^V��q��� $������N�	�OW@�����b���]�����n��D�0_���Fz*n��g���!ld@��<<�/Dn^_�l��yqv���2�������7��_���!�=��Vl�Y�0�!v���LB�0@~!�B���rJ-^��L�a��;��Z4z�n�[��G�7��P�;����y�!��������AJ�.s�RS�<�3�2��S3�_�	�()��e����N�MVN�ZY���76cGsY=�{�#U�N��]�b�h�����DL*��.!�}�i�E3C�������E���H%x�S�����=������b{HC�l5U�������-fJ0��������u�b�]�y/}��N��b�&��M�-%��|�X���$Z1����������~������'�}0v7��������GjM�jZ�)�WL  =ux(<�����
6��@ *\�h�F�a�Yc`�������s3c�Pf�4� =KHI��V4I�5���1��I��d=��TZ�~-*��b��m�������cR����g�5M �e��|8�-S��SK�5�4�C:?���~���e�2�������8�6��F�������64
�w���O����~_���v��������UQ�Z���sg��A{
/�c�I���(:�0�2$�c&]L*�������?�?�����f���/)� j�K�I���u�>����v}�N���������>����?T���~��hT� �6��?�������,�����xY�U��M?/�������JX��Z�z�R
���������Z����V[Q��
^��qM���8�����&vs�y��X��4�<G�GX�����}�}X����n	�=^�bA����Z�Q�T�7����Ne�R��X`����oo�.��X
*��j�}p��� �1�@7�[[m;j��z
��D
x&�u�a\�7a?�����]�n$:�
�T
o�������K�n�A����u��|D�52������~UH~� V*�G��~�
[6j_{���������zX
[�V����Zs/�9���h�p�-*��o�����Q��E������S�[�)�?��:�<y��+A�����E�p+	���f,p�v��������'�� )G�'����K��1��[��wF����M��y�x�����nGh��Ai���@.ae�o
���kV��ZX��r���&5V�W����������Zx��N�:����y)U��#C�4��0��io��7|�k�
���x��QG�� �����~�:�
������Cb�dy(R��pX@pf��	4�%��9d�4�eA��G�A'�f���+�&z�������6a��8=F�y10���Ia+�"��8�@�������">�O`�^�/E)�
��Z��
��h!]���)j7Pdkv�g����}����;0
v����O(8���r�$�%�qua������'���'`T����SJ�.XEL���F�;���(;�������:��"��B�4��=Y�d8LpU�rO6��XJ����A/&$�9�y��_��a�;��^�9�8�l�9[�:��$B�(�G����Ne%p��Jq$@�:����#����"��k����������8Qf6^�VU���W���84���DO/�9'Q@���M����!���O
�^Oi,���.�(��cr����	
d����?x\
t��j�#\RA;�X"���>�B2��)��%�A����j�d
��Y`�����Q��6�5����T0n�[����#���;")��;��
M.N��� �
���s�j���Y��.��[*�2d>�G�o_���T�\�'H7�A��g�W��sbI�
�tK��t&]��t�s��N
$2Ec��z1@���x=7�����_c��4���R���O4C��~g(������P���b��D@/���	1�_�]p[�1��I���	���sJ�u�J\5��.�]����?(6�J��u�V����S��������T����[��X��_z1����=&�����Y	�����R�-�y�$��x��@h��goo������j����[|P�3h��1k�Z������za�;�
���n�6���~�����E1��������D�����Ck@��i���b��(p��9a�8b����{��~��C�D"����5z,�Q���|B���4]IMk��fYw�eo��n���z�u���������Jv��t��V{�&`o�+�_v���;�:yu�i�����������y�<��E(����s�JPX������q��
�O����#��
J�<?���T�w�����-�S���ZO�-}�[YQ�����.�lhx��Ig�e���>}���NNO�Ym����'A>����������=����{pO��`9�Et�������j�V8A�b.�l����r�>��S���k�i������h���qy���}���8:={{)��9�!n�V)Vkb9�������ch�������}���]���!~V��~{Ap+*��$Rh)p�8a9K���lT�h������3�P(��d�
;F����P}�H��X����P���u�W��j0o���@��P�;��U� �.y�&3
����o1�&����|Fz�'��Q\l:�6;���)�>x������OQ�P��rs�V�@|��������B������zjb\��M���.���D�����pQ������:��b�qrqt�h��6`����@�OoO�B�|�>�������Z�d���uT��
���k����W��}�������A�������z���n�:����~�u0�u���fSp�Q����������k)�o�)7.%��
;�@:�hP����n<�����9dz��p�7������Za������<j~�;8�DV[.Y�~M6R���rc�R�w����}���FQ�c�����k��W��v�.�G�J�[SbU"j��9�F�k�yY���%���R��>�l�1y��%L��I�l�v�E�s���.'�E'=���j��u��9Mt|�>�=L�$�6�
������.���7v�vE�f�	bs;���7�D�`T@��O�����W���K���40��]�%��P���bG&��\�$�F�"�T��z���z2�U �^�M@^o������	��1�w����t��X�]L�:v�i��1o�"()���* O�]D<E��x�=�L����S>2�G��������-n{�����8��ltW�w����]�V��B��B��"������������n��%��HJ3�����S�"�������b�3��j����ye����<�N\@A
�����S�O,���uK�3Ke���t�X�U%x	�:������=����)����i�I�F+n��%r�@���4X7�>��|o,{�1L��������.�6lh^��a����
�����u���T�w���>?y�+���s��jS�w��������<�����I���l���>���jC������~�������` x�f3�5�Q���k���Z5���Aw�U����X�����Cf�+��{������d�%����ickR��bp��3|�pJ\��>[�(L>�_�K��rR��F�2k^�z�p�sbN��/b��F����mq���
9�0gi8�	f{*om���y
D�0J9lC���L(����K��� ��2N+�J��������1{18=L���*F,�-A������y��=1(��'7E>�b�
^'Lei&�������rv��6�������F������)�B<��f\�{l��=!/G0C�����(%�9��@
Q�b������=��L�'�).�������#�}����>N(�!BIW�e�L
B���"��[BDU��`�r����>N-�"�!%J�4�)��rO��>��`��v�V�#�Y�s���
�����(u��YOI��_��AI� ��~/�
�����S{�}����j%��B�'��M��V��~�����85��cq�z�	�}B��1&�M�+����L�J2�����\�����|K��Z2TK�������7c7���l��*H$`�h��N[;("��q�����OO�`�-�K0Z��e��)���
lO�t_H|�/�'UmVD�!��>zw����i>5���N�W �_g?���5��F+W��U��}�-��.�7��2�����>����
#��+3��j����b>�����(f���/��"O�%&�^�����k#���X\��� �6���j2��tItx�����"��}
��o�����q���5Q��Z�r������e
|�����>��/�^D��|>M�������{v������@]�����U�]���O��
nUi����"��U����W���Z�t&}C2Mc�C6d�TB9^������.�)�%���4����@����`$��B���)	#�j������}v��r�%x]��B�/�������Sp���"!�1
?�nR%�f1-A-xC7�
��c8���3�
����]>����Y�_~�������h<'a{��@����E�"�M���F8����N�-���j���)!e!x�0�&�#���.���*�4�����4^h
Hf��@���b�H�������%���M(	����X�iWC om$�J�N���	B�q��n&H���A�
��O�O�$L��S�j�**F���0���xB��8y}��7�;���+h��r���h=�$=?���\%�)��s�I	�bO<�pH�����L��<d�;|"���uO����R1�4���)H���Y@�~9�����f.�XF2�1�����������*j���"����#�2� j\`h���(� �!�D�A�I��������@�Sy�S�_���jM�Z���h5��p~�cKR>g�f Q��x$QL�8	1Fq���V	���Q�)����aY�}C�
�5�:%p����� ���6F1C7Al�2�%p��B4z��k�,O��Ab�iA���.��A���1iQ��;+9d�����BMY��"H�l����Esl���P)k�f��Y��������\n����~���I0x�^��tY������4�c�gP����J-�l��P������;�z�@ B�1��Z"@A���G��}N6�>K��\�K�Uo��\K\����7�|���s����s�Q�����7�|���s��$>��[E�g��JRV����"~����a�T����T5�	(���d�h��P�$:\s .�w��=�7�`5��$�5�-�>��`��?s	%2���lbo��2�|)W/"����u����A���>�u^����z��^���5Q�|	B� ��?�������M�;�g���U�����zu����5+����o�A�O���Wk�A�������~w��4�a���*�f��E��n��|\k=���(��[dC����X�9����L���9�T`V��t$��Al3,n�����2:��g�Y�	��V�)_����k����Q�����z��}Z�b"�����T3����R�t*����g��������:�#
�|�aF��<�����)>�0�x��n*�sxk�Dg�V��"5����.FHJ�m�������^������,����b������%� ��I	]��$Ig�����%QD����.�P���Kp-���}'G�xI��/��*v��UoS�����^��b�DC��b���`�tyqX	��)�Te��]7V�o��|����p�
	$����(t��q0��(8�s�M$�!�~�~��$A]������e���4klk�V���T�n�)VW���/��$�L� T-C�!�F�vj6�� ����N�����IT�j��8�%v�+�Yhh�K`��r���.�W����;��->H�������w��~�����������~�l�Z�z�Z9h��E������2����d�TJ-�=#�����oi����/ �0w#e�����> �'7;B�qy7���5]J�	E�h�E
>�qJ��)����IjJ�\��V�+�(������v�;�������=[6w\!����R��}e����UI�1[:�������}��"�"D�#�R;=�W���Li�S��n��t��d��a�v��Q2�^�3sKW�E���{K��p���{wn���$04s)����"�HB����E�[�����^�c6r��^�N�y�wd�����7�5����v������������2V����������0�W+�������H����b��H�<�����V��Z3��Z���}�����^���_m����J�����
�~w��A��Z�.�_�Ym/�.�������2����Ak �Ug~Z�G{~��dx�Q�L�e�b{4uOI�5+Fv���N�`���=�j�Z��hZ�ry�V��������	fm;���/�^u|�%HM����!�#������M��y:�D���0��1�PM�>F�EI�pb��w;Z�R�8�LFNQ7�}XL!�{r
����d�����h������������������������������������������rp

#265

sawada.mshk@gmail.com

over 2 years ago

In reply to: John Naylor (#264)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sat, Jul 8, 2023 at 11:54 AM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 5, 2023 at 8:21 PM John Naylor <john.naylor@enterprisedb.com> wrote:

Well, it's going to be a bit of a mess until I can demonstrate it working (and working well) with bitmap heap scan. Fixing that now is just going to create conflicts. I do have a couple small older patches laying around that were quick experiments -- I think at least some of them should give a performance boost in loading speed, but haven't had time to test. Would you like to take a look?

Yes, I can experiment with these patches in the meantime.

Okay, here it is in v36. 0001-6 are same as v35.

0007 removes a wasted extra computation newly introduced by refactoring growing nodes. 0008 just makes 0011 nicer. Not worth testing by themselves, but better to be tidy.
0009 is an experiment to get rid of slow memmoves in node4, addressing a long-standing inefficiency. It looks a bit tricky, but I think it's actually straightforward after drawing out the cases with pen and paper. It works if the fanout is either 4 or 5, so we have some wiggle room. This may give a noticeable boost if the input is reversed or random.
0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with sparse trees.
0011 reduces function calls when growing the smaller nodes. Not sure about this one -- possibly worth it for node4 only?

If these help, it'll show up more easily in smaller inputs. Large inputs tend to be more dominated by RAM latency.

Thanks for sharing the patches!

0007, 0008, 0010, and 0011 are straightforward and agree to merge them.

I have some questions on 0009 patch:

+       /* shift chunks and children
+
+               Unfortunately, gcc has gotten too aggressive in
turning simple loops
+               into slow memmove's, so we have to be a bit more clever.
+               See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101481
+
+               We take advantage of the fact that a good
+               compiler can turn a memmove of a small constant power-of-two
+               number of bytes into a single load/store.
+       */

According to the comment, this optimization is for only gcc? and there
is no negative impact when building with other compilers such as clang
by this change?

I'm not sure that it's a good approach to hand-optimize the code much
to generate better instructions on gcc. I think this change reduces
readability and maintainability. According to the bugzilla ticket
referred to in the comment, it's realized as a bug in the community,
so once the gcc bug fixes, we might no longer need this trick, no?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#266

[1]: http://cfbot.cputube.org/highlights/all.html#3687

sawada.mshk@gmail.com

over 2 years ago

In reply to: Masahiko Sawada (#265)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On Thu, Jul 13, 2023 at 5:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Jul 8, 2023 at 11:54 AM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jul 5, 2023 at 8:21 PM John Naylor <john.naylor@enterprisedb.com> wrote:

Well, it's going to be a bit of a mess until I can demonstrate it working (and working well) with bitmap heap scan. Fixing that now is just going to create conflicts. I do have a couple small older patches laying around that were quick experiments -- I think at least some of them should give a performance boost in loading speed, but haven't had time to test. Would you like to take a look?

Yes, I can experiment with these patches in the meantime.

Okay, here it is in v36. 0001-6 are same as v35.

0007 removes a wasted extra computation newly introduced by refactoring growing nodes. 0008 just makes 0011 nicer. Not worth testing by themselves, but better to be tidy.
0009 is an experiment to get rid of slow memmoves in node4, addressing a long-standing inefficiency. It looks a bit tricky, but I think it's actually straightforward after drawing out the cases with pen and paper. It works if the fanout is either 4 or 5, so we have some wiggle room. This may give a noticeable boost if the input is reversed or random.
0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with sparse trees.
0011 reduces function calls when growing the smaller nodes. Not sure about this one -- possibly worth it for node4 only?

If these help, it'll show up more easily in smaller inputs. Large inputs tend to be more dominated by RAM latency.

cfbot reported some failures[1]http://cfbot.cputube.org/highlights/all.html#3687, and the v36 patch cannot be applied
cleanly to the current HEAD. I've attached updated patches to make
cfbot happy.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#267

john.naylor@enterprisedb.com

over 2 years ago

In reply to: Masahiko Sawada (#265)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jul 13, 2023 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

0007, 0008, 0010, and 0011 are straightforward and agree to merge them.

[Part 1 - clear the deck of earlier performance work etc]

Thanks for taking a look! I've merged 0007 and 0008. The others need a
performance test to justify them -- an eyeball check is not enough. I've
now made the time to do that.

==== sparse loads

v38 0001-0006 (still using node3 for this test only):

select avg(load_ms) from generate_series(1,100) x(x), lateral (select *
from bench_load_random_int(100 * 1000 * (1+x-x))) a;
avg
---------------------
27.1000000000000000

select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from
bench_load_random_int(500 * 1000 * (1+x-x))) a;
avg
----------------------
165.6333333333333333

v38-0007-Optimize-RT_EXTEND_DOWN.patch

select avg(load_ms) from generate_series(1,100) x(x), lateral (select *
from bench_load_random_int(100 * 1000 * (1+x-x))) a;
avg
---------------------
25.0900000000000000

select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from
bench_load_random_int(500 * 1000 * (1+x-x))) a;
avg
----------------------
157.3666666666666667

That seems worth doing.

v38-0008-Use-4-children-for-node-4-also-attempt-portable-.patch

This combines two things because I messed up a rebase: Use fanout of 4, and
try some macros for shmem sizes, both 32- and 64-bit. Looking at this much,
I no longer have a goal to have a separate set of size-classes for non-SIMD
platforms, because that would cause global maintenance problems -- it's
probably better to reduce worst-case search time where necessary. That
would be much more localized.

I have some questions on 0009 patch:

According to the comment, this optimization is for only gcc?

No, not at all. That tells me the comment is misleading.

I think this change reduces
readability and maintainability.

Well, that much is obvious. What is not obvious is how much it gains us
over the alternatives. I do have a simpler idea, though...

==== load mostly node4

select * from bench_search_random_nodes(250*1000, '0xFFFFFF');
n4 = 42626, n16 = 21492, n32 = 0, n64 = 0, n256 = 257
mem_allocated | load_ms | search_ms
---------------+---------+-----------
7352384 | 25 | 0

v38-0009-TEMP-take-out-search-time-from-bench.patch

This is just to allow LATERAL queries for better measurements.

select avg(load_ms) from generate_series(1,100) x(x), lateral (select *
from bench_search_random_nodes(250*1000 * (1+x-x), '0xFFFFFF')) a;

avg
---------------------
24.8333333333333333

v38-0010-Try-a-simpler-way-to-avoid-memmove.patch

This slightly rewrites the standard loop so that gcc doesn't turn it into a
memmove(). Unlike the patch you didn't like, this *is* gcc-specific. (needs
a comment, which I forgot)

avg
---------------------
21.9600000000000000

So, that's not a trivial difference. I wasn't a big fan of Andres'
__asm("") workaround, but that may be just my ignorance about it. We need
something like either of the two.

v38-0011-Optimize-add_child_4-take-2.patch
avg
---------------------
21.3500000000000000

This is possibly faster than v38-0010, but looking like not worth the
complexity, assuming the other way avoids the bug going forward.

According to the bugzilla ticket
referred to in the comment, it's realized as a bug in the community,
so once the gcc bug fixes, we might no longer need this trick, no?

No comment in two years...

v38-0013-Use-constant-for-initial-copy-of-chunks-and-chil.patch

This is the same as v37-0011. I wasn't quite satisfied with it since it
still has two memcpy() calls, but it actually seems to regress:

avg
---------------------
22.0900000000000000

v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch

This patch uses a single loop for the copy.

avg
---------------------
21.0300000000000000

Within noise level of v38-0011, but it's small and simple, so I like it, at
least for small arrays.

v38-0014-node48-Remove-need-for-RIGHTMOST_ONE-in-radix-tr.patch
v38-0015-node48-Remove-dead-code-by-using-loop-local-var.patch

Just small cleanups.

v38-0016-Use-memcpy-for-children-when-growing-into-node48.patch

Makes sense, but untested.

===============
[Part 2]

Per off-list discussion with Masahiko, it makes sense to take some of the
ideas I've used locally on tidbitmap, and start incorporating them into
earlier vacuum work to get that out the door faster. With that in mind...

v38-0017-Make-tidstore-more-similar-to-tidbitmap.patch

This uses a simplified PagetableEntry (unimaginatively called
BlocktableEntry just to avoid confusion), to be replaced with the real
thing at a later date. This is still fixed size, to be replaced with a
varlen type.

Looking at the tidstore tests again after some months, I'm not particularly
pleased with the amount of code required for how little it seems to be
testing, nor the output when something fails. (I wonder how hard it would
be to have SQL functions that add blocks/offsets to the tid store, and emit
tuples of tids found in the store.)

I'm also concerned about the number of places that have to know if the
store is using shared memory or not. Something to think about later.

v38-0018-Consolidate-inserting-updating-values.patch

This is something I coded up to get to an API more similar to one in
simplehash, as used in tidbitmap.c. It seem worth doing on its own to
reduce code duplication, and also simplifies coding of varlen types and
"runtime-embeddable values".

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v38-ART.tar.gzapplication/gzip; name=v38-ART.tar.gzDownload

��;ks����j�
wzcG"E�2%�M'��$n�8c�Mo{;�%�$��!���������z���9�6��b�{����M����8K���B�p"����D������/=�x���L���������N��o�{`.��9���o�N�q���X����c��0�A�����O������}�����:��{~��|K�Ew��^g8�u����}��}�k;��Z�B��ZL��e�yH��6���h�O�8��������)��(��E<
WF?j'<�����v�f���fV����}�0���]��������n�����c������Y�bV���*fe�����!�j�'	�O5m
�d�DA���(y�#����]����Y�.����xd�������c�dY��,��43&}����`��)lZ�X��k��Ni���b8�����<f�d��$k2�y�����G`�H�Y�##�����)t�q�M��Vk6��f# �Sh�d�� 	=�Z�G'm�H��I�c~89�~995��G��E��L~Nn���.����o��OWG�uL:��5]�5�&n+��0�Dk"����g��_��e��>������������H�0������t}d��6M;����'���s�������sauD���E�	W��tn���%������o��O�*�?��0�J���x:��k�����M�L5��*=���m��C�"[����>,�g�\��]P@�`���P�m����#���F@�$b8E�i��2���@ld�+�}����;�������P��i>\��*��ji#��!��<����	����y ���T
'�Dv-a��vm�q_�}bQ�����[������%����kqis�F���`��� �X)<��(7�
jK�M�N����c&B�Z�@���XwLk�Ok<���������)��$��=�k�
�h��@1	z{����0������tz����zw`��$Z���HRX���l�������P�����y�`�j^�3;�cO!�?��{���"q�l����;q�@��]c�������B}3�������i��_c���
�G�����;��.� ��?,�O������X����d���w�x��jH�h:(�N��6�c���D���2"����~�pt�g����T��F��i:N>
�A
��������n B��
�����WT,s�� ,g��,��Gs~&�L�`_q�j����-��@-�*���'fA*�����$�R��P��v������5
�P�����M0@�LS�$�rP-l���
DyT��R7|�N(~��t���
A
�
�����
�))N\���3^�]���	���6:�aj�?���*�k����TFB'�=�C9
\]��.��
l��~E�W<��?�v:����������C����L������M������A��y=���o�}��q���{��}R�n�����`�or�d;n���v��cA�����)�Cfe�����0bV��Jc�Y5����d4H�;�" c ca�[`s�p��
F���� ����IQ����^�$��Fc���� �\Wgo��\\^�.���Vwi���/"��1� fE�$@$��20,�S���@�@(`�� L
�{�QXu�,j����t)\��I"�8�mz�8m�1T�����ci�yic�8� a� 
���c�����A�����{�@�5�G��r��~��E���d����"�kj8/B�b����mG���d���<�GY;"<�h}�`�s?.N~�*�K_h���F�>f'��k�����wwv��b*f����#7��7o����}����C���Vo��ik������`���O=�^�3Vx,���
�k�������
I�c
�V�Y�8��0�1��`���BT����&�(��R
��0�n
��!����}��q���}������3�y��y�V������GQ+�n��Wg7G��Oo���~�#���;����=���@�{�JdM��Y��C��B��Qs1�?L�[��B�q��(���0�oc9i�Qm%�KQt�"�EY�R~A.����3��`��8� ��=�����p��Q<<�C$%��{�������=��/A�2�F5"�U*�Q

�%T0(&}QP���+���BO���(�B9��8�7�Q������Jb1�M�R��bUB
$�����j��t��R=b�����za#������cZ�,�G!�=N��t������D����p�	��
k��W_h���;���o-a�� �}�!)��
	���J���Ae�;WL�e\/s�X,��9,G-�
0�������bR��{w�loo�6@���7t�P�s�o�������??E��0bgoQ�V�����h��&kW���C��y!nn=�\x"s+K��4������@���W'��f����)
�_Y�X�����h��v�
=u��.a���
�9���3��&,[���lX��y�"����v�"��"D��V$	�.Hxy��U��(�mg�U�g����sn�[������2�������o��VK��9���g5��:����l0h��T6�A�����[0p+�lQ�ge�k.x�S(A���4*i�g!�)a�����)q�)J������(�����&@�����M��6��YD�G�=���se���b���`�����PC:&����S���v���(��>X
�PF�W�����2l|�2l|I~���vz����~�}����$H�i��b����%KAP(3�������Jo/���5����Bev|�[�7�}�8pM����P��#�P��"���
K�iS�
|������������W7���c��}��	��e��S}8((�	�#h��a�?k<Q@�x����tEcC@�X
(OP4�Y@�X
(OP4V��W�����tEc9�h�%�mm*�19���P���;�����#W�s��Q���@����������(A��R�h�Q�t�()V(��%�^k?�gg��(HR��,+�Cr��
p����ASS��z)�m����)�����qzm{��y�0�}�����]U�[Q����a�>fa��Ao*�����A">�vm,������Y>�u��d���~}�&�a�k��cH���j�gAo�e0�{�����������	���,B�h2�������Z��;��m��?���S��w<T�������]���]�-zC�n�}�3��������������Ia�CU��}h���cn���T���N���dVM�p���pz������Q�x�n5)O�\+�FI�K����������lz[=*�g[���\��Q����p��3��?�������F_�[A���6J��0B������Zu��
N&���<�N6�����]��\����"�a-qNZ&�%�����2_}��>(�[�~
�g��5t.���:����z%���t\����SX�L���#�l�bUw�)���D�,�l�C���
�"����������P.,J3��(<�m2�q��x�Q�%k?
X���@�����-��������Pm�'�e�O�D	��R���n�.<��gq����
����p���B�
��:����-�M�D�m���N�: 5���aD��=������$s����to�D(����\
������?�!�w��d$)���G�
���4��$Ed����[��Ss��� Y;W��\�����(���y��X���c1��k0(�?h!W����h��R��{�����o:�dD2����+'�*\l�\_��"�$���[]���}
n��9m�����jt�m��hV�atL�v-�%�kyb���0\�'[�y�l��`Mr�^�������I�,�IY�������'���a��U���P��H�Yx���<����.&�JlW������?	�n %�a����:x��9���.&�~
n�:�����G��	�����\��XF\����<�1~>cZ�A��J���t�D��cV^|���l��2b��m�D%y8C�	e�h|J72U�+5�����HX�+;%VB���\t���
D=`��C�����0Do-�u�,,��qf`{���MS6�Wg�$1����Po���r��� ��Z�A�2>���y;a���k	��A�������6#�|���J"� �3���*K��<��wuv
�
��2aE�<���)�	�4��7,��*a���@a�g����RL/*d���Ef�����k1i)_�����I}��Y�T3@6PE���z
z���x�� #����e�2����F��e4���eOK-!
e+8�K>%��~�+vJ�]�X�������Zq,\� �{��AI��.f�}�(3�.}�B��xaU���"�;
�v������8�����P2l7Z9�]�!>���(��uX���\�L^���l7]e�����j!rg���\�x�P��>C
�t�������DB�Hi�wO�4~��pX���g������L����t��4���Q���<C!R���J�W�����-n*���lAH/O.�=���sbGL"X�L&�X7����i��Q���zd�����<��n�\&��'%���h����5����1~������UZU��)]:l��G���VX)>�$+�D���++P��^ds�1�"��!O�M�R:\F�^d�T$��z�<���I%�p���d��	,�����AB�C�v�P�%�*��*#e�
��g��T0��S9S
���	&���di^�*O������fo�?NN�OoN���aM����/!�S�#L��Bwtu�vo�`G �Y187�]���\��Ui��F!������|�-�C���jC3���E�� ���8d��
�ej�T��z�	�����
3����/������?����(tI4�X�@�t>�+@������J�H:����|)0� ���K���>v����%O���c�<��S�����9K�EV�(e�=�Z*�����y$bA��/Q�.M�����t[�W��p���
tZ}�Z���^�$U��B�:XtU����uJ��5��os���TFcAW�9��^ ��PTkQ�����3�x�1%I�n�����@K�*��0�����N_���R�Y�K�����cQ��D�'������~����m��Co���z@J^Uz�m���j�������_�����g�9�'�������w;ko&P�)���R��^�i�[Y�~}�nfj*$����/�x\�j.F�{�EA�Y�.t���S(�j����kO3�zt������i�����B���i{I�(<_����L�`bBV�}e�t��|%������$�H��$X�ty~�{��2#Yd��=��j2c�'�~9H�3A�!�����^�_;S-&�"3���bn�u&��GO�����y��Hm�����@��	�ya8{���Kw���%���a$������+��IO���`1n8�EBG����l�r@m�p�X9C����U��m4�*/��uQe���BQ8�;�o!���pf
^���<���������e��Y
�E0Yu%�U�l���@�*��X5!�;��w������5����*�O��������BU`���������X�e���"�X��&��Y.��d^9{��./�x1��s�^�L/�q[�<���f��e_�/;?��j�!Z�)97n=���[L�����O�12�x�����j��CE���x���������g�3�������
��\��W����c<�+j4�/��dj���}zy��������)����u�x������w�1�V�:��m��	�1@�`B��K�Gza�f8N<OY�n�}%����9����R�y������C��lS�L�����-�m>,����@���|X��%O�^��x��G�|�)}��T��Rz���v2^�N�����e������Q�[
�o��J�TLH��=v���Td��V	A�0Rp���t9���(�f�P0���.����-;�-�E\�a��Xg���x��[�B�n���^V�gkF�'�f���k`�n-�:t
$����7�7�t����qr(P����S�OT^��{�G���CC��� �W��4��O�p���3{V2�#Tf!ylN��t����	�Fg6�$v����������4|�"���E�� ������h��:O�^�=�,6�n���6��>z�+�r�42�4���:�N7N�3�������'z�c���mqK��,����g��z��'�c����?�_�%j����&*�D�7�����|:z����g��C)�(]�������cOU�b2�v�=����t��'��5��OO'���nmy�����yON/���o�.��#f� �#X�HmY+�z{������y��st~~�+�9'/g�w��(�G���u���`�F{�3�m�w<�-"���B��5b/V������h���=�>n��c�`@n�B�vNNO��T��&�i�d�T
w����Vm��rd��\o|Xr����z�Ka�D+)4�����D>��K�W����I��{ �JS��G���'���7�'���Y�<�dA��������+���}s�$UV��i����y�3k����1n��1nFl�=3R��|t|�`�i$n�~�!����QmfU�63{iev�����U	^%����_xI���Mn��H���H��TI������M+�Nj>�Rj>�@8���"���%�^rbN����[��l-����S55O�3N�q�	@l��Y�.�=����U
� �����R�_���N3hD�Qc(���\X�SY0�a�������z����l8��ib]��?r~t|��s�b��d��Q������WY�:bu��vz�+�|K��O/M���m����@-��/5r����?;�H�'*��@��lj$����<�V/��nZYuRG�T�:�x�|����Pr���|[��oR��oV�ZFea�����fE5���*�9q/�)q7�Q���Rj>z�����ER������8�|���Y�:h�Y����YfM�&x��Xo.����6���V�f���U��N������iy���M��� ����	i_]��-�P��j��VM��6[V�$DC
w�_���
}��C��'���&���H#�C/�E���T�}�m�\)?�Q[��<k��s���7�T�m0|C�����H�G��fY4�O�a��?u� [�s%+�'���0������8��x��e!�������+�O�8E�>�-�
E���"�F�������^����i�/�N�"�<+:���]l��*u�6��mb1����_��9��Zk�v�����&�Oz8w������;*>�~����)�����uW-��}�hP
����V�x)-m
����4ri[0�O~4�����^���LS�{�<�9:��x�^�[��������2� KQ_�(E�xuvy�)�����U^K�m �,�T7&�t$�z}t�� �i=(U���7�E%{�����r�	QR����e��%T��fa�l�U����\>/�������Y�B��[06�w�����}�I��v�����Yl���.�G�2��xG1��L���a�O�z	?�u�XBh[M�Tk-��"�:'�0���'�A>
 �AM-��Dg��R�
��R����M���^@w��|z�BZzI�'�a	���`*O������n�|^�>
XX�hn������EI����k-�u��{�	FSH��W��5���$����r��ho��*�`[9bl��6���b^��	S�������5�z��$��`�e	��H%�pt5�`*�	"f��xf��H��(��p$E��0$���sd�[b$�GFQG���]�q��h��c��[��2
g1Z�N'SL�����x�|T�k:3�=��c�����`>�!s�h�����E����d2����Z���{����nX�X�x'0����1F#N��C��+9������'���@���\���;��6�d�--_�z�cE���_7�`Nc��`��A@�l�jC�j�g
7�4�
����!�&�'���5�^�	�e��ib��S������z6"��n�@o*�ls�z��9	�GC���)���b*.�w��@HB��Z����n���*������iBr�#~��*������8�rV��|�05���&��fUo��Z�k8I����������\�aa����G�A.��t������&oyH�N��\o���k����O�kj�{�=�<CgFN��`h�����<�����K� Z.��-+�_� }	='6���0i������Q\v���5�M���e	e���rA���!�J�u��Q/���]��7� :�L�a;W ���h0�a���'�~Y�z� a���~:����3C#��`K������DaZ����ad�		�a�i�(v����`S<F�%a-@.G���Z.�:�4D�Y���1�	�'36*:�[����k�0�v���|eI�]&��1�g� J������0���(��CCZ��,��0R����/o������cN���v��z[��[D�xHIY�[z���x����\�bj����~+�r�
7��GI����
��;<����bC��%�}0D����&�8z%��u���ja
���vJf'2�;�a��TZ<��Ol��yx���Y��)��#4�q���6]���+��e�Jl�{��-U��mM��?�0O{�Q�)barF�U
���Q}�����i��)9�s��{�K�jj�<�����j��rJ���
�O7Q��<� �7R����������J���������-��bJ�+�s�f$������������j��t6���'��g/:?�!����I��X\���@�N�0�v�-���*NY�k��s�7�]�n"�C��R�}��AgS�k8T8�L]��
��?��%~zy,�������6m�����b���������5Yi��lP�,aI���DlB1���(j�#y{��t��H���{�)��q��EB�9#��R�K� f��C�#\%J0�"�b��!�C���Z��l���-hz��8�!�b5����la��/��!M�b���N�H���)� ]k��I�4v�=Mcr.L��4�( ����\n�/��E%a7|c���G���;��m�u��4���f��<�6����Y!�w��O,D*��>��&�l��x�g%B���<��#8=}i�}�4���0���'(��U���������YnI�F���#����{�[j��u�VO<zk)�7j<���L�[.G�����'%+�@��V�,���K��a
����q���	[�P���t����� ����v0��hWB6�t;�#����e��;�T��/���S��S]L�6O��o��o/^����M*����n=wwF��lgX9����,���s\~���X��V5x�[m��X���5�p��x��XY���������HF�<6���V����&	tX��oLl��4-��l�g��8H���bw�n��)x;F�n��m�pN�����R�k��Z�9��k��$�7vO�>0����P�"��F��6��F}��,��$�\w�+�p��5������`n�}!�`J���^�IpVw�2x�d%ewI$D{��,�J�/
��YT�8���XM�P�Y;U1��@M
��C��7���JA9����F��L)�5���|�&ut��b���s�zIho��P�d�O���Z�^��I���]����S.�+JkT`ZOAz&�l1}�!/(�
�B��xP;GR��O$y�������
�q�����0�����)��3J��<���Y��
`$��Y(Q��O�I_������v���o���Vo3Q������H�T���DF�$�l��	�ND����-Q��5(.4����J|�8����E���gv"�e��MZ`�)��
�T�d�E�Tk1vp����%�C`��t(:�XkD�Xn�C��x��H�*}�2�HD5U����#�U����P3��;�t��4
taI����q�$5�T��O�J�~Bv��'?�8O-�'��/n?�"[cq�����i�h�C:b�G�go/QWV�TF�B��#%C�,�\Z�&M!W��jW��a��t������:yy�O5���K:��$W�y���!���A7R�������V�o��ZFu-��X ���_�
'�����������,��p<B��CN�KI�I��:S��_L��*��+�>��DQ��D�P����� _��Na��K7�x����|����)�����`;�)�oS�w��A	k�����uaC���4��<��i"E��/Eg���Y���>�����t��������5���t��A�8\8�k��ZcikK����Z �J�@k�ab���D�l�8��]s�h����0s��(2� 7�8n~Q6`�#@��-��i�C)>�=>�4��z��>zy�<�|�k5Z�����A�%��^�+RI���QH�����"����n���{E�X��r4W�9�������	a@U������j?����b��)����H?}���2�I{�,L��}�R�r�1� �~h� I�#����2	?�����t�� ����2Vt�$m�F���06���&��De��M)�H�]�@��P\��7�Qo>�{��e~@��FA�
�'�����g���9�`���vD�:�c��b�C��X�����$U"��9I(R�IG�4r�8�25�*��.L��2g"��d�>5H���y9
���;!�d@B�����<��W��d�����=�%�j�/�SVt%d_�@�udBd���UM����+;��M�X�f�Yn)�F+��* �h�T�gFBA�e����_���UG/I�d��k��3o�nNY�
�N�	����4��HL�_�7�q>�v�QGeD*5$q�&��Zq�Hj������P�2�ry�� 8��,�$�u�1���N��
��'�P�_���p8"yYy��S,iJ
vB:5z�EW�$lyC���t�8�H�N���9�	>�V�eJ������*�B��c��<[��W�;�6Q3=\����u�84X�����I6c�s���AVc�B��*d�Q���s��`8�F��������!$g !K���e�h�����dWLbr,
+T�.�����Y1v4�(�.���
�&Q`��b���4&�x������b�vv���0�<�]��=ZN�^r\J���.\����b���V,�,,�u�a�0�"I����j3���VT[���3�n������.�h�%�W����W��i�z�?���+n��K����,��kv���Ob%	W`��O���{}u~�z��-������/���p)V�Ok������|P����p>F����h:R�C��PE��[�.���<�o��?�����6��|�����N�3���>�H����Y����G����#�j�;���^�9�TAS��A7�P�
��������0�y����b���|�=�����%K�.P����������*][�t��9���b�(x�J-�eG�	x��	�y&W��z[+��_b�fX�����p�gC���?x���8���|��%@d����t]�D�u?s"g@J�T	Jo����s�2���
<�#"��v�'���9��N��n����r�-���E��~p3W�����9;x��)�e�jp3���3q������}���x���C<%7�}��
���?�8��4�\�X�G��8����xW�����
8���7�_�?��w�4�������|-(	�����U�����%.+&���A�CaA�c�>
ct��.Si��]+n�6�k�#49!? �wQ�r�l�
��/�l0@�+%&V�[z������'Iy����LT���VS:(^Q^-���8�(���� e��mz�~����5\�e�%����������na��
~�[u�+��o���_nE�W����V|�� ��7�NS��5_mxK&8���aZRa�a%��SE�Z���*Ge�����&���sc�L��=:��>~(N�C`�"U�����E1��`��=x��LJ���9���!����W���7�a�T�����oK!�"gL��K��+�51�J�g�Z-:�2_�B���j�e�t�$���X����X��t �,bz�.[D��MKb���d����������N��
5�40S8�C�cn(Po�.g����V��z����7%q
��~�����M���
�l�AAy
�c@���K�%]��O6:�?0����L��k�MvJip����4O�7��Y�����#wb��)s6��Owf�&,(���7�mF����t���G���T��@��1�+��U���R���J�������@�N�2�)k+�{AVDi��"���[CBq�lF���<{v���!�[m&���w�{�����i�����I��I6�Z8-�A#t�I�g������B�����b!z��5�MT����t?��(���'�y��e��4Y.g���cr����A�����"�������������v�Z�gQ���4��H�������=��h���`� [.)+,�0���������P-n�P@�������U��U�����M�'�a�����9���#''�#�WN�����mMm��O������X!d���g�X8O���Z�����8���� �$�������:>����DA�9;%��I�ZY��~a�Y��:(�4*"��t���$��
��(8|u��!�a�,���=������$�]����2�:G�Z�y�%f�����$�R��������^��Eri����.�,�?�1����{:19��@"���|
Q/�s;����!>'�c��H�E��9�Uh����>������,�r��82�6����'����m�E�D�����r��
6t���D�z�;u�h�M����e9#lY�����1\�mz����*���SH`�G��Tp���0�D)`&��*��KmV�#8h�;��u
���W��RnO��a�y�����y�	��e�5�FK0�=%�� rw��&��>j�mE��-����3E�i�s��!�����*c���7S�8s�p@��h�C�����
m��c:��a�'�B�k�=I������Iq�b4!����:��q�8��2�/Q��M&�+8Z� ��d_I��U�������0�!����r�����V�;;�9�����K�>�����'GN��/�A�LXZ(�����:`���V�$��@?�����2(t���H�|6��a�,�:k�*�	a;5�����9��JlLw_����8=6d����3�/~����\,�@@�XF���"I��f�QFH����h��x���%����6�,TS�".%�u�CB��PT�[YF�r����k��7��PQ�#��X>�����W�)B���D�Q��U���`Z��g��XY;�(&
��L�l�R�p��x���()��u;DW�U2h�21�,vt����p2��\^�0]Q�y��C^��/l��J�m�	�^��_��2�g����Ym7'{,����o1:0�`�im�,����;s6�����7a������40��D�Q�^@��a���F�=v�[�,D���/��5�M*�X���M0h�����z�M�X�]�)�Y�Ol iab���_�1r8�� }&�����OO���^�����S^��N��k��-?��$�1v(��[q4�#T���6%\M���&$������&��`7�J�(�Pc���}��"\eC�%@NS�O�w�o>v'K�-�1 /�\�Y=��a�m��z8���#��*��(�nRq�9@k���;��t�2/0n�y���04�����IZa����������5��d��[����d���V�"��O��)�t�������B��u���:P��'-���PI���h9�1)(b����F
[�W��0V�qq5S�c���L���v��IB)�f]�x����>+�T��bw5w�O�"j���ib����%Z�s�t8�@���	Yn�c$����������>
��s�.�bZ`�.�T�	����T�����9U�_m~�d��Mf�!���u������2�.t����������=�1�e�GC�1��"!���Vj�(h�h�6���V���P����}��T��BY.(�<�Nq+%NLV�$�m�Z��L�xb��HL�w�
<0���^�y���pG��g�������D��(��h��%����K�!�RW�0��r��Kd�\dNMTu`#�������"�P)�k ��*bY1[\�����,5&pU�L8{w7P��L�C)�'lM~����j�8�����6C:�f���3���v��#G�/����Y^�NL���l�^"�������gm�m�v�^L-�e��� �Tyy��|���6�����>���L�	�f��@��6i{-��e��-�6�Adb}[a��j#&�o�\B��%Zq��C������\��X���R7���5>tJ�T������r��U�
&��B�Z����
�yP'�,���}����+�{����t(������7 �W|9�8�G�(r����\E��������h�NG�����-����uN�NR�5���8�y������m�r�� 8Q��)��2����Q\
t�+�'n�@�j�s$_|X4l~�f��`�F�^�����7"�`����6�W��m�zLXjH%�E����_�p���F.{jA�Q�S����I�	��g�S��0����{$w����wJ��:�����Q��j�00|��������m�AX�wSR6�H�����W��S�;o�*N��8��H	x�P�J���fx�=�N J��yxO��P�;�3���g�T_�1��D21�|�1��Z��u���*�J��M��������r^U������L�����s��*�"7Oat3��d�����AR�L���QU�+-�Re�NS#u���)k}�i�qj$OS��g�'Pdy��o*O�l4���v�+DlD�����������:k�~%Bs�Q�����@Y<J�e�Z$�c��\e4
��]ZU����2Z�jOK[2DxY��K6�~'T�������c�������h�l1�E;����f��
�����x(q�z�I����5$��oZx}��$�e��B�=R�(�-1m{Km?aT�7N�v�V%�v9n6}/~]�n=x�H0�"�}����]\�g���B�^3�T��=�����2r��)�]<e�F�/����66��K�[��\}6�L��F�P��lZ����Bez��Z@���\�sS�!*��Y�7)�:��0�a�����w��/n���%���H\���r�?&�Ja����$1��G�#h~WL�����;Er�����F�g��$p��|���r;O���m:^@^�s�lU�E:b�c6�&TI��'�|A�����3���"O��#_D'R��&�M
Zvq�I�������b�������\��
�
����-���v���#�o(��2yPU�\)Y���I*��}�BTm��c���%��,�3��mJd1�����D���,���r��|JFd-6y�s�d���)a����G?�Jz����+�Z�vzA��qf\{�,�������������
��D��������:�&V� ��
���nI��2���d��Z�<�����N$�w�~--B~��iLj�H�0_y��([�����B~k=ZtJ���'�eWqF��Q�6J
`��qz���#��5MY-n��D��L\� ����9-v?Z%�1ls5>�G��Y�����,J�J]��-�0��:d�*h ;=�u$�\61�f����0>��d���y]�c5��Q��q2�X��u�
�����)�l����N���	XF|9���%6��� �W��I�U��5�x��ox��u���i�uI�L�{M��NWL�
���YS����mp�T Z����[w�Z����.��P_1
�
�������v)��%��Vx���Ng�@�5.����+���i���oS��*p�g����a��U�m��j��jhP���lY�g=��>��$��GX������:��� V�
S�I����������������:%v���/B���I��k�U����k��
�����-���v��.���d1��g���P��A\�� n�HDC����v>�Um���lg��[i��,��#��
�arI������*��
.�T=�r�Tg�?=t|��Z��]�1���q8���M�0{8@x9W���}�vd��>	x���j7S������D��J�����F�t;��m��R�����4���Y
q(����������%9���&���
&6,��qh<j����w��)1��s���d%����e��+�"�Y�D"��?����(E]\+����g�g���B���I+�������H�v_�y!�����R��������v������k���\�1R�%2(�����/�����y��������o�;�C��o�����?���\e
����O)��l2�Z<�'��P����`��k�#�p��V�2����<	tT?�
\&}�&�i��E_����G�m�m��j��,7�Zb
P���n�4��~����;��a2F`o1�9]�����#��[���`���P�E�{{�~���";��?�^P��������$�������d������)-J��yqv���}~����E���%iT�I'����e��|9�(�ci���ck[��_ 
�����}����7I_��������Q��V�7��������������k�%���d2����=-Q�y�d��;���-���f�;����1��g��sN������W��}Y+_��m��K�
SF(d��-`gP�t)n�)C��XM�g�\i>�N���/^NB��4,������=�_�~$JW��1o#�3�g�OYa`�LD^��y��A��fMg��YxsC���M.��G ����K�t��3�I�D3��'�����������\*��b�*�`�"l&nj��"2�;�Un�sr$�DueD�"��f@����Z#����/�_�^���R�H������l��C�\�AN�)rb=�"L����X��q1Qe�D��9��1��sl6Y\]�Do�l8�t�M8G	ML6&�7����E�Rs�Ra��t��b,�U�[���+_M��4�(�5F�"���-�m��	�����42p^�RVp$BC��$t`�s������#��faz�!wX^���p�������/u0�`���g����g��$T
k~�
1�a^e�%�4��lC��N8Su��� ����imT5a����^�o8#n�,f���	�������6�	y��^�LY�5�HeO0�ep�*�a��5pZuc�����Lf�'h���q�����hV��VA\\EK��!�ws��#�h�J�K����!������w���m�U��"��k��T�m}�-}��
B�l�G���_.���4�v��"�=��.q)�440s�4i7<���P��[w��^�\f ^[�a���{�E�-�h��2(98����R�J8)kQ��CWR�$��R�
�E�
���{AL<�5��,�^G�N�J�^Jg ������	,������kg������Z��Kh�k��.��W��2Zq��\��qT��H�:�)��9�`�N�M(�r
%���U�Y�G�W�2	����b���,�c�}s���#��S<��)2����_�=������>������t�����������5�1a���<�����a9��a�T�f�0�V�YV[�L�Og�*A�`��&�P+�B:i&�Su�)�+�6z��d��b�Z�?���3j��mv�j�����xq���b2{��=�}7�E�@O�'{���[,�'m���*�'c����1O;�5��X��V�LLi
t�������u�;�H�|�MRv��IT��t���^���UU�z��S�������}l�
�*&i_v�S���F7�B��8�I��g���=?p���R}��+?}b��Y��x�Xb����2J���WFMY�Z6�M��xC������Zk���t�!g�b=��K�����Jh���B��9���X�$��?�a�6���,�:���H���N����o��zZF4�����`��5N'���/	� 	~dtOa�"��1_,��s�@d�ctz���~��<�:��b8��<�X��j�H�$
����,�R&������1��"��{#z��'f�w���"Z�KE�^���]6M�����~aSV&CYP�����)���q4^�����!wa����P�N��4��I�����"����!�{R���+�	�uN�Z�6��hl�(����U^E|+����U���+����4T8��&�6/�BY��@^9W�,PSw��"�������S`��|�����'���e�7����u)V
���v��$���������3O6Yzz����==�����7�&��D��_r9�q;�'B��;���{B����W��f�+0��$'�^��Mk�6��p�T1�k�OU@�ol���B�=��R�M�K��$����;���TN{ ��?#U���3�C���v-��k��
����	AM�m�b-�1Q��y<O�����&jI
��F���n��������H����$a�w����.�w?������5WZO�P��6�'G����������p�i������6�P������-#��M��B2�>Kpt�\c�������!0b���������)%$rG���SFD�@A��2�E��(O������IB���qD)�&%D�o(xh��S����*���s������q����JB�.H+C��'�/����M������(�H������x_K^e��0���Q��El���4"�;ttm�O�77��(�������}���ZP�-\�A-�j��[s�m�4��qS�V�M�4��9U�G,qB��a���a��PRg�(���a<�K�%
�Eul=���&O�=�N�K'���Q�S��?�Y����,�4#R�r��*Jq:��P��s�����eb���H$�P7f���5���7���rY/��*>�����w�c��L�R��
Y�S�^��U�ETs��.�c<����{o����"}@ =��-=��g��'�������|y��H��L�K���E���N�c�O*�����}u�����l-���DZ�i��R������Ti�"��P���J�Q0�u'����tkv��������^�5���?���ft0{c���4�@��%1
�"�es]u�H�<��^'�w�7�t�����&��0�h���+�z��r
=���.�G,W�9��M�����'I�"����680�1�1����[)XWR����M�/���8S��3f������������=����'���1:-�t�N;K/��,��cuG(�F�%z���X�{�+Vzx����C��`9#.���U��=���&T�8��U���9���x�,K"I���C���������n�P���$�B��� F�kC�C1�������.�|�x���)"fE�X��uk��x'�O�`n).�7%��|��/�)�uL��4�x����&3h�	d����f7��Z$�SS��$���Z'�)�=��,�fL�D����d�����,���V��pl���Q����Q�A
������F`�[���m���j!��7dR
�l�"����9?��G`Rg��v�^�\F�]J�����NP�������LyV�o��7.@����<{���D�q�����>EB���g�f���i�)���0������|6�.�Qg1F�X������9O�v�(��m�g7#ojx{1G^�d<�S�$���d�7���y��������������j������),�yyA�T��F5
(F�:��Vm��vMC��"��A[�9��5��*��|R� �7�7��g�(���`��������T3��_TA��+{7��x��c!BV��D���>�s��<�KOS�'H)`Q�E���;7�-_k�[T���)[mR��z�-,]����P$��p�'�������v5]�	&�	��(�3�%�������A���;���m:Hx���)�v�����k��D_)��z�p�eC+1���T>�����4p�bJ;'��9M/+�Y[&zG������������ll���A?k�lzh��&b("��;T�U�����Z"��Q��a����5�)�����Hr��v8��Q��"�A��K<\���EJLLS�+G��h�`<�������f%�[Ab��B~���+����d{4����;��8�_q�7�jqo�?�o}��4���;kk�$�Z�����^2"���;�����"1a���W
d��p������h2ALG{��h��Z�o���"�F���H<��8'"�I�A��KO��z7e]��I�x&��S����d��y���d�;VM�l�a��|������
�����^nt!�@���r��������d#����.�u=�+?J^;�0>�
���^�0����F����wV��Z������7��9T�������K4����/��!�@�m�."R�Vp�\-��h\��?�;���k��;�w
�gFu�4�j{��W���C��R���nK��f�AR�������<��&�E�-2�(���BeB�o��n*kP>OEL�����+x����#�������c���]�Jz����[���i��hD=MCv���B����E*l0La�A>�������v
88���&��/���Q^qc�|z/UZ�60����0���?��!��l��K,���>�6j*A��9+_���jnrA���h�)H��"���\�A2g_�-��xq�K�|}�=��iPTh6�G^~Q��rp3vu�q&��N~�sR�j9��,k���V��)�`���*��L�iks��oi�FA|o<�Z�+�k�h��`��=NX��T+6'i���)-F��P�#��j���������Z��G�!�1}�`��<e�z���#QCR�C�p�n�(_��i��U��M�����C! ����0^�=}~yrvz�JP����Z�����^6����e��6�?l���������y��n�9J�%�������uB}*�f�5a
���$
_g�D}N���v���:
��Y:�T�:�����h�Z;�YHR:��!D����C;�73a�� pS����p����������b�h���=8:�E��jNQ�Y��Ji���y�3f$����!��h�D�������;��d�
��{l&�+mL5��\�*����a+���|��@l2	��Ls8yp|B]�����Y�yF��Xrz_����&e�� �+SS*^�z�W�]��<j�C7&�*�x
	2�-'�Q7j��r.�W�7A���	��1d����G�%B�I���sW�]�����b����?�
���m����erb�)`��|�
��}`��,4�e������x�al+SIJI����t� �R,��Fi�
7Ej�}Q�k���2����^D&�q4���/�Y(t��[���_�e�5R'���PaC5��[R����?
*��Y=cb�)�_/NN����i}x�5
g�M��sd�J���(

���~���}�wj���9t��~B�`?8n=p�x����d#��c���. ���7����)�rz����u���c?�'5�P3+���6���7&�j��~e?tp���9�����'n���>%�c�����g��k���I4���	z�y�h-�7J�J����O���a�3�$% @��O� �]N��i3�'�J�Z@;�)Pd^�!�&���/U* ��e�<�yz;=�D��\mfo��C�z��/�cO�"'�?��/J�����V4���6�?�����(Q��hd��2^!�����~Di���:�4�J?K�������K�+y���<w;Sm�<�;:��#a�����N3���)���%��$�r���F�������T.2�s���AMV\={��'���S������VgJHO��S�6��1	:|�$]��u����"=!��"�4y��d��������Y}���c�;pe��'��&���\��)PW�H�������t�a�*g�N"�.E����2���[�i*��'�����E������6Z�@�R�:��$Vt��#X���Q�"�F�8xn���jz�����H���RpB��?�	#�w��������vJg����e��?�fo��'�?�h�3�MuI�V�+mb	GR���9>R1%S/���Y/������^fTL����-����%�.`�'�k]��]�w�JW�y�����p�-���b>���8,_��7[Ch�SP�E�(�W*a�\��^���J��hl�J����vvv�����WP����f���x��7���&A���O++,�\�]E����Iw"���UE�>�?��6�W��$�ztN�du7�v��#O����w������9w=��lx�F14��K�<���7��b���������w4�z��;�o�{�reoPmF�V��l��pj�7�/����g�'����?����;�p��:��h���Y��/�pG?$9�y���n�Qtv�����d�!���W�&�'�rw�>��	[�e]���z� ��������~�W���+��n&ci�"��
Z��n������C��$�����7�����������b�y��>U��?�� �Tm�Y�
n��e(6�Of^H�,����t>(���	�[E���~�[�k�a�����x1e�Mv��S�b�g��@�2���/k�����������2Z��1~��L;�xw�Z�2���8������a�N���Y�Z]�f��p��>;~�����P�����9{������#���i�v��(�o�hy�������W����sT�'pK��zk2�!J������kZ��1n'g��Q�s�z\*U��r��1�_�@/<u�v�����.p�(�{q�J�	�M���>~w<~�G>~yu�����S�Yn{z��!�5�$4�b'�v0Za#._.������R�
�����������<��������7`����8����_���N.��Q[��~��<�t�3� ��9�bHv\E3z(���`1�Po�V������)H��f:�n���DS��)�5��{e.�N�-4F�����V�=��"�����I��b���C�L
�f��MJ�h���	g�&r�<Y�h�a(1Q8i����3J�D�W�$ 4P
G
'��f!���Y������������sk����UT���O�C�w�8�Z��Kj�p���~����w�e#�o��!�0���Z@�J�������`,��N��������y�0!�-5��(��p@���3R��.�|~�j�qzvy���8�'���APg��XU(���
E�VS�U���nR��5=W7����s�V�c����[y���������{�l���k��N�w����{�h���k���w����{�h����{Nx��;�����ve�N�Zw�Vm��^�q�z����5��,�;����u��S�->�1��(�
+��
u���*y���p���u���,�oT��D@Q~.56���b�*l���M�A�$�C|���Jh������DM����q?Pfk'_
�B�l3Ju��*����h��W��n�{'J4)����'��'�^�c��m���|pp$6�
pC�� ��KNFv�1���v��b����'�a���<��{X�B�qd���	�$�ao���-���6���AF�;O�Y�,V�*���	���C�z
�Ag1+�pv?~��`�Zw=,&����JF�.��Q(q~BHq�����x��!�O��;�o)q_1{��j4��!����N�����|��O��m�>�4��<����yNK�~�nw����q((��x���Z������Mk@<$����qx=|����q^18�3LG/�A9g�u����2��
�P����8`
"�I)�xO���r#�'Ax����W\s4�Q/��{w����(����u�]���5V.K����U��^���w������c�fr����0�7����������Sr��sI��P�7��'o���0H����F��I�6��������I�g#wq����p�c����-(��|{~zA�\�G�eR^\��<���9�������Ot]��N_�=z���{o`���y���*�A�7!i�[���7d����?:�p�cx��]�Ek�K�b�)��|���3+�������X�|��*�z1x�����*xI��,&S�/g��T�?9n�^��8y~tI���o�Z��{]��Sb��)O)._o�{�z��f2�����W����\��n�q/���������.�(v����,��v�;��n8G��d����S�� n���}�"e Jn�j�}vN�������W�����
�D����@o�z��[&��1�.c�Y����E�2S�(:4�"��E�����p���I<�Fn��Z�}4�P���X��-��vZ��SC��[@�pN'����"T�t�
B$8��1��%�@��\�%�+�s��]+��S��xq��,$�]�>D	A�jdg?�3�`4�F����3����<� ��@K��mS�����LG-7^�Z�z�[��n��G[���f������9GK�C	�0�
n�{]�^����.������p����
���g���(qK�4����n�v'}5��D�,\����u4:��4���{��n�G0'�Y�������)��?'���s�+�G"�{�����6����Y����^u/���y}r�/�L�^���5�j�D�%���5�6)���b���\�A��A��	J�U$>
����6�(��4O_��}�0b��P}dU9�5p_�v!��^��b=m�K����Cl DI�;�e���4�����T	`�"��p.	��n
���������k����������9�c�=F��'�eNjw��-
������Z�g��"�;#�+�U�a�j��D~���hUJ��r�����rRX��>kTW�:�d���>����qr\���Wof�I �,����
1�_���N/.�V�����-�����z-7+�[��������������K��5�s��K3j��b��DG!QY�v��T�A��A�\�����)���>���);���r�I�W�2���3��4�������E4#{(T#9j	b����m[9�&Sy�G�>D����!SW�:�N4��*��3��;������.�
�\q���SI����L��t+w~��]���}�PlT����@
��-}�wH��r:}���9�����,�(��2bly"�p(��N�u�b���r�^���>���m@f��������!?��7�mc1���jL��(i�UEeG���R�9�������nA�:Y�O���z�y��y�t�x��3�������a��&���1�!M:[D�������7��WX0 ����~4�`�����7h�9���q<._Vyo/�L�K��&�N�U�����\�'N�����K���z�1.0�wHX*Ru��:���#Q���
Y�'�>�d+�����D@V��4�����G�T t�,`�|Q%��q���y�����'KH�@�N#�GA�_iY���������u���wl~���CL�(f����jcrcA)n���v��Q�����3K^r����4�q�/O�*U��*�Mb�'�nh�A�T��-J�6c�6�t.I�D;:��0�$����P*�(#��m'�)y��}T �@A*y�>@�0=�B��G�
58c���Jg��B�F����Z@������X��g"CIa^
2�,����5e��3!fo��I���z9Py�F�H��
v���9��|�Oj�0w��d�7s)�#c���2g
|��M�eH�v!	h�������^z������F�����}1�^gg�tY`������*m�UP�KSIK����L�@�?���`NP���,*�E������#��Lgb^�<^���'�����;R9����%��=�Q���|����&��\�c����M����g�B{�qo�RM���mzCogF!�������:n��B�JGz����O�F��f4�u|��g��/=�/��!]A���5 "!�@N��!��\�����=����<��<��1�
�����P�~vcV��:j4]+���P�(H��F��|f�	�?m���$2?L$����I���&Z��	X�f�i�5+����T�v�	��G\�'����u
>��F�^�������$�������@�}�}������}�a�F��(�+�qLXG���B�	�����%����p��wA3
��Y���i4}6�d,��13<�8��"�:j��J�b�;2t������e��c�������������e<�1&K������ [�XD8��M	r,m4��X"�"�k�m_�P��2�O�n��X�/����D��I�$��l����)g����50bs��Y:�l�����E���!���(�D���<i
{��a8-��]5���rF�E��������L��1�J
r���dPrac���$���B^�WJ|(����LN�$���k�'��G���>!���>B�&�/�����p���/�������Z5fyLMf�h���2������g �	�AR�eu�����������3����UV*��QS����,�s��?�����l �S���M�8����Y�K��,b2��xA��������hDIU.��P�h"���S�|� N�j�1��R���s[�	��4�.1k����[�����?��s{�hN�?���L�� P	�@����������PRRpC�|��1@��f"o�Q0���+Z"uO��	t����2�%qm�ga�u�:����}1�����|@(������,+��a6��Le����������w���ytz�O��R��_��6� �f�������:D�<�F�n;�P8�8E�`����bjk5��X���Cq���
�*fT�Qz���5)������JfA�/�9^�+�������7��$��������^v~vL�I�7��W��'Q����G���&���1W�Zx��N=,���	p-hLI�S�"�'���C���!���yg��"��8x�H����'sA�����!`b�����.���2���H��e.����G<d�o��-`���E4l�|����-l1D����t	����g/^�W���(��t|�����a4�;c���uo�D[_�N}�P����+w��27���/4���WU������W�t�o�Y5���0��b���a��9%���6*i��:���	��t�T.�a���}�I�����u�����;l|��,���X.��;�&n�}��@���&6���[S�m-y5�����.Y�k�� �f&Z�9�`"��7�4p�FQ��D�D�S�4� �&��@��L�b���,,��8�����
���u��l��bu#�����p���Jc����q/������.2^�S��Y�T.��]��+}>��d|��"+:���c��p���H�,{��9�0A^���#��:�9��/+�!�K��I�{j�������8���Yu��
��@���?
��*����*4C�@�	�z������v�xP�3�
7&��Z:!H���>�I�x-�
)|�y�-������5}u���{�����n���b��,lzG���]��-��(�o5����Wi�<8o��Kd��2�����|/�� ���f��Jf�;��H^�Ut�bsi�D~H�v�@)0�BO&u����k=D%l@��)V�Ss����+��k�����"8#���I����d���WaKUxT��"�N�gxK��iA3���&@pf�Y
{s�]S0I���'��������0��r[~�X��-����a� O��6��G���.�Fsd$5�?[*�O�0PM�L�X!)�]���m�Dz�O���[X����W����&��0%���M���iaI_����H�����3R���9��Z,��JZ|B[����YDQJ��)�tz�������>��2�)�UAZ2��gc�)���+�>
Y��+��u�4B!Qi�;�e����j2��e�ul���+��C.yWd[�,�9����8)i	�NJ����Y��������}��j7��F��=a��7;��.aL��������'�)&Z�Xl����PmL���n2�l��$���5���!WES�1�*b�U��h;C��]N��A����f�)]f��Q2/"�k^4�e)��R�
���#�V*�o���6Q�J�����5F�nkNA���"�.�����I7��&�����a��ZfkI8�nI��R�FD���7�d�Y��QL�� `�<�,��K���`�8eZ�����8�/n��i�vP��������N,����j~=���cW[ey������pq�|������5��wR�:�����@�5��?��N)h�#�[��ec�({n8���P.*S�)2���JAs�G�����H�Q�?�5�O�������N��lV�U�09��bSr�ucF{�A-�G������S�0cD�C�6%��Jn��
�y���C%2�
���q�)������D%Fe��@5+
d��g#���w�W*>u�)�?���5
�Z6��A��l5��r�R���J��{o�����1k���d�N�	�5L���l����%q�j�#W����*._#O��O5��iy5��,SuP��z��RcP�}����p4G�9IR}��a ��W���9'����K{\�� FG_�����/���`S���8���`��7�~�/oQ\�8'��e���bb\��Bs4NG���c��
G����n�f�\b�+��5�sp��0���V?�����>��j�^��k{{����i*�p9��h4�d���?�?�����pnm������V	n�F����l	fSf��T���M��u�������$���������>{{���U�f����6��V�U�����M:��g�V��_'��e�V����@���j�Y{���~�R�+�A���J����FUx���J�"��}��A
vs�y���z�a\��a?~��o�&�����n������[�2"z1�����������z�S9�T�.����8�������������X�&��j[7Z���EQ~O'l����p�������KA�-�N������jx�\.������T�o;;K�K>(��^�U�z���O�@�A3X����Ho(5$
�Z_Z�N�c����f��������Az��m@[���=�b���(���8�g�v���6��	_�L���������4TlV����QE@�\Zq����.�.�{�W�v+�(�����~�|�UI���N"Qt���*:���ogtNW������cgu�b�������w���fv�Z3J�@2��1�2����u��w����j��7�c;���ys�Y�t7�q,Ox#8JE�����b0)N!��s3wGP��(������wJ��a9��/����d�(�8{{��?|��l4�	���o�#����P'�����</���[������b*��,B���N�s�l���>��'��������v�����������������������*�U���
B���=���Af.�=��
��2j��������x��?�5<��}�j�i�#w��2�;L��a�9`M��|\��F����~o��M��Z4�o4mN�`O��M�n��4��+X�hU�������n&3�N�9���"��9+EY�YB��Fy�r�F����)~�\dx���0�4��A��������u���x��9��k%�%q������*DO��t��p��$��("��
���m�n]�=ow�������h�?�)�
��J]�_t���;�^�=�Zt^��r���Bgy�&�W�q�zil���$Q�6vw�_N�<&
�d"�l��?G������%F�,j)�GdU����!�� I���K�9\���	
kya���b$iZhy�fe���R����|�K����yt�M���B�w��<��� 8������r9e���h8�>�j�z-�:��SN/S�7��
������pC��8�Qhg��e4?�h~J������� �M��`s0�����[q��J��9'���9T*����K�!�^��A��6�P�����-���v4�A>_���E��%�%�>
��f�������ge�C)>@j/�XJC�U���������(�GbK�62��-h�Wj�=�l�6M&3�{���
y>�3.q[�`'���S���P�N���(x1���Y��	]�T$5��;/�*U��2
@>����;
U�"w=�_^Z�"�����t��w'��p����
L��jwX�����5�c���GSf��_�[�y`���AP,�l�z������r����07_�g��3���F��C�T6y,����b�e_��������Ye���y�&��CI&�����l��+C����v�r2��"qH�O:�B��Y$�
�R��]����e����v}��Y���~����o���!8�'2�C���l�JD�p�0R��.I��}3ul#������,v�4�����$�:NE�j�vrz�>?���BX:'��E����Q��1�������u��#Y	�F�=LE�S��D-��~�Zx�Y-{���T�1�)DA��"�h?!�I����#<o�3�aggF03���G)��
��GCK:�������A[�}��+U�OS$t���l	M�5�
+������'�����T��s?WDL��=������<G��������LQr�x�A���G��}yt��otg����S����qp�]�����|�W7�
�(����@�8��8V��I�.�s��}����Q���S/F���:���y@�/4�E��T�UtX�~���M(K���h'i�������#
k�	�!�H/:������*���]���h�a��q�U4��$:�����
4HT<����?���Tt���~svq�d�@pv�b8����h��x/��"
!�����h��������?8��v�&��e���k�����^��%���-oPV���D�O�.���l�j���0�:J�����*�;��*@�����,���YR~��T�A���Xk�S��q�.���>���H�<��q�,�
h,����p��
O�C������h�I�`#�l�;Np�1jFN��xES�lB@��&�~\.�9�P\*�������9�`��i���G�<�d��op2��d���p��0��&E�F��"��)���>f����'���7�s6�8�,
~K��}k5j�� 
x(�<]'��ST<=�zrU��;��c��)��s�T�z+�����t�lH�U���ON,{����x�s����>������9��V��t�fD����O��#|?�f��[�{<�n�K�B�(����^u������Xe
�"W�2FN�
��=�`	�a�Ra�/��e�A�
��r�*�`���hL%m0�\6 }�������sOt�rK��!�
 -���XHF��XA����X�;�Y%L�,�{�|G`����]���s#2N6��7�(��0�����Mou�;<��f��8f]�}��
��5<��iy�x���K�@�d�{�p���L�2����r�y��i�ZO����W>����C�+F�G����w����R\�(�;��4C����k)N�����g�$���7cD�o����%�p�w(A�"�
C����}b����
'w���_FO{(p{������q������p~3�{Y�)�����	�|d�)�������Nb�O��'N:����`���o�����u
�N��~F��F��8�v��z�Z���zu��a�dM���<�������j�����b��Ja�����8v��GP�Z4~3����<�YfbVb��^������]4x6G�&���.�a�O\�*��i���?�k���v��������xo�M������_
�S�����)_-h���g��#�4�	����?�&� �l��������0�{�x�7M��*�~���)�SJqU�e��&),{�?�P��gp�#��3�vKR]�*�*<�����\PJ����8���
���%����G��lA�V��%��J���0���;:���Y&<tF����j���g\�|��e{���t�o���K1I{�I�X]�����6j�c����N�4�t����kL���6{��?�y4�a^��������`F)���/{�����xA�4D��$s;�dc&z)%�8�"g��L���b?���l-���CG�0�����Ih	���a0��:����5Z�p?���<��
�bJ�(���_��_��UUI.���~��~�Eu��!�z*�g������I����1��}U�l�,A���}:��l�����n.w5ne0c�9�V;>:Dr���/�����b�dZ{��m��T�&��t��4y`�G����P4R�+x��|�v�W�]�I���:��w:�^�[�i���g��w�z�C�� W+x��I���'��M
�q����KQe�OR_I���&�����pex(3��.X�,����7F�Y�)i�6t�^��gQ	����nS��h����Phn-�-��'c��=�|��(�(���x��38O-�Nn�^[�\�>��n(�.��1��������].*���������w����c�t@�@��~���w
�'�[��������.�^�g�,���S�:S�#��ZO8�m��6��L}3/X#Hf����D��H�a,�����2o�kM��j��������k]������*Y6����r���l���0���9��?b����[k��'�����C[s���^�Ro�6#�����j�Ces4
�%��@�O�������C������)�v�J�RTr|<;QZt,N�i�5Q�nY���3�h6i�"�:��v��%�b��[��Z�F��
p�]Y[_��O�p�,�a)b~k'�c����E�{�1���	F�����R����'��C����`�^Nh�*�%'x��kp�;�&{��0��U�8�
���JK���H��8y�>n_<�Jv\O�*�B��T����2!r�"��	��3��5}z9�?tPA��H�o��g���[�5\��>-�7w:�� ��y�eL��xtp_�������y���:��Cg@e��5,���������6�a����=��ea:���R�A�]fa��
�]v����31&����b*k��1G��87����;l4+�V�����x������6n���[�������F����~#���5��r�Q����A����Da�}���M"�j��.����cY��7>�aa�<Z����?'�N:@`�D�9�S��?F��T�H���h2�I�!����9��t^O&��t��%�������+��%t��OfQi>)��%���4������o~i��je�Vm$����{��.��~�_4��������F�����{Q�=�6��fo�7����7��^��Q�������*������Q�w�@k��
](�B�V�vN��ao��syk��:��p�xx5������5�U��"J=��"��@�w2�Up���m@��>%��F�w�B���%�,7��@Y���P]D����k��~���pKPV����h�����G��h4�
���1gh�Q�����
��3z�,7�-�����~�h, ��d4�U����g���g8��P�W���UV�0��r?
���78���
c�v#�`1J6/&�1�L�OD��78�2G����HV
�����C�gW���_��V����:�����Q�n�f*4_/H����O���^����U��D�d�%��S�������4����N��y�*��[R�\\x��:!���C��{����w;2pn��j��D�	��R���s{�I�Z���%�������]V��I�4�\��a���Y1�3k�)w��
;wimU�����u������--�����u:��
��]��b
�����
��rs������J�e
�h��[Q��>2��/1�:�C��?�{h�_����cb`�O���s��3G1=�,�\h1���'�<@�Ry����.�V9Y��^���w�ry�����A/ZwY�<�ze�+�#�|�W���M�(�R��"S|�^q��Y��H����f��G�������[|�����4�v{�j�_�����+'L��7m��Q�s������%��[�Z����*	)��)�[Y��d@�X
f����C�/A���nQ�!�?
�
���Ob��>��As�����
���6�
��P�ey�f���������P���S������s�������M��'	�r�s�z,h�k���
�f���U(���:���XLQ�^�<��v�����FCN��i����zB+�	���=��V���^GcE�@�jOh3y��e
=Q%L?�J��D�A9���O����"5���	j��"fo�M�������s"��0���,r6��x��r��0���^=8�������/i�Q�]���t��xH��9�y>������P���}zy�����%�j���k�i�^1�zQ�Sd��:��d1R;a�w�y��~z�.2-���/6	�\��L�@4��0����r�Cm������SH���)����d�0��Mw8VO1��!����_L(.�<-��5��Y�4R��=���p0dy�M�	6��Hc�_"��9�-��I�A��~��h����7�(g��7YEs�������p
Qxb-n��o�
0��A���xMpz���E���z H#5iZ9]	�X�zM�1�����+�'�������:U�7�v�cP����A� �]����Q��`o�y���,��k��|�����G�~��V�HC���2�b� ��X;u1������6KEQ�<P�Refv�`-�E
�G]`��H�h�^n�P�ly�v�g�mI
p��������>�3�^�s7�a���q����,���H�w^���F��w���>�fq�[���V{���aL&T^;+Dr]sM^��{+��xq����>��KN��7	�vK��#�Z�(��2���*Za;8��n)�FP���5��J�+�i*��j���d����F��%&��I��Pw��lB

�
7M�1�\��r��M�>J�� rT��X7%�h�*�j�.g�f#PNO
D�L���p���$f���(x��1�y=�bvv�[G���Az��	U�0�KWw//S��w��i�'4�����>C��>=��H5���K����A};�L�l
�W5�F��-{���v�c�7H��>C�#�P�$�%a40e�%�_��Q��8U��dC��HS/�
�W�f�����)��_�0w(���1��(K>�M����-��|�@?���;��f,����C
�4jn
��/o�1�O��,b�RyK9�&f<�YS�e�f_��#��������@?��i�m�����F%!jr�1Y��%�B���3=$Cz��{~���T�W�������z^��Er��_��C�'�<�A�-=�v��$�MrB��x����N(������=��3T������_��f�����f��^{^��PG����"Q�n<�G#���-J��U/2�G1�5+����;�C��S����IZ�c��Os�h���r�Z|�0��&?��!3�5�!�����U�d����V��y����)f���y���ldo���J�!������Vw�a<����j/���c�����wsR�w�����������&������(��f����p$����H��d&D���R��2���JQ�4.�9Wt����.�0�����9��\��@����M!�����xJ#��(������^�yz�L*����e��Z��A�������X�f�R�(G�F%��b�Z�����n��5:�%�y��
O$���p�C��7G�W�<�F�)��J�D7�G>)���w'S�q��oa������f�?s�^<h�����>�"�������1���I��X5I�����u2cb�<�0���B	����JM�Z�/�D@�^��"�����7����7�����o�)������_��<�X��q�P`fvnu�`m��-7J�`��B�+�&�3(��8]^8��&���
�����hb�h��H2N��x��{c����	�Mg^�PGW7$����"�@Ow2�����M�����C
���9�Vl���s��H��M�]�n;�-\��(��1L��#BT�.l(\�'����S&5N1�R8B�|�����y=�-�a�.n}��\P�A�F`(�x��"!(3AS��9����x�.�5�79C[b.ft>NZL� ��������hO�E�3zX?���@j���Jy/x���9�KO	��<1��2�"������_.��N���f����('�>�(�;A��B���	\"Z�O3M��jI8��=�F��T��S�2O��
�z=�%55i?�B.{�CY���l�j3kv2���z�?pj2��t�}1��K8M��4gq]�q>��V��8R���0�~��8k�L��S�|���f�u~�B���������?Q$�a��-;��%��1�v��-5�D��J���$����
��4����`B�-U�.H��x!�Je	�*JU�n|�����KZ�0Y�!�LP�S�:u�|�c���C��8��?���y���0��iJB~=B7x1�+$!
�����*���8-��E��2]��nU��K�C�IlX0��u���_���kuzS89��#h��\�,��%����E���^��V�� +�F=�$�u;����Ib4<:z8b#1r"�@�ao�dF?�b�	�Jh��1���:o���tES)�qboKux8'��4/��������J<6�a�����gJ
�q!�W��>e��vF�������c���UV��|B�	��tZ�-ui�$�2�!#,���G�K4���I�� a�B=SC��b� �����������(���Ja���K3��b�'ZG���y3,�����d�
����}��)���s�H'��(%�1����Qk�#��6�f�
1w��(si�����[Yc��E��`��-��S���gY�k���~���� ��f��@b������(���66�K�!zlEY�����d���#cl����F
G��6�2��r�C'��sh�~@t3�A�Y$w"i5<�x�����2�~�#5Q��i�{�����������$&�a��f2����F��������B��o4�A�j|�	��jp���!-�}����	c��s�L�w����%Q9dH�������R��X[+��%X������x���H'��p!�\
1�)��m|`���g���+$8�'�����L�0�)E����H )q[&���'��#�](��q���M�������U�`����F$�hG#�I�����A�
�E}�IZ�R	��h����3���%p��|��V#���������M�=�P�bh���@	�%��_�D�����N�ev���EV��?k~X�8$��P�@O�Jax��i�]Ba�P%�$��6|�����S����~P��EVIz"�_rJ3<���l��Y��D
��7[:v����%{%s�e,-W����?���b���
�R^����Xh!e���A�%�Df`�j�����Z����p�����[�;�T8���Y��;��<�1H�����fet��{U������e�Tr�&�@�AW�C}1��A�{.��4=��}V�����L���av8�'���)
�Lz��$����v���}[FF��B%>��+�tOD�0����	C~x?��{�q�j:������������I*�2t��Wz���=g]�Nmz�Q����B�$/j�=������n�rEs<L�;B��)���8�J!�^T��8�%��0��I�
��\� wnau����:l��:I/z���UT� �o5�*������ 5��|&Ph���������b
�OT��T�N�t�F���<PkAah+	���,���]$�QEj=[v1��W)�Uqa�6 ��MbC���	qX�Q��/]������q
���B;YE��-3`J�nW����B9���h�Ju��Z����w�]�4���Jp%�l�b~?�hCj-<�k�J���2�\�!f>��J�<��Bm�I�M
�8vP���Z�H�+{*��������`%K�JMi>�����E](�E���*F�E5
��zVU��BQ	m�NS�cSp.!#�c+�����"���}�������fU��,���=+�h���sb���Kp�AU��cA������	��DK`gCl��&�-��2^#!�5z����}Lvt���@).-��
M�Y�F��
-a���
��J��3�G�SX�S\��~����}�d������M��k��V.L�p,g��J�����2�j��~�~���H'�e�Z�hp+��)�����kC>	<e<�����@��'�:^���f.)����,y�0ST�K�V$���C�y�W9�2D.�QP�|5�3g+�Hxl[�� �s��Uf3�����
j�3G������%���M������F'��A��m�i�rK�d��p_�9%f.�U�bc�,]�F�H��8Z��b��	���
b�	�O�T���D���u�H�,b+[�q���P����
���z�u�l���a2�N*��@�� �]'���Svt�Z���B"=����h+�����YrN+o	l�Z]����Y5�0}��vpa�2��l�)N�����X�FK���#�+RM���-�����	Z2$�He(�Mm����y�?q��g���dL'k��������d���Z�F����I��K��|��[�1��6j�:��
�f�����v'�k6���S�7�;IJ�6��Q
2F��l`]��U�"i	�X��3|�:kI3�G}������^��X�����b����gQ��  �X�1���'$(b6�2��'~�m-U|��x��	j3���&?X�!B,?�$#wWG��� E��x�{!w�	Gk���!\6��c�H!�:C�8g��r���$����}���t7y�'b�"��qr>��R�I�1#�/Kk>���PBx����4��#�rK[=V�U=Y���I��'�'����J{�P�+��x�l���6����v�t���+r�Z-D'�t���u�r�R�)e�����V���
�*����f�VeXS����Q)�T5���e���N$�N��YRh�R�Z)��Z�UZL����	��'��ev�E�?�<��x��n���u���:o���g�^�����W)nI�7���.�>C�����1�4�������|Ml��V���H��zu/�o���A��at�~�����V:��E��oR�;�����l���Fv��c�=�j/g��w�u�����[An�/o�eB��]��p|�C�����<F�75��g���.�L7���1�iS/��7G�G�^�_u���i����H���)zz�4�;zN&�1$U��������v=;Rnw���s5�A��he�������W�����%rl�c
��~l�"~�c�1:�k���'4�On�1u��d�<[�����.�~hKx��/�K���t
�Y)n�����Bp#|YY���l�6�^�2=E	Y��U����W�������������������O�&S��-�Z���J�7�i2�Ky�����i��X!�}U��
�Y�c
�Q�c��Q��m�<f}�c2��!�������Z^U������X���_S�)��b���I��LQ����Y������
��:��H<�A�Ko^M��X���������7��n?
��A=<�Q?Mee6��~SE(Ds�B1���V�.�N�S'�H��h_\`"������c�>�����S���7���^����*�p9��r�&��bjt*)Q������G�����w��)�hoPmF�V�[.7��^��������a6��w�^�M�i�.�>;>9W��������Q���L>������s�7��h�������-T��8p�"��&�8o'�}Q�33x&���}p��o�*�^�\���j7��������Y�V�R��U�P~P�1�(�Ov)wISoxY��ai��V�����%���e~����N��:dd
X�����je��_�e���f�����g]�.�a@3o�Z��O\������wr������^;o_8v�]�<��y�R�i��.����	�K�m��e��bzl�G�G��&�=o���^�kb�K��o/o����w����8y)����`x���]�����lh�W,�L�>��[Q��������������)gB�������n�.U �P�X�"�)Y�PQ��N���t�������aX�e%�.I�����j��p�����Y�:����6\t��?�0����0�E]E��Pjt���������I�-���r�Fi���C<sG�W����Fa�g�'�����!�J��p�
P�@��l5���Y��\��N��p1��&�����No�n������f�@d���$����kX:��b�iJ���S)b�<�pS���lW�=���N��|��&�&������x��������{"#��`*�gL���g"�������M>��|x/�����Y���|���w���h���g1Z_T�a�����]���ks��wu��\�l��W/Yu��v5���j��>�a�2����%���%����00��3����7N���O!YO�xT]��&���a$ek�?������n/��m�/bd�n�d`��[`0{
�/��>�LU��4�|{~zA�.���2	�..�O�_n�]�/xst�������o�^���_;ZA�~�u����;����J�����+�<x��Q�T*�~9i^�Zd�&�]gl��FZ��\,�
WI5u���QV���bJQk�)V��m�\%�*Y�\����I��\Q�������<��0���/.����*��)d�O�#��E��I�5���@�����>��pF����Y��j��q��:�R��l� �	���/[�D
-�d|��V��h�0G{��-���/�E�>�k4���zb��N���|�����l'ae��q��aQ~XLL���@q��c����<UJ[���h�����-}�C�R�7������Wg��r��[V�-8��V:�ZZK���BVr�����t���F���,�"&G��%�	Y�K<�
j��_|�v��g���-{�>S���<��r�]&�B�cM�z�����-���6�0n�?�n�2�$9G�h���GWC�gR>oZ+&�t��!��)�KL�:T�$��9EnI��M��x$�����5��j����5�L8����Tp�)���k��u���B�"��&(%������JV���NR�i2��f��gp�f�fR=+!18�[����r4�|�H��A%���P���G{����`a|���{tn���X����
�7-��=
I����6��SnB���}��_,/&s���UY,�,QS�'�+�B#�H�@G"�
�x�,uy����(��g�G��(���o[l�:/��_]����3:��.���k���g��
@�#+F1��[��MHXf��/�"?cHQ�WI�%K���	2��}g�@�c�V���p��Y�����!*f'�9O7�7�C�b�x��� ef��i�)������g��Y����^����2�����N$MVN�YY��X66cGsY=�{-#��N��]�c��#�J�&�uL���>�A�4����?�vIo�������H'xqS�����=��YY�	a{X��l=U��������-eJ���������M\���.�^�^W�)���M�E��ZJ6#�*�l5���b�E�����<��_�O��`���{�B�N�5e�i��:`^1@z��px��/)*�P"��,<�t�BpY����Y�����n��]2�T���2����YBJ&�O��6�����1&��X�I���L�e��B���7	�m�����Lj�W��'MsYb0�H��V���DM��fuH�����|�����L\E��S�uu�f��h�[�{�uH����A������I�����k��@P�d��.V�l�\��sg��Q{�/�c4D�(�{(:�(�2&�&&l��
hjk����|���z���,��%��$���|R�F@;����)�u�[������W���g���o�F����5������f��oA�~���,�B����ur=^Vn����tS>4���A+�j�J��j���A������~u/�4���5��hT�I=�����6�8���F�i�y��G\���~�/�����<�]=��K%z��
����{1&�fP�{��>�;��Je�bAIU��9�|�S�T������Q�d������nm���D�5D�{	�u�����"�
�!����\.�Ru+��1H�(��R�=�����k��^����LL�o.�U�.��NV*�G���
[����8l���+)W�n�V�V�U.��f����j	s�;4�w�*����,����d�������d���x�)K�-��AL"���^]��j�U"x{zq��}��E�!�����MC;�<���qw��|@2����_go/1�K���4o������G$��Gd2����w^;���npD���iU��U��h�^�
'�T�9�)z�+�r�f\����VV5�{�^u���U��2A���!>��5pH{{������TD���On:@�.)<~����MG$���
\F���/���H���a!A�]�9$���rJ������(��������-vr��H��^�w3���F�]�����(�����,j
[���x���v�c����?�{���R�
��Z��
\!h!]���$S�n��7:��O&����2��^���)��H��0b��d|���Q/6���'f��e=�E=!#�h/7_L�z��XG���WF��;�j��$�������<��"��"�;�k>Y�T�Ot�3r]����D
����Ao8LX��r�����i�w���~svqr�Ns�*,v��e�h1O���k������<8��@�6��:���8��WK�/o4Z3
�)H?%�KJ���6#�
���f(�w��N��C[�m zz�%A����n����t����rj��zJq��(.]p��F�K�^��NP����c���0���$�G��@;�T"���>�I<��)�����d���Ze�	\7*"_��vC�?�����c��t��S
�
u�9y�Y�|Ge�zGW�I�	b9���`������k	,��j�R2���=K�u|t��u^O��5<�`:�!�� �"G��%!�[((�4u+�E�tA�������
2%c��z1`F���t=?���~M|n���J
r*�������!�����������1@/���
)�a�]tr[�)p�E���	,��U�sN4u�J]Q��.���3���?(6`%��:G�1�i����v��ED�	��f����Di"��/�M���EU:���-�q����������
�O�^�K�������!�O#
���[���f�����f�;���F���w���������\^/�A����4�Z�+�e��}��(h
\h��:����f�S�T��\2�O): �[Fh6_���n�h�U0�y��CvL����P?y��j)�����E2~-��6�R��2���~���
����T���j�W.��������!1���#b$KvP�������p�%?���O�u.����V.b�7����W@�!�2�G�g�m|�MQ��a�z���ZOONO��k��$��p�B����u�T�/'0��~��������h�T��64�-:R@�.A���bj
��?)r3`��l?�b��t1\�6�&_�+�%(s>����0����L38LFe�����h�wZ�/---���E������k��OoO����&����zeb_h-hrOYv�#�����o�N5�y4go%
v������Y`��h�/ir'�FQ��1��������8�aP����`�D�m�����A~���~�����w�hO���X�L/0�>@A#�`�b���	1'+�$����;��*5������.�Yo[Kw������ig��c�hX���b�����&Ma�}��^�?��l��(v�b��DC���,��C��U��Z��qTj�����8�p�&P)I6�zY����a�����5���Z�Q����.���J���5{�V�Q���7�+a�Fa�2����oF�o���V����������e�_��?��m
Z�� �x������
\�8"�M��M���c����F!�D�f���n�Q�P��(bv�+�^e�����qp��0a�\��hq����
�3�I�#@m�ctc�S��p�Q$8C)q��Q�0"����:J9�B�������t����Q@�5��T�7�g�9��4v���Fk��e����B$��F|}��{��:�
?�A��?��)\��U7,�i	��d;A+�Q��]-�
�W��������$�
 �
	oCl������A��7�];���;��Z������6K'��j�Q�HaRhm����q�*�p����06�U� M�%�"���'�����Y_�h6�����T�Q�O ��B�PN7t��ZH*Wk-r�~9y���NC���,���#��d�t���%?�E���o������h9��7�`��c	�HJ������|u��������'��fB�V|w��a���f�F��v7f����eOcf�Z�c�i�����W'���\����
)���P��������?y,ii��y�2c`�n3�������>��`�����g��:�~�:����8���g/����gM8���1p84���n��y��w�1.�,b�7>��}N�������
��(T�F�#�����x.�U�����"�x��vyk{����X�%\�qU,6��Oda�
9z�z9�b"n�<���PL��30�
�@��i��L'�oV�������N���I�o���4vU(�@�[�#��q;�CD7]3�$��s�
g����	?��Q��0����j���k��=�;��`�����,1������C���0�4wU�����?�:z=���Y{#������To�������lX�<;�hw[�\�{�� ,X]���Z;C
 Z�v��
K���;�����Y6�j3{|��X[:B�H���C�D���}:�{�Fu�������I��c��c���7��(=��`y��1��
�H�YP
H��q$r������9A�q�����M*����$��6�l��f.~z�~]�3�2�+���&��Y���(��G)"0���6U ������A+��:������WX(��b^i���%�C-�7muVE\h�{�pI�>p��>��cX,2���BF�W�e�����`-al��[��bl������f��y��i�~?�X���jQ���I
<��2c�?�����$d�����9d?BL!8i'��x�2o�8���4�Q���:��41�r)2����'�n0W�����c������;t�WW**�O'�j�oc-w���
��"!s[L�#����%�\���_68k���N�v��|�0��/�;KO�� ����=]UJLQ�M�{�PB{���b�c��6(���Xl�K�$	r4���|�{O�����i>��>R�hfIgL����!���1"��EGU�z<��f=�eg%a��g�v�E���;<7���N�U��R�����c�*��
��IH�b�7g	L<2���
Y���yER��8���/��1WF��n"Rcj�!�|XV�����R���0{i�Y�������EDi0/~9:�Q�z<�2�v�;��3�a��&_23�D����:����?u��'yEx�W�U'����H����-[gYK8��xQ��v�Nz���-��x�
�E�4�t���~�^����-�7eyN�U[d<���~���`V�V�_C��}�r���,��Z�"�Q4���?����|U���rv�I]M�J�0���eE���A����<M�Ye4��C����\����,�k�y��w��?��*'{2*W����F���������y�aLV�f�Z�����jX���}Of���^=��N�G�<�7��,��~����Do�{�T�5��o1�4Y#�Z#]�P!���K�%����a��m���n:,"Pi�J�z�������C��8R0���y}r�/���^���5�j����h
��J���oJ(--���X%UB1j	������+��jpk&���z�����!�o�^�5{{��~#����Zc���w��nePm5�������~�����q1��[������.������1�I�H��66��I���:�M\�~�+-��{��
�~/�wt�3�sh�W?8������������d�i�x{�������Q�z��n,9��c�B�����(oA�����|��=��(���3���W��p!��[4���&���s!p�d��B�E����_I��W����1�w��w�}j�W�_|��� �F<�I�c�ekXt��*����f|�W+����RX"5y4+���1�C�v%X	�vgW����I��J������=>t�7��A�:�E�z����n5j��z{���5���~��l���W��^�qc��_���?���@kp�r�I@��nm�G���h�a���p��0|� !�8������t`�?��z=��!v�����{]��p�����	��n���\�� i"�r�$�����X2VJ�r�`p���T�m�t�WA�%�Sm&d����E
�G�Z��'�����������y�����I�gGC8�t����'rk+��6?��jcd����?�3�m)Z���g��5������C'�3O��������z[E?���O\*����*M�i�:����N�
����Cy��^���UlQt����V�/�mO:�ce>�Ji���0������B2���l�-G����ww+Q�z�a�j��r_Z:�����<��dn��L�
����2j�?A`�Mz��Vu�FDvu�r�&���U� T�^�Fz	������� 9��o�O����H.Rc�5�H�}���U����Z�N��A���]��Vi&�?�{���������V����������(j����A-�T[�J����f����?�+{��K�����iAk��A5��G�z���p8_�(k��8������~J6t�-i5C�����;���4�d:��[�~����;���b�
����F��E�%��6��K��x�A�h�rp���>��I���ua��;����>F�����������_��+��9?�2��7��=�������48�/p�2��n�dC96N)l��Z�[:(�1���1I�00��;I��K��J@���^J]���()6N�D�s�tz��b2z���|{��Q4�����c��pt��d]�,������;�
L3id8�.�1�!�7��L�t?��<���v�1b�ly�p�@;7i�a�@��������-3�m%V4.o��g��fr�����uW��<BS|�n�(k'���5
W��%��0a$� =?�^���z>���ww����x��x����/X'�����*�����T+�F����/��S�����z�=?#���6dve
R��Qo04��}�������qW��_GD�V�6�^Mp�l�EP(;�<?��!e'c��c��(� n��7�����MbTxZD2���m���{&!�4~������	��c����a�K�E#�~�>XL1�e�~I��s�T��flBDN5��
}8��F�cF�s�&a�����Zw�S<y��-���DF1�����Bi	��&3t=R�:%w�F8a��;p.����/�`�K���C�)�C�s8GE"�F
���b�����3lE��EJ��#�'���zS�;��E5�\}�&�%&X�r���FN���
��I9]���):�%�b_'�42k �Naq�K�vPB ���@�v����x��������L��f9O����������*fn�"/�=�1�	{���>?y�+GP8��7�@h#76
=�Z��+���������7�������'�,/�c���,����xb��l�n�����MF���S��\1>@3�P������n����A�X'��6�Rb�ZY��MD=5�>%&�U���sA<�
n �<Ng��L�qd� R�^��
:����E�H&�{����g��(B:T.��b<����Kk��S�PWx��B�g����&\B�i@!�C������ne���x�$2�:�ig�Y�����'I	���S|D�S��L�&Q	NKM%������������d��fc����������T�F���v�pP������^��E�A����@��}��~A�I�����o��L�S�����(��`%�g<X���k`%0������6��R���������0P��2 #�)�+���V��"�������(���
�=x���go~U�5h�������l?���~<���oR�H�e���e]��|L���	\�(Lt�.�Z��:�7��`8��	�ob'��J-���
>�AM��&�����r�,Om� M�����9g����e���[��]�^Q����6��i�A��j��%�@J�FV�:�xM�^7@0�*���Qf����D�z��G�sJ$���E�j����Ri({��Xu�,U�Q�*%�G�9M%���2����O�1�����g�4�x�%��)J����{�f������������������6��~�WoV�^�U;���Z����@5�����������?�uM(h���U0���$n����fQ����(KFGeRE��������������I�VFT��c�V��|�Z
�P
�r�Y��{�^���H�����,� �������-t63��tJi5�RB:e�Vb���]`��.0�3���J\`%6�.� Y�C6\2�:
$Y#$r��w=`&d����T��
}F�&�����v��+�J6����l��qx�Q|&��ee�l]*p�T`�������Y�L��4�{-��wP���F���d9�at�����v&����X:���
��
HC�� �t�5�pKnFT�"�����]��m��-�Sy}v���a%wA�P6�^�|�4Y��9h��1�g�U"{-��)��U���V���T��������YR�\��K��&o�u��=iS�>;�Eo�����k.���������*�G�h,��������?]�>�������',}W��������$�W�#����p��n������U��u������>|���FTT*����.�����/F�������,�o�A�C������HE;���/�Q���/ HS !��p�3��n�������X��c/)&^CD�Mc���\>h5����^��w+���F����~������L�d��$�'�c�ab���PM	�����u�u�_Es�@�"�2#1��Yv�����1���gn��&���E�z��tQ�@'bW�� ���#%��-�[��<<�������A�s��nI���q��n���u��u����r���b,a����+&�Z�����!iAah������X�����n��U�u��r[�p�'���������_����r;O���m���������yi���U��EW���BS
����W����{F`��T�;P+��(b���d����K7g��(�
�,�C�������7�@
jNL�XF���e&���i��t��'�������Q���-�����.��i��K�M�����z%���������?j��A����
�k��j5�F�Z�����~����z����������z��_�������\Y��U�
��A)zc�s��������K��Hfi��;������?w#[����%�c��@zR�=��mQ�j�Z$�24���)����#_K;z�\�W��n�1���v��L;���}��d�'!K	�c�R!C��X�
�q'���������H�����(L�����z�C���7_��=�ipo�����g�=�}t"(4�*��T	m��R�G��
�K������]>���� lv[�Z�{����G��F��<���z��jV���}������T��q������S��EMOZy�,����
C���k�z���T�>y�u��8�����-�Z����]�o��j.�o�;�\��y���W��:���PxY���w\l��Q�F'�b���/e\�v8o�bg25
�7�����=_u%S�Rs�p	�7����4Y�4�n5|��Q�����
����n���V��6��%uQy%P,�M���W��#���K�1��|�'9x	�;1�p��^0�w�X���z��H��Uk�����C�?0���^�����FmP�W�~��
z��Z��W�z��o����������+K�����(heWd�VJ}��uk	��*oB`3���
�����b^�p���/sp�Z����qk�<fI��o1&:�[13U����t�N�[�vf-!�����,�[��#6��cq��^��b�G�
����g���vu;��%	!�_f�b�'6X�_���>H0?������id8���!�w����d\�]���nQgk�5�A�����QUwIc�*}c-��q?�z���~Z9��Z���E5��z�F���V�z�j��KST+�3������q���jh=)���]��`[���6
��i�z��v�����J���7���o�c|n�Kq=w�|�%ja��]]x����������nG����K����C�s��L�[%� ���G���$G����a�3���2�=(���C�w)9^1�	c[������	��2&�&")�f��b����qav��g�B��J���RCc��)q����gb�f��	6R��=�FV����	?u����a����,L�/��<z'f�+����*�s��"���89E$�(�\,)�����<��fI+�15�hD�kC>��Bm�����A��3^����[��<k�05����2�)��9��Su�b�%��1��Ds�.hQ��u��'��&��fxK�P����A� �]��p����p�7���zU��5�U>�_�a�}������p!"

��������"�1�
��#^����}�����?=�3�o���<	�QD@X�cN4�<����P�ly��������-�HOQC\P����0:"������,�������K?��M��d_����6����vVV�N�<������cBC� ��u��U�u�5�Y����t��/.����G�4|��?z+`�����
S�^�}���*�7G/����Jhp��u�S:��Y�<6V�^xh�!�8%�3�:�I8���=�+�
�[;'V5C��}�����g"��3<3�Jv�������-���$kR�8
�y�q5���#,��p�0o�d1��
G�R����������Uk�a������x�E&�NaNU�6������WG����'�m'w��0p�6'���6�Zm�k!ia����}'	�;3N���p������v��St��(�_%��6U��H���r�
�i���b3]h;�Tro���^uU�Ai�Jr�u�9�Y(8������cY�S�3�>������']�%H�Fu�yg�x�PUJ^��o��L��p,m�(R��K5�vy��1���;gV��C�L)@�Zb�V�nro�7WSP�a�06�#	m��XCmm/��:��'����n�.��O�s��"{�g^t���8�����PL��)@t�p��fs�b����p���>b���� i��c�(g)/�%S�(��Le8,v���R�3��(^���*�'���D�u�B�+������g���]����GC�j�S�
��?RX����B�����
�;�0�����v{�6��<��R���W�$�w��G�9��n`���$�X��7Jx0��Z���UM�8Y�@'Gd��B�Ci�$��Fh2��%L���p��2s��eD�H���(+�(����[;p���[��<xx�q�� ��S����x�B6��b�M�����U�S����8��f>�v��(gY��J����Fz���
�2�e�}�1�r�~�����W��,6)��v�}���$g�G���)+\P�=�����w0��O,1$O=����\.��c��1@��LJ���d�=�I��r	c���<���K��4F�'��I������x~vzq�����N.y�����p`��4�O���t��0�Y���|�$A���P
�t4�n�]�^qd����@�.|Uk���O�g��6�!�w_���/F�t������"����\-���H8aN�Lz�m�
=�j�a�DC!Y����XD�J��YcQbV�Y6��,�jq�2oX�Y��9z%K�JZ����<�$�T	����oYr�MQ.X\2��/�9���*�e���E������'���6���$���)E�������{�c�0�L��;b��i��?�u2�Dc���_����&�q8�����zp<�Xm:��$��l1&�&���d�a�L�@���A�~}���������y1��#O���w
X�1��?,�	@��<E�X�U*9+(�)%���J��(��M��\�=�F��<&����(S/��~�p�x>�,/��b�y6;
v�����4�����*!�+3�,M�=�k�F�
n�p(W��}�Tm��XG?
u����s��������'�$Q�}���^��
P�Q����A3Y���N���Wo/N~n4=o9\��[��v���q��"q
����H��p�j:�-��"v8�C�m��G�O��X5������E&���(��	��MO3*�����Y9*�# �����K���%\�b�G���m��qv�C$����u���s�9�~�R�b
�X+���^|f$!R���G�`��PV�����hB�$~�%��q6�k?�0�<�P�}�S�B�������e`M0F8b����OQ���2��sf&/����y�����n%���Z��d��B�b1������v��0?j��S
X0��O�B��U��^e�D�	���Py�aD��8x����I���K����}���%���S�I�l���b���d�T;�,��OH�9���
�A�����>.�i<������~� pC������
4$+i`��KQEG�,l>=w!��3VV��,a�)N����-�)�q�]��'��B5���n�1=�K����������WH�c%��W����W�X���$�9
�yn������\j����|����f����M����7�y�%a���(�9D�P��W�"������g����o���
��^�����`M�x��(�.���)�,*��nV�3�U�<����i�iCNe\�����u$������,�+�@�Q�#���\Vx���P�bS���p�����&XD}��x#'� �9���������):��	J<�Lp��+�ESX�!k?I����K�����7�:pl��@:s2@5�{I��P��9�/�f����;{���2�'b��w�km)OVO�I��DK���S�*1O�-��i�7/E��)���
��H���cSo�����v�c�I5����r�r�?IY��kz��W�1X�K��1�\�o����b��n)��p�B��HI���������,�-�3V�q�C�z�Jh'u��nQ�A����l^|�Zo0`+5QoI�|�;K#�gVPUT)OC;;���8h�*�������T%�?���tV���0���	
/y���{�*Aw/��c~�A'4�L��ViMPT���BAD�O�K��Q�Y*)��[�_�����W��x
I�UD+y�R.��[��,�5�J�I8����[�T�V�;�i����������R���&:h;6�E�?�����~	�b���U�z�V���H��{H�"����(�p�3T;�-��OC�jO����=5�l�i������X��{Qs�U����~�Y�F���km������X�!w��M�e�o^v^�=}~yr�q�^^H�1����]�<o�C����n�c�"��m���5z����%�T+��+j��
�c�W�H��y����i�����<������#�_��s��	0��<*�����Df�����e^����^*I������������{���m���nkPm�����J���o�����A��m��w����������57��������@C���V1E���	�sh�	��y���sTty��<o`�����/��;o�]��?��B�������}j/$����@q�����G8"e�{���'�%se,I.�P{�t9��J`���U�����$.-������2�<)Y��]���������^����#��x�%��u��g�I��]N1Kl�E
=�����1����1����hoK��N�h���|u����9m��I��
����x�_}�\o���f�m����~����O�~����}�&����u�����$���+<��U�]��}wh��U��(0����]�;�������rh�X���#`�fE�r	g#��i2(��y����w�gn����������^~m��T�04��VS���r(����\�H8|A�n�uxM������V��/�L�:��o�����s]h��v���Q6�8�F*R�W_a�
��Q&�<������m��[�4��;��
�a�2����i�����}��c�]���h�r_Y'�|�`�uI�Y�u>�S�$�(|M20T�[�b�v����i$�k-�;�I�/����9+���_1p��U)_F���m��P��bB�(���M���4{v��,�����r�X���
\��4j@������������N���p{W��������-i�<h�+�%+��.S�i��������"2����f����p�����~��Y#���$�I�H��%�VS�A�����G������������X���X�G!r`N�����_��*��d��Z��^��������[ak>$�U�X�4��7� %u������e������&G����n���[������Jq��
��jUMH���_;
RzJ�9Z��6�0��$a����K�U��_3��Qp"��X�Fg��U�j���'F���8r���������h��=��%�u�0�!td���r'9����>�-t���h����*���]�<-:JO?����p��Vu�rr
`��L�����+���}��F���6=ZE��7�5Z�/Q��F���ui^�3���o��2�;j�f`T�h.F��t��Y��Y���Q�TMj�J}��Z����/r�#����Z%�+A�w�mX���_N�F
k��;�4���3e
*�0��j�n����1Z�N�cP�<�G���lF���*n%��QrpJx06�	��(>����	� �/��:*q+�-,�=63����:� ��G()�1m��������K�WN�����U��T)A9]���N���a/����o���=\���b��m�?k�'q��?yp8A��X���_��W����w��rF��!)7a_Y��
�rZ�5�<GQ8�F 5��:���_?k={��v�*�����@�<��rfA��+�!E���Vqb}/��_����*Dc� ����A`]k���P��(�X����EE���b�$Q���)Jz�k���Y�P �$6f�K_��Z��k����L���i�a�	�:�������y^p��Z����y
�)�e>96�����hu	m\�h�������S�Q@�������C�����0C��|�HU���^��R��ng�T�#���S_`�^un�� ���������t�E�y�H����e�SM���}�P*a�*�l{�9��:�9
�{�Z�e3���,���N�I�uCp�>I�B$�)oE��!�Vw�Z$��$N����
�����,��+i��M"�a�k�E��E��g�A��{
�������"{��I;�y�=�3�[*xA-���j+�W�jI]<�5�'v�\�����r�l�3ng��*�]�t4�"r��<R~�
r@&k{OD��{�1����$�:D���|��"&�&�P��Z���`���D�{���$���h��]�s�\����|�������WP�W��4��yQ��J���w<�~����&�9���{���p��Y8��^�_JG�������Hk!�c�����>��1I=	
��������SxJ�Y&>@>�� �6���P��a�+��������w({����q��w-�|D k%�E'6t��7�p���mv��PX����RiGE�A����$���iw�����p4�������e��
d]d;k�c;�Gy�����O�N�p0�Q��_�=X+��e�cT
o��:H��F����������9�l���XE4��F��5&.��e�{~r��_*�Z����������|8�[��F�Jm0��{�z�Q��nXm�������^����/���6�QW�������e��d���(h
4��D�<?�b�sr�vwo�x2.S%�ZB�FV�i3&�ir_�]�S�o���F�`����{��^���]�q_m���!�9�UE�/���,�0���s�������&��x:�D�������=���:���(����S� �~�1���4-�&��D�c]L�}6��F#�5�b��h����D�������?>|������������?>|��|���`��

#268

sawada.mshk@gmail.com

over 2 years ago

In reply to: John Naylor (#267)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On Mon, Aug 14, 2023 at 8:05 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Thu, Jul 13, 2023 at 3:09 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

0007, 0008, 0010, and 0011 are straightforward and agree to merge them.

Thank you for updating the patch!

[Part 1 - clear the deck of earlier performance work etc]

Thanks for taking a look! I've merged 0007 and 0008. The others need a performance test to justify them -- an eyeball check is not enough. I've now made the time to do that.

==== sparse loads

v38 0001-0006 (still using node3 for this test only):

select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 * (1+x-x))) a;
avg
---------------------
27.1000000000000000

select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a;
avg
----------------------
165.6333333333333333

v38-0007-Optimize-RT_EXTEND_DOWN.patch

select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 * (1+x-x))) a;
avg
---------------------
25.0900000000000000

select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a;
avg
----------------------
157.3666666666666667

That seems worth doing.

v38-0008-Use-4-children-for-node-4-also-attempt-portable-.patch

This combines two things because I messed up a rebase: Use fanout of 4, and try some macros for shmem sizes, both 32- and 64-bit. Looking at this much, I no longer have a goal to have a separate set of size-classes for non-SIMD platforms, because that would cause global maintenance problems -- it's probably better to reduce worst-case search time where necessary. That would be much more localized.

I have some questions on 0009 patch:

According to the comment, this optimization is for only gcc?

No, not at all. That tells me the comment is misleading.

I think this change reduces
readability and maintainability.

Well, that much is obvious. What is not obvious is how much it gains us over the alternatives. I do have a simpler idea, though...

==== load mostly node4

select * from bench_search_random_nodes(250*1000, '0xFFFFFF');
n4 = 42626, n16 = 21492, n32 = 0, n64 = 0, n256 = 257
mem_allocated | load_ms | search_ms
---------------+---------+-----------
7352384 | 25 | 0

v38-0009-TEMP-take-out-search-time-from-bench.patch

This is just to allow LATERAL queries for better measurements.

select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_search_random_nodes(250*1000 * (1+x-x), '0xFFFFFF')) a;

avg
---------------------
24.8333333333333333

0007, 0008, and 0009 look good to me.

v38-0010-Try-a-simpler-way-to-avoid-memmove.patch

This slightly rewrites the standard loop so that gcc doesn't turn it into a memmove(). Unlike the patch you didn't like, this *is* gcc-specific. (needs a comment, which I forgot)

avg
---------------------
21.9600000000000000

So, that's not a trivial difference. I wasn't a big fan of Andres' __asm("") workaround, but that may be just my ignorance about it. We need something like either of the two.

v38-0011-Optimize-add_child_4-take-2.patch
avg
---------------------
21.3500000000000000

This is possibly faster than v38-0010, but looking like not worth the complexity, assuming the other way avoids the bug going forward.

I prefer 0010 but is it worth testing with other compilers such as clang?

According to the bugzilla ticket
referred to in the comment, it's realized as a bug in the community,
so once the gcc bug fixes, we might no longer need this trick, no?

No comment in two years...

v38-0013-Use-constant-for-initial-copy-of-chunks-and-chil.patch

This is the same as v37-0011. I wasn't quite satisfied with it since it still has two memcpy() calls, but it actually seems to regress:

avg
---------------------
22.0900000000000000

v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch

This patch uses a single loop for the copy.

avg
---------------------
21.0300000000000000

Within noise level of v38-0011, but it's small and simple, so I like it, at least for small arrays.

Agreed.

v38-0014-node48-Remove-need-for-RIGHTMOST_ONE-in-radix-tr.patch
v38-0015-node48-Remove-dead-code-by-using-loop-local-var.patch

Just small cleanups.

v38-0016-Use-memcpy-for-children-when-growing-into-node48.patch

Makes sense, but untested.

Agreed.

BTW cfbot reported that some regression tests failed due to OOM. I've
attached the patch to fix it.

===============
[Part 2]

Per off-list discussion with Masahiko, it makes sense to take some of the ideas I've used locally on tidbitmap, and start incorporating them into earlier vacuum work to get that out the door faster. With that in mind...

v38-0017-Make-tidstore-more-similar-to-tidbitmap.patch

This uses a simplified PagetableEntry (unimaginatively called BlocktableEntry just to avoid confusion), to be replaced with the real thing at a later date. This is still fixed size, to be replaced with a varlen type.

That's more readable.

Looking at the tidstore tests again after some months, I'm not particularly pleased with the amount of code required for how little it seems to be testing, nor the output when something fails. (I wonder how hard it would be to have SQL functions that add blocks/offsets to the tid store, and emit tuples of tids found in the store.)

It would not be hard to have such SQL functions. I'll try it.

I'm also concerned about the number of places that have to know if the store is using shared memory or not. Something to think about later.

v38-0018-Consolidate-inserting-updating-values.patch

This is something I coded up to get to an API more similar to one in simplehash, as used in tidbitmap.c. It seem worth doing on its own to reduce code duplication, and also simplifies coding of varlen types and "runtime-embeddable values".

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

fix_oom.patchapplication/octet-stream; name=fix_oom.patchDownload

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 368186dfb4..b212722912 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -2935,6 +2935,8 @@ RT_VERIFY_NODE(RT_PTR_LOCAL node)
 				}
 			}
 	}
+
+	pfree(buf.data);
 #endif
 }
 
@@ -3034,6 +3036,7 @@ RT_DUMP_SEARCH(RT_RADIX_TREE *tree, uint64 key)
 	RT_UNLOCK(tree);
 
 	fprintf(stderr, "%s", buf.data);
+	pfree(buf.data);
 }
 
 // this might be better as "iterate over nodes", plus a callback to RT_DUMP_NODE,
@@ -3060,6 +3063,7 @@ RT_DUMP(RT_RADIX_TREE *tree)
 	RT_UNLOCK(tree);
 
 	fprintf(stderr, "%s",buf.data);
+	pfree(buf.data);
 }
 
 #endif /* 0 */

#269

john.naylor@enterprisedb.com

over 2 years ago

In reply to: Masahiko Sawada (#268)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Aug 15, 2023 at 9:34 AM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

BTW cfbot reported that some regression tests failed due to OOM. I've
attached the patch to fix it.

Seems worth doing now rather than later, so added this and squashed most of
the rest together. I wonder if that test uses too much memory in general.
Maybe using the full uint64 is too much.

On Mon, Aug 14, 2023 at 8:05 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

This is possibly faster than v38-0010, but looking like not worth the

complexity, assuming the other way avoids the bug going forward.

I prefer 0010 but is it worth testing with other compilers such as clang?

Okay, keeping 0010 with a comment, and leaving out 0011 for now. Clang is
aggressive about unrolling loops, so may be worth looking globally at some
point.

v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch

Within noise level of v38-0011, but it's small and simple, so I like

it, at least for small arrays.

Agreed.

Keeping 0012 and not 0013.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v39-ART.tar.gzapplication/gzip; name=v39-ART.tar.gzDownload

��;ko����j��=.zbG"%��NS������{��@X�K�1�U��d���������W'8����3;;;�YN���j�,�,N�e�0�"����>�81�,v�@���eb����1�O���	O�����n�����xt[�V����n�~h�v�j��Z���W�R�0��G9����������DF����c�������:���a���<>�����]�����.d����Y]�j�����!�Cv���]��8�Y�������(�Ah�2��v�Sq���n�K7$����a�{���z��j����G�����wG7�o���3���7+E�iQe(��U��HT�*�I��U�6d2�
� fF�cE��<
�����.S"�q� �`4� ��`3%<�J�fI���`�PY�2�3�4L
:�I��cAc9_�-��V��@�����������T�'i�q�pY<�??��F"N��D�$Pn��>d�4���fs6�����)4e2jF���x����:�'|b�[�ON��ON[�^L����"~�?&�%��Ww��q�������O�m]��5�0jL%n3��0�Ds"�����3��/����C�����������J$��{��Z�|��(Hon�v��M-�=q�,�w\g�v\��|.��hwf�i�m�]:�����~����o��O0) ?��0�J�R���b�����_�!�6�~���������
G�J�T"]���?>,�g�_��^P?�`t�m(�p���H)�Y� k	1�b��X�)(d��#��$� �Z��I��w����:N�(�Y�����Y/�'6�����)���MP��l�	]��d4��A�j	>�+LG��}Q�e5����{_c�v�{��E&sk���Z�=eW��bi���F`�|t5h-q6^���3
�j&�7k�u��V�j��_������y}}
�;��sO��g�(Dr*�@L�����"|�0��G4�^��;��X�[���f,���J����=z	�8P
6�R����8���f<�4"���\�#���S�k�]�p���y&��8#�\�744x7�Y��1�
�?��5�i���{�M���Y�h��Z���x��c��f��������4�!���A~���n��aa�o���(��:���6=gS`��A[z����z���}6U��� (�`�}��Ut��K<���{o_S�,ESp�S��:{�����h����Z�i�k)#]��]��E��hE������,P�]^�Z�� �d���-���mVo��
�_��)_k	^S��d>�@R�����`!�(�
[P����	�������v�@V!�� �T�6���B7 -�0_���c���
��`�f�m�l�j��Ou�g`�%#a�����!���~����~E��_��?��Z����j;�����E��s`y]��t�v����.�q��la;��X-��Z_��~As�b��3c�����	�M&��N������?l[��u7�6�(�EUGF��@T�j,�j�v�yl2$�1��p�&8�2]4�s��.~�0H���*6I -��
�k�$���M�3���������������S0�n"Mv��F$����5H�R1		�<D%��Y���W����!����f������
F�,�,3,�� ����������$��\��<s6���Z���L<���K���	�*�)�Y���'��\2�f�
����]C*��-��9"4!.�QH;����YM�������B��V,St397 �I�'�%�<�G;F"<����P�s?.�~�R��W��_?��������O�KG�1���sR�S1�����	�x�z|��W|[j=��-8�f,=���R�_�Onl��)�2���cu������"�zzk{�����c*��<N��F(3��� ��������@�Zh0HX<)��XW������yu��j��}���1M���;�����l�3�lCQX�v�7Lp�����<��8zw}z!��{1J�������������@�`�=��b(�[��0j�&�������0`<�Tf"��X�@�24pI�R��e^�W����!�9��dh�8����=����$���	?<�M$#��{���������/AOS�F3"�U*�\
�%T0(&�	P��cuG��
�Q�v�r�g��7�����T�|e%�S�	�i-!�SY4�`���\��k��p�6<� �����y���d:�GPQ���L�MdL������`
���s����'8��%�8�b��kI��HH�`�ou��
&@��b�.�z�p���`��`9z�0�Hi���o�&_�w�����}<���d��W����������7g��O�5���[���9���� ����U<,@��y^���V���e�eiy�L���;�hS�����=�{N,��P��'�W�������^~����
Z���rx�v"'�_qJ�����b0eD�i?n����v�"��"D��Q$	l�.h����U��(��m���gA��.�n.q������E�������e�nWX]0,���r�*�����rh����C���K�I�p>��n���H�E�����=��7��
!���H�>�M�A���[��f���Kx�J��/��X!��M.�~����C��a�p�YD�E=���Ke^I��b����������5&�t�{���v�V����=XJ�PG��7�����1��1��~���uz����~�}�����H�Y��b[�
��HAR(S���0�\
����	���]e@n2���-���i����?�B���hC��:���M���P�K��"����'��g�.�n*��5���}��7h��!Wv���`@�&���%��������wI(��/��oH(��	E�{&�oK(��	E�;&������	E}5�������P����67������S*@��� o�A �w~��Q����ZK�="
��:�(A���R�xP���^��x�B	��PR������\��K���\|H�0T�w9�6XjjU�V�"��U���D4����tw:=��6��f����^��sWM�VTs��u�n�YX������
�Q4H��L��R�����&C�M*/����2
�AJ�>F��0�k�r`Ho��z�gAw�e0���nP��q�#�3�wi"��
�P�}}�g������15���N���o��?qQ�����w���:���=��w������kt�m��������k�	�����N���?�
��?��)E��Z��?{wH-����l���D���d��a
����ssyry�$	�w�[Q�5x���Q4J����_����-�����06KBu�}�UX��W���q�u8���U���
jF��B��/���v_	����a��W���A���	��-Q��W%�l�!pW�G'�:�g�fS�AL"�cb�pu}��C���	���b����S�<3<ZC�\��a��/	�W�m qiNW�u�����)MlW&r�����R��;	3���c����t6��'�Xtj�����,����"V,U/Dl����&��/D�Q�Z��E��oA��^�j��|\��]�h���\�1�������Y���������NOt�����
�����x��M�G�n�0���x^t@iDA��� ���)����UF/A3w���K��J�sVb���hp�<8�<�m!�$1��H��u�8��W�<���&�!=~���\��������������
<=���j��1y��/�~���f�����p80[���Fx�@A6�3��v�k�0�����Pr�qan}r}�r� ���Rm}���
�"ss�l92n{[�3��4 �Vy��%\�/����$���i3��pMYly��V�������j��/���,�8'!n��Q<>�b
c4�����v�@�>/�x ����	��Jv��sT`�"l7���]��YH���TG��/xz�!<�xG�b�_�-�<x��(���� ]�����j�[�E<��jZ�cR�i��jE���%UL����et��
���`�nD5�P�DB�+�����g)��������>�S`%���@7�y���QD=`4~b3����e�b������b�H�qf{�%��,�@����X/BM���� ��X3,5�c��~A���u6NX����@��@a�zql"�8�Tf����#�EH$z�5pP���%����aY�0t�+?/W���9��F��c�P�gY�*��<���������`Y�����d�|�%~.&�YU#�7
�E�b�{�U�4��.���h�AG��!�K!d�)+�F��e4�c����^�
e
+8�>������9%��r,����V�R�8.�J�Q"��:����Gx���[��/����SxPC'A�
h������~'%�`��������������
��X����`�*H�Gg�jU�������(����ZqA���Oz�@a(X�M�u��O�v$�D����{R��IJ�:)M����U^�->X�G��o�B�i������y�J�	�)�Z��i����)�Z"�%�R��6Z~F����n&�t���H������(Kx~2_��R�yyr�x�j�fE�U�|�	���9� ��_��pP.�d�WE��r��C��x�zi������<A|������E:',2pn��� m�e4��E��D���-����|aR� ]��?�Y�wa�f������A14fI��
z��Fi��J�Y�k� ����b�'K���?3SY��\��������>-~��������k������/!�S�#L��Dwtu�fo�dG��i>$Wd���as����T�p�������P�*�A34%\H_$09T�����&��W�N���}����Qp'<Z������{�����v��A<����tI4(�r`�:��	��������%��(o�X
�'(���45����q"�\|)R"�rl��c��u*q������Rh�V%J;O��Rc����8�D,�5�E j�)%�������Txp,`�sL'�U��04��?A�,|I�[�hu����gO�����k�A?f��v!�k����s:(�@.���(��m
�Adg<�rgJ����9�#��ME'�lA��]��:��Q���S��PBT���[N�Q�}�K��P���P�����a��CuJ= #��%�����s�'M�����>��9'�����7�1��g�A�q"%�pC^�����������0���D���������&�k��+���	��$�n�3������M7��_I*�
���0����s���--f���V:X���=������T}�h�6r���r]�}��-�	��������'�=�����7'���R7t~k��!�c&(4$XPp����W�T���BdF��Q�-���q�H�[�q8�-�!=���Y��"�0!�g/��y��7B�D)�?�$q#;Ikv}��1���"��,�
g�H����W'�P�,�VF�ng3gU���	Z_mF�	��uQe���B�I<�;�o!��qob
^��4�����g�����f�4���+��0&�\BTY-��	��<;y{-������Q��V�P�������Z�U����_�������y��g��>f�[���,�|y2���AGWWG��G�)r�Bf���D�<N��f��e_�\5<:;~yb�hU�p�������l1���?���d��N:[b�����+�Q��Z�������� *�K�u�.G������^F/�������hL����������������N������K��*�A���?I���������/p,H���#zm^=��;���{�1r]w������9,���R�����������lB�,�����Dn�b\������������,eK��:)��[r5�^S�
/��t��&��r;m�d���f�5��e������Q�[
�o��J�TJH��}y���Td��V�	A�0Rp�u�l9���(�f�P2������O����m!.o2N����n%)�D������[�m�����o��xk�,�����k�I���/o^��z�
��2���c(P���;P:OT^��{WGW���!����[��(C�J-GB�����)��*����������n����F�N^�<{sqy��I�&	�|�"73�fj�v�T4{i���/��f7jV���<9z*�r�,24�!�:��6N�s�������{=������B�R8,�q�����g��������_��W���@�j�G�������w�OG/�:���Tt(��gk�^�^"pz�J�]�C6�n�'�m98���������i&�����\�_�0����������`����8H��6R��������=}y|qr�<��8�+�9�������
�7���I���M�;]����6:����h!.�����bifnK����B�
w�q`� 7�J�@;�gg'T��&h�d�T
w����V}gw���=,�������a������^+4�����D��r�W�F��I�� �JS�WGo���kP�A�a'0rA4���Y���kg��r����o^#I��3������[��xQ��qK��q3b����Z����c�f���;��f#�Fm7�Jm7����n���|�U�W�~q�3/�7������>��?S�����~n7�yu2�1�2������q���.9���9���=���rN��$���y�P�s�sOb�|���w�����/j1��M�5����U�A#r�C!T���"�������5:�8}���!����!M�KV���#G��o�W&�H�K9��X8��8�W�)VW�ig�����������d��6��;�r�2#gx�1����~��h
��dS�?�����^�s�����9��R�����������$����B���B����<(��E�?4+��U	��{	M��	����'D�2��H3�g/�.}YZSH���M K^�Uj�1Xn������5QP��T�[o.����'*
����s�����yU������*��.��A8W+��#����W[�
���VM���6[V�$"Cw�_���
}��C��{W�A�\ed���a���AK*�>6}�����-�[�5@�8�Z��~,G�6��]Y
�7p��r^�,.zm�s���6���B���^�tz]�k������#�]�6�-J�����S|�m��PV!;��(�at���o��e������r�L/b���#{���&���Rn���&��=��u��c/�vj�,Y����}���+e��w�Q
���W�N�.�$����j��sF��F5xO��.Z�,�2Z�
E�
i���`f�d��;U[�R}T�fD�$yx}tV0�/�zn.�KV��<�,EC�������eT���vg�Vy%��U�h�|�HReT�hD���H�����/Q1�z��L2h����J��
�[�.<QR�����nK�����.�����N^�E	����F��������	��������N2M�C�]��
����(��8p+c+o��]�h��6�d����%������Q��_1kW�>�����~��M�j������,���eV���Ob.������������UG�Q�-���yzk~���**f!s+j��<�(0�P�>C��l��;~��g���]��?��^���b F�B�"`.���b���r��|^��<�XY��q�(�	�	4��b�u����W8Z���,�����W����:qK�a�q-�<2t�[
W�4:ZW �����^tU��V����Nx�dF�3P�uN��&�D���m�G�3� �F�@�^')7�#)���~�!����}��#��]A�:*	�oI|��8�&�Y�P.�R��q<I�|x<c6�STV���}B���	�����o�io@v��m��[�X�Ko��h:��$��y���e���6Q�\�{C���(����Y0d/�����e)���8�m^���;����dF.-_��*!S�V���g�q4��S{����!":���Gu�������I���d=��S������-]o��\���H!S10FK���\������	�7m�F�I��v}��$�s��o��S:F���X|;�
��4�&�}4�v����U���SL���-���=hg��>�b����)�v
�����_�v�}-��>������D��s�?;svU(4,��,���<�m���������-����������:�j6)���Z�)GO�O�����b'w��F�H��8;z���*��K�E����7��p�3R���S�(��f�^L���aS��C�JaZ��Y4���n���!���Q����c<n�5��3�[�Z�v�@���-z�`J(�8�2�����
A�����x��A�'�F�������-TS��4cG#�.t��2!�8�;	Q�:���S��e!�,9Dh
j�}_�r�����1�i���`�	Lh:��ejO��u�85]��a���}3�+�,HJ�2q91-��?{$Q�FWm&����I2����&V��E)��mC_5^�j'�CH���S��c]o�a����)OLpK�������,6��]����'��p�m�}��,��WX���n<����L���!���4���Z�C��K��?u�74{�	�A��e��zm�b�u��%X������H��������u~��jgo^:����P���g/��5W��e��>�Gx���7� &/x]��m�=��oK<������g<�Q����V,�����(/(E�E��)���)�|k+��>���yzI"w���1
KOG��#��	�E��md�M����xW�j�H���zzr||���I�8ZGRT�~x�
x����9BMI�>�n>��,^��Kb�[G��yA��V��	�$��z���-��>A��AB�������A��*"�.g����-���ZA��8(>��qA/�d���sc���� ������uD��[���qrX�&+�=~�
jo),�|OS���EA����W��7���g�a�K��o>���2����C��3:-�����}Bt4�!B�W�!I�]����
d#hce|w�ei���&�6�oTt��1�%4�G�=?:;s�l����mF��+�f'�����v�"�)%�$��aR5J!uO:}tB��q���o���~�����x�nl������
$��|l��8s�����O�������(�Y[�6{���i����4=�"�$E��^|V(��h@������;g�������V2���x������	Z�C",<��^�#�����$m�n@��������I�(6�[M8�i��R��r1{��n(-����U�.�`�,�%I,���w("5wZCw��t�`O:�)�@j<��O��c�@R�_�|*#�+�`b!P����L�C�$��f��Tp���3�*������g��r�zB!f�� �R�B���U�H����[j���u��n:E��K��DL��]3LDJ��CN"��?)HD/G��PL]���~��s���s�Pv3�S��J�+����Y����T��U��`��"���7����B+���.���vj���5��8<XK��^3c��-[�n����[�-/��n�r�:�X���h�p�(���
tq�#J[���9�*��$!a��nO�/"�S��
�;�V��
��4�u����m���4g����g�@a��%v�������;�z��J���)���[�X�?�I�?c|j�R(`��P�.���I%�
r��R�&�H�y0�����:*: <R.
��;t���Ay-���2�-�2Q�pHi�p�����I�=��!zu�_��<y{
,��/�B&���e<�.@6�Ze�Bn����
\~���X��V���w����B�[���Wy���b,�MH�K����J������aDJj3S:0��v�7��T��V�uz���2JE�lh��G"����
��1���EmY~oJ����@5�'+��#�,�}��X����"����yJ���u���d�.u�>�g�d������qF�kqf��WJ�a��|i(��� ����b)���*+)�Kl�=+��#������,�q�q[��j���v�lF�fE
g$J��oH��(���V������[3�A�'m�A���&�����h"��Q��D�<���b�LxY����Z��G���p�)��5�,7�ZaM[i����.Q�5�lc-�	�v�?�T����Q#6���JFT��9	��������G�M��Q�:r�d7#L�+idX\��������Ez�����tu�y�u�>��f�$���������$��0���	�b2�%���Ff���!cZ�P�F�J'��NM�QXj���~�Xm�F���>0���PO��7p����9�C`�Jt(>�P���h�MJ)��!�����*z�zR
KLOU���$�����F)v���f�h�]!�a�8b9�������qT-�O��������w~Z�����C�C\+�deY�E��K��X�o�I���CN�S�h/�����t���%|+�DEntu(�!���������D���o��n��g�����m�{�����u�9�b�T=*T������.�^/�{�9"0~c�o=��P��\���=J���eg��Y�����eFZ�u��<�*���t�5��.�������m�/���9k�gr)��``�ab����D���q"��@��F�������(2p�7���"0�P,�pt�$V�Ab~(4�E���'�@�:zq�,�~�i5�k�{��m�,
E�<bv]r����1����rb���pbfN��%�;F�:b�^(�\M��j�;�.��hT���'/~��/���\L9�8X�#?~���2�I��fL���y�R�
�!����;4E�&m�M����2�{�H�O|+�-�.����
X��.���</��o"�O��bG���(/pE��j�v�*h��������;(�;=��7c����j�a���W��P����l�a�quj�(t��'�|�(��'1�i_$m#�vq���� �YL���|l�x��xAv���� �2���rZ�
wBb����=-��y����,��L7iQ�z<s�����73VtbrLA �L2h#2I;?�&��g�����]�]���Yn)��~N#���;�jB��L���&���|t���1������52�\��L���S�=�X�x�nb��#kM{���~+�j���Mj�Q(��ijf��	I���V+#��j%�"h*����Q��������p�t��Oto�G,
i����F�]�E�;���u��O�
7SI�T�3x*�1+��"N���n��6�"��9�r��'�PBoUV>a��8R�k�N������J� ].��������_����!�"F���16���O(��p_�(T��B��K(Tat{�r$1����p-dB|21��?��*���:"M�A���p]V8�]�:��M���yJU���
�&�j�0	��f��!�l�������d��ml@�za�E~-�1�w��U�����;�-������H��I��-XtYXtPA�6Za�I�&M���f�A��6��wg�� �K����.�Um�%�W���dW��q�zw>����+n��K��Y�_&������hJ���^V��b;���2�T�;pi/�#1f������/����c�:?%��h�VNcL/����7�S�bk��u�i{0���6���M����1�)�b������Sk{k���_�r������O�Q���6�I|r�����3���o=����������c�����j�5��v�N9�_I�!"�i/:�P�#��q[.��0�D����rt��|��`����E P�������#��*]_�t��9%d���b�)x���eG2x����"�+*�����������V=q��1�)5������(��^^k�,"{pg����\�����+�9z��^LzCV�f���'0�4���
`�v���������)�n���������g�M��~0�
k�����7��rg������?-$O5s�������<x������Zn��� |i�����@X���fitG+E�� ,p�j�ay�_/r�B���J���Q��:!Jv_��F=�0�~*H������x��b���+F�1��K��X{�R����z%�q[���>G���VNz���DEDP�X"�\��zLJ2L���q�}����r����D-��"����Q������S M��e,,�zK���5\�yW.�����������r��
~��y�{�����\�nE���]��j}�C��8lB��� ����L!���kT�>���6���u�RU�RtZ$� � ��Yh'������IO��=f��6(��^�Z(�����Dy��<��Z#@�_����H�P��;��|=`"�i��#��8��Fl��A�b���P�7��)�f��$O�=�L����]�g1.D��7���!e	R�p�����)���P�c/����;^��Z�]B��I42��Z�3o�O��\����t�>���QU����hr���]�7*�����&����{�u����~McM�crN �dO����t�1z���.Hu'n�P��Y�Cy)fk����N����:��QU~R
��
���K�Sm���%E����mD5|f�Y���j��";F�V�j�L������`�k�&�1���~�Oj|��Z�H��8��(]�=�TH�"�6�v�r��a�|�_������}Sr�HH���t�S��d��n�t��/)�g �(��&e0G<3�=t�{��H	����L��ShX��I�I�F+���"�����S���y��sF?E���e�l��_���Z������W*��L����#|!�2��
���%PQv��B�`
I���������|���OE{v���!(��Jk+e�bv��)��9zYwW���(a�,�wj�V����}�s�$k���3W��Z���WL^�O*B\�E6�aJ��[ F��g��4�/�����?�ww� T��
R��&�D���??���lbj�������YK���B�b��^��t4�TE0��-D�����	A�����"
����J�;���ud��
��h��:W��V,P�Ta����:�0rp�7	&�pZA��W���v�����<�5�Rdd`�X\9��������������������Ha�OA=��`�`2@GL�}�ZY<�BU7��V��P����63��60�������
�6����]�w��1�xfR(!�z2L�Q�x��D��yrv�O�C�����&��\X!\�U0�[]$��H[�2�l����'GD#��������3��Di��\����FL�Y�W���D�/s��^����]��!k��r��(#�5&�b{����	@3�w�n=d�Q�ob���l��m���qj�m ���iZ#�l��E���MN��o�}=$�VaTf�*�
���p�@���!&C�(�$WZ������11������-������#O����eX���|8D��(Kp�,K>���4���S2OK"�w�k��F�VtY���K@�S��&8G���iZ�r&
�B35�s$���<$@��������X2�"|{�yI;�����K�)_2o<P��IYv�b>X|I"��������n����L�bG{�D���+�}�{X]������i��h�0nh�m�����c�ll8�#.��8�r���:^�	d@���?��R�����/���!�o���AQ�WF���y����9X�V�N���,�����Fbc��r ���v�9�!����(U�M�����r��7�2�8�!�H�cZ����o"��D������,Ix�5��e���q)�
H.`|&�� E<\�������\\�T�A������9t7�r�@X$7��xq���L$zup3�����e�����`�@q#M�l�D1I�I&J7������0��P����:I1���q��g�h{&�n�Y��������1&����a���C���s_��$�=��3d������m�C�����S	W{rW���"�5b�����+�4������5�iz�L�C�"��G(
��vp�jh�~�2t�e���r=�]��&�]U�w������yM�"E	�	�J�6(
�
�+��waL��w���.l��?y����	;
���0��0����l,bBn�?>����<Z��������s��m[.<*R9o[!�oT
U�J��{!�#&$���C�&�	�W�J�(�P�r���wL�b�b��j8s��`|���saw2��bJ]�yA��z@�������1�7�~�x�`TV�G9t���=���r��x�-������1����#������z��j"���+-����i2X����t2z[p�b����f
>]�.w�c.��A�}�.��X�x�#�d%��(���-�:&E@�����jF�������5��e��~,�����Y��nC!I(�=R� o1�	Z}Rv�lU�@j����E��C���2�XJ����p��h�=Yn�G&����������^U���.�bZ`�����	����v��i=w�R��������2�����se>�E`�(��RD(�	���y�
A�����y�1�C��!�Y�q����<��j�I�j5���J�]���h���������T]i�ME1����1�#h��7�=�qj!1���`Ghe��q*�E�}��n	P4�`f�)|��6c_S��dz�$���7���������B���me��B����I�e���V�"R��-Y�!"�U~	��t�b�p�2m�+��>��n�6L����OR~w����L�I�[nFl�~����j=Qr@RP��!�o3m�G�DaGx����m�b�yk�����I��������+��
��[F��.g�����Nt�|��T|~��Dt��RBf�u`r���f��Z �t���e�o���.m����������L��&��kP��zhI�V�m��i0��!�� V��4����#�,�y���7�z�Ey�}[KC�$�
k���Qp�$@��	���Zu0��
</n]���X��*
�����_���C��0���e��S��j�L��i6�(���sc;t(�1���8��:k��'q*�:e&���=�6�$�Ap�rpP8�e;�
����1���q�6
�lG���E��gm~�o�U�2=f8��H2����8�M�Uka[1�=K
���|+
������,5
�S�����Z��4O��t�+]��[�Bp���=���� ������}�rZ���|m�00|����-��6�� ,��))�Q&�p�]��j��������/N�$2������.�a���)�)�}$���Dc���*���C�9��q�(p2��u]�l�U�:]�����H�^I[��*l|H=f�t�J�@���iQo��b6�%�����/M&�������[�22���9L���������85����F�Y
	Y^�����,g
}����Q��kt�x��4����_��\e�r�4��{(�GI���\�|V����F���K�J@�7���V������������
��Y(�=:>�8���>����]�c\�p�U.��U�o���m���P��l��������/�������������-k~�
�&��b����6���vP������:i�@����������M�����6IX24{HB���Y_���#"�}�b���9g�V��V�
�{��OY9�����������j{|���#�E�-2B���/�?��&�&���7�Y=]�]�z�r(I�y-�`�|;E�-�Ib����kq���`XLbttB/0�3�Um��I������[G� ��H�p-o�=�O��B8'�\������7��PrM�i�%q�Z����v��E����-�a� 8
[Luz�������N���Qkp���P���^Ur������*U�3��6��FuU)�����T�_��[t[�T��
n��t���?��`D��<bN��h��;�)�U��/��,�V�����	;���Q�XR1D��v[Q�Pc��l�2����L|l�
�Q�I�S�sl�A�w��J�q��k��������gN9��=#��{zCA78�R�g=��/�m)��*i�n1>���e�i'�uz��PH�)�,��J+����](]��$C��O����O;�;����Y%�B�����%�����v��|k��*F'4���0������$���k����Y>������_2{����L�'��5�>�{j���! gg'q������4��ca
Uv�s�3���jqLk��_�����9d�2��7���"N_,�-�C��F.��">�G���>�<�G-��n|V5��:�{����	W����E&(�|'�us����N@����1�I���O8�M�������p�Z����<�i.�2������K���q,�p,�B����{K�6��r�0K����������n��k�}����T�W��pVU���e���w���$��]Va�����S|�?I R�9�.�%��s�4� �Fk,q�5V�Y��T���8���:8G<e�����tx��l��p'���)��u*�������m�:@������`\���hS�N�����C��05,qb�|E��Yd�~t��c�!d�>q�m�)EN���O����2��1�bE�R�p�6�����NmL"�ut9�.m��4�LI�'�m������j#�]N"�N��T/N��N�^:���r��>$_[$T��3^���m��Q2rr�y���hB��<`u
;�C'R��C���K��Q�s�� ����S$��H�hq��%�
R��8�������9>V���X�"-l�;����v����{,e��Q�������=q(c�
�4}��
�(�����TsE��3��m�84���
��;������I�eK�2E�Is#(�������	G����w�W-\+\���������B���I+Z�"J����A��
��B |{����R�]��7�9��a�j�������ag�+��@��8%�^���#�.����a,"X�%|�}b�|l�W��������P�x#>�����B�d53����	{��h�j4j�=i�N?Eo=u��ReG:}p~.�>l��,�
��������U�OhO���B]�k�HAmd:������������s�&��I�IM���$?I�ZEcb�Z��uc�\�<{sq��&��2��""�y
E��T��7I�}��4;�xz����i>?���~urq����e���iWT�V'���e��<[9	/�cr`iR��#��C�)�0&2|ElF�{��j���>�nTrF��6Z�6+��x�/��|��L%#3_b.��i��|W���hi����'#=���2k~��6+�Z*c����,�r��w_��������"L�c���|��j�tJI%�q#��-�j:H?����Iw���|�r"D'�r��^�J���Jky;�����)kGl���k�1o��8jSv��dt=��,'�L`������u-�W`7��m���|��b��d�������A�G�t���	�����aXd���[���L�Nl�CZ�R &0#Z��54����k9���:>�����8nN��ViI���5
:� _�JA��T�'#���{��@^^H���P��LL��M�L��9��d��5Q6B"�d�[�&�Qj8�-"2)Y�-�����PhgQF��Hk��.R9
N�
������L�,F6GS>`D����������C��I��c�9�*��gO8nr_��k�*r��5��B�Z��X��K����2i90f��G2	�8!���L�����/�H��\�����*�]��C����hW�v���G��7#4��UB�{rE+7��InZtW�R�[���G���;��	p�ln��D�W<�:��<u���?���_�Jl����j���
z�b[���d�."z�+i�%�OV��=��:����i�s�|"+����������?�N�9n�:���^�Bn�fK������E�-�6C�6�C^f7b3���X������������YHzJF,A���QK�������%��Y��9�����{g\G�)�]�%*�N��=@{����F�����d=�o�E���"�S^t�����������:�R���������F��y�p�<��wTH`[�I���5X	7|9	�O�>{��[��|%�P�a�������Y\���Q��Y'��l���V�!B����1�LP=���~j}5:�F#�E�n9�0������F��1
-�Hw��8S�<�8X��WOO�����<��W��]	8������,��D!���?�q<l.	c9�r��T�q�L�Z�/��\)}�1��M��,S
D���.��E���V���������t��h�6��!���	����
���3<�f��
�	�pE)��_2�!����e������5#���w*��u+D� �Y�����y��8amC���/��>�E�����K�u�wG���Z��+�}��������W@�����C����PT���@���Y��H��8N��sp8���S������
���u�����(dn�V\Y)�Z���)������������>�MW+����wd/�"�������V�0��Mlr�}Q�J|T��[����|	�4�����F�u����sY�?������������<Z������K��G4�3��
9!$r�)�����~)����,�(�VF�w��XWKPkd�F��?��U�C�	��%�C�)�4#��{#>��'f�	H��,Z�KE�^���]6����b��_���o�P�,a1m�D����sxYf`���`te�lt&4\��+K3�C��1��4��l��<��w�EVj���C5����hN
���|U�
�rhoL:�J3@��
�N��6��a���m�(P���,PSwN�������#�g�J�������O�
	�j#k�i���R��7���6�	_��/���(���
��|r����=��������;[u�_r9�77�wNj���3���+
W��f�2��%')�@�����M��6���p
���]���fVY�L��eV�1�Yb�|���V_b���g�������=g��B/�P�U�E���s��<�0�ZcO�M�x�<�h���jN���F�*7�K�����~��������&�.�97�������o4�RV"����?9��,q�<_]��������Tg���Bq"t#z:0����6��3�d:,��q��a�����L��Umk4����QSJ6H��*����������g��DQ�4����|B���qD�&#D�<4F@���HX���������$-b#mm!!����KR����������8H_��*�,�����,������F���*�Y���	�7��v��\�_�C'���S��`�I�����v����}���ZP}eHQ���
h�L������hCNZ=5�A���nZ����)��rd84���/�Wn2\PnCIe��"���^��JL%
�E�yl=����@nQ�N�'p��Q�S��?[G�8��,�4#R�r�3*J�
�w\1��
�]1��+o���D$~L�3l�����`�q�d��^z��QN�<4�b�c�~�t�	|��=%cv�]Y��DE&\��:���%����9�#J)�����d�D����8=S�O�M���0��EF�/�`�Pv���LFJ:��]����c+��%�=���6O`�h���B���:��)�)��CCm��")G�����r����	&��ykdAx���h�2�d����y�b����"o.E �����9���Ho���)��M��a&�(:I5���	/�zHX�
�#����sl'V>K�������2x\C���t�4faR7�{�%�J���x���/��s�
����tf�~������mz"W� TDF�eu<O����4���
�2];*�=�Y@m����?'��;#��}�}�J���tq�C�:,g������R4����YK}��8��������u�dYI��w�N��
�]���u������>
�7Nxg�M�.
�������9@*�@�5�a|���t�<c.���#�^���^��\\,*%nJB�D�_FJS�5�
��A��h2�F�!+'�x2��j�LLM���P�#F�!Z'�"i��@`f��!J��������_���ET���g�c��<��*)���r���{m����K��\��B�z< 3i�e{�+L2�S��=z�:��k�����yD�%��Ba:��\D��������)7h�M��_�v���w����G��7��S&�m|s�nF�������������O��^k6M��!�@���7�^�'�E6��F�Ln�-��\����y��aN��������o^ T�R��4��q�NRERK��}H�{SS�x���t�������W�H��%��k��)���h�V���h��p@��y�QgR����es:�����z�s`EX�r�~/���r/}��B�:e���������
u8Y�
��~f������z�(Z�|�:!AJ	��/*���)o9F|>��v�h���]*\_�tcK��;0 ��:\�s�I�h������B�v��~�?{%u2:�i�����^�\������uV��g���;�-��=��2Z=����-�C+1����������2�Q������aA���b�������l����4z{#$��%�A���,�^�������"�/�C�XQ�nK��%���l�{��S���[��"Pq������$�h{�t���*2D�hq��kxR�H��4��p���3��a��~X�F0+�������9L�.T���D����pg#���^%�W;��|�3����C���I����\����=?|���;����o!1a���W
d��	p�	�����hrAL����l��Z�oT�� &F���H<������� `��O�l�z7e]���N<����.`����H��\����,�7���'+�i�yz4��5��G�����.��pT��z�^������XW������;`���������������	3�� n����xg<��w�����y������^�*�ai���b3�a	���U0�h�7�ED���hK���k"�<O�NA��]���]��Q}Mr�=Y�q����G)�g}���?%�^�3;!�����o�F�h�����R��OX�2�X��h�R7���5(��"��$���5��Zt��3�MG`��Z��.a���IY�����iZ_4��z����d�����0�*�p��}L�c���[L��d2���^����	{������gn�O�����el�mZ������
��y�����p���aQsP-��B����\����SDG�ER���_8�Kn |���.��Z���AQ���y�%�������F���R;�[�����T&��]�Hk`%�����Y*���jM�d�6G��}
�(�o��@�w�r��A{q�i�r��Z�9��<�i1�����V�����L�02"j��65�P@�y>$�n_�y���y�QG�����F�F�R!��c�s�8��l�uV`���!�p�?i��o��]���]F����Op���������F0~ut�����{�?|u~|����"Z���M�������2�:��3��0����%2�3v��!���P;��p��T�<E|�q�
aa�v<L��^-�(��L/C��Y�s��� X���7A��� ;�r4�����f�e;a�h:���y�|� �Ls�D�����+�&d
[�v^.J����<_LC_��^$��L���;�����J�)�+m��5[��__j�����	+V?�� ���sj�)�<�>��0���,��#�1��amy���d����)�������X�����!&���*��G[b����>�FRP'���)(r�S;�LY>��2��������������_8�����������V2��5?�LA�3���a����?K-�pa�
!t��oJA:�h]�TR�K2K�F�����1X4JlP*S�����P�^X�>���)�T;!S�4�w�$��!�<�:�L.G��o��r��C��_��A���-������0�J2l���X<�D�=?=;�%��l(d�@h��8�������,w-)����sk1��'j�N�t��~Bb?8>���{�����$���@�I�z ���=�^����r=>;zu|������y����q&u�~�9'��/O~:yi?t0����,�	0��7�I�
�1�[ffr5dg�2�f��J����U��s���W%�ZN�'��o�����Pp*Df�>��6sz��C��L��E�	[�&|�R�X��N.����($r����nN���s��3/Xf~h_Dl����E?��<��R����������x�%e	���<V�+d�������(Wv���������Q??:;s�v%�2}��ng����C��N�vf�s:9����wR9U�{��OL�,g���t�8X����������[�j�B������~NI 1��4`���������nx�c���
�]�v��7�";!�� ���}t����`I�����
jx����2��}|�� "�dgl+v�������^�	�C�s�����"��r~������+���K��%�HX��w��.9]�P�������&s�[0*Ud�����6zi�f����}�**'��^!q��Q�0��v��������}k�j?�n.+7}�q-��~�����g@s�)8ojs*���
�&�%�����#�2���U��2���� �eNE��4ly%.B��v����\�E���hs��7���t���
��Y'��M{�t��������f��}���I#I����R���zm����F�ju��X����o{mcccN�����f�^+�F�g/��{�+���W���������Ut�k+ot3��I��*"�)�a/�n?�N�&I#�s�����~�������oHuV��1w�z��Io�D1��	��4I�[�Qg�O��W����"�V�;����Z�`o���T�;��n��_me�&����{���
�K`s������e�8��Z��@�bsORX]��5M�}6I:�n��$�!1�]t��>J�a���~������t���t4��f�����Z������=H���+�����~��n�.XW��������6hu������M�hv�x��{PZs��rg��g�����
��^�L^��c���S�
�f�6��nU�X�z8�!=������� =�'�oU��Tv:��v�{�W�
��$���~?oo�;������>�
�'���6��y-��t4���z�?�Fs;@F��4�����I)��V[�y� �����������N�}��qP�,M}[�f��p��:?~�����]���)��q��1��(�G��������e�~�E+����/���<9>�|��s�lR�%��x0D�f�����LKm2V �����,0* }�2�77k�j%�G�LN^�@/u�6���o/q�(�{~�J�	n����5>~{=z�G1~�u�����c
�Yn{z�)�!	�$4�`'�60�a#._�l�����R%�
���������P��R�QX^����+��0RR��t.N��_�,�\p�����{�����NGA:���*�d�u2�x��qo�F�!F������t�1.N�{�q?$�)��p�3PkZ��+\��[�`��I[���{n���(E�H��S����A�PHf�kOF��d������s}�t �0��(�4����eV���tf<P��G:(��&1��\����N����=������P%:�F|�TV��#��&�-�@W��{N�en�S���6�jw�y#�o�}C�aF��������GG�����t\#�|�dr�F�S����dtK�`������9������T��K`�]eZ�q��_�>;y�A*�(j�"Xl����RE1E���*[!PW�\���k+�\���9+�������l���/�s��{^m���K����w�s��{^i���K����w�s��{^i����{��q���-[��g��~�j�w�V��[�z�n���w�����4��.;���
��)	x�1�(�+��
�z��*y����p���uof��$��TC�$@Q~��m�_�fJ�������F)G���I�IT�?�)<�o��.B�/�B��@���b-
	���(�b����R��q�S�m�:�;Q��4K(��<?}�
�#�'h>����#a��
�g��`�/s��Gd4�v�LG�I;A�p�`Z��<yV��������nG7#�I>�����-���6��QN���I�	,V�:V��f�,B��:��8bV���:}��M�(67���A��F
(E�PPF�������A�]I���%B���X7y�2��r���U�7|��C{w=�����y�3_c�4}^il
�y�������0 ���n�!ax�PP0����Z������Mk@<$�?��a<H<
o�������	&��i�����Z�B�v�Ir�Z(�O�s����h�����h\����}F-qx����Z�Nv�����0GN�!�b��7����k9����������^#����x�N�����q,v�$�w\6���G�q�.	����7&�K>����pqcr�����#f�6���oF��4���:Z��Gq���I������$���\P�{�����$���+����8}v��qt=��������^��xs��$z�EPTi������pS���z�����
p��'}oW@��������3��B�
/Zs_��C�R��Y{:�X�"����&�����^�o���@Q�������hG���1�H5z1������������gGW����������VwJl�>�)���uvbS/�F�-��dxM�����z�=��v����sr��bog�o��h_'[-
;~�������z��
Te+�A2�/��(R�d0V3��s�Xt�x�V;����n��"Lt��i������3y��<=�e7K��v<KUR�	���X��7�@��WT"�� �O�-]��Ck<"�O�e�����Y���3�wj�ftH.��h<�[�
�}AA���5��������d��v�����;e��Kg��S�`!a���!J���!���O(��h�(^�p.�L�"�91Q;�1L�����8�KTN�nJ�!�m9B���F;#
�_��.�Y%�4)��f�u�{��s���'`+��&N88��O
8�A/P���p8A�/Q�t
�N�z��^�-\�����u4:�4o��}�|�n�G0���q�c'NO/2����N�+eV"?�&�������6/4��Y��fo������|uz�/�Le^���u�j�D�%�����mRlx��^�#�8�,��X��.�L|�jMmJQ$"ci�_����0�b��P}dU9��
p_�v!���
b��6��k���!6��I���2��[h���rG��p	BF<���h7�.�NU�i�S����QK�KGZL����\(������2'�5����d�5�Y����o������zM�k�V-#����+[�<aT����T��u����Du%��JV+���x|�<'�uvwv�ws��$W����a����/�g�g�WE���j7���e��kx��w��[�g�C�����%\��=�.���?��>&�01�0��`�.m���B�
�y��\�����)��QA���)����r�IJu�2���3��4����
��E4#{(z���1��'r[W�~��($����>d�*����F�^%iq"��va���2!E�+��Ma*�1�P�)>�n��O���W����
�Or��C��/�i�BAg�sV������!I��b(��Z >
��N���$���T/g��?=��AA[�Fd��\�"�����E�^�6�)���$�I��
ZUTb5K�
� ���h�!��
���T����	Dz�$]��}S���������8��������UW���V>�2�I'�d��{�P��=�
d�W����Xb{��-s�ZvN�>h�+��U>�0���N&#�gp��-��aS�0�S��#���R/���s$���JT�e�u&�HT�$~C����g�l��)0ZE�C�
���Q�8��)��������/��9���b� s2hZ�S	��+b�0���BZ��mg%�4�m�mw[�{����t�1�b�]-��6��1f���X�����J%����Y��.Pe�~�����"\�V@:l�<�tC=������Q�����1���P�%���P��1*��'<��MxD�����d8�����K�(�%���(�!����I) C�'�]�]�d\�	�h�8��R����p2_y6V|���P2��B���G����KM�f�L��+���w�^��T>��m�����(�N��/�Im��.vI���f.e}c�,u9R��Qh4����n�|@����_����c�
F>t7bY�������2;�����(N��WVi����^�J��(�<�fZ�~K�*p�9�����d�,��'���p'��D':���������V=	�����J����=��z{�����)tw�M���J��A�A����'�}{�qo�2M���nzCo'#F!��5�����;n���JGz���Nm�F���4�u|���"_v���3C���k4k@DBn���YC2`'��~����W�ywEy��y�Kc��G�S{�,���
u�i�VH�AuQ�
������#p��)+Id~� H��9����&Z���[�f�i�5*����T�v�1	�G\�����uJ!��E�A���/���$��|���@�}�������}�a�F��(�+�qLXG����B�)������|��������f������d��d��PR�cRy�p�d�>D4u��/�2�9wd�9�����K�������Go#�>������,��<�f����l�b�XB6%�����y�j�|`���M�}5@1���0<��c!�p|��k#L��%�}�)z��9US�6(,C;�k`��I�t$���}���}���jC
�Q6@���y�(�e�pZ�W[j&���F���u����,�Bc���h��I���F��]I�#*���<����P�?v�?�t>���k�'�:D���>!���>B�&0�������'J��sNKl��OR��<�F�N2	^i9�b���s��&	�AR�eu������������3����U^�
��)�a�SV(���7ye9#H��wVd�,Ncy`b�J���YJF/���^����F�,�"��&I�:%��!"���{�!o,5L������]������I*����[�����z�������L�� P	�@����������@WpC�|��1@��f"��$�M���-�z�[n�:���t��][��yXx��q����t�	|bQ> ��gYG�o}���� ��S���1���,o����82~���#��R&�W���
��_���c}��?�x �T�����g��m�
�������WLm�&���(�r|�U���86�hciM�9��t3f�[������uE���U�z�l�s0�>4k��������h����f�I�6�/E�����D8s����k���3��������t93.2px=�=`���/��76�)2�����$�h�x��Ez$�c�x �$mu�����F"�-SA�v�(�>�Kx;lC�N/��`�gs��5ma�!*��U����:�(�7��|_�����A^��i:�u{����-.�����V�������w��Q�4��ej[c_h��1��F�E9jC�
�����j`��a-�8(.-�%pJ���BM�6*Y��:�����cA�C*�����?��U�~g~]1�������x,���F���r�.���N'��(�C�
R�r���2mk)*�Qm��m�e�����'���	&�h<�|���5�H:6&�9����)4���f���ea1���1N��/V��]n|$Fnp�q���J�"(Eg�`~�}�E�6} ���N�=���M]0f��#����\|������|r3~|��#+:j��c�A<�az$s��[��@� /HFf�yA�����N������=�WR;�PgK�����:�pBY���������j�����:�
�P2�G���Z}�k3��,���0�����3���	���{1).%C!E9������"�����8~;��f�&�
�XCl���U�(�R���K�}���EB��������|*M�����Dv�,S��x:���,�����=FV2[}��D�:�����K[��C���J�Qz2�#H^P%F\��*a#��N��]������]�\�?�
)�3������ [>F��
[������L�<�[:���K#0����M5�3C�R������Iu�?r$^two�v4i�����#��K�r�ma]��
��x$^�I[<*�N�p7��##����Rq}D��jZd��
Iy����L�h�$�k|��e.nM@`�'�-�_�^��(��l�6`7����9=|��v�"��V�oOHe�N���Fj�*��"��6���K����(��R��i�)[},3%e���UAZ��!dc�1����>Y����e�4B!+Qi�;�e����j4Y/e�el��e!V����W�[���9����8�k	eNJ����Y��e����}�_�n08��x��x�e������1I>W!;W��7���h�b!�1vv�ZC�2y(cz�����S3���[�B�����*����UK/���D��T�|�wj�5�M�2[��B��p�X]���,K������7�����P�a�� �����������-�=,9�n�����is�~�%u�M:��w��q�r�H����-ef����������������&��h=��c;��Z#�P`�e�u��{P����*����l��-�A)�.^\�~�
;1�?������[���\m��%�6C��u�}�A[�
��+����I����������/�:��~I=�q��;�l`l�e��'<��Eb*3�A��p_)h��H��c�0�)8���V�&���";�����q����(&G�ClJ��l`�dg�[���j�����3�0cB��6%��J�m��g�i���C%2�*���qn(����~�Dy#��@i������&�@��[�k����������d-I������mU*�j�����
Z���;�������_��aV9&���1�^g��F���]��Y��\��N���r�<��O�T������dY�<l��
��!�����;x�
;q4Lq�&��H#�@0��4�S#wN,#�%T����mA�����_R_��C-���	��4���`���D?����.b���v���2Y�E11n�h�����}
R���S��G��mM�l���#v��y�."�k��$����������m�wv�����T��r���:�����~�~?X��� �V�4�\[���������M ��W�~6����K)Q�&�&�'��;l'�1joV��9���omo�j����Nu���Z���z����j���������&mf��F��>��+������9����l���n��_��Fuo��h��q�:8��k�v�
�x���2G�= ��Qvs
�y������~]��q'�~H�oe������A��>x�,��'=��v�A O�Q��hg�Qu;��T�k��������������j���{!���A�65�@um�B.���o	��o�5��d�"^[F�&1���W�k �J���b�+����ml���PiS���:��?��m��G����u�KzCy:i�������I�l��W��$���Z��3�`����hK���dc�N9���*��0-��F���V��_�L������]��,T�V�N^�RE@�\�}��m-.����`��jU�I�E����w�$g��Z^�n����;��#�o���V�����a���b������X�������[�p~�_!���a��o����[�p������7�c;��N@t�Y��C�8�g��	H��f����V\�����{�.6��+�h�b-j���p�����Go^^��J�������?��l������c�#����P'���J�Z���el^Q�7`e��.rm#N�	��6L�%���t������I���������������'�sV:������W��++��R�I�����D�7d�b��I>��N2Je�)�
�oP�p����U�%�����\�0I��5���V
�qp�'+Az�3���7�V���}�is�{��l�w#(��*\����N{�[���n8��*=Ztr�_9��N��9/_\mw�+%�+D+1_P��N��$���{��l�
���W��'�d�["��S`�������vg�(�X�nK(�i������@y��U�4����������������6/O�.�����T��N�/��O.�O_�?��}4_�
�������ua�J��
d5Y%;�������lmE?��~D6�D�	�0\��K���A$I4Z�QK~R�E�[~��[��Z|�-"��s���e=
k~����`$YZh~�feco���Ns:�%k�����4�F;L��0�X��K/"�l<6�������A	@��C����v��*��c��fc����5���h��p=6T9JQhg��E2=�v�dzF�T�����(�MT�as0����������"�s:�3�mZ&����+�!�:)G�:@������x\T4�Q����v�[��f4>Zb]���h��,��!Z�}����T��b��H�������
����U���Ek�P[�`�?�Ek9�J��g�����h��`�"��#/y���V)�������==�S����!�����d����:]��LX5Iw�:/�*o��2
�?uf�����T�(w=�_^Z�"I:���x�	w'����g�������j�W���1�5�cL��GSf������y`��AP��l�zX��N�������17_��g��3��F��I�Tjy,���t�b�e_<��r�^hQ�g������<U����aS&�����l���O
p��v�j4�/iL�O6�E���x��]������Jr���������������gG�{6���{,3<�_?�&�//T�#��?���wIz/���cG���&����������[l|�xx=��D��������7��.���s.]�~�z����j�����[��=�Jx5���a&����H&rh�.���]$H��h�C�n�c��$�El{�O0\���8&9.�����1N�������{��s�3��_�6�"
-�d�z<f�}Il��]�o�vhx�) qD�esh��a�X13n�p'L?����d
����� b��0�!�M��l�Y�����D�1A����+$
��cr�'WG/�0�FwV!��E��aG%G�E����,�yu��cxG��.����	����+��vv�^��/��<N�va:c�7z�2k"+��T3L�
�o�Vp�{���0�7u�,���������;��(��'L�<%�8\�8j������[G3��Kg�)���������h�[2Z/*� Qit_fO�V"S�|v�����)��Yo�a��
�(���uC�����(P�@s���/�W�^����V����l��IG��.�`v{�oN��#�r�7����*G�b�4���1�{l�j����0�:�N���K���R-iZ�����oF�,���TiZ���Xk�S��q�����>���H�<G�6q�,G�
h,����p��
O�Co(����8Gi��xa#�l�;�8k�5#'\_�O�1Z6!��q�a'�T*(�)�e`�y��u0X�4�h�� �2��u���Yr*i�1Uf��9��I
Aa�Z	<����h3*�����_�8��\O���@���9}i�k���[�Q�,Q�s@��:����9�!�3���L�)�];]N��S�B-X������gk��@����������{5]��w*JJ��*��cM�/�����k5��m�l�H����C�p��~��G��Xq�v����w�U�q2�o�����.�q��Z�,�V�j����Y!����a9Z*-���w1�?�tB�(�A_�,�Y9 
)��
F����o���{#tQ���]n�2���b����U�Q�+H��1k~� ��������w�U��8Lj07� c�d�JPp#��
�0,�����bp��6.�L�/����7��Z�����'M4-o�U~����r��]G���S����S;�X4��h��R�C��!:��o1���b���E��~G>��QB����~��f���x=����}9�|�����j����������#n,%HVD�a���R�,�������N�� ����
�����!=2ST�t��M��������&No����GA&�����<��I��1z����R�����7�v;�v�HvZ�C�^}/��A�Sm�Z���Z����z�$������FW�K��!�%�g@Z�FB�#���>����S���=��i�A�k������G��XfiNL5�]���g���c��[*����)Q2�v!|*��&��B&wQ����~����E����~�7��sxS����M��:�``H8������F���xR(<��c��,J:���������W��X�!�i;v��	Y�����u�?��c�P*�!l2���7�at/���=�����]���W�T��_���{A)Ab=#��4w�*�/������0P�l��&#�P�5C���[	�P2�{>�[�{�%A3O�����Q?�M2�����nC�<%��t�������L;�g�8-��a![��6��C��k���QP$���eC��;amv
w�-ji"����=��w���.�B�^�����w ���tB/�����M��P���h��I}4F�b����S"4��MS�#��	�TLUB�H��������
��n���������\�gS�E)����n��*�Jr��\��]��.��zv	�������S���.4(e�d�����Rm�g�D��@W���e��D|^�p��0/����"2��	�$����������&������_S#�+E��b��`�G����P�R!(���|�v�S_����K
l�;��o���k���g��w��8@�- �{���&�������'�xP�B�e���/�/�����hb��;\
����aoJ�;{����
s�}J�b�
@���I����\�J��VK$��5N�4���(����-r��~wu���K�����M's79��$;q�L$�~��eL.��-���.���������r����!��H�4P
$����[x�p^wb��g~�l*�t:kQX�bN>���<5��H;������S<��K#���D��5?C�&�$"xNrc8pW�~1{��]Kz��:���n� �RO�l��]	`���`)�!��v�h7�tL�����������&7�����7g�������j{���P_1�X�Gkwj����k�-����h;�J�IT8/�p&�{�d���J����x"8�����,��k�����Hg.��o�"�:�.v�9�b��[����Z��u��A��������~�&���b#�6"9��IP�0(�-�a��Opx�n�?�M`���7���pB����"���Pu�g�9ic"��$n����Q�6#����*���pDR��������g�b����������w%�m#�LH����.d�;����M�^�{��M�U>�;s�6��z���g
W'�OK����O�q'`^h���0<���� ��m!#kg������@�k
�F��'5�(����Mz�v�%f�/fY�&��bI��-DPl��$%������!kp	LL��;�����h�d�9'�
E�������~���X�x����}wi��-����������Ty/[�F���nl�+�F��i�t��j�7)�����>���)%F���q�9�e�A<���Jk��lz��e��{:j	�1������K���`rX��?-v��h4�ZI�C���2�Bu:�G�~s�l����?�#��w6�:�Mt���&��t�������Q������
a�������������������^�$������w�F�v��;��{7��vm{{��I'>������q������G;�G�s���`�w�H�j�$�
](��j�q�X��n�����TY[��It#+�]9�%��G��m�j��b�5�H�SP���n��C��"25~���|���\NyQ�����h�k�+���)�Z���n{�>'����S�N.��'t!���5�pg���~�+"Il���T����x*�@a���_C�[��K�JaF\�Nwx}8�{�J\��Df���Cc���D�T^�3�S�V(���4�8����-�������"I�	���p������-�[�2rR�Jt���:�$>��v���zm���uEb��w2��B4�=�U��$�����%x@�n�s"S�a�4��c�?'��dT��l:�}~�P��%��m����6���vt���DElG����{/}���&��mx����s��g@'��aq��Y��9����i#�U������N	t���yC��;�F�;d0��L��^x��//�V�W��t{lK�z}?rZ��$�S�g����Y�vX��a���[1�sk,*w��<wimQ�����u������-������x6����Z���=�n��nw;���^k����U�]�s��LB��3��!g
�g�sKD�����������0�EA
���s���q?��������~�d�������e�e��*'�������V�`�R����A�z�N�]�,��\Yb�w������:��Td�
��z�LdW�}c��-�8���Yw��<�.-���7��-).�vv��Z����uaW��	�o����H�z�,�	������V��fb��6�
G�����D�k��'e���P�L@�y�C�����H��=�����D��H��O�z��)|Jvr����8E{��G1l�5�5��$}J�
�����i��P�������#��3V��*�9���MO�'	U��p�;�cA�bklGT�X2��H��IG�,p X�Vf6Fy=��	@�����I|���m^�+�	��G 
R�Dbw{�-��=���iPK��)�D���'S�LW��V���1R��h�	*RS�������j�xNr�Xfq����M�����E�����Y��#&��`���
���t0�9-5��k�T����HW<%��g1������a���]�>?}vtE�a(��R�CZ��c�(�2�@�������g��e�sW8��x���E�����������������L�4U���m�����'��5�R�p�+�|~q|��U�c�P��-+�������X�d��C�����dc�c<�	"��) ��N�ZC1C/C�	����b��8��@��9�%� ��0I�����-�S~�k�B����A|��:d@CB&@5���^�O*��V�������B��G������n\���a����E���C�[��3k3?�|v����G(!�c��[D��#s
�i��[����.�!�A�.'�>]3r���)1�E;7�{�u9�0[�q�a@��VGqEY�>i�,�6�$���<)�Jbw�x0D�����x���Pp�&����E'S/JY��$H[&���@�����i"T=�6G�M��r/���E��vP��[��LV��O�z�#U(�"6�����
d(���7�P�5:D��aOBp�����5��z��Q%b�k��p4�=�h��Qs;�h�h��m/�d����3/���r��f=l[S�e�jX���jKzD���n'�q�5�����F%1#
�!���-��>��!=$SV��{~�>GZY����l|��@E��e��"���/����:�!O�RT�7�P���n��9%�L���9��\���bN�������[��/�YG��`���~���������0�[������I�}BM��E\�Y&����t<�j�m�Qf�J��1��e�!���n"V��&.�;
p��Y<U��-]�����|�����]��Y�>��t�R�����u�+������l����K�J���;��B|����*��D�]82F��F��_pw9R�#��Q���)��idU�Q�*������f'�7"y�-o6�b�`
{>���b�O�d�Gat9+)�P����N/eKg��������?�aQS)iyx<�A<��!���Mo��Rz�5��{mRi���f�1Q�8i���*G�;��Y��g��N�?w�yO�5`�gCR�
��-[#@�I�t�������!���S\o�f����-��	_����Qs
H�6�TVj�zC{�������AyS��8�m"�5j��,p{�E|�~T�U`�;�>\�X����i�T��\����0��Fi��/n}Y���qp�m)�'#�mh(�����4�I�E]��1v�|ot�y�n=��t�e��
�ir= �S'�����*�����u���	��4Ag�Fg���%5E}����i"gh�GS���t�@'��m��@�O�
�������K��d��qu��&�)�������{;��/���Q�c��tqsh;
H���*:7C��[g�$i���v0us�w����V]rk��F�/m����Y�#��E<����1��G��$���Ff@�#��{�<���jeo'zv�
h�n>!<#0��pG�|*�tkx1_��?_]�=���)�����sTNGLqAw��_���=v<�Q��+3�P�%j���p�0\��3�7���y��U���!/��I��8�.�2��=&�j3#q:�M{qz*�wj"J(�_3�/fp	���;��"�?.z�mW���i�c����_��f�����s7��0��{Ur�7�<���j
��X�$�5$��:���bGs%�3Q���Qi,/_D�Ic�MhA�Q^"�vI��&s�2H�����r����I0���i,j���'��U�n���(R����r�^��'(���X2g^EF Ns��^&�j����0��Q���$!d�c����<\�:�&�\S�A����	4Q.�Am����"I�C��Tk"\�ae��5��L�tb�M������!V9��&��v��N��^�r+�5a�JMu���K����@�/�NQ��S��I�R�� �Cv2�������f?ym�wL�����+Zk��2]B;}��cF����K
����m�OL�sv
���#m�D��5d�E����������3���S�Q�����X1���D%��0b��rE�;J6��I�4C�(�z�u��h9�7��P�&T��� �M�����E:�'8xD)8��y,���0�)h���'���wI����lwj���1�E��/������|�Y�fD�[�4L���:�Kk�N���uHt�ED�X��1��4�
�c�(�;p�G�\E'�
<L��t� ��G���r��Fh�%,�~@t�A�I"w"��U�����
�Z=j�c3�!U�'o��|sy��Iv��l��q���AD���4�0������%���8��B3��BmB(���� 
��rB0����P�q��Q���C��n4�^�������NC�(���M"�r����J�/<�_�{�}F:o���Ki���q����[��p�Z����gO1��)��T_��h"�A����}�F�A$�K�lt�&M���#'G�C�](eT���s;��%�&�
�lH�?i��k)���(����C�S�9�zS�E�*�p�Z�4�&S H�����!����{�.QL/�vn���"P*�At�yk8Y�����s�"
��&E�
��������}�1nvIG�%G��LQ�C�3������du`��"BU+o����
w��f��;J�V4V������y������-�������
���U�p�9�8��i�3���}�F��]vi�cMP�Be�����AG!h���R�a��E����)�KL4�"/���1	8�Os�(�_�a���������c��1^��}�r�H�kQ��C  ��Hn}K.$Q�u�m�lUi�Q���9�:Kz����sj)�Y�#|_O�q�V��E�$����=
���'{b��f�]���\�'�����)E�$a��p�b	���gk�������TSDCK���Q�h�W�A�T�w�t:���+n%�nem4�$�}�i�
�D����Z";V��O�m]�E��������+!�r�o��__F:�}�0[����;�����RQ�+����k
������e����4�w�i���x��O�AR^�����!r�Zo�*�J!I��S���K�!Gt'��26�
�hC���5:x�!hj 1�R�*A?D��9�������\;':Vv�$v���*�8!�m6��p�r4sv�f/�S��Ab�Ab�`�pY9$M/�vR7*��<�Ix�P�>}j�*MZj	DIH�-"��~�>��;�u�n��F�������2<��Z�������f���u��pm�IT��2���W��S�����g,�aL�l��V��
2&?��<h�����T�i���H�!f�T�byEs�TZ�Hd���#u�����?|�|��d[K^mu�*���Ee5�(�;9���5����!L���gce�Y��9k�z�X=8|���C�J��:$`�����~��	��[Q+s\�_��b���%��ku)���t���E^N(��V�-�R	0)���<���s7_cG}��b�
���=��=���HeR"�M���3�y��?r��g����L'o��������d��8Y�z6S�u��X�M~��q�rpK)���F�T���ZB�a���������
��D�##_&�m!Q�Z�ZG5��2��rm��
_�mX��3��8xN3�G}������^$Sm8�tv�"Hv����2>��X�1F���<z���o��}[Mo�[~9A'L_������������W62&:&�Xf(���egK]-��[T��(���~�o,1���DhF�!��hoIja���DU�xD�H��-��dU�Ckp����@�7�j�V(;����4��gi��C*!s����x�RhU��C��84���>M�lZ�s���h
�`+��e��z�����;�A��w���w��v5��=�V�;\�\��v����S�;�u�����9�c��c�&��}��>N�l�kQa��+yH���)�oH���H~z�f���@q���~������z9��>�8z���e������c��kqv>��y�����g��v�>�'����WT=����rZ��/ux�[�v:��z��}~�{}��{����.l��<����_�o�c������~xu��:�3,i��:��$��������jV_�:8e�qX��W����,a�"�)n����Xn,%����jz�EA]�_��=}���ytqq�����WOO.�+7�'M��N�9�	��k��m��elF�[��4c�#����<�����������_YP���I����,��L������pUE�-)�Y���S��s#�����MQ�,n~�<V2�fb���
�S���03Hp����m�5��\����F���$q�������p;*+�����)B1��U
�(���w	��.=u�I�q�����KL5uq�H^�Wu~\v�)���p����U&qRe�V���rft*�+����(w�����k��S��N�����W[�J��i��7v������0~���������]�yz|z��k}C��&��y3�O�����9����?�Qd-�@��p q�V�����i��^(�Vn�����A�v:��j��[�$��Z+�n�n/X�p,��"�����b�2�����x@�}�#U�K�y���}Kk��v��.m/���,��@@�ebE�!'T��'n��{��j~X�ez�H���d�Y�.�a@3o�ZBo��.��m�Hs��(�|z�]�8�,q�8�de������5�]���R?,�����W'g��c;>�:���h�P�����2[�h f~s�x��%���������\7������������?�o���%���Gi)����4��aa�C-����x4�\\�R�R��m��Jj+��>���U����&�������9�P��h�K��n��=���{w90s��f�yvq]d���rw0�#��j��){:��S�X�<6N%,4���R�P�z]+lr�MqG�'/O�]���(���N��<�����H"2��)@���7�����g{�����b
�*b��U�e����t�4=�n����v��K��#H����W�tc�^&#?J����(��<�������=�R�Zt���?^�e���[��\?��N��������6�I�Sp�_3�O�����	F�Uq>���e�	,��L�����>w�G�����/�/��qo��k;w���-�������<V��00/^��l��j�%��&���v��m������Z"���0�o�����L`������1��W�7�gzq����������B����uT�b�%�E20��?f�)I������#i��K/����W!������������������~��!zyt��
�P|���d�WX������j�ng�����W�����F�R�x�r�b�N����:l�T�
���*y�X+��j�����	��\�j��H�J��
�Z����%�����)'O�0!��E'�W�,/�	*r
Yo�O�
��b]bu�������z*�����g���?��������� o.,��2�T'$�f�L�$����d�^�b�x�>��D��J��d�8��S���������C�7������xoV���J&
��N9R�&�Xg��5�2�E��v:��	�<�kEK��F��T��>{��������\v��9�����
�e#�d�y��dc,pm�L7�%otlL�R-c�a���
2����x+���~u�dC��z���Lu��uh�����;W�}������F�N������S����Yr��	����*���u����/�����n]5$�����T�!T�C%J��S���T����������8��#������8�)���P����I�������k��P�y+��~������5�w����{��5�����GWL_=�h�Na��������?��x=H��A��u���5O|b����2q�+,9q��9���g0��$x�s~[����y/������������*��`i�D]�s�H�@��X#;��j�h�v�s}�����r
�����!J���[����]�6���/^]E���00:��.���k�9��]�
������IEL ���f���K�X������d�O���n�Y2�w����x������	�3��B����,�����fv�����a�����PKL�j~1&�����a$�&4BI���I���q�C���-��X����thaB���
�sR��������9��H�<��m����$�>$:��`�m��������bL��z�f��bDgLU��l)��U���o(c��w����{�{]u��g�7Qm�~h.����D�D'��9�-|�Dk�Da~�<Q��/#��-}:y����f���������~����/*XQ"��<<�t�Bpy��}�Y�mLd��}7��.�F*_���c���t�7�!%��}Z�&���s��y���T=��TY�~+
�n6���E[~xt��R0��_eH�3��&0���`!���U�����z�L�e40lV�4:?=.�4�4*b���-��8h��:�����W�vH����34�����W��_+��6�b &��x�
ek�*�����xz��k|��E��C�IBq	�qN�t�T�V�4����?����}�	���y��ljU$�	lNG��.�A<y_cd�������k{;U�/~���;�5���W;���������j�3�mE���f8����������Z����A�������i�$��vgw��$��j���v�v5z��2G�=���������<���Qt��I��2����m��WF��'k���G��I��	���lHZ�����Q{�}mT��������������j����.:��4=��P$
����7����:��tE~��M��(��o�N�H���Su+o�����_���k��&S7�����+��V����_������d��NQ'�'�{��e�^z�������Ck{;�����J%�����I�������z��-����dg���'I{4��a������ey 
�O�/��N_�D�s!zsvy���p-�[�c�*�8c*�fE��zC��u(�A���u��
�-4��}y�����~DByD\��6���Xe^w�#2�X��@�N{�[����&�>8h5������}�W���r��lU�Ko�, Z��������9�����"�~0�{�G���!�H������B vg4��@Q�)��}L:��-9$���
\F�Oid�W�"��>
�����!�F���8��ZC|������O(��s^G��rG��+����>��W���;������)E�8'�����;-G�2p)�E�a[��4�:�9<�-����W�����U����up�2Z���l�Pae��Q�g]$����x0��o��	���D���	b��hx](�I;5~�3��nC=�������+��]������P�|��l�S6��������$���e�O��;������IE�l+z��3bV�����D(����n�7��X��b����w�����������,#���`G*�s���H����:c1'�"�������j�<?��NA�e��QhE%G�rD�V|an/�V=A9ZM���I���9�������T"|k ,��f�zP�M�P~� �F~/��Vj���W�?o���1��	��O��	�TVxL��&��G���(���JD[�3$�H�I2&�[X�:V
;)P�,���fFE��>�n��_-=�b��r{���+�26�
�����(�j���"�S�r0Bcy���}�U��X�K���d|sEm���������>���x��| MB�EAxe��\���J�RL��|F6]P|.t�3�k��E#�L��"��u���*�']�������W�,��R��J 0!�������?��P��nQ���� ����'��������N�"���Dv�X8~��3�&M�RD����Eq��L�������m�c���3qSu�L�-D ���S/w�k��go��~O7�{V�H~6Y��	k����.r���Z}������������!�O������A�[�kU�n���N���Wo�m�������v�+��f��k��y�������������V#�F�1��u��t��;�=����)
^j,�rK�BR�����]\g��Sa~�Z��{��oW*;���6��A��$�+zClP�,�k���������lz��������&f��{C�9)8���{�~S��xjt��������h4��H��o6F��<���p�4��q2���������������������k���

#270

john.naylor@enterprisedb.com

over 2 years ago

In reply to: John Naylor (#269)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Aug 15, 2023 at 6:53 PM John Naylor <john.naylor@enterprisedb.com>
wrote:

On Tue, Aug 15, 2023 at 9:34 AM Masahiko Sawada <sawada.mshk@gmail.com>

wrote:

BTW cfbot reported that some regression tests failed due to OOM. I've
attached the patch to fix it.

Seems worth doing now rather than later, so added this and squashed most

of the rest together.

This segfaults because of a mistake fixing a rebase conflict, so v40
attached.

--
John Naylor
EDB: http://www.enterprisedb.com

#271

sawada.mshk@gmail.com

over 2 years ago

In reply to: John Naylor (#270)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Aug 16, 2023 at 8:04 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Tue, Aug 15, 2023 at 6:53 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Tue, Aug 15, 2023 at 9:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

BTW cfbot reported that some regression tests failed due to OOM. I've
attached the patch to fix it.

Seems worth doing now rather than later, so added this and squashed most of the rest together.

This segfaults because of a mistake fixing a rebase conflict, so v40 attached.

Thank you for updating the patch set.

On Tue, Aug 15, 2023 at 11:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Aug 14, 2023 at 8:05 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

Looking at the tidstore tests again after some months, I'm not particularly pleased with the amount of code required for how little it seems to be testing, nor the output when something fails. (I wonder how hard it would be to have SQL functions that add blocks/offsets to the tid store, and emit tuples of tids found in the store.)

It would not be hard to have such SQL functions. I'll try it.

I've updated the regression tests for tidstore so that it uses SQL
functions to add blocks/offsets and dump its contents. The new test
covers the same test coverages but it's executed using SQL functions
instead of executing all tests in one SQL function.

0008 patch fixes a bug in tidstore which I found during this work. We
didn't recreate the radix tree in the same memory context when
TidStoreReset().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v41-ART.tar.gzapplication/x-gzip; name=v41-ART.tar.gzDownload

��F�d�;ks����j�
wzbG"E��N��c;�[;���I��g4	J�IB�C�s��~w$%�����;s8mh�����S��M����8K���B�p"����D������/=�x���L�������������o�{`-��9�����t����������� ����i��~�x�����4�K��O�W���c�=n9�����,��^����������m�g2f�b��.3�C����45Ds�.�=b�|�=�~N�mD����(�Ah�2�E;��8DdM�v����v�Y���=4��a�MS������?���aS�bV��oV�*S���UY��X��2dU������6d2�
� fz	�cE��<�����.KE��,p@��h<�0�-����X&Y�'1���%"���I�q�`4X��'A
;�0�3@�����M�v%�����={��1��K��<���{^���!���5q���Hh'A��i
��l�e������f�D����2�"�FB����I�:�'|�v��''��'�����(���_���o�����E�s��pk�zu��c��U�����4q[A���'Z�d�4�<c��bN�5�h�b~
���G�k"t�"���k�k��>��Q�1��4�pS����c��;�=t;�ax>�#���,PM��+��[:������~���o��OP)�?��0�J���x:��k���T#�6�j��Uz���m��C�"[���?>,�g�_��^P>�`���P�m����#���F@�$b8E�i��2���@ld�+�}����wp'��u;Y�,fM�|�<�U�e��F�C�ey
��a4����@B���6Nj� �Z��M����������i5�n���^c��v�g���M�����Z�=eW��b��P�F��|45�-q6:Q_
���j5�[Zc�1�e?�� ���KHK������(:��}6��A$�q�$����(��g�z{D���[@�������h�n�",Ha�>O3�-3~�V�C1�&��U�yq��q�=��`0����P���5�.K8XJ���Ov�Z���?�
	��dzl�������yV*��A�a�w��]TA�Z����ng�b?���&��U����nVC����xw�w����0�7�
�A��Y�AF��6=gS��O��.�,��������}6M���A<Ha\�u���W��D�^�A�T�{���e.��������6�s��9?k
�q����d��w��Qk�E�V�Z�� ��
�
[�x�g�R}��n����c:M�_��)_�	^S��d1�@R*��f`
�Z��A�G�.(u�W��B��J��Jw�z �����^���Z��������:��1�5x����hm��K���������_���S	���D�(pu�g���60���_�l��,�ku��?��:�����������`���KX��n���6�����n��!��B������� �?�FL?^���2��E�u��p���>����5��������*CVU���2`U���iG��&�A�A���[��Xn���o�=���M�d����$	�@2Z�����:{������fp�����H��e`�	���A#h*&"A%��e��J�N�B{�aj���>|�p�na�y"h*>��@n
�B�	�Vh`����A�����$��\��<�m���Z8��L<����;&�3p%6
�`�N��&O[���p!�0��`ow
��"��g�s`!�@����i0=��A�������-���k�2C3S��<��v��9O�Q��O(�Fw�������_�L��W���~���������/������������Y�'���M�����1�n_�m�����`Z��D�Z�:x�b6Xr}�S�����S���F�5�T�[�C|�$�3h����UDj�X��8u0�C�����M�,��u���5��_d�������N�0����xV��&;��<G�yya����8�G.Xy�C��g7G��Oo���t/F.O�w�Gc;{�����^���L��+����B��b�.����.0��H�Q�2a���r����K�6��ER�\���\��@xt�p������\����SF�����0�����	K,���C4(��d<����W�Tr)(��P���,\@y�
lG��=:�p�����t"��G��K�++��6E�`�2P A<E�EU�����R=b�����za#������cZ�,�G��=N�z�����D����p�	��
k��W_h���=���.a�L6�}�!)��
	���J���Ae�;WL�e\�s�X,��9,G-�
0�������bR��{w�loo�6@�����@(�9�7G�����o����"�k���(C�s��9�A4�����YX��!��7���.|�����A0�z���j�T������9��sb�����,����lg4�r����:h�
���������y�MG�-�`��6,��<n���X�y�`"G�(��j$�Ao��vq����������AL�D7��W��p{��v<��}����
����nNw��YMz��!��iZ64xY*e�� �`�����
g[��Y���3�z#��<��4�3�����
tY���x�1`��������� �5r�����o#w!�����`5�&@9������s�,2	�X,�l0b���jH������|��������K��h���p�|_V���Q��/)��\��N�Z�����\y9��� 
�X,QB�d)
e&��S�2W]��5�z���tWP�L���e{C�7��������/T��6
��Hi���n�T�w�Hb������������Z6~����>��$���+��pPP ��G�
�����NE����P46�����=����������j@������P4�_@�X(l�a[���kLNa�)�R��j(Ji����|v,4Pii�F�@�Q%`%h ����C�S
5J:�%e�
%�Q��T�k�'���\�K�����}H��U�w9�6hj*U�T�"��U��5E4���Ru��m���������A���U���\yoV����.(p|S5=u
�)�kc)o�Xd��I�#
x&��d��c4s_��C���T�>z#-�����U�����y��p/��!�L������=k�����tM�^������������g�]���b�g�7�����������m�<p����������CU��}h���cn���X���N���dUM{���J"`aC��k��)��'��U��ssyry�$	�v�YIWs�����h�����|������QO����U�2�+p��5Ws�u<j~�z�r~f��g��Z�����~+���W���%cc���Up�cP��e�������������F�����������]�3�%��'X�1�r��>�������p�A9����S�<34��s	��S�-�,	�W�m qiNW���3�j_���re"H�ae*���0O�w,��_f��&8�d���@-�Y��R�e��n�2�X1U{!
`�Ln�0�~x���V��(�-��(|��(����(TD�;�2���D	������n�.��gq�
������vOtlgK�r�j�re:��������z��B'��GQ��a��=������$s����to��T��\
��������!���+f)7��G�����4��$E ����[��Ss��� Y;W���\���
�(g��w��R���K1�a<(�?h!����h��	R���{����n:��2���-+'�*\[�\_��"��$��P[]����}
n��9K�����jt�m��hV�a�����A�$�kyb���0\�[�y�l�#d5�!z���e��c~J,�II���{T	�O(C���@)9���!��/�(�bu�y����]������
`;dGW7~�3�@����<��u��s����b�_�-^�<x��(�1�� ���K�M�[�G<��jL�6(�4_	~�����)UbL����eT��
{US�M�7��g�T"���O����g�;�)uC��J�1��n��L���j�����|��g/3\����U|��Dj����A���O�X`|����|J^M�FN�5�R�8j��/��T������� �Y$�
d��&2�3����x�wD*	���D�����������5,+�J|��t��9��'����xL��,+�
��}��8VK1���
Dgy�������|�'
~�&�Y�f1oP��@�s��)��SP�Q#����������S���>�����=-��4����/q�����w8��)���c	
���j��p��Wr�A4(�P����/|�W�\���2�����?E��$�$HW��Qq�s9���d*�n�r6��C|*u/n���%H�{�R?��
���n��^�Y_�MB���
�{%������=}�0���gU����4��&���i�$�H���8�	��i��-?��G��o�B%i*��(��y�B��)��R��I�y0Z�Td-���R*�L)-?'v��D�V7�I6�
�b�D�F��`�'��
Y�/q|%���
9�	�
J�m�&�i�����s�;@���(&���#@�D��V�~�A�[(�h��V��n�J�Q����
�+���l�H��y�pS�������+�?�^!��}aR� \��?�y�w��fd������1�gI��
r��FYi�*�Y�k� �T��b�'O���?3K�BVy��n��7{���qrz~zs���k:�(u|	)�
aB�����7{�;i��a��)��=�2,`�*Q�5�������l�*V�}P�.�/i��x��!��W��(S���j%h���Nx�<uT�	%���}Y&��m��QU
��hH���~3(�|B7������J�H:�,��|)0� �������>v����%O���c�<��S�����u�\���Q��{�T:�s����H��L_��\�J7�,c�r�(�B�����R6���?A�,|I�K��u����gO����Wk�A?���V!�j����s�(�@.������t
��eg<�
cJ����9h#��Uy'`l���]��:��R�Y�K�����c
��D�'����	��~����m��Co���z@J^v�m�j�j�������_�����g�9�'�������=k/"P�)���R��^�i�[Y�~u�vfj*$����/��_{���F��������Y����$���Y��^������S�J��^���=3���2��,������l�2���
�>��Z��|T�.��}��-��������������O���W�����e�.��8�`�@�(�������XC-��
�.KO*�BuF��zO�c���`�A����]�8��FD�pax���/����[B8E1�?�$a!{Qgq}m�1������\�Nh��CGB��j�rB�$�n�"h��t�c��_]�����Bz��E���
]F��8�Y(����.x��b��2/[�O��'W��g)x���#	�0f��8X�2Z�e\�}�z%Z�`���;�F�[z_����c��-�j�G�,"h���/�w�������"[�������r g����t��:��:��<��A{e09��-Y�8��J�O�}��j�xxv|����*I8�o�z���@[$�P���O�1(�����6�f��!��*�V�
 [���&�/�V���u�)o)����O���IG��XQ��u���L�z)�[gW'�N��N��h�.5�s���@t��������)�l�/�/ ����'�Z��H7��c�y��u������|M��s���\�n]/	�$&J��.6���[|���Bp4�	Q.���b��,yv������`=j����^(biKm�����N��k�i�Y���X��;:�/��K��m�!Vi���iT,�<������!2���r�(YN3�,��)t��5�+��s�e���e;����S��1y�]�����uvmW�6���U�g���8s�?�`��c�����.���u��M�e�g��nWD�6loOi=ayET��]^]:�"��hon/Q����OXp�
�$A�����m�Z?�/�����F{6�8+y�:zwqy�S+I_�H�|�"���k�VPW�ki�����#�d��M��������W��`7I2O��<�c�d�X:�m��y����`��s��Tl���[����3o���vl��_���9wj��K�;�&N��W��p�txj���gxEV�a����%('���h�|����5i���i���!O�4�-��v�w_�����������]�.���_z�<��������Fj�Z9������ON�/Zg�������3������7������h# Y;������Io����mh,C{@q��Pe�\�Y3S[����v���Uq�{��F�R�����Y�K�r�`=-��K���XS���6V�������%W�Je�_��;�$��S�GWN>��s�/8�%���l,����
rO��uG�g�O��Y�<�ff��O��<���#��{{"U����0��4���i�W�N�7WN�71��/S�}������]wN�^�(�]O�Qm�U�6R��L�L3�����J��#�_��LS���z6�E1�����GWq�c|����fZ��xt��x.��,�5�����z�������/6�}���fb�V��8M*N�L��;�]�Nd��t�[���j:�����6����MRACq
#��>�"��=��	>l���uq��/1�S4��HAU�����G.�O���J���R�a�����4�^�����N�u���S������I*�lBn�����}���9�����_^&��W w'��6�U�=l|����fZ�������-�=��h��F����K,�-��oP�)���$�LJ��a����
kx�U�c����D���������������k�oJ��8�|[�%o�*��,��������`�ut���7��JVB����h�j����'iU����4�j��&V�
�j �U��	����Wy�\#;�jg���o3�e	��	���:����= ���pep��NF��x-�����So������q����#BR����@����/=����1rY�d�;m�c�M�6���B���N�x:��������@��L�	6��-X���iG|�����P��B�(�����������K7yu���v��W��g�����'�?y��r���Ma1����_}�)h��Ls'W�����:��_������@m�{�v����	&e�eKh`_�[U�=���������M�P���F�Lf
���]q�*�WlaW�	�=Z���4��_��-���o�KU�G����Q��\��_]yDF6?&O�7��-����!@K���'B���^o/��=����?����;��o���|O����cJ��;9�j���l�JOl��O[?�Ns�<��(���5��`�7a�Bo/�>{	�B���)W��D��X��B��:�[:[�QL~1�1����'y� �g4f��I�A��,k��+�Gn�s
�`S��D�?��/2:C87�J���������j��x2��/��A�
������O�����9�
�I���r%O�&
�g0�K�Z�u�>9�	��\���W��ktyYM�@W@_H�y<Q�,,w~��������@��6@A\���&���o�|�%���@d���[C�R5����	K�Vh���
 ��*����������]-�zQU��[:2&�>z���`%@7*���
���b�l�ir��!A8�� h���?*�1 Qr�2�uSS��"v
�g�%�j�������(HQ��~��)n�x�Cd��lt�/�p���t2�l0��(��18F��@%:�,'������`�~��|�eX
{������|:p�|���{�5�n,����{����(�f�,�K#%�_�,'�X(��W-?�
���-��s���n���T+��o��4�c��=��2�!@:���G5Z�:���Y���=N�Sc�O���<]o������@2Sv0O��Z�R�V��D���6 �tb*�S+Q���-�}��z��rl���$�
���\��=�n���,��1(C��<1�=��T}@E/E���!���3��x_W���V���k)��N��;)�����]�ru����.�>��N�_s������;|���#v�N�X5�*~�-x������EI ������	A#���B8;|wze��J�AO��_X����GxyNN��L[��{�3I-f�bLU�%"��2CfhT.����;Vi~F��N�
a������x���:�s	�8��B���B�!c�d�+��!�����Nbp(�B��1������j�hO
4���!f�������,vS��	C���d���9Dpr9l�b_�t�����;�B:d���&b@���<SR��s�S��;�]E��1EfR|�q��n�}�)"	s������}����5�e ��B0\����6�U���P;*,���_�f��x[�_D�x�yb�K:���m�Pb,E�q�v�@��@=����6	��K��Q/�����p<Y�%�Iu!X-�S�j�C8��^�����������)z���v�"��Pn/�]G�Z����:��IV��2�q�M�l����S���>�,�����xs�Q_v�#>���|Xz\��U�B�9����G�[
�;��xj#�D�����UC���^P*���R���UJM��v�|E�3��M� ��C�0��Oz�g2�\K��B�f����D
���CS�h���e�����i+��e)*?<�<����� m�7��d�(g�%��['��yA���!�	�$WY�[4�3}Qi�O�9(a� ��M*}�P��Z��?�ZjP�iQ�_y��4�H^��(2���(�BY'�:P�52~C��5c�����������2�4��.
"G
��Hy�c�ywI7�ti�����K/�l(��x0B���NK���0�A��F<�rJ��b2����9[}M���-m$Nq>�������S��t�W�g����u�g�7�R��>�����p����N���"<�D�E�H��)���Yo�@�)e����x��xG�5^�[��34g8�|0��D:&N������`A� l�����-�"hQ������4�"U$��\���,����0�K��=��5�V��E�%�z��|��80A�����=�}&�Z��6�{���,�N�I�u	�0�j����F�)%w�`R�m��B��o�g�m�d��'�.Zb1���%����#���d�m��&#����j<��m��Z-����������J���g�_�d����Bc���l��J�]��E��H��4����%@��	�Qy���(��j���M}B���%'.041�V���*�
u�D8~��Q���!�����4S����8P�	�)wU�������X�����C��+f�����b*!���Z�Z��yV�)z�[��t���c��\ZzU��Q]������a�Q�-O�^#y8>�$��I���D�8���$]����=������Eh�������`�a������ak���b���MCyO��p��0�f.����a@W�f��M��Y\�����6�l���;t������	���KF\���h����]H�����%��2�3���F)-�p�p���KVq��.^.2��3�t;���V���0e�-�0��p i�p���	���
{2v9�%��
��:m�?*�/�%��>�?�@r�Z��,n,����rT>Kw�/�z+��<�g��7�,.��5;��gyy����,��&%/����p�Nkbn&����������rnJ�A]��3q���(f�����hC�#5 x7p6������=�����_`� �db��D�=o�u ����<����u���d�U��
�R������f�$����&<������r������������K�R�XF���$�.����X�n?�����	e"N�n+fSN��������YQB�������YK���@��N4���L�2[��QH;m�I��M3�S�����p
�IZ{B�g,)���������Q���X�G���|�S�k_ua���4~Q�K���m��8�N���
Z�O��S�Rv%m� dN�!�y�m��6D^��G��[��l#����r2��xqm[W�
!G�=nO���uu�t�M�>L�f�h�o�%6�H8�@&���	�z6�E������C3c���P�J'�������&�y��&
0���Vs����=L�F=u����N�,�$�N$H���j�D4)��@�,�A�	�wL�x5����0�-1��6����$v8��gE��F�<�dq�r(G}�?����R4����>��U�i��l���i:��h�D��9bE��;!	�iJc��	g�S�/(UN.q{���L�R��.�o���H�f���r��{������&0�%��w�q?_��J�C�K6�<�H�_>�y�1���T>���s��M�.��,�{6�Y&0z��'�:�!��iI{��>@s_�V�0Vw����KE[YzZm�]%�|���x�aNkK�K����x����f���-�[���H���@3{7!Y��~�gIj?A��Q���}�z�����N|w@�8��'8���X���%1=X��C�-�.�?�7��������y�~XoV{�M3�(7�8���+t�qw0�
���0�d�%�����)3�F�u�N-���j���~�<�m�hd�F�c��H�/�k^�9�,���?V|�$���	����^�l�\w�����t8I���$;��?M�j	�����K7P{+��UA���C���������C�6*:��=A�O��������������)����n'c�:��ja��g��P*l�x1�4-\���������d8x�����iI��.a�c�B�"Y����`{:R������,;)*�L��/��,m�bO,���(�����f��g�IC�U�Ybd��M������c`
���D�6�T��l��Na$T�z^mHw���f��D��F@�%vPQ�h`�� �b��e��l�
d�?�QM�k�os���Hn/N��H����S��^nYMh*
7��[��Y�|2mc���2�]�M��;���6I�ke`����	5	J
59���������~1p�	�m�aOC\s�.a�H��X����[���d�q3�DO3��3@�pB.��ps���	1��B�J��S����h�|A��y ���������
�J� .�������B\���
�e	c�x{��h��I�X�����,�V��#�s�B�F0Z'1#� ��LPOF%���'��IY6Z�B �$��K�����y�d
  E.��R�\��ZM��Us�I�*���
agCF�`{�@��/#�w�G[[��J���yz��1�p+���R�������G��+��f!�j+&�'T��
g�x�I��tr1�!gT9���3�F(��!���K��BU[��������$g��y�|�>^�3���S����/#D�����"��7;����6���{���e�ji�B�+�����@����e����O��OQw>�����V����F��\�U�����;�V�5Sw���A4���W�
��r3[������v!�����B��c�����J^��W0�{0n=�=X����SL�0��B������|�Ymwf���
�9����y��B��>�y�����lI��u.���W>`
6��[�ML�ta �{�Z1P���kYJ���cBv��A�����7u�C��!�L<n���c�)�+���� ������V�,c6��@����y�����U���"!Rwb\�.�5��:_�
�3!��/p/�_���f���&0�4��u�`�f���y3_�~�R��b�w9!�_r���2q�`4W����7��s'����4�<*��Q���F�2����'��F��zk�Q�&Tc�����A=�G[����Z1H
9V�=Gt����x���+�Y��'�4&���QR�&4�
#�Oye��
���(@�X�iZs�k��;F�1D�(T�q�����	�u��cb�}������1�ID����h�����L	��Zz��L��O �Sg��/|-'�"��(���#�����$ �F7�X�[����'.lQ�30O��\�����2����LG9|~��y�s��Gg�A��~��Vu��R������M:��88��G8jWz��A��8��(���N�<Te����G�D��*:�d��~��D��'B�{�
�s�G�2
�c{~H�����t�5(� ���n�$��Q��r<� �<���H'��&8�5r���yL�b>����x�H�&RA)c_�Y���V��|�\|4�,A��,���g�8IElP���q���7���
�P>`�f��Tx�m��OX����t3�O�g���W����<�]ow�����m��m[�U�^�4�=�V��fUqMK#qG�@P�^��]*Y�>�e��<��p��T�~�nk�����:)�>�L,���>�j]�6����)�q�������;�
����Yj"���d(������ ��~��j]0�5n
����f?i�'�>�
-�B��w�.m
�IE�t�q���|�;_�W<���kFox��]=`�i/����7�z��2��A����!�:�W,as�3��������DJ���1gR^L!!��!T��D�t�N���/\v��9����'KZ{�����,["�A���~���o���W����R�td��h�?�u2<������^�f'�(������
	�3���J��;��7GgN����	���R�3f~����2� ��p���N"��%�Nu��U�6�b_�������@���n�V������:����u�@;�R�u������p�Y�=%��n�?1f����;.\�����f����2>��o-bl��.[���IO�l�B�c[�^�^���L)���iD�f�����A=��v/�H�
@�;���ud�����j��:U4���VR��0Da�EE)<����������5���z�d=T
g����)2W�["!���I�������$�<��K'�c���X��	$Ft�2�G�����7��h�R-O�P��QD�HgB������B�F}Ko�	m��5�]���!Gf"����{tLR(y4n�"g�48��u���'3���+����
�j3��)�����m#���
�xN;��ah�Ox�'��G�0s���>H��@W��@��	�y�t$�W�<G�����,������8��}�a���~Pw�C��[����N��7�R�����������hj�o �!�i:=6|�XEq����jWZ���K���^�~Jd��b�1$�#�H�Tp&>�*E����7���9(e����l�����g�5�S�Ie����|:��� )pO>�#So2�5�����o�W"^�'������%=8��$�(N��s�k�x�2P��7R�8��b������$X�b<�>�>����^�m��aG�>z�c�5�K��G^���`/��9�$��H����:��6�L>.�*X�7RA���<��������%��A��*L#�G;g���[��������p0������pr{z��
�Q�DY��O���y������$��)rO����o�##�������������p'�m�����7w������
2���/?�6x�,���6���!��\.:B�����D(d$�1-�e���8,Q/�`�IB~�$�+���2KMAB���9�	�QQ6����d�%�?�N��MBE%��`�_-b�]��8�k$E�:��TP�@s	#�,F��a���
�)&
{�L�
��c��%�"\���Bq�x1F��a�����
{:�n��������d��.��<�!�� ���y.g�0�# �;������S9�W�"�\��f��4���\���������e��i���[c�����|�x��A�U����Mp�
j�.�~�pt$?�=�/�vUDO�sT
>�J���x��ERX���6`
w����wn���w���6m��?i�������?��}8�a#}A_C�����gcgg�4�����jB��dn��m#�G"Ua���|��P9ha�n7
)�� A:�
���'�J ��Q2������c���W���'����7��#K�-�!��&E�'X=�����=f��f0�������QD�Q��$5bG9��:o�;��x��.0�'u���0���\ W�V/�[M@vZz��{���M�+��\�Obo+NU(�k��]L����rk=�

%����|��������?��DZ���h���(�b�G1VD���v��@�N��w��&jf�N�
�,���HBA�Bf�6�"�h������q=
�����l
�F �i��9��XB���:��D�L"bC�����g?EeT�8������s;��w�����@���U�y���2�Z�P�~�q��V��6��c���6��3Z0�L'1���x���q�
����]v�(�P�I��IF���7W��7W���V3�Z�����}�������	�0Q�L��U��H|
y\��:�*N�w��-]0>�9��/���E�d���_��J�c_c��h~E�h����@lz�9��q���t������WB��"i��
����
��=����2�a<�7��L��2���f���k����MR~��W��OF���L�k�.$n�S��r�ZAM	t:��o������J�9,�S���^�Jv�MZ����|��_��`l#�n�8�SLL��A��`����R��>�mt���>+`r��Xf3��A��6n;�d�4E��-�B'R��i�����u��K��s����.N��_�4{��l�u���y:�(q"
�5��<�J�T���������PG��d����Aq�4I�NDk�5�Z� ����0�>�7qSlrm��%s���W�r�i��HQ$l���\�Pb��N�8O&E_�o��)9z�=��3����
<�$V�|^���{��Mv���8�98��t����"$D*w?`���L�*E��|�fQ��U��g�����2���1�9�%�Y��G����/[���I��xjp%��zi����d����-Hsj�j��2����:�N�b����{��OK��;����u�r��(�|�a�	>�N����nE����.��QD�p\]��z�T����J6N�$������.�c���1�)�C������x������c���p�8b�O(����������w��;@�=mp�����x_�����zL�.M�	\�M��h����h��xHX�O<��d$<����<_�b����n�'6�w7���o7��Nuw7�������|E�����4#�DQ�,�w�������5�q4]�����	����fJ�
������/�����Gd�n�JiT�WwiU����Li5�=emIiT�e-f�l����������P�ifs�}x
��
��hb\y�p�U.l�*�m�"���k��������L>d��v�#;�}���������-������7R�'�N�qS9b�������w���F���N��K��_�)�?�haI��>���&ce����AHe�M�[8;���5�t�z }�����q�}tXx�����L^+��e +"?\��b#�,0����N}��iB�+u�����U���f.E�d������3S4�B�$���������b���
�"^@
g���>6�������Y��
��s���\���=�b�,���+��Wg��1���� ��~k����*�'M|����7�za��N��������*��<��n�Xyn���e%�?���^�Q(S%�
���h���k���
��3��YE��<���e%�/1���6NX7�3�'�4�0�Z&,��b3��������},.��J�Q�"a���0�H3�a�`G�Dj�5��$�T��Y�����<�o����US��<u���d�V���h]�#M��X�,�+��[��Y�����A����
�n��7f=��_T�
)��(��+�b�����w��$���85��dBV���*�e�7Vw�u���,�:?���.wx*����''�(�������.	�D�1}�/����`t��o���]	%��\&m�>(����7��E�-]�e�H�/������Lt��o]2w��^����8OIkZRS��"��o���^*��W�hF��pZ-��*�gbo,�(��\"|��8u����6��c�\l�Uz*d�6��]�y��Z��n\U5PH����������_����r�w�<7E�Y-�x�3�-���X�b�����.����M��`c+MfZ��,�?��p������Fuc��\�Xu��Q�0��W���fd�2�*��AF��N�U�q������m�UL�k�U�8�9�'U�.U}�=u���oYUV�{^A�/�&������#-���
(y���B�^�����cT���~�g_}}�^������Y���lb�MB=��U��n�b�7e�*��b;���Dy�c��>��b�AVVT����%p����]���A��������(���s*#����`���w�|,��Nm&=��;���FH�Ki5�G��&��vZ�3�y�3���2�76Bphyy�Y��]��Zy'��s1P��b�+H�sD����ZP��K�(hg>�<���$��o��i��r��H��}��K$��I�s�"��3�SL��J^c`��3���<�������	*��LF%������h}v>Lt[gZ,${a���:I�5�+S�
e}n�
e���w�D��y�y�qM�8������l��� �OAoP�d^��$�iF�T0i�u�!k?p
_|2�
�Ku`r��������C�k�u�P�<_�0U�p)~e
���vP���b�[����%��n�������Of�]��tJ�LP������yD2z�"����E����3��^N����H��:<�1�p��7��ZX�>
��+����|�v��
p���������1�����,�<P�{�\����a��'���}���Z�(z��5����B����H|X�6�}Z�����1t�8p���YL�DP��~�.��VO
������|mX�������t������#���x�,�@�7Q�csv�{�t~C�f�i�:����uq����e^��^l������a�-�J)�-~G��R��%���A�)>��6��n|�(|���'����6L���B���R?����@|f�d(	su�������Q���Q��i��x'����c�����(���;��L]����O��kz��}U���!Xsm_�oH2s~O-)�|:68��������O���t�s\�'BT�'zl�eO�b7�����b^�)�9�'q����)3�x
'�M4�]L�6�M�g���D���4b	P3���X
q�v!j���i���&���y��.�z���'���%0�zW9H!��p`*qS��(����Ze�����%FK"��R@Q�#qe�xc,������!D�Yd81������'y�U�)H)@�)%)�p.]�A�	�#(`����&�"5"����{��!A��IV�Y�	&B@&B� %je��M'��N�$�fR0�J��ku�: �����9�sO��V9
�&a��������"fs8�
�R ��d�'�F�&Bx���
E��*{��^d	r�]���o�lN��u�`9��S�:��0�+�������C��)[�UI�v������&I�nK|�
W��%�'����0[�������������a�����p���n���6-7�}�R���~��X���R*��	��������:;O~j�t�J��)F�j��b��V��qi�z�=����h�����L�v�.�_i�R�xu���0������5�Do���7�'��Ra���+*�\X*
m��X���dm�"E��&�j��7%luE��3��D�S� �������pvDk�[Bt�z\r>s�moM�c=��gq�2���M�WA�2����7K��1	��oO�E'���8�1MDYS����E��O�*Q��6����K{���~%�i�z���9�
����~U�FG�s����; n%;���B�a���S��RY\C�P���ej���-���?T�"�F�p������_��@��lA�M0Ggp���noH,~xO����r0+Y�#���z��u||�����/�b�br,�A���$��D2.��?uy�X\4,��Uy;�����1���^�����#�X��L�)������2U�L�[>�M8m�w�����do7�,���b������3�����5���[X9����_{����K��"a��*�sqx-P�G�s��b��QI@�r!�a{���,DYc�������lX�R����.���YV��a�w��/<�h�{
f��_�26*��KE9sI5����J�L�V�qw���4C����W>7D�4n�#_�=�\�]&!SS8���Hr��%�M2k��fV�3}�}�O���W���!s
W��tF���RTVV��'�N��
LYk�Q9n-�*��E�SZ�C^[)����E�V���NR�:�NDRE��BZ
�c/w��P�E�����9��uc��Z{!���!�/l��iK���'�R�@�#�)�4��>n�!W�c�[G94�`��/�2�	M�#��9H>00���$�7�l��zA�NwY����^zaJ���,(q�J���
�b���G`Bb��X,6��i��B�J���D"���
M5��9�B�������Je�M��e�q���$�k.�Wy��-�.E��8��G�&��QK����f�6	���l`�0�6u�,d>�j]
~p&Ti��w�t �'���e��5��8�my(V���jC���k�H7x�5|���&K/���T���sK�Y���yu�^R9K7��wJV��8���{UGD
�uFdL�C�N�a��r��x�I	�l��{��a?�����kf��J Td�)��&�T�f<W�q�q��r*��Cn6���������N(����"']x�iWu���bL��'���?/�';�au�Z�h#��:L�]M�������}��~��]'-�wY��P���>��T�,��$@�M���������{sx�g���������n�
B4�F�pV4�e�'MQ�g������ ��c�����oOH�a���d>��@o���m�Y:�y'e�v3���<q�.�����)�C��M�7k���v:8H�P9��*���sj������p��
�XB/��|O�^���Q$�BGT�m�?Gh_%�|0�7�tB��n$rB��d��7���-�;��z���oi���@#��I���s��� 2De�O6��e-�r+���
)_�\;�H��q�&3��BG�-CA��D0�<p^�yL`B�
i���d?
����h�5�"����(r�<i=��b���0
�B�RR��;BC\�<
�[]��A�(_pl������+�VE�F���"�v"��p:eX_��7�h�2��V���K��{�/�%����1�� ��]�1����2�%\d���gG�v���Gk�7�����w_��F{�g/[�O���g��Z���G�����������;�M�u��AM�a5�V����{8�ec�m� �H]_Wh	}Q!���R"�8�z���r�����i��x�53����!o=/��z/��$SP3*e��@���!�����x���m���r	�	��7���2oqh��5�}V�f�e�>��bV�.�G,���|'�!>�������{����8��
������E]���#�uK����+���8��!������b:@0�{4��)O�^>����2>���)!+h]i��e:�%>m]�����q������������:8��[���W��wa�Sw PN��)yT��$�-�3�,���yP���[{���eX$�&�;-6�g���Wm��?<wY�����(A63�Z�M�r��)��W��5
	�0<c)V��3�A������R^�WJ����L�_��<�8���9$�0[Pb��aVVf�h6��$�H���',1�=;��d�������G�`����������B�}:��! ��a���7Q�q������k}�������C���`���MZ�eFa(L�,R��=| u~����������K��r��d��-����K}L�A�����u���;���G��������F��q������P���>���7�P1�����l�Y���b$�/����7o)��"�@[�d���|������t��q_��E������^U�R��<�i��b)bKI�<�}�����L�p��t*�C���W�HQ��0�Qh��)�bv�������,:x�b��s�Pgb����Ki>)1��H|[|�O�^����G����G��z�+J4h��"-��u��xF����3��
67�g�f�E���)(
aqq�]��F`�'�CT�	�e)[m`�Z���&���vE��$*�P�����3�MQ��.�$�	���Ty���9�3�+&��=B�����N���O��px�
�{4}%n��S'�7�N������!�|~����$q�b�v��sJ^���L��~�`����S�����w���
��~Z7���i�Aj���K!L�%����+Um�95L���"���x�+p��a
QHq_��H.�D;����R�� B`�%���I�$9S���^>.����y[���8A@�D�	�8�7'��_�]yM��w��&Z=�v�;	�~�����mp���w�/�������M�����w����w�g��q��x��A�7�����p&���N���7�$����T4iJ��������H�4l���z��#1h��r����]�&�Kq��O%�G1"����IJf�
���&�L�h��7������*c[�u��\�@X�
��r����u�����Z�u-�!�N�x�W~�;^�{��o��Wg�����Q��j��U�$��#�>>��{:��J�� �!3g�<�	���~�>������b��H�[���9Z������w��+�������^=����lpO^��kw�Q����V��������-�������A<E�pEt
)X�gt�2A,�[p����D7(_�l��E����x���2a��M�`����H�.�Lx��<�DH2�k�O��dH7	no�'h��0N1xl�I*�1���y/��p;lE���ycU��P��;���������hhI[�}g��p����i}k���>�L���!���b~�����W��uh;�bDt4U$�3����C�`��x1R�d�@��X�4H)4]��(���Q�i�:��0�Al&X�YIu��J��5��q�<l�S���a
`w���	���Q�������|j��R���.i�N��x��R+4�����)NF��P�-�Xj����� #�-W����
��S4�U�'OY�&�Hd���`/l$-	�?���������}��Y�lsW[b�O�����wgGW'�g�A����8�9D����U��o��?��~�h3��o��O^��.����P���_rGzyuxu��S�y��&�S�E&����b�yy�kG�fiH�M��O4�R!�l���TW�����VNv!�����z�i���$�u/�5(��${_�� Z��b�h&��Ns<b?)���j�9)��!"8���	I�<�����<�1������A�� �����K����36O��ZT
��H��i���"{|�T��A+���|�Lal���Js�����.������f���_���u��h���'$����MIB���c123��-L:�*JG��z�%���]HNF��+
��|D���M`��r|��e|+�D�������)�s��K���L�j�0�Ho��������=���C����������r��A'��#HGlJ�J�0�n����p�����
V����$���7�+&j���!&��F��G�>��k3D��U������8^��#�I-��
9��_v��B��iP�<������QJ{ur�R%������� �����,Ex�����kX��w���P�4�?b#�{.���O�@��-!5\8��W�������Q�"!=`HP��Mb��o������iy���j?'*=�|��I���S�G��i�������\�R��'B�6��@'�7�q��Do������L��x�z*J��R;R�����0���|Y��eey2_�>��8]y���LA�����n3�K�}(1�Y�@
��C�V��{_��?F�������rz
�����H)��?�p�����=�������>���<�|A^����}�<��=_���H+�k�l��R^�ze�BwO>�D���?�xh&�9�~uxv����?K|K?�?&�hz��pg�`�%��2HZ�8;�R��'��D'����K������OwK����8a�&�������;�J
��;����X<���=��X^w�%���M_����t���S��Q���%9�5������-��i�]~�/#Wbg��Hu����w�e/�l�����O'K�|u����RrI=M����_��[���v��.)���G����� ,inE�d�U�r6�������\"�v�=�����|��O�I�����}��/���������9���������M�s�������zS]R����'�&���wo�%e���U������� �eJE7��Oli%.|��f���\��;����R�z0��x�����E/�^��x�������fc Z��t�z�T*a�\���n�Y��J�Q�o�J���7������_����j�l�?{�x@�7���}����?����_���UT�k#ot;��Q���&�9�CQn�axM������V��k���z��.�����[�h������g�Q�b�K�)�$�y��G��b��o��QL���wL4���jw�������~�5��N�h�mh���������"�\�{y|rq<��TlO�Y,f�zt�������"n��h�A3�Yt�I>�����'��}3�|����3:�����Y@8�g��k���^o����5�����^�[�v+��l&ej�"8�u��:�n������$����6�74���4���<�����BM}�
M�1,A��\�f�2�_�.�b���d��������AzxMh�*��\��u;;���^
l�}�/����I� �T�X��X�$��5����|���;��`��N���N0��p������)�����e� �h�����Fg���ir,�SS����\Y�o��������5\Y�6��_�/b5[����|r�S�h]�o�hy���������q�����R��L��9�D9���p���-u�Y����3O���s�x\*U��r��!�:�^��^z�nli����_B�������(-vpw2����������?������_U� �gz�9��O{|�����HR4�b%6����/�������R�
����������|�bz���.�>�����Z���ux����:�`��6;��j���i+����������q�/� ��`1�8����o
e��#��`4F�h<G��p�ju��n�����O�oK����c�n���^����=6��
�dF��lR�D���(�},����d�� ���� D���18~�R�Qb&�]JgFE�<���i��K�AI:�}���{>��*Q��uT^����~��]">�z����Jp�f�V�k���w"�e=0O�&�!�0���ZB.����C��0 �kd��tn��dN�7��Mn������}�4'��~(��2}�R(�GW�VE?���N�Z��#��<�t�>ID�d)QL��j���:����ru�/Ww�1fo�;~y�1{+g��N�c�V������r�/C|�����|�/�5f��_g�;��_��_^k�������p$
y��S�fe�N��w�Vm��^�~�z;�����qZ�w���U�����
<��6����u��*�:������,��Y�[�>@�DB��Rj�?�U3F[�zU����Q�Ao �I@R|���Jj������
umX ��?���V�	I�g�I�+�V���J�����4:�^�N��k�b�sg�
�N��S�x���������#�
�g��@�/����3t�?��'�Y7�p`�8��r�Iq��A�n&��|k8��'�b&�1� R��z��m�b���qy:�@V����S��=*+�pv?~��D/J%��zRt����F�,��QHs�c��yR�P�:tb�@�G�6�[��WL:����	��]O'��?��o>������6�fZ�<m��Cz�S�1~�j�����p`(����Kebybu�m���!������8EO����I������$��	i���@�:�BH�f�Yt
���?y����D������Z���x�h�U����_1+��W�n���w��9R>�0���^3��Z��B�f.���q�*{�p7����;�[����#�$�������\w2������
�}I�����xQ�BN1�a�����_���$xG��
�����$���;�8���}1�C����KX~��]�]�%W�#A�e����89���:���������x\��g���nG���rw�,w��M5�Tk���^��F|���{
&Wv���.N|X<�������%X1{��:�-�����q4�~������Z��S�
�B����>
^�E;8�0��L����d1U�O�[gW'�N�����k��l�A�����~sJC��7��&_�M8���AOg�k|�_�G�3��hw��h0v�c�6E;��~�'��:�� ����`��������*yA����}	&e!���r�=
N`��[�UpXjh��N+BB��Q�������$��c!������.b��r���)2�d��92P"y��
�H/��hKW������Gc@�E�F��MVm6�&������
�(��d��[&�!����(���@���93,�3Zto���*+��x1�z�@L�;G}�%�w�!@���0��dI^��x*&��y�b�vnC1T@�wg
�\�3*'~Y�2���/N����n�4�� �~����]�0��`g�1�U�+8��>�	���8�8g$v���$bd�+��-&��@�/��E�5b�`���(v�n���O���QX�����������(���`��8>�Hx���Mh:���bf%��mSn���e�\�(����������k�7'g��T���{�YU��`��7����&b�7_�98�����";�=@P���*�����ZI�\�@{�����Aaj����=XT�J�}#����)�{���=���������,M��X���8��aW�X��,!T�(#�snn����y�*���.��z������[LFJ�I����py��}������Vl
J���&�Z������x�XW6����v��"=�wE��c��+U�+*��:���O���^���V�������@�i���4R��%
W���H��
3���G�g�Wy������-��gk8[�����k�	"h���
�G�����F�sR�C�qhL2H{�Ld��m�7�lP�=O����v/��=�
B������R�.�]!N�cS][�t����c5�#+�E?6 v-�Gdv�7�^3������`u��h��� lvH&����Tp&
6��*��3���8�f�JL�����o_�Y����~&�T<��?U��^A�����<<���fZ_��~���<g�|N������R9�>a2�
>
V�gb��D�{P����������� �<mt�g*|]`�q��n�h:��7ec������*2��ap�����4h�h����f2MJ�f<H/��+�uo�����8a���:���8�cU �����d��"�4���_�~�/�eL�u��^4���3�o�
������F#(/�����eb\�G4���PGQ�#(z�f����e�uB�9=�cB��o��T$�d��#hR�D�L�7���t.���W>�E+�r�:I�	�Aa�$�:E��|Q���bh����\���i�"�8'�+��4h�{1-�y���%(7�|��,_�G=���x�����]N9�6���n�G�Xl����J�k��,yN(�o?UDFh`�+Y���&!}��@54|?	����c@{#�A��������A�%���o��>�J
xD������;� o��O����m��S���$�G�����A�~7��T2��Hn�@7�i	,���y���z3��$8/B�L>�������#�D������Y���1��-d
:p��(J��w���6��'��J&��eub�Lu1�����7��4
�j\B������{n����?���(�������ee�p�`��B�{L��+m�Y��KJIK��L�P�o�[yv0���fx��$���������&:SI�3���WK��j����C2���������O2�F�7%��3�S��e*��X�B��~Q�W�K�"��@�I^���v6!r\�Z��i�-�#�(H�R�S�
d�������������!��%{�u�2D���1��`$H9�%�D:�o�q}@�wW�G<��������N��U��X6*�i<V��u��*��[$�3��B�SnLIK"��HA<��/�\k�1���?@*3��~����g�,���}E��f�9���u���)&��LC���+�w<�< _5������g��)����lYO��w�;V���Hx��6e��	���D��i8���;�Iy�,���i4~��d���!�<�5Y���:l��Ja�-:"g���}��$���,/q�>@����o��S�M}l�q7e3�	�0�IC�qM�*��G������x$�P��y�Mf,��Hxm�i�p�"�+?,/k:��~��$�,���"Fh�o�9�p���p����r�P�EM����"���l�N��j[�$?�t�����.����aS�sL=��
2���l��C�S��A�
L#/��k%>`��Y�:��+�$��J�I�Z�Dp���;�X����
�x��`4�3y���O��DObu5fDLMf�h�=�R������0��0.L��T�a���q���a���}���*-��������.���m��<2���I�H�k�I�����HE��j�h�S��z�-q���(#r�s6��RGc��)���c�k���O��F�{�i����I�]���?�Joe��_7����Y����xC���t��X	�$T�^iIbz�C{��+�!�|��1BL�����`0��W4L���Rch�)���O��/���k�������#O��6b�G�	����8�
����g^��'�R�t�u�9��Ov$����#����?��+����m@��B�\>V����5�*��(�Y'�i��
���#��3/��M�7�#��P�e�bV��Qz����(f������o��/g�f��|���������H�j������,�b�F��o0/��������%O_4|�)�E�]�h�i`�.X�4�H��g�5�1a]N��O�OHf���K��[[������#�t��n�ZEC��!s�x��$]y��F��-sf�&v��>�Yh9,n!������M���{�� *7�Q���:�g/��_1����`^��'=��A��V�R�2�������m��7������Q�.s=�����_�9�YUl8�[m,�U2]�7g�������P��������%���]����S��
�m�w�X.]a������RW9����u������:�'|�9O��Q���+��7���OE�$$6��^0�u�5wi���� /{U�'|!�F�|���D���d�.�@���m
��XE�39���(M|D�B�m!���R�6�[X(Dn�qc�%7�@���a�X8��D�"(�`������}d#��	7v�(��b>!7�$�`H|q�A�OfH���{�����f\|���*:�`�q�Q8^@z$���Kl�@1@����iB-����p����y���X+���3-����W�n��$Q'��������J���(��e���x�A����3t�{����<��&X�v�8����!�v2��.E�L�10R���R_�;�+�q1m�r�s���7��.
n�*b}
YX���#�*<4(H�n�P�A!�����I>{/M���v V	����2��p@��Q�-"��p0�������W��Rn"V�m����@H
�� �InA��*��EO�6@�u�j�pal��f"���`�����CO��Ax��$�(����S����U�A���M#����?7":�fns%����f)���w)NA"����	�����m�a��xJ���E[G�q��F:�pm���J�RX�(&d$��?W\���@6�6I��$#wu�h�F�G�k|%n%@@��
�_
���g�-���z���
��|�����d|9;�|{�WFd��h '���\/���C`�x��2�(D1�����RD�g �|��+�s�$)���[�%��|>��=[��5���:r�^)�e����%��� �!�&7��l]f��
�~j���/��C�=+�=l��~�\l�����;%�z���s�����|x�����v{-]��0��m���0	�����
8x��t
������V7Tk������)]>�I��Z�������|S����f�U��h[CS�].��.����f��\f���d0^D��i���%}L���n07���ZK�~�h����%�g'|�?��8-�_�8�n8y�K)�w�%��u:��;qd?�9p�����er��d�@�%�*K��t����%#�L;���b"��;!f�����u�5�m:v�L���,>���L�7A)/^_���;�$>G�����I?�u+��UF����"�C��6Z�*���Q���G��G���!#�����?�W�:��Y�����,��>0�O������5A�H �"I��+�+I��)jq4
�^��u���j��QDPag�:?9��[���0	I���T1+0f�����;Q�� ����D�1�a�PC}���P��U���-�G��p���q��L�"^��a	nLL�����D9=e����H��BA/��^K|�.�S��tw2���G��n��lt��Z���W��:������ioyc�l//��0+�S�S�A���d#@"
6������d�\	����q�tz)�x��M��	�������
W.�P7������v�^8���g��	�>��!"���@��:��e	�I�_Y[`���� x�}��7���(�=�"'fu1��z�
p�|X��K��I7_��q~��|0"He����NG��m�F�3h����-v#�9���m0�o�f/�����`����Twj������j*}sY�`k�7`g�����~�X���%H|�V�����q��O�jI������%q����c�VQ��+!B�4�!�����p���7@�h4��ouo�j�+~vw+{�����{�F}�
�k;;����,���� ��Q�7����r������?�f�Y	�N�Q�����k�k��^���E�z�YmF�������2��=!->��5������
�Qp���0�!������]���P���B����l ��6(d�ZP�>��{V�	�*�������7�H>~y{xu�c H5�o��TQJT�j�;������gr�is0��aC���F�e��Sg�5��j 8E�\NT����������|�}P�b�F�&��?��{�`-A|?h�>��.�L����,�f��1G�������_�:�m�7�.I��6y���As�"�4&�8/t5�\�t���^�T��_3��}����V+I�X���Ai���l�����Y]&����[�t*�(����������8ju�-'u�[�J�X�	�7s<'�s��tj�2�<S���s�k �����������WH��>GXF?��5zgg'���Zs=�M����De�����o��S�0����\;�Y��C�;�7�F�qg����E�W�����������:�[���^��T�8w�9b|���F�9�g������*�=���</�}���8.C�
s�I��Yb[a��������p:���������V�������u�;�<|�:X2�����o>��]�A�}1�]��U�L�s���2p2��h$��-v2��9���<W�p���t�a�#{��2�;�<a�>��I������(�O�	���d	�Z�5�o�h�����7����r��V�Q��6��0�6�	
���!�v����T^!�����U%����,��|�:�o��6T88O����FB�*�����,C*2��B�������_�0V�3`�����K����h�'	�%����/�]��/O�/���l^�TY����N�/�o[����G������q.�SYQ������v+��mOj�uR�|�� 2���v����g(a�M���
b����r��l�L����-��EV%�-��'qeN��_R����u$���=����$e�����<���z����=���'�h��A��8x��</X��t�
��r9���h0����wj�7���pz��
�|��
^R`�h�0 o���vf/^G��~?��gHJyQ����	�A�����FT�H�v������0�� �@�yr�<�/W�bP+���(�����-�
����A>_�4
����)V%^�z������
��r%Q�]�h��4���1*I(��+��%�UO���f��[p��_�={���[8��T���6���<��Dm���R�����OlGp>E��A|�J	!����L`s��p���"x~X����� �a$�l�[t��`8��E�|�ij��(��K/�%~`�9�T3
��op9�X
���hT'�,cd���&�����@m��x�q�O A��2)�����u4� A�q��9i���z�:=�j�������O���!�3��?����,4���6+��SO�H.�?�x �����6s(o�"����RW�)���!*>Ih�@"\8����1����������������js������g��� 0��<���^$������0��_�U�����a��_
f���)������lxj���ba����uq�����t��v����[���?���#G��u����G��>L��IZx ��B~5 .�����?��vw4�l�Q�6��0�@�	jL�]�5�~#
�1
[+3#����4�!QJ�Aip!�������G}��>$��Vp�y�J����D��-�	���f�D��������7x�(P���
���"ab`odS8��b��Q8���B��B�^��G����=�e�[W��a3	�
��\�{{���C�2�0��<??��hv,8��l�`��NG`Vz��[�,!t��}����Q���S��.�V�X�&a� ���~��Da�F�O��6��
e����$v"!�>�yc
?!1�%���z���p�"�*�������pN@q���:����$��=���4�T<��c���T�����=�<�d2$�y|H�:���o�����\�)HA E9���/�G���� p%������KA �3����j/��pHQ�����(���LT"�'�a��fD������c� B�"7���'��R���Y*!{�EU�Ci�d4���l7����L�
U"�K;H�pY?6��\�F��c����!�r �
�������I����-�
ABr����8BG��p���kf���yY#��Y�q�0�����V�L���\.���X$_����Skg��AwNK���I��X&Cq����7�s�n��C]��	[�����4�L������l�{��$�d����,8J,���PAf��@���j|NH�y<N8t�(u|H�z|U@�;��+���S"
���P�V����d�tH�Y{��=����g�����!Q�3������]osj~���2��y� ��$�'h���~t���h~�@�.=
	���G��Rt�(�YV*#W��V�hZ���Q�1�,Q94[*dQ��?�L?H|A�0�{q�n20 �1BL�d�98LB����?7|�������XaC��[��I�.��\6���������
d�p�|��������Y�"v����<Y�#1�'Y��a�������6�th7Lb��<>�z�&�.T{7�I\���{�Y_*������-�pqJ�f}�R����bm=
�^jwh\F�&�v=�R��vC9��o�u6%$��tS�wP�q�P�����x��WO������*�)���7�/KazHC���BH����"���m���4������	�������:O<�����X�����&�o�;����L>��^�v���9x��J�Q����6�{�^�����~���W�K��_�U���Nug�Z�F���ey3}6!��FH��� �|�?��d�RS1rL��q 4v��Pk{4~3���(?�Q�SIrWT��Y���X�����
�!��D���.�O�r�fH'��|�{��;�B��@�|e��ju���xSm���6���1t�� (�����k����\����d���
�8�	�������&�8���~^7!zV��h�~S��b���/��X,��ym����XT��i�m��p���W�_��J^�kZp/Q���@��(��
s�����?
�$�>����:����_������$��^N*�Vy��o)�,�%Z��n��u����SNz:[��=��W:�����M�L�ZgB��`�6a'��FBx�u����T*<�����c���]�����CZ�q7o����7���Y��������g�(G��"���P��IF�J���h���{2�b���18��1�H� ���Df�I�� ��_�������;��jH�E*60�I�"�����3�M�%��G*��,�QUe����9��������ePL�H"���������m�Yn��I�ew#$tA��|/�/�����D(/��x�
27���$�,����/�6���e;�~�Y�R��{_���=��RR�y���>��n��j�~�{�M�}�}��/�wjJ&"�Y}�����#��W���_#�7��_�"�=%@+�r�/!��)}�R�qL3�%����H &&{S0��s���}���S�,&�P(�(���h�k������`�?T)�����m����<{L	��Q�W0���x���<ON��(�c�^��LD#L��'[V��]\���Ae�a��uXYc�N����x�!d�X�lb�t<h���������"���������������s^yl^�2��sU�	�Hx��33���8���7M���:%��v�kt��m^�+c����lF��~x�H�$Xs�������h������B�@g��~H�-��X'���^�%3�AG��F�Jw�������*����:���&p
�$����W��J�'�DT�_�;�� ��Si�v�V*Ei��AP�������U�c%�e�#���Y�Y��Z�F&��e1a�2R��N�[
��^=
0?e,�-�[�M`Zp�"6b~c+�mi��'E�AH��f��_��a��[���m�����%8n�-vh�:�9ix^T������u!�|9���g�/���I���*�d��G\���i��uy�+�L�O�*����
�m+9M ����H��w�)�I>���pW�L�����>����4jqt��q
h����Y4m�I(/8�I|;��'��k������5�3�Z��=�:�T�����ao���~�,m���=���K3OL[��|���,�Tl��E��������.+r�L����k1�V!#��gCQf��a�Q�6����z�������y�������O�g�&S�������^���-���j����7�r�K�+��m/����5��P+�x<A[���#��R�����,�/��I[���h>�|+m����2���n1+
'���J�b
�!�i��L���Pp[x�P�����V~�}�t��� s>�E����/�8�J�>��K���0������v�v��zew�w�������Q-���;�Z����F;{���~���v�Q���U�N��W����Npu	����l��lg	��.��R
$����A�fuZ
�A���!���76�n"��`���1�S��� ���M���!*b��N�[%�!��PL
�_�7`��/����xcH��
�
��b�V:����pH�1����e�
��BH�QqCH���j>
���-��A.�?V_��|�b��
�6�"��F���05�(������<b�=#Y�����n@�	��� f�$ZAX�wq8����-k����8����q���[J�5�i��.����z�(?��ro��fu��!t�����&��L'�c��,�z���&�H<@K��R6M�a�2=�?+���`�?_��l�^�����������F��k�lG���g	������;9L�wo|0���`���b.��O��7���jK��W���6����i#����cS:&`P�V|��2�_�
����������������s��������\�VkVB����r��Y�oVw�����X��!��H��J��52��]���]Z[��aEu/i��!?����%��	��'���`���������n��+�{�f���V�����1��lEA�Lk{�����f�rK�������	�9������A�9�w�p�S�|!��B����Obg�piU�lY'�V�3��i�����^���,��������w����T���E���]�\�c�N�6��K�7�d�Ar��7z��l���*�y7�������op�e�76\FsY�����W���,�+?e�@��V���E:w5�%u������Z��2����R8�� Ff�E)_� �����H��`��h4����H���w���p�
�>����El
��{) �v���?���P,�5��]���gZH�| ](
�P��(t|y��B���j���w��s~����%�*�Y����/�]l������u���o@��+T����B=��|��^D{��)�������M�~e<��,���"AX�����R��k��I����S�%�)��XI(]��_~��dH�����%Q��M���|������M�,�F�
�8y�16�eA)���-�����,�����`�;�����T/��A�J���n<���9���BW����Pu���uvu�����
S� �{&��-<(�����]�Ls��.�zB����N�D��SAN�[ y@���}}��^z�^�[��J������woO[���r)x1��J?�_��{��\����m������[(P���C�����������SL�G~
�5��g��FWt�sy�x��C�X�X�2��X<a��
�������;���w��n��h1��`^��:���I��AO��A�:OJ���������������f��o�����\.����7��s�u>�A��u{����>o�)"U�	�9��g�%r�F��� ��o�������j�]y���q/�.��%�1J|+��L��A�S�O*&l��9�$�!�'�XI� �����s�X�gn��-Y�mp<$���I���d�h���"�c��>� M���hs�;�������1�`	T�����	�/�
�U�f���Y�����0!���(:���G��������#���\�����$��	�����5��V& ��b�x,���y��7d��3��i�L�%K�<���@��h�2x���)8g�\#�;�/�W��[����|a%����!���U0��S(�5P���A5+���DT�uq���H�)�V��`�n��q!��K/��?��<wNP8c�,;����iLv��\o��lOl{Cb��4�y�O���+@o�����*�]������8�p�����p��D*�(q��Os%�o?����>���\&���o	�R�6a/(h�Z��������*�-J9��K]2��z��Zty� J9
�6��.#S������)[��/�,)�3m��r��F0.U��+�������'�~���@n�7G!�R���6��U$��,U(�k��5O�N`�4ZZl�E�{�M��"������GVt	�/��.��Qk�AL���������?�naS1�R���p,����y����Q�q5�cK�g'5��2�Qw @u��V�<�R���9���!�)�&�����F��-=R�B��x���"��I84���8���k�a���:�������l�qt�b��<S�74�\
����7�t�M jM���^�?	?��b�{�>\�X�J�c�/`�qE/�{K��8��Fq�rN7�������9��N���Xm���X�*_��fQ������{p�{}'�6���#&�<w46�2��G�SCj�h<Y\�Q�?�y0ap���L��8�`�F����%����}�����(����m��8fu���%����QA;���!7�E�z<����e\��9�!\�����G�_�#o&���3��E����`�b,`��}�4q|��������S���Qy$���$���jx����C	�����[�����7����DL<�r�Hw�q`��^�'�SP)����?
����g�F�k�hYLEO
�58���������P��~t����:�A�<��*8\����
}���iB�,`��'67����4Q��B����O��9_OiJuM\��w��@�~�(�X����`>��K��VM`	��s�|����i���NX��B���b�J�Q�(M�!N�\��Bl2�T�!������d���``X�.����S�5��R$��!�T�UX;\jI^��:de0MJ#{�*I����&t���l[�*�n����t$��$.��X:��x��~~�|�p��]y������iBkE�M�^
��:��L�(!`���6��e��l{8w�n���S�c.7((�L��t���I���]/OJ#��	%[�@�6x	V
xm�$p�A�a�n"l3Bw�eW�@�3���DBpr�`�
������<����EB��|���Oe��c�)��� ���)�h��F���C)���i*��V���+���?�Xa��4�^�;���y\H�N��NE.�������w~|��,���@��O��rv��!�#.#V�9$��	�V�������@32�����/cC��B����J�(�O�����Ln]������#����p��p:@7��-��M�PZ	�G(JR[�-�fr�Ob���`�>�X4N�P�bCi�z�P�!�K�L�k�csmy�!.
.��I���*M��en�/ws,����Sv�dM��ab�26���J(���&��&�!|li����pgu���Sl�<�?�M��$m�]��	���1�_0����D�Y�g"��=2Ux��.{���z���B� ETm�����<�����W0gp��'��P��&p�5?W���h��f:RD�;\�E�9�]R���&$����<b	>��6J_��{�&�����DB��q3����#�FHc�0�\DC��=����+w��t���H�-e�p��\����3�(NS������3"b�j

�$V�s<��b�]iv�����	�'��IS�w��������3����s3��a�F��X�"��2�������M��;��m��c�S�1ts�R�'S(�CYj�0����@�t(���F���'q�Bzy�s�&�@)��-�i��`)#�����rL@EL���1cp x������]����aWp�x�Y2�W0kB)|�bkP��;�1t�����o�����p������-��8+5b���N��#K[�l
NC��%�?�����s4�����d-R�%����<��[t�uj�sM	(!3�/���bG@!��	�R�qR���3�?Nb�.!���8�d:c,`r8�?�nC0�>l�����[Y��'M�:�m�:=�����Z�"��H�G@�������lJB�u�m�l�i���-��cU�t��}�d�"�Y�|���iH^�d&j���F��B'����=v��R�m��X��q���r
�8�����+{B�K^C�5h����t	�$��q�+�u}��ht�*:���cy�c����ND�V�G�\R?F_J�@���
�0�����%����������?z)��v�h�|�T����m)&V�R��� Xe��\^���o��'����6�"S�R(�'}���q<����
����]<�V��md1S�C�����7I��=d��x�����|H���N?y]
#�)�V��)8,���L�e�yX�X��c�8����<BR�`nj6aI��"f���^�����8���D7B��i$C�ozq��u�4���CS�wuu�G��������(����UB���O��b��`��o%��)v�������p����y�c�<{�X�!c�r6]���UB��Z�jxt*��;���2����
���n��1����u��4�zI3Un���C=RB�n��>���_���V����Y��N8Y������:�9,���f[�20
�xP
�5J�LN��B�/�����z&W&���g��I����0�����:8����}6�I�/�C�  ����2��uP�LJY�p�
�GZ�������KT)xYP� g�<�b���U`^-��%7�"R��M����wBk/�r^D��t)����Z�����8������g����'m���0����d[�8�������*gmr�������b��TN�X�Ar-��@4K���cj�����y����/�#��`�_�V2�Qv2�}�l��U���I����������#���
+�mz���r��wv"�~�b��J��d�}���K����_�G�������/?���t()���8���1v�����Q�I*�	�.��'��VW�j���i&
~p�����kF.��0b�dE>��[�Z�b2��n:��?�DO���1lU���m[�jA�-��oY�����I�iJx�2Y2�42{����d��u��8V��}]�
�����)K~M���[Cl����X�Tw��f����{�Qmg���V�C��m%c���0�}������&b��S�&�D��!�w@�r��,wo�3���_0��F������H��E�T�����V��t�A�%��f����{:��3��|������i��}��m��X~�-;N���SM����#I;���V����
�'V=+���}f��F������Zj�}�7����������]�$���o=�x�$~��������'��G;�X�aN+�4�W�%����|�	oUG���`i�S2�.{���f��a,Y:E�w�d'����Y��oz���<^�����<m�/.��~�z��uAQ��o�0�u66�i%�]#~��-�������h-����F0�i��\jQ��TJY��KJ���\���=���g��'�L��n��W��ZF`��-7��Jd�XjI���v��2�Pq������9c��j�:�����f�KO^%�H�����|������J���r'���w�B�IHY��$��D�h�T�I��@B��E ����I/��-�\�./!��E�H�������h���;�|�_K����������O�D�<R�S���������^L������N�G��j#j6+�r���u��n�H���
�p���j;;�h���f��^�\Hp-!� ��l�o&����4��G�
gD���l!K	
$���~;?# �v<�`�R�=K�j������J��(��N���v�������J-��X)��Y�`�IM�'$��&R�=��74���bjM�lkz������4?�je����C
T`F��v����^%V+������`��|X��wBZ�h��6�nl)�����>9��]�.g�,O6����� D�M��KN��,�?(���_���/&�v|xu�>�h���E����e�&��������K�{������k0�\���\���������o]��
�L�?�IKJ���=�	�"<�E_�B0�L)�/�������R�
���kT�����h��o�J�Ge��u6�8���
3�����k��������Y��d�����8�MV%!��v'�pp	pm2;�@!�X��#b��a������=0���}��]�N[GWn��g�W'G�gt�`8I��0�t�j��Q��
_t����|�*@�:����J�7��������=q��� ����2}u�����)D/���y�~K�q���b\�S��g�2Y���?��b�5����{����V����w'���`"���O��?d��������E��8����~~<��=��x)#5��]��WoZ?�qF~Q�	�p������_�?mp��lG�o��Xg����z�R����U�����}H2q��J���^M6��
����D��w���X���0���l*���W�7b�>
��%����F��	�Bm.��M���4K
b`���/s����������^��+X���2�.�.N��6�/�'��=��O������;�P<�����2�S{�F�5z�{S��W@�|�b\�8�%�b8N���F6}����nU��Xsk��l�A������G���JR���u<X��/j\K�j8V�a���&���}Q�����UfEV!�a�}��_�� _c������ko^r}3�������O����d��<$0�
�y�W'h�&���$m�dA/v1H<�XS�D:�J��$n����V���s�'��m�.�N����E1x�(h�G�b �M��&]x�+�Xd)��c�������7��������g���_����2������W��+,��D&I�R�9'�_���5o��p�_�zG�bZ*E�6���CF��<�j0�_}�m��w���[�v�~&?���o�<�T8��c{��L!d���l.=�kJIO�g���L��������z]��8��y�Z���m����}u�bl*�$M�9E�n����#� �����G�?�5�TT�`
#E'\L=�j0��X����}
�����70���w;���Q�n�S�.��8�A����4"V1�jc�����p> �����N&�d�)Op`r�
�5�	OL4�����x�������_G�8�H>��9�9���eAm������������Zsi�DM�s��@���Y#���r�p�<~�Kc��@K)�����~ r�������]5��W�o��M�i���|s��)�����9���@��9��2H�] ���f
���X��E��&>��d���gw��h|�#`<���N�|�^���3�7�.1���9�zt����
��e��j ����WsB/KJknGr��
��V
�Z�!�yaSV4����w�������omi�����D*m�.�}���4N���{������?�F�O������I����}VVsB�<tC�l5T@t�T�����u�zw<|};pe������K������Z���hS�CK�f_�M	�$+�(Z8�yx����������x��Gc���-z}�yt��CM�{�
�5JOl���b|\S��0���C%�K�MZ�=��oc :���=�)c�Re��FoX"J��weEST�p���	�8�>�z����,�VT0��������������I
��]��a�5�8�a���L��O��W��D�{#�M"���O��d���
�D��S�����F�Wo�4�p���6�<(C#���������u��.�7�
�B�lc���VQ�Z�����4����5�����������q	!p��t1�`�V���������z�$��Q:��J�*�J�I���|�>���Lz�oa5u����[5�������T���Z��S�����xTv�����fA��0o'i�V��7�yq���Z���D�Ne������������t+{����6;;�~����2��=�3�����X�
h�Y��&7��,�2;����?�c����e���������������lP���k1�[�����^}��lU�*�����?~y{xu�c h5hl�{:MO@.-���������D>�����B(�^��2�
{!09�B����QN�s+|R�omH?�D��L~����#3�?��57��^�Nf.�)�E���.�L���a�X�_IF>tvv�j��5����h��Q����C�t���"/{���[�_��TQ���;��6p5�x�ax��B�������i+�������W���@�Z�1+�f�����M���40���B�������]�B�L 0o�����#4��#��r�����:�������rv��~5��]H���������)G�(�SheB3���*��^�[@������n����t6�v^J�����p:�����:ti�S����i(H"B��MF�\H�K
���^�&O7\^����1���
P�Zkb�@�M/imhtC�������|��}����:@���8b�_�b�M�9�1�w�)����i�=&�q��GO�Q^�'��R���e!$P��oc�� �e�w�s����+QJ���N-O��U��B�h�`��rk���+�d�$���2M��+0qy�y��O(8���s�8��:Nva����r,�	3,�hEB7��8�<<�	��(��/;����|��O4�}U�����2,.4�p�F2��=
D�i3+�d����6����	��lA=C�2��
�����=�<�j%i����!8��#�6�\+ �|
�"���G�f��j�������_^_��JQ.����s�����������w�%M��C��YS����h����z�	����
u�������vG-�!�K'\��^�?���A
�(o��@�!�
����K�6��x�S*$��K������h�~�b�jPy2��BZ%�8nX����X
����X}q�m��}��W�9�y�K�|E�d�zE�W���	p9�dcihb�>��&k1-��j�R����6��t|x��M^mO���x��| md�yfxE��\��
�SL��t�\��r���k'������E|���!%U�O<�
�}���qlRb��AN&��p��#g}���h��@�1�P,�`�oC��ab�cD��]"�}�9��)Vl�pN�B�6U�Cd�|�A\�<	*���b]���~q��q�{p7UAg�p"<x�rD$�������������0zC�������Vs\::��h���Z�l�O�Q����]~������VzQX���������:��Z�#�%��fs�^�y`���B��j=xu����l��l��������@����D�x\�l�qG��f����i
��*�����0������.�F;�����zM�nc���������N/��O1�$c�}oP
����!��o��@�p1�i���_���6d�
c�9^p���>A|I�ptj��������d2r>A4�b
�FB[��������n��6?�����/����&*Xf���S��K��b��!������{������������.hx��N��Wb�N������v��Q��_��w�j��_���z�C���G�������a�;I��30\�����k�g��g��+�)��	�_��3�E�U��h5���@�|
d��~MY�1l(qB4]�|}�a>�W;���)����v��nU�"QO��a����;�6��wL-�FS������k��	���5A�ZU�gRI�E�	��WB�K�8��V�g�Q/)JB��es����M%"���%�a���X����t�j�{������$3nf�3f��y�x�2S��J����E�Z�~��y��$��\w����YE�ao�a��s����<�(=D�-�-+�s�8�����:4����Y�����O��t��{��&g�!r�w��Pi�8�;g.}�6FI�����!I+���d�+�I�wKG�\�
�h��eF�+�A~�\�A�%8;���h����}-����t���%e�B��z���ut���]�D���^o4~�������������n�RmF��n���{�J���w�T���w���;K��}[�Q��U����A�"$���>A�(G�de?��t
v�AZ��Ysg7P�`P�ZS�0rkC�)���N���2���*m�������1��eo�/(��������zC�7@z�V��d^�J��zJ��z�Cih=�UPo�	�����D�p�cF���u��3�2m1��h[+Df14/[�O�����7'G_&�?�?�d2!�`����QxRx���eZg��S`�x���.[W���_�>����0J�^����O��EEl�|
�S����{��]�Q���{K�|&^]��!��<����vx}�W���G����N~�YbgA#*��:�M����(�W��0p*�'�:srqx�V�-��&D�2�R��?���h)��S=��B�f��$��"!�I�?,��
IP�'������-����$� c�i�O> I�����������0:<=-�G�����J�
�����d��|
F#�Z��?�P�7%s�K��v�F��\!����"�$�7G�Q�K����O�}�,�H&��K�2���=!fD��r�0O���9-���2���_�~\������CAn&��<}q��m��/�b�g�S�������w��y?5.!�To1�y
t.�g�j�"�4����"i�!:zP)��5~Q��&��#Qk�����*EZ���U�=}�v���L2qck�v����{��g�����n?r+i��6-�-Y��xt����_T�*Ve�Z��Wo�4�{E������0KPc��,���k�{j�,A�ZO����]s�;��&Z�l�<	��G	���'Vp�����/���y���d�=O���<O�������p�}	���f�,���3����������}������� ��]���n����5w�k)�����KW8��l
g�Q9=e�wu�[��K����eb��+J+�{�4�b�<����:�q�����5����fgE�].G�^5�oF��Z'^J������J�
�{@���!P���
%��
%;���V5��Q�
�������N�0[����9��.��W���C�o�9� ����4E�#�]����,��7��}#I��u�M5��R*S��-<����b��]�B2�l~[(�2���/���a�Y����w��������U�sX�.'-
V��'K�&���[�P� 2=X��yg@�{�.�LG����Cy��&��������5i�X���J���Q�v1Q�^@������-d1+�%N��5�^Y�`�+�*���$����$V�>C%���px�%������,<��],�Je���r^<��C���V����*�Dc�dO���_Y��|_C�$��N���18Kx1G��D"�E������F���7�������/�F�/���_�/~	�������_�'>|=$����_r]���������`T]�M3��h�P"����>\�����I����z���:��K�:������9cgmf�g@�����'x��q�3��*C@iy�����%JK2��2���������(�%�@��r����Nj����(��7�@�n�J9j��Z���o��%#�@	S�P�����`��M�
�_��{��(��k����/����E��1��:rn��������e��G��6����c��i-���������n$b�(>�����1=[�)L���^����.����9|���G��p�{�p�@�I��Bv�	�z� '�!����G�p��%\��D����+j${�R��MBO�I�m��i�R��k.��$�Jb\�� �A8�#�mJ	��.�f{=�ILy0�J��40��+�%�N��$`�����3����
�bx��?������a�%?v� ���;K��K�"UF���~��Rj���}Re�RI�0e�,j��&�b����d��(�#�3�l)6J���(=d����)6�x�\��x-�������\"�?�pVy�<}Z���i�^]p� }��i<J�J��i2��xx�/{�t��	�����/P�W�!�y�k�@�}���W Y.���<�o&���Zc@5k���j{y
P�QJ2��P9�G���9$�� �mt�vf�������p��=�#���$Y`)O�Fs���c�=����:��=��wJ��$�e)]J���t��.�p
������=��_%���j��k<m��1&
�<��~E���O�_����O&���V>i�Z�O&guDS'X���Z�,Cf����x��hs������V
XG>"��'E0?�{:���E�1�J/hU�w��V`0�T�h
'�`"kL�=���j�*S~��|���z���Eq���)��<,�,O����-����\�D�x#v������W
��BU���J�DF
�=B�\[f�Jy�6M`JZ���Ax[��z����w8�>�����T�9_c`��G�0������4��:CIW|�dqkhi�5t+g�>��P��2�o��z|)&Q�q�����(Z��Qt�1S�;�cd_B�������^�C_10��q����0At�eN��c��<Nb�&R�d]*�LV���W�dE�eY��SY;m���K����wL��+�]�T����q��z����T�K�-H'�}q�V�������jta���c���$u��~*[F�,����X��kJ5d�i�KU
�X�3R�'G��Z���&��`�7U��2�����r?�&[9����2���`z�R��`��0
���?����i~�������~��f?��ZnJH

#272

john.naylor@enterprisedb.com

over 2 years ago

In reply to: Masahiko Sawada (#271)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sun, Aug 27, 2023 at 7:53 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I've updated the regression tests for tidstore so that it uses SQL
functions to add blocks/offsets and dump its contents. The new test
covers the same test coverages but it's executed using SQL functions
instead of executing all tests in one SQL function.

This is much nicer and more flexible, thanks! A few questions/comments:

tidstore_dump_tids() returns a string -- is it difficult to turn this into
a SRF, or is it just a bit more work?

The lookup test seems fine for now. The output would look nicer with an
"order by tid".

I think we could have the SQL function tidstore_create() take a boolean for
shared memory. That would allow ad-hoc testing without a recompile, if I'm
not mistaken.

+SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+  FROM blocks, offsets
+  GROUP BY blk;
+ tidstore_set_block_offsets
+----------------------------
+
+
+
+
+
+(5 rows)

Calling a void function multiple times leads to vertical whitespace, which
looks a bit strange and may look better with some output, even if
irrelevant:

-SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+SELECT row_number() over(order by blk), tidstore_set_block_offsets(blk,
array_agg(offsets.off)::int2[])

row_number | tidstore_set_block_offsets
------------+----------------------------
1 |
2 |
3 |
4 |
5 |
(5 rows)

--
John Naylor
EDB: http://www.enterprisedb.com

#273

sawada.mshk@gmail.com

over 2 years ago

In reply to: John Naylor (#272)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Aug 28, 2023 at 4:20 PM John Naylor
<john.naylor@enterprisedb.com> wrote:

On Sun, Aug 27, 2023 at 7:53 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've updated the regression tests for tidstore so that it uses SQL
functions to add blocks/offsets and dump its contents. The new test
covers the same test coverages but it's executed using SQL functions
instead of executing all tests in one SQL function.

This is much nicer and more flexible, thanks! A few questions/comments:

tidstore_dump_tids() returns a string -- is it difficult to turn this into a SRF, or is it just a bit more work?

It's not difficult. I've changed it in v42 patch.

The lookup test seems fine for now. The output would look nicer with an "order by tid".

Agreed.

I think we could have the SQL function tidstore_create() take a boolean for shared memory. That would allow ad-hoc testing without a recompile, if I'm not mistaken.

Agreed.

+SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+  FROM blocks, offsets
+  GROUP BY blk;
+ tidstore_set_block_offsets
+----------------------------
+
+
+
+
+
+(5 rows)
Calling a void function multiple times leads to vertical whitespace, which looks a bit strange and may look better with some output, even if irrelevant:
-SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+SELECT row_number() over(order by blk), tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
row_number | tidstore_set_block_offsets
------------+----------------------------
1 |
2 |
3 |
4 |
5 |
(5 rows)

Yes, it looks better.

I've attached v42 patch set. I improved tidstore regression test codes
in addition of imcorporating the above comments.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v42-ART.tar.gzapplication/x-gzip; name=v42-ART.tar.gzDownload

����d�;ks����j�
wzbG"���p�N�I��q�vO2�=��HPbL
������� )Q�8N�����iC�.���:�f����Y"���X��h�g'���n�8�|�hi��P�gB���g����<���t|��s���A�00�k���q:?�cC3���<�x��O�8����}�����*�s����M������i;C��|��/���YF��M�.d�����]f����4���]{����{����[�����Q��Pwe�K��g������K7$��L���}�4���������?���aS����f�2��Y�U���U�*CVe<I�}�hL�Lz�(��V��X�!��x�vy��R�8��q0�lLp�y*<�I��I���`�H�0c�g�)�,t�� �b� @�Fci��+1
�p��[��y�^�4�I�b��\��������GF�q�n���}��Y6I���l�Od����O�.�Q;�l$��k�X��<��1���x����^L����"~�?&�%��Ww��q�������O�AGW`C��K��n�{�=�I�N�����/�tY��O���B�\X�Hx-��T$t�{��F�|�i�(�oo�v������c��;�=t;��{>�#���LPM��+���tn������V�5��.���R�`a6�2d'�t�=����F6m1���������
G�V��l���h��~}:x{9@��������-����#���F@�$b8�k��S� G�I�A���i�v�����w��d���5M���lf��U�� ��l�����&hnG:�����Sm4��A���	������Z}bQ0�-��[���o��K\;��3���&��6����&{����J����@��hjP[�lt���3
�j:���u������_������y}}
�;��sO��g�(Dr*�@L�����"|�0��G4�^��;��X�[���f,����4�2��h%<0� m�a�1X��������S��s��`�E�)[#������N�;P�d���:�����@�L����|*h�`x���g�����4�{���E�{����7{�`�vvL����h1z���*�v��M��S���������T�����2"��9��~�pt�g����R��F��i:N>
�A
�Z������n B��
�����WT,s�� ��������9��k0��}�e$����?��Z�(��R�
���`�W(V���<K��C
`w��������~M<T�|�&xLP7��T I�T��) jACQ����
_�
��+�*�����B�B@�x��7k�n
@�`�����`��
v����t���F������,�4���Hh��'Z(G���<��_��y�����g{�g�]�����e�7��O<����c���!w�sLa<�r]�2���1��e�o}!��U�c`�C��?�=�/��H����:�D�E�g��C�����!��0�CVe�����0bU��Zc�U�#�c�� � ������m���V����v���Avv7e��2���}c�S ��z�Pu]��~ssqy}3�|{
j�M���20���v��A#h*&"A%��e��J�N�B{�a���k|������b�D�T|�c��F�,�,3��� ��6u)���I"a���y���r�j��g2�X|^Z��0���H�4H�a@;�>�������A���~{���k���>����(��H��YE��j����B��V,343�n����O`�|��D e���ot�n@�y�������.}������<�}�N^�d���?�wwv��b*f����#7��������}����C���h��i{������`���O=D_�3Vx�N=V_��SMon��b��1��vq�`��b���l���B�SO�,��j��e6��p{��N���:]��~��������f<���1���Zv���`���^��\��>������<������������D�b��5Y9�-Z5�}p�Z��Sh0�E*3��m,g m9*�$p)`�nY$�(�U�/�%y�'@G���;�8�:|O�<e4)|�y���I������2�,@��<D��r�K�A���Q��x�
@!��j	��2@������vt���1����3
���P�<N'�
|�<���@���X�`S�
Va-!�SdYT����<-�#�+��p�6<iI�==�	�y��d��GPQ���L@MdL��������|�������\���8�d�����_�������+�T60��s�$[����1����"\��r�ra*���o)&5[�w�����m�A�?����B�#|st=��������)���;{�2�:�?�D�~9Y���h
�1�qs����g�[YZ���N�J���:a��3����4~eI`��f�;;����mz��5+\�.�G`'r�g4MX���)"������Et��/b
��E��E���H8�]�8��u���Q*���J���1����^5����e���D��}��u�+�.(��9���g5��:�Z�e�`��e��m��8�����[0p+�lQ�ge2l.x�(�� @cC�|cT�$�BbS�*�ei��)���+J������(����G���F�Bj�kS��lY@9������s�,2	��,�l0b���jH������|���������K��h���p�|_V���Q��/)��\��N�Z�����\y9��� 
�\,QB�d)
e&��S�2W]��5�z���tWP�L�7�i{C����������/T��6
��HiZT��Z6���]&���{}r~~v�������_s�Z���~r�T

t�@�Z��rX����)�h~�����������P4�g@������P4�c@�\
(�_P4W���(��E�-1l{S=q��)=�T*�Q
E)
���C�����*--���3���
�PxHuJ��FI�Y��lX�;j��Jx��������vIy��Y���*��.�MM�J��]�`��������_R��>�m��{]]����|�]U�[Q����aE)�juA�����)��h��O��]Ky;@�F��E�,��:���g2
�AF�>F��0���cH���j�gAo�e0���jPQ���#���wY"��	�P��}}�g����c;�����?����[��O<T��������\,��FWt���m9�a-���[���������5B��?���;<������cc��:���U��g��$6�M�v���|�q�L�Q���`cn.O.��$!�.2+�j�<�v����r���?�������y��*��g�]c5�^�������(�gf9:�I/�k����n�}%8^2��yz_�?���:�z�n��
�y\�l�����]��\����"���mq>��������9���W�&������������i
�Kp���f�F	Hp�n�Ks�
�)V���:�+�@��+S�X���y���c��2�t6��%_tj�����L��t[�����
�Q8�dr�������K�~���G�na�G�[`�G����G�� "���~$J��T.Wl�pkw���<�sW8�u�?�{�c;[��+hV��+C��5
���������x_t@iDA��M�r��m�*�N���������`�S�wr}48v�_���x��AL���&b��+3�TK����Cou/O�u�~��d�\-v�~p�ZO(4���s���Kq^�/��b�Q��@���,\,�c��^&H!���~F��s0�����Pp�pal}r}�
� ���Bmu���)�E��,��g2����;h�$�`@�zt���=�,�8���i;��pMZly�1����"����z����)-�8'%!n
�Q%<>�d
c4�����v�@��H�x�����	��Jv1�sTb�"l7���]��yH�p)�S_� �.���{��6N��y��xQ�\�����|�o@�HTj�f,!6!�oE���1L[���|%��J:_"�T�1+/>�Q�>d6�U	0L��T��r:�!S���W4>���*����DLD\���
�+!t�`.���3-�"�q�F�'6��W���pC��2\W�Y���g�Y�>ec�i�qN�E(ix5�9-�K
����� �S�:��,v��f�p+��Z�� �`_j3����$$�=�
8���j��
|Wc���Ph*�U����*���@|�.���1M�����*��y6�X	,�l�B�+�]�a<hJ�����e�(�9��gM[��Q@%dU3�)���NA]�G�2RNN_.#NY�n��XFC��[����Pf��3���Sn
/���b��R�%(X8�[�����_�=r���CA3���^�s���l��b.�^T��� ]%TG�y�������`�����"��������J� ���J�`�*H�Gg��*{�g}�7	�;7�����S��2����P��C�uT�O�v$�DJ� �{R����"u��d'<�E2��LD`��������3���. �
�"�xRJ�Rw$]���hpS��DdBJ��3�����V"X�L&�X7����i��Q���6d�����<��*�\&�6(����h����O8f���1~������UZU��)]:l��G���VX)>�=+�D����+P��^ds�1�"��!O�M�R:\F�^d�T$��z�<���I%�p��O�d��	,�����AB�C�v�P�%�*���e�
��g��T��S9S�Y�<�:�[��,�Y�)��\�����"����������2����x��%�t*t�	QV����������=+����2�������D��"^8r_PV��E{�X�AmhF���H`p�B�����_�v�L-�����]��;����P&��#��e�P�c�
�)�FU5�K�!�*�����	� ��.��~(M"����V���x���+oS����'2.��<%�Z���l,�N����s�Yr-�:G)��I�R������#2
|�
si*�������x0,��L7���K-��������%�B,
������=U�SB_�����w2X���!�c�����h��Z��5�����+�)I�vS,��=�ZZT!�����wu����
J��e�.����J�5�j"�r ��_�'�j���b��E���B�)yU��������*f��~5��G�������C�?�_����@5N��n�K�{���oe�����M��!�����>�|G�q���_{_��Fr-�����;�d$�
f�\����g�7���%�@���%c�����:K�]-�{&�Y����:u�:K��|T�.��}��-�	����e�V,��Y{���������_������BX��X�11
(���u��5��{������p+Dg��i���,���8�-P�?���Y7"����3��|�������%���!$	��:��k��IM�����b�pB�	y����	n��Cx�� 
���,���h��U)x��l��,C�
]F�h
�of���`�^���<���������U��Y
^Db��$��l����*��X6���g�wW�v6�sja������:;6zn�W�$����������^��2�"�le��|+�������'���txuuH�y8����<`r��[��q�%�0�,��u��������yU�p���������
��O�1�p���6�f�����*�Q� [���&��hV���u��7��r�_�����#�x,H�p2EY������������������3:�K��\�A59���'�1���9��M�������_�������`�<O�n�}��\���zn\�y������%'���RZ��Cs��S�|X���#����|Xl�%�_��@}��v�iJ��BK�[j�^:v�����i�qFO(���G@�v) �vB�v��fA��J�dU$�`WA�R����d9�����&�P����������-n��.oNu���v%.��m�1��]E��E-��3��!�q�0�
L`�"2�F2�����o��i�L�����(����)��',��������Kg������-��e�!����f���L����0�Yp��Y����������h�&g'/ZGo/.O~n%�k	�D�v�t-���z/��tzz~�Yl1�I���T��u��W�&Id�i0��u,�lK��M��5Owz�
��[aK�aYL����9���Mn�v
���U�����SC0�����h4qj���������S��t>�+:�B��h/N.A89�TE�.��An���v�N��_x��o��������W�-�������uq�����;�`2,�+�-k����g���_�����4���s��7������hc@�v|3��3��;��}���"�X���|���x��-�f����_���Q���{6��F�R�����Y�K�v�d=-��K���\S���4W�������%W�Je_�;�$��S�WN>��K��8
�!���l,��^�r�'�������g��,p�Q3�����N��r�D`��o���J���
S{�uN�Mk��v���r��	���LH�����1�i��P�^� ��H�Qm�U�6S{�K�f/�����J����_��BK���z6�E6������GWq�c�����^Z��|t��|.�,�5�Q��.Y���Y���=�-�roI��$���y�P�r�SO`�t���w�����j�����6����uR@Cv
%��>�"��=�	<l���uq���6�c4�?KFE�����G.�O���B���R�f�����4�^�����N�u���������I
�m�o�����}��9�����_^&��S �$��1�U�3l�����^Z�������-�=�����f����'f�����w�������@i��7-<��Ya
����wN��oJ��oFX�?!���D�i�^���� ����[�:�R��Rk���<O�	�ZG��J�zS���d�)$a_i!����-��t�V%����w@{�������s��r���6�UGp�������3.��LhY��	���
K�H�� \YLr���B�^Dz��@���������Gm������"����b�m��7��AK�q�Fit�\�,���N��zS�
��������$RB�KM���8�;��e!�`�����+�O�8���P-�
E�C�Ed3�]�����z�&����_����yVtd.���������,�7���j����	H��Lk'w�����:��=\�Z��}�@l�}�f����&e�eK�`_2Z'4��{|;h�U�/qK�@�`���\��������*�WlaW�	�=j���t���^�[������4�@KQ_��E�<=�����LR��{[��F��5U����{:�z�>��K��������
�wQ��<q���tA�G��{rv�l@	���^��M��~n���y�'������m���	�����RpiG���!���B�f�)
���������d3��}��F����@�jR���Z�+k��+eGf�sr����h5��l�_����0+%���g6�����������UEUQ�
�N@_��X������ �����!w�����Ba���R��v��O��BBu��|z��[z�&/���
��9��������/��<��Q��P�h2?��b����a�?��'�M��!^�?����	[��k����
 ��*����������]-�zQU��[2&�>z����%@3*���
�����l�jr��!A8��`�vAa{T8c�d�+x��(��`$E��bHB����i�	g�� E����w8>�m���b�(���n1��4��`><�L!�)J�ut�Q>!�E���#������`>�g4l�V���[2�Og�v�o��{g�f7�M�'���<�`=
�@h�K3�h�hY
F���6�Z~��c[4#�����.!c)V����i0��c{b��cC�6t����j�g
7Fq�,��z� ��(/���7Y����9[��D�l`�Z��&���V����m�D���\�V�4	���[h�O��`1e��-�G#@���!���{����U���)CL��,��=��T:}@E.E�����5 *�'��}]m��jZ�=|]Ky-N����?:{v��5�s�x�������8?i�yK+�w�x���G��z�P5�*|�-xi������� ��	����	F#����9;|{ze��J��oE�T�8�/�I�)M��H�y/&)�����J��XyaZ��Y4*��~��4����uC8n�5��"���C;��@���-x�@J(�8�2O��23A��k��$�����-i��[`���iA�F���B����3L���n��7a*���,�9�.@.�����\.�ubt�|��K�t[��DLh>��e�@2��vG85�x���U���(2�bZ�.'���'�$��B��K���!��������*�pA*}��W���G��Q�`��6�\6s��m1�cQ�!���n��\��QC������A�~��z/6��'.�R�z�������Bm��H�(�>h&@S}�	(���x���ZK$�������{���v� ��Pf/�^G�Z����2���$��a�8��w��v����vY��[��N��5W�e���Gx��K����RH6G���b�z���g�Om��y�C��jj�<�
Ae��S
:��J�����O�(I3O.Q��{?�C���7y*�7Y�Z�F�4;H=�%@��UA����������@J������,
x����9@MA�>�n>���^��KB��N�!����-��I
���7��h4}^i�������A���*�>e�
�J,C����T�Z��D/Oy���Ic�s�EF\sf��+�E��F�o����fL��{�T�RPR���������^b*�_�(c�^�M>]����{�9��������<'tZX�^���
���h�c�,�A0�.&B�!�<�O���F�����@��F��3%26�(*8�����a�^����j7��%��
�����9�5����YxL�4���T
SH]���\��)�c8J�[�x������h��[��3$g �>��M"��	����t�9X:0�=�7�`��&Z�:�#�����#_���|�����rc#3�`�p�~�������������d^�#\qOO����	J��*,8��A�����5m��G�I������j�06�[�#q�{�"��i�aB�����i������]Rv���J5��^�G�����q
���Wmr�m8��&=����b<��u��j-�����������@!W���$
H'�7y���K�Y���A|w�I�:#U�[��

���y�%8F��@"�P6o!����6�=u��KL\`h&b��6�0UH�$�p>�,!�
aC:0�m�������sAP��	�)OU�������X�����T��B���w���znS
Z� ,��������N��2����'���W����cK���V����"�6���a�f,bm�"�`�A<%��2��P�}�ex�Y������\<�8�1F�2d@�p8Z��[l��4���d}�m
b��%���8���x�	P�!����?���Xm�Xa����Z�q~:AT�|�k��m��O��)P�2��8������D��J)�#s��<h�?�KZq] <�\dl�4�t;���V�������EL�9pZ���7�S�aO�.���^U�����w'B�b��� d]������y�:#fvc�x�l���Y�KcY6ZiV���a��3;�l���W^���v�>���-!y)/���k
�k�^BL`-�)L��\|���!�����Xo0�O��U���w�~`�
���������Z������\���l��@d�L(���������������t�NX��K=�O�YJ]6^�Q�l��V���G�����X.���7����W�q�Y���y����E��^+��Gb}���������m�j��2�����7+�i8CU�8�� m�4������R��RfkZ�2
��M>��x�������W���p*���(������H*�^>�kGu��F`���c��kLY�m!��5@l����
������
����)�����IE6b�^J��U�qZ^h]-�
Q������@G���fIr9��Kc�������"G�=nO���vu�d�M�>L�f���o�%VK8�@&����lr��#l
���f��i9Ba`*�pHwjR�BZ�0��v�4 �h�7Z��)�I[���z�:{���~^?&�H��g��q3�(D��F
�m�r8h��@yGU�W��SF�%<���`�1�(����c[	��#H�"�2�W�SZ�?*E�	�>�!�X���������H���F�JD]��V$Q���@��4V��p<���R����h�@�d*e]�R|���DFn�tu��Ct��f�~n!X�|7���?�T?4�d��#�����Xw���zM��\�{l:����f����R��==��Q80���MM�����R����z����Q*��2�j�*������kZ[:\�;�����m�eX�����3�	���>LH�r�`Y�:N`�2|i����Z��B'�;�F�������Pp,�����p	����x���������QP���8l�U���M3�(7� ���Ut�qw0��?IA�IwK���9R�Hf�y���Z.�s���sp;�����x�D#�7P���G���cj�q1���`���������'!^h�0����L���u��:�;��@J�E�$9�/�e�t�%X��D���Z�p�@m�LjWm%�K��*J����3�����Ci6*\2���'B���i��KS�NM�������i�1N][O5�0�Z�+mi(Ul�x1��i��:1bZV�3N>�������/���X�0�RzlP�Y���7�2����e�� Y'�A���S:��j(K����Gs Jm��d�)Y���j��f�x�(�qS*of�h���@�n0��
�$�� � �S�	U��W�2�;��a��R����F@�%
vPQ��`�� �`��e��d�
����Q-��k�os��KnoN��H����S0�:��<���80���^���d��F�e4����Ew�CR� �����r%�"(.����Q0�����,��d���OH�������N]�q#�.�"�_��:��'C��1'z*�<����@2'dB�7�L�P]��Z�sV|���T���	u?d�����x�y2uC�R*����x��39\�+T��CwY�X@����|4��$��[���,�V�#�s�j!��`&�Nbz�A\����B,I������l�TG�9:H����s����5��iF�2�R���h�����CL"v�T�c'2�8������l�%mm�r���A��5����U��%G������*��~v�L"�Xm������
���
N�4i��Nn���*�=swf�R$�Z��t���j�/����8���Kr�{���w�������z}�����9��e��}/9��h��e��
!����.e��Z��D{��1���p �|~�g�m���s��Of`�4��+��|q#`t.Hl���f=���U�A��]:�y
{���.�!,�<��9���1�]��/����,��:L�����
$q�����k"���~�	]��C������'�������l��a<�p~���F��������t �m9'��D�Gj���#k������_���fC@��\���7�VT��t-K��9sL��+0���;���nv��n����(�?�a���|��Z<��)�Q@1Lo5`�Z�2fy
�z�,��G��/��Ul�<��RuA��`��L���	~����0,k!�Z�M`Ni�u������{3_�~�dR��l�7����/����H\=
�U�Af4�0���	'r����O��@M5]�Q�f?���mQo
7���J�>7@���E ���h�8�G+�!�"�����N���>_�?��w�<R�@SGDI�k��h@7�x?��I
�Ur��ub1�i��,��}��q��P1k�e�V.'��m����#��]+g���h&#"H-/_@�\����dJu���ge��~n��9;�k9�H�tz��t���O��h�0can]*�
).Qc0�N�H.�L�k��9����C�X��������Ig���]�p��J�U���*�="����&���T�D�H�c
Ao�5[��P�^]����5���c18-��D}+�,��Q�h����'��=��c`m@��$�J)�����Oh��>L�Z��B_E���\�Fmw��z�Dl���G�#q0�L�j��a���Pd0�.�1�f�� �D:"M�H��}�g!.D���K���1f	�G�p�����8��
P@c���;^��;�]B��q44��Z�3o����T����|?���*_����z����o�������r�z��A�Y�Rm�U��D1bw�	��9Q�R����9m�$��&nU�2S�j�����k�������Z��Qe~R5�������)7�Q���Zl���*<3�,�a���P�!~�E�@�M��i��`Nk�d�!���~�O�|��Z
H�;�0]�<�dH-"�m��r5�����r��Y���3z��&���@Oz�<��I�����8�����`g��ax�f0<�`�=p��0��H	Z�2�L��SH���`�'Q%] ���s������+'��9�dI��~����mK$<H�43��b�[���3+�+�T0
��&���pM����2p���h@��	l
uh�58}C���t>7��)�N?������� q�,��������<��$�]T\1����.f��S]@*e��+��?{N��}���n�
�Y+]�f���k���EP��H@;�B�u������0�Y9<���i��������7�u��Tz�E��q���q��616�^��YJf��%Q�K�e��e/z/�t0��E ���D�f����A=��6/z�+2*w�+��:se�a��Qu�h
.��r)����q`��dg8Lv���w��������%�y�j8+$�H�J��r�8�	����-����MA?	�����t�<�\+�z�9���h��B:���2"x���v4R#R-/�����Ig@����<�B�F}K~o6(��z�0v���S�����H�l���a���G��%rfN����nbd�d9q@���j���j#��*�����u#��U89d�N;��n��OX�'��ra���}?Dg�B\y���H���LG�vE�r�=�����"��r����9����g���-�u�
�D����tv0MTT��iH��#�Y������iXh���b���59��:�M���Z�����m���
�T`��"6�R����U���1�7���pP��u�m����W�Om����!D�#��<��$�Y0,��GN������~Z`��U^�xE7j�������H����4�9�t�5�i�J��x���~�:@q@��pNC�{1��uX���X��b�g��#����%�X`I��d�����������KPN$i��~��tN&�*X�7R@�d�$��*�aHtI�k��*��c��3���Y�����}��8��p	�\ZVqr{�z��%T�F����?7�����.��K����K9�$��yHF��������)XW�N���W+k�D�+�1��2C��]~
l��Y�=Lm2,���A9�\t9Cv��P�H�cZ����o��c�D���'	�����Lk`r��5	�����]��L�/��E�`#�>�)������DA�7����9�7�p�XD7�{q0�H0��$:u`3���,�%���hA�@v#��nU1Q��f�nH^�.�(��*P����1r�{�4;E���iv��bG�D����aO����\�����7R�������{�0��!3N�<^m��3E����R�:`Z���W�}/��J	,����[s�d�Ci�u�C�Q#�&}38�5�.�~�0t$���R�]��������R�� ���x_$9�%�X��T���w����M���.�����4���Z���Bz����>�p�>�-��L0�����9���:-hr=��Z�o����m�p�����y��|��P9ha�n�)�#A2�r��(��D%_���A�!��hC��Y�������I����M���d	�"N\�iA��z��������c�no�H��?�QE�e�MR"v��o�1�#=D*K�����1R�P�#�s����x��j���+�����i�X����H?	����P�������O��rk?�2
$����l�U�w3�NR��X��c������#V���{�|�qH�� ���,;U�c��j���U:�6���b�#
�2����giGKVe�$���i X@�0T>M�S�7�T~���f����&|d�v���\����Qar�3vAS��q��R
Au��m[��@��e����r�j����6��m������u>��g����Nb@�����Y�R�w�����Q��h�~����+o�0��\�;I�Z�ljM����13*��&(�D!2UW\$SQHp,�m(�0���(8u���o7�Z�`|�sbv_"��Mg�Yg.���Jc_c��h~E���7�����s,�!��-ui���+�Y���E���V�"�7<4�[�TD��H�x.c�P�2e����>�5K�����U�I������:��hc���	Y���Mev�����9@-��� �N�L��a0Q�#��`]I6�E6����KV����J���s���+�����D��KgX�&�%t�����r���t�bB�U`r���g3��B��6n;g�g��{6o
�H���B��H�C&��kP��z(M��8��~�����!�#V��8�E�i�<�J*
*_CQ�t��R�PG��D�J��A~�4I�N��$�k ��0�A�Ex��0��>��C���e.�sd�������^D�"��!L�eJ�=�i����h������!9G�g9q8qu��G��������x����aGm
����Y�,#�#P�.���Ae����!Xe��Y�/:,
6�(��,����|m~LK�k�d��eQq`������b=:�\Iz+�V���'�,5r�S����Z&aZ��v��@�:at+���������h�S
,����W(����W������3��_,�w�(��0��uI��"���������7��U��I8��H(x�a�c"]r3<�{'%b�S�C`�����Hq��
@��1���p�8B��)����������u��;@�=�p�����x^�����zLO�.M\�M��h����h��x�Y�.��\m2�WGUa�/o��UF�45��{��v�w�V��{����H9K>�"���|U}��l��o���]�bCu���8Z���x��
�����J����h�8.�����G`���JiT����*DiC�Ki5�=�mIiD�e-f�l������������f3�}x	��
��h���H�$�\���d��#���k��L�l�Ey��Q��������0��ey����n~%��������4�T�fP��q��;i�@#G�M�o%���u��G��.jX<�OC���I_�G��"�}��B)qS����fZ�>Y��H�?iqdY|^�3lc;���q�
����8�)��0��S�|���JQp�z�J#�j�R8I�y
���|93E�-�IB��H�kv���`P�ctht�/ �3�UM��IK�����[K���9k�`-oE{��^��pV���)�>Wg��1���� ��~k������Mt�o���Q�j�����o~9�8��m=:��d,�<�l��������Ho�(��@�X�l��\�5���������"�z��������@Lp'�������K*Fo-c�i��#G&��.p�K&�R�]�-O��d���0�l��H���}B�X���=K(vTO��
�}��j����e>�,�*���)Z��H����:��
����gV9��=��2{~�A7(���g=����m)��(��+�bt��_�:�Lk�6N�? ��U8+��Jk����]�]r�$C����r��O9�[����%�\���T��%�H�(��;��|�F������P,O��aR'���bi8��#�����������,���7��������M���5b��!�����)nMsj
�S�1�
Ur�K�3��
�hqL����D�L��%laA��s�s��mh��|���m@�JN����?��2/�Q�v�WT
T�N����� ����A�CB9�w�87��Y��x�3�����X�l�����.����M��`c+�gZ*�,�?�!p������Fqc%�\�Xu��Q�0��W��"fd�2�
��AF��N�*�\-�V����u���Jt��*#{HU�R5�B������*�fU���^A�/�&���3$�%$-����<@wB�s/��S��(����e#~�����V/��[I�d�-B.�Xz��n��nj��M���2q�2@��
�j`���0�	�>������(�
�%p��Q�]�����E���s�IdX��q*#<�e�`���w�l,��Nm&-��;���FH�Ki5�E�M�����g8�g�3�e*�����8���g�v�:�j��L��O�@e�{�n��%�1�f$����t]�FA[0	��3s�
2�U��o�)�M� 
\����.��&y���@d�g� ��������LD��������P�7u�dT�����dE���01l�i��H�������$e���L�7�M��
7�JR�����M�9H�Y4��@�Bg������x>z��%���%iNS����]7�!�\��Y��Ku`q�HY��/�/�����&�@Y��&�T�o����N��D�2���n��������E��NL��c2��Rp�M�D�%b��z�B6�Hz/_�����c[`D���	��1�����������Yp:�����@�h�2�������p�n5�t�u�:=G9I��|�2����������=����+}���(C����g��(�kt�� ��D�j�I���?v��-:p�w�.f1�A���S�"I�z������k��h���L��OT6<����<qH�������9;��h:�!N3��_�_��~���x���2�A�
/6dvT+�%01��N��������IAQP��h�/�_2��c7��>U���tEf}���r�Yib�������@|�f�d*	uu������&�^���Q�iiN�{���c��1mm�0s�v\�<��,]6���/��jz��}Q�Z:�PXs�_�o�3u~����d>;8��������)WRN�����L�����C�-��IS�:6��v�yJe��Y�S��0y&d��b�D�i���j���z�0OM4��Ac,j�B�u]�!�n��(�2�2�D3A�'����;�)���?�DL)�UR�"��B��(-����$VY�5yy���%�^C	 �����]�q,���_���!D!�Hqb�6Js����W#�BJr��RB��4x�p>@t�^�B�Z�fBm�cU�>[��B��JV�Y�	&B@$B�AJ��t1�Nb;Q�l�I�p*14���M���;Ct0��y\{Zw���@l2L�Z}���A�����!�7�Fc�bG�&�y��!�J�U�D{N!{�����nl}��a��4y�PQ�
�Q����k��_�u���X��p��8e+zUR��|�`o�L�$m�e|�
W��%�'E���1[�������������a����p�e�nRX����X)
*4���l;��@�A^J%�;a�
�ZE�Yv�A�����L�������?A�P�WG��j4.MV�"�������4�$o=�����#�W�4����k^@��)q��M+��n�M��Z�\j�d�.���47��B[�:V%(Y��H�z���Z��[�������jb!�)Z�xQ�@��Z��;;Fk�[B�t�tz\B��������������� ����r�����)�o2����qL����1���U�9�aL�^����||#�����_�z���za����|^)�4
���0��UP�`?�(aGG�s����[n�;Ija��a���S�1������xS�V(S#�+�t�%�k��@����]��U� 4-��	���n��b ���?��h�A��3\����,����o�~�:>>|q�R��M�	]18�A���$>j"����8^l.*C>�|��Uvq
O�X�{�q��>����dKQ�)�X�w��S�
^iz��1���x�������c��4�h�B0��N�:#B�x<�-�M�
���=��@\�e��"�6����Z�`��;f�E��%��������o�Ye}���{�f�
K�a�K���w�wG���\��K�}���E��D�k ��_�
��Qy�](��[�!wNU�fZlE���kQ�������j�a��b���|�2����G���-�naX+�$����+��Hw�&�_��nG��By�����QYYSH@Sb�V�+�����<|�Z2U�E�SR�^[)�r��"g�hL)�V����iEZb���������"�=3f/���)#H�1����H_X�'��H*JeFd�<�%��Snbm�p@.��P��phX�!^"=dS41���7�s|`b:���_�b�%&*<������i_,3X�=���,�PW�����+y+V�;G1�yfH�6�
~|Z��"d%y�eH"���<���F��	��RF��H��RYNS&�j�De\�!8I����U^�|K�K��e�t�����v�\8��:���C�jD:�Xg��s��\�.>8�4��<�I���2�Fz����D�bR.�
�mC��#]�<��SO4Yz~����$zz�-�g��8�����zI�,�\����w1B'�}w?��5��9"c��O�/�>o��������M8p��5��2l�W��Yg�*���e@�a�\�^�����Q9�_�.7��D�����^#|�5D���&<���:��2L1F��@�4�������@-I���h�����|������}��~��]'-�7����v?}����Y�e%p(��`�������?����_(�7�-'�9	����Q�Y���eX�4Y�_P'�#
�
��
����=&��j;��|2�
���Adwd9<����8�4;\`:�����8�22&G#J�6	} ������� ���vVq^�S�4���m���m`�@rx��ph�{r�*�>�"):�����%B�*����� ��H'��h�A,'$�L��&�����
v��To1}�� 5��h��<i�T�b-���D�����=�l�E�n%��!���k� ��|7��dF�T��|d�I�^���+;�	,(�!�2�f����A<�s8�
��ul5���&OZO�W��WV�@C�P��D���!�c^n��.���g�/
�7T�����+7�vE�F���"�v"���t<���_�>������-z�-�F9^�P_vK�I�C��N$��M�1�}��2�%\d
����y;|Q�����w��|C�;�}����������3���,�XK|�x�@k��iL�Kl����Dj'����Zch�$]����,�nC�D����BK��*��)�)��CA���*-GA���;����'[3s;n{
�6���2;i�g�~OA�����O4�@���������~5Z���2�'�S<`����M��A� 8�i5V���)/�=���
\��X~Ig�N�|��7qY��}[�qVf���1����[)$�5K����+~%N�l���������LE����l����E$�����]6%x-+-a�LC�D�����7�@e��R�_���C����C_���W��wa�cw PN��)I�b0����Y�]��,��M�-��=��#dI�w���fLZ���U����g.
�7Vde1�����`����%@������a��|�<c)V��#�A���Z��R\�WJ�G���
P��My�ph�sa�0���������l4�E�
7��O�c�{6���:aX3���A�=�[������L�MV8�P ���a����Q�Q���\�k}�����m
��o!��#4��L t���y*��bR�m��<|����(-��O�*
�2n�!��Tg�
����1�s���C�l93��C�����)"���9��Q�:�/|$���7�@1�����l�Y���b �/�����o���"�@[�db���|������d��q_��E�=��o_T�R��<�i��b�bK��y���s]5����t��T,�~y��d��VF5
�F#��St���M[C��I������u&V\�������E���"/��l����C��<�~�e4h�J?,��m��9�����3-�67���f�y���)(
��� {���������,e�M,\�V�����1 P��:TE�Q�I�j������B�vIv�?;%U8�iN��������^Gj���:+���m����+�U_�[=������C+
���{T>�����$p�b�v��s�_���L��~Mg����������w���
�~�0���e�Ij��8�K&L�%{�
�*Um�55T���,���x�+�x4���(��,��%����4�"W�%��b���eR�H����r�?f���-��u�A@�D��8�7����]yM��O��&Z=�z�;���{�v��3��g�	|q�Cn���l���T����=7|�}��]������3a�y�//L@lgBZ�@����F�
b*�����&L��ZAE`:+bb����z�u"1i<�r����]�&�Kq��O�b��_����,R2+V���%0�����I����4=U���������Za*����^/���Y��Z�u-�!XN�<�+?q��7��=C�����33ffS^��Vu���*x��o�^�������/<b��(Ok3d��/�!�����6"��VE[����D46{�����R�5������Q=$���lpO\�|hw�Q��M����l�w��V������wA<E�pEt)h�gt�2�X��`Z7�m�nP>OYM��:��k�*i�e�eO�7-���6j#��+u�UN��o<H B�i]��|�C
p����|���
��)�L0�O�<��|<�E����(`�p?n�����(���ez�����=Wq$�$��a����8��������l�Hs�|����b~�����7^�u4���
1Ft4E$93����C�`��x1R�d�@�N�$�.���_�� ��]bD�� 6�����JO�s��E��F�G)i��4�;T��������S������|h��\���.h�N��X��\+4�����).F��P�#��j�����DF�[����``����?Tn��d]Z|�#�C�����#i�^�1y��U��u��s��:]`���b�qp?�"/��]���]��n�O���������f"�>�j�����=|}~|���ul���j�M��;������,�:��g��a��;%hdk��-}���v�Oa��d����D�*����x���z���x��C� �����4�M�
R�����:(��$G_�?�������0�o:������VF���$���9W�MH� u��g���Q�/����h'
���_�n���qx����A�p�=������J,rc���/��3h%q���O�)���v��Zi.N� ��O������<#��19��~m~���$�Q��)�����+�������f&���2���[B����:��.$'=N�9S`>"�L�f`��2|��eL�F�������)������@�-�L�j~�l�7[c_�����<����>
�~`1�In���d9Br���s����3L�Y�ht:\���?���	JEl�|V��+ja��CL ���T=��}6I�j�8
�.3�KQ*�a�T�G��Z�*d�f�~��#l�8�O�
���������\�����*!.��@F��;6
g�(�;�lE�]C��c�\�Z$y�	��s!:xg>A1��p�<�Q�O4J�a�a@����.Aaw6���_���>>;|��>l�O�yN<Tr��F;��M����������S�����$d9O��m>��$���1�%f��!9c�~3��T���v�J��u�a��W���]���d�rmb�y��(N�����fJO�}(�Y�@2��C�V��{_��?F�����yr;=�X�\m���?�p�����=�������>���<�� /TS^�� B�cOT<������F��+��W�����G������O{�g��_�������g���s�3����!v�'S93,9��A�j�9��*��,��<�Zn�.�'���?�#)�k���������|�0�'���UR���0d[kJ�b��cx7���s|Ex�E���-�h����HNH�-H>u��J��x�-�X���A�=!6�W:����� �E���L���WZ��|W[�������Z =�t��W���8-%���TZ!O��-�E��i���wIA�\;M'+��� ,in��d�U�r��������"�v�=0�0���|����I�����s��/��������P.#1��q5��������@��y�`��.�������*����o�e8��K3Ve�Ko�f����)]�S?�����ys��s
2
wB�����`�����=w��^�����v/�7A'���@��)�w�F�+��\���Zu�����J��hl�J���7�����������j�l���@< �]���Az��O#�+�+�~��Jwm��nG� �9UY?�?�������$���_nu;��[��')���[\����fL���|6e(�4�bI��G�|{4�-�Q��:��Ae�I�@�������f�[.Wv��f��W�$�����{��.�������������Y���B�@��4��bu�G�8y��,���pK=D=�y���N�Qt-&l=�����C,�._�QO�����=6^��Vw{��~T��+���wz{�J��\��f3)Kk��m��6hu�E�7����&��f���a���N>�%O>�e/��j�Uh���a�O�6�7����u�,�
�����So��?���{B�VQ�ry��������n�v/��=^�i{��!�T�X����$@�
���W>�E��OfPQo'�`n'��b8���|4mc
���h"H)�uqza���������B�4�:�M�����>?~{�jw�p�e���:���l��#���Y�v��,�o�hy���������q�����R��L��9��9���p���-u�X����3���s�x\*U��r��!�:�^��^z�nli�7��]��A�����(-Npw2����������?�������*X�3���������@�dHB )\�[���/���f�f�������p�	�naQ�b�Q����D���b�xi��q�h�ne�N*��(�{��nu����(pg�A��s������f/� ��`1�(q�-;��o*������t���US��	�����e.lF-���x[����{v�p�Q�>�UK8�!���Ba ���;��:��{3
g����<Y�@���Pb`�`�`?F.O�(��.�3��b�<���i��K���t�i=�u�{>��3�D�����z`}����������^R;+����j��fc�z'�[6�R�!�3���%��R	��}��/d������58����of�[l�6`��A-�)�?�H�z���U�U1��������@��"�]�MR#���QL��j��F�u*W��su���;�����{^o���Y{���1go�;������3���w����{^k���{c�;��_��=�5g��=;�8�IC�-h��Y��S���U�6�V���[�z�n�w\���eg��`�i8G�&�b��j�Jx��:��z�{���p���uo��,��UB�D���\j6>�U3F]�zU���B���@��I�yP�?+��?���J ��a�cD���|5�����TW���Kg�D�a�Wiv:���8QW���g}�/O��K�#�����*��4$��y�d�e������i;�,f�l�a�q^,p�������
�.q;���L�Y��\(��u�����8Hi{�Y0����������lY1��b�&�A{&$V���:~�
M�Q�J`�����F���"Y�����|GIq���� e�����Q���o	u_1}��j8h�!{w=�����#��pK3�^C�8}ZihM���=�9.�G�A��CB����`2�:/����5�-`���hH8�bc�p=~����qO�"p�.f����
z��$!��ef�5�B����=�+ILJ��{�����k���+�G��bV���F�h'����	s�t�P���^7��Z���V.M��q�*��p'����;�[Z������@�q
�TX�;M�H$�ftg�)z_���X�SQ�(7!'����0pafc�oQ�f���`s@m���p���U������aL\Y:�%�4�Wo/�.Q�+����2^^^]�]ml^�Y�����'$��������Z������]{���pS
+�Z����{���'}�� ��Nw��������\�
������a���l��/fF������E�|��*�z1x#8
��/��4x����0&SP��f��T�?9n�]��<9:�BX���^-����X�I������o6��M��p4o������Z����g��� ����`�<G�m�v%��6�Of�u������w�9xK�����[OU��E#��}	*e����r�=rN`���Z��sX�k��N+BL��Q���z��[b����]����A'���X&��a`
���R���(#y��
�H�/��3HKW���1��Gc������,�l&U�	�;6t3�HQ��d��[&����������@���93,�3Zto���*+��x1�z�@,��;{}�%�w�!@���0��dI^�x*��yXtb�vnC1U�
?��T�s��Q9�gAe0a_��h�0��h�7���EJ��[tA�0����m�\WAWp�)'n|4b�cA��	g$N���Bd�Q��pXA�/��E�5b�`���(v�n���9��yG�bA@������7�|tf�\;
�����E�r��nB�yN������M~������s��t�F���W�q����O��%�����w\�&��/Ac�oj;M�M�M �|1���	%�Et*}��y�UD9
���6�(0��4w�/�AnjD1�{�>�*�H�FH_��!��Y�B�����k���h M�;�%�!��4��K�B��2�9���)p����O;�����\X:���H����]�C\^pD_��v&���G�R2�7�I�V��o��O+���fU}�k�Z)���+�e��\������*���O��R_���V��g��l} L��4wv���:�	����s�cX�&��"B~t~vy�7�.�v�K�o��?[��ZnV��\K.�&����� ���=����sR"��8tL2H{����K��o*��D{N<1�n��^d%{**�C
Rc=%�]>���dS][������}5�#��E?6B�ZL��
��pG��D4���6���D#?	�j�d���L�4Q���WQ��1����6]�Ub�DH����U��JtL<�gRNE�������
r���a�I��9{h��������s���������)��ilx��P���L����=�����,����'=4 �)K������
_D�8~\#��+��b8�M��:�b�A��L�f(\�:p�#R
Z,Z!�?���L���O�L���7y��~�L�p��dz�o]\�_���
�?�H�GP<�lm~���_��,�	���{�E�H,��b����-['Z4
����(��E��G����28����6Ga�;�Z��_�����S=�H/�;$(�:Y�.��T#�9��
Z�;�K�q����B���BFP�$������a�du�,�������1���	,��b1+����0��9���FO��/�e �v��������e��C%�b$p���e�K��
yq�Yf��&�i�[�R�5Eg�<�����( �h`�W�������������<��m0�ho�9�4�p�w�:�=A�J��9��TR�GJ����S��}R@���-i�>�0
Qz����<�1PjP��M��J���
7���eZ@
t��<�l��z3��$0/��|ScW3MglM��#����R-�#,���.*t���(�N��/��m	���J&{�DYQ���.��5���6������h�����k�
=��@��{#�%5�|_������.)l�T(�b�a\Z���
rzI.i	���`�i�=a+�	����
���^��'���p+��Dg*�sF�8�nI�UMB���5t$�-;���4���8sn��)�^y��u.S����j�������=q���r�&yT@7����	����Z����M;n��rGj���Jm G�f4�ut�f�/9�/��!]F��5�5"A�@��� Y`'��~��������<��4��o�F��d�^����JCp�
�p`@]P������������SR�H�0B�_{2��6W�h,+�n�Hd���h,U~��b	��gP��a��qn[�)��b!z!�TT�{���#D��|���@�}�=������7l�H�w�;������_l�:!1n3E��i8���;�	y�,���i4~��d���!�<�(j�L�7u����"�9�t�9�����%�p��������>�����1Z��i��0�f�l�bV�J6��1n�)�@�P��)��6R���c�O���w;4��yY����
0�N�ev����e
�X��9����w�T�����!'��W1>o�)��EQ%�7���F@��9��2���3�0�4Xt���-~o��S�$#Gt�%6�����4hU�����z����1��Q�p����Xq=���>�D0�	)����\G&\s9
��L������<��X]�S�Y/�yIZJ�P�����UBe@���IW��9��<��m��:���y��n�GM�k��\�>f�K����L �c�XSL2$���D"��G��:Y�m�{���8 ��Y�Y=J��&w����,	_>Fc�Xl�{�EF��'�v�Y���+������������>E��:d0��@�B�q�:�JK�c��cN\A
��c��l�lF@�(��9����R�tK��B�����.�_��{�������#O��Vb�	G�	����
8�*J�3���-���<i*�s>	�Ov$����#�����?��+i���������|���C0h�j`U���	������Sx
����i����F��a��?gY��U���p��+ci��Y��H�-���o�l]a�-OgD����;#��C��&�'d�Y��2�F��`^&\�wUm3�%O=�����m�`��$�m�zX���S����.'��������3�C�5����{�����!�MO���������9D<`��$@1ye����o����;Jp�����C�N/�9c��%toCY�B��MM�t�\���������W���,$�����':��,o�-.�Rn���I�U�>����z��F����b+�+�~%��VU��<���W�t�o�l5`K��-!8H)#��sJ=
|M�6*I�;S��
�c�w�X.]`������RW9����u)�����8�|�/��Q�	������^�'�H���^0�u�5wk���� /GU��l�e����of|2GL��p��6�V�"������q
�&<"X�	��`|�L�b��.,,7�qcv�%6"G�����8v��D�"(g�����}`%��17v�(��b>!6�dt������"�r�M�&_��I��������~j�YG�x���Y�n�u�iA:c�8Hj��0^���p�m,_'������>�:S���Ng���Jub���$����k}10��^�g(h�W#���,:C�zt�I<����a�k�#}-�(n'�lR4��%�9/�����"��/;;~�{�Q)�"�~�R ����ui�"TH4�H|7X(�P��r>�=�'���{p��.��6�T����4KqD�FBV<[u��@�*����k�-s�C�N�Px2�#�^PB\��1\�yc5��06w{3d�"�s��40Rg`�9	r
�nb���e�RU��Hf��������D37��b@`fp�v���� �D��O�����ma��xJl�%`9��0H�Aa� O��6��G����FSd$9�?W\���@6�:Im�$=wu�h�F�E�k|
~Hn�@@����_2�^�� E
�l��`6���%=|Q�v�/W�o������
�bY\�kE�	lL/�^f�Q0�7���)�"c�
�T$NJc'�Vx��G����t��{Z
|d���W�iY�4D!kqi�;�e����j2[g2��X��"���p���siE���R��O�������P��$M����`�Z�pp�
�����v��a�%k=��	�C��F,��\��\!�|2�B���@���q����C��u������$_�
)Do�a�U�)X�]e��*^j�-����.gp�s���������w�����5-����IQ^�
���#�Vk.�o������������w�e���S���CA�{�KY����a������q�Cp��t�n�2�Yd�p���D��@�������6���QL�� �
yY�_�\���c��r����p���i�fP���W�|�k��X��#������������*�K�om����I-k�g��Q]�����{=��X����x�){�H��{��ac�H{n8���P.2S�84��JBs�G���I�H���dj"�QD@��l�|~r�'��{��H�w�MI���v��ZX�*�	���'2�A�%���m*J��*}��%�h����>b��)W��������1@�(g���4�������������2>u�)�?��N�B������m4���r�V�u��n������cf�[�����b2�
%��5H���l����79q�l�"W����:.��L/�O5��iy5�2,S����E��c�?�������d=�$M�����,��O��9X�����]��������S���}-�M�p�qD!�q�`UG~�/oQ\�(%�"���/��q~��|0b��R7�!�b:��=]��9g�+�������^v��"|��zQT��4w!�o�^��k;;����j*�pY��h�7�d�����wq��o���j�����{|>6j%A��+qJ�����c�VQ��+a��i4�P��nT���:}K�l6�ouw�j�����n��FcWc��4�P�������}&����+�0
��f�a�Vn����K�����U���lVw����f��m��:�^�m��U��f��o��uM������AM��4�4x�k\��a/~��oy�|���Q8
Vp�|C�Q9��
�P���T���;�O+�`��_�l\.:�����7�WG?T�����@�D�������lB���m�@6d~�m�lXz�����_AP�S���D����j�{�������.�k��D=�����K	���,�^e]�7�,�]�/�f��1g[_^
�cPG�jAo� ����6�-�c�M?�b��S�~�q��Y�Z2�jX�V*P���r���Y�Z��$T�W����VE@6l�q����.�U����u:�nF^�wu?f�*'�Z]k�I����G'�z�������9Z��<�T�?�����\�ik�O���w^#,�������S���p�����wl�4QYp�0�o����8��dL{��,�����[���3�W\���~2~�U^S}P��������W�'%��^c����2�l���3���E�kQ&���J��>�[A��P��\c2��,����0Nf����i~8P�������i+xsxqxz�:
��]�l,Y���W���2�WP�/�����j���`N�[C.&��$���I�@<�����
�����3-~do�]�z�I�V��;���K>Y������o
�������+M��4���j��s�Y���
2�;�f����?��:=|H���L��������,��`��e�`-�K�x�)~��dp����y������8�U��y�gR��="w�Td���u��s`��;:��`�>��;�����d����������������/[�G?Uf8e��������E�����_���}z~x���++
������N���I��N����D&���~9y�9l��������l��V�������%7���$"��"��!.�	��K�YR����������I�Z^���d��,�|�k�'�d�I��?�G�7`r��S����x���YL�:�@���I�D�����������O?E��j1�&�����Q�!���%cP���x����8��!(�E�B�c��&�Asb+9PE�eD��+�|gu����DT�'W�C�z�*�X�������l9XT�D�Q<
��*t�,@X-.�*��y�<����Vt_(X+��DA6��$����4�Q	�qd^)m.��2x���`53����|���3J�n��d2Sm��_�����4��U��Jy?��OT���,|�����&����k������!��97c��aQ&�����A!�g���v��]d��KK�^DQ/�HZ�.���8����]�.�!���d�e��r������w���o�C�d�-B�6���19a`����	H�/�3��	X�P�-����d~~������y_�eE������Yi���y�&��b�C�'B�����fl�-�C����B���d
qF��dh�@F�p�
���1����������V}ww�Euo����po��?sC`b�x���G�$������0����.q���3vl"������,��49����9�
O-7���Z����]�..�����&��I����#@�����{��8��,v�z���4���a"�����'r`�.��������h�c�mwGS��8��
{2� f��%&>.�������������+�qH��N(M.0\��z�h�����
�;/_�z`x�(��,�eKx��a�Y11n"�3��|����}���P$l�l2����[9
���q_Q]�������x����3�P��uux�
����f����S�Hy�X8�,����4_��u`��]a�8����'Z���2�B�(�G/��<��nn��B�1|�F�j�����d-� -���+������o�@:7(�;�O���zAY�O�
y�������1�,���
�u�'�x1�S�8e���]mz���y	�*
���
�J*�G����_�V2��8:�QC���hS3�@mA2�H���p��6�WkA�\��v:��aB!�������n/s��pQ����(��C&*�������A�#����� D�Y����s2�b�/e4x��2d���
�+-QF�,I�v��\!P��P��)"a\:@B���������=R?������<��lC4V���+q�T�[<�������rA��&����cG� �F��3�b���(��eZ
F0��r�L�����"�L|N��!�9-=�%�`�9�
Jo8����/R!�t]�'��B�H<4�'�d*f ����g���'1$��onfq�9PbY��r2����mV���s���#9a����9�!�c
T����;�D��U���`�k��5�E uNd���?>�������^���!P�3u����osj}����6��y)#�<$'h���~`����h�x�I��B
������Y5C�,;��d@+L��	rrV�d��KD��
Y���.���N0(B�p/�`�M�4��&y�	H_q{�t�G(<t�s*�B�@�[��I�.��\6�������w�2J8B>/�Mg�!0�Ck`W�	 V��'��H��I�a��a`�D������-Z�
��]5���������x���6�^y������t�p`{'L�R��Y���y)�)_[O�����^���]�������EWN��[��f�U		�?]�
Ds�!T�J���$^������$�����
��7������_�=�"Y2�Z $���G����6��N�C0���in��
����O14x<���'���E�M��w'��|
��{�<����S�����������k�6{�n��v:���^m7��~�Wi�w:�z�Z�F�^��|�>A�1$z�A��E>B�`?��d�RS1sL��a $v��GQk[4~5���(?�Y�SIbW��Y���X���%<�C���f� >e�!PX!�@��i���?�k��_������?����xSm����&�n�	`�pr	=���Z�7�,�{b��>CY�tt�O��5 ]��]��@}C���p���&��*�~��S��������;�,�a��0�!Uu{[�/<'d���e����e�%�
{�R��Z�E���U����G��$`�����(&'��3o�~k��R.�P�{9��X��{�&C�L�h���q?�����Pz��l��v	�Q�5o/w2����0�Ab�Al�N4����1����T*<X2d�q���S��@a������f<��;����������mY����-H������h}��mL<0������H��'S�)����=f#���<�8%��!{"���zp���/n�EpF@���ai
I�H�f1�\�rT��T��f�����$�@E?�E?����c� �r>��{2z��������1��{�U^m�g�F2'A����(��=�����r�!���A"��h���p�|������m�tkO�����F��E2~�KQ�����g��CrJ9����a�M�_�}��Q�g:��7:_�"�^S<����%��~m3���7�������������Z��0}	���K����2�L�|���@L,L��`�3��R��m���)^l�(�����H�kQSu]��`�?V)���X�7����9x��.����t/_�uVO�<�h*��N�mdz	{g"a
^���c�y��9��w��*���/be���!P�U�
C�@�`��~i���������<�����������)I������Jel����J��L�gF���q�cn���qX���p����*l����2z�W{�{{Q�����?�������X�h1��P���;^��a:�}�C:o�c��+�~y��-Y1Ot��{��t����~m��b�����)�6�P[��-�X@���{b��E�E��	��=��m�h�R��|8J�[�P��j[����w��334��p�V�����@�d[�0w!�^�u�a����fW���i���m
�c#�7�>��I\�0(�-��@������0�����6~US����'�}���4</��k�l�g]�1�A���i�+4&FR*ah�brL0".����u��<�Jf�O�*�b��d����2r�"A���S��6}|9�?����������>���4kA:E����0��,���$�\�d|;O'(��k�����5�3�Z_{l
���5�E������i�&=�{�%f�3/L[��|���,�Pl��E%���CV��������K1�V!c�%'�E��x��f�����5�;�vg&nof��_�M��-1���H����k�;��F�[.7j�^�������.a\y_m{{]d]�a^�Zq��Sh���p���6����$����t�,�L4�c��6,��cJ���������i%	u1��i��L���P`[x�PQ��~/���S:��J��9����|R��K,�Si��7q��V_#��������'�{���~���->�=�E�^�^��j��NT���v�k��n��U��J���w�����V��.�����T���������T	�E�l�7p9��N�A��b#���Aw 8�������M��8X��zL���!pgv�FEPsn��������}��Z��������~����cr0u�18���
��b�V�r��`8�L���un�C����.$����!8�d\5���
P�����?����^p>J��B���[�K#KAZ
X�^�h}F0w������,�N�EL7D�	����g�$Z��fo���"Z�XP�lX@(��
��V�UOn)����2���C�/��D�����{#6�����+�������$�b��EQ�w��d#����B��	>L��b������i=8��/���l�^�"?C5��]���f��k�lG]�d$�n���&�wr��~o|a�w*��S��9�$�t�	VW[�u ���n��6�.�F�%�J��tL���/���2�_�	�����a��30��v���~9y.�������c���j{���bi%���>�gu����/����Y�Fj�T�N��T�R7x����l
+�{A�.
����--���M�N<��lZ�d($e�����t��^�����u��je��=34����(�im$S�/J�*�D��?��`�_����bp@XQP�4�#����R�H*�����(N�~�V��F�uAm>S9^��N���v�{�r�U����n�uY�B~��(�����C��uL�)�fPx���,<H���F�y��qXeA���#\y�]���_6��eT����;�~�[M�2��+CY���
Km�Q�sW�^R��oq�P�U+��BA�J����|d]�������`{p����z��F�1
=!N'0�n�Qn[����x�Y���)v�K	�k^�)����<��\C
x1�h�y���w���0
�B����\/�����z�P�1�oz�=q�b���:Uc�bcl�X8_�����CN�~�����]!�,����f��l�����_N!j��(Yq���W�\�O�)���U1���h,y����L��d�������c%!tE�~9��"O���DEl�7�T��E��2'�$I(�q6���M��������^�7b��`���}�^��}IK�"������o��+��_�Q(d��l<��N�[gW'/O��05z��?��������nR�����@=�hnf	�N�D��SN�[y�,/�c|��������\�2�"NUh?|��z���E1��@.og��/��go_�?ry�O�mD6>�z}�
ty��,�CJq4����!���V��k^GOi��������x���b�b��4��@��)���@��0I����-�S~qk�B�)���u���L�z�
���xR"��l�GW��?�|��5���Z1wC���a��C���k�<�|r�|����C����MTD��Ts2�)�[���"�,� ��m�k���X5���`��q/�.��%�1J|+��L��A�U�O�'t��9In�+OL���D*-F>6'tH�	gn��-Y�up<%���E�J�e4��x���H�,H!�q�9<`�E�=;�AL���5f�bu|B����B.G�Yumz�CC����yo�!�h�=�����r�7�����r@
�80s�(�+x�>V����v`�����A�%V�3����t�rf<�S�e�dn��U���
����^4��kds'�%��cF�c�/#[X�},#C|���z���
|	�e9P�����>�.�]��������t�<�A�-=�v�H!�$�9A���������Oc�sO���d{�����Y�#t�3VP��QkOK�(Ww1�[�����I�CDM���]�Y&R���|��+�l�Ag����!��e�1�
��+%a����U�,�����������h��%UP��G��E�*D)�C!��8�e�cj]��6���b�%X%%v��z�B|��%�J�~E�����@�z
��������(�A���8���(K
���r���$
H�fK�M��"��/FS$U����4}dA�����Q1j�7�������%1��{X�T�7�4<�(����C�ax�&t�k�M�lbi�vR3��(su��+G�;��I������?wyO�5!�/�x}7J.�h������n�3P]��p2&��z�������aG�dF�=8 j����>������9^�jH�oo@��EP�@��tp�=�|~�����(}���+I!�|�+�;"p�<[bm��-5���(�D�A�eLfP��Q��p:�M�n�V��[��'6��p�����2���p��L�Xl<�<����8��L
����dq�J��D����u��3���!�QS��X�	�:����Q�L�o��� 3:At�L������8\�����"�P=^eh\�2�	p
�9�!\������_q"o&���3��E����@�b.���s�0��q�P���`S���Y�Ao�%���jH|qK4Qg5�- rxT^���4���GU6�z? ���	�T��;�S���n@������3--��("�pb���\]���������Rr�NG=Hq�4�����U;��Qa�R3Mp�l����e87��&�����y0�*�����T���x0s�i��g����I�8��p(z�?�jJ(._3�/`	����8a�"�=�;�mW���i�a�gJ���b�a�R�|���f�u~���a���2NM��I�����yS�VaV�p�&y�*����4.����85n��-�2��5�6kUT�$���p\q\:�d2Nf����8���$��w����������.��0
u4 ���VQ��a�S�4������<��a�c.����p��3
��w�v��_/NJ��9[�@�6x	Z
xm�$@���0V76��;��+k���L�D,!��	>r0�����EO0Yt@f{���'��P3U�9*�8�����"(9���|�w�<<��MN��"o>��^Y���`���Ik���St�����^��T��"�ZB	z���OIS��rD������	!g'3�<�&q�J�����"�>*�9Ul�9����t.��gl��S�8�_@����1�L����K6��q�8C�(6z�u]�r�l�[@9�4P�<��P��>Z ������CN��}$;@`�h8S��-���Y8B���]"g��X��{�k~Qp�L���TI�p��,s3|��c������:�Kk�N3��qHT�UL�T�41��4�
�c�)H;p/'�\�'�+d4�b5���ql��&a� �M�>���X����#$J�"�����(HU������VPH��a�1�V��zwt�����Vv_�l��Q��L�!BQPJ��}��x\�^�A ��H�dp!7����R��u%!1����s�aL�Q���}��n<�Z����|��}��!��a�"�$���!^yZ�'{�}F*o)��Kn�zq�� �;����@���O	���)T��X�x2c� ��R�>�s/p5N&:����������	��3����s3����F���Y�?zy��k���C���n����@s���B�'S(�CYj0���C�t���1,�������Bzy�s�&�R�[��X��RFF��'�1*b��m�����M+9(����>�vv�'�%F�&����Ao!�5K���S���\)��m��`��dF�
���lyme�Y����.�>w##K[�h
������Me�p�9�X��i�	���]��G[�,���4���(!3�/���BG!`�	�R�qR���3�?Nb�.!���8�d::�X��p�?�i�`~}8)(C=��/O�u<�xuz�U��#��"��@�@�����lH���X�L�*3��G�G6���,��c�h���xf��}3��!Y�r PQ���T\0t����c@�!A��+�5�l�{'+����;,��%���5$[�O8<��.AG46�~��/}���W�A���w,�t�Q�T��H�J�h�I���s��X����*;���_���\O��:~p��!�R�o'��_*�y� [���:g�L��K��mk�������%����J4��ocT ����3� I/M!���1H���FV+2$n�J}@c�p���,�o���`0�
	�s����G����z�Q�\����r�
&���<��X����q:+7bY������l����D��S��N�q�i��n�P�H��1���i��F�B���&�0����@�3J�-5�D���V1��A?����;�uKn��F��M����	��`�k���y�h��B��r6\���UB��Z���t*����xDJ����k�u�F���@�p�M|����*7O7��)&@7DB�L]��p�TK�����U��N�X�������K������W:��	���_�4hr*,e�5�������/�gbeR���,9i�z
=q��!Ag�r1�O��'���}HBb*UT�9���g�H��+o(<�Z�$�h�n�\"�H��
%�����P��F��r����p���C����eS�bz��;���I�/�SR����MS-e�L�z���O�����>)�I�?�<����3��&N�}=�)�w��X�L~A���rPK1��TF�X���Z`F�`����cj�����y����/�#�����d�d��dr�P�6A�P�{�l��5|�:xI3�G
}�V��$z���2��wv�"�v�b��J��b�}���Kd����������"m��'�E�%������6�.����1�1I�2����8;S�jhm���r �D�����zc�qM�%@J���F3�Z#��XL���N��}�����E]���3�m�Rm���E��-sv4?�3�i<Iq�R�/�'�Jf����������*��o���#10�3y�/���|k�[��.�.���N���������~T���z�J�����������6��]���w���	2�s���{5����L���W����6���G ��fQ<���n�]D?=]s�{���Y$a?������z)��9�8<=m����i��������l�=���;<B��c�Im�OO.��zb��c�t���u����{��^-��>��������o�uWl�uy�{�?^��_���?�!E�~�xu�d;�3�i��:�j�$���&��VE�����NIw����M)CY��u�1�@I-"(������7=��EN����w'/N[�������_�^�h]�Wn�O��k��
vZ1�kz��o@�R�K���oEk���7��N���R�z��R�z\RJ>�-��"\w��\k^�%��3�3]���*9����e[����H��T1��>I�\e^!���O%�k�f�+�<u��3��R^���hs������vw�J���r����;
!�$���f��7Qc4U*�I��@B�E�����I/����\�./!��E�H��������h���;�z�_K����������O���<\�S���������^L��?�_'��d�v��f��W����N���k����������V����q�.��8>�������f��}3�|������=���h8[`d-�@����@��8Z�o�g��O�0[��=K������u+�n�\�:�j'�w�������ZW�R��X�$*&5-���>����%M��eM>Kk�����,m.��\/�cO@�,�"=��*0#��;��no��V+Ko�_�s��uh��������
�[*��V�?�����k���3K�'[o^�L���
�%'�C��?����]��/&�v|xu�>��E���+1��dM�����%��w�0�������W��n�?����.������c�u�W,��^������;HH7,xp��t/c<�R�*^.o�����R�
���k��������o�J�Gd��u��p�+vI4��f���������]����Yz�.Z����J���N���%�������*�/'l}n��p����6��������i���mT����������
�!��D��+Q�H#o!V�>���+�nl���Ud�.+Vz|��iz��Z/���.���g�����w���M�{��0���[������4jt�t[&kQ}�3�8AL�����|�\?6i�'��?zwr��
&�x����C��IR�D�o�Y��K�������2��2Rs��%��C�c�_T�a��{��������F:~'�c�����%K��2nWc7��6�!���+�oc��XQ~����L����=�\�O!Z�ME�"�4���������0pec�oQ�f�a�6�C�&\�B�%�60��?�9�������k�K/���G�_V�W'GW[���c��^���!8=<{�b(�v�Y�
����^�Z����{C���?���q���KQ���*��u���6/D��UI��[s�dS�.�&,kt�^��Ej����'����WS�:��0���&���}Q�����UFEV!�`�]��_�� _�X�����k�_p}3���������������,$��
�y�W'��&���$�� �$�^�b�x��(����{�tI�"��-*��gz8���]v�����EAG!��W��)~l���1�"�E�Rv:��	�<�ky��	��q���}��u�����_.s;����|y���B��$�[�4�d��������n�k����@,K�!��z�o0�(���������m��3��o�;����k��y��+w�������.BH�;�\zN����,D��[u�@)��[E�@�"��q����������
q|(���T�P�%i
�)p�W����������8R�#\#M�C�v&7R4"�����/���e;�����Z�4�FL#,�]�yW2�5�w�����������GWL�z����{D;�g*hj���d�����qL.�A@��<��i��1>�@�/�q�<V=s�Kc�`�'(I��sz�S���z/h�@�����������Zci�DM�s���@��Y#9�/r�p�<v�K}����R
��S�#� r���moO����������W���zF'��e|
|�?��`v��Z�}t�1�tRa��A�LF�0�;�B"V �"Gw3>���d���4��Z4zg�s�gP)��wb�;��tCx���#���sA�.u�S�C"d��j ����WcB/JJknFr'�C#?���h�<��M��\Z�E�^�H&�=���8�!���T�1]�:� ��8�g��|�����?�F���
`m�&'i{r�YY�	���=i��T!�3����[��k��#��e���uH������oE�t�/�l"-����R��W����N�S-\�<<H���������x��c�c��-z}<yt��SM�{��UHO���||\U��P���C%�KM��=���c"����h)}�R&e�O������t�wyE�U�@�S�y���d=��TZ�~-*��b�\��M�����K��~�!���k�@��P��p�[�Z�'��+�	a���������'�y�����CF��S�����f��h���p��]w��n����Y�����W��[�B�L���B@�([-W���=
�7p
/�c8�(�{Qta\Bp�c1]L*�����ruc��?����f���+��H4(�'�N$��(�}(O!2����j6����S5��gw����j��[��P6�P�U���?������,�,�0
��f�a�Vn����K�[�����������4{�J�[	;���J���k��^���W��kq/�iP�8�)���nn@3O��5�g���8�?�]�(�����������
��������	���b���Au�i����lUv+�����������@�j����t���LZ0����F�'��^����y���6����(9V7�F9���l�UHY��!�lu0����f]f>7>�����|�j'3���aD�K'u�q�&�W���z=��{��r9�5���NTs����8�;T���,���������E����W���[(?��^������=�<|�:�����j�N����D^����	��!8�Q���u��
�-���=~����|�J~�\������Zg^w�#4�X�4C�N����a��������F��r���:5V��5c���t���+.��jl�8�nOg�k���0	
�x
#�i��v8U������ "��d�G��p�)��o"������+P�>��]^�Tk{8,�n�KZ���>��j�y{>i�>����y�|Jy���@>��=�9�1�w�1����i�-&��q��GO�Q^�'h�R���m!$P���`�� �f�O�3X�W�+QJ��U��'��*Jh!Y�Z0uB������y�E����G������<�X���'N���\uc�'�0~6�	1����R��
�9�:q�s9�7�k�������o:����/J����C�M�)�Q���|��qZ�
Q�!�+���6����	�@mA#C�2�����_�9�<�j%i:��C0`�GRl��V��)�A$�x�n�+����!t�^Z���uQ)���r��V\e� VV=^=ZU���Wi���94��5���^Es����0������km<P��G>5�;����(.]p	��F�~�A
�(o��@�!�
������@��H����
�y+��6��i0�aM��V,Q
*O��Xp����f�E�?�n��_�{�����N�g�y��3��G���wDrV�w�qZx���A4�M��Z�d-��T-Y���T�f(�����������O�����3�������BAZ�I�L�00���K�{�^;X4��m/��E�)��"yn6��uk��.�I��CR��L 0C��ag(h}�O�4Ta���`(�a�oC��a|�c�\��D��xr3S�8~��2�Fm�������q����4w��
���b�����8�D�T:�����(gD���Z���-]@��y	�7�Jh�O��m%��������V��5����������o�!���~����F���E���N��_��u���a��������X�s����^F���?�W��K�?���!X
$�j�O@�����d��;R<4��%�)(^��WzC��Z�����+]lgD�S~5�����^�\���v��hw?Em�����A1�h�5\��������b~��	?;O'm����s��h�|?��z����<��%/u���d�t�7�b
�zB[�UiO��h������������2V��D%�Y�r��&�$���IV��Fcg�������w��->��� �U��Zw�*��z�����vw��Q��_���a�Y�����n=�1���Q|��?�G�`X�NF��W`���5��Z��N�iE`��~
���/A�)����ph�j���V�
d��~Mi��m(A!�\�|c�b>S��L������;�~�[T$�	:v��~zp�����S����m��}�a�`k�A���J� s�������&����*
C�K�8��V�g��.^�'�����S�B�^4���\26l�����Iv�60zO��$���H2�m�,u���:/OVf�UY�`��y�SK�Op��#o<����9�"���"����0s��I��Hbl�"�y������9������q��������/��h'�\�G��U�E���9�����Pi��������L��?�$NbyF�����Qr]���������\�
�h��eE�+�A~{�\�A�8;���h����}���Y������{��G�fm�:2�r��Q��6�?��������_��~���E��f��6k���u���N�/��~�S�&��������`�,X��Y�A&BR[����@"h���N��?��4b�o����f]`^Y1[�X�fF:��T+��~��f���U����F����V.7;�{;5���"�d�>"�Fy�Xo
����F��])=@����x=�QZ����`o��`o%7�[2����_����Jb�"�
H4]��'A����o���w��Z��W����bq�{�r2P�(��wPt���.`�k�@�������������������f��C���� e���?
�O��kN��� �1�aa������uu�2�������O���>q���2�ol�|x�mT������j�J�����br��]�'^^����Zy~�N����:�����~�W��M�����Vw���J`�A��������KC��d�!�N��\�*�j��o
Q�_
P���Z-
f�_�.�����{���!q��7���dA
���H��Mo���7r�����Tc��d(C������=8�f�����	-
.�wE0���jo ��KQ�G�A[���\�=B����I�h-%����L�$�zb��m�������,��|
�"��sr��T�@�p�8�r�������s@!Ss�p_*�$?��b!S6���w���Uu��`�s��v�\�y	�F�U������+�Ed1�6y�
D��\�L$D����8U�JLg[�y~\����� �:�\
���8����
�#����V��4�OU�j��9������{�w4
�85�b!�Wo1�yrG/�$���p7�t�;0������6v�����F����xPK����CWI>������?����!t�),2a4�������RV��oM���UU��J�����^���-&�%�!��h=r����i���FrO�=m|p��gEkK�#3���'���CD����p|��l'��$ig�<���E���P�a�!�]��\�{z��=��_��I�&���L�zq4�*;����rXo��hg��YKt�7���tRlV2�k��D`�:���O��.N}Kq�
T�g�(��.h|CA���{�-]��>(C��x��dg8~����y�����].��j�����Z�CJ��
�M9���}d0`w����*`wiU��DJ��b���Fp���4Cp_��tt��E�!W�H�j���4���b����1�S�O�:c0i�E a+�������a������t[J�Cgr}���gp_sX���	�HD�����8_&@�q�����N�^�������M��P��)BL�CWo+��'K�&���[^�� %�{�$�������vI�wd��o:���l28|�.�vpx���-���^)��vL������
/HT����B6)ZY,A8V�PH}eI��,�������V��P	�9���������H�|:8�[���$�N���/���@_C��	3Fo8������-���q~~�>�����'��'�����U��v0��\M��,��l��
�iGaD�5'�b`O)-�@.�)��C7�V�nd����=
�����`��2 ��k(���Z�cjs��k�pa�o=��������YJ+�:xb�A$�?u��,�C���m�HE����Q��t�2$�XRF'�(�9�H���m�(�NQZ7D��� J��!�m�d�{&����KQ��A���95(c)k�����`[��|J+p�{�d%�{�v-(k���d��� J��A��$�(-IQJI��6�A�����`� xDN2���$�(���7D�n� J9j�����o�d%#D	�A�(#4sxq����,/0m���K�=<{��;�A#`N�&.�'�f6���OO��	n��_���x�K�� �����:�I���k �(�HQ6)�1�����8��A5�P �r������8���S���+�	�k�z�"~�CZ�M���Vd�H�p��'
�/��Fr_�RY�~2(�$5����k�&-��y��74|��6e`�/YO}=���������i8�L
.�T�e|%��X���
��5[|�U�p �R�M����_9��?t��������P�����$��[?SI)5SI�>�Jr��F!}�<�\r��'����Jv���=2�0������~���Cf8)�����W��W�A�-<[���q���@G����I�J�}su!K��t���	���Z��0�1����N�
�\'��O� ��, 
F���n��>L�U�)�
A�!�5����> �����E�t�,2��)=��|�g�Yz��4Nge�X�������������#�C�J����hW��%��T�B�4�sS�����1��I�����	J�H���+V|�A��m������o@GX�Ez�$�s�LO(�������`~.&��K>X�������}>NuY����'	*�7I��#��4\�_�����]SM��M[$�Ff��H]T��n�18O�r�-��:���(U��V�#�GZ��H������R�O�����.8C�P����(<�����L��K��u
��/���*���L�@���u��I ��o7��� ��j�V�m����O�T2�6�FS
1dw<����m/����<�fb��Jb����5�s;g�wWr �5���0��D 
N@@(��
��R���~"�K[�x���Zr<R�[Jjv��R.{eZ�\�dS��dN^G�#9�O���<��,���pD�,YzN��;<_N"&�K<��(��
����(�p-[x���\^�:�T1I��k��g,��PA�+�X���uK�_�I'��x�+���b�]�M����GDB &�
d���G�1a*0��hSo��lfd�#���5bR�lf��b\�����9�Y.��T�P��Z���-=$�X�b`<�H�GeG_��;�/G����
��I�MIbVZ��l�����J�R���W�7�H�KB�gC^��:Y�o�vY����j�_��E�wgh��K��l����T�vTmuT�Tfu�p�%�	��Y=�s��K����7|�twL�d=�]�A��)c���2��$��xvg>5���&TZ��O+y�)����o�d��W����*���J�s�%��T��ki:����^��b�_� @5$�7�{� �bL��Vt�W�GG���������������������������������������������������������������������������������������
�[p

#274

john.naylor@enterprisedb.com

over 2 years ago

In reply to: Masahiko Sawada (#273)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Aug 28, 2023 at 9:44 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

I've attached v42 patch set. I improved tidstore regression test codes
in addition of imcorporating the above comments.

Seems fine at a glance, thanks. I will build on this to implement
variable-length values. I have already finished one prerequisite which is:
public APIs passing pointers to values.

--
John Naylor
EDB: http://www.enterprisedb.com

#275

sawada.mshk@gmail.com

over 2 years ago

In reply to: John Naylor (#274)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Sep 6, 2023 at 3:23 PM John Naylor <john.naylor@enterprisedb.com> wrote:

On Mon, Aug 28, 2023 at 9:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached v42 patch set. I improved tidstore regression test codes
in addition of imcorporating the above comments.

Seems fine at a glance, thanks. I will build on this to implement variable-length values.

Thanks.

I have already finished one prerequisite which is: public APIs passing pointers to values.

Great!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#276

andres@anarazel.de

over 2 years ago

In reply to: Masahiko Sawada (#273)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2023-08-28 23:43:22 +0900, Masahiko Sawada wrote:

I've attached v42 patch set. I improved tidstore regression test codes
in addition of imcorporating the above comments.

Why did you need to disable the benchmark module for CI?

Greetings,

Andres Freund

#277

sawada.mshk@gmail.com

over 2 years ago

In reply to: Andres Freund (#276)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sat, Sep 16, 2023 at 9:03 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-08-28 23:43:22 +0900, Masahiko Sawada wrote:

I've attached v42 patch set. I improved tidstore regression test codes
in addition of imcorporating the above comments.

Why did you need to disable the benchmark module for CI?

I didn't want to unnecessarily make cfbot unhappy since the benchmark
module is not going to get committed to the core and sometimes not
up-to-date.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#278

johncnaylorls@gmail.com

about 2 years ago

In reply to: John Naylor (#274)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

I wrote:

Seems fine at a glance, thanks. I will build on this to implement variable-length values. I have already finished one prerequisite which is: public APIs passing pointers to values.

Since my publishing schedule has not kept up, I'm just going to share
something similar to what I mentioned earlier, just to get things
moving again.

0001-0009 are from earlier versions, except for 0007 which makes a
bunch of superficial naming updates, similar to those done in a recent
other version. Somewhere along the way I fixed long-standing git
whitespace warnings, but I don't remember if that's new here. In any
case, let's try to preserve that.

0010 is some minor refactoring to reduce duplication

0011-0014 add public functions that give the caller more control over
the input and responsibility for locking. They are not named well, but
I plan these to be temporary: They are currently used for the tidstore
only, since that has much simpler tests than the standard radix tree
tests. One thing to note: since the tidstore has always done it's own
locking within a larger structure, these patches don't bother to do
locking at the radix tree level. Locking twice seems...not great.
These patches are the main prerequisite for variable-length values.
Once that is working well, we can switch the standard tests to the new
APIs.

Next steps include (some of these were briefly discussed off-list with
Sawada-san):

- template parameter for varlen values
- some callers to pass length in bytes
- block entries to have num_elems for # of bitmap words
- a way for updates to re-alloc values when needed
- aset allocation for values when appropriate

Attachments:

v43-ART.tar.gzapplication/gzip; name=v43-ART.tar.gzDownload

��;ko����j��=.zbG/>$Jr�"��$n�8�������\J�I���d���������W'8����3;;;�Y��VC�u�q��t3G4&"���q}v1hxY������'�F� h��S���)O����n��x7�}��W���0��k�u���M����� ��xeI�c�~�('��q_z���z�������7���u�����z����x��X�^���������2��t���c&���h��O�����������	��a2�}1�4��
x*�Y��mv����4��>��G0MM���v��>
'=b�;�9y�fm��F���������2VV
+ae$��Uc<��}�i3�'��0�#�(`q����~4f��<�g�y�����OF~:yr�Y"\�J�fq���`�H� e�c�)
�:������r^�V8�iWb��9����<�����I��������.��A`C�	��6�'Kx}�&i:M�Z��|���$Q������P��h�n��x`�r��f[?��N�_/�'avq����������������������N�W���h44��N��� sEk*���������X��j�4f0�s`�c�����f�5��=�5c?e��m���7�����k��N��z\m��-f�u�m���;��x��~�f5����'X�XF�G���X)I<������_�!���z�����s�L���OJ�LD����=X��~�>������h��PZ&��U��R|�AA6$"�E�i���O'�@lt�+�}�"N�m�d����%�|�$�F��/�'&�����)���MP��l�	���d4��
A�j>2+LG��}QK7����}��k�����`���Db�Vk=�j�)�dK��66����Ak��5�%�K��	�@��D��V��M�O�=H��������)��������
�p��@C1�{$�?��������tz����z�`�o%z�����X���|�����p�QL���FY�`�j^�3�m�#W!�?��<r��(U�F��1g�/�w`��/�3B�ur�G]!�ws�.���������_��
�G�I���{9k�M��O��7{��w{{��g�����n�w�^i��:������w�����+�Qzg����6=g3`��a[�����j���C6K&��a4L`\�uq���M|D|/��{*��CE����a�����a��h������i�*)#]��]��C�jhE�������O��B��'~4��D�>�V���Y����F�bjS����g���|*���0�	��+ j�B�a���
_ar�_��������A@�xe��[�n@J�`�j�W����|	��f6�FS��o��g*�3`��E�l�����4d�6��?��t�"������Bi��F��o���(���n��v��{��N�0\�5�.��{��qz�c~)��x��}Z�32����d�Ho2Qg6'���#��6 ��n��L��PX
�
0��������XVM;v]6c����HXf|n�.6�q��.~��O���&lCV&�$����?���.�g0-����77��7����`��X6�Y
�������k�3��b b�A)8���ds j �g�$M�{����-,6�M�G8�M`T�"!\��2�
R��,��C�-�Nc	��U,��`�6�����e�������	a�C$����#�8�!x<my�|Z�� �p^���CX�5dR�����s� _�@���?3��A�������/��jE2E/�sB��~
\�`��@<�/�1��
��"���q����]��������}L/2_Z����;+mXb&����������''p�}�we�#����oE�IkY����Y����j���g��v5U���WSMo����R��1��V�I3���� ������'K��]����DJ=hl*l��h���8����n���v�iu���voCq`;�E�`�
�zu�iX��[p�*�`/�n.��]��@���A�R����������,���z�j�
�*-�Z��!DlZ��tI(0�%*s��H�A�24p��P��e^��V����#�8�����qH�A/0�{�)�i�-�}x�HF�G��%��g�?���KO�2���fDF�T*��PK�`PD��-�<�o>����.c:�l��p��d*�C��Kr�k+����L��j	?��������Ia������yac������p-�G��=N��yEx����TF���p;1��6�_�>��;~;��e.��>�_+H���DBb��J��n0� ���t�s��\���pd�Q���DJ���*��jR��w���`�7�3x�O���
���@���zx�������S�y
#��uh}�<'<��r�v�
��\7@�V+�y�����<�%�\��^
�)xy5`��3�\��4^�I�����{{`�W�[&���V�p�m���)�W��t4a��L	��E���-�n?~`�/�,��E�V���A<�nZ�>�R��nQZ�$�����
����~_T�m�l[#���M���m��9;���x�k��c���u��7CUl���|�3\���[��"<+ja�{FIo)A���&}&R�V��j�3�O%�]Q#W� ���_ T�B�e>�\�6r�*�]�zF��q�YD��=��XHe=�$�b�P�#p���oFc�A2���T�{��n�V����AX��PIk�������5�}�5�}�~���yz��7�~�	}�����I����r[�
��HAV(S���`��K��	V3�
�����f�=��;��f���v��lo��a��FauYM��1f��6��KE�w���g�.�n*��
������
��e��]}8(X�	C�!h	��a)@�}����]2����(j[2��zFQ��E��2��zFQ��Em=��}uFQ[�(j�/���f5�"��m��
.'���P�`O=�[i����|�w,=���j�H�>�N�&J�Alpy�T�T(��+��(�J
#��������wI�������* ��KM�J��}�`��n�w��F_���^�1�VLw����vz�����ja�wS�4K�w������	��p�O��M��"����>C�M+/����2��aJ�>���(���jdHo��j�gAw�e8���nP�������������T�y�p���=����mwu���1l��o��?rQ�GX#�o�E[�]��w������pW�����1��kz����{,�mB��?���;�`�G�������M�RXY!�������D��3�)�b�Q�M]�g)\�h!�.��rpy�$�u�WI�k����p����|���7LV���*	U�58������*5?N�	G1?3��S0A�P�YA��t��+� ��Qs����Up�cX���S���(������p��������i��.�5������D`�������{���&��:�|+C�O����h�+p������z�W�t\��Z?�����X.�����S*Qu�A����D@��*2�68td��A-��Z����+�E�X�^��������&��/D�Q�J��C��oI��Y�j��|\M�� Q
��s��cF;_q�kp�����M�?�z���;��kh���kC0�5t�N����a�/��������A���S������^�f���?�������������yx~y��B�A8<a)��rk�q���Ly@S�L��Aj����<5���)���s������x:���J��1w��_��j��O,��s�Z����l9�:�a�����y�`��L�t���1�W�e��r[�S����)�DL>���Yj,�'5��V#������@_������l����v������\1kEYl�����2��!2��xQ-2}��)?U����M=�z���T��a����T�P��H��5d�<�<��Q�>�z�lW������/�2���0������x�9��I���/��I�<A|xXn�
��l�Ld�U����B�w5�a5���X	~��,��UL����eT��
���V`��lD%�P�DL�+��QMU�J�l �"��S��N��:p]W���zD�8�������*��).`��Z�����YZ"=��A�A� �O�D`b����rj�L�AN�5�R�(j��t|&`]g���{���$�
�����R�KeF��\:BU�D�A��\�eMPMyT��7�5,+
U���/�������@~������J���,gzK|�O|�VK����5H�]dA�?hJ<����d+���TgM����O�3������vA��G�:RLA_
!#NYa7�}"���-����@���3���Qi
����S)G,lnE,��#�����%�hP�����/|�G����
�����?���$q�����(�����gR2l?\��}�!>��:���X����\�,U����l?Y�����$!J�����R\�x��S���O	f���������EL�HY�wW�$z�R�NJ������$��_�,�����u��4Uo�QT��<E%R��O
�W�����>���-���R)�L-/#q����V7�q:�
�`�D�F�=��<?
��/I|���<
��	<
J��fE�������s�;@����'~��\��B��Z?�`KG�-����
K�G�g%x����~����tA6&Xd��,�1�)�A���pJ��������[)������A���~2r����^��!�����8b���>�*��j��+�g������SS�U�,�6�[��,�r]�	��^���"���������P9�d�Q=��b2:%:��(Kt�W'o���=����&��*dX�\6�*k�(��/�E��?T��|PM	�
L��$DH��|���+`'���>�V	����.-O��a�]r}_�	E;V� �rhT��$lr`�:����������%��(�n�X
�'(���45����I,�\|)R"�rl���u*q����#{%�H����+�J%����8�E$�5�e ��%�t|�2V�F�R�����m0�xN����y��	�e�I2��D��-V?��5���\#
�1��d�	IMCpt���A�%r���D�ek�";���;S�4�f��A
��hB(:gB������������]r�$D�[��D��@��'��O����,�o�qF+8T��C2���[��l���*g��^9X���6�NO�!}��z6�C�'R�
'��A^�i���Q�:{�
3=�K$����������}�Y��
 nB�Z{��^����$�:g��o�D��0`�I|>�[����p�dg��!f�Z]]]U]_��'o��H�j]��Q��,��E�a��&���3Z�@X{���������?S7t~~���`�R���PhL�����7�7�T��}�������[�I����c��'�h�-T�!?��HY��"�("�g-��{��!Q��I\�^t1��t���D��]���SR$t����v.�&+�h�Q2����Y�y���P�%��U�,t�S<S���l���y���2�<o�::���Oq���y�UG�`L�s8���Z�UR�s�~-����.Q���V�P�}|h��F�Vu�"�+'���;;x�e^E��*��9�V���(��,+g/�����`6C�Ux�4����aD%�0�.��}������u��xU�pd������n1���?��(d���z�b�����B��q��P-S�/��/dU���w�����r��_����!Q���d�W�&�������G/��N�y�.T�K�{���s����`Sy���6�8d�G���1�6/������(�<e�������J����s�����4F���$fJe����Om9m�a!x1��V.���b��.y|�������<��M�#�P����:�����EA�k�i�Y�=�Lwt|�="�tK����aR����h�+�:���L��!�&
n�^�.g�[��L
�����yr�3�e���e/��w��&c�l��� �v�X�����{�`5t&3"=	�!���>�Z|t�H���7o=��'�2�����P�`�������F*w������,���oAj��i.5?��v�dFG��Br������$V���t<N��i������_�i��F ����Ia���[AS�����^�>y�6�n��@���n���G�I�x�G��i�J���m��}���=�����.�����Y�o���_���{���[�?�?o��$j����__�5����j����k��d6�+:�B��x-��P8:�T%�.��!n�\�N]6N����	Koy�t�����W�-�M�����}z�����;�`6�?��������w���^���;���3��eb�u����u���xd��j�����oOz�mOo��c+���k�D��%m�63�%���������r�Y,�����������)�����zZ"�;����f�U�i.��%W�\>>,���R>�J��'���*������3M��� ���=�M�������A��kP���3��g�B,h�Q���g�N�
v��Au��=D�*�g�+�����Y�a��v���r����e��D�����!�i��8�{=��N#�F��U�������M+��gV%x�`�OO~e�$�s9�������������K+��VV��|L��|N�qn��D���JN��������=�u�ZP35I�jj�6g�A����.[&��]�xD-a���A,hk�F
�<=y���]��P	��/�����P"��n�_��G/��a�?E�A��b�ItI+���������yJH7��BFy`��OO^g����Uz���.�}"�"q<
��pA�|�g��[�a_j��o3/�~p��OTM��;��HncS%���^Z�����������oq��&��7#��|=	���'���+��4'�����i�����jx'EU�s�^|S�n|3�:�	Q��|0�L������.�)��qV��%o�*u�,����������M�T)o.����6A��Q�z���U���������iy���M��� ��/��c�}u���8�+�FN������B����U ����������"_�w�5�	����I�22��w�KhQ��5Du{�>U�������<���HjAw���`��v1d)8�^��&��E��j8�h��[o��A�V�JW�O4I'a�Ks,�qV=���
 �X�����+�O�8��>�-�
E���"��.�1��^�������C�zc�%��r�M���J]����Y�������?F�y����Y�z����I���={�1NG������]:e����w�)��6��=�t�j�/uK�"�h�!��������p�j�Wja�W�)�=i��L��?��[��������2� KQ_��E9{}r~�)0���:U�H�m9�,�T�+n4"L��������������TM2h��$�J��������&TI�wG����P��������/���|^�)'����B�����Km�^�`����T+��UbW\,6E����[[qx���b�#��a�O�z�H��Hh[MTk���X;��ql�?c�{lJV�d�'�E	��,�R6��sq,L
V�G����:8���lqt�}��C������ ������V�H��0�({�u)Uk;�����BBw��|~�RZzE&/���
���9O'
�%����/��<{���Q��
PP@h6?B`�u@���op�?�3"�'�M!Y^�2j��%%�������un4^)��`Sy�l��6Z���b��S^��(j	��J=f�"=�6��mVMn�:$��c��".(b��{Q��z����bn
GR�A!�CQ-?���'����uP��!�)n�x��C"�lltC!^&�4F���x��`�NQY��s��	A/J2$�v��_�}��d���[��������g��@������[��
k�����W4�Z��l�C��L9�/Y��o���U�����;hlKf���e��%d���0���z�h��,�rl��N||�Q���!�&$n�e8Y�3�����c4�fK�+�=k� P�T��R+�4�b��=1�F���M���tb.W��8K�8����>�t=�O���q���
ICh2=D��Q��Y�B�LN0y���\$oB����r���ZJ��OG��50(�g���}]m��jV���e���D���_��;>���y����/��5��){�����f'��m��j6)���Z��)�;������b'w+�F>O��9;x���*��K�A������h��<g�2Q�-���8����aS��C����,�0fT.�����;@Riw���E�
q�q�l^�4�Omj����8�7��������X<���d(	�Gk��8F��/����-��d�f�i��F��}���eD�+��9t��Mh��M�BpYR�r9j�}_�`�`t�}��K�r[0�&4O�2u�`n��c������y����Q"%g�������=�(U��.�K���(��CC^������T���������I���&�R�S�X��|4��<��CJ�]�kv	���
��Rwn��q��pY��6�>N�,��WT�]�~8����H���a���4���Zm�}���k��?u�74{�	�A����������h]���=[f]���aM;�����n�����v�]�)�������5W����#�C����;��!&/x]��n�=�WoK<�����g<��Kx�'[�t_�������T��;�`b\N)
����?�|����3R�#��i�Xz6�����Mx,J-m#�n��z�J���:RA��7�����_�%�:��B���O����_Nk
J�!v�)Of�rfZ�:��������M�&	Te>���o�5��S�J�����P��1�\E���U�L�et�S+(U����$.����r�bi,xQ�5V\���It`k"��<Nk�d���O�A�-�%��il�]A�k����2������a�K������P��`�!$�39-,T�����Ct4�bV�c0�.6A���7O����F����n_�������ZX>Q�!_�8���	��������N�Y���t�����
�S�����SF�it9��j�A�2����
�N����P�������Fw�[��:�u
:Kr�#������Md|���o>�?����fO��"��������H{������/2YR������r���
�1Z��\�p��[�������uD��Tm�[���?������>�~�HH�6�{������N�H�v�bP��8����F�(�w;�1��n���J�o��"�]Vv��Y�K�Xr/�-�pDj�Cw�t�_w:�)�@j<���O��c.GZ���~*��+�p��)�q���k���r�y�Wh,�t[��]�w���x~�{��z�ZA!a�� b�B���U�Hw��k� Y������%.�1!�V�f�H*�u��D8~Q��^����������2���8P��)��vU������HX@����P��T8<|P���&� �������
-�N��1�F���Xb���[��dl)�1�j��{�n3}8�kk����C@	��P�P�B�G�vV�=�)k�1�����k �aL!��!y����<�a��������'���x�f�V.E���a�W����M���H�
��c��h�cM.��J������K�T�?�I�?S|j�/�P�*����]��	K�F)�L����`���/k�uPt xt�(�N�������Zhg�a*�[�a������'�(�y���{<JRX���~{����D*�_x�B��{�x�l������X<^#����*���,�2�������\�5����x�����%&/����p�Akdn��������%w;�sn*�A_��S8��0�E�lx���L�%���apQ�������=��A�k�/V`F�Y�&9jcyk��l�j�)�_�3�&!V���)<K���+>���$aT��4�&\y�`�3-��
�B�(��~���D�E��$euI����bE��H�oi/$��nK��e��ce3�nV4�pL�4���`m�2������cV��RvkF�r�N�����i�RC_Q�Q��(NJ��_bI�T��,zD����U���?����3�����B�k28��J��t���!gj=���s�Au
�)E5b�^Z�dT��hY^]-�
q�����
�#WIv5�$��E���1]\��U|�����g��D���2���fT3ER{A�jIT��N��g���{5���K��F����)�D(����NM�QXk����&-�������>2���RO_go�t��/����"����#}5nG�hR�H��
I�M�'P>����a���
����2JS4�"��'�����cW	�<�0Y��P_�Oe��4��'$�$�����j9?�,u�Q�����J$]Y��H���K��Xu��i���}���K�^�=���u�3���������j_.���'���$��'���|7����t�_���#����������z�+SwO��D����Y���rsT`��L�&�5bra[����}����b,du'{��Vi���e�u����
0�-.#�-������m�VmmlyO����V&��[:N�,�'�a+}i���'���ENrw��$���"(�p,��u�
$V����PhD��K������������y�q�hU���M3�(Y7!�H�ut�Qw0��?� �iwK���9!Q�Hf��/��Z.Gs5��sx;����wx�D��7X����Gp��cjlh1���`��������H'1^h�2��������uG�:'W���)�'i�l���������6�,_�8V�[d]"q��6�]�|(%�t��>��?�p���;z��Fa x�+f��T�$\��VN�4��d,�Aa�����68�L�z���!��������q��������Q�X�O%�qOb8��H�Ff����1��A��W�o�U�=��+�YA�NV����S&����*m�K,���h��S��dI�g�I����Y�d��m�����+�c
"���$�6b����j��Nq&\�y^m��b�G�rK�6Z�������
I��A��(����U(F���I�����j�3K�.NY��cI�	���y���A4�NL��|0�q6�t�Q�B��.��f�����.J�ke$�PCABAs�6���4����P����_��D�6����<Q�c�(�K�b���������f,��
vO�2f9P�	����~:�&V$�3'�V���Kh��J��gL��*}M��a��<���_)����xS�3=\�+��uCw9�X�����|4��(��[���*�U���s	����`
R�$�=�0�ELHN&!���G��QY5Z��#��$H����aE�\`
�!E-��,q��kY�q�$V�%�U,Uy��������>��Q�Q�����4�#����b3��W�����>���N����+��v!�jK�.�E�o#3M�4i��N-�D����:���R:D�{!���j����6a
_���}^����3����w�g	�Qd��_N�Y�cvY�jCB���g�K��R-kW���d���1�Dp0�|�~��'��%���S�r�`z���o����[�>�9O���j�A��].�� ��iJ�����j3;�����v!��?����������`J^����;i=�<8��~�1%dt�GW���@?�����8�u�x���
���h{���r�<@��rN���$��:����+(��t-YZT���xaM �[��Z1�}���UJ���S>v�� K�@�����_�qt ����0���ar����z\'�hF��(��@\j�dL�H���i��j����Uj��=�S��uQ�Ahw��+�9~������8�j�:-������
<6���5�r�v�r�:����5���rB~����7%����`TM>X��
�w��H���!���"�$P�D�7���G��>��vM��7�eo�%�_`�g$"�#�y\����k���9ZpX	N���}��?���>y�2��N����(���o�~�������x��b���D*F�1�<���X{�R���v������Hg�-�E\+����h�""(
��/�Z���{LJ2B���i�}����r�G_�ZN�E��Q9=��G9
��G���P�&3�X�[����'.nQk0�EG.n����V:�����r��
~��y�s��GgNA���~��V}���!����6!��� >��M�B�m
"���y�W�6��z�>TU������k� �(��Yh'������IO`�=a��6��M�Z)������y�I}���F!���B���\�Fcw��0y�Dl���G�#q0��j���������`4��b��$�N�9�L�X��}Mg1.D���K'������H8{�O�Q���v�F,�����L�o�/�]B��I42��ZH3o�/?QU�����l6��lo�����yy<����_�/������*w/?
zO��j�U�T������"�3>�K%G���L��Tg�V/3���//�lM�V�?��R�SG� <���ca�A�uqS���9�">�[Z8����VP�gV��_P@�2H���[-*�6��D��sV�6����\��Exb��]`�R$*�q�yw@��&���B��o3n�!���W�O����Y7�W�z��g��M���Jg`��,���q����+�(�9���X��S���DEJ0��3��bL!�aE
h ;Q']�-g�����]g ���=�tIg�~�����Jx��hv?|%��
���+�+�T�
��&���H�Y�Aem�$u������z�zkH������|nU�]��|�����c��h�*���
����]��s��.:���q��l��/ ��
�����O���o_8(|2sE�Y+[����k��IEH�:M@��b�sg�����2�Y:<���n�������[.^��R����H���?����,bl��.Z���iK��.�N������=��hL��`�7[�*�����@=���
���
��[���ud�����z\�+��E�W,T9Sa������12hrb8Lv���w������������k$ ��H�X]9Km���W���������F����[$�2��� �jN0y0�#�P�>VV�R��FjE�����&E�tA��-���
�0�;�{���&�G-c��=e�!����U���&�(y<oQb�<9��L�C�����&���Z!\m(���.�K�B��-�IL�@��#����������S��)Bi�������F��^�W���D�/r��A����]��!k��r��(#�5&�b{�P�������c�����(�7�RQ������6
IWOHj�m ���iZ#�l�DDI��495�:�L��4Y�Q�q��6|�%�c�*8FT1�L)P&9�*���1o3����!J����=�?�6�$����-.�
��G��z�i��`Y��GM���kO�<-�~�)�Y���u[�euKfr�ION��s���i���(����<� l�p>����t�GS��C��|����E��i 1��X�GOSs,��|����
���K������KpN$e�-~��t����+8�7J@���>������%����Q����F	���
���V���},��M
�}��e'�G��[B�k4�Pm�	���;`7����ID^�~>E���dP�7���&�l���Ybu�&(��`;5W����;��RjLg_����.?7j�z�RE������|P���@FXF��#�sL+sY�MTy��h��h���%	���6�,\S�b.%�A���D��(��Y��r����k��'��PQ�#��X�������/�)F���D\1*(�@s)%�4F9P�Hc�$UL����R��f��0�x��'�u��;[��^���Q��4�y��
�����gL~sy)�|E�-�y)����#��'���!3
N�<^n��3EV��J�:��r1���Q����?�X�����w�l�!�I3�`���W=Bi`�����Q�@�@�+��� ���{0�j�4D��|(���5����'� +9��*<i�]N�y�&T,�.�������'7��0a�Az��c�b�H_���b&����i�����uh��Z�;s��m[.<*R9o[!��"�*mJ���B���H���=
�I���u��P����xA���X���������A�]�������c�9At�g����j�E�#�����\
�����QYE���MJ"Nh?��m|���NY�F����P��\2�+i+���&`=-���=z��M�*�3T��O&oKNU,"k�4��BOW?���X�($0��^gl[�c���u���G�N�������O��� ��{�c5���x��b�H��3u?������Y��nC>M(�=R� o0�	Z}Qv�lU�@j����E��C���:u#��j�/9��X��"�&t�)����n[�+g?{eT�8�Y�����{;��w����HA@��U�z��\e����J�j����6��m�����su>��g�����c$�����9���w�+Lw��#����~�����7W�In����{�f7��^�����
5]��Q���+.��(&�C��&~M��&�	N�L����L0>�$���/1�-�&S��3�/�[	(�%���f7Q����P9�z�=�������.��}�+�At�%y4�����
���,����*�a<S1n8c�����k�[�R�aB�gp��$��w|�_�	0:���j�V��Aq[��y��@G����`�A��L��Q0QX^n��d���e���UrbRF�e��%����+�����F��S��,h���*�a/���c0��F7���g��9;���{��b�M�^�3m��i��Mq�T��A�AuQ��	}���v
�%�Z��/N��_(4{�����@��t�Q�D�i���wJ�T�����I[KC�4��k���Qp�$ ��	�5~-z�uQ�7�>�7iSlJm�K��A���r�i��XP��2Li���CI�&M�,�T�l|���:����Uv��NR�o�I&q*��:e&���{�m�I[��X���p,�u��K!9�cP��!��e��(r���7������|
�h�+_�3��Z,���XT��&��5���"�	K
����V��QOV���eO-�2jsj��"
����{:�SF�0��� {$W����wJ�#�;�����Q��j�U00|�����-��m�AX����,F��iu���[��w^�e��}p'�R�<�,�|t���({��@�H�O��"��g��K'N_]hz9���cI���'C1����]W<���b����d�(\$k��-^FW5>��3S:�K�G����iU/��b6��H.f������6������W���*#������MY+��MK�S#�������K>�"���~S}��l��o���]�b#u���8�]	����
���ZM�f���Q��,3�"!�U��*�Q8�n��Re
���jV{Z��� ���Z\]���w��c�C�
�����K�h�l��E;��G
gY��
^�����h���XCe���]��������|{���T�.���F7|��N�.��(Q�M:m�Mm�a��7������v
4r<�l�^�����\{�0���%���4Z����y�0*\�,�Z����:P��}�T=P�����8�>2,�6{��v6�U��$ K<?RT��b##��)����I}`�ib�+}��9�*�@����$E��(�Q���
7�&�)�#��.�)���a1��a�E����W�mlR;-33_rg�8:��E�����rz�U�9��R��fW��s��$�����z��n�c'
��7�����(l5���������:����7�Xyi���U%�>���Y��P�J@b���F3#����
������ip�n���_���m��n���'�V��Z�,�nq	�CG6��.p��Jd���(;�����;�j���r�!Z*����	UfA���,���==M����^x���zH��$��`�X
���h]�#K�&_�U�W4�d)~����S
-��+
�����>���m�H�VIs_���.�Z��v�X��Qf�E��������^�Z����$��P�L��N����i'zG��U2��}m�[�e�z;����`Ln���cr%4��qa�����Y	,�0y�X�PL������,���6���������$�'��^������Osk�S������o���^��Y���hV��PZ���K����X��"h~s��~.������14j���,�S1{���d���E7�MRT
t����������W��*z'Ms3�������q�]c���$�p;����l��~���������)���;�O\d]ycqc)�X�Xv����a
������������)�(����:��V+�k�D�%*�5���8�9�R'U�6Uc�>u����Y�V�{^a�/�&����$�G��W�4����=7�O�#��J�����X��k���K+��|*��M*��Q7qU7�Y�&,��J\�
P\c���h+6I�NEQ�b�aVV�,�����:���.V�je�"|����$2
�X�9������F����;��l��Vm�-��[���F���hue�7�@�������{���6���Jxc�C;H����I�i����Od�}.:C�crsE.qF�5#S��uAcl!G�'�Y"	!�����}�vao��i�zh�o�wFh>��p���}�p
@�
��M��`{ �P�<P�P�0�����I"LF%O��u�2(:����a�L��Tb/j�7I�NrfM�$����>u����bI��y�Y��m<��8�6~�W�����8����78_�,IV��,%_f0��MfC6!d��k��S"�px��)�����s����y��(K���0U��p_�����U�X)�m�ubpj���[�DL��dg���6��J��)$���4"��|�ic��"Be\y@�W/%���x�����9����*n�{���|ZD�V���]����	4 ����F]�N��QNF��T��������w�e��m���^^���^zy���a���m7
�]K0j#��^m�������5u��Ow���1�!���s�"uj���k��������Yv������	F����I�@"^E������E��sZ�A������y��������A�s��P�Q��0 ���r*�����I?�Y�3�B
}��"����)4v��z�s5o>���ln��~_-�(Mq��qZ_����L%��^a.��g����}�d�*E�C���oe���|���f�����c����Uz{��������>���57����9#T�W�0�����3�[J�t�~�����nu�����$L����H�;t���8�T���Z1o�)�9�8O�j�������*N�.%V�L�������D�)�b	p3:��X
8u��5���y���&���<�v�t���N:��KE�T�]e?-R��-�M��PD&+��a�����	-S��ZI���k��7��ux���=�8d+N���Vi�����j�tH)f@�RJa��L�#���G�@V�����P��LH��M�J�g�9��d��5S"""���L���86�
����,�Kc���Pxg�F�9O�g��V�bSa�4�Y�_J�bbs0�
F\ Bo<�F'�����y���0�N�U�D{�!{�����nl}��a�`��]��[��.	�LX��L����e`,Mq8$k���*��N>n�7�J@R��*�|E*�@i�d(�fc��R��V�:�~��U;B	�o�� '�s�"������JN��������R*�������u��E�|v��V�0���JX���j�j�s����K�����a��E��v�6�<�[��nw�B �J����XQw��hJ]?$�����[��&t�!���l�e8^���rQlK_���%g����0uU+��| A`g(z����IO���KZ�p3i)����Yn	��E��q��d]������zF�p�
���Mr��N�*����D3K��1��G�b���,qczYs�J��F@��}��(sS�����%F#�Ld�r�iz/�Us�`[VA��}��9����/^�C���#Im��3�Z�t:4��1ton�	ej�y�[@+�����Y�����L`�d
����f�U7E}�!a����F���	-�Hw�kU"f�y�q��������������)6�+A��?�����w�(��_��I�K��P��lgf���F&kA��t�C+��<��l)�2��@+��r|ZU!���r<�p������VS���r�u������lX����6CBh�'����I]�[��'9�K��VQ����@�	^s��z����I�"�����1j�6/"'���0#srG��d��dX�RVh�}�]��	������wE��~>8m�{
b����"2��}�r���]BS���[�Q�uN��Z�����jfQ��b���|�"��"�+�#EY|`K�Mr�WW'���<4]�I���3��
�e$/�P	��0*��a�����������(��L�u��i�!�^[��F"��+����RF;�OD7���mhy�w�S�G/��,��X��P[7�\�(�����)��"Z>��DYPq*{4"0�	f(���V�&6��������-+�D�X�M��#���B����%(�W�XdIzA�N�Y����Y~asV�+�Yp�����+���q�c������a����H}N��4���H��?����X��1�{�h��#�E�Uv�J�"���D#tR�5S�����Vx��{�8��G�A���p�������u*`x�
�a��������0�~���}z��Di��;x<���!QYmdM�!����}rqmo�H*=��S>sWa��g��[zL��CiI?��i~��N�K.�������*q#t����c��Q�]"2���d'�b��r��x�I����������T����V3�*�bH94+�*�`v����QN�9*�����f�����8�~�k�B���(�p����v�����)`�=	6��y�4=�V7�Pm���)%WS��@F���]����k�NZ����C9��z��Y�e%r(��P��������?�����8�7[�t�,t�V(D�nDOgI��y�f�%�L�58:��1L�\�#�������f�k���)�$vG���S�I��@����E��8O�����7IF���qD)�&���5(�o����SA���2��+s������p����jB����|��_����H��	V�����~�u���h_I��2�u0��Ih�yl������;t���<��__���4��l��<y�t����� j�~�A/�j��[K�o���g�8Hy@�MPOy���*[��T]@��O�r�� @�
��q��d?
��L��(m�
z�b�Y�2y�z����^91-�B�R��:������h]�����\�o��u��9�.��Q��'FJ�{�f N�#��u� ���[9Y���^re��5�e�D�w?F=�D������ �����Z�E&���{!�=^{�y���7�3�������=o�::V,��M����#�F�/�PZ_b��M�J;��]����c+��%�=�����6�&����ZJ^���O)MQtk���i9
.Pw<��O�f�v���m����G�����y
nF�0��C�7$�\�p9��WcTz�P��{"�1�
�����D�&q
��Vc�Zo��B���4������/���u������I�"y��m��(�����1����[)XGR�,�N�/���8��!g�����|2�3#�{4���'2/��H�iY/�lJ�
FVZ�x��`���+�%=oX7���+y"�=^��x^��=��{x��t'�>�x����P�RGU�F��"3�bq�2<
�r�zC�`�:�	�,�$��
�N��
�Y���u������I"o����1��]k��E]���TV��k��&
	M�0+x�B�[�&��8	?
��BZ,WJ��D���
H��My�ph��s��a�0��a�X9�����`��d`n������<b��u�>�<
Dfv��{��>�\������*�tX����~����_�D��)R�3J�^��S����n3M�B�zxMf����"W�h���{�&ur������W���3 i��l<�i�q[�q��;Sn���������s+7�d+0~(O_�WN����Y���cZ��G��;�O&�C,F_�p6�.���3!
�|����[��8M'��$��m�%���3�=��,r4������{�����+�
]�0��A<	�Q��@j)0y��`f��r=�n8�8��3z���2�Q@5��(���E��"�vE�`���
���3�:k�
�\J�qI��^C���O�Q��`�A� ��A��=���*4~e���'�p�2��~f����`s3xl����O�R�������wj�[����X��R������J7ZX�Z���JT��*&�2NU�O�7E���L���(���%��D�O�������J�u�6$\g��~J��t������'�W�VO=Mx��������>���[x�F�`PL����aN���c�������l���o5z{!$��%}����5�M�P����b��.�C�TI�nK��RYf�X��x�+�x���L���������?���Q<���	"D�X���T)11�).����,x�w���s#H�Ab�6 ��������}g{n���o���������	Aqo�?�o���/������&���i�w����q��w�m�aq�/�/������W&(��S��#T;mn��d���(|S��)�oT��$&F��qK<}����i#`��gv�v����f�N<��������R:+V��LK`����{��/�i�yz4��5��[�_��kKH�*�\��"j�B7K;Y���e?�����[P���c�s��������+3fvS^�����xg>)�;l�����I�����
]���2e�<�)���|�
q^o��p	�*�,���h�OB������*(�m�K�r�+P<3���!���
�H��4J�>�����%��I5�|{;x��}����"�A�5�S�PS,�4B�����7(_&���E��Kx������������c���]����I���"d�6�QO�������%��2�"�p��CM������g�h:���Y�
����	{�����I�gi/�>H����F�5��	:�S?�=�
��Epcy,8��}DX�T��KV�$��bh�s@����SDG[DR��
�:9n`|F�k]J����%B��B��?���������#��3�v����TW��L.]�H�``�a{���
�Y(���bM�$��G����nT�wG�u���r���&Q{y�������bs�y2�0���l��R�������\cdD��
��b��M�C��) �2���D
I������Bx���Vq6il��m��������C��-b���������� U��
��pqQ{wt|�l`�7���� <��������G��`��&Vsn���������S�yf����S�F&p�N�b��y��kG����*�u�j\�BX��S_Y�T��rzf��!<�mf9l&� Z�{�o�2�2�Az��`��U�7�v������I��BTS�)=+"Bb_�6!m�B��jqP|�d���4��~���"?�����mgm��m�fP)Z`�!E��X�E�����K�����a�����V ;qN�4����g��3B;y7K0+���bLN���G[��,!etejJqa&D/��`,Vf�}��b��DWE�!�[b�������.$�<N�;SP>"�Lgv`��6|��erk��q;t��A�STw)����]����2���Jo���fG���y����}'�������V.l�!�5�7� q��L*)�$�%B����Oa�,�
6*�I�Y��(@/j�a�GCJ ���T=��}1I7j�8��.2��P*�Q�L�Gj�:�*l�f�~�#�*����?*��Z=c>73qi/����<�����>
������:B�;�l%�]K���c����X��#5B��:xo?!���k8M<�Q�OI�a�f��E�=�KP����D�o������i{v��z?�j9�~c�I���s������_������_���xB���
t�z��~�����
�����7$���,u��t�9�u��J���,O���M�3�$� H��S� 2{N��i3�'�>��P"g(f0����o���*��U��}��<���B��?�63����AN�`�����}Q�������������j�_lA������;PZ����0(ce�B���9�=��U{�|�i��~������w�nW�,��y�v��hyRw�3�3����$�f;�S��'K��I�����K��C�����T�5�s�q�!MVH>{��'���SX����d[gJ�b	�)����s}Ed�@W���h����HO��-H?M�>:�v����d`q���{blt�L��$������k�3E�J\i�������n�l_���������:[\&�RzI=Me���_�R���!�<��]2��N��%��M4�dT���Q%6�������^"�v�=2�8���z���/i�����}o�/��������<qrY�����Y�[����O�<��L�yS]P������X*���woT8��K;Ve�Ko�v�����^�~d�*q����X�
$�Z�(�A���J���,��iw{0���h{>��^�����7h�sP�F�(�W*a�\���N����J��hl�J���7����������j�l�����M�|r��������������][y�;%�z�������r����j�n�t��du7�v��cOR����������9w=��tp�B1���K#�,�g����|��o��QU���i�V��[�^� ��]T�{���^i�m���G��5o�_B��w��N����_6�;�pt�G�4y��4��x������<v��E�Qt	v����x�1���!z��Q�b����	[���n������j�6w{���Wo.���Lh�"�:A�������4��f7g<�G�
�
�;��A�~.`�[�
�T�1.A��Zx�x�^��Plp9O���Yx��iP�Y^�����;��E�������m��O���p��6��JU� ����+J�^���;O����/����|8����z�����h!�(�*pza�y�W���[�!�Z]`�re8J���{��\.���)��u��1����W����z��}V��n��xc����G�����xu��-%j;��x��(�,����m�R��Hp;;:9��
X����R�Z����
����
z����e, ��z��F���WPvpw<�.����������1��C��&��c��r����H�
I%��%+����>�q�ry[�g7n�*n�%,_��0Y�PQ�7
���%.]��#�K����i���M{����n��uQ����{����tfm�]�T���h���`�F������t�1.N���at�f��B����^���0������miw6�r��m�] E)�BW-�0��[dE�d����t��W���c9��__�s�%F&
'����#��`F)H�����J���
��i50��%�@���s�������`�^#�����Q��o��v��@W��{A�U��������n�V�h�I�"4df`�[��R	��:p�2�q��������x�fW��
5�Y(w��O�����0R���@|q�j�q|r~���$�;�������b�T�H5��*�)�V�W�
��N��]z���su�s�V�e����[y����;��[��=�7go�{F��[��_��=�5g�{Fc�[��_��=�5g��='�q���-[��g��y�j��U�6oW���]�z�v��K��p�Y*<	g����Plc����D7`T�b0[�2��1Z��n]�j>�M��Z�14I��R��q��1�*���_���Nb$�gA��.�&���I+M���y�e6���8$����T����K���6�^�����]��Mj������"����7b�P�E��G�*_�4����y�T�e������i'���m�q�q$�Y����Z���\�Q&�kx�.�n��smo=
���X��hT�L��3�5��)H0$�t��e�$�
��Q�Jh�����F
(E��WG���	%��������S B���Xwx�R��b����p0���Mkw9������x�3�\b�4}�4��<���BzF �(?x�;hH�_8T��<��K�by�Et��5 ����(��=�/�������)&��tP�Y�B�B��v�it��P���8`�"�I)��@���r-�'Ax����W\�r4�Q7����;�����(���������e`�
��R;��n��F��F����-�wk�X��I o	6��u���!���`B���/)V����� ����iI����K����(�\�P�h�>�=�{'g����3�TVr)$@������3��=\,�e������������������L�E�������W�������]{��U��V��Jg�wx��;}�)(���n����S��\�Zs^��G�Rg�yw6�Z�"^�'_�$�����V����[�(�������+�h���1��J5x5�'���a��������s�E�~mh���>hu�����R\��d'6�6��x�
=��.��y����&]�n������c���Q	���g�ixm_P�i����%�����OU��������%���)������9A,�n���a����&�3tZ$)}��5fo��������d�����c��rJ�5�)V�����3PE�@�/��_��������5��G#��K���"�l�U�)�;5t5���d<�I�-B��}AE����k�c�j �H������w����)|\<��z�	s������kg?�SJ~4�D����3��e�q�G8��s�T1*��;����D�������rq��-�tC3�hoL�����7���aG7��
W8W���>��X��6q����8����	�%n(	����[�����I_��8�tA;�x����m!�����(�������v�6���������O�����J�����~��o��s��r�F���W�I>�u��K�S����f
��/QcIoj;M�M�Mo��08�����9�>0A�eINC������E� 0�����}��Q�a�����>'���+��.b�W��P��M}��w��
��i�c��L6���]����*"\���$77�M���S}�y�pm�whtt�)����_�c\8�/�S/��~��������4V���7�Y��g�u�`��?��[����z��J	eT�T���TN*���g��J_G��V��g��\} N��4wv���:�	����{�cX�&4��������y����������,�j
��r����Z@��V�n�pT>"�\:\���S������$��':��J������*���'��
�����X���
���p�8O�oW�.��pLR�k��q������y�����V�]�)��C���QK8�&Tn���5��?
	"j�t���J��g"���WQ��
��!@�.�:1Q*��r�}��#B%9&��3%�������w�
vv��Q�I��5wh���!�����s�s��_���*Q�%����G���)����3�����8�9��G=2 �iK��������F��~R#o�+��b4�M���:�R�A��J�f)\R�������nA�IY�O�J�N�Wy��~�L���A2���OOONa�����1�!O:�G�������7��WD0`����^4��6��-s�5Zvv��h����U�������X����`��#Q��]��#�;�R/��r$�����T�U�u'�HT�$~C�����8��G3��:��T'M�.�x��0C*�:{��|Q���q���y���A�JEK��@X#��A���hY����/A�1m����sl�H�{��,���@�����cL,6��_MJ%����Y��.Pe�~�����y"\�V@:l�<
tC�����lP�����1���P�%���P��1*��g<�RI�$ufS8N1�;'���o��v�j�y���1b�G���P���~7uW:�WB7Z 7�9X�'s�����ME����:d�1�O�U]i�4cgB,�X��~�����o!����v���9��|�Oj�0�]2��9�����b��Y�hx��.$��/) �{i����?���*RrS����{����e�M@�
'Z�	�+��u������
3���������`�m�@x�������w"��Nt��8����ZJl����`�~��t�	�v����8siD�i���I�:S�\{hbu��������=I���j�&et@7��7�1�����j��fm�pF�D�#=KUO�6P#F}�o;�b�3�@��������������[ �l����e��3��z�4��$�i8�{e���}�hw*`�_�][���8M�
�pP@]T�Bqs��rf��?m���$�<L$���Gt���&Z`��-���4�����YY*A�����#B����qe���(�	�lE���K�w:	=0_5�C�(S�Q�93���g-���0%�oKw��Q�e���u�b$�fB���p0��w!3
��i����h:�b��HR�cRy�p�d�>Dn����*E
s���)r*�;��3�������%�}$���c����;a�]
��E�c)��"�����K�Sdwm:�+��?U&��I�n���e��}^q�8�4������
�Su?�l��*���FFl��n�$�pH��pH�ycmH��/�)���E�4���=��0������I~@�����o]�.�K�����B-40��\��C�+��`T���G������.��I'��wR�c����Z�Xp���;�8��������lp=�Su���Ol��Ob}5fyL���h�=�2�����g�)V!R�uu���R�����g��=��
ko������d��wY.�n�������l$�!k�)&Y�����"����1uX���R������xD�r.��P�h����S�|� I�r�)��R���sZ����4�.0k��t�����/�z��,�]�{��#�H:�T"�)G�#��42=��=���<i}�������cf����J��-7F
�2}�}�H~I[��YTx��s��p�J���>�\> ��gQG���"���(��S��Ic���Y����w���yt�����\�����u��|3W��g���LS���r�]D��n;t���i��`����bjk5��X���Ci���
�*fR�Qz���5+�������oA�/��e��t�G��iT�`Bv�	h�|��,;
�8������L+����m
_���e���8t0�
V��v��g�E	<>���)�rj\d��4xT}�<�;���=om}Pl�W��H���d.H�V4J��C����t��W:0������X�������]��p��:=�fBM^,8�6��-���4G�N"�q@��1����J>g�����z���?��yglx����hk�#���nj{U{=�N�}]f���B���UM����F
_�U�%�����a-%8()#���K>|M'mT�"u��$��R�l�������N]��o-�+a��<��a����5�7���WpE��z>E�l���:6)��KS�e-y5��9��6Y�k��(�v*� 32�D
�)�&�lM*��M�~���`i�#�
�	����L�b���.,��8���������U���,��b]D�"(Fg���@�>��m�H�
���{>_5�Tt������M#�r�M�*]����$��E����1e�GsL�d��w��d��t�q����0^V�C8�6�I�=�VR��Pgk���y�:�pBi�N
�������J����_�g(��#T���C�k3��4`{f�"�������N(n���bR4ZI9�J
q^h}+uE0*fM_mvq������K���*��\C�=��H���
��������cO��{i��8o�Jd��:�����r/�����z��Jf�7;��D^�U|�bsh��~��^�S`R��Lj�T�	�4z���E^�X
�.L��\����q���fG����;'a.@���Q,����
[�����L�<���(�K#0����4�3����;#��)�$j��H����������)G���H����:��7��X�h��xT������b�����u��/T���4VH�s�d�fnD[$�����A���y����3����	2��fn�v�+`ZX��W
��y|L�=�+#t�6[4P�r����g�!�u�\z�EE1�����2X�'��|��*�s�9)C��*�K�?�l>g[���j���ul���M[�I#��6�[F������u��Jf�w+�W��X.r��"��f���g���I\+D(�S��w��z0�[-e8�����c�W�A������q����&�v	c����|v�o6�L0��B`c����
�������:c��O�@��n@
Q���vUn
�fWE-���WmGahs����p�����o���w������5
uY������3���m�
����r]�\�N����w�e�����
�p����V�[R������G�c�G��q����Jf+i8�nI��R�FD���7�d�Y��Q��d!� ����k`W1t��2��_���0�����i�vP���Wgr�k��X������j��{��
�m��%��6C��u$}�F[�
��k����I������0��?��N)h�#�[��ec�({n8���P.*S�92���Jas�G�^���H���hjb�(���3;������N�09��-bSr�UcF;�~-�G������S�0cD�#Z6%�\�>���=�������jd�x���\Q01x�_�f�J������4��7��F/��\���]d���|_�PH`��*{�F���(�k��E���4��;;v����Y�^\��aV8&���0�^o��������Y��\������|�2��O<�T������hY�<�����#�?�����a7����(z�I����F�ix�����spb	/�����jbtt�����X�Z�E�#r�q�!�	`��H?��[�����I�@��@�|�E11��Oh�9�
&C
RY��1^��#�����Sl.1��[�
�9���6�]D�W�E����.������zmg��|{9Meo.�n�=J
��gc�gc�.6��F���spc���|j�Kp�7J��Jp��`�1��(�J�mM)����'x��N�P����;�/|vv*��?T;��j�R�7�P�U���?��Ok��e� ��_�W�E����7��D+�n������b���j}��S��J�W�]4v�{�V����o��:�&Aux�'�_P����f�o�8�|g�M��c�[���>���u8#x�l�( M/�	�� pH��Z}����R�*{������� F>	~{{p���P5�4��;DV�5�����nlP���1�#�9������klS�e���d��?��Z5JQ.�SUL������[[�K>(w�^�U�z���O"%@�A3X����Ho(Y$
�Z_X�N�c��������,hao� ���`����j��O��{�Oc<�� ����#�,G��je"���)w�Z_D��ZIc�z��JkUtd�������eV��vj�nF^�wy?v��$�Z^k+�::Y�I�Z�1��s<��K��ll���<s���s��p@<|�]s=g�����_��y}���fe�����ET�����;9MT�[���&�I�cq
�Q*��������P�������b���E�~�~�U^�R}p������w��YO
%N��#�����QvBq�����"�5T�	u��t%�}����eh^Q�1�c����V���ll�4?�(��������v�����������������HG���9�i�A�@��{_=�L ����A��Q���7�dT
�3�9�y��������LK��w���b��	k���(��������~TNa��FK\�j�Ec�F��4
����$o�P��ud����n�_
�n���p�-.e�{�l��
yz�JYVm�Po�V��\���5�u��&7n�p2H<M�y��~���|�l�g+$"C&{�J��|I0��\�e�
�s56�Q8�R�tT����v��[o�g�N���G�����N_�T��T����:o�����O^�	-:�Os�zeI��?�{����8�=�5�I�m��������Go�MN6�����r���]������+�,���1��?]H���K��`q������_2�4'���}��=B��A�3#�:�~?4����) J]��Y����2g>��t�Z.�8P��C�������o��?G��|>�|M�-)H�l�0`[��Uv�(^E��~?�f��Jy(\~���� l����}]��V{��s�wN�� �h������\h������V@�w(�l�N��C����a��W��f�j������gAs��j�,��Pp +���N�Rki�V���B}�86�T�dT<�J���o�`�R{�eK��p<��6��/Om���y�q��*[A��G�����:hGp>E��A|MK�����LE�P����������RG��A,�����qo�
/��
\����e��Q��?T6����;��f�'�P����j�vE�:�cY3;���5e���U�i��[��0?�b�������h�.��q��9����yF=#�kt`�;�He'��b(yO,�_�%A��X"��~F�*����S51 �hJ2�����P,�XR�����R��	F�C{���"�n����r�B"��n����������N������������������H�+-�Q�<|��z�U��^{g��&����i�iv#��?>M�:NF/~n�vt|�>=}���L\:;��%�����Q�0�������u��#Y	�F�=LE�S��D�-��~�\x�Y�,
����D�1�1DA�#�h?!�I����C�o��S�age�03�����)�����GC ��W��l�}�(�����J�}���T	e9d,[�d
{���q��;e��7n,�,U����F�D�Al�m�0�R4+/���h�!��9z%s�!l�����j�������Y�4�����}ea}~r�:_c�&�_��J���H2'\�Nt2��e��NH��� 6�0����|����a2�V�X����ha�h�~�������O�Q��
ei���$
� w?8��F�0��n��@�QS�Y�?���u<G�x>�q�8m��:]@��%���B
R�e���b%
�'o���+�v����!��
=�49��.����d�0(Z��{�kX���V����I(l�p��J��rf���xY8�(�{K����C�)Q�����O3����F��C� "��6����'��R����T��u!q�����K�����*S��+mx�t0. �E����[p������%iC��@�
���?�qmR�b���)0b������"Il�o�-J�	����0F����h�vM�h5�x����2��kNE�5|^;;Cm�qZ�5��(�2�q����p&��1[&B9���N@��."xh1����f\t3����^��u���9h�V���]��5��K��(
t8�<'��ST2=��zrUP�;��c��)�t�T�z+�����t�lH�U���Ov,{������u������������9
_��f�A��q^���7��	���Y3`%-�=�k7��G!�j�Q�[���B�UVjE��Y��i� �f�D�k�@�0d�������"� �	�DX�8��:Xx�q@Q�I�<��H�py��������9%������y��#W��~�0�z�l������	!_@��3���9kpU`0�����E�Mt$!��4	68���(p�z�[����m����7���������n�
�t���C�W^��b�B4�9Z�w}�SN��"O9�������a�F��+��t�I{������%GN��;��fCT	)�?[�Ds�!V�J���$^������$V���D�o���%�p�y(E�b��@�����bxGv���[I����/��=�=�d��oF��4xB����p0�|��,�.����
�	�|b�9�������Nj�����Nz�����N����U�Q#���+�;�j����Jc���Z�W���"Q�NjY�L��	��?�����0C�"�0x
��d<�9})��5|��N��(j]{�oFW�?��2K�c*i�JB�?*_�4�~��	O����D�)P���9l�2�s*�w_��c�V���Wv����/�M��?��Q���t���pt�=��MZ�W�4�{l�>%�X�tt�'��y�H������X�R!�q7���	Y�����M�?��W_6�,2���6�a� ���=C����]���W�T���� ��RB�F�����U�P��w��p��Z��(&'���o���*�o�\�#����bT����|K�f�.��w�~���g��q���*f�Y����~�������cd&
up���02�E4���y�u���|T:<�l����a'�f����g����2�+�;����������mY�{>�$�9ET�N��#�6f��U�c�-r2��m�)4�zO���0�1�D	sl���D��7���S����+z\��?�!����.���R�����.�WUP�����G��G]T�M�%����q�����^�w�A)�#��0&�����j�>�5�9��bnD��(��,������]�Ax�F�E�����%�T9��+��?���6����v&�+od���t���R4�a�G��}��8��W������6�/����P���)����/��Ep��y"F����n�=��fp�}o�5B�&��3����'��O�������_r_I��:&��M�oqdx8�
3��*X�,����76�Y�)]3n����O�"J�k����X���~�r����[Y�G:{����Q��A��x����������V����Q&�kJ����dL.��-���*��������a���;�C�����@1�l�^�4����S�������U�����Ef�g�4���S�:�����ZO8A�Sm�2����3+X#H&��������an����6��kEO�jo�����������P�M	_�,(�Fdv97�xB���sN������&=�������b��h+�Q���W�����1�<=Z�S�s$m���$�Z��*��P�=����_��&���X��F+�������@���c��J���Jv�rG�4s�f��*����j�`�\�,v��1�^�u�a�����vW���y���;JOX�����
d[:;�Q���`@��j���������w70��M��f��������FeI���������� fS��~?	~��`$�V)���#�"/�^��g/|������������m+
&d@��i2��|�p��O/�����U>�3��v��'z���g
G'�O `x�p'�h��	�c:����D��%db��L����9,�v�a��h4����?z��H���d����,����dH�B��v�i�QPb(�Y}�F���|��_���
+ ]���@�+n����T[��^c���vf������[P�oI��'��t��JT�F�U[;�j����*�F�����DQ�}���M�j��2��<1����G�5E�T�0���+�,�/����1�~N�V:��������bW����V"�P���@����x��������9�����#��wJ�^	�0g�iT��K1}��x*���&.9��k��_�}w������^�7+����=>��W�Z{{Q���5����fxQ�wk����Z�~wg���~����zpu9�{��d�����(���d
�I�������:!�ni����t����76��"�G���#N���
�?sZ��(��s*R��N��J��"l�����PC�75��h�xe���
�����|�7���3-R����g�
���H������s��j?
X��
�t��A��?V_=�|��� �o����sD��R��A�������� V�3b��|4��\�vc���O
�fpJ�
6{G'�:b��0@X#\�0� P�h�)��V$(��7��kL`��.��#���Q<z4������&�)���D+�{4��B0�T����(��
��lD���t�P+�l����e#��N�I�!A�|�td��j��+T�����4�\v;:���T���Q�T��&��+_��J�����:� �
u�	�W[�u ������1mD]��pKj������A�_X�[b��a'��v�"��������������s��Z���n�]�V��'���J�;5}6���N2��^}��������X�YcT�M�l��Mk��5,��E��4�G�;��`AWoB���|fK���
�D�����J��������V���VZ���Bc&����(��vQ6�I6��%�\�e�E����~�������A�9�w�p�S�������'���^[����uE���	X�;a��{��k���nT��V����`M����%����������
P����ZY|�������+
9�������� O�������om�f���N�_�V���.��R�!#��S[=b�����1�oID�,�V��B8�J�����d�]���B��R�P{x�L@���C�����������l�����Vd��'�{�)|J�2����<E���!,�%����DC�3
������0	�����&��c����-����!�SO�Xe��N�X�������O��P��_��4bEtD��5� PF>�f�i_����!�O����`?���?S�/ 2�0���h�x�W�	-&O�Z�LN�'��)�gT	��h�/��cV�����T��zc5AmA_���"s��P��2s�����D��ALMIY����
8����J�c���Q8�������(��a�Je��n4���y��AVOG�P���}|~�����9���P�+�i�^)�jq�eHW@�)��`���b47W	���Da�_�@��[Dy�-�)B����{q�}���HSU!��w���}��(oo1��7��]������wo���<�l[�������S��d's�#�~��h��R��e�����z.���wx.����L"�
�$���Z�
������h8����A��d�}4L� �/�\�LhH��f��]4�!�IE4��3;T���J@����Uk��a�b��	o�F��=�����`�8�|z�r�����=�y�O%��I�)��);.Qm�q)��
Zw9�����c<�M�1/Z�I�K��yo�~����*�:���"�I{e�.�3�0�t���[��X��������s�X5
��a][����dJ�G)@���y�h�������9�0Q��'��hw���9\1
�l�	�z�d���
Q/�
�W�f5�u�Y�L������Dx�.Q|Fq�$�_-�yV�p����5�\8��
���4~���Z1#v3�1����J��Ec\�P�[������g�
X��`��h@<����	�g0_��;��hT5"7��[�*����C2f5�`���X�k���q�*����T�;v,�{���J�[�39�Y\��o���uS��1gb���N?�������{<3�w���moq�w��#v�� hV��h�3h;�����PhJ��d8$��	\�Y��e"���8��&������u�o�#��l&���	o	�R7���6��:{�H
��uU��'�.���v=6g-&i���JF�m�]&:��"�?�a3�� ]B(i�3k����Q"�+U����HdL�M�9h���b�6G@�����34E���**��*U(�k/���'�'y�-/6�b�`
|>���Q�H+����"�rV�\�� .���1�:K<��/W�������&�����G0>w�=�/��	��V�=6��{vr3���(sup�+G�:��i��W����Cw�yO�5��#���NZ�F�"$0� �F2���G#f������t�7/���|����2 �t���c���
��"U����
�M!_��i@�qw�{����Q�����A�����j�*�$q���B�Vh��[�Qt���<8�9����&�3(��8�R8�L����
p,����ib���w�2N��=<��������i��@c+����djL-���KR���PX&�^�N���C6:���-�A(��u��#�C��G4���H�p�v���e����D���\�
����� \O����]&5O1BG8��|�������{��v7���@ta.���}#8�����H����h	S3��Yy�����[#w5:|iI����j	���[�����7a���O�
md�0�~`����T��;���G�Y\zFtFp����yU��HD}
^�z~zp���6S v�����z������
MW��$F	��cT��4�U�It�����pi�7I�/=3�y��U����Ajj�z<
�����B����L�Y�8
f�p=�:5�$�,��I��%N8��p����b[��0�6c�.�����Wp�Ya�2��w�F��n�\�j
5b$5ER|�X��M�\EX�����9��C��4��/���1���E��h�]�������2�q�q9����$�C�����8�k�������n���\�NRR+�����@HhBb���*�`���E��0����a���H$>b���}��Y���0���}M�ziRBF���!���K�j�k�%��c}��0@��(�IY���TR'f	������+=�hy��1������ZNe��f�c����� ���	�h�������I���y{+�������k��YO����N��sRkE ���(�%�3t�<�NO����$`��������	1k�0j?�"I�J�!,�Hp����:��qFy�&.��gj�ZS�$�_ ������0�L����K6��I�4C�(�z"8BW�M��s�F��*��W�C�!����',�"��l<���������64p������xI����,wl���=���/w����q�^�f�r7'����aOu$���L�D��_�&�5�1%]���F��6D�-� k���{��.�8d0%j�����6M��P�}h�<����M�-��������(XU���S���^X���A�os�A��i������/�����$&���d2��J�D�������V@BB�L�5�A�j|��n/it_W�!�'=����c��r��~������5HT~Rn��A7�u�n�Lt���h���W��x�n���q����B����Fn���#�~�����,>~������"�?�����"�qW���ht�%I��M���4�b�di|,�k���0yn���t��SA�
����G��?����	?��|�m��{�~
<���~�xr�J;��V#��t�#�tD�i4��C����0��
y��r��*N�s2z��=q�9T� �n�E�
�7���x�':��s���)O�
�B����7������:0ue!�*H��}�z�.��4��R������e�F��t��s�cda����i��gE��'*q��Co�j�4]�F!�I��'[O�]4���
)�r�����!G�!h���R�QZ���3C?�b ��j�%�t:&�X ��/?�n�p~}��$C?w�/N��u<�xy��e���:��8�P��@�����\L���T�N��r����-��cu�t��}�`�RD�J��};�&![�J(TQ���V�
(x����� ��� �����l�zG-����
/DbKHu�kI�NI��9_�j�hd���n.}���W�A�L�w��t�Q�T|�����d�����P��!ak���
��~���
�>}u|�G�A���Q�~	}�T���n*V�R��2V+QD)��|���������l,�Df4�P>��.E���z2�B7H�K�x0B�Z���
)�!iyt���s�>���d��Z���mH���F?yM
$J�c�[�$����`����`-������^8�g��J#�EHfBM�&�M�����K�T9v��v���
�*#+"Gt�K����
�J�,|�Q�7|����U�n�%%\ �������w�pK/��F%��:���S2�������3��b���u��xm#�IU��R��X��S�������a	L�l��V]wbL~�
�x���)�K��r�l��fLC,��������%���(�����u"`
Yg�I����d[K��:z�c)�����K�����K+PC�1������/�gSeV�Y��9m�zY
=�|���C�J%)�O�F;��������SQ_�x�������'�����ju%���t���E1^N(�� �C��-WR�y9\����(�Z5�*�w�[{��{�":!�I�L�6��f�l|`��&���~�����1�����������1q����L9y�������'.�cTImIu�\�(0�������\v<�bs��ll�K���-$�/[+YpT��Q.#*�&h�H��[����^�L��&���5���W�LN�;;r$;jX��2>�,�c`����;���?��7��Mdl�e����CI�-���0/��q��_������`����}���u���FmQ�Wj������~c�qm�%@SJ���F3CZ+�5��&cr�#_@�����"��
z�����R`�;P5#����Lj�3������	����lnm}`-���z���w��M���z6/�5��Zn
�V.���zu'�mu{�^�����;�^�����o+���/G���M�u�?-�r�����0!9��w��|��4u��c���o��q#�m�v%�@B��"<����x�O����*nY����c�>�^FoN^�n���}�>>T��g��	��{���?xA.i������>:;���U���s����q���J��k�j�^��}����o�]%���$������x1���q�NC��������t���$�_��k��l�B�h������Y}�_���v���^<����/e)K��Nq#�(�E��Rz>+r�M�sa�S������G�_�;����i�y�>e��t�4
�:Y��f����ox�e����oEg����4B�N���2�z��2�z\2J��-��"�tL�\k^�����������WU�����-V��J��X���OZ:7E�������D�4���y�@��s5cf����W�5*�����*^S���v���p��*�~#�����\Vf3��7U�b4U*�	�P !E�"�k]~��+��!g��3L6u�~,/���99,:�sD�/�k�tI���Y�U�S15:��(����8w��������~'kW�����z�=��{;�z����K���
�p���ju��\���g�����Z��Px��0�\��c�i|u�����h8�Sd-�@��H a�V��������/�Vf�C���A���n�U���r���6j���Em	����2���w��(��J������i�
�5�@kG�v�k��Al?7`~�	��J�H2B������je��[���Jo�_�K�U�2��%�6|�b����4������G���i������������W*��t���H��J�bl�����H�c;<8?H>5���i���,]3i e~w�t����������+T�\v@�.����������q���@���gTi)����D�|�����������-p�ry[�g�m�*n�E,_���,E���w/K\�|�Q��Xg3,��aD�\q������^�o��f�,�?����i����S	��`.�M�����.N�|���8���h�sCm�#�����}
4���_�_�'�Q���h?�����H"0��*@���7�� ��{{LJ�[�*�U�����+;����z��Z/��hVo�Y�zv���<z�@7A�e2��D/�n�b�3P�>.��	���m��E�����b�j��w�����T?w'���?n���?�1yg��~��F*��7�����=�������������}n�x����1^�^T�a�����m���k�r��vt�N�u����A�����\��h����4�X�����j���v�%r���8�5�l��v!Y�M�xC����
��� �������{5��P���h/d1����"��o���$uU���#k��K�����W&���������������o���!x}p���P|q��d�WXY���5����k�[�x���oT�+�D���S�2s��6����Z�*Y�XskBI5u�p�>aY����k��H�J�w������&�)\�$L�y*�	Q(}_�>;O�����"��e0������,��$V��y]�����R�Nzt����K5��8kn	���k^��	��Y.3�Dm�dRN/n1L=��io"�^�\GR�8��3�������'�C�l�/�����y1x0/�($z�@I��c�/<��y.����qmO�$����&4�'P��������'/�t��YT������9�HG&I�Rd'�]����o�`��oY�ccK��!��z��h�Q����������z����e� �g��$Z�&-������;��g�am�ogK�_pRz�=�n�e'}wnm���r��T��Mk���nS5$�����T�!T�M%��4�h�-]ub�[��"��Q��x�4QU���H���V����Lc��m_��z_cxh]��[1�Y��z�]���p�iG2F('W�[nk]�1���H�h>�����i\o���G���<���90
���t�y���P|���3^n���X���.M<�q��$�������N�}�{��#�*�^���f��kC��%j���D|Rdd��|U�D���_��Z�@(�'�D��Q�7��f��������<9}splzOC������O����0,.w^+���	EVN*b���7���������A�����Of�4Yn���%��-Z������3���Q1+1���y�a����#��["���]������DXe��*��
S��_N	�$)��E)9I� �i�$�@�8��,l����z,z�ZFZ��(������9���M`RY�t������q�����Kz�]�?E:��`�m����]�ge9%���{�f��bDgLU��?l)��U����/c�l�(��u����:�L���l�,����B��W����N�3-|dDk��D~y�<S�������=�>�<������=���	�S�������`M�*������e�&k�a��1����2�T�Z�2���FoX�J&�O��6�����1&������Yv����[P v����m����:Jj�W�'M�rYj0�H��V�����z L�d40l���rrt�gm�n8T��[U[��c��y�k4���=�vH���gh�[{�7x<:wn��%���P�d�/[�l�\��sg����_��p�FQn�PtQ\Bt�1&l��
hjc���|j�K����A�W��d$P��K����cy�qIo�������T�����m��Pm�Tw��J����m��*�;U�g�4�����hQ�e��M?/�k��j�{Q�S��.���^������Z�~��F��Z�oF��
���hTw�b<�����6�$�_��8�2�}�#�<������6,����6��GO���A1�����%�1����W�*��������?	~{{p���p5�4��;���I��IE�l��h��D>�kH�v��M�W����,�	{!9�BN���Q�t�N6�j5����2Y9�����3h�U�s��%�����l� '�h��R!��z�q�6�W��bo'����p^��&l����p}�E�|���"�/�;E�-�[mP.�w���O�i�AC��`f�
��?���io����6,��^~9y}p~��(��������6�VY�0232A������8��^���_8����c'����=yD
5)������kk�y���c���T;�f����&����]4����H������h��[{lq����d Xb`U��eg2]&^*k��P8�S�N��#i�}W�<B�{�k��������������C.��eT��bvq,R��hX��f�t6	6����9�����l��}A���; ��s<�`����-�G�)'x�)�v����XS���l�����O����|�ea�$���ocg�'��Jv�S����9�(��ky+��L�E�[_�[�N�\���'��5��������Sd��#Q�b4'�����e.G����������"�e=�E=��rc7'��!8��\���?(�t�����������<������I�E�z+w,�5gT�A�n����o���`��nU<2�1�+�p�����'gG�����b�`G*t����H��k�1���e�U��h����>9�`O�Mysy�/1�8^:��%�������c�*����eZ6���+i�a���h&��5��tS6?x�Mj_~�(�F~�v��(���(��C81#L/���6�p���.�L|�?
������J�������Q4!�\Q
+�G�8YVo�q3�"b�Ek5��a�;���������\��3g�/������������*��8B*�!2�i���Z�t-��T-]J��P
g)�����������'��!���W�l����n�������a`�������vT4P���2��y��������l���������Is�#5���Sbn�w�p���GC'�M��b�^�|R��8/�m>��V�K�O;'�����g�E4�p�<z����)��Jsw��H���u�c���S1UuXM��kQ����L���vK��RX��q\�`���UTB�Z\~�$�:R	M�>F^\C!�P�S���MW�S���V������t�Q���w���~�S����^�������J�^���7���>�7����@��K���1�l���t���J!���lE=����7$ �g���Y��Q�v�A���Er^P���(�n���I�\c(}�jg<v7�p�����Mt��J�F�1��8CY��bu3��Oj,�x�$1$jD65���??���h�G�����6����t�7�!ts5��|@@��Q�C���FD���bvp�!H�$.��|�3���$�Y~�{`+��Ak�j�jR�U��-Rp-������cw����vj��Uk���W�w��u$[I�H� ��Z���J�
P��~=z�A�#d��p��m����9��:~������J<�.g7cJ��T�A��~�~�L~G�9�rN`M1w)+$�$dV;$.c9?��?8=��O1��"dRN��+���%��.���o�0���0�M'��Y>�i6$O�T��r.q�fH�p��k�#��'��$��Yp�X����k�������4��������ct���u����BN�9�e1u��i��V����F��"an�iPl����.��E���H�f��������z��� R	;
]�8�W��R'H	�>�3���f�(�Z�W��SE@�a���)rVkN�G<��[��jRT$������H&�[�S�R���>����<�1�"p�����_�x��C��T�����^��9+���F	1�G��r��JIU��;������-�V/������ZH���;��J3}��?9�-R�!oGj
����8f e���#E��	-�.���h������9�}a��kjfy3���xI�=Z���V���,��OM��h6=o=<�z"o��9���c�7���Gma�T:���PUB����S8��n	��������UB�TO@����������6�ED�s@��Z j]/�k�Et4�;�Ec�pS�u+)�)��^U�����P�#�d����a���&���
�^���c$���y����}���J�!�H��X�exA����I���AI�r@�a�UO�h>�v���_;;�Z���_���3������7jw�J!�O�tX�]pj�=��S���'/^�u&�)�m�p4�T���P�st��������*��� �|�4�tL�'��Nz>8}�s�����H����������W�-��:����Y�������wT?2���K�%[�J��X����0���6|#X�����O��������Dm<����y����:��lu���_
������Abm�Z�}�B{��|���y�>[��a��Vk;��U ���2�	�Q�R�v�����T��&�i��5*E�y�kf[�����q{Xr��a����R�����V��rA}^%�P��}����"	��ZYxmA"U{�5�����E�H����-�*x�d���������3
�C`�����)T�����/�c�jP]t|�O)�2M����iB�����/�+#������!�l�����e�����0^V;c�R9c�|�d�����9��L`yjd�R��f������C���[�R��FV�j3�J���K+�������I
�j
Cl�s�~s�K;�{����S39�&�#�&l�U�m6�d`� L&�!M?:o��f9n��>l]��%�~YC8�D;i�Z2��
G�H=5�����[������v���S_��KOe�@1����O�^���uk}������q�Vm��V��S�V��G~p�N�
�����:I�3u�g����T��5��F&�4�y,_O�X-��7)����GY��da�oZ���YQ
����wN��oJ��oFT�?!�D�I�G>��+��go��a��}��5IpG��������Da���0���w��z�U���ufM�����w��O�-��[�����Z��#�>8<z�9?m��_���H\�_�!7�����llmo�&�G���Q8~
�� ��4j���?��N�mjB[��D���t���Ks�C���R)���e��!��"_e��v�9}Y.VqL�<�����Aw�9�?�s8�y�=8??x�����	�AK��*�q)��bm�R7L:r��1���nP���0�=l�<:n��m���q���� ��HOe�$U++�o{x���9�����mg'y^�:V��X.�q�W���l������ Oy���x6��8����	�������t3�����L���7g
��q`�w0�
VY5�%���s��E���Y��8�����HPB�~nV3��R����_��syeE���'��j!�^�k�c����]Eh
N���
b,FP���w��E|�N#X��"����6��T��f���������)9�u���J�h�9�BUw��*�RK�
|�G���1��%-�e���W���xY���|���"7{<|����gGo���FB,9���o@9N.���Fp9����6t�2�I�s[$�^��0����y�Oo��e��f���9��c ^�R��C�sK����~[#W�����N�����_���[��dX��&GB����1V��R�x]m��j��q����I�~����C���_��;>��$B�I@��o���'�2D	
9L�4/���p8�{}n���H+�[Q���zMa����V�]���z|+��bWQ�WiS�+�� `6(w�6_��w�Y�����B�U����)-c�`W����R5i�qWq&X��@�Z�:M0�Q�kUl���Z�h�N�#�����Jy��.��R��!c��rF�~�s��Al�`���qo���0O��J-��������-f����)*��q�	uxx��u;P��~�6��������K�	������h�Q�zC,;��c��0��x[a;��1�����+j��5R��|�F���3��a�S8�I�W�g��H��!�C����g/!���&��<8�T�`!�b_�IW�+s;mlW@��:�>��������jFTR]��	Z���4�'c��.�0:��q�S|��������_Ir*P��</=C��<*��!����s��0�1��V�Z���
��#D��g/�O�����G�z�b'�C'L)`�(��>�b�	%�
�������=�b|����G�3�2=���
(��JG�����u��X)��d&gq��1��l;��Ze@t��#u�/��]��KY4?��
��Qm5T^�E5�M�N�G��R�����2E0"F�HL����v�m@����:��#�#"�@4�d�i�3���l�1�F�`�+n��S+�-����Q.7�O�J�����)��\_���($��"IA��d�DH�!}����]�xv��S�0�Tr�H0=�	�O��Z��`�{��N��6�A7�F����u_�?&7s1������������������pHq��"Y���I-�@��f�FH&�����1Bb���<���!Z�R��K����`����P��	��i���f��c;"��p@h���z)7q�n2��20J���4�Zi�������d������xL�Jq_>qr�S�EW�L�X��)<��k'7J��(�GU��g���1��rYa�<t�0��M�p!�(f����RT����@rmM������9�G��{�8���>{k�|{�������������h����m��T�b�u!�� ����0��9R���F���c0�m�:�gq0�M��h�6���`������C��Q,�>�F<}d�L��Z�mG5 f#��'y�i���w��"���*��,�&�"���
6Z�?�����c��`�q
�"��&Y9��$��Yr]���01
����:�C���4�%��U�@�Y��]���$���O�4�#����a����<��:���Y��{���	�
�`k�Z�mKb���K�{zc�����FH��d��^
T>��/��R�w��B�T6A6�XM(�f�������2��bv����a������K��2�����@��#�AYw�-������/@oXt �j���Z��B"��
�� �P��9I��Z� g���#��&j��4�(o��L��p�F1��_�o��\R���kd)�Vl�:}q\fJ	�hP�y$�:.K7��]�P��6YT���-5Y�"��P��Zu���8���dF�8�W��
q���EN�� 63b�+�$Th�wt�4d��!���%�L6i�V��9j��RQ�;R���C����N�8����w�"���p��j{)����R�5����[��/�P��\E#�+�
���!���0�����h��e���A����sR�@9���m��\o��c�;�z��e��|nq���U8
��":�r
�"M��ax!4@"Kg�
����>%g|�������C��3�X��h9)�^8���Q0�e.q7����H1�
Q �TZmq�qi��8�"�-����U���SJ���0I��'�/�I��cX�'�rH��qC=��%��uuH;��Y���Qxj�Ej�l�N�Zqk��@x�]l�r���H�e���Vn'�}-:%����Vw�gKM�:�j�*k<n9���������p�?�a�?���4��{������z��2�����5���Rj���,�1��1��+N������@�������E�h����K6��������u�O���3<�"I���G�]��\�s����
�>����?�(L�;%5�����}�en���+�W���W���>��$�x���K9�.�"�?�E���8j����.�}�Tc#��_-���Z�~�G���@�A���Z�Iemd�Q�C�^���|eL�}��E�F���K#�h�B'wh>�������H�e�d2��}n>��$6��	�oa,��u��K������e}���!�d6-�������r��RMF�i'
K�A>���Sqq���#b�7Jj�������%D+�������m�zD�xgr�RU����=�����q��!th���YF��L�k��@.P�V��u�&p���H��UeTi���Y���|�+�����,�^�Ci_,��R�6J2B�*P�e�Z��f���J5-��
5+"���5T��t����Ig�`�%*���&�����P,�
�]����W�p�po1?�V�)%��!%����[���JW���urp�j��@�up��N�����1��E���^C�B$=��Y�V�����#����%��t��M����;O[���(_��Q��
�OA���[z���!�)U�����`�E�rb�~]��M<kO��>b
�^�B��v�����+�����������e�p�RY�V��F	e=�)+U���Z[��N�k�l���F�.�*��]���'��?�F�F�����G��G���{�U
�P�h�����38��l��(����+R0V��=�Ff�D��K���J��9Z��[����C�����q��^m5��N���x���;�7�1�&��gr��L���~���_�-	�T��J�x�U	_q�����b�kI����)km��=#%�
�B���
F5��=%��6yVU��Z��=;<f�}���K2����_�����?H[p���Y�,�b�"���$�"�9B�uo#���O���6��8 �y�����[���Xb��������%�;M[Q��p'��Nff��)�������0��<)}S�?��09�4%�����3�k4].���`��
��`��-=��R�r��2b7��1����@V��v�������
U����.����b����E��H����FX',�m@���|���;J���sE
B����F�����o���HI�x�g�6��H#�>	8k��?�:�����:��'���C��n�G�>_�D"D�`��8(K��A��[`r�����$
��I��mc�m�^�o����um���Q���Y~r��t������l6�t���b;�)���X
p]��d�-
�U�T�	�u��X%�^5�ij��S�������#S<=`~�����<_-������N��R��V����������L�F<�-
*��wd��W�)j�
��Y����zY�PC���s�i	�=�%���7���V�~��D����}~�C�|��g�g�2����N?����uj:�zn%�����)U�DTzF~����uu���^]�u���i�pX�{��7j�8mj�=���s�(�N��c�b����E4���/���]�|������������w�����JN���R�i���:h F�ZE�3���4
?
�����b)�lY�	w;�m������������Oh�t3r Z��J��p��0�.m���i��<s��?��.v����l�J�qR�r���szr�q{����c&}�bA�9k�K(H�~�	����#R�4�����`p�~�����1"��� \����
A�5�����K\���iL�|'�������V��u�����\�N���m���k-�-��ut)���(����u0S���m��3����yV�&��V6	�����[M�S]*
��#�����������f)I$���A4P+�	���Dc�[h��oC�2k��o��[�.R�X"��I����Uq�u��_iY��'6���������[k)����D{+��"��%;�H��i�q�����2�����R��#y�Z�Vn5�@��D������<p��	�� VI�)�M&�l��� h1T���3��?����8�o(����C0�1������h:��&BB��J�*�x��H�[�0[P������f#�?�/5pK����}���wS��t��P�������`�|�5V�L85#��EX�i�A� _����L^���U�K���Q�6W�-&R���XO��$k�����]��ZQ�)b��j���_��R4Z��*����Nhe��v����[�1f���yZ���Z���'%���(�a	�HTr�!���?���@�CZ���������z�j�	sW5n7�����=Nd����|�)zj����gI�F!Sk�MGhJ���P������GA��F;��%�{����:Xa��UC�����7�$�Bx���d���)K%���1e4&�=��$�^�,���0��&���P�% 9�n�O5A��i��eU�%�cN��3�!*����Lr��JL~���[�@�������-W������%�nGj���$�0O�DZ"HZ	/��B�@�{���}�8�j�H*.�jDn ����,�m"p4��E�^��}�����k,���_
>mjTk$D�kFkx��cj�h�#�K���4E[�K�v-v�1�u.��j�+���5�
��V�zOGO��	@�]����
-���+���.5K�����DCdA,�/����J�����+rWX��?���,�A=�%S��I3�)�S
C����|w�A�W��w�>,KC:��"p���Y�
����qo����r-���1���f0�^_�W�Pz��"���Tj�F���h��f��#�tN�;�@|2�b���`����br��7�e�3��Y��-� �b�ke
���k�ud�u'��������S'���!������p���(.�����J�"�0�J����L���SZ��>�\=�{4nj�~�����[K
[�{�
�_�����K��a��G�������C��+a�`0oI
��c0�k�MU"HB5m*��	v-��D�'��^�j|�	��M�cL!���Q��fc���j)�g���FZQ6P��]E�����r��6�
��t�S�"IB��S|��Z��SQ�`�{�b[;�]Y����{^	���ae4�	��Q_\m&���CP��A��Qu��)�=u
���i*�D��LU�jK������w91�]�#��d��}CS��N
�IM�^�QJ���MBo��x�N?��#��n� ������tItY��tMIY�-\v�~��A��� ���:��~��u�Em�xQ�����7?J���b3(}������C�x����N����R����Va�Mf,�������
R4�:�e�1P	�w�6��O�o�O�A5�}��E
6[��\����H�� �/XD��p��>q)�z��:.H+�+�����F��Z��u��������.�)�N/QKQ�F6I[��t��N��l%��ov�N%ssOK���oT����F;q4��JF�������B������������h�����RVN<\��K2��V�(�8���`�U��m>�?�=B.O?Z�-��C����<�-�q|Ap8i��%>�����0���������+�!���|����2J�N�_��+[���u�n��c�����5��M_���w�K�v�~���l��i�ml>W���5�VH��u#��L���_>�3m����Uk�X��Ls��EP�6�[���[�V�!3�����7��w�����>a�������6:�[i��z���[C����
iq�i���-�=��j�Q+Z��}Q}w���jfm����o�iO��,�.��v�����%�l<~�D���v�s�fW�������a�-8`�M{5�^|YO~=�����������HH���r#'nYzG��W�������e�*��,���_z���Y�q4�a��p�(�$5�D�(�����NEn&$gx�n<�XWC���_6��&�$s�R:�h���U8,q+��3�]��HI���P��"��x����N�_
}��Z�i�y�E��L$�NTz��E��v��_�m�:]c����O�Q���#T��i
?��&n�xE�2+��g��A���xu�Z�W����b�/���rb%7{u}�+�����g��%o��Uv���PrU`���]��v)h�_+�&a�m������q�$V�W�l���S�����<���B���X't������Z>'�����R9>K��6�[�[V)�{���&�{��7��[��&�<�A�1����d�?Tfb�)��;:.�miU��mPrvQ�-#�o�C�ncl�A���K�ks�����vp�=�1&�=%]��}gr���%�0Fe��K<��`��n�h��R;1�Eg����9
]��K�j��2��q�]�S��Ip�LRn����
d������p�N^DR�b�_�H "��V��x����W�l	0���J��a���O<S<��9�G�V!c�Z�Zs��(�����8�i_�(��$�P�@x�0��l@�`�6��U�Hj��Z���w)������}�c��Eihv���LcIs������Vk��0���#���X�Uc0�*ii�q�����8�>ECJ�.R�����m8&�e)7�1�Xx2�{$Y�JMp������IS����I�~�A��T�Q�H�+#�7���?{i�Rame������A!���2�����������.p��A4XN��$0���{�v�w�xx�t���rY�<]�	e��`�E^+2�z�xZ�V��]\�~5�� ^zE�0�����������XRr��P�	�Y$�6��&W!@�y%����#�Tcs�Z
#��}@���c.e��$mor����AW[����p2^�:�����z��K�X���D����f.��!j7���(��e�����I-�����V����� R�lS7�J�<���2�Bu�oC����B����,@�E�=�r,^#@�J��4]\$Qk���]1���g���;����ZA�&�t���7�4<���i���y	�L�{^�7g���z"�q�M�t+�5�h8,n`����5��T]���(�D�R�c@�bph�/OO�$��M0�g��(Y��R�hge�%�, b����_�-%� �:��WR-����ZX�.s�N,��	�j��5��,�����k����������dH��O�a�7��_YL�r9e��(h������0zjf�Q��ybv����"b���b&r�'I�����X�V	��F%�����X
(�;���,xw�b7�"1��X��'�|�����n8u���k�����M�6��7����,��$��'�L��H�uK�\D�E�|�th�P1��tB�xe��1W��NG#X�=�<�W��Dq�@)$x@N]��� �V�Z���!���(��W����^��	w{��_3�\6�V^@�&v���D�I�]#��i�99�3p>�������V�$��,b�R��~�@�=�mI��w<���pNt��s�V��i��E���v(�o���ST,���7o�z�4�P�
��;��l:]�����xv#�]'���<�W�H�0
�#��*br�=
������+0�lP�_�a8�X�>��ri6.�G�5�
���Q���_���`�A� ��A��
B��*Q�
]����O�2�� ��m�P�����]�1��d��?����`��,�T��f���.��(�����NMT ��*��'����P��ZiL~����!����$�����e�:^�'����-*��Vx-k�\���Q����������M���Y�`P��������5��o�������78i�YP�z������2G�#��V*�CRR����������wMi��g�x�O�1������Y������V���V�	�\7�V9�C�Y�"���+�ku��MK�����8=���T�������8���HZ5�|��^#��=$_e�y������3o�X���������U������f�y6�nT���r-��[�������2��X+B� o�����{��)c?�G�kY�H�8�mq�d@T#����R�R�gi3����'Q�*�}U�q%��g9t�u?���n�L�1gh�����K��Z{�i�����f��-�Ci���������'|z:�Ou��������#m��'k��[\F��	�k3��+���'B�j~v~p~���!������������~>���)���W��~
�f1+�M") ��;�d�p��PZ��&������\�;����������������#����N����N�f��;C�Om��na�����|���
��K�{uW����G�<��������m ���H+<;%�TO6�~H��D!��o�5��������YT�D��u�Gn���/^�C������N���c�mj~BN�SR�Ro�B���_%��L$`sR�L��t��R���?o����V���x��jf�V�_t�������x�u�X�'y�I�S��z�myW*c�0,@�+��+[�U�u�h��j���E�~�����I��J�~W[��
����H��j<�<^6�4�]�&-�����;���RJ���3������oISRhY[��KZJ�3�Y[�f3���N�������\iT�,���*��-����e2�Hvmy�&!�9z
��y����.p|oqZ��i�s:D+�W.}�[�W�7~i7hE��?v�);z�PLH�,�Wv���c)�L��|����Y��?�����ey�zi�%g���j{e�����{�c�S"�������M�`��g�,Y�R�Q+7��������F�O�z�R��J�A/���QiG��UT�����\�����4��Yy��W+��W������T���i4��?T;��j�R�7�P��j����7����Q�?�u|5ZTn����K4@i]T��F����;���C�Wm���J�U���_�*���x�E���T*O�������<	��qx5�8����?���|_}����p0,w���6�Y�$��Aj��`~	���Z��N�I�lU�*�����_���I�����?��A��]��(d}�# ������uhIi#�������1���n7��m��x���+w����O��U�;�}8��!���~��*����Rac�7���R�r0���z�X���(����������j�\�z��J��F{A�6Nu�^7`������^����JO7�wg���w�e6��Qp��a���_PY�EIX�E�������"yJ)�
ls�
(BjW�������X+�d�*h$.�)��a!�^�Ju��J�n����p���|��a��`�w���=+&��0����n�����b�����������Y�j�F���D+yIC�J���,.=��B���4�P�vg<6�Jl��){�}� fP�%t!�����9����p�MHS���6������a`�(�B�u��OA�����rS���������t��N�G�O1��q��-X�V�� x����������#����������7�v	.���<�$�W���������^��u����@K���?�$�K`'�Y��Z@SPP�p�����A!�-���p�k�����5��
 ~�Rk(�Nw]�$`y���	$�Xc�l:��y��)�i��n(6�5J��n������	��[I��l�P����hQ6�)�$K_d���kK!j�������y�X���1�5W���g��
=� �h~��������"��������~]R���!����7�������o�ww�I�w�?��w����`�a���^�_kTzQ������n��n(�hU����G��l�/��#��`+3<�>`l�����J(�������mG�'�S�s��z%�F����������$��?��Y�j:�J%�0�*���Uy*��@����j]�����2
eb#���Ii���]&�`��Y�����g������~k�\n^��vj����|n���nW�x9�=�60��F�q��Rxvtr8��	q��!{Y�/��?�W�)����8~��j:�����h:���x|���e��]�O��_�_�'�Q���h?	��������m�)p�0��$"�~o��fQ�~6��!��\0���+�t�/xD�w1�8��^c��[��1o���o@�����d2��4p���w�/���������%p����Z!8m��;=>��nl��Rpp~������
������3��G��~\������'P�����~���x�q>�����������t�P�����������,�ol��O�������'F��}P�����jw��O��u)D^���i��^^�5|��b~���C���rEnD�W�����d��j1x�����P��x\���P9�ZcVG�~��X����?�O���vrz�>
��Y��a
k�.v!s��7��#mN�S����"H#��G��J�JxH*1�&��J�7PW`2���#����G��d�{�@!�z!B�|����u�u)��j���WA� ,��B�r5��:j��i��#��������E��~��!(p1�|S��f���m���'��������2=���%`�"��x���LA�g<�E��w%7W5���=�
���
�*����:�V����<���-dod�a@tM��`����8��P���wdh������e0�	��"z�K)|�����[����!��Zyp�>%vU�U�w-�����H���������i{�aCa�z��I�����z�������c=�.|Fm�[�� ��'{��KY�'���*!
9���~�-���o�	k;�M���d�h.5�����
(�E���~K��.]��k��n�Uo6v���T	i�.#-:�K�H�i#�����6�k���Rs�����c���X��"8�3�!J�/��g���2>�F�^�wzb�P���|3��C���K��i�p�wxz�6���O��fr������R��[MD�J=��:�M)��z3�D;{���D7�+Im��,�������$������R�?a�?������@�{�V:{�#h|GA����JK��i�W�h��$��:���`2�KOh=a��^���W����F�]��n-�!�������t;�V"�b}��\���d��x�w��	y����������"�� �i����/Q�jL������2�=�3�
�r����3dW��g���(�#tr������2fU	��O�^���$I����C|�)X�o��z�~�;>;x��������������������f����j/�'�r���S0��HMj���������;��V��O�6c��F5�8����<n~'���9��p�}��J����8	F��)q�3�o9,���_�^��`ig�����p�
B�{��&P�f�U.WZ;�z�V���.;6�kT"A��6U%�;F�u���O�j�����E3���6=%t������mY�@:
'�D��A�
{��Q����m��x7�!<�x;�QU�A�g0�mb��o.���l0JN�_^G���^no_u���:Ri|v�54��W�5�AL���E}�J~�>;��G�<��������FK��.������dx�Q�6�� 7��/������x�G��b��ci
M����H�����--I�h�(��p����d���`���[�Z��9B��5�"�H�YR�z��"%�Z�(��D�j���PCS�n����z�����CaO�����d?�����K�b��Z�� � C�0����t����9��y~r�:��6�����i�9�$"���f���q>9��%������ �H	 �;$`��!�i�7�k��t��C<��r����n���__��E7������R��4�y=�����[6]���%&��jN��:���#����[�k�H����MD"#�NE�o����k� ��M���(���'@��F��@���Z�
�'+��lXV����*�(�8����z��0���2�>==9x'���G�A���<�$�C����a	�h��*H�+U����C����C���b1;��l5���0�q8�0f1�:E�����	n��7����{�	��r;����|	�$�9�Dh��}���3���p����,b��@)�-kt�����+��V��o�������������m��+��o�;���<I�}�pS�y\o$�r�P'�?��gG�V�RzRSJ�1\����������������?J��[�#d�$QbpLp�M��$�����a�pZJ��,����^W��98��}Sw������w%�^�}��*U%
��%�%�`�!V��'"~#��]�1Y��b6%(��33�$�{�Oo~���Ox���,��d�+�C�d�����)!�' >��
��\b-��81U�
)T�XB��^����i���3D�<P���CQ��z<{g�o��(dM8�F�����S�%�_�?-�&q��O8����C�������&#+���T�c@n
W��<><zCC|TW����Yi�E>�3���u���+�1�k�zx��p�P�QI ��"�U���'c�DM�r��*���k�H�����������
?���z9����?���a��ns���F������&���p��r�)���T�J36	aWy���z-����SB��r�7�s��
�]�`T��[
g��R
g�?��P|��,7���\Q��!D����Q-B2m�,j2j6��f��y��m���'�o�1������N����8Yx��[�7S=�/��=�E���u~>�c~t����$B�r�f���J����6[P����^S�-c�Po	����H�e��W�f�d �E���3�?+���t3���i7+8W�=��a����`A� 8���Cg���gD���T������DJ���T������ �-d��o�H��P�)�!����N�
�L�$���d�I8)#<�t�?�
���Gi��<������/��Pk
P���zR�#}�����G�%^�����
�B#TU+��j��`�H�
�c5���Z���,N'�3at�:;}�ah^����"_�j�����b��-'�\�1Ks9W�cuI��z��8�����Gg���(>� 2�6�J��5�7�`���5��^4	���Z��'3���(U@����Jw)k1���;�����&qX��������C�8�������%�2RI�q+��OR�FP��Pjk�C������@�#P
1������0��@O��R��V&��_�N��Tz����G��|$Ez��v[�������c�j�E���-T�f�9�)l�Ot�3U�}�Y����u-���w��*��8m������h<*�=��Yy����S"M@���e��MS�4�5�l���Y#�����y�P��f&�zPD!������]����a�2����������gF�q����4;R��j�F,��%@
*�����r����2E*���{��j*�.m��9<9n��(�n)��%�PBd����2.UVS�z�^���X<���C�����b<i�l�.�*=cz��<_N&
5=R�9�����A�	�r��j� ����Bii���m��0I��
P�����y����m4�[��ZO:Qd��s^�$XW��"�v����?f#�XF1�o;��=����@9G�yk�e3uw���I�l5o��g���Y~��E�G�>�z��+��������6�u�z�7l�1�������Qq�"����+0�4���_��C�XIl�����{\�:o��N}��*N*c4)�k��2$[�@��%^���r�>`�|`������a�~�+�����)��0�/u�G.3g~{rvt�Nkn��)�����mzfx����	.Z,�xhZ�ZjC;
����5�X*�,�Ro����,M��s��K����7\A���q���Q������>*�h��I���������n+�H�����C4��0�`,�^ft+)i9�����,��P�51���(e������D��[2R�B�a��)H��V�����z����^����1�<�t
7��������.0~!.�j�oD�8A�xI�Vt�U�F���O�J�m�%;f<&�<�@�����S&�F�Mito������Z"�S�Q�O������O{��^�m��z5vk��^��lV��ne7j�T�a-�.����F�q�eX�#�<��=��K�o���K'T�b@U�O��'5���Z�P��N�t��bk�t�F�uT��x;���k�G�hH�k�q|�`�Q�b�r}�~���~�������H���*����n�X���]���)��r�M���$�2��%A���q������5�z������N�b�����!wV+�j;��
����
NN�T��_���*N�������q'7�~	>
�P�`2���3��UY�a��������s'5
"@�R-K�SCp9�
��l��.����jylQ���!�S��H��u9�FFwk��DGre��cdw_yL����{�j������E%�!W��������\M��{<��E���s�;N�����}��_)C�4���Z�����������#u��6r��M�w��/�Ae����
�F�G1'^*7W���������0������(��7l��n��8���6}�ol�a���&����B�0��
�i���Y���;g����s8�1�&<��������t��QJ���W)�W����Sb����[^�_�S�HVj�n�pv+�=�����=I���KNu��OKR�?]���#TS"0�&&���D2��CI�$��{���
�:;g�z���(���|4|��_�?P"C��L?��D��+?�X����
���Klh:-c��}�r���"B�E�B.0T�u��0\����VQa�JPRI���Y�Mc��[|��(�M�L��A�+���{V�l�g��!�$v�����x��w�����m�����eY��=6D�E�����D�/R=�8jp/��rx����\P���������
&�R�V.��Cip*u3nv4*�CR��[�V�-���J*�;�����Vt��eh���y�VP���� �7G"��@��IT]1�54��O&�-,���;|h������,&[���2�Q�����]��QB��wv�N
�_�N�[�+��������=;��� ������D��]
��,E2���
����CA8�Fa��X�����6#0t'����3qB��F@=k��1Ev���8��8N��G%���`l�sf/����3�~V�����-@����a�<�T��E=)�����Lqy<m����5�-k�Z�9�ut��|����M��I��K=jt�j|kM	3��H�nt���Ue���
��d�w1�!�%!-��P!��G�_Y����:����W�W�Dv$�������}��X;�U����\)�I]����S����$/��j���u���������������k����}>����/�V������vw�^�S��������f�m�.j{�~��4����;O��'��"�_��y��uc�lp=��63f�W�q �*'
���F�����o^0�
`�PIK�y���6_��� �������q$�s�Wh���!��d����"��m�m��-�p0$Kn�ul_�4
����3��D�N��z�M"��D
��7��@��5���oA-�`	���H`p�@����0{A/�d����SI����7��.0�R�@�x�v�^	��Aw�&M�c������	9�Q���|�f%~�`
�E�l&��a/��b��������	r��	�-
d??|�r*����������,V�&c�NB���m�:��&o�w:D���+\�����w�����Jk*=(�X�A��5vq&i�x�'�p��N��t�p}��
��-��S���}6���Z�L�g����yfA2������s����_���lG�S����IU�-7G������I<��vs�t��;��NG�l�{6J�
�~n�q�9���E5����6s�j�8�L���q�Y�
����7o?~��W��N���-�
@M��.�-��u'(��=�<���~~r||����a]����V���5��k�
�7w�=�o�s\o3F{`z��Go��N3��e�����9�t��g���pX��l�7���v��x����4�t����]���	6F���:W��A�M`Taz���Y(F`x!f�9�����)c}�}cL���'���=~{��G����;�,(U3��� ^,.�LU��S��� ��c�<��z�����y���b�nq�qo���*4q��&�������58k�������C�����~zr���H�A�tH���}����c8�EB���"u���������u�[���J�o]	��p�[	����i����{�����H;���[�>�Gt�*��:�
wt)���}�8!�����Y�����L�N�<`�u��U�H�_��gE������1���!$
�����?S�(x�K<�u������bF!m�^�������K��Ws{:�ll����`e�
���eZ�%��k�M_���e����r�5�`Kl�P-lu:�`�&�G�O�(����"d��5���	}������r�/����5��?O1��|6����T��o���?*����� �t:AwZ~�
�^i�
�v<����J���MF~g��k
��C��u�I)� ��a�����0q��D���Z����g��x:����gq����j*T��lA�Mf�y����R��[L������r1��Y�~��)�kDc��s�5�sw���O�r��J_%
0/v[<�5�W�����y�QM���v
^G��\W�?@��s�F���������
�+z��c&�-v� ����;�^��x|>�d�'Kw5$te���<����X��z�l��S�<�L��0��
G]��Z�qd5K�P�C?�7�i�D�4le�2���9�����v��O�Z�����(	�]BW
��*DBj�~#JrI��p�Op�9��S�{�0Kg�����vu��4��n���N!g8�i���+r�~lk��"��7k�()���E��*��t|7���,j� �U^_&�6�sp�6�1�'c)~O������f)��A��>Xe�n�"����;�'��������w(�);��t������K�P5�o����Ceu��9�u|���![h�L����9L�l�#"�@�z�5c�%�w�~��-��x�tx��7oU�J��e�6�.��V�l����El�e)�S��Sw����f��+e�j6�!SY��$�7m�"����������p`2������z!�
�A;�v�U�2���'����uc���!��\,(S�d)�i+�5C�5�~Pi�O����2wT�%taI�}�����F���7b��=���A�el���-c�����9{x��tMc$������Y�3�
��w�����/a���KVu8]��9�����Wct�v�,�`��&����g����-��B1���mlQ��1��OrBDA��:86������E&�k�+����c"�d���F�B�T�!�
��w+��1+���L!�Pb��J���'��N�
�&
.�*���S1�W\��
.�Y9����L�q�������L�&sg
3

�.S�����"W���xgCV�5)��k�������E���ko-��2�}�>��-s�Z��|�_����4����Uwnj�q����|t��
�����������^��'�0;�M��
r���"�9���E��n�����(��g�4��N���Qo�i��a;����G����k����{~t�����m�����H�U��\x�hG*z�+-n�(Eai`ZS~	Y;o�a�H��d���y�v��S����A�kV7��n���������
�����C�2d��9j��(�"�K��`�!����b�(;J��v��KA���;��)�q)U�/���P	�	7T����S�
���n��O/pBqC
/4�� C��d5�n��������WgPOU��v4Y:*�� ��K�]
<�96�X!�u��;����
ec���,�������\�	�<�;�G����PPn��F�����aN/��A�L�9.f[�C0���o���M�`�%��^���G����l���;>)_G����6�����qX���c~n����{������no���&hl�I�3����<I�v���z!�|"wx����a�����6`�D%l�D%�tD��/��xZ<�����`�^N�	SL�/N���R�y��<v�S<���?���;uU��wK��P#�"��9������hR����_�]��X�����poj#
����q!0��y��D�jIR����3�3
7%(�4h�@������Mu�F��(U$�Z���a��5��5P����>�ysp7C��H�7 �!e5]�`�(�D�]N��+��zqms�bx/�C�m�kq}d*a��um�V���p�W�HU��Ot����-�|��K��>�}?���t��9���c�����q���������{�]� _�n��� ���l�m�"��_��Q���n�G�v�`�o��(M���k�v�J�a�d~�����}��,�C���� _)�
��eF��_G�� �7d1��~�>�[����yX X������o'��(N���!ry����[���Tl�O�=�R�;�~���SG`d���C�dn����8��������_���N�vE	N�X��D�W9q��F��	�����C}�S\��"���?5����#��F�2����	`8��x�������\��CE���-���.�b��y'a���p�����E@G�������
p<.s|.+������
	�/}��O���3�Fv�`]a�76O������a��eMZX�0�[�]���1��%���}>��I�w�B�����x:���T74�E'JbG��;��2�-l(��cI5��K�W� \�?A����'F���6+�@!�)Ga�TR�l
�������g�)
%�b��L���- �d���N.���E���5$2H��wt�G$
�(���5���Mt	�0^�	������"��)
[���R_���L���lw���K���-WOK%�qI�����-��Z���r�#�����Q�y�f���Aq����yFx�x���g��Xm���Y�eEc�-��\��A8n6����5��4%����| �6����H�'5��u���K�����g?��c�^9��0E� ��e?g���
�!4#D�K:���f0G^���6&����8o���V���t������(Q�f�Y�����7�{�������M�[����J�NE���}�����'������2�w���k��o��,��
0\F�$�����g������t\�FZ���:�QO�Hon���j���I�&�0�Lo��t�1.I���Png(NK_]	����fe�����Q��tf�"P���%7����t<�E��W��
	��o��$���2�	u�C����P�r��JS�����hx%��#�H�UQ����>.�����)Y����2��mC��q>EROSz6aO��$�8��(\7�N�<����bv�;	�R��;�e��8a���u����l��	�����Wu�{j��b)fJ5�%Az�����O�w]��<���t����%	���P'����z|��G���b�E�?�q@�0��L�G�����m}��f�;Mv��,����wo_�W��&�)����%l��)U���c�5RSY�P&P��q9$9�:b'IOh� zX����x*�H�'�&�9�=�x�����{7��*p�k?N��Am��������HmGz���?����^r�F�y�l:<;�/���g��$C������������N�������� �?�$������8
��p����5���d-�g���������"K���� Y�(BI�#��������J���s0�>�g�l��J`�q\K�m��BVid)�Xl�&��"<A�X�F�n��a��bd�������#<�P~���������n��^�\�Ax���pt6���'�x��$�`%��f��'O@_����
;�%'U�����p�L��{�>N&�i�I�!�5������.u�K]�R����.u�K]�R����.u�K]�R����.u�K]�R����.u���6���

#279

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#278)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sat, Oct 28, 2023 at 5:56 PM John Naylor <johncnaylorls@gmail.com> wrote:

I wrote:

Seems fine at a glance, thanks. I will build on this to implement variable-length values. I have already finished one prerequisite which is: public APIs passing pointers to values.

Since my publishing schedule has not kept up, I'm just going to share
something similar to what I mentioned earlier, just to get things
moving again.

Thanks for sharing the updates. I've returned to work today and will
resume working on this feature.

0001-0009 are from earlier versions, except for 0007 which makes a
bunch of superficial naming updates, similar to those done in a recent
other version. Somewhere along the way I fixed long-standing git
whitespace warnings, but I don't remember if that's new here. In any
case, let's try to preserve that.

0010 is some minor refactoring to reduce duplication

0011-0014 add public functions that give the caller more control over
the input and responsibility for locking. They are not named well, but
I plan these to be temporary: They are currently used for the tidstore
only, since that has much simpler tests than the standard radix tree
tests. One thing to note: since the tidstore has always done it's own
locking within a larger structure, these patches don't bother to do
locking at the radix tree level. Locking twice seems...not great.
These patches are the main prerequisite for variable-length values.
Once that is working well, we can switch the standard tests to the new
APIs.

Since the variable-length values support is a big deal and would be
related to API design I'd like to discuss the API design first.
Currently, we have the following APIs:

---
RT_VALUE_TYPE
RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found);
or for variable-length value support,
RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);

If an entry already exists, return its pointer and set "found" to
true. Otherwize, insert an empty value with sz bytes, return its
pointer, and set "found" to false.

---
RT_VALUE_TYPE
RT_FIND(RT_RADIX_TREE *tree, uint64 key);

If an entry exists, return the pointer to the value, otherwise return NULL.

(I omitted RT_SEARCH() as it's essentially the same as RT_FIND() and
will probably get removed.)

---
bool
RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
or for variable-length value support,
RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);

If an entry already exists, update its value to 'value_p' and return
true. Otherwise set the value and return false.

Given variable-length value support, RT_GET() would have to do
repalloc() if the existing value size is not big enough for the new
value, but it cannot as the radix tree doesn't know the size of each
stored value. Another idea is that the radix tree returns the pointer
to the slot and the caller updates the value accordingly. But it means
that the caller has to update the slot properly while considering the
value size (embedded vs. single-leave value), which seems not a good
idea.

To deal with this problem, I think we can somewhat change RT_GET() API
as follow:

RT_VALUE_TYPE
RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);

If the entry already exists, replace the value with a new empty value
with sz bytes and set "found" to true. Otherwise, insert an empty
value, return its pointer, and set "found" to false.

We probably will find a better name but I use RT_INSERT() for
discussion. RT_INSERT() returns an empty slot regardless of existing
values. It can be used to insert a new value or to replace the value
with a larger value.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#280

[1]: /messages/by-id/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com

johncnaylorls@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#279)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Since the variable-length values support is a big deal and would be
related to API design I'd like to discuss the API design first.

Thanks for the fine summary of the issues here.

[Swapping this back in my head]

RT_VALUE_TYPE
RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found);
or for variable-length value support,
RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);

If an entry already exists, return its pointer and set "found" to
true. Otherwize, insert an empty value with sz bytes, return its
pointer, and set "found" to false.

---
bool
RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
or for variable-length value support,
RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);

If an entry already exists, update its value to 'value_p' and return
true. Otherwise set the value and return false.

I'd have to double-check, but I think RT_SET is vestigial and I'm not
sure it has any advantage over RT_GET as I've sketched it out. I'm
pretty sure it's only there now because changing the radix tree
regression tests is much harder than changing TID store.

Given variable-length value support, RT_GET() would have to do
repalloc() if the existing value size is not big enough for the new
value, but it cannot as the radix tree doesn't know the size of each
stored value.

I think we have two choices:

- the value stores the "length". The caller would need to specify a
function to compute size from the "length" member. Note this assumes
there is an array. I think both aspects are not great.
- the value stores the "size". Callers that store an array (as
PageTableEntry's do) would compute length when they need to. This
sounds easier.

Another idea is that the radix tree returns the pointer
to the slot and the caller updates the value accordingly.

I did exactly this in v43 TidStore if I understood you correctly. If I
misunderstood you, can you clarify?

But it means
that the caller has to update the slot properly while considering the
value size (embedded vs. single-leave value), which seems not a good
idea.

For this optimization, callers will have to know about pointer-sized
values and treat them differently, but they don't need to know the
details about how where they are stored.

While we want to keep embedded values in the back of our minds, I
really think the details should be postponed to a follow-up commit.

To deal with this problem, I think we can somewhat change RT_GET() API
as follow:

RT_VALUE_TYPE
RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);

If the entry already exists, replace the value with a new empty value
with sz bytes and set "found" to true. Otherwise, insert an empty
value, return its pointer, and set "found" to false.

We probably will find a better name but I use RT_INSERT() for
discussion. RT_INSERT() returns an empty slot regardless of existing
values. It can be used to insert a new value or to replace the value
with a larger value.

For the case we are discussing, bitmaps, updating an existing value is
a bit tricky. We need the existing value to properly update it with
set or unset bits. This can't work in general without a lot of work
for the caller.

However, for vacuum, we have all values that we need up front. That
gives me an idea: Something like this insert API could be optimized
for "insert-only": If we only free values when we free the whole tree
at the end, that's a clear use case for David Rowley's proposed "bump
context", which would save 8 bytes per allocation and be a bit faster.
[1]: /messages/by-id/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
repalloc, and nodes would continue to use slab).

#281

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#280)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Dec 4, 2023 at 5:21 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Since the variable-length values support is a big deal and would be
related to API design I'd like to discuss the API design first.

Thanks for the fine summary of the issues here.

[Swapping this back in my head]

RT_VALUE_TYPE
RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found);
or for variable-length value support,
RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);

If an entry already exists, return its pointer and set "found" to
true. Otherwize, insert an empty value with sz bytes, return its
pointer, and set "found" to false.

---
bool
RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
or for variable-length value support,
RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);

If an entry already exists, update its value to 'value_p' and return
true. Otherwise set the value and return false.

I'd have to double-check, but I think RT_SET is vestigial and I'm not
sure it has any advantage over RT_GET as I've sketched it out. I'm
pretty sure it's only there now because changing the radix tree
regression tests is much harder than changing TID store.

Agreed.

Given variable-length value support, RT_GET() would have to do
repalloc() if the existing value size is not big enough for the new
value, but it cannot as the radix tree doesn't know the size of each
stored value.

I think we have two choices:

- the value stores the "length". The caller would need to specify a
function to compute size from the "length" member. Note this assumes
there is an array. I think both aspects are not great.
- the value stores the "size". Callers that store an array (as
PageTableEntry's do) would compute length when they need to. This
sounds easier.

As for the second idea, do we always need to require the value to have
the "size" (e.g. int32) in the first field of its struct? If so, the
caller will be able to use only 4 bytes in embedded value cases (or
won't be able to use at all if the pointer size is 4 bytes).

Another idea is that the radix tree returns the pointer
to the slot and the caller updates the value accordingly.

I did exactly this in v43 TidStore if I understood you correctly. If I
misunderstood you, can you clarify?

I meant to expose RT_GET_SLOT_RECURSIVE() so that the caller updates
the value as they want.

But it means
that the caller has to update the slot properly while considering the
value size (embedded vs. single-leave value), which seems not a good
idea.

For this optimization, callers will have to know about pointer-sized
values and treat them differently, but they don't need to know the
details about how where they are stored.

While we want to keep embedded values in the back of our minds, I
really think the details should be postponed to a follow-up commit.

Agreed.

To deal with this problem, I think we can somewhat change RT_GET() API
as follow:

RT_VALUE_TYPE
RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);

If the entry already exists, replace the value with a new empty value
with sz bytes and set "found" to true. Otherwise, insert an empty
value, return its pointer, and set "found" to false.

We probably will find a better name but I use RT_INSERT() for
discussion. RT_INSERT() returns an empty slot regardless of existing
values. It can be used to insert a new value or to replace the value
with a larger value.

For the case we are discussing, bitmaps, updating an existing value is
a bit tricky. We need the existing value to properly update it with
set or unset bits. This can't work in general without a lot of work
for the caller.

True.

However, for vacuum, we have all values that we need up front. That
gives me an idea: Something like this insert API could be optimized
for "insert-only": If we only free values when we free the whole tree
at the end, that's a clear use case for David Rowley's proposed "bump
context", which would save 8 bytes per allocation and be a bit faster.
[1] (RT_GET for varlen values would use an aset context, to allow
repalloc, and nodes would continue to use slab).

Interesting idea and worth trying it. Do we need to protect the whole
tree as insert-only for safety? It's problematic if the user uses
mixed RT_INSERT() and RT_GET().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#282

johncnaylorls@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#281)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Dec 6, 2023 at 4:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 4, 2023 at 5:21 PM John Naylor <johncnaylorls@gmail.com> wrote:

Given variable-length value support, RT_GET() would have to do
repalloc() if the existing value size is not big enough for the new
value, but it cannot as the radix tree doesn't know the size of each
stored value.

I think we have two choices:

- the value stores the "length". The caller would need to specify a
function to compute size from the "length" member. Note this assumes
there is an array. I think both aspects are not great.
- the value stores the "size". Callers that store an array (as
PageTableEntry's do) would compute length when they need to. This
sounds easier.

As for the second idea, do we always need to require the value to have
the "size" (e.g. int32) in the first field of its struct? If so, the
caller will be able to use only 4 bytes in embedded value cases (or
won't be able to use at all if the pointer size is 4 bytes).

We could have an RT_SIZE_TYPE for varlen value types. That's easy.
There is another way, though: (This is a digression into embedded
values, but it does illuminate some issues even aside from that)

My thinking a while ago was that an embedded value had no explicit
length/size, but could be "expanded" into a conventional value for the
caller. For bitmaps, the smallest full value would have length 1 and
whatever size (For tid store maybe 16 bytes). This would happen
automatically via a template function.

Now I think that could be too complicated (especially for page table
entries, which have more bookkeeping than vacuum needs) and slow.
Imagine this as an embedded value:

typedef struct BlocktableEntry
{
uint16 size;

/* later: uint8 flags; for bitmap scan */

/* 64 bit: 3 elements , 32-bit: 1 element */
OffsetNumber offsets[( sizeof(Pointer) - sizeof(int16) ) /
sizeof(OffsetNumber)];

/* end of embeddable value */

bitmapword words[FLEXIBLE_ARRAY_MEMBER];
} BlocktableEntry;

Here we can use a slot to store up to 3 offsets, no matter how big
they are. That's great because a bitmap could be mostly wasted space.
But now the caller can't know up front how many bytes it needs until
it retrieves the value and sees what's already there. If there are
already three values, the caller needs to tell the tree "alloc this
much, update this slot you just gave me with the alloc (maybe DSA)
pointer, and return the local pointer". Then copy the 3 offsets into
set bits, and set whatever else it needs to. With normal values, same
thing, but with realloc.

This is a bit complex, but I see an advantage The tree doesn't need to
care so much about the size, so the value doesn't need to contain the
size. For our case, we can use length (number of bitmapwords) without
the disadvantages I mentioned above, with length zero (or maybe -1)
meaning "no bitmapword array, the offsets are all in this small
array".

Another idea is that the radix tree returns the pointer
to the slot and the caller updates the value accordingly.

I did exactly this in v43 TidStore if I understood you correctly. If I
misunderstood you, can you clarify?

I meant to expose RT_GET_SLOT_RECURSIVE() so that the caller updates
the value as they want.

Did my sketch above get closer to that? Side note: I don't think we
can expose that directly (e.g. need to check for create or extend
upwards), but some functionality can be a thin wrapper around it.

However, for vacuum, we have all values that we need up front. That
gives me an idea: Something like this insert API could be optimized
for "insert-only": If we only free values when we free the whole tree
at the end, that's a clear use case for David Rowley's proposed "bump
context", which would save 8 bytes per allocation and be a bit faster.
[1] (RT_GET for varlen values would use an aset context, to allow
repalloc, and nodes would continue to use slab).

Interesting idea and worth trying it. Do we need to protect the whole
tree as insert-only for safety? It's problematic if the user uses
mixed RT_INSERT() and RT_GET().

You're right, but I'm not sure what the policy should be.

#283

johncnaylorls@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#279)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Oct 28, 2023 at 5:56 PM John Naylor <johncnaylorls@gmail.com> wrote:

bool
RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
or for variable-length value support,
RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);

If an entry already exists, update its value to 'value_p' and return
true. Otherwise set the value and return false.

RT_VALUE_TYPE
RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);

If the entry already exists, replace the value with a new empty value
with sz bytes and set "found" to true. Otherwise, insert an empty
value, return its pointer, and set "found" to false.

We probably will find a better name but I use RT_INSERT() for
discussion. RT_INSERT() returns an empty slot regardless of existing
values. It can be used to insert a new value or to replace the value
with a larger value.

Looking at TidStoreSetBlockOffsets again (in particular how it works
with RT_GET), and thinking about issues we've discussed, I think
RT_SET is sufficient for vacuum. Here's how it could work:

TidStoreSetBlockOffsets could have a stack variable that's "almost
always" large enough. When not, it can allocate in its own context. It
sets the necessary bits there. Then, it passes the pointer to RT_SET
with the number of bytes to copy. That seems very simple.

At some future time, we can add a new function with the complex
business about getting the current value to modify it, with the
re-alloc'ing that it might require.

In other words, from both an API perspective and a performance
perspective, it makes sense for tid store to have a simple "set"
interface for vacuum that can be optimized for its characteristics
(insert only, ordered offsets). And also a more complex one for bitmap
scan (setting/unsetting bits of existing values, in any order). They
can share the same iteration interface, key types, and value types.

What do you think, Masahiko?

#284

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#282)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Dec 6, 2023 at 3:39 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Dec 6, 2023 at 4:34 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 4, 2023 at 5:21 PM John Naylor <johncnaylorls@gmail.com> wrote:

Given variable-length value support, RT_GET() would have to do
repalloc() if the existing value size is not big enough for the new
value, but it cannot as the radix tree doesn't know the size of each
stored value.

I think we have two choices:

- the value stores the "length". The caller would need to specify a
function to compute size from the "length" member. Note this assumes
there is an array. I think both aspects are not great.
- the value stores the "size". Callers that store an array (as
PageTableEntry's do) would compute length when they need to. This
sounds easier.

As for the second idea, do we always need to require the value to have
the "size" (e.g. int32) in the first field of its struct? If so, the
caller will be able to use only 4 bytes in embedded value cases (or
won't be able to use at all if the pointer size is 4 bytes).

We could have an RT_SIZE_TYPE for varlen value types. That's easy.
There is another way, though: (This is a digression into embedded
values, but it does illuminate some issues even aside from that)

My thinking a while ago was that an embedded value had no explicit
length/size, but could be "expanded" into a conventional value for the
caller. For bitmaps, the smallest full value would have length 1 and
whatever size (For tid store maybe 16 bytes). This would happen
automatically via a template function.

Now I think that could be too complicated (especially for page table
entries, which have more bookkeeping than vacuum needs) and slow.
Imagine this as an embedded value:

typedef struct BlocktableEntry
{
uint16 size;

/* later: uint8 flags; for bitmap scan */

/* 64 bit: 3 elements , 32-bit: 1 element */
OffsetNumber offsets[( sizeof(Pointer) - sizeof(int16) ) /
sizeof(OffsetNumber)];

/* end of embeddable value */

bitmapword words[FLEXIBLE_ARRAY_MEMBER];
} BlocktableEntry;

Here we can use a slot to store up to 3 offsets, no matter how big
they are. That's great because a bitmap could be mostly wasted space.

Interesting idea.

But now the caller can't know up front how many bytes it needs until
it retrieves the value and sees what's already there. If there are
already three values, the caller needs to tell the tree "alloc this
much, update this slot you just gave me with the alloc (maybe DSA)
pointer, and return the local pointer". Then copy the 3 offsets into
set bits, and set whatever else it needs to. With normal values, same
thing, but with realloc.

This is a bit complex, but I see an advantage The tree doesn't need to
care so much about the size, so the value doesn't need to contain the
size. For our case, we can use length (number of bitmapwords) without
the disadvantages I mentioned above, with length zero (or maybe -1)
meaning "no bitmapword array, the offsets are all in this small
array".

It's still unclear to me why the value doesn't need to contain the size.

If I understand you correctly, in RT_GET(), the tree allocs a new
memory and updates the slot where the value is embedded with the
pointer to the allocated memory, and returns the pointer to the
caller. Since the returned value, newly allocated memory, is still
empty, the callner needs to copy the contents of the old value to the
new value and do whatever else it needs to.

If the value is already a single-leave value and RT_GET() is called
with a larger size, the slot is always replaced with the newly
allocated area and the caller needs to copy the contents? If the tree
does realloc the value with a new size, how does the tree know the new
value is larger than the existing value? It seems like the caller
needs to provide a function to calculate the size of the value based
on the length.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#285

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#283)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Dec 7, 2023 at 12:27 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Oct 28, 2023 at 5:56 PM John Naylor <johncnaylorls@gmail.com> wrote:

bool
RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
or for variable-length value support,
RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);

If an entry already exists, update its value to 'value_p' and return
true. Otherwise set the value and return false.

RT_VALUE_TYPE
RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);

If the entry already exists, replace the value with a new empty value
with sz bytes and set "found" to true. Otherwise, insert an empty
value, return its pointer, and set "found" to false.

We probably will find a better name but I use RT_INSERT() for
discussion. RT_INSERT() returns an empty slot regardless of existing
values. It can be used to insert a new value or to replace the value
with a larger value.

Looking at TidStoreSetBlockOffsets again (in particular how it works
with RT_GET), and thinking about issues we've discussed, I think
RT_SET is sufficient for vacuum. Here's how it could work:

TidStoreSetBlockOffsets could have a stack variable that's "almost
always" large enough. When not, it can allocate in its own context. It
sets the necessary bits there. Then, it passes the pointer to RT_SET
with the number of bytes to copy. That seems very simple.

Right.

At some future time, we can add a new function with the complex
business about getting the current value to modify it, with the
re-alloc'ing that it might require.

In other words, from both an API perspective and a performance
perspective, it makes sense for tid store to have a simple "set"
interface for vacuum that can be optimized for its characteristics
(insert only, ordered offsets). And also a more complex one for bitmap
scan (setting/unsetting bits of existing values, in any order). They
can share the same iteration interface, key types, and value types.

What do you think, Masahiko?

Good point. RT_SET() would be faster than RT_GET() and updating the
value because RT_SET() would not need to take care of the existing
value (its size, embedded or not, realloc etc).

I think that we can separate the radix tree patch into two parts: the
main implementation with RT_SET(), and more complex APIs such as
RT_GET() etc. That way, it would probably make it easy to complete the
radix tree and tidstore first.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#286

johncnaylorls@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#284)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

It's still unclear to me why the value doesn't need to contain the size.

If I understand you correctly, in RT_GET(), the tree allocs a new
memory and updates the slot where the value is embedded with the
pointer to the allocated memory, and returns the pointer to the
caller. Since the returned value, newly allocated memory, is still
empty, the callner needs to copy the contents of the old value to the
new value and do whatever else it needs to.

If the value is already a single-leave value and RT_GET() is called
with a larger size, the slot is always replaced with the newly
allocated area and the caller needs to copy the contents? If the tree
does realloc the value with a new size, how does the tree know the new
value is larger than the existing value? It seems like the caller
needs to provide a function to calculate the size of the value based
on the length.

Right. My brief description mentioned one thing without details: The
caller would need to control whether to re-alloc. RT_GET would pass
the size. If nothing is found, the tree would allocate. If there is a
value already, just return it. That means both the address of the
slot, and the local pointer to the value (with embedded, would be the
same address). The caller checks if the array is long enough. If not,
call a new function that takes the new size, the address of the slot,
and the pointer to the old value. The tree would re-alloc, put the
alloc pointer in the slot and return the new local pointer. But as we
agreed, that is all follow-up work.

#287

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#286)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Dec 8, 2023 at 1:37 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

It's still unclear to me why the value doesn't need to contain the size.

If I understand you correctly, in RT_GET(), the tree allocs a new
memory and updates the slot where the value is embedded with the
pointer to the allocated memory, and returns the pointer to the
caller. Since the returned value, newly allocated memory, is still
empty, the callner needs to copy the contents of the old value to the
new value and do whatever else it needs to.

If the value is already a single-leave value and RT_GET() is called
with a larger size, the slot is always replaced with the newly
allocated area and the caller needs to copy the contents? If the tree
does realloc the value with a new size, how does the tree know the new
value is larger than the existing value? It seems like the caller
needs to provide a function to calculate the size of the value based
on the length.

Right. My brief description mentioned one thing without details: The
caller would need to control whether to re-alloc. RT_GET would pass
the size. If nothing is found, the tree would allocate. If there is a
value already, just return it. That means both the address of the
slot, and the local pointer to the value (with embedded, would be the
same address). The caller checks if the array is long enough. If not,
call a new function that takes the new size, the address of the slot,
and the pointer to the old value. The tree would re-alloc, put the
alloc pointer in the slot and return the new local pointer. But as we
agreed, that is all follow-up work.

Thank you for the detailed explanation. That makes sense to me. We
will address it as a follow-up work.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#288

sawada.mshk@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#287)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Dec 8, 2023 at 3:45 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 8, 2023 at 1:37 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

It's still unclear to me why the value doesn't need to contain the size.

If I understand you correctly, in RT_GET(), the tree allocs a new
memory and updates the slot where the value is embedded with the
pointer to the allocated memory, and returns the pointer to the
caller. Since the returned value, newly allocated memory, is still
empty, the callner needs to copy the contents of the old value to the
new value and do whatever else it needs to.

If the value is already a single-leave value and RT_GET() is called
with a larger size, the slot is always replaced with the newly
allocated area and the caller needs to copy the contents? If the tree
does realloc the value with a new size, how does the tree know the new
value is larger than the existing value? It seems like the caller
needs to provide a function to calculate the size of the value based
on the length.

Right. My brief description mentioned one thing without details: The
caller would need to control whether to re-alloc. RT_GET would pass
the size. If nothing is found, the tree would allocate. If there is a
value already, just return it. That means both the address of the
slot, and the local pointer to the value (with embedded, would be the
same address).

BTW Given that the actual value size can be calculated only by the
caller, how does the tree know if the value is embedded or not? It's
probably related to how to store combined pointer/value slots. If leaf
nodes have a bitmap array that indicates the corresponding slot is an
embedded value or a pointer to a value, it would be easy. But since
the bitmap array is needed only in the leaf nodes, internal nodes and
leaf nodes will no longer be identical structure, which is not a bad
thing to me, though.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#289

johncnaylorls@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#288)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

BTW Given that the actual value size can be calculated only by the
caller, how does the tree know if the value is embedded or not? It's
probably related to how to store combined pointer/value slots.

Right, this is future work. At first, variable-length types will have
to be single-value leaves. In fact, the idea for storing up to 3
offsets in the bitmap header could be done this way -- it would just
be a (small) single-value leaf.

(Reminder: Currently, fixed-length values are compile-time embeddable
if the platform pointer size is big enough.)

If leaf
nodes have a bitmap array that indicates the corresponding slot is an
embedded value or a pointer to a value, it would be easy.

That's the most general way to do it. We could do it much more easily
with a pointer tag, although for the above idea it may require some
endian-aware coding. Both were mentioned in the paper, I recall.

But since
the bitmap array is needed only in the leaf nodes, internal nodes and
leaf nodes will no longer be identical structure, which is not a bad
thing to me, though.

Absolutely no way we are going back to double everything: double
types, double functions, double memory contexts. Plus, that bitmap in
inner nodes could indicate a pointer to a leaf that got there by "lazy
expansion".

#290

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#289)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Dec 8, 2023 at 7:46 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

BTW Given that the actual value size can be calculated only by the
caller, how does the tree know if the value is embedded or not? It's
probably related to how to store combined pointer/value slots.

Right, this is future work. At first, variable-length types will have
to be single-value leaves. In fact, the idea for storing up to 3
offsets in the bitmap header could be done this way -- it would just
be a (small) single-value leaf.

Agreed.

(Reminder: Currently, fixed-length values are compile-time embeddable
if the platform pointer size is big enough.)

If leaf
nodes have a bitmap array that indicates the corresponding slot is an
embedded value or a pointer to a value, it would be easy.

That's the most general way to do it. We could do it much more easily
with a pointer tag, although for the above idea it may require some
endian-aware coding. Both were mentioned in the paper, I recall.

True. Probably we can use the combined pointer/value slots approach
only if the tree is able to use the pointer tagging. That is, if the
caller allows the tree to use one bit of the value.

I'm going to update the patch based on the recent discussion (RT_SET()
and variable-length values) etc., and post the patch set early next
week.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#291

sawada.mshk@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#290)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Dec 8, 2023 at 9:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 8, 2023 at 7:46 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

BTW Given that the actual value size can be calculated only by the
caller, how does the tree know if the value is embedded or not? It's
probably related to how to store combined pointer/value slots.

Right, this is future work. At first, variable-length types will have
to be single-value leaves. In fact, the idea for storing up to 3
offsets in the bitmap header could be done this way -- it would just
be a (small) single-value leaf.

Agreed.

(Reminder: Currently, fixed-length values are compile-time embeddable
if the platform pointer size is big enough.)

If leaf
nodes have a bitmap array that indicates the corresponding slot is an
embedded value or a pointer to a value, it would be easy.

That's the most general way to do it. We could do it much more easily
with a pointer tag, although for the above idea it may require some
endian-aware coding. Both were mentioned in the paper, I recall.

True. Probably we can use the combined pointer/value slots approach
only if the tree is able to use the pointer tagging. That is, if the
caller allows the tree to use one bit of the value.

I'm going to update the patch based on the recent discussion (RT_SET()
and variable-length values) etc., and post the patch set early next
week.

I've attached the updated patch set. From the previous patch set, I've
merged patches 0007 to 0010. The other changes such as adding RT_GET()
still are unmerged for now, for discussion. Probably we can make them
as follow-up patches as we discussed. 0011 to 0015 patches are new
changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and
support variable-length values.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v44-ART.tar.gzapplication/x-gzip; name=v44-ART.tar.gzDownload

���ve�;ks�F��*��Y�r�L$$�8eY�m%�������-�����1)���~�=I�eIVm�U-��L�������[�����Y�x�+���"�/O�����n$���$�e��P�eB���g����\���86�����x�k���0�}k�m�:���m��X�^����e�S�~�����&�4�[���^o�$b��5}��y�-�������-_��=������c�Y�K1ff��Z�/k�nj����<b�|�=�~�t7"9�yu� 4�$�E��L �k����I��L���9h�X��k���|�Y��������wlb��e6M�_l&�L	+Cae3ae ������j��)���6|I�DA���(y��l����L���Y�.�����0� 7�K��,aY��,	�
��K|����`�c���0�3@�����)M��@L���������u"%O�����x~��D�I	�H7�^�Q���A�9�N�q"�k �Kh$�u3��Z���<:��������n�}���q����Q����o�����������5�}��~�8��n����P�u]c2u�A���'��$��2�<c��bv���{i�d~
������@� �"�����4�|���u�1��4�p�-�=q�L�;�5t�0<���c1��m��6����-�_�bz��pX��a��
�,#b��I��$q9E���e���ML=�AT���{�m���@J�-�B����/���� [0Z�m(�6��W��R|�FA�$b�E�i��2�F�@l	� Wj��I��9��^g'g(�Y������%/gO�H��[��_�7A};�9�$t��a���WK��Q��t�k�G"j���� k��������.��,2�������:{�.����C�q�������tx��Rpw�D(�������i��i�{��V�_BZ���<��T�YS��� �Q2�c �A�Fr���,���������t��~��G��T�@��}.3�-S~�^�G1
d�
������8gvl�cO!�?��<r��8S�F�f)g�/�w`��/�3B�ut�gC!�w�$=6�A�����F<+x���������&y�Ot�/�ng�d?��Z
Fw�������
��G���)����aa�o���(�5�AF��^�	��� ��=���W{
��j���Q�e$�k��n�*�Q���3<���{wOQ�,Ep��0�=3��.��5�k0��=%e�����?��Z�(��R�
��R��T+|��<�����:����v�n�����My�%xLP7M��@��8�'l
���DyT���6<�&
?7
�4���Y� ��R���o�
���$.�<h������;|	���6,�05�{��������=]&���f��0�\=�3=��g���_qm��L��R�g�����E�����m{�]��u<g�3]���/�N��M��
}�v������b�^�B���?�#����H�^����n��YV��6!��l�������2V`�	+aE�� ��v�yl|=H!:�" c���j7����E78��7����d���$���6N�	��t=�Ah�.N���:;����?�����N2���|v�\��K"E��c{(�1�B{��4��i�>}:t7��<4�X W����Bx��e�340H���`0�.%�|<NX�ba�9�!���x���������"��M��,����	�i��iuj{��K���%���A��"��(
i2dp
fV��54��c1]H�!����L�
y��1p��9��(�H���wc4t�p���e�o^�[��c���W�~�s��S��%����Z�n���v���n
/�����
��Y�{��'���E���/f�#��_�}�H�1�j��6��������+$���1��f��p�1��`6�����;���� _�@��}]A`3������N��vz������{�g:�5���x�%��c(�6��a]�o�����>�:;�py|�����\�����vj������.�5�z���
�*-����Dl��|� :�#�G��T��~'S��
\��������Q}��_�K�O�q�c0�w�q�����~&�����><�M$#��;�'�W�?�!��K?d<��H�R� �RP@-��A1y����0����[�L�9h����[ ��=��X����V'�++��50E�`��20 A<A�ES���ei������ya��G�����2�Z ���;�L��*���MR�G�$&����M�N|��|�������>�E�	�8�Z��kI���������+�L6��u�8[����67���"�$�����T"�>�
,���T|m�v��j������b��W����������W'N��%���-�����xIx
����*�u��</D�V+�E�2���<�%�\>_�)�x~�g/_2�=']i(��'�W�M�vv�B/?����
Z��%�rx�v��a�����	�'�`J�6,����~�"��n^D9X�(�?�4�����x�[��]�������,HS���U�=����v;f���m�e��~�v��R�^��Z�^C�i�84���b������_����d�X��YY�+�Jz#H��"��4�3�����
tY�_�x*1`�����������r������#w������`6�-��r�D���R�(*	��,��0����z G�����<�a�o7l���]���
���xk�y�o[��C�a�[���+W��Q+_�@z����I���2����4K���0�e�a�j,��'X�X���*
�i�&7-o�{�������;���*�
��� ��mj�����m&��}x�?==9�p~qU����������~reW�
l� �t��a!@�?QFQ����tE}CFQ_�(�O�Q��/���f�'�(��E��E}5��?]FQ_�(�lI`����k\N��)�V��zP�� �+:?��(�Xx@����}A��u���X���J	<�P��K��V(�JJ#��������wI���Y���* �.�KM�J��]�`��j������R��!�M��
�]��-��k	�j�������T+��0�x�n�s
R�%��Q��L+^`�!������gI���~}���a�k�rdHo��j�WAw�e0���nP����C��S��Y*��	0O�=���������}�2��x������;.��t`{z�>�>m����m��)��5����l�v�v�?����X�5mB��?��{��������E�l�������iO>PKl(0c� >�1���x���.����b�����N�\y�Z����E�i��\����_���f�j�<������g���j-��G��S��Q���r�LP3J�<�����:`�@8�����A�=f�n`���I*fp�}��d#\�����a���Bg��h�[�@�`������g�*���/�[*���3��5t.�U��i�J@���pH\��Up�N������+�d����\��Uw��sG�e�q6��'�Xtj�tVzE`~�N�2�X�T�(�mJ�q����Q��l?
X	��@�����-�����+��Pm��'�e�O�D)������n}]���k���C�0�gB��k
�����-W�`�k��V,�nv1���x^t@iDAV�G��r���(�UF/A3wN?�&����,E�y88q���VC<	�G� 'Ran1^c��d<���&)!5~���\��������������<�SjD5����s��b��_���5f�����p8�[����4��l8�c>2���1� aHR,������[�s���+(�L,!|�j���X1����d���p�[�Y�3^��B[��0�}��A �;$yMOL�q�k�b�3���-�ED�^U�L�ya�Oe��9�qU������1ch�q3P*N(mg��Edqv�y����]����.�`;`�W~�d ��a�3����x��9���X�#�������>��D�J[W����^������iTp���Z��%bI����{U�Ca����%���QM�3*�R���K:���Y�`}1qQ�RGtJ�������:���\ �����Ol��W�}����e������%�c��t	Bz�F�(�,���}(�*R%A�LD1���	f%:P���3�]�]�$��U�x?/�
��]�tC�R�h���� ��UOG0��$��2C���,��^S�1��l������oZuV�(��*����^N55�1p����;�G��rr�2�p�q�����$�!��MS-A�I+8�K>U���+8%��p����X��X�x+�3@��%�
�������d�K�P�9^Xs���l��$N����8��;�J�)��nY�n�	���0y ���^���%Y(�Y+�o����q���39������T�E�R����K�o�1��FQFK��O����v�G�}&�Y!n���17�	�l\fl�8����
D���U�
�U
�dwJ�o�U���q�������Y�������|:;�#��8���%B��2��A)�c��dQ.�8XP��������l���P�y�@�S<+
���d�����
E�U,LZ��SM��F���C'�c�Nds���w�9l�H�2�I4�0!���!���2��y��&MH����%���)nA+�(�`�iOA1���i��*��3Y�,f�]q �!�SQ�b=%O�������y�� OW�����-~��O���k{���G9��K����&D9Cwxq��q���`_%p!	=��0�Z>�5��A����e[&[t<�>�|P��	�B!$ni�����%nq�����[����P�#L�'���~N���H��4(��*������Q�f�9� ���*��|L�z����������f��J�~��Cyn�]���(M�B|)$���<%)�����R�,���*Q�Q@�Q�����"��"d��"u��L���y�S���F�3&�b-:[,g�DC���3��	?!C�=84���T����`�'pq�F�c��`���s�t�����r���l-���}�)O=Y�STL0��7AW	t�hB(�?
B���� �
F��e�L�E�l<�f�l"P�r ��3?I���*�F`-(�-8T��2���:c��	����Y�~����
e����?>:�<	~�?�Y���f"R�	7��A^�i��,P�9y�	3=fH��_� ���\L�j��6�Fy��i�6/�PtU��EM3�qx������i�.�������\���W|�y���$�#���`#{���x��_�ZRk�Z2&�����Yj�j-�=3�eb������������jJ+�1F
QC�?�i����w� 3�EY[i<!�������!�5 U��0���E�p�#���/����[��D1���$a#{Qgque���A�X�N�k������le2��$)�<�`���s��_��DiQ����.�]P�"
gp������4�t�����y�z}|�>�l��8��HLV^Ih�0�,�e�$��B,�������K�{���95�����ATl�=��}T^����������]��2�#r!e��������.ObJ�7������p>6�)����d��K*�9Y�u�������IK/RrD������ggl!A�4��
|r#�|%�[������FTB ���fz�l0%6KT�>���O���r2�Ee���@V�o�������������)���LM���N/�_�<�<>;�#�T��T�>�;��m�0S���_�X�v�c���k�R/��7���	��m�/��*���5X�
����8!�14���!16J�tq�����9�����KA���l�o��%O���pU��G������sy,�o�-z���vr<^�4.&g��_����a�x�.w�]���]*F�S���*�+��M����zQ���vY%�iw\k����v����v��	;c"x�o������"�m���U�g��8k����a��;��w��/���uzX���2���#w(�`�������*{�.//��zD���� c�(L�O����-�7U�$i�=6�����gg�n
��=�L�Ih�89�l��^�;�8��Y�Y$�D`��	pS5���+�
5����K���)N��%��'��W�� 6LbO��<�c�d�X:�m���y����8[pf�9��*N�b�-}t����|or3�k�o��7o���U�m��o��x{y(�B�'v��|��3I��h/�/�m8>�TE��nA�j���v�OO���[���jG���-7�W�,�]C�?>�h�_�=��������N��`e#�e������_���'G�����������s��9�����c[*%�+�_��5���Io����m`l���|�!�y��-����D{���F��r�Y���&�N����s��;�s����+eV��,�_5�a�j&P����U�m����$�
N���L,����pn@��uG�g�{�h}�,�ToN������5�({j�v���#��v���z=��v-�F��V��H����M3�����J��C����G�!Hg��6���|tw>F/��n�iu����94��I�[D���d�s'fUt�f��\�esI��$���y�����<T���yu~�&I�#���0�S���#�\�Oe�im�`���u~��o��S4�o%y�TqR������G����	�OK�s)��tzy~v�V��f6�i'�����q{�8���$IV�6A�y6]�`)��!��//�����{rt
���*��1zi�v�L��8*���������w��{�����qBl������M�1q�8�|[\xo�*���%�������� s(9��,&��J
R���WZ������*?�U	?{�����+��*L��8G)������Gu�(�y�v���V	Z�������]�����<���������%5foj�Y��2>��?�F�HQ�k(�d����H�nj���`wqP(�w!<���> ���w''��gd�R�eO�P�dI���BO���|�m)��<H@w3�^��c��%�]�i��yoj��i2`�?Vt�����Rcu-�K����wY38_b� �H��(��t��a��B^�8�uoi�|�LY�^/���}��K��M���r����'�[����������T�?6h�����3��v���q���3}�!�A�����]9e�������%}.�JR�M���XYY&����wn����$��h����K��*��4/��ft���]�W�
%[7Vp��+���b��^Y"�����
G(�vQm�@A�'�T����X����_���pP��u*p��*�AnB���8���#��;>�l���l�R���M��~j�d�Y&�������]g�����V}M�{?���PIFl�5�`�b�iViC�^b�^��A�N��d����1;t+�:
��+�:^�������������
m��|�{�9�H���up5�]�C4�� ���	��#���o�/�l�v����c{AF��/@-����)
yU���W����hx�JZ*�x!��(�(�M��s���	�r%����>�k�,�]�>_���R���2�9��x�j|�RC��2m1r����]@��Qp#�7�mi����kZ`{���
�4'K�;$�[0f����c��wI����� ^M0F[��U 8zdZ�6�15E�	��YI0Z`74�� ����=��0P�������<�/��6���CZL�YF����_�����e;d6��14���\/����Z�E��e�e����'��t6����������l���<�����8�1���l�B��L)�)����,����g��:�<����[��vP�����,F�`�c����K��-���Uh�jn��f���
/AN�QZ��^�0��������e�+}�I�YYulDK[�!
�c*W�(*<[���
��S�PS6�����
#y4��!�C0����{C�P-��;L�������<��TZ�C�b�"������j�3D���.7�}9�z��WR������)�_��;��dj�y�x ��
�N��_��4���%@�(�v����Z!\M�]
ac���89|�~^c�:���;A7d��ZW��N.�b�RI���[���U$���<k��(�jH���!:��@]�RU��@ '�6��[�C���!�5��,|	����C;�	��������
8����@MDAH�(��������`�-i������fAn�����Z�Lqb�n���a*��M6�|	��6J~~r�����MCprPV��r\�1��dF��I�R�!�x���.������<c(����_����Jj!qU.��1�������$��@�9t-o��R#����Mx[4�cw�b<��"��C�����Ye�I��s���1����
��pl �h`�� K�%��w��'����#��)}�[M@qp���������$0������P�(�~�']��"Jy�9�fskI��280�7J��-h�dL�w$)]���Gq�6��d5���.��f����w`G�E���~�N����a�9�jt
VUr���mx�����j��)�S���m�p��Z6B�\PZ5������R��r>�M��45pr��r�u:b�H��lJp�/��7/ZGG�/NZ��K-O.��Y����Og95��wA
M��p�f�{�9�����H�0:��������(�����c��S�
�U�e����x�2hdbxw�d4#q�0�$F��
����c�r�8��xuxz���]���������'��	p�H[bj�Yt5��F���*����)u������
��-�2fM�[/�m�rAl�(
�2�G�Sw*:;*�NiG}E�x��\�]���I�Y0\���+���P��S�C	��O�pS�
�2����HP�g�����tj����0���u�[����80v��(J��mz�I��GOf�L��0_��"�m��e �$�i��;.l�m����+Ld'�FZ��Z��MX�H"��E�!���1�uy���
�3������pX��E�a�!N�MlW��2����(	S��� !h,���$
��s���rI4���wY���A7�Dh���20��4�����Q�`�.Y��z�\����C���m-��nzY��s�)��)���&����x>����V�(8K�tm���P�}�LAs'g���o$�<l9��W�N�duH�+.�r]�
����R���Lq�z���J7 �.4����J��j�;O�%D7.��5�+�� i�5,7J��E�k$/��Y�����U�F������`�1yF�q�>���� H��x�����)���J�N�p�R�x�qp-����T�@�P��SC3��Zq���x�	� �[c��w����������E�x��f�����/�D��l��$ !d�e
�	���\*�
4���k��6�a�A��'9��(��]dg����/�NZ������o�p?�/���C�(s�_c�L��-h����_gI�.iV`1�I���(,�����L�kX��._�=��v��2
eH6=�n��,%1E{�i8��SR��(QwB� �\rE���M���R:3N�5������:f���i�����G���
I����'�"(����?(��A�y�:�t�i��B��(��`�`���s5�x�*%�)Jd<hE�
�Vw @�
'��M�2[�,������'`g������s���"�t����D�I��4�T]i�,)��K�
3�j]�R��B
"8���*���b��B��yN9����z/�qlD���&#�Y
��5�:���)N�d�Q�������F������3�*5T��5g�,E�@�EO�Y��M������<����
�4��)#���&�~.c�GR��F���K��gA)o>�{�}"�e�V������������n,TIv&!9��4h����g� ~(�z������2��������,�����#�L`���4e7����8��6���$-&�sh0y3W������@��b�T�y�cr��#�d�E���E��J]�=��/Au�;��-�x<z��N�t���3y�'^�����F�P�zt��O���C��5F"37���k�be�8��l:���g���bV��&����t����� C����/u�����!���r��
�Q��"1���%��Y��C��!�+8h>!	����x���x^��7���_����a�Yn�Y"|���P�B �U!����pb ��$c&}=�l�zN��9�ye����J&�s�aU3 �h�;��xX����G�6�����+&��30���"��A��8a����k��#�����s����..�]T�,����7TK\��$���s��q�m}�L�V�]W�q3�U3��8�.�;���Ci�#�/I�z\y�f�o�Q����k��=(o*�R�/��8ui=�`��V�B����
/�D���F��tr�)N�d��.������]�tJ���G�����$�@�c��x�(LRT��AXVC��aO�u�@�QfT8O6"����U�LMK1�
��79:p���63�6��J O�0uTf��	2���Pe�y�!C��v���m4S[)%�fBfV����Q�����F��%jI\m��PKO�2��y�����}�c���M���S��4J�O�ml����R����\�c�ABiI��r%�"(�����Q@o`��������~>p!	�h��u��S��%�E���DdY]9d�c2��9�@��%M�
�JI��5*�����"��[a R��&��=��(���~���a�aB�d5�~8To�x@���p�C
��?a�]�>zG����9�q�Hj f�1
Y.��1FF��B��L0-x�t�kA#j��,�1�|\������g��Q�QnXb;{�`�(V-�Q�^�M���2�C�.�t�LC��`��@����������@	\1�,�f5���d/*�}�GG\��� `� ���V,:/�������8
�@(4��A�����.��E#$A�������m��|�������_�����t�{�?����{��������������<��!�/����v�*�'�(��5v��!�j�V��8��b���vz��1�:?E��d��)D��` ��k�sq���O+���hZvTLn�3������Q��6�����LS�ic���?�����l�����y��_�$����4�`M����^�8���������E����'Z�vg6	{�0�SL���@XD�BG0�6�	u�)��O�>R��Y�/}����4��9�paL �g��Z>P}���uJW��c�O^�A�&t��7�h�_�q�����0���a
�
�io+9�Us��G�0}�L�+P�l����G��,;�	���2[�f�.1/U$����* y9�|�]�
`X�B��ZJ�fU�@
��Y�uo�����D
���,�&7��A�K���)WF�q�}�6��
�B�����W��������&^�4Q�g�������Q�,���n+`s%`�yn���1�@P}qp
���j�0���Z`X%D��>_�?��w�y�A��U4��8�Rt����&�)��-���Y'�5&0?�c�y������`��J1��P[7�JI!��-��@*��.�Z^6���~�i�[A0u���gEr�{�5�td�+����(M@�#�3��c�{� �&�A��������7.Qc0�N��\8�k��u�C�k]��/�zW�������_�vE���+]��j}���"�O5�L�b�w���u�Z��2O������U�
h�Z��!J�.���:]�[����{�j�7������$��	��wi�5��q����7g��I��=�C�C<3:��^dQ�#��
��	�S(�}j���b���X.480� ^�b+�0)�N_�Q��-�w�l�L�s4����>��"g�8BD���&���9��Mx�Q���@������jJ�	�"��qt=�O����bT����8�]�vW����h��-~�W��g�R��,+�hqZD���s�������M���W�N���;��l��^���E���%Y`T�7 U�(LA���bb�}��{a���0��^�����E����`'(�3#JE�]�.@@v����
�6��UK����I�C�~��!�X��8�:��,��`��)��It�"��f�r��AX�_�������}�r��I�'�x�R��$��n������)�`�aA�a8�f�D�`S2�.�8�J�T�~k&�U� hX�
0���"����s������+'�9�dI��5�B3�����D��#�5��R���T�2?x�BI���|];�t>��i���/A�<����[��@��T`^��c����kM?�������;@y1���������d��X�J�}w2��c�zb�b��g�.%���&��������� ��"O�,���W��4�_E�~$�Mm��0�5���6�n+�z�c*3}��"U1	OD`������8_j�SP�z�$�����������;P5�U���guH�5O6xs(E�������@'�Y��-]�P�re�8h������hY�q��#-?R�kvP/#�/��$Az.��R��%�)�'V0r��o�q��y��8����>Qp+&���$�-uG���f�����DA����x�� �r����p7U�3s�:������0���&�L�3�C���T�yi�q�����5��8m�y���'s[~ru��VPZ���lTc9�KK�IL�dXR��1i�����C������1PI�G�F��1�j�4HVwhr��7��4l���F!L�:�����&��������^
S���^�m�4`� & �8�z�n*��Q��(�T�_Q����6Wz�^n<u����M*C�h����\
�$9�3���GN��d�jNI?�� �*�����V�[QeUKzr�'�(������	�����w������g"\�4���m�iQ6�-�JB��,��0�������r�]�3��jGqYr-!&��
�g/�y�A
D.�������AK4/
���1���*|a��I��G7Uq�����m�Vwv������Kp;���#k�L����@�d��������$}����9���"Me:�ht%^��L)�j�!���U�$*FFO�r���bp��Zm0�\rL�� LdrS�M$�-�����X�'	��3x���k�L�	:�C�����Ub����l��������;@[��%��Oj��"���K�Z���H��z�������<�y�QG��vn
`p�%EM�^����J���oe�5�n��5N�=mC��r��sb����r9��)0�)d
������Lh��;}���(�[J�|?-��,K�r������I{q�d�9k�����:0�}�}Q#�&}3��	":�
J��x��
GJ=v�7/�s=>
�%�x�zy�.	�. [t-g���lf���w���6l��.i�����(���B �p�A�E��q��]��g�s��G�������jA��b�j�^�B��@/^#��uR�b���}�CG<!�bbk�c��S��*��F�$����>R��b[	��6��9D\��Mnh�H�=�Zh��l�C��kHa����b@F��XD�d+�S����������L�&)t�lA	7zr��es�H�^))>wN����4����P�����=36����W���K�{��S�I����2��lK����/#��X0�����d��/����Y
�1��F��������'�8�2��
���[iYHv6�!g��h`H��0H>/�|K��Z<��b"�{A�EG���H�����|`�1��&����c�U1%�6�D�K����{F����;�k��2�J������'Xnl2��G�aI>�A*1��p�AH��1YsQ���5��U������������D���R���\l�������/6�a&�Q��gp�����k Z�A~ HoBA
���P��F���lu�&��(oB�M��D�r��"����[A �+���o����8�
���M+^g���N8T��K���^�;��%�e�
�!4��������C��I&�h��M9`��f���ic���M�xw?>������'d��.(�bZ��>`�Q�gRH���,�0r�X�\��!��<[����$�Jg
Z8d����+��oc;��nA0���B��i	�(���$OmS{�1�FEL�
J4SS�T\	u���V�"�&��dBz�`l�����������L'��#NP"%��Z�7���W���IF�8��=2@��y`�Tw�|
Eio]�1�Ep�*V}�l�)���'#j (�0�A���M�� o�F����d����;�Hp�i�������>��?��:7�<�(

��0��%-��u�B\����X�Y�	u��M;[���$��lO,�@y���0���I`C�F�]	:c	��d(���Ld�A����b��('��mD<Y������45���
�$�s,����v���c6����O-H���j�w�2A��;�av7a(����Rn��qZ~�)�~`�%�S2:y��
��'d������[��m��A*b]�7 �F�����������&�5�C$2�g����%�y�����0��C�u�f��p�x����\���B\!NBH�1E����A�/FT��c%�M��`#"	�zn������EBO��/.���H�MI�h[��b�h�1���.��\�+�WTb|.�R��8A����������G�������R��OfG��/_UneX�+��%-7W����0�V(���-����g]��l��z"+��;�F��2(�vP����F��t�V��&m���V��S���[]���B�Y)�<<:z^���z���e�������f,�L���$G��}�E��i��h��sM��g�!���G���G��d�nE7���7BU�c�B��V�"�@����e�:���GA�	�����Yo���?R�5[�]���PPA��:~���4�y(ob��,�l"b�.-��k>,��8<����*�I�lw 
����b�6&�����Z&�������	l��=�F|>����uv)H��n_�b
J��j�����_��2f��p��L��>�Wl�O��0�2y�k9/&b���4MC��1LML����g��eR��7�=��.�b1+HU��R����3���3*�	Kb�5��7�Mt���`ZJ�x�KJl��ytF7I�,�lQ�������Fo�(����GG��-_s#��������d�y�������(Lp'������I�-W-��I�k�G��L�j�;�>���PI�vl���D���x����H�����b(KK��,E��,����!)�&EA��4�wmVQh�S�"��E��uDD8b�=����	�����k����*^�d��������QR���btP�_�x��nf�6N
&.��Ux�C�6����R��R5nI]�������t\�vr���A&��f
�����S���n:zy��:9w�]��gK��?|Pt�(u���G����!������� ���9�[��;����s������B5����"�49�>�b�+y��^JC�W�QF�� UMK�*la�l,�����:�vw]��7l7;$�#�Pl��U$$/4\]Zv)���#W�<d�����
�?��A�������)��jR���[��h�#�p��
�&i�U�����l��Q8)��2j���0k�+)��\�*-��,�����:�������*��&�~��%�rY~��E�o,Z����������X����$,mv��V�%7�X�Ju������A�O�,��_r��^r����,�Y��S�^(���wU����j��������LnCeKoS��m��m��[�-��(�[�nWc�B'�'n�8�zB�j6�A6��3�>���u�
�c�3h�<���x���$�0�����R��Hj��`-E�EP�����R���M�*>]������x������F8�g�2�D*�
�<��
��x=2;o^��N����|�2b=AgH����.,dN�C.i#�-
����r�~�K�-���h����X��\���.�/'Y��rp���NN��KYA���n<�g
�7T�EJ��ro��J(eqa}���eV��ab�:�Y.�O�M]�<JaC�:o(m�3n(
��w�����6c��.o���v�]o��V�Z��T��
�DZ��4\�����9EupH�rt
�����Q�N_�_�W���M���N"d�_�����N!c��L�}�mImkP��t�r�9$3����)���)��J�s���"�{�jW�G!�bl����/HG�p�\^��1���s�k2���[�;
��� ����H������s������������M��K�y���~xD�}�6�I���]����GG�pO<�m�t��k��m$:V�M������#0�n]nH��bSd�!�6��������D���kG�8o�|w~������-(my���y�E���k��Q�c���{�t~Mdf�h�:;���u~����EV��%jd.B+L���M�+=)�$��S�(
���f��@](� <|Ii2����~�T�������jJ����L���>h��q�����d�	�����7M��~��P�iEhN�{���`�`0m9�1���M�$���,�*?�����x�}Q�t:�@�Zb�WU�������v�'Mw%xDa��:I�Nw��/\,L@��N����������Jnk'm����T*���]H*��A�n��h8
���h:�\��&~�f31htV�f(2XG�	b7b��M�ii��2"���u��.Pg��%;]k2����/+��E�b7���QZ�����sZ��J2�)�NB���P�bi=�G:::����n-�f�_�4��0uQ���(DT��"
I�<�K3X�H��^����d��3O�`2v����[]��@��Hn�e����#��`je��M'����4�#�&�bMW4�y5�q!��,.4-2��tL-�I-5��f�"�)�s:G��^,�d�1��Y1��6~+x�F��	*�����u����y�Q�V0)�/)�en�.�oe�
��zT;@���^o�Z�
X��/'����x��H#d`�����*�EI���vck���(�Q���>^O�<���u�Td0WGXk?l���$y�aT+G�!��sC�8>����
���<�i#������?��."x�^��:.�[���2s�N3�P�eM4����hl��C|i�L*3���RqR����7�e]��&Q�Z�`^c	�{��$K�vZ��,5����My���_��H�R9T�
L��x�Z��{�ubM�)�sxq����{���5�eu�r2p]�R�(C����B�>�s����1!���\9����Z�����`<�o��E��W���ia��S�RJq+F��'{���3},�������	��=��
M��G�>�� ���Q�z�����]�E������5�f*�d�6`TLej�
Pi�$�����;XxX3F�y�����v��O��u=�l�Ag����1��������8a��"���z��utt�����&a�����{m2Lt��%z�w��`�8B�H�Ng���
�V�4�<����������\�G��Q�����M�?���������������1@����)��I��p�g'�ll��;n�M�3�.=�3�>��$��6$EG�:��N���a�X�o<A��5ry&���
V2�!��J����!�5nD�1;�)|-�#�d�]��#�;�N���e��{ d�dceE��{�����iYU���?$
�����ud��
p3��^�U%qa(cn��Z�6t�bD�4������k]�14	���A+�$HM�3�$]F
S#���H�'��%VLb	k����3�s(t�H����b����[���q�Kz2��N���&�J��j#v�C���Y8���Ex�0	��b�X���rri�,����IK|��g��P��E_�����!+���"�Y1f�(���)�]��u�S4���%x2��4d�D�`�@~
H'a����5K���f�}����>��%�/�1��B��:B����I=�L���JH<��"�6��/����za�
Ft%M7P2����s)������<v3
S����6��YA���2l���o�:����O,�	���!��.t�K.��$��$�9BIL6b$-��=��fX�@�W���R�j�7���@bi5�
~��B�K���]�"@,B�,`Sw�|�:?�l	��=
/�w�D�v3���*�f��p����+�W�F4��^?�w�
�z�������*5&��#nI=��i~0u!�%��xx}O�$���7���|E�
F�����	?#>7�8�����nR�6����k�Oe���`����$�0��:�����:k=��lj
�:e����li5kpL�z-B���*D��	�f��9������Z��<��j+-�BrS����s�%��Yo�b�����z	���-������M�!��� ����Z�(8��lo/�J)����I���
`�U_�$���;��3���Q:�#Y��������=&b�
P;��|2�
��R:$Md97f��^8�4�U �J��|u~��%=L�F� F�9Pi������ ��OV�^�PS!����i��] ]@<qx��6Z�������HroqaK��P�Iv���x_s�Ze4�`=�,�4�,b3XK�q���U�y��b4��`���h��i�T�b-����#�S��d�����V��A�e���&�u��3����$�r����q�����TKIm�<��6��� �9���H��'7���eNn�'="�+��k+,�A���I���@�1�8�e�fX BX
�)Yp�~�*R�-p\�5�Zhq����1�����0��g���>��������&2^L_~@�I�#���m���'�>jy�������������h����`>��
��G�� �$|�����Tv�<�0��'6+�2vFr�����h6��zl����Z1o*v~aW��e�6��%��W���+K�"� �z���B������i�Ow�53���k���z�����O"�@^[���H:�$
qf`�b��������	����T.�j<�d��3��a��m�t�o�'8��j�%���1��a�3d���KI8�xL�i��^�U�K�zn)g�@J�Ow���Y����T��J��TP���t8@8��x4��$RL.$4JF��T=	�@�<K(,��;�5��R���G�������t���;�K���I���O!h#����qS�~���[$+BY,vq��������z���0e�	1Z�u�
M%�W��:Wm��������X�zW��E�p*wv	D������VG����e�X�gUk��v����+���&����r>�y�@j���>�>CX@p�I|EH��de?�f��X.�n��%�S����j&��1�:� �|0� �����%s�/�d_+<^�
��h�{q�&�>J`l���|�������������qw8B�d��t�A�h��]o�@L���o�w����ue2��\�~YFO=��Iu&���w�vz1=\���B��8�"��C���sm�#�6��mE�������4���i�����Ag1���1@@6G*�wo����y2E�B�DO�m�]���90���8t��{d
v]�(��A<
�Q,�=lI������������j�������X��_�"y-\��@"F�2�~��������`f+����S�9�\�K���|R���#�����(�b
���w�w1��|'.W���"������*��FQ\�z�����`{;xl������q�u���u��X��^�ZJ�+u1�T�:TEG��A�����&�<$�v�9���*���4')|j�Iw�h�����q�����9��m
�)�����L>u�[3���$�/��>����7�,	l� �P~�7^F����UF|����8����4bs�9O�%�WCN�X*5u���������^����"�B��\GC��>
Aq}�W��Gc}k����b��(�g��`O�.5���h��sT��L���X:�����I��TH�e{���Xop������'[Wj^��i��x���w��Q��t����`��?��}�� �;��z����J�t��ysCQ�������(��B�D�J��\�/��g���@,���F�
b*<� ��&L��J�2`6+�@��N���z�u 1g<�r���+[m&/K~��E���w1F}/P��M�(�)�uC��1R�f��$�/�5MG�3�����j[�������V/��kt������J�XB�B�~b��$�����(�.hym��l����6�Y<r�-*�k;���q*\��	��(�f�8_����p3l�#��
%��F�����k�W�<P�g�2��yH�p�����8i�����$�3���������
{���������x2�n@+s|%��g������
����Kf�!=����e��������RfP�����z���"D�R^!/�����#2��m����@7d�a&���r�L��'ig>xd�I*�1�g�y/��� l�EW�q`YG���3R�KT@���t%l�f\�����
X�e�F,��p�����"���3K�|���jnr�kh1�������Ya���r9�� p���*%�7�rF@��,��E����K�!B��b3�V�J���M:�Y$�K`��z���K�2�IB���N�l��:�M� B�����K����j;M^�ND���.8����X^�)4�l��)�@��>�c��_�+��H��S���`�1���h�*�F�.�8�4����$����U�1���Ux7-.�1Y�����@6��E0��~T������<>;�%0�_���NZ��O/5������cKp�m���������y��~�Y��%J������u"NJ�c�_��C!�Wk�H!�}���v�W�:
��i��D�*������Jg�YdL<��!D��Y���v����3�gkS}��:p�h������w����b;o��g��9q�S��4��������#�2iS�W�z!>|6�h���#j�� �jv�|�Oib�M6���&�Dm>	7�c��\iml�x!7���HJ�w�$W��y;�����������gP�i�]{9l�g�~�%������o�2�26%�����|e�#J��r!���U^�-pJ��\�����id��"L�������W�>b�ob�73�?�4`��]����7
���������6�Qo ��>2ly��w�eX�}�: 1�.��w"��cQ���4a���h(�h�)��A~`�O�$���c��5B����u#���a�-���!NC����R��#4�*�H�L+�@�,�.o}�X6dp�'A�3�gD��	F����iK����c&���������p��Z����!�c���A-b������9�7� ���Z�T8w���'�0����E�d@'��;��N�o������iy�����l�������[}OI��'��Z'�C_�/��r�N�|bG�H��3c>&PK�LfrL�A�)Bkyj�zgi)�7R�.K����A�+�����K0����-�]�m���m�!�����s"B�M��R��1�_������b���q��R��.�����{�gy�����d��V=
:miM\�aR^�b�q
�|�	���<�4�L>sF���k�g������6�����e����Pj@���P,��Otb$k��z��X��r]t��t1�>�0���t��&�VIq�������)5����b�����.��d�.�W�B[N�#k�T���S����MU��9M��2F��!D���K��v�����XU�yI� ����� �%/�,�����
�,������h��}����,WT�4����yU����A�]|���6��0y�&�������.�l���
6�V����7������G�3�MyI�f�+����"��4������,Z]����0P-���
���<w�Yww0��hw1��^��N����h�sP�F�(��Ja�X	��r�����r�����
�Bz�[;;;K����
�J9�v��^ +��L,��NM�����
&���r�Ic��"�zNU���y����Ul4�y�?��n'�u�=�$e\^}����<�\�sW��h�b`��)��y�wG��b��o����[�I�@S�U�������wJ�n���'�����{���������������Y���b@��`�c����+���L0��E<�Q��F��,:�$EWb���xr3���/�H;pf���x����k���f��_,��N��Wj��V,��L���Epqk��5Z�x��
f��8�I(��d�1~�����j'��'����;����*�����������]�;�-�b��1����Tj�-pz���1HxKh�J�S,�{�N������~���O���p��5��F��%�<�5�M��1
�$�RD��d6y��prVNbU�y_��i��n�@��@J�u����j/��iq,�KS����]Y\�o��������( �r[;g/�!���e>>�V�[9�m-N�v��~u|�:j]����Y
�v�y�0��"���:���.j����8>;������0��P(K���C�3�^��^x�n�h�����_��A����(-pw2������������������*��3�����w=�}��H�O )\�[;���/w�f�f�������p�	�naQ�b�B����@����b�����8~��Mk
����������Ks?��Q���������q� �A8�b���[v��T%*�!���0E�9
/`�P�[/v�\��Z���0Q_���9��c�
���r�r�pCX���E���Aw6)t�q�z�>�&t;Y��v��Pb �`�`�?FO�(��.�F��bH0T�4E
�F�:��}���=@H8���b�WQq30�>��b��uq��j�v/��.�5��Jy�Q�+�	��������$�
4B�k	b�P���b(��@�.v�lN�Z���}�����H��}t�G_�8�+R��.���2����������@�TUA��rlR�0,e`F0l����q�����\���r�s�V�c����[y����{��[��=o6go�5{O�;��_��=o4g�5{{�;��_��=o4g��=;>��@fh\�]��S�����w�W���^�r�z�;.K���R_�.Xx�Q��
q�&�����H�Pg0��� �c0W@.����b��E���C`�HP���F��F5cUlV��o(�D>�
9	1���������*�4B]	�8�G�����A!	�l3Ju��*����h#��J�N���%�
�����Y���O1�r6�c�|��p������3bA ��)��S���;m�����U.L0��qV��8����n��In�������X��MYM���w��n[�X��h\��&�r/+*V+��g��Af������/��E��6���x`4\@�(���2
)�w����r��N,0���M�������������6�wW�	�[�>���74��������� �e{6�s\B���v�
V]��A�dH#t^*�ck(,[�6���p�����(z���5���xE��]� �5/H��9*�IB�7���+�A����=�+ILJ����%�n�����Gk^�u1G�u�z����	s�t����^3��F���k�\��q�}+���z��{w������#�$������\w2������
���F�C�OE����!�2Y���3[;�����]�+j�����GF�(�����`c���;) ��|w~z��\��,Qxqy~��rk��"x�"����?�u���~w�����o`��.���R�R���z7�[���5d\����5��x��\�E��K�b�0Z�|���3#����v�,_��*�J5��@���$x���l�'S��g��T�?>j�^�:~yx���o�Z��]���b��)M).^o��|!�h2�=�����~�]��G�A�
{���y�����e
B`�-$����������t��O��x��
�(��K)�d4�3'�o���c��r�I�+��Y�����r�UU�yK�<>��X�,q����"�9�f� ��� H��o���+(��c��'��.��c<,���c	�����N���wl�zr#������b�l^T�t�B8�1�1��E�@�X91%�3Zt����*+��x1�z�@,��;����;j`g?�3��2�F���3���y�A�=J���PLBb��3�Y��rLC�Y�2���'0�|D�h�M0�x��W�]�0�(g'�0�U�+����7>�	������#b$N��V!2��(q������nt��-���$��<B��s��k�hT$\h���T:�������JdA������'l�p�)�3�W�U���m���� m�����5R%�r�}^i�9>��p�/�s���j��%���v��.�|1���/�}�� T�	R��G>
���6�(��Ev�/�A>C�.�I�>�*���Zp_��!��^�B�����k���h I�;�%�!n�i�'�:�%d�s�VS��lU�b:%T[��4�t���8����)�!|(/8�/}R;A~���A��]�a�U{���,�����|�]V��,��@�����F%GX'*���+*��:o���Fu)��JF+_��z|�<&Gu�z��R�%�P���c }+��zPt��<;���]�d��������^����(-o��\ �U�*�7qU>F_I�\��='�!��!�9T	�Tf���$�h�	�d�
���I��
�@�i�����%�]AN�k3�Z��������#��A?6��ZD��
�=pG��4}�����7��A#?���b�d����sw�@�H��8;c���m���,�x�/���63��Bv��$���r��r��+��
���A4�����x�_&�kY+�Q�����;N�O�J`���B�g��\G��V�<���%==��AF��hL��T�<����Y�^�4�)n��8�C��Ud�(C�*����$��E��
28��-��7�	�����Y�:��A?\�\N��L������s1WUR"����&�-�m�	��C��� ��Y0A�=po�g�Z1�Tp�����P�u`���"��q�_�l���u�jG@���.�Y�T����6�7sn�G�"Vx��"Yg���:&�Hd�z������g-���`��*���$��:a<�b����
H;���C>E�1��2^,f%4����$�!.��I��}!-y����S��U{���w=�� ��X"Q���%���$"�,��h��4��Q)GR��+K���\�O�Q(&O�!Y��	MB �g�jhq�d`�-Lo5C�-�T�=
�H;Z�� 4 ����P(�(
��M���Y��}�C��ai�>�0
Qz�Yj<�1jP.�m��JE��
7�b=�������y����f,CI`^��0�,����5e��5!bo��6���j9ay�FvQ�
v�c�:yg��'�	W���9U2����������@�����&�2���s��������g���\o��@IM-�=�uvFM�6R*�e��0.��6]9�$��Q�y0��	������Sb?k�gQA/�'���p+�Dg*Q������%�V5	�����L,�����:{������(t{�u����c��Q����Jd=q���r�&yTt-��7�	����J����M;n��8rGj����/G�f4�ut�f�/9�/��!]F���5"A�@�N� Y`'��~��������<��4��1�
��B��h�~vc���*�/^+���h� H����������SR�H�0B�_{2FE�+M4��Rc`�2��k4�"?kg���3(������8�����|H1��`
*��_q��=���z!"�����{�
�r�-��p����0Q|W�c
�������cb#A�)J�B���i��H�g����lx�	����BC�l�P�Z���5u��/�<F��d�7������Eumiy�����H�xPc���mc;���A6C1�p!���h
�^2D>�.o���b�O���w;4 �~Y���p0�,���.�0�����)g����0Bs*!�!g��V1Bj�)��E^%�7R�q#�`���
����f�`�]���[�[���)�1�^IF���66l���G���S�K��J�X�c������;	���zq�}$��`��,���\�f%\s1
��Lj�R�i�l��I�Tc���d��f�+-�Y����
�*�0`���IV��9��<��m��:��uR��GM���L�>f��+���@���#�!�dp�����
-b4�0xA�%������4"��f�X(y4H��)r>Y0'|5��ic�a�:��-��ln��5�OR���}���w��oo��)�-j�����3a���*-	L�qj�9u5$���<F������c�t���H��-5��"~�s��q�-��4,�NY�8�Txh!�p���el��W!�;����<N�����k��$�?� Y�{���G�;��/�����H�>�1s�X]p�A<�i�U�EQ��N$�n

��#����Mm�&���(��|���	U	�(<W�������fL$@������
�y:����,*�s0�:8k��/;o-�h�
�E�z�Q�6_���a�MY,p��
FO�v����E2<>��	�rb\h��,x\~L4�=�_�=��|�d��r���d&p�P*B*���O��PL^��n$��2gdm���G< �m��-�P���h�����{oKY�B��m}��8�e��f/����Cq�K /��qOt4��Y�[
\�{s;mm|e[��Cm�j�'�i���\/������b�iU��QKx�HW���Bc��������R,P<�������k��d9�35���9�;�r��g����J���w��%��[��^��g��kD`�(�����9��|,�!q����X'�r�&����rT}���]��W��@���� s4�
G��P�l�*����~��@i�#���j�G��*����BArC�0fW�Pb#rt�v�����cW�N$A18u���#������pF����i$�����x�o����J�Oj���`E'�S#�:
��U���w��(&H��C�AZP7���e�8�snc�:�uO���9��g5t:��0l�P����'!=�]�X����u��4CNC����d�
����M�A^l�(\�X��jiG�p3�}d���Z�1R���R[�;�+�q>m�������7�.n��*b���mzG��B^�$�$�}Q��r>�=�g���=8ob��N�d*�	M�^��8"Y
�!+��:��
 yzT�M����-�A�J�Px2�#�^P9B\��1(a�����]�����k�$�s��6j0Rg�)�I��MPw#`����aKexTqE22��O?p���`$���3�R��#�R��Hy�?S$^pwo�v�q��G�)E���������������h�.{T����nD1EF���Pq}��l�e��
Iz���D�(�$�k|�K\����"���$&|�`7A
Y�M>�l|tKz��V�8_��?������
�bYT�kE�lL/�^f�Q0�2���)�"�b�
��'JJc'G��d�����s�e�=�>1YG�+��u�4D!Qi�;�e����j2��ex?��aY���
�!���6Km?[&6VQFB���4��v��8j	��e6<w���
z�����G�'��q�K���s��s�8x��t
�0c'��4T��<����)�>�I��\�������)��\e��*
^J�-��I�.'p��f�gQ�_�.3u� ���5-����I^��s���l�5����R]�(�?;�+����h��a�)�v����{��u����a�����������:�;�Y[��b-	�-ITY�������*��`;��#;��\#�X!� ���5�-:r�L	�7U����
L�5�R���`��vb���O����~�k>���U�����<�C����U����Fy�N�F'_����bM������N!h��#A?6s�3���	���p2�sM�\d �<Qh�+�+	��)jst&�#K����D>����������(KfE�
�I�����*�3�W������;Od�0�r��q�T�h�*}��#�h����>b �)���������5@�(g���4@��AIz��A����+���������x-kQi�[k4�b�R�t��n����W�����7f�����
�D�0��r���&[i��_�mN�%�����<������K��SM&hZ^M��_��'mP��B�p��1��\v�^8������	�>����+
��H����i	��9�u[`���� p�}��5���)�=�(D<.��(��������������]���@�������'����!�,t�/��1p�����sF���]zN�e�,������r������j�\�������j*�pY��h�7�d��8X���� �U)V���������S�V�A�p)�KA���@1&�(�B0��4�aT�q7*NA��I@}45�[������S���**���z�R+�U�T�T����������Y��qx=�8I+������
L:�JXK��z��+�R�[*U+�J�U�N�����^����E4
�{��|����[����
�Qp���0�>���Q|����F�`(����-�q	<�j6�C�T%(��������Ni�T��Xt�!x���/o/_�P
J��r�C��$�V�8���-|:!K Ms�K�d2�]��l���T���?AP)���D����j{{�������.�k4+������i�Y
���u�R4��,��.W�V3�����.��4�#c��7k���}k��4��yP/4k���c2����#3W�AZ
��J��5S���>�i���������6�h��Mj�>���e�e����N������,���L7]�,S�k�8y���V���^O���:Y�SB�C��������^�����o�:m����\��c�e���1k���;��\�p�;����9w�������cy��Q2>�����P2����`�~}X-/~���_��kc^��y�zu�����������`����&? ��v�=�:d��UaB�Q���y��?�
���+LL&C��"��m��q2%����	������N/�OZ��������I�����U�`�JG��c���*�pyE�b���$V�L]s��2p1y�(Q��-N2��9$��<W�p���d�i�#{��2�;L��a�9 E��|\��F����~k��M-W�8��4m��`N�+M�n�&D��'X��w�rv�����hP��>p4����sZ~�r�r����e��X/Q�A��u��q
��i2)�z��H�V����dC[#k�s�p�������.`���l@b����m��A�����%Cu�i^�;o�_����}�:<�cPfrS��<>�h�m��_����+�)�O��2�jiE����{���p���,�d�:Dd��������O��F��?������n��I���[r����7�[�6��-�R���_R����u����=w���$)���{�����z����=�����h�����yJ	</�:��T��V2i��4�>�*�j%�&��cN/S�o��[�KrP�!���0��9���������)�RV��im����X���*�#��]��;�c7">�R%��8�d���V>���@^l��eq�������Q�����F"p��p�U������^$��}���\�ZY�$
�� ��f ��J��#KLi����6�������+�g�Q2��'��j�\���<����\�������|z�"|���S��������mg�D9N�[��2���<
�^>����3d�#{=�_ZZ�"�z�G�`u��u����X��GL��lw�7����1�#HG�g��?�O�~`�e�
�@���L1z�������!�5�'' ^>X���'`C����6>��L�1�r����}q0�xD���b>�f��x���������{b��	��L1 ���S��)�$�Cd{�Q��7�J���O�W"�n��[�����f����as��?sC`b�x���{�$�����D0����.q�9�q4vl"������,��4������9�
O-���?��X;>�l���{{y!�3��pa#�G���_��r���]�:���V��Qv�!�$,<��t��_�h^j�F���n�;�J4��0���a��'�1�q1�|�� p�0l��P�L�_i�C������\��%�mV����}H 	������f����rHP��&H����wF��o�P�y�@���K":j	.��a�KV^�����/��.$��4S|�x����3Hf��uyx�������$���(]=�Xc�������l�V���w��|����Vh�J�f�o;F������(�f��)�&��n��6��|�aLF��d�����2`+x;?aZ�&�!q�r��8Pp������5�����:�c���D�U����{����b�)���h��
�h�[4��J�@Ai��g��VS�~y������ %�����#2�|���5A��M H�@)�������Z���a"��@g6B(Z�|�8~���|I�$B�E�9�&���!s�H�Ia���A���!���������g�d��m�����.��}�yU��n�f4��t�7��L+�L�
Y�"^�KH�p�8�v�\�G��;t���g9�m�����?xp%n��bAy����A�a!v�Q�*���D��w�p�1rFVx�xES�k@��&�^\,)j���X$[�����u2�� �����k����P\g`�A����R�A������T�����$�L�����,�{��$�d{���8�A���R�h_@��f5�
<'(�,^'����<>��{|���;��+���S"c���P�V�c�5*��,�:'�v���Xr����AF�� P�3u������9��F��d�zX����O�4��|?�z��[4{<P>�K�B���~���^u�������Xe
��W�2FN�
��9�`	���Rn�/��e�A����r�v0�f
�4�h�&y.������
�E��':w�%V�����Yb�s(#���z,!�xGd���r����K`�3��6�k`W�	 R��'�PJ��kI�a��a`�D������-Z����:�/���*�j�Z<i�ayy�,�K�@�s�{4w`;'�R\�����y��)�\O��������]{���%��E�OE�[��F�E		�?]�
Xs�!T�r��'^��r��*�uy��X��	���e�2��R�,�p��m��X���6��N�C���in��������O1�x<���g��9���:���;����L>�b>TQ��9x����Q�������F���jQ���;��U�R���z��~�S�V��r�q�UQ,����1!� .a���N�Nh�O���`:�.���T��:|�������_
�~{��y�����������8�|�+���C�D���.�Or�VH��)}�{����J.�O v�������x%����k���
�c� �/��l9�Pk�:�e2O��gh����)���������[�o�����Y��Y��w���bJ�sd�|��b�9�{C����nO�oK����l�����;�������^�#���mQZ6~<�A�����(%6�>����>dL���-��Z��{J/f�<t��$h����3����}F�)7=��l������n�
����T�<�E�@X9c���7� �>E]�f1y�J�,2���h�)��P�p�90�������?��G�0���-	��w�(G�$2�'Z�!a
��-�#��S���l�1��|����0�!�Dl�����6)3���k|\��#~XRC/R��YL
����Y���,IE?R��f����*��%����qzO&B��;���	�HP1&���_�j?�%�	�ln��.@���y@�E7�����?b`A�F;>������/�����l��[{����kjd]�(^��V���<�ZI)e�r������������#v_�P@{��\|}Ep��h"����%�b���\��MW#���oOZ���O
������KPe�OR_.Ue\��D�w�2<���	�^��w�\z�k�����*�/�����\�F��Rs����eJN47F��Q�'c��=����(������x���8O.�J>�shIa�D��F�����lZ�kq�]�&������ee���C �+�
C�@� ��~i��w
�!G�[�����.����q���?���c�*��%�+O(��3e�6�MM�3�#p3�(R�#X�$��w��W�7��������lF��~x_O�d\���r�b)�1�o��>k7�x!�����������#��u.�5��o��yb���G�R�Z�6�+���h�N�Ja�)�����n	@������k$�8/���M�����m;F+�������@���e��N���Jz��#��F���
7j��<���l��nM�V+�r����A�����U�z�x�Bi�
���[;K��?�c�\O ���8<#L�w#?��C���w����	m_E�"������8��YG�g���*~����
X%�����:>i�.^��3�'I�}a��r2d�Nr���g����dS��6}|9�?�AW�T���+��>���74kqu��q	h�a��Y4m�I0/����v�N�����O�BB�4��kz��5 �2l�
G�olP���i��8����]��ya�_gs<�d!�b��,�((�(�o�}���$�|��_��k ]r��@�k���(����^m��mwf������k`�����%��#t��^vK�Jc���V����r���J�����������E��=L��������>G��qnK?Z����o����  b��S��a��"Hf�^����4�LFN+p��)X�`�N��d2lO����_!@���������a�W7��d��B�_bq=&}x,k�
��/��^.������V�5����->�=�6��R����t�a��m6�j�5������a��>�{�E]��^�>���V����c�w���<�6�8`
(��� ������@� ������u��HX��jL���!�gV�fytn�������V�>d�-$T����$����Q`s�����M����\a�_�A�J�\p3)-#�����P:��_��D�X��[��]@zV�i@"�EX(��s�)���'�[.��-���a��4������Ea��gp�y�T���b<�^DvC���|
�f`J�l�.� N�(�u�������BY@K����hy\����2�Mp)
L18�s�I��i���`��
n��������&`�L'�Dc��4�z���'[�H<@Y��R�M�b����C�?+)�� �l*�}z5_��5������Mo��pt5����;����Ix�����������upI���x8���4�@z5P��!mDU���K*��
����?A�V
�]X��3l��Pv�]�a���0�z�RJ=����B���U=��>�\����k���P�v�`]�u7�C��WZ#��:g$�b*t��Xd�R7������
+�{��.
����--����`�x2	�
��Y�s�����/u��^����4��F�����5���V��<j�yT�c"����.�'�s>�h`D6p�������R�H*�����$N�~���d+XwAmV~�r���z�-�uj��b�������~7ZwY���zeQPG@����BT�� ���
�Yx�\q��^�.�4���z��G���;���8|����nj�^i�$�'�OMM��+Cl��
Pm���3�����`���HYT��Z�)�uU`j�`�-��"��X����U3B�$�;/���<�[i8�8��ZhY����?Q�����)v�K	�k^�)X��=��\A�x1�h����Z��y���E���C����NT�/I��+g���7=��8h�(�&����c;�����ytr�k���$��
�f1��`0��X0����O ~���aq�����\���(��'�V1���h,i����L��d��������c%�|E�~1��HO���DEl�7�T��yH:��'�8J(�q6�u��
bl��sFK
GX/��	XNX���>�e���}IK�<�$��V��������e(x��l<����Z�����_^b�����c����t-�2&*�P�	Bs{����~T�Q8�������=Fj�<�{/��.�-�Te������woOZ��-�T�Fs6+�|v~t��M�s.���F����7�o�@N�����G��"�~�i��b��U�����z&���wx6Ul���������mhh�`$�/v�`X���y@-�t��+0�AS6	T3��S4����Dq���.��90P���4���\2wC����a����Q�k�<�|r���uG���c��[D��x2�)�[����.�A�+/+�>^3|���)6��7��Iu)Y.��Q�\��g���>���}R�Y l�!\r�zb���� f�h1
��9�����#�0�m��%��,������hq�m
{4^ut=�W$���8��0���=;�AL���5f�gu�C���B&C�Yumz�A�����yo�5��(
���g���o���K��	^�����(��1��X�����E� �n1�I�tX�<K��[����Q��kL
������p,*X����`i�W5x��5��@d�. �F��h
vD(�n���&���,�h�Ff��Y����d�����)�(=<P-�I�>A)�.n�6�<�4\�y��,�[x����������=6�_uc����y2��` �x�`���z�����z��Z{Z��@y����@h)���4"���0��@~8�r. ��,���\��6��7l��FU��O�l��h.	������@�Z��X�4���%>G��.���f=�����I���r�b�QD�������e�hD����44"^HL,��P)3�h@������#Pwe�� ������A�l��8���)K�M�<.<g�!��,�m�E�'��.&�[hM$�5�q��)��t4b���k��k���=PEQ1j�7�������+A>��{X�T��H
c�������0g=k)�������}eS3p( O�4����4�j0����`r>������Bk�h���p�\<��1�\
+��g��47�xL4�9���Y�j�aG��dF7>��@�F��+�B�+5��9^�}He�\��+$= �I��5���P{���	�E����({��lI1$�l.+�;"0�<�bm��
5���0�D/����,� ��tO�t:����Bi���Nl�A��������k�t����Xl�<����8�!+����dq����D����U�����q����)qL�
���d����(���7m�V\F:�c�L������8\�� ��"�P=^eh\�2�	p
B�!����^�;�W�����g��tQs`�*P�����0I��8S(��{0�����(}'����ZCo9��qK���r�@�2`[���M�7i8�$����(0~@����{�����`?$���gF�i�L:u:�W�k68��������-��	n��-�T�����xa0��p��S�!O9��V1�
��&��P�i�s��i�~��^�g_m9s���J���
���<���a��6�-���|EO�GVM���k��E�	,�4��`^���u��Yg���N���L�!�L	<P�K.H��\���l��	�����Ujr���j2;�L�)�����*��g�K������q�Qn$�_E�qc����YF����&��j���� *, *�B��Y�,�s���!�2��� XX�������&XcDg+�?j����#f1���2�C�b���� �����0��H���$>�"����4�J��&�8+�8����(B�Fz�f:�z^��^\W��+u�
�bEHZt��q���{3A5���5��Aw�j/����e��q�����<oR%'ST��u:���C���i)��Q���;���=
6�a����^D;E��y�K�.��Oy�.������wvt�����+����O�b�R�2i!�#n�kH��Q� ��B��f��������!cO�b�,�G$��~�h����4y$J�~q��Ql���(���N
X�F9�4P�<���P]�9Z >�����C��}�)@�h8S��-h��Y8BV��k"��X��{�knX�E&��b��B8_p��� g���;����������eUc-2U%M��OvC�� 
���	�r���
���X4y<���I�)p]0F���@�@t3��R���Dt{�$����9h���z���B� �Bm���<ywq�S+��`6��(��L� B�O���}��x\��@�@(���B=��\n�/n��<���M�1������aLbR�\�}g�n�Z	����|��}�Hm���� r
��������<�w�4�>����7r5���X����Gqw�_���/����.
��X]��d�l@w��}���#�J���ogd���+��]�%�k���fACJ��-�YD�RD���&�7���/q.(���As����'
)�C)p0�����zH��1�����O�����`:��Q0��@�����t��Oe���@f����M�w�m+�(�YJ}���L���U,�����'��������K#nT��x��Fo��E�L�>`���6\��o��q{.�k�vi,mQc3��4��P�h*s�3��h���O���=dL�b?��yf���K�Y�h������p��@����`�dfAw���q�&�,F6�p2��f�����
#��`����JI�<#;������W-7���
���B>�
�d���dC�t��f>X��F<J?�)'V�`��GKN-K+9d��Y4
�0�����e��H�b�ql�D������n�T1�dk�;�@<�`Ed3�a���+����c
�>���R�*Bhlhm�:J\F
Vi#:��K}�5j��;1WI�O�r���&�%
�e�!�#%���|��Zx(������^���"m;�TU3P������R����8@X����4��v$(�/���X�vLiJ��t�V- �E��tl\<�t2��X)�q{T^��sld	�x������H��f?y�
8���"�'`z��[���3��m�a��Jn��	^��&$�
��fUVg��������X���;P(t-x)XF;�Q���U�R`z�M�!����@�3J���\���q��~����O�����%7��4#�������	��`�k���I�h��F���l�6Gg��r	��b����h~�N�K���6��+U��kQ�4�i���B��<�x�F���1}2/2��9R	,��/22�W��1�b	2��?\Y���$�MZmy�J_��Ee4��(�;9���5������O�/�gbe���,9i.{�;q��!A��r1�O��'�6�}����[Q�n<\�_��"���!�Hku-�D�t���E^V��� W���b�C0b�����Z�K�4�#���M���F���'�T;��4���fX�r�,�<�����o3U�|R����yK�ug�����<Y>�\e�MF���X!?��V*�H�����1�@0K��`�����Lu����/�#r����d�d��d|�G��>�P��7v���_
����#���
+�m^z���2��:�:DKj����n��X���@+�z���o�=��z]�!��|�ZD_������x������763F:&�X&(��#egJ]
��[���(��q�Ro1���hB����_foPjaD��	���}���8;Jb��D��M[�jB�-�fo������IN�I���Y4�2{����$��u@�8V��}C�	����IK~Iw�f�����{mW��p�����"/�T��^���E�o+���/G��?��ob=�uJ
i��D|���������ut*�q����
2�t\��Az;/����i�h��O��^ (n���OU����Q/������''��������#��[��>���������/����\[����K�����0<����i�����^�YIu�O����/}}��k�[b���.�����-���6���=��?�~��s���>��1��������������.NI�R����M�)CX��y�1t�$7����]��5=��"#/�W'���/NZ�����������h���o�O�����
rZ����op���������i{n��������J�������|`-�G�;F=������u�~����U%���0e���zt%�w,���O�;�E�W�����X������
6O��@�L#��7��kd ����2T������]q��J�~-���W�TVj3��7Q�>�J�I���D�E@����q/��M�\�.. ��y�� y����Q�*&�#4w�����DW�I)��]������y�$�����CA��=��AL�N�;���~�T����z��5��(?�{�oCo��=�Z��&��=�x����\���
F,�
���d�1�����������l����I[�R�	��hE��Ql�v<�'����,��3P^�X�d�����a��)�����b���4*�R���:R�IYV7RW��+[�Cu����7�hfF��W:��V;�X����Zu�����s���S{{�z����H)�
��za�\�������:��B���h��+��K�y���l0��Q��v2������Z9o]�(��Y�8��y�Z��5���RpQ�u)�����) ����/��:"��{�z-F{��	Y=���D����~yv���5�Q���i�������������]�Us|�W�A&�xx���J����K�BP��)e~�������l�,�C���X�B~�-,�`Q{���tq�Q����0,	��`�D�\����7��~s/���[�f^�����-q5�Z��x	�p���0�.l��T����AH%�Rx�*����^e����ce���^P�5�q��#�c��K6�aAg�����\@i�.�~��y�z}|^��9~	^�������!;�#�8����N
�� ��P`s�� &��}]�.�^%{�>���4�q�K�/����J�|���?O��As�\�NZ//�y�������7��$�����^]e����X��Oqu��9Y+������hL��`�-���jrhU
���	'Y9k#(-�B.���X�����?��[d�z�u���E����K��!���!����F!T�d��Z��^*9*6�FS9 ��7���ul�d������M�d��eG��'�(���A��7����%/�#������.����8��Z�����iv����9;�u�3�g2G��X��B���*��&&�T�\Q��MY�_��K��������(���q��.�p����<���6���#+��h�lZF_F;������y���da 	�jd����H�P�+��q�R�@g[�y~\����!'�:�\
��>?{����
�#����V�N���E���������9���;[�p(���QW�X�-F� KvQ9��>����c<�.|F�[���Z]>�/�oT�8?��D	n�:t��#h)[.}�CX�'��Oa�	���T=�.�]����+�7�.��eU�R�����Fm/�xPI����2���������5wOk�=�%���
��b�����Q<2>t�0���������3�f���7�0���y����%k���(��&�,�v����akxG�goS���m:�{���B�������~��u��p%i�+����:~���\S>�k
f>���u��Q�,�������Owe�����|I0�h��f���|����������2'
��`"�gL��?d����"�����A��JG������2��2R�]�%1��C����n*�j#,E��R�.�����������+I0~{2�?��R��
��e��Z$���7����/����y��C6�;����_�b�S-���]���r�V.G�^��g��[���\+�!���|q�v���S�CH��
��G���SQ<q��{~��O�0p)qG��	��^N.l��8����T�u
����`��^]����hl\r�� �����c{�d!xwzq����#P�cVM�=���qU��~�$^z��%�dk��!��F�
�|�Qi\ hk�W����@	*�������(��� ��Q�|������0gQW�U_g��m�(��D�XSM�H<����b��M�E������8X���_a]�[j���j�Z�?��3�A�6�6,��~)�����2��Wf�N0��72��JI���p�dS�.�&�St�ZO�Nkf~��~O�&��o���	zL��Z������;YoV����|�elc���y������LX�'��������|��I.MW�����o��������?���}t��b��we
u3�,i�_+��deID��GI���g������Zn���w=�(0�dj'~��?
��������9��������<ThQ�u�R�n�8;;�!���]L�=q�&*�!jZ�.n����$��p�fG#������=�l%�A<;l�"�O��>�n�m������� �G3~��u��a���H�q�e$;UC�,���Q8��l����&w��R�P���
����|�������0�P.#��(��Og�GY�8j9C��x0�mc��6b��!�{y;p��E`����?^�(�����#f�Av,8��
Wv���0�E ���m���f3����gG�6���N�.���9��u���ou_e�)t��1K=��v.g@&�����
�9��[s(`K��Zo��y�p���XF$a�3�z����k��J�8-�:��'q��wi^���Fg#p"�4X��
z��6�������
��7���}�,T
"J��X�����Qc|����*�B8��~�~y��o�r2���H��1�?SW��@�\~lD��Zt(�G�~���"W�f����$@���sY�pv�m��U�<h8��V�cH��+)74�I���9����=U���H�;�����~%���p�.-���J�8�w3x��?������W���/&���5�[Lg<�Xn��+�����m#�%��d���:�*OJ<�A�	H>��kJ"��V
�G��Ab�H�~F�t; 9k'@	��Q,�H�I��H���<����.�
>�Yc�����:9��D)�N��Aj4g�s����cLL��#�/�-;;ry�:+��,��:PiN�}o�@����V��������@0Lo�k���/i��L�����<^���/*xK �:Ar����w���|&�(9����L��e��1~:m����?+cL"��{*6%g���~tv����s�)�JAQ�\1t)Q@x���$x2��W,��B,l^���w�e$|�{�X��idH��K\W�xG
F<?|�i}Y_.�� ��qRgL�%����AI�z<��z�L�K�����q=�@�v�S������;,��*�$=I�m�h�B��U/}H?e��K�K���1.U�B�Jn�r�q�R�,�L��@<t����Qw
|Eu�D��A�J�Ns:��{�X4� ���`���0�F���D<$���[AC���X��7o�.�/[I�VF��,��t��h[�Ta�(am3��j�V!zH���d��$\���n;�R�������T��I2B����4y"r���$�)��j����9���������{�A:�[HR|"��b�8�V���:� $'B�&��8�I��&v)�Lr�j�J��;�N���6B"�]�jH���)p����������xk��C8��C,�([.����=
��`/�k8w7�{QtaB
�*&lU��r��������S�V��(�z��E���|R�D���(�},N!U����
o4j���W/��������\�W��J��W�S�R�V�
J;U�G��p�qx=�8I+������
B����fo������Zm��)7���Z��_o���Z��/u����8�4(�����*b7�������\����v(N���?�c���RQ��aq2�z�%Pn�4x5��r38\\��((7�V��`��W*m],0Q�������/���n��A�!�$Nw������c�~�V��A2��w2�o�^�g+(��X�H������r9�jgK��&*'�X�Q���G�BV��j|���*2��dgw�Z��0���������ii�y%����(�u:Q,={�F�Q�*N|�;4N������bm��}��Ac���pG`���DsK�a$
�$�I������E�tvrxy|�r��x����4%`�t� +.��2�q��L�j��x��f�G4��������p������&����"}��T�n�_�.�������Z���T��Mj�C�}�`L6��7X��5��E���=�����R�xu�0�:����)(�OyGSA!����	A.)�|�z��t���by*#������
P�\i������C�n� �|p}{>i�nA�fr�h�����y��6�4��o :�v�����Z�,�DK�'��?��4]�m_����D���^Ck�'�QT+,���ZH%}��02����_��Kbw��.#q������`f@��"��!�6������*�����e�O�93�2�0�:X��K���x���	^0O/��)��F��


�`/f2�>r�*,�����b�����L�3���'g/��~uv�>9;���>�SJy-��^��
���|`b�Yl��=�4�����>�,�)��khtE�d[���DO��9�.V@�������@������o�v�V*v�Q�\��Q��G���D�n�5NP �`�D������DW����
�z'��.��5Dp;=�<~������<�bA�����nX��a�c����Go�R�
�y�����MU-&�2����s_}G$e�zGV���p���)�k1,��*�R<�4�Uxa����"L� ���^P��.A"S���~2�������
jfg�R%�&I1x�� #?���y�����q5�w`G�O�z!��M�i�� �@����0����BJ�����p��b��J�u���%B������������X��	'���'8\OB9#�^R�Q,��C���%���x?��?�R����?�z�T�C��->(�)����f��5;�J�5��0jt��~m��ht�N�i�����b�j�����V�Ok�e��=%�!h����������e�3%3l�wR������fE�18]&�UA�M�����j���P��"���H�C�m"[�\�u��`���$�@��2�������h�0:��J��+��J�,��f�5�Zz�Q��d�Q�2gb��b��=/mi�������zs��V�Y��8�y2���w����-'��Ym�Cl�E/���pv-.��v%u�uZ�G�~4��l
o�r��k��4��Aw��
Z���
��
>
�h%���|lH�#���i������3v���d"�]pv����"��;�4�v�Y�]���'��K,�����N �&�a/���>j���u���p���������Sy0$;|�OZ��|��Q�R�J���skL��<{�"
@�zu�jea��G�����W�	9Tq����_�������/
�I)����o�������{����s����j���C)JB(J�<�@��Z
���x�{�I�M'~1�k��n�S1��#���}F������
��<
��P`���qw2���\�����jYG��8��H�r����?�q�9r�X�-N���S+��p�V�E2Vg15/,� !�/X���	�Y�, [@�#8m��B�El�����q�@�@����=�l���W��g�.�����A�d� �*'�cS��������=��
����6D�?::|q�O;����%��{b���t�A�$	
@W��##����6O��hO����x���|�G�y�i�I3�������JL�G �T)�����/�l?�p��[J("�S4��,�:���M��Ra�<�&������5�\*�	dM����;�d�`vc�F&��;�
�P<Zp{gGgO1�����d�p��]+�y�3�[��X�h^���?�_KF���#t�V�3pX�F��2���4e�A���Uq�
{/n�D��7��A����Ap*Z}���.\��d�%�6''��9h$+P�cPYj�P��O�1�.��@��(ZB�I+;�<��s��b��}�~�6���3?q��9���e����6^��)��v��ia����K����;�$����l�Nvp�/���;;�t-
��]�������\��g���[n_�@��,n���HZ��L������<�:cNc��<$8�d��"�ab��Q�d�H��0�H�I#���Y���m
��~���f�Bh�[!����A&���\{x�X��Mcf�"��O&�4Mo1����>�m'���@'�6pW�~�v D�`2��kW_Y����@�>	~�1���)�*��1��(�����i��=�=�������1��w3�U&eT��3Au-P�����2����.�;���-UX���R��(��wo�zIqA�,��i5�i>%��~r)��Y�|k����U4���76c�_��.�
b����q>��t1!-�2��`p���R���F���7���o�Y�6�p��W���J�_+����F��k���z����<���h-��A��A�����&��0�����@�+^'\�]��������H��8D��(���X��y����d���:*��On� m���2p����S��;"�e#EhH���W4��[��~D�OQt"�k2/�-<��N�����u�T���H]��BbZ}J��x�h���s�)���������nK����<�x���F�a&�Q/gHBWE�����%�TQ��R
<�3NC��N@�+F-�����Dg�b,Y�z��n9�/{�N�����%y3S�K��L-�wa��/��]��D/����S��������S��e�b�}����b`E��9�Z�y���Q�wj�<a�Y�
�����"A�
�\��*�U�zj���(%A�d���y�4��d�S���G_ntn`H�{��
��)IFe�}q|i� ��d��[:cp.(��yx���� �o�1d.N��l0�������>E��N�=�o�%Jz�p���px+�jQZ��0��h�����(����E�i����j��&���v�Y�'��A�WZeU,vk�~y��	��V_������p��}l7�D"���E(���<������jp�~���
��#��:��a�T|��@C��&�#���4m�Jr��!�RP�����VwV�N�to�1�#��Q�?���
�.>������������nG1����!a��p8�����Y���'^�A����G��^�1���jw<�n��~������zd|^0������<B�A�(���"H��O�'ZD6#K�>��1~{�-�s����$" A���hpl�H���@��G�I��u
@=XmRj�+�J	��P5D�����[�;h1
����g�������P����u�!@=��%X�p!w�
.��.w�9��MM�|���WM��6V������<XCMC�sO�8��E����,���F�R�P��VqEM�]rt0���:�����_
�RU2�4��'�DW��Kk��7����AF�V��/c�Xq3}M����w��������������Q!�aa:���d��f�P�J/�
�+���r��������������f}��_/���^����+�F��,U{�Z�W�6��Z���������r�iy���4�`��00�����#�	h�PSq[������!��Rl"�&'�
D��z��_p�H�{�M�ZX],�������^�
�{�<���o�=�hH���q
e���,�;��$�;`�N
��R�%���
���d��R��v�����JfY���Tq��e���+�'��<�`�/v�[
G/���CC^h��E���T5�7]���IZ���b��^@��V*]���&��O��h�x����j!!"^�	��>�<T�@�����GQe�_xs�
�v�o�5������5����B���J��:�4��������ek����X���g���
����n�{xyy�v+x��5����(��]�a}
+}VzA,]���D�\n'[.�k���e�&ht��1���i9����Gtx/���&�
�Z,���������D6��'�'P�UA)@�V7%����<�c��<�e����kY�.���r��Y L�����y�C5u�\�&����s,g���)�~�7�7/������~�����fI.������%�&hY��d��c\�$�m�)�AHIx�����f�We���y.������x{x*u�){��S�g��P��1D��I����065�!>�c�	��������^C��o:�zH��������F���A�2�t�[�����
�u�e���f'���p���^�C��y���+�vnw|�V����y_�"i-�51���Zl8o^�`n��S�\�of�L4��6��3�G��9X���I���C��r�p�$D�Dh�	���`�����BZ4G�[�7������R�V	k�z����v{�n#��w�J�n�_��J�N���|�o����X"�-�P�+�5!�U����|�����7P�@��V���i~�� ��, �O�,'��o+���b���_��K������j
my�5���RJ����w�%�������{TH?�QYJ����6��X��+�}[|���@�]%�����rE�����"�:(U9���E�RE��5��V�/S�n02k����d8H������5S�O�PL,Q�v��BF��5{���T
�K�:OUK��Hz����'����6��\����
%:_:&��Q,k�;�H�P#���������n�f�h�[�po�����\���1��	��?>x���=�9�04-��?���g*�A����Y����-2��XRM������"�@.�#;�O`���N����H!�����$�lJ�����p�gGv�l�gH�i���'�e���K�MT�5�!��?0=2hi$=��TC�{bBh�����������G���(
[p�pZ�J ������pO\�K���;�t��B~�Q�q��\�z�����T Q�f�t ���������o_����R@7��en0vjBO�h�S�U�1l���o,p|P������^��)���H��bJ����U����A�^j=��r��XO�"��4z�PS2����W�����V/E`��	"��&X!���z�&��1�Z@�b�I�.�gi����1w%��=D��Y��qd�*�
�^������Z���[
y��%�;��,+\v�W������U�����;���m (1����-a�������?������)�-�b����� ��tg�
]�q}��d`��
 �	(��P�
w�
�ha���� 
��kV����������&�)R�^��Y����eA&'x���\�4{��N���y*ii �7��*x��J����r��lx��#�(�UY/!K�k,H_�),�������
��
�9zAN�Y1U��+�ij���9�}
��
��h���)�*&��`�p�u1�J
J|(���J]�.���qc���.xO�1A��b��P��	�{W�}�_���:�L����p�sZ����ur&$�In�����I�Rv����������-�D�"G�[����h��������eCP��"���_����)c-�yP��:���hCU�K �e�HMI@C�@����Hr2$'I#�1�V�2��c�F�>�����
��x����\�x7�����;���?�@���6�c��2�V��=�7����\8�F�O����(����*u'�w�������������~9����f�Z.W�j����j�n����J�r�\����8����c�[����j���dd����(�����T@���O���Q�C�j����[[H���@Y2�B���_OfO���O��&�:�O��qw�Y4����4~��KZ�b��v�^���x
������+?�@�uW��X����|B�$��~��/Wz�~����F~_\K�/e���tn=���0����}��cxN������a!7S�|to��:%�bC�~����d�D9>��z������hk'W��AUqG�3����`2�;8��k�q�m��?�6�n��t�����1+|���n�m4���&Us|4���IJuzS2W5Z&��N�h����
x��4��
@$)=c���M�(��4DF��A<B� �&��`��q��W�&�PB@��T.7�H��L���i�X�����SYL�3P�����W]�6bS����MI�w�(��.3��<��E������zp��S��X6a�o��H�Q#2�p|���|������?*����0������PX��p/_�5<	����4���J�i�q�z��Y+H��6��U������0�������/=H�W���T�^�<xV���N`���z��`��x(���4Z.$��)�����~��orgX��@"��m�\��T�)�!����./�xIt��1|fe��k�_p�����.������)���c
����������54�
|���������<��r]�I{y8��Jg�Hw���������nG�.��`�]%������V���k������0��� �0r�+���h����M���U�a���������*����S.����r���W�U���}���:�~����n5�7��N������f}���
!<xTiv__�W)?�6���*��H`���dI�)�
�
�0��W���b:����P�$��!����n
����,����P����^����%�Z*X,�������6�J*�=J����W���&l�����6��	5���0ak���/e��d�������0�
��V��x�$_�;��LR�a3���/[��,�
R���������{R�����W�"X�F���=oN/����^�9�
K�O�0�F��x����1�I�
'�&5����r�,/k���,H��1&Gl$��~����^��I��#@����vUR������z���%�*�$�����O����^��tUH;�R���t�����s.MW����/4a��ZW]�����ikb�{��*��z������������F�\�����^��J�vJ�����n��(��r�R��}����������"��2z`1��[�y��]A.��A����5�-{�L)����6]k��J^:\{0���0�^K�J�_���a���W��R3��o��f�7�FJ�����o,��B.,��q�6�eT#�X]q���q!s)��	.�Kq5f2��h>��@�4z~��>aFkT��
w�����VF
B\�bDg_�]��fw�i��#b�f0��#,��pr�m������mUg"���<���F[en��l=g�m���-��6Qs!]��F�M���_u���<������O/5����3�zz�Y�'r<�����v����P��@��JU���:	�0t���]������hpu=g.w��b �0Ad\L�d�P��c�9#���������~��_�����4�����X�|v�.}x����.5\�Gg/qw�@��9yx�\������z�~uv����l�`6-�"kM.�J"���j���y0���}�7�j%��6������x��Kr���;���R�>3��?��F����0��}�;��C�=���|L��S�����/��&���e�����
����w:_�LM��s*H��K����C�b^��W��
����y���
�,�4
�����%���(b����Y���q���{;��AC��,��A����� �?L��#(n��Wa��1��bm�������7���p`u���9-
'"������"T$/��S��>�~pb�$c�8+�d��Sq�����c����D�����=�"g��g��C���Xt^�B��n������~ �� �D�����G�qW����A�����'�b����h@���:f!'M�VB�Bp�&i��$��������1x:���d�C��$�'-�D�2���~�3H����>�[�J��A���r�=�s4�1R,W
=
�����%����Qu6��G�x8.����U�����)�����Z;���]���#��^Gs
\�G��Y,����<�XC��R����W�'��5�g,o)���7��n�������7�9P����'0C�!�_)��
N����}������M����*�>O���N8����$z��9�Dw@V�i���9\�2�/9d��s��`;���?�C���xh�����pc��c�����H��U8���K>)h��.�
p�KH����/Y751�:<	Gg7]����l��(
k��{����V� ���c:G.`��BW��jI�O���o�A�OT�t�z����*��6��z�^*u���^���Wm����W���=���VK��?5��0���y���"�:�A����I��T���$�1`H���$�-	�7�m�Z'r9f��a~�ae��5��o�o�$������V6O,wsRmBZ,�G�f���E��6'm�m�N��\~��6�+�N�\�w'b��Q��1/������7Q>��P�vlW����I`�&�Q��B
��D�@:`�S~�gS�Y0�il$��Y���!������-��6���X=i�qtq��B6-�yhEgQ����F�sy��u�3�d�5\��k���E9>�����Si-� s�h�1��*���}����7��}O�m�33O���'��S�v!<-�2+�pA����L�EVR$��}�u?����3�r�,C�V��9�]�1jN���w�Y��d�����D�s���}�P��$8�;����FVflD��{�Y2[��>�h
+.x�@��Q�k�j�^�:�&^������Og�G������|X�����w����g?��c.����������X���1�5:���x���a�}@���
�nA����m�X�@|�;b<�	#�@��xv�_�4���_�^\�_��;����X�a-`���Tg`�/�8��Ai��ln��_c����U�F�~I,�q|=��@��!����1T6�T^j��'�2v�^K���e��2�3������,#a��� ����7w[`��"���Y<�9��G�LGn
v���j.�v��S0b��s��D9S��5���Qf�"��������`c�'�M bi0��#�GcrvB�x$��NWh ��R���j��)���T�K��I�.?���}~�y���q���/��Y��X���{�-YXm�
�m��!d���:9|�~A>����q_����q�zUq���d�+Pw�z2�����Y�co��)n�E4��v�X� �N�.�p�zu�������n<��,��=�,�U/}���M��p����M���#����o�-�~���$���R�f�$/�^��E��*(�%��D9���O!�4+�	�*m��u}�x�����/�E}���i8FRJQJ�F7*�'x)A�c���b^iO!t4b�����������m��GH�h�����"�)J<Q�?yf�8l	�C6�Fw�M:���,eW���8������UL�r��Z���iDk���J�a<��\�9K����
���������2������uM��8���>_��-h��|���b8�[/(�B�{�*v0�nS]D��������w���[��P�SWx����M#�m����e��&'�~�}_�/'�C_���������}z��ogt�Ya��k��N�[�w5� 6��.�E6e����Q��*j�+���/H������]�^~5�'�p�<�)[�C�#=�M�GtoP��b����������7��B�f������jc���E;w��:>���=6�)[��r��6�����*��&ZOUD���������X�SW�k����\���'�����U�?������$^���������ZX����f���5���k�^�S���:�J7���JMq%|}�O�i��L�SG������:��f!*nm�����Q�,�Qil�Iy���:c���6��~����33�8w�����w�8�<}�&�9������ex_����O2����	B<����6�;q�)����.���Ee�o~�����������=d����,�����#�t
�����
SO�=a�* ���Lf�%�I�������'��������R~�:��I,�n��������N�]t@I�}���z���XFr�z@9�����,:�M��@�g}7XV
���0�`� k�� z���o��b���EK�}y|tqyv�a
���f��o���1�G5�?ahb������6��������u[�U
�m�V\���r���5��E��|��W�Mi�#{�l&i�MJ[g���f�`���1��*w�qXg�k����y������
��'��P�%��H]�����[���������,�����A����%5Y%H�/r�h���
����_�:���c �����,�bV�s���A��c8���A3���O���(J�>�����������/O�������"��N�z���.��L����H��P����gjUg�(Q;���XY���b�r��<��d+�2]{_FeL�r��	����:*)�/%�<�m�!>�uq��o���#@<o����t)��q�>��
AYz�I�W�5R���h�!P��0�s�\���x	8��0�Zo^�V�Ie�+�q:`��c{���{��&
�Kl2'BEk�Qy���OtJO�h�c����-�p��kg��=Xc~�����9��y����������Q8�X w���������������_iT����M>��G�~��{�J��(�����h������*�^),��Jm���&���4�����?����7Y:F�?����J
.A]�����V�O��`���&hpP��@Bk��5 h����1s��)�:�������_	v<�|Y3���+V>�������J��r����b��n��l4�F���}��-��!�=j5��&^tz�Y�q8B���M�G��5����������p06���h��S.^�Xk���1p�K^������"A�b�c�����z-���t8�f���������?>|������������?>�������W�

#292

johncnaylorls@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#291)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Dec 11, 2023 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the updated patch set. From the previous patch set, I've
merged patches 0007 to 0010. The other changes such as adding RT_GET()
still are unmerged for now, for discussion. Probably we can make them
as follow-up patches as we discussed. 0011 to 0015 patches are new
changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and
support variable-length values.

This looks like the right direction, and I'm pleased it's not much
additional code on top of my last patch.

v44-0014:

+#ifdef RT_VARLEN_VALUE
+ /* XXX: need to choose block sizes? */
+ tree->leaf_ctx = AllocSetContextCreate(ctx,
+    "radix tree leaves",
+    ALLOCSET_DEFAULT_SIZES);
+#else
+ tree->leaf_ctx = SlabContextCreate(ctx,
+    "radix tree leaves",
+    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+    sizeof(RT_VALUE_TYPE));
+#endif /* RT_VARLEN_VALUE */

Choosing block size: Similar to what we've discussed previously around
DSA segments, we might model this on CreateWorkExprContext() in
src/backend/executor/execUtils.c. Maybe tid store can pass maint_w_m /
autovac_w_m (later work_mem for bitmap scan). RT_CREATE could set the
max block size to 1/16 of that, or less.

Also, it occurred to me that compile-time embeddable values don't need
a leaf context. I'm not sure how many places assume that there is
always a leaf context. If not many, it may be worth not creating one
here, just to be tidy.

+ size_t copysize;

- memcpy(leaf.local, value_p, sizeof(RT_VALUE_TYPE));
+ copysize = sizeof(RT_VALUE_TYPE);
+#endif
+
+ memcpy(leaf.local, value_p, copysize);

I'm not sure this indirection adds clarity. I guess the intent was to
keep from saying "memcpy" twice, but now the code has to say "copysize
= foo" twice.

For varlen case, we need to watch out for slowness because of memcpy.
Let's put that off for later testing, though. We may someday want to
avoid a memcpy call for the varlen case, so let's keep it flexible
here.

v44-0015:

+#define SizeOfBlocktableEntry (offsetof(

Unused.

+ char buf[MaxBlocktableEntrySize] = {0};

Zeroing this buffer is probably going to be expensive. Also see this
pre-existing comment:
/* WIP: slow, since it writes to memory for every bit */
page->words[wordnum] |= ((bitmapword) 1 << bitnum);

For this function (which will be vacuum-only, so we can assume
ordering), in the loop we can:
* declare the local bitmapword variable to be zero
* set the bits on it
* write it out to the right location when done.

Let's fix both of these at once.

+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, blkno, (void *) page, page_len);
+ else
+ local_rt_set(ts->tree.local, blkno, (void *) page, page_len);

Is there a reason for "void *"? The declared parameter is
"RT_VALUE_TYPE *value_p" in 0014.
Also, since this function is for vacuum (and other uses will need a
new function), let's assert the returned bool is false.

Does iteration still work? If so, it's not too early to re-wire this
up with vacuum and see how it behaves.

Lastly, my compiler has a warning that CI doesn't have:

In file included from ../src/test/modules/test_radixtree/test_radixtree.c:121:
../src/include/lib/radixtree.h: In function ‘rt_find.isra’:
../src/include/lib/radixtree.h:2142:24: warning: ‘slot’ may be used
uninitialized [-Wmaybe-uninitialized]
2142 | return (RT_VALUE_TYPE*) slot;
| ^~~~~~~~~~~~~~~~~~~~~
../src/include/lib/radixtree.h:2112:23: note: ‘slot’ was declared here
2112 | RT_PTR_ALLOC *slot;
| ^~~~

#293

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#292)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Dec 12, 2023 at 11:53 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Dec 11, 2023 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the updated patch set. From the previous patch set, I've
merged patches 0007 to 0010. The other changes such as adding RT_GET()
still are unmerged for now, for discussion. Probably we can make them
as follow-up patches as we discussed. 0011 to 0015 patches are new
changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and
support variable-length values.

This looks like the right direction, and I'm pleased it's not much
additional code on top of my last patch.

v44-0014:
+#ifdef RT_VARLEN_VALUE
+ /* XXX: need to choose block sizes? */
+ tree->leaf_ctx = AllocSetContextCreate(ctx,
+    "radix tree leaves",
+    ALLOCSET_DEFAULT_SIZES);
+#else
+ tree->leaf_ctx = SlabContextCreate(ctx,
+    "radix tree leaves",
+    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+    sizeof(RT_VALUE_TYPE));
+#endif /* RT_VARLEN_VALUE */
Choosing block size: Similar to what we've discussed previously around
DSA segments, we might model this on CreateWorkExprContext() in
src/backend/executor/execUtils.c. Maybe tid store can pass maint_w_m /
autovac_w_m (later work_mem for bitmap scan). RT_CREATE could set the
max block size to 1/16 of that, or less.

Also, it occurred to me that compile-time embeddable values don't need
a leaf context. I'm not sure how many places assume that there is
always a leaf context. If not many, it may be worth not creating one
here, just to be tidy.

+ size_t copysize;
- memcpy(leaf.local, value_p, sizeof(RT_VALUE_TYPE));
+ copysize = sizeof(RT_VALUE_TYPE);
+#endif
+
+ memcpy(leaf.local, value_p, copysize);
I'm not sure this indirection adds clarity. I guess the intent was to
keep from saying "memcpy" twice, but now the code has to say "copysize
= foo" twice.

For varlen case, we need to watch out for slowness because of memcpy.
Let's put that off for later testing, though. We may someday want to
avoid a memcpy call for the varlen case, so let's keep it flexible
here.

v44-0015:

+#define SizeOfBlocktableEntry (offsetof(

Unused.

+ char buf[MaxBlocktableEntrySize] = {0};

Zeroing this buffer is probably going to be expensive. Also see this
pre-existing comment:
/* WIP: slow, since it writes to memory for every bit */
page->words[wordnum] |= ((bitmapword) 1 << bitnum);

For this function (which will be vacuum-only, so we can assume
ordering), in the loop we can:
* declare the local bitmapword variable to be zero
* set the bits on it
* write it out to the right location when done.

Let's fix both of these at once.
+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, blkno, (void *) page, page_len);
+ else
+ local_rt_set(ts->tree.local, blkno, (void *) page, page_len);
Is there a reason for "void *"? The declared parameter is
"RT_VALUE_TYPE *value_p" in 0014.
Also, since this function is for vacuum (and other uses will need a
new function), let's assert the returned bool is false.

Does iteration still work? If so, it's not too early to re-wire this
up with vacuum and see how it behaves.

Lastly, my compiler has a warning that CI doesn't have:

In file included from ../src/test/modules/test_radixtree/test_radixtree.c:121:
../src/include/lib/radixtree.h: In function ‘rt_find.isra’:
../src/include/lib/radixtree.h:2142:24: warning: ‘slot’ may be used
uninitialized [-Wmaybe-uninitialized]
2142 | return (RT_VALUE_TYPE*) slot;
| ^~~~~~~~~~~~~~~~~~~~~
../src/include/lib/radixtree.h:2112:23: note: ‘slot’ was declared here
2112 | RT_PTR_ALLOC *slot;
| ^~~~

Thank you for the comments! I agreed with all of them and incorporated
them into the attached latest patch set, v45.

In v45, 0001 - 0006 are from earlier versions but I've merged previous
updates. So the radix tree now has RT_SET() and RT_FIND() but not
RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous
versions that incorporated the above comments. 0009 patch integrates
tidstore with lazy vacuum. Note that DSA segment problem is not
resolved yet in this patch. 0010 and 0011 makes DSA initial/max
segment size configurable and make parallel vacuum specify both in
proportion to maintenance_work_mem. 0012 is a development-purpose
patch to make it easy to investigate bugs in tidstore. I'd like to
keep it in the patch set at least during the development.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v45-ART.tar.gzapplication/x-gzip; name=v45-ART.tar.gzDownload

��Ize�;ko����j��=.z"G"%���i�8������v�����$�c>.)����3��$��	p�#���������,�vGo�Z�~gi�����"��T�:=�~�Y��R��T�C������	���w��V�k����������Z�w�����m����Zm�2��X�A��������}q���m�m�����~^�I��������^�[�#�f�?�{�;��E��
���g�I�����=�j��
��!�Cv^��]��8�Q����������&�O��g��5X�fnH�mf����ak���A��]���f����G��?���a-�iZ���2%��������2V���1���^j��%i1�KX�,�<�Y������3)"g��2n�N��@n6��cY��<�Y6,23���3��^'<
$0-�a.g��5m�S�v)���N��������e"%O�����x~��D�I�	mH7���8�&�����f�$��
�14���	@v#��k
��Q2����z7z�
OZ��O����<~�?������w��O��������v�No~�������b7�=��$i��A�c�7�{�����L��`.��Fx
�b)R:�Z�@���������m�:�F� ��3}�u-������\����3�:�6�h;n
��������.���=?�����6� �I���%��Q(j.�-c���l�`�A��g��3m�@�R�l��i��vu2zs1B���j�Bi�	g�
�����H���5���SP� G�%��\���&qr�ENz�89GY�*e���f���?i#��!lY]B||���h��<��Mx����Z!�V0��v������D�j�
���}��k�������k���\��|���Sv)�.�
�������%��� �K��1�@�f xS�o:�����$Z�	i����@�REgM�:`�(E�T �������"|���W#�N.��=��l�M��z,R�	�������=z	�8�
���]�uq����=��`����E(�L�q���%"����_g�������Bc�$=6�SA������<+x���������&y�����=g���=���#k5}��w����)v��S|w��^��aa�8��(�3����6=gS`��QGz&��C
u�j���q�q�$�k���:�q���s<�������bU�����avk�����i�O����i()#]��S��C��hE����V��,��]\�Z�� ���T�-��
�����0�P�����u0@�,)�HK������,d�QiJ��6�P��Qx���m�*1�*�W�~�U(���$q	���xu�~�AV��a���i_�OU�������2��N6{���M��I���_<�`�_������	B�^��,�����?�����k��~�������^����k	�l�������������~I�1d/�!h�������d�L�s�`]6n��Y����6!��m�������2V`L	+aE��$��v�ylr3J!:�" c���j7����E78��7����d���$���6I�)��t	=�Ih�.O_�|}~qu=�xsV�M��f�	����9#*�D�6(��P&cD
��<Bi������na�y*h)��\ W�������m�s40I���`0�.%�|2I��bi�4�cRG<KR�����v��q�D�����8�!x<my�bY�� N4�@�6w��"����8�0P��E!�@�n��*b�����"t,fK�?�Zq���)�!Ov?.���P �����n���A�x�����Ky'_�k���:��?d��?�_Z�����������Y�kx��0�����n_�]����[���8��l.K|�f8r}������+<f���oL�+���7wg�
I)�hL���Y�4���b:X���_.���j��`��x R���� ��[����8�k�����t��a��A�3��
���x%��s(�7��a}�o�����<�>?z{ur
�'��\�����j������,�5�guVN�p�f-��"6Mo>]
����De&�P���h[�.
\��`8)jbT�*���b�`����t���'p�I4)B�E����������I�I��Oy�.��O@�2�F3���T*��PK�`RL��-�<n>����>&�4uF����;��D����v'�k;��
0E�`��20 A<E�ES���ei������ya7�G�����i.�G��=.��yex�&)��I�;���D'>��~���z����'���x`�Q�������HH���oU��
&@��b���z�����`n��v�va)��T�RM*��vw�j�U�������e�B_ ���jt��������y3�j�:�����D�a9y���h�1�����t�,�,mr	�W��'�#e
�]\����j/�EW
�IG��j���=�����6�TA+^��]
���N�	�W��r�`���L	��Mt��m�k?~`�o��,B���E��Q���A<�m��>�R��nQZ�$�����
���vv����.$Q���Z�av:������j�:�����rh��i�C�/SUl���|�3����[��2<+ka�{FIo)s ���H�>�M	�@W��[������+x�J:����X!�j?�\�:r�*�=�zf���(WqA���|!�������"9���?���9&�T�[��n�V����AX��PI��������5��5��~���yz��7�~�	}�����I����r[�
��HAV�d	���0s=����f,Zw�	���}������at�Vw��]��L���hK��:��f��1��Em.q��4fo_��N��^\^W�����v��h��!�N���`A�&�A�ph���7�(��$������%���g�o�Q��.���g�o�Q��3��g������2��jFQg+���O��r
OO��
������\��!�Gy��j-����3�l��P�HUJ�A����BI�`��PR���G��.��%��d!>�G����d,5�*A���������Y#r>;�0��M�8��a��\K�u����x���Zi�����o��K0G�(s\'����b�Y>�7#�%Q��2��!���������F��j�'A�H�h,������c�G������T=`� ���f������e[��Y���h��o��?����'��}g���V��D�v���:�~�iY��������;,��6!T�����?�`�����cQ��:sae��j������
��&�;H�OyL>�8�����E�������!��4�E^E���)pkF�M��$W%���o��l�z�xNsN�����Yf��^K��Q����p��3�\?��/!h>���.�}!�]Il8yz_�?F�]7
0���$s8�!pU��m=���hx~R���p�a4��d +0v��g��^�Mgu��V�������h�+p���n����%�W�t\����K��L�%RG8��JT�I�K����WE��
��kP�pfw�Ud���eQ+V*������[�?L�^�����|?
t��?
���?
~�><
����� Q
��s��c���E��Z�6����=��A���nE���\����	1,���c��x]tDiDAV�G��r��m�*�A����wg�{{)�9KQ^��@�Gg���/��
3H�T�[G�7�&i��E�<H���ou-O�u�~��t�Z
v�nt�3<RfD%����s���������
&����6�t7�[N��n�2��d8�g>�1���	� _HR���e�dB���������A"�>���]j,�K���E�����5����KP`k�1��p����E�$���i3��pCUlu��V�������j��+?��SU`yM�A\���7x|B���h��jJ�%�E
�Y�_m�p|�dK=G%�K�v
���������HetX���~�d��q�>��~n������G����_������8���0�y�c|��eZ��;��V�b�XQ%���x]F��P�pT����Pe#*�p�B%R�]�|IW5U9Kl(&".�S��N���cp�P���zD�8�������*�O2����Z��*��Y�"=��A�A� ��l,��1��I
<���"U��D3����`^�%�
y8���,
]U���E�P%�d�v6��5;U,e��/�z6��&����
Dg�y�Zo��d3�$O������F�nP�Eu�r*�)� ��ST�W���>����C�A�KN� �@(�t'��wd������vp��8|*�58�WpJ���8s��"$j��p�"Vzo�:
JVt�"�����8��5�2s��+��O��<I�Y/1�q(�c!��Sl������L�a�@���j��+r9�c�P��V@���=��e�'rN	K���x���-r!�x��ch�����<��
^�[�<���D.	��B���Cca)8����m@������_��+���d�J�o��U��^q���k��!��)n	��:}~r@FMq��K���d��#�R,'<��,\�p"��x�)���������S���-�"�xRe9�4�����
E��U,LZ�^SM��F���='�c�^d���w�9�H�2�I4�0!���!��F��<��K�&���W���������p�R��t� ��i��4@a�P���ts��9��7-�����cM%O���H����y�� T��������~O�N�Oj���1Ga��K����&DY�{u�fX�8]��Ip����r�Kx�M��6A� ���f��1�-�U!BV(�f���&���O�8g|�	8�B�|J���F���hs���&��h��F?'TE$}M�����v�l�ypi���rA4Hl(`�
�7�P�(��xw��tx�\e����k�~[�\fW��;N���`�J�"��I�j<�5���tW�xV*r*����r�~[��2�|�z`R&n@��=N��}#pS�V��.������E[�	v������.�Glg�X�Sm0��8�#M�>�WR��G
�D
����R���]1����qDv�SO�&u�5vN��E�(Z�=������<K�]��&��������.�/�QN�}�'�B��@E��r�����uDVP�P�l��A�D�?i�>X�����'�g�*����3{��NDJ<���4'��l��j0�0��`�d����[��H����&��h�7�/g�u�r	EW�Z`oP�����G�����������W|�9cKF���=�����������P,$E-�����}������%�����������}9y��\�B�!s?��W�����'�S�#��a�����T<]��RO���Z��`���������Q �f+@�%��������Q��xd���Ry�����=4	GP8���+���3����g��������k��H/�9�+�}<pJ���D~����e�
]�"�n���5��/������[���n��!��|��P9jtD>���������e�'Y
^D�RE��Eb:��F�!VCH��i��%� ��#�D�"n�������#��m�9U�S5@9;�������yq��h&���>BFLoY��>������q�f3.��J��Q[�<������m_�/;?�������|�\5��{q���O�� �=z�p�|����=29<{d,R�3
��/+8,�����/�\5
�p<�'� �+�
�^�p��������.,�����$�WSsnG���������g�|�������`����������)�l�/p.�q���\���y	�N/��F��)��M�[I���k���<��q��pC�`�E�yFb��T^�p���j���dr�b8���N!����u;����<��Ci�P����:�����nA�k�L��%f�������T�V�$�RM��Z�!��r9���B�$���=�v�(��`�-*�{L���p�������n�."�#�������$�;����t�\�g��Lf�B����>}��d�����Y[��v���>-��GA�,q��[k� ������q�}�&����$��szv��
4� ����h�����Q8%Pr�����"���+Fz�6*�&��XxM ��g�\�0����������%�8��t<��������s�>|{~q���Q���F ��������g�;u��Z���������q������}�����iL���gtj��Zg����
�O|� ^��	p���k3�x[��|�m���� ���y��,��������]��������ly� �{�Y_�4r|��J�H�#C���x&��\�N��
G���=�D���6w�-�� �?>�h�_�9��~�������`� �E�����������s���~�v�2qm{7����2�t(qY'�\�V�w'����wD���C���F�����X�9�a���r���e��!������Hk���'Hk����4�N�,��Y4
|�l)�C�Lq��:t7D����4��Cs�@&��~��!$� �����'<{�l}��f��m�G��|��|�==�'H��W���E�>�S�������Rmf~����VV8��N�*����,W"A����_�����.��X_ie~���'��)��s`�'�'�����/�0�crm�[�>�Z�3�H�kj������B���?
����uZ> &��A%����7G�q�
���`������}~��s�1���{F�qZ���w����u.Sr�QV2�	:�<?;��������k]"��$�ERdzk�,+[J���Bi�)�i���"��=z�$o����6�K��X_ie~���'uUL'�=�-��
��s�N�;c�dS��D���������lp�8�}���N����y|��,�'���uJl&7n�Z�|r�k
�� )��.?gu	?y���N��k�:n��A8G���d�s��v1A5Z&�r\ .Y�aK@|
w���}�������M ��m����h��F�2`W��A��]&����NV>:���37�'}.J��L`��gz+�qp�������\���E[v�<9V��}*��~+��
B��
 -�^[4,+��1���'��qC���!��$>i���f��(��l��)05[TEd}�>�|h
CI���3JQea^�a���	Ij�	e���z/�����5�������u����c{�.�Y�m��Wy�Z����2��Z���N��K�O}�f��z<����OP8�,��LY
^wP*�^�p��fBcP������,]��XD������f���i����K+��Vh��.��� �\�^���z�#h�M�=��������I����P ��M����O��I�12,(��b�"�S�]��"��%���AL������|I+)K^\�9����jF���n �����2��c=C�<����=>�l6�������C��j���y���"�������6��{��y�M�.����b�N�hB~��g���Qc<B��l�%��F���9�tP����&\j7/��q�6��$�?rb�0�%Wu�������z������N9�Ae�����G�����/�|����:�c���TRAoJr��SR=��*�^��q^���rhD(Ah!���aV���[�>/����*`"1�j�����.�1bdN~��k�-��� �I�RH��W��w=��#3�1������u�;
7��5�TA����������*�T ��!�%���z��������fm���(�px=�4� n�xm����)�)�<&N
yz����8��1�?:���Tpz��d,��Yt5�d�;��1	�1��N����t��X|��{���
'`�u�y���n��a+t���kW��l2H�{z���;���abQ2m`�Z��q5L��Y\y��� Q8�A�ag�j9�Op}Bo��QF���47�J�	���o'���N���+�����$�����n����0��3CNMPZLr;d����x �"Y���A��!b���n��7c(�����k%N�/
�C.4����D�!&F�Q�
����o�>d�>�s6��2F��ii�U���I��q&*�;��������{�40O�'L��}]m��jV���e���D������^�r
���]�v��t�'��<�x�\K��"O���k����M#l��'/:/0����G0-�V��|�^su����j��["�����O����!�I�>���|�q"i�w�cu�M��]�,�0�P.��1s>�.]7��x������I2����h����`x�Z��C@M�A(�{���q�����#��;��f��g��.�W���9+��T:�_7a�M\
��
��m@.G�r,��.86��$�@y�\�����Sv�(�G��1B�W�^��2����^C	!��3����;W���h�hbO
�����<��T�U�G���9q7	�o���e���E������R������:�4(����a�8D��$vH�'�A��$	���\�8�����x����#E�c�`H�36��Xu��E�Z��B�V\�)o��@Z�'r���0Z.�%�	�7WZKk����R�*oC{:�`��C��5��|�o�?�������o6����HNGON��gDH�����-���IbN�W?���������8�V�.��pw��X�������y�\"&;9���Z��|5���l�
�rZY���!e~�T���k�u����'�}�>!���������H�>R|+��8��!��C:���_�h�8i��p��N�����0���a�0�h�D�zu�M"x�"$	i)�
7H	���H����������k-��N+�?��8{���o�s�)���6m�U/@�%����#�(��D(17h�9���k��D�ht��BT��Yo�+�����r�fci|�/��h��%���#����2���-L���������������=�i��#e������ n���L��V���B���?�0����?��T�g������_(���Y���(�����Ps1:����v��$	���k�#��b�E��}�8PbJ�3�n�g��=29V1����g������������w��G��[�l��z�#�|����NE4C��"/�A�+��_Oj���G���HEM2!�!!��G�1��� �)I,ql�F��,�;V���ODs�J0�	'<H�V
v����Kj����J��1%3^��R!m4xH�'"@��c.G*�_��*��+� ��$������T#��e�*?������mQq�T&�f��y��N�G-��x{A�}1�aHN�j�����`��#&P���|�g�y���)�r>����&r���a����Q���D:s�R�������N����5SV]�*(�+M/�6������p�)��"��n0���G��[X��yN�����2��h=pz���r�j���.�[�OL�a�Y�W���f��~�M�Y�X[���^T�.�
tq��}�����*���}���y���-`�0�D������je�x88$K04�Q�1La�D81�r)�gL�	����@���N��$W�
�5v�������cB5�����;�tL
h,6K�NV��mZ`��8�q�&������m(3d�4��Tg��c�$�s�T����'�w������kd��]"��cI0+����1Qax�6�����?gK[I��$�+����zD�NcY<��Qok��B���) �}������F����0���le*%H'�V:�:�t)��@"��-��'D{�YY��Q/J�����2&��i��^�E$�
�]�M!�8��!�@������@���ZEg1v��e�w��,(i�-/#����i�b�k����t�����ZeGHt�����V�hF���\@��h�Q�}��I�
��B1��E��Kh
���$*��)��6�C�r��4��T�������6��	��L��:�1��$~H�r�9e]�A��4�22�lA��8���.�rM
0v4��5��3K�m���7�;�������)#&�X���7������o���d	`S�1�]��f�xG#M1��-�*�c��0S���+��p�jgX}������R��]N>���j9�����R������	q&!%�~y�\��wN�'���s1�����I1i�UJ����z8�Sr��$��|��<e/]�?Ps�Y6��
�/���x2E;A\��������	o1��� �1=��#�h�U*�����]^�=6i���&93��-pd<~��^&�_�e�'^��������j���������Si��#�������v��p�|]���������Za3kv�����n%������l��	�Q5s��r���]}���dD��.d��wO�%�����w�,�9la�L����{"�I�Go���W��A��N����6w���-�[T��:���7CJ�{5��������YbmI��9��e��r9Z�I��C
����e�0�^�o6�?8���atTl�5��8J�~��nq�����W��L.�C�<����	�����K���GQ�6�{Vl*g��JJ&�����������;��+��I�g1��A9H�DP��POH��A9PL�������@J)�tv������m��f����f�u��x>�,3�e<��?9��S)�C�
���t*��y�0����E�i��&��4�Rh���El��c/<5I���_9�2�� q:@�����[�u�������B)����.��xp[��H:���r�D� ���l�A�R����`�*\	wv�W�*_��hX)5F+c��h�rP�V&�eU
,��Ae���H^:x�_��$�olz��?�=�����R��;�`X]Xh:�����~o0�l<����y�l��^��f&?��H��z�����9Kq[W����Q�2���A�T`���;
�v�/��+|K��A���r(����Ku��]�L�4X���
a��"����aG��p��d�Y/���z�(��3�~���a���k���A�X�������pfz������!q���C`���$���8�	5�����j����� rI0�����d��n�]`�?��������^>.�AKU���3)�8������Yt���I����Ym5��r�C�KU��x���\0�[�?��c�x������I���Xy��r����V���>�����~��W�r���%�.;b��0c#�
��tf�0�;�v�"����-q�D>��p�v;b��r{�isp����n�?-������s{���>`k�!<K�ED)/~>8f���%��39I�K���D���{�S�S	%
w��H<t�A�8xz��9A����l<Ek��<sN�}��
���k����<��N��5[X�fW�h�7t�N����e��,��5�B��o�s`|�(���>v�U�|	��� �g�g!������=��N�:&�9��G��Nw:��0�qj���@DC,:G�`G).��p<j<������<r_yOug��k��"��/�����oy{i�@[�����f�����bqK�i;�e�	x��
�y�&W��z[+`p`b�f��jz�b��2�q�7_�,�K ����*
K����u�����J������)�c��I���U/�Y��e��xl���=������u�I�wk�%��B~�����"q��v0�&�����w�;�dD���A�����I�fY�
Q�L������U�%��;+N�����(��3�Q��<n�9r
9~=�������������j�`����:����U�����.�<IA�mH��)�"0m0�@�a��O��4t����h��������t��r�?���#���8@!h�|��qWWhM�Z ��H���9:�1�2��C4����*��QE���GE����h�@����T�)�.�����5��E$/�J�k+�s��J��K
�i^��~6��	t;Z1f_��j��R'�%$6p`0�M�w��Hj�N-�F{'��k�C���v-�k����BgG������z��c�\PQ���Z�������'������"�O����wn0��|&4;D->�����of���`�`8�!G�"����/a0��g1��v�����L�]�pX��RPW�b`s���Kg���J+(z��P��(FFD�!�����ux�����?F�����{!J��!��#st3�M����0���h^O�������akF���_�����A�Y�Rm��):�s3p�P�z���Tr�-z�����"y[U�JjIl_^�K�~��~�	����fAXPU���T���jJU/�1�Tv��:�+	y>z�)?�[Z�����VP�gVZ�E�S�6���[#�4c�'�QSr��6��E \���xb�����t|�$m@uk&�N������v���j��'�y��y��^�f�F�M��a����YF?~��G��R3[p ����B|�����d%*���h.�d\�a0Q��F���L���5xXQ��U�	�����Z��x-����i�<�����e�"hV�����{����XZ��N�:P�ThM�7�s�M=[�:���	��&�������{m������l�,��fe��O
�KN�������T�O����V�*o3�G�Ze��7����E=ul�����R9�mg������p��;t�SM0���v�]1��W	��i[����
��Z����^m}�ef�w�.6���.T.�����>�Tk�_��=�\��~{����u���(���;�����9�!x�F�T~�����,��n�G��D��D�4V�#wX���g��Q���57���RO68I���4U!NR�be��� 7��9����Z&e�TJ�_��Q������@t�������jic��%qV�d�z"�*��Z.��3:M�p�^�I5�"�� ,cm���P�X��N�C������8)rM���Z�{Z�)������-?G:{�o'7q�����T�D�+OqIl�d�R��1���������5m��s�����|�mg=��������/:4�r��k9�/��l����d��<�3��&e������!�A8ET0��	��M�b�}�������[��R!U�]���y��4�6��mqES�LOFi��������b-A������(r���+�xEw�V�d��pz�@J�kd���H3�|�3�3'w"�g<
��������(_0�L��<$�zz�,����A,���A���-G���#1�5P���*��x�a.������W~�Hsl:��]�/��DiOZ��&:�B��6g��A_[k��-��
������}g��5��#�$M#cv��@�4���c&z�jYu�^w*�R�;'�!
���H��"[���J%�YQE�c��(�������p6:k���y�=�T�r�l|dgb_����pT7VYZ��"C�'y9��w�)��8.�HU�e'H�0G�4&�I`��h����l����6,<Q�b%�r�s��,�q]�6���A�Y�ky��VC<���p/(!��1����[-����{�R�����k�)>(u���+�_a����	MaH*8f�L�b��2M�)��w�8�czn?��Y3��6�X������}ak&Td��L����t���^�>'Ru����b!����(����<����w��H[�Y��122xo�/�T�w=C`|e���EI`aU,� �)�)a�����d#th��T���e�5��wL���Tb&]t�I����%�e�����M����|���	����1�n8��"��������Y����#m���~�k���h3�M�v�Rb0%3o"T���6��LF��ccf�Y�zT^:��X�E(f��0qM	�����|�`o����0o�j:���_��z���c{�@��n�H������&Z�qx)%���������p��5la��D-�a�d�8RX/�sC��O~��i�<q3�}k0,�BFUK($6�3{f����I����}��[r����l)e�c��Jyd�����h`u����)�)R��A���(�f��2N��g��H|�G����FQ
Gt��W.�����j�t��D������R�R��$�'eY/��D�P��BR���Ql��lv��
�Y{��i��
��c](�W�b:9V!��\au�W�/�W����js�f([<zG�bV����8�����1�Y�������D�>^t�������v��Vv�������A�T���rv�"���JPqDy�H.vX����tq���$�t���o7TV��Y�%��Y^I42�b���_+���j��.�8�N�J���.��0 32>������yy�/�#i��dt!��D�E�WrG.@�C�|��w�BY_@^��WR}j5K������i�%<0�75�ft(�����}We[�1w%��sR'�1�Lx��e����E�v�*�����L��L,��+��]��:����#7��v��6���FY~��e�M���zJ�v�=���%�b!U' e�K�)K���:�Q��+��-�Ol	��_�`l[�7��:�������h�\B��U���u�Ai�#M/������i):2 ?�s�i�i�z�M�l��i&1*��^���;�s���6jT�����&+>�W�o�$�����%m*�8U$8�8�G,{I�kB��L�9��Sui�.F��z�������*�!��fh
�����wE���m�^R�����5���"�<D�=�L��S��	�!����7�s�I|34T��}qWA�����+qNFh[�yrB@W�}��Y���pR�/�� ���v�=�~=J��j.
���Y�XN/��,R$%ju��My�'�PH�8o?����W�w��i�����+�3�{����n��1��$,[�i)P$oX:���������&[��!1)�3*?�dI�G���>�;�a�4�� e����DM���\���H�XJ�0|������g4���[�*n�J��!�F�QD��.*�g*L{��b�t|��42k�� >�Z\t;�\���'����W6�.�"�\Yt��?u����AY'��AK�P#y�����?>�����z+��^C�4�"�D�E�=����-�����RU�V��TVF
O (�%����,�1@e+�2���Q��&k���Q���
��QZ]4��JmY��<8:��2_���X?WXF'ZKR.�I{<3��n6�=������+��l�kg<���e?�m�x5�x!�O+�c���=2y|i�6�6�j)Z�����k��<z�H{�b�}��!$�G���1�,P����;�������@U���*&�p�1�����k6F����<������I�nw�<���cb�6&�6���Z6�T��D
KBR��_��"�G�me���VXDBS_�c
�"�z�2�?�H��otl9;��Fqa�����k�`�/���)���Jlc���#I�f�I^�;G��YtR��w0^B���j1'V������qb���Tn�[�����w:��6
>����7��g��Wn�i��,����C��2����N.��sp���,��?EDw�q�FQ�*T�K�Og�����*5/�@Aa��(�o�7vJ.��\������*�����ui���CVB%E7��A:%�[�q�J�=YY/_e���<�R���g����}��js��Sx�fGA=��e+�$\��]EED3N��gN;+�6�4�vwvC���������zl��o�4��_T���<hWQs�6�LZ����x�K�����[��T���j���?>������������Y-�\:l�V:�-�g�����|��������%�^�%v�����{�w�F���}�Q��Vy_L������d���s��6�����.	�	�[��TsL�Dzi���S�38&��'=��<��x	e��R5��?E,L����*�Wg�n��t�upss�=��u)Y&@b�D;�6��.�n����I��D��N��[�����I��H��ff9+����pS���8��
�i��
~��m'�����p2��E��gH,k�!� ,�K��\+����
���+���r��L��<���KqYL[Vb��V-~c��JWQ�M�d��G
��#+�ej���_A��K�K]�|�,x�qY�����y�A�P�Y��E�;���p�5�B[�T5V�T���ci��R2���0B������lOu���7��7Ua�}�Z��n��M���H�<��>�j����HL����"�&���au2��<u��x
��$��T	���/e����r#�J�x��~��i3�c�y��S|�M)c����nDy"�q��ZR�����9h�6�$�������}p��'*|a,������!���Q��D����\0F�x��*J��jqyc4���O��@}����x���<Ng_�	���S���D��7�U����&S}�D}S�Q���;�'�l
�<m��s�2����05mSA��*�C��'��<.��S�$�py�g.�p�?l����5���}���^���}0������������G_����Y
���s�`M�.5YD��K�r(	>(`���c�G:9x�y���;o�,'V(<D0�����/�W�o���D�"k�T�2�:1���D"�������	p@?��m���{����S�����G8r�;5�J+� �l���k�oB��Q��lB���_�=�3�H��������"�l�i�:���T���o��nkb����A~����L�Q?z���5�����l2��������|����z�#��B�`a{rt������F_$�^F
���F=2�����,��j�,�5F���3��|��'O�e/^�>/}<����Mg��m71���\4F����Y��V~�XKw�~2/ao>��)039��1����b��	f���q�>|{~���]�.�lxR��}��g��u�7Q�C��
t��dv�,$����<;��~�>?���"o@���)�������o�Jn��N.�I�1us��r�9AH���/��L_q�B?6�~;7��KJ���f	�pE��H��M���drR_0��^|����������h����� _y�+d�2k;VqdT^[�$a����e1��!���=�]ZP����6~�`�&��FT�$@RE<��V�+K�����:i��A4X�	�0�8H��r��TOm�d0Nac+���I�t=����9��w�!jw
'A��9M���ixK�3��&M��<g=�M�����8'���b��h
�u�����{�>����R���Y���a�������$��u�)'��q'yv{%Wh�YyX��Lv����������TO���&[����mek�t�$��+IA��L9�"?����?@�M����hp�<��Wv)�+��\�u�5�F��)�$!
N�Q�Q&��%T����h�GA/J�I�������W��������4��T�)��lt(��������[6�D#��������F�����+��=���s���|�[_>���L1*�x�8�%7����>��*�P3;J<WgwR1�5���
	�^�3��Z�(�H��JzS2$������[o*va����7ct<����l_�M4"��6v�x�����e�:y���J�k�8t�cL�'�z���R���?�����
wL93x�g�*nO(� �D0���7��m�]~��5tn^M{3�jWr#3�*��D���iT�oy@�IfER�6��(��Ac>��X�x���N��$F��,�+���C%u��9k���K,�r��:��u���7����S�T�����{�v89�C��H#f�(�6���F���3�5��&�P,�2���U�JcFE/L)�0�o�i�D@��,���������f@�{��h�MhB���N�wr
����o��SM���Q��\�p����@w��p6���Y�������u���e�,�r7x������Z�cT��av7\J����0\�%� �)���O��j$�6e!���,���a�D�5�#���u�Z�U��j�a0D�8+���&�j��f"�d����fq�t��l�0��aNz�D�H�.<`���_���^����h�Q�oln
|���];�V�q4g��8���0hA��+sY��<�AV&Z���
�*�/<a����y�)����&����|L*!���4�:�?a3t��r���}#/�����^[_�7���D��LS��x�zX�o�F��x����S�V6�6��x�{.gt����\X`�K���|W���)irq0�i@�Qd�n��h��N�;�����k����RW����������[�b4������Q��SD�+�AQOZ����WK�c�yJxD1.��q�NF��swI�JTT�0�
�'�f�p����pk9^��'�2z��
/Ar[1=������V��6�"�u
��5�6��nT������a��-L	���Q����K<����"�
2�"q�����3���$�u��`�l��
�WUs��IA���dG�Ft5����8��Hj&��d�R@�%	{n����s-�s�M��H��%�Yz&6��\i�����;�5z����sY��|H���`�aK*YG\�3��o�c4$�N���0���\�,��5P� �lxuV)��$��"��`|"��6}Ht�5�a�������lJ��k2��($���S��������N�w�c��	��I����b.u�A�3�y8.N�,s�S��1c3�0e��$�H��K��
RM��r�\j����B�4	��YIT"���`m��@��,b�+�<� ����2pj�B�C���2NU�]��RVg��H��a[1�8ai�5N*z���8�~��4���B_��.����;��������L��j|"y��E�*6���d{�!�+��}���y�p�����*='��#I?��e���\�%�s�NgT�xH�������81��<���8�7��r�1�R%���>i���p��D�_�i��I�Yew���'a���g]7�dt\N�I������kp&�/�����^y�}�4�>����������8���<�����Y�@�ie��H��OK��z)��V;���a���q�*%_���������mM��P��+�m�1�O�b��j������DrLIr���� C��+��k��l��gR[�Y���/��?��3�!���x6��`EC����&���JO�P�a�/2�	�%�A�<?{�d=l�g�bFRZ-�VB�}�L�������O��^��p!�*C����md]P=qp����:��
��n��;�\������s�����`4�o$��2y��s}����a(������
Pd�����^.����qp�W�s���^P��&t������eO[!��v^�n�BZ�;��O�^�$)�����-���x��W���1���9&�p���dR>�w�~@@Z����S��w���K�����d�m�&aw,� ��B���wD.��:���$\p4|Fiq��#�k7�=���#��/kS	}���\E����>l���u���D��>D��:^��W���Y�)�������)���n.��������_���%���W���NT/���#fF�-H�`W_=l,a�������c�p�RIT���X�����G	z�
��=(%��L�:H��(:
��~���/�w<���N�f���N�8n�#�����\`��FW=0��~�w��!�����Jd�md�U�N�������(��tOhiX�5�|
�l�[�OJ+&���'4��"�X�FPH-���w�#�<�Zs<��cm���{�B�;d�����i��=����V�nKc�'��������@�^�PC~'�eZ�,:�M	L��������g�(��l�?>�?�d�&`���~���L3�0%bx���G��������qV�����b�Y��
��}��d���1�v�8`Bd"��C��>I2���!�<����P�
��@#��j'1���*u����[|�����D�7!�P/b3:hA��/�~�,������z9�:����d@�l9-���MbZrrV>���\����qr��v�"v��P�r��D �4P�������\AB�+4�N��o�MgH9H&B��ZebCD�E�VY?������jt�C�]����f�>������WD�������������<�%��N���_���e4�����,${�N���.qc>6�B���@���gll�q����W-Y���K�W9a�������u�����"+�H�k�[����k�L�R��e�Rth�(x��
=�E����y{q�j�,�����l<�U�	�_��@Le���;���V;^�A�����Db+��@%����g�.�+��WM/p$
�4�~��+�������7tx�L�ai�Y������x������GWp��W��p�~����nu�� ���(V�=��#s0����O��
f�k������	l�yyA�T���'4
h��W%��`sG�����f`X�n��05\��ph�.��%I�������E��l~s�w�m�������e���n��:�����?����`��<0}����G�S���������Z7Z��Z����qJ��.��N
{O�7EE1��?$Z�z������Sw*v���Hq�M�a>�H����,�%Q� 5�L�����z��T���<b����>�|���gi`����D��4'�xY�}��_!	��s~�����z��W��VO9k�Uz��0��gx������
�Xj-�Q�,���0�W����`}g����m>�}���:���G,�a�CDk%�����&%����f��p��<��r
{;�Yn�N��4���������.u������������N���Zj�/v������:����.����HW�~�g97?��%�J}��{�U����>P]d,P�������p
�u����M=�L�eT�=*�0�v�+%���,I��9g����n�JX3�������.�lKqs_�������������&�:{�.�l'K�����V���.?t��<�U�Z����O;��H�D�
��G�?����Y�.���-������2�
�]�A�A�+�]�P^4��p��2�Q��p	^wI�&I�3�b�E���X�K���
����`��<��v��z����*.��_����Oz�������|�$s��)w��8i������~�5�?E��UR��������A<����4s�r%j��l�S��zT��KU
d#��D�e��;���WiY�xIJ���
=��)"���WG(���Er�2���m���+AY P������A:�J������
&WX��*���tJa�
\���U�U���VE��h���K�����5bYhw<��~���m(B-�7�g?��|�P�`�����GP����n�C�%�N���I�{�l����N�������V�R�o-
�XP�f�HG��Q��
�5���q!�ag�����Z�d���M��V9�G�Da��j*��z�N�u6���4p�
�7���IfV����e�1M�)&�9�R���Hq��gA���3Oh���\������t����b�^40�=����}��WC�.���G����:���R%��6������#�$��rv������0�2$�d�^�==�<>;�R-�`���xK\�����f���>����	�����>;:~y�>6�mb7����Hyqypy�J�het���N�8�!G�J���8��/�|�7�
�^e �6��\Z:����IU����yz�eOO!\��
�����n\���Z��.��m�N�������;���n�Y�k��e�^
\(-�
*M
����3���+%2�QB$Y2�x ����8��CR"�0��Y\s")��������8��d��5��{��;m�+�,��a�,c���QR�+_�����
w�Ugp��:|B����)���'���/�����sR���O!���LC��P�z�#_�����+��� �*�L5	��-����nl9�G�8,�
��8:B�����J1cB�������K��S�K�9\����
����}�V��5N���*'�[p�Q����Os�F�f��[B���7��Cq�����2������\;a�X��6�iH��o(c<�����TV�Q�O
�$��(�,T��)-C}�o��2���$����%�����*�$����i9�WW�����[���H�����/6��0b��H�M��W��VC��Y�A������O@�Gm`��pV�����v�S�>��~��[���8NL������������m����������c�1������(r��I�������c���O@����	�Ro�.���+S��SoH%N�����s��e�W-�^N�@�U���YO�?8A��	�'!����%�Z��@B�P�_�!!*���/U�8��e�<���d|�D��\mf4o�����z��1�c���G�?����3�g��XD3e��W(�8w���U�p���?{h��%f��������)y���y�~L���<���
���n	�6v�I�.u�"��S��	������+����P1%Y�����!p�����^@���������;uQ-i�Y.�?r	��+Ra��M�8T��������H����z�~��z��SMu��C)�����y�U1�;{�����3����{��f��BI"�����/��\]�����sff
��y|�����/i���*}\�f�S��j�}
�m��F�twl��������W���prW��B����4%N�)�����KH��-��rF�C���jV���yb�s�3pU=Kp�Ttje�����y0[A���6|��������
Mr��jQiB�UP*]fA�O{��Qo8�G���`o���|t��l`�OA�5��^���r-��Uw��UP�T���F�T�{ckkk�����A�Z�����
�O�$
b�Z���� ��]T�35��NDE:�~��(Ng�������ul
I*|]=0����$����$���[�]��	��\��(��+4C�O�4���x�};���Q��:��*I7���N�^�
����U�R�U[��4���0`���S�E����\�}qt|~<��`B��<��u]���g ��/�pK?$A��t����~]���'��nD^����pT��1���^��lkw�y��j�����a����VZ��%�j���v��mn�77�w��i�1-n�U�?������c���s�ux����;�����~�'�y��/�����24\����g�2o`�4�JT%G��V�����~�[������ym������p�u4����+�<�
8&@��+*C���Qo6�P��=_�_��|8������C������Eh �����+��^���9�����eoV��������v�;�F���nc����bD����������y��@�M����7�^�����h�L`��P�M*�	S�j���Aw���#�2q��g��Y\��J�r��mH���+�������qU{���N�4/�_Ak�����jp���]O����A��O���3���mZOo�a�P��3HbU��'����q�ry[���nU �PGX����0��PS�W�+?K��|�fJ����H�G��+@'7\�*��nm��^��� �.�.�j�L�u4��0����s$�[	��t�=nN9d��at�f���NA���+#par?��q�$Wo6����gl4�P��{��S
�.y�P���Ao:.u�Q��6�~(XM�~<�	P�Tbd�p���?&V���r�K#�D)o$��w���kQK�.G������
0o(�u������E�0�1�o'q|j�q/��*��Z�Zu����>����T-C�����.����8���9:���>_0������f:��A���\�K0���
A)��.@^;�L�
�8=�<>l?
�� �f4��'.$������
�^g+��:������Z_��|����������������X���������W�2�6>x�����Zk�w^������5�;?��k���y��Q���	���Y�|P����U��Wk<�_���~�nK�����l_��$��zg�V���Z��H�Qw0[�3��1:8���f>�O��Z�1D�}����V��T�uaNEQ�`'1������B��_e�&BhR��<���ll��pH ����.�Ze�^�m��~������D��,�<�;� w����?B/�l\I-~��z�"�Sy�Q\J>?�G;�x>�E�������AB�U�=.n�G0����e�{8�[�{|�0����y�1���`�������F��t������^�k���C�J'�^�O�_q(�E��N8�����l���Q�yuJ��PR�<.�/(:�E(�S&��[J�W�^:�F:x	���'ct��~�7�x��k���;��a����Zg����v��������g�x�U,����n���&�S�y���������y���so��h&�A5g�u����6��m��r�8a
"�E)�xO?/�o�Z8�O
��h�����*�hT�^��]�<sd|<�DY.�����k�Zd��sYj�����w��(�}��e}�:8V;`u��{z��z�����Fo:�P��K�W1��E ��K���?� 	3[�z7��m�Kj�����n�������`�s��*�)���|{~zA�\�G�e2^\�^nl\�E�����G"����������gAQ����{++Iwj��Z�Ui}����7�U�I�N�����>s�K$��^��O�%g�yo6�Z�R���)�|�^!�Uj�b�8
@��s�"�vp�^��	�T�W��|���O/�_\,����-5����Rj��	/).�lr��z��v<���LG��������&��^������~��RQ	���g�ixmw)G���`��������;OW����n��KT)Sr;Q+��o�x���~�z2#�Mif:�.K�U�1{��<=�e�%�Q/������2 �P�X�K4�J���+�$���'��.���5��G#�JJ���"�l�U�)�;
t3��L���d>$��]�!�Y���E�����J�b�v��Q�{�-�q��6���HX�8����'�(�~�T�k<���'�g4�
�u�A��bPh�����G���m�v%"}�U��	�� �
����1e$/r������A��K��]����Ht�����@�$��n��^#2�
���BN�4�_����Gq��n�L���6�;��w��=U�5wb��+�9�������������|�rS�d�c'�_����E����9���|^��>>��x�R/�I�t�_�����v���Po~���@H&tj}@��eIN�}�K��Mi�A`�������QF��.�5��?�*�\�����v!�{3�
�����O��N�q�5Mz����P�I���R-@��2����A�)�pv�b2��m����:�bJfi��\�
��pT6������1��wp58}}�r��j=���@�Qe]1���iY����0U�:%�u��RU���sRY��0�[����Q'k���o��\} .��4wv���>�	������	d�a���L�	�����e��tA���z��r��
�������k�
�R����R���+����S�}�\`�Ga��uB����pA��K$~r��|��%V���8p���)�]��v�$�cw�H;�����GN��yl�7u�"Z�=�����d0�aB�����%���vHWF����H�F�����T8�m���+h�2@�H�7�:"TRp��<Sr*Q��O��+Wp���n�PHi�	�S��/�y�\NW_T4B��8��}�"���L`6��qu�t��_�Fw����O�9�O��/��r���z��Lz��xE�)�����=��iu�W[�
��f��tX1�5��Y{U;b�a<	8����� X��I�
��u�}~~v+��X�*4�%ICF��Vh�g��W�S,`������0�mu��;.7L��w
9�ijd���K�	�|��{������h:���
���P<�H�|>qc�|�u���L�#�L��+�S�V����������*�o�W?1���&/�hBT^S�����O7�=J��rSs�[�������6�0T����
�V%�X�*�����/De!:v6p�`����z��c��OR���`��l��r�m,)b\���q����6���J
S�K�JUL��2N���8�X�pHL��,�
0��J��A��#L��\�.�g��B���K�(����P*��r�A�1�L��j��@����|��_�"\z�`��j<j/TX0�����e�~'p���O+m 'o���|��X����Ia^��0��'����dZ�� ��+�]��$Av�VZ�T  ����f�H}��t�������g�T����n�������d�}1P���o2|��/��/$A��C��}�i�@��g0)���?��<����s1�^�d�rY=��5����m�]P�K�MP�Y�pQ�@�����i.����4*�M��w�?7�=,'m�.<��uEz��S��^����k��*�x�?���3xwD~�<������LU�����m�f����M&dD<Z����r:�>����Q�~p
`�����_8����!�J�Og�W3Fm���q_f
��/=���K��,�ktj@�BA�\�VC2`+�������<��L�H��2
\��q�2�n�/��-��Q'
&�CJ��zUhn�J,�F�@`�^Mi�"��R@`�i��3��-�\�6�kk�X�����Hi���t����a��?���xT�a�'Er:]?<8���.��L�`�_Lpi!(FN��K 3���H��h"����b%g�O)�����c�C��	-J|�&�`��y�B
�i���I4��j<��X�	�W�3
�k�H��;%�@�|��}�9S��[Y���]@9?�J"2'����	��jp.0-: KK�4A�����W,�V��
#f��a(�Ke��_�����P�8�W������9�${{m����mVa�750�p��������)�j��,tpF�%b;��� ���C���]�W�P�^�~����[J	?�4;)��S*���?t?%�������l�j�}Q���h�'���Z3E�4�>��0�
)�i>n2�2�����T�3�i���I��fV8�x���^�1,����?i�Hw�
b���������>��m.��������TxO^5%Z�}�r9���K�L��d�����R�%w�L,A^*)�v���%*z�-e�������h�B%�����NIr��@����h�KSs�X��i�]����Mpi`�_6���es�hn�����L#� P1A�#���Z���K%�����T5@�m�!
3�yEK��,F��2���t��� [t�YXx��s����Ut�	|B�U ��wgQg`����ny���)?�����������82A �����T��gM$_���cM��?�y �U��(��g�������#������ZC�7�����8�	�T3��(=����s|��2�jPp�_F�}��'y>����4*I�0���j�} �N�{�o����Y�q�9}R�M��<�r��"8k�'�����3���~�����95/��x<�>f������/om�Wl��^?��i��\�l2 i���C�H�IO��C60�cL$�e&��N,�G<`���-`����h&��p�������r��:�D���2Ac.����+X\R�>|��
��4��-.W�����&�N�R�����X33���/�
��wU���\���W�t��R�h`��a-%8()#���sK=
|C'�\�"}LOv+H\2-R�l����T�I�"J������nFY�����g�
*�D_`!��OpE���>E�\����mljh%��H�Z�jV%C��6��k�� �f*6��p"���TAp�FQ��D?G�S�5�`�������J�c���.,��<����
q�M�Q���iu#e�1R��p�Q���`7��2�g�[���J=2^P���i�\.�����k}?y�d����(:�Rc��p4�7�.{�����@�������:�9�o+&)����I�{����/���8���]u��*��@����~���U��+��#����z&����;�6��N�A�l�
S(�x`�uB�p7�~��J�1TR���Bg��+�Q1k���KT��?�-�=b�0�U	���Q�C
�z4�H
*" �[,�K(��j=����P�7����<�Y�2�zP��e�l5�[�r�m�Q�����qb\l��*��mw��*�sRw�b�
����c�����29p01
ww3�]�\�?����k�b�z����	��URS�<�P�**�
�����/��#�f0��MKaoF�K�
�I�e��yz1�������S�o�/�q�l��[$o��Xbl���[����iD1�MR���q}B��Z���KI�������h�%2l|
�KQn�A`���,�_q���g�1�R���=F�|�w�k�12�jw����lF�m�h�6������`da���g��������(�w�k��I=E��c�>D���Xd�0� k����r������>����G�%�������JY�m=�O#��nW����y%����3��L��J\"O���#+./��G��]�(���+iW�o�m����r4�DK�����3�N{-��^aY�W���Af�#����/�����	�e��	�����6Y��/����3�l��L$���5��/����`m�U���Xz��v4�6��r��ag�/cg3f�����)v����c0�G��x�P{��J��`7�Yn��lm�f�����Z�����oI/�K�-��Xqj�p��_Z�;8����<��#*j����iak�J�+i5�WIa�R�fD����*��`3�����X��Lv�3���u���U%�L+��5���m2Sm��Rqp��Bl�V�!; �^�n�Wy��X��PY��~3�?�F'_�U���F����Z����{��~L-��R�*�G�"�
�c�����|�1e���s���LEf"�](���� }8&-�9��c�den�q��L�����Q�]�>+;&�~@�J��j��h�~U�Q��E2e��S%�0/cDB�c:6�6��Gol�
��Yc�1h�0�
�����Pv1x�_���J��o��U�"�
�A���x{r�V�����g*�]���e#����V�[.�*��U�Wi�~������f����T���1�G���������4����M����T���Yx�oPlW����������,o�'4�@�^8B�L��>�Q?�G�S���|Ya��5�Fgd��In$����=�UGG��p���`�p(B>=�8g<m��m$� ��[�q.��U�@����l��D7�G����!e�,��31f��h����)����+v�\�E�n"�����jw����Z������No��r���\N3�Z{M�Y���\��\��� �Q+���������>6vJ�44JG��J�p�J@�K�y1U�(�!�(��$�R
�Q/*O����7�+i6����S�����������:p3����*���N�A�k.\�����
��f�a��n���/��#�u��N���]�T�Fog��OX�v�W��J�Vi�kWW��50p�$��������47p���k���"��a�}L?��������
C�&o�o�$���trE�6�T�Owv�V��Ve�R���w�
�����7��?�A��]��X���ojh��%-h������cv�G�u0CB��JwlSFg�O������j8�\.������mw7��Z����r��5[5������g�7��A3X�y]���PJ�v����]�^m}q7,���v��L�;�oo:������VcOEy�Gq�>�{u����^�L���a����/B�Z���b��vq��::�����olwy�U%�����^F^Iz�w���U�I���V�*u��S�:u�c�o��Nw�����e���fW�OQ������7-#��g��������c�E��}0+n����ZYp�?l�K�Uv��o[['9���uG�t��;��_]
��������hk������;TE����_�=�d�+�8{{�$�|�����	%��o�����������</��������2f*I�4Bc��V�ml�
BRi(���������98?89i�oO/^���t�����W�e��� |��G6c=��M���!�6SfM�w�p�Q���������x��,y��C���E�����(�������~PNA��K<�j�Es�J��
����"�P��ud����^������R�:_���M�p�4]�����jh�f	�k�D�k�^���.���=��N��������Z�o��R;m�g�d�@�^����^&��6��Z!�����;����D������K��z�>�x{���8>=8��s�>8?�1�
��Z]]t���;/N���>�����\�^Y����o��v*4�mO��uj�|�z#����v������_�F�_�
��g�+�\�������E#%K�,�N��X��_�$�q�yA;GX�6�_-n���Kf���w��*{6$�z6�wfc�����*<�n��!J]C��'�2�Rg>��j�Z.�\AP��C��u�S��_��~����|�&��\R�8Na�^1���Y��fgWWq4;%P�C�B�}�5&Z�p8���A�u���4�n����'s���U*����K�!�\��A�������?��[7=h��|��h0%���X�x�<h��M���K>_(8;+����`�Rgigf���}�8v�T����<�L���������;�>sW��x����<��3��y�%�l���+��t_��N���(x9�o�i������T^�)�;/��*����~>����;T�#w?�����E������2���c��P�����1uj�A����]�����,^MY��UA����VB6�����eC������h���q��9����yF=�{t��;�HG���X(eW,�_�%���L$���
�<�3O���,�>�d!�)��;m�&���_�
?���.��Q�$���2*;F���Sj6v����p����������v�/Z�6���{&+�����C���6��G���������4}�F�#���i� i�9���?Kl6>u"�l�v|z�>?���B�n:7��E��������1+�����}�����Q}��r�)Xd!����n���r�.Z��:���Bc���R�=F�o�~B�\���x�����ae�=��k2�dJ����t�^?����!�$�B�.�W���
��H"�!C�� k�kvL��	���'�����T��C�+"&���x5��a��hV���xtBT��J�*!!h�����j_�������\{{�2���qG�M_����k����YH�Pi�B���t�$����NY7�
'�&����FEq/7�O0Y=Lf���+��T
.,M
�o�S\1���'`�n�BY7lG')���_w��}De�<a6�Y�����)i-�O�"n�1�+�g�dN�
\G�N@��o�;�@���#Y=��X�BE�������c`%�����C*#d�|G�6
C�t�eS���&E�_tI������Ll��Z��C�O����/gN{Q����#�r���A[�����4�]L��b�j�,BD�E]�?&nRl��J$��T���u#��e�h�%o]AU�[���6�D"�'��p�<6��\tF��#�*�����r���������
7)e1p��1���I��e�6�7������`M�V�����Q4A�&��`<���r����5�&��L��87C]�8-��L�`�����V�IQp,V�uCN��q&�%�F��b���	���>����N0%��h�t&Y������4�
����m��(�9�d�DN$��dz�	��UP�;��s'���A:s:T��dDn�Z�G�����]�~On,��l�WY>rT�T_k�|n�;��_k�fz@s�q^���/��	���Y3a%-�_��!�I!'z�0T��Yu������Xe�"�i� �V�H��k�@�0h�������E�A�#�?a��Y��������S�`�!6 }����
�����C��S9�g��@:r5x��
R�w����3.��"!���t��:�On�,�0�X8��V�H�`�<
��7��]���&���~�ytp��u^�x7��������"/��Q�ha��:N98e��<���JD�1���]���{�c'��2W�v�BB5����
Q%���lU�7�i���+��R�x��W�6I�*�'*�*�.�7�-*���Q�d����yo���*(�3�mN$~	|?����9'�.|5�G��J+O������e�g��	�x'p�I�����`n����:�>O�9�=*��W;��n�_�E�h��W�w��v3��5���^�[����j�M���ZV.�'kA�*�0%���@�/���GL|��G0O�C��
+�2 ���=�Z������o�@������J������Ik?�Vi���pH�u����S�v�2�w*�v_~�}�V~��+��_k�/�M����Fi����)�@��~)_-h���	����9������_�7�t�~w�y��-b���U�3�|�n�n��O�hV�W*x����
0�.��f<�����]�����Q����D+@���
�n�������L�����	`��:��QL���C����U�_K��������AE�*��5�E�Dg����q(���?��3m7�,#�?���i��I�L;Ff��	]�9�Y
���.��c�C2K��t�y"������c'��Ba3��}�[��4o��@~�U~~������e�0
�v�!��H��{��1�1����0m���'�SLi%�{�d����Y��$JXA`C�%d��Iu?�Z��^�������;���5d�"7���rQ�q��r���
������7�`7�����	��TjAN��)L�
�.<(��d����We���5�9��nD��(��<������]�Ax��1� 3kK"�v�����������t5�����&��D�?�(��0�W��m�P�R�����|<v��s_�/�v_�R�x��^|}Cp��y"�����b���\n��6#$-���t��=9@'~r�/���?�}%�*�LS%?�dx83��)X�,�����,zJ�b�
�
���iTB`A�y-j����Vv�}������oKR��H�p��X�0
����l���=�����j����U$&�c������b�@���+|�)����P9Q��o+g����
�J&��b`���L<h�EZ���I��q��������K9��gr�4�.������.��L;���`3���
��e4�'��3�����F�������Z1����k��fo/��H�t�����J�
`+��vvRB�0�xB��I���?5��76��V</O�;�����5+�z������S��E�t�u����5���N�/��J�{�,*�x�	�=Q����J����x#8����8��2����=��Hg�0l�YEuF]
�����Nr�"���^5�v�W�A��)���|���N���2#�7����M\�$(��-�c�{�<�T.�&?��c������:pC;���,����^��������l��������8��T��*���pF����I��}q�k��$��U���/�/��m[�mB�&�F@�	��3���|z9�>t�V�h$O��������H'�O[�����L�I'^h���09P`�k�����5�3�Z�^{�LH�����������C��i�!=�������Y6��:_�)��m�fA���?��q�\<S`��~�^L�gX1���&�\�z��f������]o�c6no����v����[bx?B��*�e���5w{�z�\����v�Q�����D��}���M"��.�w�-b���[������qa�<��nP��O�`0c�����tp�#,n�_$o��i8�&F�PB�O�3�u:���ag2l��������������;��~�����4*����~��2��W�&.9��k��_���Zv������������(�{o���������n3���������Z{�^k����������W��^�GQ�����Ow*O�{����P�w�@k����Pi(��� �]]
z`?�������H�I������0�xCd��Q�!������Tm���V�o��Z������
*G���# r���~�7;[���\ap5����I\p7�@#U�����2Y���1~d>�-n{;�������Wi�H];\�T����R����
T�q��5������Ga���'���A�Jp����h �����0��(���`�i�6�0E�m���������?��
�7�H�<�q��1m#��)�\�?�����FlV71B�V$Q�������8�V��FQ_N��d#�����^!����	���9�b�6�����I��������Moh��Zk.x�������S��$g]S�OB����e���[�~���2��@�[�ay����9���6V����nK��q!�j/�TwZ�u
���6�_���c��tk6�-o?��D�����~����I�"z����N�;5���5Vw�U��Zc�����������c�yH�lz�h��5,��������i���>������� ��
�D����WW�^��_.7w��^�Y��\�s��L��%
Q8��pZ�T�r�����1�9c�,�����s�z�p�S���������p3��[�������+��m��	{��nc�U._���^����V�����Z[��wH�����>��T�����~�Z�w�}c��'9����w��<�]Xu�-�h���[�`i�����z��l���lKU���2S[_�D���A�#h��HEY��@/-�@�J�����d�=��� B��R�����L@	����E�>���J��l���E�+���'>�4>��3����8E�V�!�5����DC*Y
<���At������������c=�![x����~����%�V��w�?�������q�`f}]�R���k�{ ��'��!������$hO~>��e���|%c��_ZOh�?Ss0 ��0���h�x��::L^�d��������Q'���U9��.�G���#
��j�"��S���$���E����1
%mQ4���+l�����D>����c����}�H�"���T��oG2�(��0�u<
B=�����������K�
C��W�2���%z�)�R����a2�z���\%g;%�+~N�o�1�<����'|����9�23"-U�h?x��|����i��`1og���g�G�o_�?ry�O�m%7>�|}�L{xh7�.�9��������c�H���
�5���|��T����B���xgD)b��v?�0cu��k�)t�o�������DI������/O���_��;���t^�_�h��w���!��
?%^^�o�Z��U>���{����YN�c2�?���l�`"~��*�� �'!�q����lD@�'<9�
\�����xq�9�}]��K9����Uk^5�j���I��.#��N�o��O������}�����O�\l���Kn���l3�T�����r�4!�CM\���EW����DU�k��P����}����r�`�����@��J��W��#iQG��E��0-t��b���an���m@��� ����	���0�����Q�$q��i] ��JG�>��.�X
��1T?��G�	
�29<J�	��>*O����o7�^�Hr9���jp����F���'o_�
��1I�D/>8H��:�
�$;t����9��?���k��Y�8�i�`���� �Z���,�5yyC�%V<�YK�mu<���p#�����1�.�/~�/D��s�����������&���|���q|A������\���X!@�O��7�9V�=?����'������C)}��6�=��"hXt*5�Y���[zN�����)�}L�����:�)?�?���=�l0P��$���f�����������{���@g���a
-�E@�4��
8m��F��`��g��L�/������)Uu9���t�(��h.
����w��:��XM�vV�*W'����v?�.&�L�d��yHQ�-$Q&��wFJd�f�,T����,D/.A�jTB�T5l(��E�e�h��u�u=��,X��7��fVQ��j������\�4��J�0"�:Ko�&�.�'������	�H���L
a���%S&Z������T.�s0h��`��;-*&�+O��q�`~(V�x��p���	%��K���K�y�X�t�@y��W�!FZu�3:��f
=��8Zx;���Xz��<������F�z��@�>������c������h:�
��S��,�	�]H�� �Xvj�VZ{�������A�^�fO�J��xcx<2C#����e��>�iW3V)<9[f�/p��D��{3��xPZ'�g�U��bqE��
'��NFVi���-l]����1{��7��M��-�l�2����8��%��p�F��5)����X����3��N�Ys���}o��#�C���j���g�t���c������p�\8
0�� �OvW�Lz"�b*�p�v6|�����y3�-�a�.�t5�ZP+H�F`H�F��"�%�@�������w�z�O�hHD���/�L�L��*#}�QX�1l<��(�������K�H�����R��	����D\!�%<#0���k*|�B�E")����������?�A�A���d[���""D0�*�J�`P%���.����[�!1��4�d��$����l��@H[���'���'��@x?�x(�?K�������`6����#�'b���=K|����I��%�8�������3��Wa�r0���3�p�@A!�!�A�3��='������5��-�A����bS$K�bu�a������E�����
��{���3>g7&v��*�i��+�Lpasa�pD?���`�}��(�xs��i���K+��IJ8&t����A^a&-�&�wQ����B���psW����������H����3��eH>�u�5w�BEY}@?�
�,�/Qy��-��s�����0�[~�I��\�T�,f������bC���>�
���#����H���5KOu���
[�'d'Bg:
�Q��k��E�V��� �CN�+�?z�q��<yo��vB�����:+�����\B;CG�c�����)+$I.��=XOH����
k��#���=d�E���t���I�f3*�6a��/�@��b� ����������+�1]�2Q�K+����h�S�(�`���f�*<�V������G,�"W�\<b���fl��X�.4������D�#J|���wl���1���4�z����~�]�a�d�V���pD����H��_�%�=��"����F��D�-� ���c���H4��&������v�M�jP�����!����WZM#������X1���s4�+(d&����9� �C������8����������8Q&�����J�>k~<�ao�! 9�V��dp���!km�GC��r�d>�{�!�z���?��|w�a��	U��4��l��}Q�&��E@�<�����]zI$i<|!���`��z���������@;��o(����������$��8OE��R�O�}�{���$����giv��)u�D�����~y.��R*w�W����V�j�FY����TK����3�F�\aM%x.)������o���~yK��O���7��Kx����EP���*�pea��8��f��`$1�#J%�I��#�����/��e
U����kkVq�\E���@tR;qr)�@t�'��=>��o:]�Z)&��S����	�
w
����A������������v
�����5�t���rn}�������c�����?U�9��s�����
|�����h��=��D&l�w���!�T�����C^�UB���RbOa_g6�"���G4��
(*�����|����N
�6X����2��B��������9wu��[L�T��s���s��:�oz&�K�`���y� ���uD�\��(�dr������0^��_8��k�$��[-(��z�y/?X��YR��Xl� /�:`)�O��2�i`p�(y&�F��Q��x	�:��4��(�&`�����1/�dc���Q���Q�?sP�:�&y/}��RBA��p��_������u��F����%I�D��d��.�x$xe������8������������*�(b���(M��YyX�K��9��5w>��@`4VF��e�����	J����5+f�K�i�k�5�����P�_	�vKk��	����B��G�L�48���`�$�Y�
��*qHR�@W��q��-?�?��dV�"�^���$@7�O�T�)9XJ��JC�&��R�V���'��
�~X+T�h��:�N���G�5��)8��b�-����������$U[�k���^�I���K3gj.��������Jx����y����fE����!�t��|���q�j
(=v��8�z��z�.V������8����F��dBd�vd�w�jM��Y����_&~z&��}
�v����TwR5�3���������u�z"����|P��e1.4�pcJ�
���T�3�\�w��!c�b���r�:�.pE���wT!kJ����lF"=��Y��j�'�8|��a�;<�L���`^Y	/J���e���B��5k6T�sy%�[$���}�B�6iP�W�B�sR�ZE�m��5�_-b��L0\������F'~qhQ?��������!i�����A���
����\��t�t	]/�/I��FW&��G�d������;�q��k2�V�b�b���"�9E�8����;�V4I!��|8\`k�'"��_a����G{����w" ���,��%���3�|~��~�����������������m��g����
A�1��:�KN�)�������$�:��1�2y�(*3=����Yp16z�Q��K�����}T��Q�#'?�im�H:[��L�o�X0L��F���5�����h�}}���b�) N���&87���t��b�7|�}]��%2Q.r���?0Q�l	��<���,�	�-��DSLa���)����9�hj*�J|��j�s�:`/J��S�sb'��)P�d���h]��)��������-} |�{{���: #t�k�N87^+��*�dmj���)
�����z��o�S��t�s����G0T���$����i������]�Fr-��;�n���0�eT�����Jv��X�d�v�����.����dX�JnhB��)��}{9
G����q�+��6��&_W

&��,�S�_���O�vM�A���f�%�L3�Z�2��������}�9x��}z�>�F�L0��
����!Eqapz�sr|qI�S���:�������W��~���e������gc����l�H?�������`�&~�dW#�~�$�GO���n�u���6]�o��l'X�4�{��5iV��/��K����R0in�Yz�e1�<�IK�$KiT�V�\k�c!�)b� $?�MZ�n��,vZ3�k��oH�2�K�9�oE���S�B������Re6���e���de����o�a2��i�Lo>sg��ls���b�VTfl�bU����������IK����7�}�(��3��_"����-��$���j�F%��.|��U��hgo�$�_i]5���^=�ee�&��&���R�lm��2�)|!_��S��X��h_\`����!�������Q�i��#r�����VLW��I���5�?S��pI�F�..�s���>A�n�;9���n�����;��^��W��L��sp��xj����������s�e��38�����C�<�on;�s
�s��"o�Z�^������<��\�x2PO>
���i8	��V�d7����^����n��[i�w���U�Y�W-���9L��&������U7�>o(4�ES;�����7������vz}g����������O�x�I��J�Y�&e$]QB���je��[�eJ�|��ju)�j_@����{����\0���9*�r������^;o_8����<��z�J�����})%����+�8�����)"����.�OMS�{�~��H��"<���^ �{w�s����<~�����W�k|��"x�,��<k���a�gr�	u`�������(�d��L��,��O�P7/�������V�
u��k�����	5U�2�g�[�o?�r=�\������<�+^�������F{��\�����������4�^���Q��Y$�W����%�TR�p��Fm��������7�[�����(�,���b��s���|f1�"�Y+\�������W��������C��*�?�?F�@����O��K�>�Q�X���6��%�����E���e��������i ���\��-Je�U
� _��<�{T
�*�t�>i^�N��/��^si���#��Nx}���������;��;:�?D�W���`p��|�<��S oe�5KM������^m���������?�����O�`�<x���E�5��%b�PN�0c���F#T������V���*5+�Q �<d��o�vZ�����b^��&7������=�M�-��I�����bEr.Eu�m}��)H�����o4w��T��y�=�*m����7��>6w�w����5����B�7�T�m�@b������w�M92���W�[��c���]��L�=�w������\�������8�
��q����e�#O�O��/2����H�U#���DA��]����*��ms��q'�`����B��K+|��������C�w`��1��U�����������{����8�8����V���I�g��-��Ow����'|F]�Y�����z�W��q+����R-d ��U��p�|���/am'u	��7�1��M�s�F>d+�E������U���[����F��l�Sj�2��FFt%���<�F�L�3m����
�������(]��\�ET��CD)Yt������������n�������(�+���"�b;�PB���x�!�q0�F������4�����W�bP�����������N�.ha���Po�#4h�P������f�Wg�G�go2�����l1`-�n��)�����x�W�~��mV"�e��]���w��O1@�#HQ)W�@���g��t*���)�ru������R�jI_���]�������~�&Q��0U��3'���d9��r'U��[^F����>���g�e	_/dd��}H�����_�hE|q���5j�Fs�!���i�rIc5
�W����iy�����z�o���W�����G�
5 ��QX�HK�_TR[g>����%�E�ZI��gH~nj9������/��j����������FU\0VTQ�E���o��<�����`Ba��W1��EL������p|�A�S�"���)s��b6f]����/���^��o���G���dcK�z'oq��P���{��$x{zq���������\�H�
���Wo�����XpI6�M\�������Q����������HqG_g6EV����_��wg���g���F=��_g��nli��N�SS�-������.|xQh�K}��8XG0��aU�[i�T��Z���X��,�5���������.��D�~�|���v��c}-�����y�5wI
�E�������N����N+V�Sn�]O�<���_�t��8��}�e_��b�U�����5V���k�������fP��id����������$�'����~!���
u�O_�u~�������R�wiM���������,mIH}�,3�,QR�����3u2�mO%*H4����p�������.���������bB:�4z�����gg'y�8�~b<�u���o��t]�
f���q>9A+8^�S7Lz�`����KrB$��4�or<�it�]�������,�����[L�n���RfA%��=L���`�!~\�%&�&f�M��Dxd\���&El$��~��U�������6�1t~:;>���_'�5G��}���a�����~���*���|����g�������u� ��$�1����h�^RR8rdG>�N�h����Mj��]v��9��}T�����6��or2I�����72q��G��i������SA/P�@�~���3��9Qb����`c*5V�����M��Y�w��Z_��#�?����v����91-���@���o�y����w� J�����}�"T"J�8������|d���iF
�`�y�9<{�K����G�B*�k$l`���]�!<P>S���y��^��0t(v��t+����G�N��;��DHzsy�Z[1��8)5c�g.4�%����FM�3m%���zr��x��X���K�[���4��_�u��h!��$%��w�#r����_���_t^�_\&���)*fJ.��������������N���j����6�$�E����@����)3�L*���������2�f�����;%�?��q���ORAaj���,t��@7�x����[�du�#�(<�`�i�D����pN$�NQ�����?���J�����Pt������}mRT����/g�l�����C�������=�*��1e�L�m�����0��}Nx��
�`6����k��<�F�)�Qz1s����G����t�~w��GyV�c%$���`���~tv�V�5�K"([w����!1L�8����j<���	[�x��W��j�{��|P`�H�E������3�N<�
dk�}5|���a���pS����y�ENC�TO�g=�� ��L��MPN���&�hUR18%�yF��@�����|��6���l����W���`��� -����K7�/ji��r: ��'�0Sz���M1�Eq��W�����uT�+�h�����h� �?�����5`�	���_���j`C��G�S r�w'��������v��U�{�+f;D'=�4y�F�r�8��t]��2D��E�l���K7�:9����:�}��2���t� M��(/[SL@D!���9�����>��'HS���E!��sPi$�>Q��)��z�RKM�~���P�Q�*�Y�-��T��n�j����n������F\>m�!I+�.z(�>_:�|�����~t����)c���Z���rg�n�Y
_��p�������4�z8��l���j�������[����N	 �Y:j�T:��K�FL.d�����F���p��<��Y�A��l����N����vww+�6v������*5x]�CP���M��N���a�>���-{�o��%��k�����SmV�������kzU�V�V�_���Z{W�V%x
7�"��]�.O�A
Ns�y����(8
��p���+�Q���L7�����@������T[����lVA�����[��Je�bN�+���9�<�1X
*��j�}��T{_g�W��+����v���)�����gp�����FP.���{���4����j�skC���:��{��Y���?�@���1C���r��T��(�o]25���Ra��<+��v�Y�����(�w�QL�o6k����Hx����$=�#���"��.��`64����$��:�tn�=����z�a���,���[��:;9�<>i'm���H����e`4�� �n0�J0�p�������F��
.�&�{��4jQ(%���]��:�z��}��T;��U5{��������m\����Z�N�=L)�GE6J���CK��V7	`�\w&��u��2A^4�h�����p��M��B ����	`O4�|����]:qN{7�;p5��xqlR��hZY�P�%�A7T�H������3w>D������YD���N��Y(�-�BYa�z�z����#��<���RL��jV�'�QQ�~�c�8�@�/����1Cn~��Q����i�h!���J��IYj[����,��m
��-
o'���"�}��s����p8]�rq�3
����U����'������\7zQ,�S���^P�X(�p(ta6���|�J;�@�Y��i�����|F�4�r�����?w^��wN�8��O��������p��VP��\�4v
\��Oy~p�������<�?W04���S�Y���^E3����0o���
�������[#��3�eq�L�h��/����(f
zQ��A��
�	���R�����Y��hK���
�E�M����]���a��Az
;�G�8Y�cr3�&��C���H�2����s^���h]������*��W?�Y-?����D}�V���V�t+�_�"��y���L�/���-�d�P��������jE���f����n��LM�c��� ���N����+Z_���C��d�����B*&cY��,��8��J��P9�-��
g�y�gQ���1��l��L�yfT��{��D}�X�t��=-���d�����h#<V���{�)�D��["������0��4�8��J���K(C�P�/����2h�������&�?�f������G��V������^���J��z��z���������q����v[�e���R%�>�������_���a�I�.��v���(����S�e������%�����@k��(he�B��h����6XkQ=����&����������� D��A���������K����]�hp��zU?o��[���o�������/�����~T����6��?�5*�v����x�Tf�K���G��7'[��@M�'S����/���v%!�OYoe�������|��Z���UM������_�@��S�XU��J>�Us7���W)������REE�V��jD�r��{�E��^/;�nr�t��d��[�`��g���k��>���xpzt�&v{;�<;:{J-��(Q��D��2qn�8*R-�@8$�(�9:g�n��(�o�go�������]������������D��D��-��
��)�,�!%�����O��b������?.J_��8c������11$�`��|B����WV�?�������-��U�E���������.�A�;-.��H��t>Bo{ �����m��A��u�a������v�����D��������W�>�D����
c��0��8��c�l��kjk;~w�����GI�S�������;'�����*�d�$����z���D���Z6O��sL��0��%�������$��?��4��X����/:XB�����I��r����i�E���<%�G�yi
�g���{����{�?���||z�>��o�=T#]u7x (E[��~��P�Ew���)5�u�3o�:,�=��ErG�U���b�E�r@
��id�!�Y�`#9�>_L�&��
���]0��)�&%;�3�P��$�2K���Q�>�a�Q��#�Qq/�mvk�e��r�ra�r��B �aC�O�'�Ss�$�H��l���8V�aq�1-�?Dw�"��^���Z�,Z�i���n&��Ap�~y����;�!�����Q3��=�5{Y}n#�����5\�D������>x'=���l�T54}�2�h1����W���F��T���{���j)�b�&���?��X�,
���>{�<C-�W/�-���r8H�q�,����tgk)������`�j��������=R�i}gQ^($���{w7c�[��Su����k�7��E'��Z�`r�	�C�D�<!)�%�8"��qy3z�a��t<FE�9�#s|d�b��{5���U�k����T!������$�&��bqj���$����:5�;3�B����L�Q=�;9��H!`�e���Nx��Z���=���~~���.��O�>u&�m�������k!���\o<��_�Y&�	c�{/���v��gYy�Y���61�Cc�#p���t��'����p�c�0�"x�z�Y���M���^j�������`���V�X����{����e���A�5y/��u[.4�WuT3�Z�6�3U
|;��&b�q<�H
�}�V�zU-�h���'��ynD��G����9U���K���x��z��@�����jF��$�OK�iUii�j� 9x{v�K�������n6��CLr�`��@9��p_�\F5\y#XI�/�?��	\	_y{��P��\����|�����?;��N��������[�������T�Y6�"/����6�H�m
�Fk+�����Kh��gq�8"
=h7��fb�t����bB+�VG�M��Y��|��eC@-vy=[/�:�P��
�A��z Q���,����X1��4&��3�Ty�l���5����A��,x����B�Q�I���
���l�)�u��|Q�S�V@4�Y/���#���cWRW������_6���es�(����o�	��	�u*`��H\��Q<�zX�~��W�Y���W�hj.2�$�Qz���F��Hb���"��3��i�/�����"�����X��?�r��z�YM�������&���_���v��J�Ro ����Z���n������n��_���|��iew���E��+������/K���6=1'����T9�P��wR6���2��ytWj&�D�v���^��a�����+���,�t8#?,m*��-��
V1%��������h6�B��!�/J�"
���\U�g��w�B�|v.^�oP���M\xW?+���,l�
>�J|*��I�zj�Ip�����F�2�����i�=�D+�����J��).�|j��W��ZQ��V��-�?�p^��9���v��O����8�a���M����G����.`�
'~*����+���i�j='QP�4����c��b�pF�^���X#p������S��$���<���Pz�9�� �vuo�U�;_�$~�����q2�g���	��X���mj�[��cO,f$�x6��3�0�<l���V�%�Y��n�+;��u����|����tpKK��E��/�@0��Y�Y��.����/����5!;7�Kk�7\�=��2<�T���L�5��*:Qm�k�b��|�m=��*68@W�����VT�xW��O�t��E�D���6�SU�8���9��LV<C��s�99��(0 of:%�K�c1w�����{���u8��n4���Q���0�b0��:�G�,�Y��y0��$`4��L�~'�%Ic�dB�!f�D+C�/
��3����G�/���g��	p��f��,���`�q������s����8�u�����wj�0?����<���!�#���=E��1/���h�� {r��{�1�;����M"=�����q�LVP;�e�������4�����5�r�T`N1H��A7�]�DM�.2<�r�yNs�������m�����`����x���T%���Pf�B�l.[2���>�0�7@b4$����b�\Co^u�^6�������������CA����9eZ���h2�\��p���@K�bf2��#$�����=ki

y�-�r�	���}m�;���g#��&��A�qx�����i4����%
����U"��Gx�9)f�����������F�y��i�X2��!������n�L�6��(��\�\��|���,���p��}
{@A�q ��pH<��������L+����p��u^U�9��/�������<
����H�_��.�$	�6�>�
�{�o���L�-�T���xL>qC5v@�J��>��R)a�����N1��y��%vo�A��N2�Q�%K5�s�R���g�.�#��@NcD�!@`	����d��u
����HFe:�[!~E9���������i���,�\'&�=y{q��b4�|H�@��DT.��c<��]x���0��p+��!4�f�{|a^K���..��s�������IV���=A8�$#��c��9���B-�a���5pe���C�r�L���`H�����(�E�Qo�LO�&�+�!��T�Q,�.�I%�H_����19]$����X��9�Da���[b[���������T ��M4�HA�����������^�m����_l�Ga�D�H	�����iK���_��S����$�?�z�?��o���?��^���Vz���^��mU*�����[���F�[iUj��n���?�O�������{d���b��m��af	b�XU)e��
L�2���76(< ��V�����j�]a����E��`&.�w;
n������-~�Vi���*�C��0i-�p��wM?��X����pX�R&�3����\HR+�"�����^�������1j��!_d'�>8��4�tI�WF��b�8��
?Q�8���o�\�����-CA�Q�a�/fT���l�x�Qt��]�����U�q�1���p�v��)��o;F!c�o��x�m>O<[F�-�5w��Q�r���� �msY��bN��
��E8����Q�~���m[����&�2�1.h����	-~�j��o����� N����. t{=UU�zjj��������y�C�a��7+WQ�z
����UGc�Fv{����S��`+��,����^2r����>���i{�^�e���TS����^:ya"��P�]�Zl���`�F-����j���j��y��6��M�T��$W��ZA�Z���Dw+{� ��+���{M|�S~P[}�M�����$���LM7t;@I��FWT!����]8����}���D0������}����|C�9�=���[��s�����b���<����V���n��a*T�x0�L((���r��L&�&�'��d�<�p;`��3@� ��iDxnH
0	9����U��*�V��W�����k��M����v���B�X�Way� {��f��MAx|�;�f�����T��p~;��x�wb��FI�����A4����,�R����_����9���~���u�������S7�_j�Z�h11�@ m@��]�FW����
hL��|pr���k��3��=��Q���w�Cr�
9Z�K�0��%`�F�hgRB��p��-� I���(����b��oT=�������<]^��.���S>��:i^��'�j&T������u[{�ry��m����Jw��Nz�L�tS�SR�4�We��&� ���2��7��fA�!%�W	��Q��_1�$���j�S��T�9���B�"
dc]q4E��H����`�(B3�TE�NAn���Ft��?cr*�f�M&��`�<�pQ@�~����F��W��@w4R�k����1V9�	��@����d�:_����q��
�a2������oB�/�q	����J��!���������A����4�}$�;x��?G����_�4���|,O`!��U��g`r7�$�u�\�)����������j�{O�~�Bh����OZ4E��t7��P6���h���x�����$�rS����w�������%���sr��m|]x��+9��n����$Fd_BS���$
&�����"v2���&��Q�4�Il�
�~R�d:E�BYn/a8�"�}} ��FY��s0�y��|��N�4����tv�X��u���?,E	�_]����f���������,��
����G34"�
x9U4��C�2��ek�BP�"j(�:&�S�R��</�PZ�m�b6TP���l���������	��D
�����TP
�����;c�A��_~���d���p�o�����M��l�G_������hx��z6��{�:��v`�f��	0>��b��q��s�>8
�QG�-=�<�}{�.	Ji�yD�(����
&?��#�*�<!~�)��h�`��$�AF	 ��n1����l�����}�q\�!D|�R�b�&p�TO,�,M�O!�<�;y�y��a������D�=)���7Ws$�W:���*���$s���O�B��7����q�������q@@c3���l��HV��3�$�E��f��Z
�
��.n�Xe��p�`8�m<�$��m:��,����^���=�s��E�M�"i�������O�����_�BfDkl�-Cm*L��B���El�����LHL;t���[`�?&7�=@|D}9k��9v�gA��	�Q���H����hH�{L�SP�ym:��W�c5��o����(�h�0��K��3�@
�$D���3������!�sL���XE7����<�P���`E\���=��������=g�T/�S~&�J`-�
���}�S������a����M�J�����zN����M�������"��!�Kx�����@�
G@�W>�cl�b>���GT"|{���X�s���O�3�A�*��{
�	�h��3B��F���x�9b�	�,Z�� ��`a������]���X$v��-�yAqo�z�8��c���.a����zT�Q4���D�2����_�A������<�Ed5iD�mMN�����?�:\q
��B[CF`�}��H����*}P��~��x�������#��=Ll���z����{��}<��B�K���X��~��aXg���bJpE�B��Z�3��>��M@2R�,�Q�/��D�;��Kt��#v�8F���[R�p��FC6J��iR1�z3���w�?�����mt�(��?ox��9���T������	��E��Z'������s���b�	V��~D=ed�-�R9�����&����i0�8��Y�H�h;Q�C� 0rW��o<!
'������
�����*�
��5�	�c�����9bG�c��|DL=��);���*��B�����%,K��$J���t2�LAd3�S�����7��?K-�3Y=����Cn������mA��f]�{��{=�~s~���}q�aT�y���E��{syvyp���}��������$��+5���S#���I��/��h�d��1j��TF<�#��Q��e�H�w�a�#�~c���`�'�\�3��3>�q��B��������De�
[����
��9fH�H��.~�<y�}�IL���������p�x���z1p����#L�n^��
U��!�!M��3�mLVI��U��<S��J{U)���Eu�%!v�(��	��&�?pJF/)��;�B�>#I�����{����(0��'c�':������EV���M��'��zX��$��X�^m����!���w�$����H����R�����:'���F���	��|k�wkR�O���
M_B�����}q��i�R�"(H*ghv�~�
+�1q�<R�"�&����j52[����fVe��Y�GKzKY������n-�!DYI�$�$l&����DZ�����?������8�Be�1c�x
?R\3j����Pz$�.�*}5��Q��(��&v�Jbx��Si@�w��
�$#1��M
�}��r~ ��a�����&��e�o��|����._�k�&��xY
�K/5 m�
&RA�M��%��[`�J��8la�y�UR�Up�jo#�O�q\����������h<"�����8(���4���������FE���"����G���"�Z���E�����QT"�;��|D8"�5J������|�����kp7��k�Q2�P0aCN������.ib�!H�m�O��c��1��h�.�W������OD}������x��&f-�R%���A�� Y�����R�u:�SG��x�\�����?��d�����8����6��4�p3�RJ�	n?��g�$G�h�zT�r�3,*/��yX&�8�\�R�Sl����R,��Zc>w
�
���=x���E� �cj�*��P���6GO�*��g�C3�sf��2FRr�4�j����N�K�R�`�4�����l�%����+���������#L�������I[���(7��n����O����#2Z���OL+(�Pp�g�,C3����������b�IKeDZ����p�5ndU)i��������]��M����x|�f�8S�\j��S���<-�f�%g�p�4�O��@���V2�K\�(+����.^[����h�e��eG�!Z�Yv<����?��=��5��!�0��O�y���LY�W�f���8)ky"�Q���W�$�`?��Z�v������<�I$�u=���'EM�n#����*�n�F{kRCVwv��Y�iV�I4�,z�H0*���T���|Es$��me�c�6���X&���XF������<x�hxo;���(V��~I�]����-���d����`r�s8$�Y��x�"�c�Z���_���Y^Of{����w����'��XIt����p �z���y��O�<5������-��itg����
@�����<����*%��$$��u����l������f�������.������v��e����J,�IB,��90/�9��K��$�+���v�0^�#�4����z�}
����g���(�/�M������
�1�l�������,P���MFC����S���9	1! ��`�Y�B���FqI+��u�B������b��g����a�S�l���{����=8�f��80w� ��]'�kJf���z�������%#L�������!a���X��������5F�U����?���J�v����3�d�3�?D��49��v4��P���V�%��V�F;~I��QB���xXB���A������)C��V#��M��6�j��n'Ty$�S����ak�z.�N�:J�o\���|�0�����q��@��81�Mjet��QEu��C?~��A|#yO��2T��.>����]KL;uD�����F}%\���B����>��wp=Bf%����h
���@V�Y����,����Hh��}����#5��n���$�����@!����1���}��<�O����x
��t"�mle�����$l&G6���|��������`�d���+�By��v���Y��G����aTdM9b�Li �L�a�j�c��c*l����H��E�I���$K���*i��F7 �PH����ht�)'2>�s��9���>;�����u�C7��<�����+���jy����S;�k�4.�A,��'���%���l�<
g�>r=r:�%e����������J*��HX����R�,���`4��c���?����~~_((������I��G��GJ����vy��j�#(�pX=)�aW��%�����3+0)��j��V��5"�,�����Z#Y!��0vw������`������6E`�2�yK0K���S:����Ny�`�ck-}���
f��?\��r1[_P���>;�<\A��(v�TZ�������b��XI�F�0SM�2��/��v�f:����t\���W�6x}{�U����`�^|]_6W�Q,���w]wJ��Iaq��p�Im��f�3e��-lfNKmi�J���~����`�V���2��������F��0m<�T���T
^����_������#J�xAg��':�,�m���9j��n�w���t�xS�/��/����*�����	��]��;�6��{��z?�._U��
����Vk��z�����KO�Q�Nx4��n�<�K���U����G���r���^j��� x��!z6� �%�yP	~��|o������i�����^0��i��������8������X|��QkT�TQ��J���{����.l`*�T��K�
�3je@�A��s�fJ�Oo����k7
��D���&�'e�s�����jDu�D'�bY<l���9MOB�?{�zzV,?�]
eCOr	J��F��a�e��z}X���"�!������!xp=��m���f�����I�
��3V9�O���������
}�X98��������bo���m�+�����]�=�������D�)��jlc�c��5�T`������Rm_N�y��L��4X���E6�}~~�B<����/R�Kf����-3�rG��<��K�'�2��(��@��cN���[�yk��������`U�v7)H\�J����`EF��&�KvY��9e���\(�D5������G��cy�����-2���w�0T3�����5�^e�CL��l��9���?�QD}}v�������3-���A���������g��h!�V���'��o_�l�w�(��O�d����}8`������l{��*p�/���f���CsjJ���\0]E����B�.��K1m�GBE�x���u��������9}��y0�A
�~�=o��v�o�����-!W9us�����tqyS� �FG��>������}^�mA�z�X�e7kR�g�u3E��r���0�Q���+��:i���(sE2Eg�R�$X(����B�3�����s�k3*����G��������RiB�!A-}�������k�@xo���&+9����N:���?������ot3���9�?�J���N��e�S���xAD�e��O/�
�o�.���9m�����[}�s�qc�
�`NJ�;�	)�
(l��
�)��S�����1 ,J�.W�����#�5����O� �I�������<�K��uh�W��}*���K�-�f������2?�>��s2��HvUh����}��Xw��3!���R����u��"��b�I�p!�6�W	��
���U���7y��:�U0d�\*�-���,��A��H��Cc[c�4n0�AU�+P��S�OI|��Dg��C���Ql�>'D"����`Q.
'����E�LD�(���(�3VC��IR��F� �D<�[��u$��0�?���%�7c����]E��S����{���..������s�>�<>;U�6�E�%�xEy���*���az��:�����~�[n����G�fU��^,0,������������a�#?�
�i9�=t�"��VEy��\\����j0����H���UZJX~#|��3yW��_�����^*V�Ya	n�������-��&��I��n�N��]U�s��2)��I Cd����OGX�'8
�{�'o�y��L�����~&��pB�������=e����s�%��Gc����u�;s��/�n@������8��l�U�������4�(3��^g��G_0����L8��Q��4IC�
n�W���g���bs�V)���kTg4f�'��@����P�'�Zs����Ag[p�C�1��%�l���"Q::��J��YZ9�,��urF
���)��p{IAys�:�(X4��d��|��]�F"�p2z��i
3�S)M]�}�t��Y�����" �U��y�de��fr�M.�����J�
��2���<�?�L)n���O� �6�������Y��z7�/�.0�[ZJWcA|�>Xq�4y���?g�����2���E��0���v���7��MrZP��9���1�$��
	��g" �KaH�6kE-��])f"��?'���'���Zl����Y@�=��#
L�|��{�3u�+���WV�������?��g���\����o=U@��Z����N����S�-Q<T��"�W�C�}�r��
p���]��e������MKR�^�7�5�6<`�>�Y���yvg�<�4d�$��^�T���(�C���`L$/�Al��a�rh����*w{*��jR��z��bw��1�k?��q�.}"�J��Z,3i<�A�;��*M+�VI��� ���!g�V>~��^�IF���s��H�}h��o �DA#��MwX�����)Iqt.F_U�99��rP4,��#����b^d]����� �N]x�'@Q�����HFM
R���e��~���]��]L����gpNJ+���1���+r������*�#��)�L�����VG��woSA� �^�Q�u�w�C3��H��\h_5�wY�R���8w���w�(zH�����u�d`��������+5���j����{�~�\����U5�mf��.]�"�)W��-i���������������Q��BW�I�k�aa����E��.k���t��tfXV^�����:=H�m��z�|o��;$�u}���
E��(��2Q�����,�k�������������g�����#�����"u�c����X�uf'�����:;>��X��G��)|�&]�Qy�J�l�OU��.~/����n���4���������V��S�x`<��SJp�|���";��G����?B��C��p<Qz���2�ER�����"/*�l%����&��H����OfC�*�����p�?���C$���*����L<�D����#%�d6-r��'*1���{�n'i�����f���d:������}��Wo����|���Z���lC�R�W����b���`�L��	��Z��n]r���+lK���L%4;�>��]������>�=)f�3��3��,���5��p�'����8']�O�F�#��A(H�Z^����8z�����NC,D�|R����e���K�����a4�n	����`�Le�\��f��5�&�����{�����m��O���������x�z�Nh�R�����N�Ql�P��q%;�twD��� U�}7���O�$�:���e��j���L�>E�b��f*�*36J�9d
3���������Y'@�WSY�a��{���ggT�� b��wM`BRe�$����|h)�Vu�`4�&�0�?���G�l�X���*�R�=x�`.�w���a�q��A�F�x���1���}cm���R�������"b��[�OS�w���S�*�ub/5�S'����g%���j�r�0P
�<��S N����,cY���u	��d+�8
���������:b~x,Y�1&���StJ_�]�Y�T��aq��v}Nk���R���/L9gM^��3r������-��@�}�:hNWa<+;ih����~����t:�I�CL��%0�����@����S���N���>X-���`�8S����������l�aK(&���*��
`��}�>M�C(F���{���sbbv,�C���=���-��������^?��)}>}��G�A�T��<O5���������f����?����yc������������A
���M������5tU�*��P�1��na������7��r}o��������.k������r���J�	��m)U#������LR�%�e���\1��K��D;�e�z����?���]H�nE��@x���
��t~#3��
6�z��`{���\�0Y�.���A��a�t�R�����?kv����M��J0��81P�����pk��`� �Hh<�R�a�$���C�$���p�D�O8��<xdT���A#=@�vj�p��GS�M��x^�bp�����Pj6�k�:��DQQO,��D@*{s}�(��XUB>��c�S���O\�\��Q��������rV����D�v���K�k���i_�^���Q*�IU?��.��M�B.l�B���&z
*�6�y�$�k������va�����.�V�F�n����j��.����;���%��U��haV%Bx����D���=�7�8��x�b��	�!�+�[
|
�p�5,�fBQb�b�n)���I�j��I����"�Rh}��\�Hxm>�N���e�ak���(�
=D�9Bo��2��|���X7"�aT��B��0���Y���`w�/�D\����,�4)��{7�(����Pp��I�e�T�%������r9�y&�/� �2[��1�����Qm������6Pm�uef0v�
8E8JG��3�F��~^��[��w[�'��|����}f�I�r�U
���[��o�T��w�M��*�\m��q���Dr}�{�����l��(w�N�f>���0Rz.O�S�o�-����~��cO�jA'�iKSv�r���������I�D�+���'� ��������
vH<��V�~�t���b�.Bj 7��B�V��j�xZ^o��"qS%m�� D� �b�.2�M8>=j��\\\^Pp�F��1&�.=a�����o��9<�b��Zd�v��
C�M1����]�Zs�0���$
�"(A=�<��Ji�7}�����O~�<?8=���IN��m��!!1�e�o������PX����x$q��0r�i�y���M(\/8DZ�"���)wH5���[���G�/sry�S�����g��
�T��c��Y�jq]�0?l��h�\�m�����82R�0��QeF���{�KwF�����(L�I8��^,���R��{�����D?������N� 
��%J6�&2����e3��q=BN�K{����i�,���.	]S:�WZ�]J�����#N�(f3l>���D�Sq��p+	�x(#Us��zl5v�,����Q?��^gZ�����ll��p��5�*1���������|b?
������M�5!�#W!��s<����L�x���T��a���@���	�8�3S�0^!��ll7-�{�������$���2:�kV��\��{��8{����]������5�
4�Z$})�k�I��������`�_�
��$�������8G�5������ ?A���I������Dk5�-{RKYU��K��������!��3�Y���O��h���p�q�}��~d}��"(� ������Ch�U	�
`!T<V�{�u�G����B	���t��
���w�z�V�������|��~�'6,���h��q��D��J�t�(��E�_<��:L�%��2-H���0��m<�S���V�]����]�Z�V����F�z��_���5Z�v������N����_0x��}{\��Y��(���n��e|BV�b�j����p�b�AHG�T��0U����0���3�������:C�S�6����2�sq��{�����x�����a���@������y,�B2YF��)�|Oad�T����})
`�*��7MC|��K��-�^J������t{xG�����m#p����;�~�R.W�W�j+�Wv���?V���$+:�"���4�l����W(��b>�#,0��i�.I�y2��-=�&'d���`S2�����-�<��xp��S����?��
y�9�=���L^6]�H>�;���������0"������0+L���g�a�};�{���^������f��n6vQ}'��u��~���[p�Y���8�����q~�]��$F8�������|�;q<����x�������������D<�n01���z�H���
;>>93���5^����C���/�E��J������7�3Z�i������������U���j�����h$s�Z�)P8{�.�F��~t���9�u�v*f~�>��j��������J3�.r9XhV�Dmx������|���������|�l�j�g.����\�]��<v�7����w���	���5�^ �^Uj{�J -�j�v���~oH��,�3������Y���E�:�Ik�}3�O�7��7�p�o������J��
�����bp&���d��i���m�rf�f�V���*�����C*`��=����q?����FD���#"Il.������D�����*�{��>]+��3�����C3)R�OA<��'�\CI����;���?�_z[��rI�D����U9=�X��T���$�b,��������v��~��n�~�>�N����M]7�Q��!�(�������bg�B����+�{4EbP�J4SJ\`�#����::99~�����T�0����]�SN���b�F����__,PXs�#������THkP���M��:�;�D����Y�c)L����o�}�N����bD�C.�J����7�o�sr�j��c���5���Z�����X1^@����,��!��Ld�Q�n1�����9�@��+z'{���A�_mSx|�C�
�^��Cl�N���,���:��Rq\E��K<F��vNIz�x�������*xt��C���-2��yk����s5e��9���'�qJ��*���0����T���Y�M�n�����+��F�[�����^�W�iD�~6��&����H�)Jh[�����`�	������%�"eq��������7���oPG�_b����<>������q�/�ADAq;�����*�fp����F���	eW���Q^�g�YKZ��Tv*�8��^�\��k�Z���I�����!-k�g���,�?Q���-�h��/��<���x���xq���hq�w� ��i� �;{���(#�;���v�/���������OQ�\�<�|@�E�T�G���&?���!?w�gS~������S�T�/j�������A�j���������j�&�*�r�Z�nl��?��_����S�T������v�/CW���^�%@�%1��+-�R.��w��7*���l����N��	�j�z���jc�(��[��C���5w�T������'���N�7���v�����{��y�6z��V��j�6��N�u�;a�U���Z��^�u����_D���T*O�A
Ns�y��=
.�������|�|���[�� .<��:z��^e�zp�`�Z=��=�����Ve�R���w�
��i������U ������U�����+�4�M�uCT��)�y��
��n�����l��������A���4�4���&��m2�]O�Q�}���"�} x/�n����)����� GH?A��-h�R��J�<���x������S���6�%D����<��n5`�"���*�zK�{��Rac����k���M
�U��{�nT.��^�r-�-5;�a2�
�	2F�j�
{���^�V�I���~I���%����(��K���\���'13
��P1��a$�aOT���@�9S|Q6o����bA{��S��uFK�����8��~�+x�G0�	���f,����tW�{���F��?���v/��a�y���d���e������	��F�4K�5���W��W�@�f"�B|�����GVVf�/3�*�&�P����g;�g��VV�!�����ir��N�N/O���.�o?�.G���rz1��N���g�KY��c�]t7���k���-�~����NUi�0Q.y�A�4N��������V�%���&���ag:iL�t���L1��L>5{4\-��/F������;�����c�-�����2_��*�������)�&	 6��P\��)R�����*�3���p��)c5
�GN��[c&�0��/IF��
�b�OO�.��������7o.�������|'��[�C����&���^�(^enS�&6�t����jj%y�Et��c�vo��]�����7+�P��X�W���ab�)3.�K�q������"4dS�+�zJ}��������� '��
�.q�(������������X��W�'^�HM'��
�3=��?�$�������>6�����P��8��O0s%�TA���/��w��1e3O��]�>
�G?����]*y]����R��
y$1�#$���h,��_�^N����)�P���
�����]��Y0��T�4-<:y|q1��m�Ek�Z<��Yp���4��1����Z��u�mC����h�f���7R5���S�r�3����E
�?�na��p�P.��&P�����D�5�lO�R����m��h2��*�	����-Ry�F��`�n�������p�[�=��S�	@��[��:�k�����S����\X�2"�n��$�L��h�*�c��$���{
5�h���L�%�������"��D�E@j������M(�p.
�����|d�9���6��Yo�/�)�PU(!������.-TU-��P�wK���[,B^g��s�	��p���L��V�����Qn������i���9`*a���	S�J>0�>4F�nT��b���`�T�#�.�]�Z��^��J�)��_���	cBy'C��D�c�69�dGq|�]�2��'~����S4���O�H��Fz� ����9����hQb<r�sYd1
�S�T�W��G�F��*n!Dej ���L���i����w;�Bq�2��3\���x�z��ZE6#���F�!k��1,��.�� 6��5��~���$1�<iWD���*R�(���c���jI2���Y\��2=���;-�%e1����
[��N�x�k����1��R,1��D�F�:\�!f��F���^���������5������-���)�*~`g��6 ��:q?!Q1���g��)���{
E���������I�~����-�&�3��\w�'-u����t7e\~���o�7��I���Avr��-��Q}n
%'��*Vxq��w��`Y�(���#�MHY�<P�b������3�<p����<<?�x^�+D��t������ E,��$��_k��]H�,�L����#�	���AS�������Q�Bh���g��I���Y�����,�����9zF�+�����t1����l��41f��S#EhG�����O�:S6	>gU�,�j�@ �f�@�*��1\���Q��]	E���v���*d;	[,g;�J�#W��p�`4N��mS��J]�"�l�5r5�A1M�[�[�_Y��O�1���a*������J��7�������r�k%��u�6�2nV;�T�)�.Z�oy�yO�A0��N���5���7nN�Y>D�n!����u�����-in��n��S�����
�.8P91���\f����`����
��Z9}��8PQrv�&(VU):U������������u�S����j�MN������:SzW�{�����UOP�UG��:�t�����WU���U9��_�>;~7�	�t��j�L��7&{�SUf��$�����ws�����
��
��z��q3i��^,��S���Rt����d����Gf,�`�Y�j4�����3��fO����#�����"�|�V�O��hm�;t�>���	�����AK�)��Us�f�
:5Y:Y�K=mL���5~y���R�����@_:���'���a��1e�'�(�a;��fG��JdKg�m�R��:����s�=6��R��W��	�~m�F���{��A��f�����F��`s
|���>�����(nE��5v=/,}���`s�<�|�7[������[�gdB����2�\���9���Xd]�T�Y���s���H��j���\���?�����Kv�0P�M�?1I�������S�O��z���N��[3�%��;mv�����]o��n{�ivz4��k>�����"����O&V��)�i
�)(xN�z$�����E���2x�E+��<g����d;�.u�����f�PSa$��UY��D�Je$���"FsG=>��B�6'�����0T����iw�� �m.�$�-��W�Yk�Y��r�W�(��Y�I-������]<��b&�*E����"%����3������W�	+�k6���E�K�n��t��pa`�E���J�_3�`��������,�#�E�h1�� ���8��}@������a���l���7)������S� J�C��A#	��i������XJ>tn
��b�@����R����#h��B����n��"���1��
r�������G�4�B�Q�i�	vh� !�/6�:��tB����c(�X%����hH7�]��H�K����-��'[�z��0�Q��$��W��p��<��:j9�Y�AZ�x��y&�X`��rcS{��x��s8S����/���3��Z�&L)��\\!kE�9Z��=Tl!i2X�Ey����I	T�/8�����FE`�P[	�A��8���o��{�0�tN
0#k����Q������
�A�s��	v�)Dg����,�t���F�w4v��h��J�+8�`Ug���&�U-CI�Y�55:�&����O86�IM��;*�� �D��$��+1=WD���
4 $���YV�;�.�����%�kD�:?��rG"���kwQs��F�r,�q�0
���rn ��M��G��o6{����������4}\���.2����X�;��%�^6]�uGi����o`�pkU6z,����0��h6z�L��H�/��DP��f��M�A���
m]�c3������wty�������A	��ZE>���Yg�	v���g��[��N��{��JTb������t���/c�6�>�I5of����$�����H���>�==�[�&����-�}�P0{����2��P��k���.���#�+7�\|x��o��e�`5�v��k�[�G�����wZ����3����n�.���,T�e�Q���-[e���}������'�]��+s�:�8s����&Q���4����j��,��K	0�,$[��bVN1��{�PN����^���d��D�q�SC�AFX��D�5UK�;)�,#<��Y-�{�����@��OE�&��3(��[D��E�^_���:v]���%�Jf��%���X�
�/��]4/�D~���}I.�`?<���=J=��B#�:�SR�'l}�y^R�?4�����h����#jf����,���`��w�������\�wWu���`�/�����l�z������`��3>h���^����^0�Zm���q��7;c�?�6g���5�o��%D���������VwO����X�	�V}�4�� ��|��S?���6r�D�'�*��&qco4�A����W$��L��.e7S��q:�_�	�� ��
���q
+v��|&~���9��r$������U-�>d<6P�������2V�t�)�p���gFc�5 ����%���n>W!��(�� 	��9a� t<�.P���q��(�))����!%c���JP����D%��r�(�`i-�6��J�c�#��>r��*�LC�����kr�*���:��f���n���v���)*9q�wK��9�6�d���.����Y`]�Tz������^��=`\f2�S����TE~���Xi�-4Ex��5�Xd64j�E+
�X,�:�?i��2�����g��|�e�`�����j��C���(��pR({����h�]���r�m������$���/�f[
����&�Z�g����p�[Q3�
H�e��x�k��(D%��B��+r�V��G�v�	�o����;E+0���U�%�����B�qI^�^ ��[���1w��$��W�0��*��n��-����1��8K�d�3GVU -�j��|m���N����C�@[o�;��2�p ��"�Iu��f3��F|�!3��i�X}������!6���N�������8��}�
>�lD���G��G�k����)��������{���y7����������97�5�n���m��C{cK�("6�� Y]��%�~���.�9]�Kr��m���fbG"���~�U��f�S0�; ��
?����)?`�1��ZUka�i���T�	O�:����cf8�W������6��*����
��$�w��x@j��i#�h�@�^&�l3H�_����d��^��A��o6[g���MNSe�K�4����?Y�Q�)��3��������~�z�#^�{a�l�z��6J�>��;#����!��CG�5�&~m���zZ�B
�����������8�6�$��&1�N����	�/�h��V��n��Ww��K^-�?f�����E���R�#��Ft&G�����zd7��]�qwt�V�VN�����`���Vy#����)_�+���1���b���|XA��&n�i�
��9Q��`�q��D�!
9��y�G�j%���)Xn��/W�������������=���N���U��B�0��l�6��nI���A^�8x?�}��J���sz��
�5�j�O�MB1kSu�����s�� �j�D
�/��`��x�j�\_�t.{�9��s����JN�^a
��������I���`�E-A�<�����"���n#U2���U��L��`��zW��Qv���r�Q=��?��^����(
�
T�Kb��-�S�Sv�f�	�����iM��|������>�V����Z�?�v{����c�T��������o(�2U�����������:A��[�,r*xT�6��C���+�1��y)�RVB����Eq<��/�����k���bs���j�]��h�u*TX�I���V��j�}�m�2�#���^8���W��K�C�9�n��o�q,�FS���]q�w������GIA�uW�Ik*O����-���7�c���A�`fG�^����@I��+�2[E�[�2��g"?-k�K��d}[Vo�i��"t!
��������+�7E���|�^�TT�4���������s�*�s zk�I��U
�j����[�c�MM��%��Ey��2�U�]������@�@�W�/7I��,�E�Ck�ma�>�	-�w�
��ZWp��A���?Vf�*��Q�Qb3��#�TIA�x��n�����^d�M6�2*�s�	�Uds�ZE�l�_%m�HqG5&�5@wq+�jMY8Fjj�zo��+`��I����vL&�:"DW<_�����q\#�Z=��`#7&��wq8�-0x`���� ����	�NM�����@�Z�] �G�9�������'&�`�
\E6J
�$l��~������/�Fe�D)�nR�����D�^��,�mD0q/����cgA��94�A�3��o&/bR"IeF�)�����%�nw�A���E��
�N��e�<TB:�������B����_>����J��Y����U����Y5+��t���A�i����3�k��,��Y'����3�� ��V��'
�D�&��Z�s���x7_��M},���2�|�/��n�%��ON�T��iz~7��y��?��?=�k�&�f��
����:�n��n�����=�����>��G�+q�.$?{�?���
��/r�
�.������p���g-�Y@�<P@D�BQ���*�Z����2&��|�t��������I����AGXO����`��g}������;��9�6���������&�U�<YO8#sOgd�R�x:���xJ����W���<��m��h���a?��d�������~s=�K������<4-.�h�xEW��5$>�����y,%��z�7���~�<|>��������y�<|�������`

#294

/messages/by-id/CAFBsxsHrvTPUK=C1=xweJjGujja4Xjfgva3C8jnW3Shz6RBnFg@mail.gmail.com

johncnaylorls@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#293)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Dec 14, 2023 at 7:22 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

In v45, 0001 - 0006 are from earlier versions but I've merged previous
updates. So the radix tree now has RT_SET() and RT_FIND() but not
RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous
versions that incorporated the above comments. 0009 patch integrates
tidstore with lazy vacuum.

Excellent! I repeated a quick run of the small "test 1" with very low m_w_m from

...and got similar results, so we still have good space-efficiency on this test:

master:
INFO: finished vacuuming "john.public.test": index scans: 9
system usage: CPU: user: 56.83 s, system: 9.36 s, elapsed: 119.62 s

v45:
INFO: finished vacuuming "john.public.test": index scans: 1
system usage: CPU: user: 6.82 s, system: 2.05 s, elapsed: 10.89 s

More sparse TID distributions won't be as favorable, but we have ideas
to improve that in the future.

For my next steps, I will finish the node-shrinking behavior and save
for a later patchset. Not needed for tid store, but needs to happen
because of assumptions in the code. Also, some time ago, I think I
commented out RT_FREE_RECURSE to get something working, so I'll fix
it, and look at other fixmes and todos.

Note that DSA segment problem is not
resolved yet in this patch.

I remember you started a separate thread about this, but I don't think
it got any attention. Maybe reply with a "TLDR;" and share a patch to
allow controlling max segment size.

Some more comments:

v45-0003:

Since RT_ITERATE_NEXT_PTR works for tid store, do we even need
RT_ITERATE_NEXT anymore? The former should handle fixed-length values
just fine? If so, we should rename it to match the latter.

+ * The caller is responsible for locking/unlocking the tree in shared mode.

This is not new to v45, but this will come up again below. This needs
more explanation: Since we're returning a pointer (to support
variable-length values), the caller needs to maintain control until
it's finished with the value.

v45-0005:

+ * Regarding the concurrency support, we use a single LWLock for the TidStore.
+ * The TidStore is exclusively locked when inserting encoded tids to the
+ * radix tree or when resetting itself. When searching on the TidStore or
+ * doing the iteration, it is not locked but the underlying radix tree is
+ * locked in shared mode.

This is just stating facts without giving any reasons. Readers are
going to wonder why it's inconsistent. The "why" is much more
important than the "what". Even with that, this comment is also far
from the relevant parts, and so will get out of date. Maybe we can
just make sure each relevant function is explained individually.

v45-0007:

-RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, Size work_mem);

Tid store calls this max_bytes -- can we use that name here, too?
"work_mem" is highly specific.

- RT_PTR_ALLOC *slot;
+ RT_PTR_ALLOC *slot = NULL;

We have a macro for invalid pointer because of DSA.

v45-0008:

- if (off < 1 || off > MAX_TUPLES_PER_PAGE)
+ if (unlikely(off < 1 || off > MAX_TUPLES_PER_PAGE))
  elog(ERROR, "tuple offset out of range: %u", off);

This is a superfluous distraction, since the error path is located way
off in the cold segment of the binary.

v45-0009:

(just a few small things for now)

- * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *   vacrel->dead_items array.
+ * lazy_vacuum_heap_page() -- free page's LP_DEAD items.

I think we can keep as "listed in the TID store".

- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.

I think we want to keep "in dynamic shared memory". It's still true.
I'm not sure anything needs to change here, actually.

 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)

It seems very strange to me that this function has to pass the
max_offset. In general, it's been simpler to assume we have a constant
max_offset, but in this case that fact is not helping. Something to
think about for later.

- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",

This should be signed int64.

v45-0010:

Thinking about this some more, I'm not sure we need to do anything
different for the *starting* segment size. (Controlling *max* size
does seem important, however.) For the corner case of m_w_m = 1MB,
it's fine if vacuum quits pruning immediately after (in effect) it
finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If
the memory accounting starts >1MB because we're adding the trivial
size of some struct, let's just stop doing that. The segment
allocations are what we care about.

v45-0011:

+ /*
+ * max_bytes is forced to be at least 64kB, the current minimum valid
+ * value for the work_mem GUC.
+ */
+ max_bytes = Max(64 * 1024L, max_bytes);

Why? I believe I mentioned months ago that copying a hard-coded value
that can get out of sync is not maintainable, but I don't even see the
point of this part.

#295

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#294)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Dec 15, 2023 at 10:30 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Dec 14, 2023 at 7:22 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

In v45, 0001 - 0006 are from earlier versions but I've merged previous
updates. So the radix tree now has RT_SET() and RT_FIND() but not
RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous
versions that incorporated the above comments. 0009 patch integrates
tidstore with lazy vacuum.

Excellent! I repeated a quick run of the small "test 1" with very low m_w_m from

/messages/by-id/CAFBsxsHrvTPUK=C1=xweJjGujja4Xjfgva3C8jnW3Shz6RBnFg@mail.gmail.com

...and got similar results, so we still have good space-efficiency on this test:

master:
INFO: finished vacuuming "john.public.test": index scans: 9
system usage: CPU: user: 56.83 s, system: 9.36 s, elapsed: 119.62 s

v45:
INFO: finished vacuuming "john.public.test": index scans: 1
system usage: CPU: user: 6.82 s, system: 2.05 s, elapsed: 10.89 s

Thank you for testing it again. That's a very good result.

For my next steps, I will finish the node-shrinking behavior and save
for a later patchset. Not needed for tid store, but needs to happen
because of assumptions in the code. Also, some time ago, I think I
commented out RT_FREE_RECURSE to get something working, so I'll fix
it, and look at other fixmes and todos.

Great!

Note that DSA segment problem is not
resolved yet in this patch.

I remember you started a separate thread about this, but I don't think
it got any attention. Maybe reply with a "TLDR;" and share a patch to
allow controlling max segment size.

Yeah, I recalled that thread. Will send a reply.

Some more comments:

v45-0003:

Since RT_ITERATE_NEXT_PTR works for tid store, do we even need
RT_ITERATE_NEXT anymore? The former should handle fixed-length values
just fine? If so, we should rename it to match the latter.

Agreed to rename it.

+ * The caller is responsible for locking/unlocking the tree in shared mode.

This is not new to v45, but this will come up again below. This needs
more explanation: Since we're returning a pointer (to support
variable-length values), the caller needs to maintain control until
it's finished with the value.

Will fix.

v45-0005:
+ * Regarding the concurrency support, we use a single LWLock for the TidStore.
+ * The TidStore is exclusively locked when inserting encoded tids to the
+ * radix tree or when resetting itself. When searching on the TidStore or
+ * doing the iteration, it is not locked but the underlying radix tree is
+ * locked in shared mode.
This is just stating facts without giving any reasons. Readers are
going to wonder why it's inconsistent. The "why" is much more
important than the "what". Even with that, this comment is also far
from the relevant parts, and so will get out of date. Maybe we can
just make sure each relevant function is explained individually.

Right, I'll fix it.

v45-0007:
-RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, Size work_mem);
Tid store calls this max_bytes -- can we use that name here, too?
"work_mem" is highly specific.

While I agree that "work_mem" is highly specific, I avoided using
"max_bytes" in radix tree because "max_bytes" sounds to me there is a
memory limitation but the radix tree doesn't have it actually. It
might be sufficient to mention it in the comment, though.

- RT_PTR_ALLOC *slot;
+ RT_PTR_ALLOC *slot = NULL;
We have a macro for invalid pointer because of DSA.

Will fix.

v45-0008:
- if (off < 1 || off > MAX_TUPLES_PER_PAGE)
+ if (unlikely(off < 1 || off > MAX_TUPLES_PER_PAGE))
elog(ERROR, "tuple offset out of range: %u", off);
This is a superfluous distraction, since the error path is located way
off in the cold segment of the binary.

Okay, will remove it.

v45-0009:

(just a few small things for now)
- * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *   vacrel->dead_items array.
+ * lazy_vacuum_heap_page() -- free page's LP_DEAD items.
I think we can keep as "listed in the TID store".
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.
I think we want to keep "in dynamic shared memory". It's still true.
I'm not sure anything needs to change here, actually.

Agreed with above comments. Will fix them.

parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)
It seems very strange to me that this function has to pass the
max_offset. In general, it's been simpler to assume we have a constant
max_offset, but in this case that fact is not helping. Something to
think about for later.

max_offset was previously used in old TID encoding in tidstore. Since
tidstore has entries for each block, I think we no longer need it.

- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",

This should be signed int64.

Will fix.

v45-0010:

Thinking about this some more, I'm not sure we need to do anything
different for the *starting* segment size. (Controlling *max* size
does seem important, however.) For the corner case of m_w_m = 1MB,
it's fine if vacuum quits pruning immediately after (in effect) it
finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If
the memory accounting starts >1MB because we're adding the trivial
size of some struct, let's just stop doing that. The segment
allocations are what we care about.

IIUC it's for work_mem, whose the minimum value is 64kB.

v45-0011:

+ /*
+ * max_bytes is forced to be at least 64kB, the current minimum valid
+ * value for the work_mem GUC.
+ */
+ max_bytes = Max(64 * 1024L, max_bytes);

Why?

This is to avoid creating a radix tree within very small memory. The
minimum work_mem value is a reasonable lower bound that PostgreSQL
uses internally. It's actually copied from tuplesort.c.

I believe I mentioned months ago that copying a hard-coded value
that can get out of sync is not maintainable, but I don't even see the
point of this part.

True.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#296

johncnaylorls@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#295)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Dec 15, 2023 at 3:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 15, 2023 at 10:30 AM John Naylor <johncnaylorls@gmail.com> wrote:

parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)
It seems very strange to me that this function has to pass the
max_offset. In general, it's been simpler to assume we have a constant
max_offset, but in this case that fact is not helping. Something to
think about for later.
max_offset was previously used in old TID encoding in tidstore. Since
tidstore has entries for each block, I think we no longer need it.

It's needed now to properly size the allocation of TidStoreIter which
contains...

+/* Result struct for TidStoreIterateNext */
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ int num_offsets;
+ OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
+} TidStoreIterResult;

Maybe we can palloc the offset array to "almost always" big enough,
with logic to resize if needed? If not too hard, seems worth it to
avoid churn in the parameter list.

v45-0010:

Thinking about this some more, I'm not sure we need to do anything
different for the *starting* segment size. (Controlling *max* size
does seem important, however.) For the corner case of m_w_m = 1MB,
it's fine if vacuum quits pruning immediately after (in effect) it
finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If
the memory accounting starts >1MB because we're adding the trivial
size of some struct, let's just stop doing that. The segment
allocations are what we care about.

IIUC it's for work_mem, whose the minimum value is 64kB.
v45-0011:
+ /*
+ * max_bytes is forced to be at least 64kB, the current minimum valid
+ * value for the work_mem GUC.
+ */
+ max_bytes = Max(64 * 1024L, max_bytes);
Why?
This is to avoid creating a radix tree within very small memory. The
minimum work_mem value is a reasonable lower bound that PostgreSQL
uses internally. It's actually copied from tuplesort.c.

There is no explanation for why it should be done like tuplesort.c. Also...

- tree->leaf_ctx = SlabContextCreate(ctx,
-    "radix tree leaves",
-    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
-    sizeof(RT_VALUE_TYPE));
+ tree->leaf_ctx = SlabContextCreate(ctx,
+    "radix tree leaves",
+    Min(RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+    work_mem),
+    sizeof(RT_VALUE_TYPE));

At first, my eyes skipped over this apparent re-indent, but hidden
inside here is another (undocumented) attempt to clamp the size of
something. There are too many of these sprinkled in various places,
and they're already a maintenance hazard -- a different one was left
behind in v45-0011:

@@ -201,6 +183,7 @@ TidStoreCreate(size_t max_bytes, int max_off,
dsa_area *area)
ts->control->max_bytes = max_bytes - (70 * 1024);
}

Let's do it in just one place. In TidStoreCreate(), do

/* clamp max_bytes to at least the size of the empty tree with
allocated blocks, so it doesn't immediately appear full */
ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage);

Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that.

I may not recall everything while writing this, but it seems the only
other thing we should be clamping is the max aset block size (solved)
/ max DSM segment size (in progress).

#297

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#296)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Dec 18, 2023 at 3:41 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Fri, Dec 15, 2023 at 3:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 15, 2023 at 10:30 AM John Naylor <johncnaylorls@gmail.com> wrote:
parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)
It seems very strange to me that this function has to pass the
max_offset. In general, it's been simpler to assume we have a constant
max_offset, but in this case that fact is not helping. Something to
think about for later.
max_offset was previously used in old TID encoding in tidstore. Since
tidstore has entries for each block, I think we no longer need it.
It's needed now to properly size the allocation of TidStoreIter which
contains...
+/* Result struct for TidStoreIterateNext */
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ int num_offsets;
+ OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
+} TidStoreIterResult;
Maybe we can palloc the offset array to "almost always" big enough,
with logic to resize if needed? If not too hard, seems worth it to
avoid churn in the parameter list.

Yes, I was thinking of that.

v45-0010:

Thinking about this some more, I'm not sure we need to do anything
different for the *starting* segment size. (Controlling *max* size
does seem important, however.) For the corner case of m_w_m = 1MB,
it's fine if vacuum quits pruning immediately after (in effect) it
finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If
the memory accounting starts >1MB because we're adding the trivial
size of some struct, let's just stop doing that. The segment
allocations are what we care about.

IIUC it's for work_mem, whose the minimum value is 64kB.
v45-0011:
+ /*
+ * max_bytes is forced to be at least 64kB, the current minimum valid
+ * value for the work_mem GUC.
+ */
+ max_bytes = Max(64 * 1024L, max_bytes);
Why?
This is to avoid creating a radix tree within very small memory. The
minimum work_mem value is a reasonable lower bound that PostgreSQL
uses internally. It's actually copied from tuplesort.c.
There is no explanation for why it should be done like tuplesort.c. Also...
- tree->leaf_ctx = SlabContextCreate(ctx,
-    "radix tree leaves",
-    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
-    sizeof(RT_VALUE_TYPE));
+ tree->leaf_ctx = SlabContextCreate(ctx,
+    "radix tree leaves",
+    Min(RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+    work_mem),
+    sizeof(RT_VALUE_TYPE));
At first, my eyes skipped over this apparent re-indent, but hidden
inside here is another (undocumented) attempt to clamp the size of
something. There are too many of these sprinkled in various places,
and they're already a maintenance hazard -- a different one was left
behind in v45-0011:

@@ -201,6 +183,7 @@ TidStoreCreate(size_t max_bytes, int max_off,
dsa_area *area)
ts->control->max_bytes = max_bytes - (70 * 1024);
}

Let's do it in just one place. In TidStoreCreate(), do

/* clamp max_bytes to at least the size of the empty tree with
allocated blocks, so it doesn't immediately appear full */
ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage);

Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that.

But doesn't it mean that even if we create a shared tidstore with
small memory, say 64kB, it actually uses 1MB?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#298

johncnaylorls@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#297)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Dec 19, 2023 at 12:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 18, 2023 at 3:41 PM John Naylor <johncnaylorls@gmail.com> wrote:

Let's do it in just one place. In TidStoreCreate(), do

/* clamp max_bytes to at least the size of the empty tree with
allocated blocks, so it doesn't immediately appear full */
ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage);

Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that.

But doesn't it mean that even if we create a shared tidstore with
small memory, say 64kB, it actually uses 1MB?

This sounds like an argument for controlling the minimum DSA segment
size. (I'm not really in favor of that, but open to others' opinion)

I wasn't talking about that above -- I was saying we should have only
one place where we clamp max_bytes so that the tree doesn't
immediately appear full.

#299

/messages/by-id/CAD21AoCVMw6DSmgZY9h+xfzKtzJeqWiwxaUD2T-FztVcV-XibQ@mail.gmail.com

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#298)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Dec 19, 2023 at 4:37 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Dec 19, 2023 at 12:37 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Dec 18, 2023 at 3:41 PM John Naylor <johncnaylorls@gmail.com> wrote:

Let's do it in just one place. In TidStoreCreate(), do

/* clamp max_bytes to at least the size of the empty tree with
allocated blocks, so it doesn't immediately appear full */
ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage);

Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that.

But doesn't it mean that even if we create a shared tidstore with
small memory, say 64kB, it actually uses 1MB?

This sounds like an argument for controlling the minimum DSA segment
size. (I'm not really in favor of that, but open to others' opinion)

I wasn't talking about that above -- I was saying we should have only
one place where we clamp max_bytes so that the tree doesn't
immediately appear full.

Thank you for your clarification. Understood.

I've updated the new patch set that incorporated comments I got so
far. 0007, 0008, and 0012 patches are updates from the v45 patch set.
In addition to the review comments, I made some changes in tidstore to
make it independent from heap. Specifically, it uses MaxOffsetNumber
instead of MaxHeapTuplesPerPage. Now we don't need to include
htup_details.h. It enlarged MaxBlocktableEntrySize but it's still 272
bytes.

BTW regarding the previous comment I got before:

- RT_PTR_ALLOC *slot;
+ RT_PTR_ALLOC *slot = NULL;
We have a macro for invalid pointer because of DSA.

I think that since *slot is a pointer to a RT_PTR_ALLOC it's okay to set NULL.

As for the initial and maximum DSA segment sizes, I've sent a summary
on that thread:

I'm going to update RT_DUMP() and RT_DUMP_NODE() codes for the next step.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v46-ART.tar.gzapplication/gzip; name=v46-ART.tar.gzDownload

���e�;ko����j��=.z"G"E���NS��������-�����pI��m����%%Q���q��
-rgvvv���8�4MK?��4�rW�#�E�_���u?��,Hb��I�����>�����1���w�����8x��mk��W�k:�YN�1�m�����-�6���y/��x�2�)c�E\�Qp�l����O�7i1s������w[������t�-���L�����3��Y�K1fV�����Z���9`g��%�r��%��H�n^]G<
7�~��<���Z;w3@�j1�90��>�����]��O�������~f��L�i9���2%��������2V���1���Nj��%io1�KX,�<�Y_���^�2)"g��2
�G� ��l.����ey�l$X*df,�g
F���yH`Z�X��k��4�BL1��;��g#����<��{^���!��	6q&qd$�~ �\Jx}�FY6���t:5������������H�k�^������>��C����?69E��Y��JMo�.n��_���7�/�?8&��l5]�5&S��n�{�9N��)��3F�o�tY����,��`.��Zx
�b)R��Z}O������_��M�7�������y���n�5���r������qpE�qk@���^1�����:��e��
�,#b��I��$q9E���e�w��ML=�ATz��{�m���@J�-�B�����.���� [0��m(���W��R|�FA�$b�E�i��2�F�@l	� Wj��I��� '���89CY�*e>\��*y9{�B�}��<��|�	������'����
;�D�Z���Z�#_[�$��i5����{��J\;��=��"��Z��\����B�]�<��7�
ZK�M��h/wGL�����M��n����V���i��%�%���c�I�5�k�
�h%�8b�j$���0����^�h:>Hwp�;��w	z���H$,��2�2�w�%<p�@6�0��Z���p���x�)D����G.�/Bg����,��,�%�L<�pF�����h($�n����F|"h�`x���g����4��S�v�!��:��x��c�df���*���n7fC��Q��w�{��dX�v� Jo�� #r�M/�X�y�����E�����e5z�M�(�<��5Xw�y�(@���tO�����X��	8�	@X�=�:�=��K0��`{J�HW���nQ�:ZQ4i��21
�`��V�$��y&��C`w[
���iX�p_��PK�6��n�S���qPO�\Q2�����mx�M(~n�it���
A�
����l
�	)I\�y�/��W���%+Z��-���o��'*�k�g`�t�DB'�=���:p�$���/��o0�$~��=��@([K��mY����#�=�[��s;6w=�-�=������}��[m��m���{_��~IF1d/w!h�������W�d�L�r�`�n������cA������(�Cae(�*�3V��FcAX5�����z�Bt�E@� ���j�����8np��o��]��)deI.�;m� ��z��r]��������jp�����&;���"��8s
rF0T,�mP���L���y����}������b�T�T|�c�\	�B�	���� ����������8M`���y�l���Z��i�zL_��;"�S�$R6	d0�S����'(�\2���
����.!�R��|V���(����5�YE���p^���t!��P+N2�27 �����%�����"�R����(�)��Q�y-o���_>]%��������k[�1�=��3�1��`��^�};:����-�r��3N<!��R���G�o���2���cu������"�jzk{���d�bJ��<���FH#��� ���J��B�����
��"���
��5���(tM�������a�]��Y�����f<���1���>�a=�o�����>�:;|y|����\�����vj������.�5�z���
�*-����Dl��|� :�#�G��T��~'S��
\��������Q}��_�K�O�q�c0�w�q�����~&�����><�M$#��;�'��?�!��K?d<��H�R� �RP@-��A1y����0����[�,�9h����[ ��=��X����V'�++��50E�`��20 A<A�ES���ei������ya��G����eZp-�G��N��yex�&)��q�[���D'>X�~���z���]���x`�Q������fHH�`�oT��
&@��b�-�z�p���`n��r�ra*��T�RM*��v��j�e��������(�>G���������������F��uhu��$<��r�v�:dc�"s���"f�{YZ�`.��/�|8����/�����4����+��{;;`����-xS�x�v9<�;N��_qF�����r0%D�q�����E�����r�Q"�i
[����nU�8J�s�Ei5�Y� �*��+��n{��}Y��X-��<�0,��:���[���xVk��c��u�

n����9��)�����" exV��������R46��7F#M�L �6%�]��l�J���F.�A*i��B.�b�\��hr���]��vm�=X���P�"�<�����\*E%�'�E"�Ft�0C�(�P��0l�����Q�K�������o
7��ukX�5��>p��<=j�k@hB���r0iR�^&�����f)R�&Y��,1�\��������]e@a3����
��a�]������a��FauY�5cZ
��\�6i���������?���T��l���v�/���C����A���M$>����8,h�'�(�O�Q��.��o�(��E�)3���e������E}5��?8���f���(��E�-	lsS?q��)<=��*�Q�VDrE��<�����#R�/���tk\@"U)�J:�%��J�E�������fg���]R���@�Cz��
H��A��RS��z)�m����5��W��{��r����a��
����X5�[Q����a���8�.p�S7]�9������(In�/��������z��$
�AF�>E��0�5}92���L5���;�2	�}C7��������)���,B��'���Ysm���N�c-����������E�!��������y�4[��i{�����-�r������,�Z!T�����;hc���������M����RX5����j���
f`c����<&{�Rx��"�	\��y���I��"�"Wk��5��:mz��|����Z�^-����9#TU�W������\���^����Y����f�xyA��t���pw%�1���{��V�4����T�����U�F��}������q��.������,��9\]����|U6!��.�[*?��3��5t.�U��e�% ��J�
$.��*�n�dJ��:����4Y u��3O�D���������Yd:����A(��guVZEV�\:,� `�J�����tr�������K�~���G�n�G�[�G����G���"O��p��R�-��3����{]{������=�}s�&���f�m�2�]���X������^�q��aY
A���s��Q���^�f��~8M���Y�Z��pp�<8=?���xO�AJ���:b���7�xHS-MR�Aj����<5���9��s5��������O��,_b�������*&��
��Z����l1�:���������`�+�X:�|!I1�W�e�1��
����3VP�XB�(�Vg��`|'!,26��#������@_������0������]���'&�8�5U��P��8$�A��W�"�7^��SU`qN�A\���7x|L��h��jJ�%wE
�Y�ms��dK=�%��v���������Het����~�6d��q�.��~n�����������_������(���0�y�c����l��M���G+r�D���`V^|.��}(l�V����Pe#*�p�B%R�]�xIG5U9K�/�".�S��N���#p�P��zD�8�������*�O2\����U|��Dz�3���.AD/�H`b�����^E�"���(��5a��BJ>�`���K�$��B���CU��k�m���@
��T��������(&����Bf(���a�kJ<e����<U��M��*��Q@�z���������8NQ^�bG�[N!XN9N�8���}�DC<#[�i�%�0�`'~���B��q��0w`+B�Vb�w���dUA3_�5�s�[*3��b.��S��I���b?v�RI:�v�-+�m�)�T�&������"�#>� �'k�	�0NX~&g���T�����Xj�� r��
>�V�(��h�C�^�A������H �����?+�M>4��0A�����-�������������J���N����:����'�t�0d�;�%#��7'�����)��z������~DP���g#Y���N��2�y�v0�2��z�r��P��J�,'�|��%�BQ�Dd�{��T0�Q����	�����`&��n�����e�)LHbbm����by���IR���tI��i�[��B8J)�x�S�E���i�0�J��LV:�wWH������������T�G���t���E������Um�V?����W��=�����0W��dTf��,��9y���L�.g� �Z	�HBO�����Oe� k��W3A���}��!+O3���E�@F��Z�3~�E�[�
�}�C��Vx�8�
���I4cD����"��!
�F���v�l�ypi���rN4Hl(`�
�7S�(�8�p��tx�\e����k�~_�\f�w�;J���`�J��"�FI�j<�5���t��xV*r*����r�~S��k2�|�z`R&n@��9N��}#p�V��.������E[�v������.�Glg�X�Sm0��8[#
�>�OR��G
�D
:���P���]�b����>�����,M*�&Xk������Q�"{�+�q������M&��af�c�t6�F9v���$���}#0��1����YA�C��y�UU~��I?��x`��2k����B�?�����S;)�����/��UV��l�L��u��G��	<.#13��0���U��~������)]j���������������f���$�!����Fvsc�/�n�v|�+I%P�-*��$}?��lg�:�/��O���sN��9�����/�|��]�}�'������Z��4�x�:�>�\�Q]=;H�����/���O���v.z�z�lP1y�aj��[��y}��}�G����#C�r��v��G@��I8�(������^���1ml��;�V<�^�Y\_;~Fz���	����)9>�$��V.(�o�ZQvF��Y�y�����%�g_t 8�f���p��p0S���Q�K �Ym��^���O�Z8�R�<��**F.����0*��B:��Z��`	�&zq��otl��X_na�����	������.�_a�W���fb���� dt���E���9��:~����|����U�=?i��'���v����U������-�!��1_$W���^�U�El���0�e��^+����jf�L��@��L��v��
:b8 w�K)WEM�F1����@����V/o9J���^
'X�	���)I������NZgW�/O__�����^�*��_U+���;�)_�U�M�f�}�sA�{��?���%�;��7��S������Y�����yn�<������������
�������%�<���`�y>,���g�oZy���Q;�J��"������t��v
2_�fZ�,1{F����p�_D���B� ��j�
`���U�������&qg����E�vs�mQ����`�w�~���g���w�v���&6�F�n'i�FD������"�_{:��7����0�$Cw& F$��"������i��8
�f��@���Z������%������/,F�t����Ir*�������u|��"lG��@������X ���q���D�x���5���q�%sY���s���}q~��t���d�����W����w��?{�H�YB+J�n��{��S���ua_�>��vX����wn��u�����iL���gtj��Zg����
�O|� ^�w	p���k��z[���r�m����� ��	����y�H-wmT7N=# �B+��k������t���gqz������+)"1��z����r�����T8�v]�Q;���o���smi�A���e����������E��+�-��O����=}}rB����������L\���b���L$mJ\��o������w@�����0��xq��Hv�\1k�3G�3<@��A���x��AY�xH�����)��#�	��<�k{����6�f�m�M���G���=S�������l��	FLA��9z �ZS�7��d{P�Q���3[���jb����6���_eO��	Rtr���n$hQ����nd��6��T��_9���AV8��N�*�\��"W"A�g��6_�����.��X_9���AV��zL��z.�Q��{D��JN��������_<X���%=S�t������)0�.����0//����bhT�x��{{������m�n��[�/��a�>E�A�N�g��u�	~�����}�*%Weu!�=�������Y�����^vz�+�S�$�L�LoM�e�cKIutX(��!%B<��_���GO���1=���tI��+��9����*�������[�w��{������yb�������IY����qV�6�7Y����������yfO��%89���Ln���V����50��AR�Q]~:��~����;�_��apc>
�9�����A��?��U�hYL�L$�Yp��pd�-�)�e�:�5V��24�pvw���������]!n��fPt�T�#;Y��h(�sc��l@�<��8(�2��r�U|��������qh��.�9�]�e�������
������a��E�b`=�7�
���G"�q���-�"�
��x������&�P���!��$>i�Q$f�����GvN���Qi�5��lC�5��`�
�(E���}�����G$�����x��0��o����^`������`o�=m����6}���L�D�[�P_B{zt��������o��y=�~��G(�M��g
�,U�;(�n/_�p��K�1��.��7oW.�n,��=�]�D��%�M��4h_���*��+��e�P�c�A�l�Y����]�&��'��>~	N����Y��]���P ��w�=��ON�IM3�6(��
�2�SF^��"�o$���VL���"�n��I{�U�_�1����jF���!�����2��g�D�"����;=�j6�������C�n��z���%����~8�^v������6�s%`�4)�T���{Qp�D���8wN�.�1Z���6(�d4��x������jMF'��J;�����C�1k&y
������'Y�cN;=���>u��$����*���h�8=�=<}sz����4���J9��zS�����Z W�������2VC#B	BAe����N�|����B��Y�J��s ��������#���7����r���2��)�dx�{�����:2�FsE����_7 ���p��a�m>�m|�Z�����G���K��B@����>�1;v�zo��v�������~H��8X����������P�8)��aJ (�+��v\�l%�8���h���;���6��Q1$���0��d�b���N�XI����xu��#���gN��x��]0��11�kW��������|:H�|z���{���abQrr`��Z��!q5L���^y��*�\%Q��A�a��j9�Op�GG�!)#_w��b%��_�i0���xp��? �.	�G5>������Y������S������7�O-�J$���G^"�X|� ��abr3�r���]�V��� �?�lOQ�ZL%ab
�1���@����m6��1g��P/c.�����(.��Tg�b�c9�I	m���v3�|�T*���&��fu?�����p��}=����wgW�\�B9/�a�._�a�	:�=MT����d���N1*p��5�����6����������1�S��G0��N��|�^su�����0-�D����'�PM!�dQ
8NY�������x��N�u"4�W���Z�lC� ^H������t���]����4# &�4:�B|'�,��$�n0�)�51��#�=��O'1��1(b8�d��n�7�K������3�)#rk��Iw�n�����eL�	��\���I�]pl0�i�!#:FA9��;�Of�49P,�@�c��y���U��3�(
�B&1f�y��|0��"���>F����1X��~P���������q�o���
��v���-���."<�#�#��;�4(���xl�8D�o%�H-�A��(	���\�8����x��Gl��"�1x�0$��c ��^G"m-U���+.���_��� ��#���bp�-�����+��u��a��0i����=[0]�!���q�O>�����MwCE^B8��&��)C$'�'w��3"��z���;t�����~@�����a��q���]
��8B{��$����Ef��D�vrK?�ji�<j
Ae��3&����N�C
!����&V��t0�=�O��5|B���S�%$���}8�HX�7q:��C~Y�tz�n�y�:99~�������xC�o�1��a�0i���zBu,N"��b)	i)�
7H	���H���Q������k-��N+O��f��l�|���9��v��%��� ����K��V��e"��E4���k��5��	�}4:��!*��������r���SB`9r"��4�;�:�.t�Eo����b`��p*�������U����AP
�������1�x���m7O�uf��ksQ%��p�b�
F�~
��r���d��a�@
������,��(
�*}�$0�\���OF84Q`�)0��q���w����Xjzi8b������1E$&d�L�UL����Y.7�*n�Fg��{ �#�Q��v0�|����<_�6��S�)���u�����"E��Zm��(5��J�"�u"$DU<��;���8%�:�
�����w����Z{�������	����b��;��Z;����zLI�����T�
�����F������W��J)�
>�8X	��z��[��}��v�I@m�amu����8�*g������W������xH��� ��=�0x'o5@|���b�}�i�m�X>�3������v��j�L9qsL�0����N�(��Y"�
����(xd����@v�)�.[��������;�����p:*��"��v0���E#�B��<�p����2��8���`/19w���Y]1�����j�������4�:�X�6��j����]<��6T%O�s�U�F��y���0tSJj��A�x�2i<�%�(�&�E"��r)�gL�	����@���5N�I�+����;t��J���)���G�X��?�I:&4��P'+��6-0~�Y��8xy���d��6�+�A�J�S�3a�1UN�9\������������G�d�/\"��cI0+�����0<Yz�����gK��YIW6��������z�_���\w���3@2D��M#m
���O�c��K��TJ�N<�t@��R�q�D�,[�]1���f�����^����=�eL�@�nO�J�H����&:BbqP{C��'��9�������b�<|��:�x�YP�$+0Z0^F�7{7�x�R�j�����i����>��D�	�4�����5��A��x�I'��,fI�
�C���E��Nh
���t+��)��6�C�r��4��T
��
���6��	����w:/	1����H�r�9e]�A��4�22�mA��8�(�.�rM�0v4�$7��G3K�����7�[�������+#&�X���7������g���f	`[��1�]��f�xK#�0���l�	Vh��J���9D�������iP)�O�.'�]vz8�����R������	q&!%�y�\��{��G��q�������3i�U�J��G�z8�Q�����|��<e7����9�,�Qk[�S�P<���J.�f�]J�_b����@�G]:����o��>��J��j�EW9/�����EA}��E�82��[/u�4�2�#/_���VtfFP�zv��gNO��)�4�����M�����X[:O�.��������8Xc3kKv�����>H�����;+'�F��	���3D�w����������A_Tb�=q���f@_����138���)5���wDZ������7��N_��{���A����E�&o)�B�B��������pR"��$c����b�fM��%������^��h�&-p5�S���|�����
i?�qT�A�T@�#.5�E�S��)���^�B5�\wL]�h?�&��%�.R��E�Z@@�X����B*�����v[uL�>��S���'5��PD�� A��b@=�_�@X��S�t�@@!��N
��)���a�����g����?�����IRi�O��55J���l��N�/�������V�)��u&�}����D���NS2�1��]6��H6�1�%���S
�����{r����&	�Qe��s�@���.H����'���h����d*r��C
���+�@�1V;�2 9_#hk=>���'U���!�/W��������+F%�GJ�q�1�m4J9�h�������� D4�}v$6���`�[�t�76
�����NY��m��l�Ia�.�4�H�R{i0�t>��iP��B�HRK.�s���	m\A
=�N�M������@a����D<�E�b��$;��p��}9s���c;���X9��!�^M���(�u�T,����v�!��Q�~v��
'�H���I������������
�z�7���
���
g���3z�n��]8��;;I�o�nN��Q���Q�Y�vY��9
"�v��f �H�u;^�
Zh�L�dg�����aY
Z�b�9��I���ue`�������ftMZ������h���W�#�X���;�u����#��D�Q�����#�&���b|@���J�[�7G�G��G��7�{����Vl�l����������
����`��Q:�$��E#�\�|%���vr�l���n���^�/���}^�����g�������~Gx�����,^�r|�����)'tb�p���q�|F�s��d'+J�V��x�$���0�����s�>?G��d�V�)f���3G����9���i�y�M��5[����A4�:J�����2[�M�Y��v!��?���0�P������Ww�H��������x��*��t�k;y��X"�L�;�xW���$�u�x���
����*����D�����0gx��>�b��Y|�U����%[��@ ��}.�D��K+���um��5{�T�Vv`�;c��M�'�/38�V���(�?�q�\A>�m��A��%�Y`3���U��
�@����i���A�'/��U��#S��}Q��J����3�������!�^��E�(P�;�����{>���U`��4�B�6`K���>|	��E���h0�&�����w�;�dD���A�����I�fY�
XM���gO�U����_���gY�X����(Xc7X��������JpBL^���Z�A��7�QjUp][�O�jW�>F�^������$D�[�6�x �0����A��:<��}_�j��r�c�F�\�;�;��e
G�4^�@j�~�pTM�:���2G5>�(��&Cv�U���4�� ���(�p�PT`b�
A�6�������&������^�ym-z�__���w���#����/&�9!�nG+6�VMZ_�T����(����+��Im���h���c�-z�W����zM�C�]��
�vqzy�@A�{��*j�T�|��r�|����Tw�8C��i��W�n�
�T���t�����y|����<�"_D'�=��]��<�3���<��d�.�S�� �����K�\l���(DW�;A�t���x��'��9zu�cdDt�"2�=_�}w��w����j$�������Py�?2G7��4~���*_����z�����va��6�U�^~��V+��AU#EG�bn.J\��8�J��E����S$o���L-��Kq����4!����,�*)X�j@���Y��H1���!]�x%���B�:�yK����;�	���J'r@Q���
";F���j���IbT��5���c�	�d!��0��8�;��,�`Pe�)���dFD`;��r5���L����Y��?3~#������@�z�<��I���V��-8��_S}!>���Qr��ED�4J�8��0b#~Y{��U��<�(B����L�rf��J{��f����W�n�\b��f4+{_�I��=vy�f,��@gP`��*�&��3:���-�	�fN���Pb�/�Yf�����LMg���FV�H3����'��%���h�	p�K*+��i����D��-"Q��l�������:���ZvJ������zf��j�u~�:��&�,�V������n����4����Q�`�@e��v2~��>�2�w�;r��L�C��*�y^Z�K}�����X���.u`�x����t���(���;�����@�!x�F���	����4�K7���s�JE"@	��-��	7��(��L���}�J�'�$j�A��')]��2ucA�F}G�k68�I�&������b42��p '+]�;9���A�Z���zyI�M0��H���������NS'���nR��+-�X;z�$'�}���@D5�.�hN�\�b�{�����mJ��������������~���!��8�8Q���F\�����v�L;2�j>���vM[�����#(�t�A�d;��'xr���M��\���ZN!��&>�tB4>|,��4��I�i�x~�k�o�N��1�D�GhS��p;e:�i����s�T��uWz�^m>N�M=}�
����Gc�P��`�r��j1�A������(r���+�xEw�V�d��pz�@J�kd���H3�|�3�3'w"\�<
��������(_0�L��<
${�zz�4����A�,��Tb!�Q��#RX��X��'(`��B��p2����t��Q�+�i�96�q���V^��.JtS��!����R����5���{w���o	nG�>����K`��u���1;
g I��F�1=Q=.���������r��.zU>�)���y�PA�H��"s�#�b�J%qkQ��cf�(|�T��6�p���h�fOX�n�3��	���;z����U�������^.�c%��:�IRe��OR��M����1��Xe�]�2]��$��xJ����+d�eS\/D&�����x��ZN���g|�����H���&��V����R��4p�48G��Z�C��J���2GXzn%�BkR�%�G��|�J�f*���=�n���'�������
Cu�-V���s_��U��bP^�49]�o�����T]��+��@h�r�-z=���H�h
��=��w�l����>��N.!�M�P�����b�(�, ��e��?E3�/vu@�lD�T
>�JGAM�&k����'���I/�r��77�f�w���.ldS:?l kan�H��;���c�Hwd���	�6x$.��H�74���Zo�w��]�lEf�lL�L���5���E�*��%��#�Y�������N�Z��9j1L�T�|d`u-B3��c,�9���Mny��z>�^������T����0R�����D�F��D\��~r?�O�r��-/4���?l�L?G����X=������'n�����%Z��j��&rfO�C��>Iv�})�O�}K.�����-�/`�P��a�"��_%������LE
~;�U�%��T�X�+��4�2\-�O?I�T>*��� w���}~8<C������q�r_X���k���.��tEhMhUSQ)z��(�NU6��W���=�����{T�.��~I����_m��:��+�����/^`���
3�1U���1+D
�t��G������Ew����X��P/:�{�J9�K����x+���kq�P� ju~s9�
�I�JU(A9�<	S$/<,���m�zAMNq[:���w*���l����,��0��0��~����^S��h~E��'Q�����(NL����A]9�t�=������R2z�%�h�+�%/!����kM�;����/ GT�N)9�>4K��q���i�uD�S��F�f�)k�����e[e���%��s�8�1�Lx�%����E���*�w���Lv�L,�dB�Mb�j=G��oc7c�n�y<���FY���e�M���zJ�v�=���-��b!U�6e�K�)KC��:�Q��k��6�x �"~���m����yS����*r	u�V!h+��c�Q�4�
WO
B�U����\Y��#���u�56��M:������SxU��[�w�TJ��pPa���L*Q����Q��M��-����#@q�Hp�i��X�����>����F����<]]��&����/��u.�B��-�$8�y]J����/�>��A�/*k01��E
@y�����#��{bC<F�]��:�(��fh��M�����_+��9�m#���u�
�l�g5��	Ja���tR�I��8�y1���dyk9�l��L��(�uN7��KH��Ri y�����R
^��9Kl�ut�nW��<����C�/�{�&�(�������@�f�D��6;��w;�l�&���t(O��!�%u}������<��|����2F5�+��������T�a$���+�/h��#�*��J��!{F�QD���W�
g*L{�)=b�t���p��im�[�.M%M?�������������>WF_��O��F�yoP�	�n��+�H����52��Og������������7K{��PeQ���(�k���~���TU��f=�����gI���9�vP����A�"�gT�����A��Y�iH��(�.q}%�6��Tx��|
Y���z��_*,���%)��>����7a��}�k�lw��TC6���3�M��5���{�K��������SQ�.�<���Mm�Z���-�K�V�5uQ<���H1�>����������(�Bk@�yy��M�q�����XpE��bJ�
\�5���uK]�G���!�([�;P�n���1�	_���b-g�sL��Q	)l�O{Q�c��2)�My,�O����1eC�[��H���7:���*����kdn�������,F�K��19x�
M��$u
3k�$���#���,:)��[/!EW\���0+��in��$���X�B�
�-q�Z����M�M����[���R�,�*Y;�v���m:�;/#;���"/q�17
U�rd�)h�s��7���\P��b��}:�<
n1�R��K6��S�FxcG���j�U�H?+p�����!������>d-TRt�,��#Q���^��TI����U���S*�K��x�J�m��Wa)�6G��P<�woVq���S�"K����uTD4�d+~����nSJ#kw�7����x�e��U�������F��6��*����m�������p�{���F:r�tWjs�\��[2��Gs73{q�PR���6�e�KG��Jg�%"���f���������6���k�$���V1����Q4�N�������b��]d�H�/�������%��
pI�NX��^����n$L�T�����1��<�i/����+�(k�M����~�0u7��n�H���m��-K�
���M��@#��d����`�$/�Tx,�Y4�2d��&��
�?;�oA�m&��#�g3������R��M1�G��`�+����w:��?�Np�����d����/�X��@XCX�1,�VY��l	������f����3��5y�{Q�,�������o�Z�������J=�&,���,mFV������R�����`e 
Y�$��x�%hM�����;���"�TwN����lJ���j�A��+�����dJ�a��M��9��6�����o�Qo��5��������B���+"�CyE=J�t=����b��7���M����d�y�����2�H�c5��[_�o#��F���C}�1�fx�Ls�1=��l�R��k�����D��4"������%2�Yu��lB+H��'���E����O�X�K��� i�Zt��G����r�E Qtx��C���#O��DU3 ����7�S�����	��~��n�wI��j��w��h����)r@���7�Y����&S}�D}S�1���;���t�<���s�$����05mS�����C�����=���S�&�p]��.Tq�Al��N5���=���_��#N�������������G_��z�Y����t�~MM5iH��O��(B(�`���c�Wz}������w�YN��P�y�~$���_����
��\G�L�Zf�ubjKz��H~�����r
�$Q�U�^P�?�b�R�Q#��rT.n���F�)�gW�@_FuTn������Jp����||�X;@��<�`�������,f]9i�<~����+<�#�g4+d���
p�>����ys�^z����L�:��K�K>�(=U��u��xXXI��F�����2�Vj����v={�m�x ��'�;�j��z����\F�`\�]�i;��=��k
�����h.=���t��B	��!��|3����Tj��'����d�.qQ�~K� ��t�����L�f?�J����,������2uWW�/~�'��U�6`zXX��X�yd�&�/�%���HB5�t�u�#$��J����6O]g��i|� �����E�>/}<�C��Eg��m7�����4F����YC�R~�XKwx��d���bsQ�ba�}N �8��>�`/BmB�t��\�^���\�����T�<i"���r�I)F������M�3��h:�aY����_�_�u�j]\�{{u�7�~uY�I�\)� .��
?�����`1������N3�L/2}%�V�['[���B�I����:\QL:��&O��������d�)��wu�-�����U����^�|�>>�i����X��V�~���_g�V�g�|�����7MhU�gv7f�%����`wQ���K�q�F.��qv6�I��E��z����$*��N���MF��v�����K��?-d��������&N�.U&��&��pD�`��&M#xN�������v,�J��L4�:�v����cgy�|r�,�J�Q9��a����������vpk�S�X:�1�Np<>W�HO.����z���D�:����Z2����r���i���5�vq�����m��������2�)�>�+wvdj�����6(����
{��AB�+nR�����9���al��t&�y8iL�G�.fST�������
zQb�H����(l2ft#4����M�������f[��=�h�x�W�x=���4S�7Cj09~�h]���I���21�v����������`�dW��:�m4����5DNLnd>��)��T�WI�Vv���D���
I!�bgT0�J�_�n�����d��Nck�o����>~��>�L0����u�6H�`�W�X1]S%����G�F��.a�T����1��\��fEJ���.LtO�(�7�����E*A<��c�L��G[�D��S�����3Q�h���Z;����TI�g 2���F��(���!��]3Si��6K�
�w�)����H���qbE�s{���n����7R>��M'�������3��:�'�������8����<�e�Y^%Co3h�e	O�g#�X�9E��mL(~� B'sPu\���K�I��~{H#&�����2�����C+E����7����AB�Z������#�0���]f�X
(�J�P,d����5%��������l�Y���b�r@��ma\�*�l1p���#��T��j	����M��q](�.�6�p������ �wo������F�jQ�
^O�2����J>�@���yX'���J�]��#����g���C}��N~���L���'���,n����;�8l�I��H�����(2������vw|}��.V�8V�q�idi����62�mUU�2�Y�?��M':�4�58i?�@T-; U)�u��[����`#�I�OSK������6}�������V�\��\{�|��9Dp4�#�g:������Z!1� ����KDq�d{�����FK�^�]�2��0sT��;�����
����e��e�����1���J���[xb��w����h��"b�~����'��S�M������%|����8�]��.���+j�@�����7��@dY.������^�.B�K���L���������r�������n@!��#XF��:�y��K��'���4�&�'Q�u(�G5��8VJ5k�5�$*C��O-v��������k_U���G-Mx����U>�J�,��������"+K��0��x�H��5�",�6I�.����f��,p���Os�9�!��2�y���N�e	W>�!���EVHy,����,���1&�
I�����P>e�.cqU�zV�f6�:��	�e�l�^�0>���xA����a�����xJ��k2)�)$��S����y��N�7'D��	��)���Pd*�E�32�y8.�^.s�s��0E9�0F!1�I�������
�O����hj���D�6	��[IT"���`	m��@��,b�+��=� �����2�j�B�C���*�U�]��r�gs�H���^1�>a���N*���T:����4�����^�.��Z���{R����#�fL���"y��mE�*6���d��!�+���>���g��UzN��2�~�J��`[��Kn��.��(�r����E�q��c�+�YS��ro|�2�\{�;���n�}�����>U����?�&��J����8�MklO2��+����R�p����� �-�!���L�H�]�����
�i$���u�M��Kd��q��y:�������Z��*t�>T��<9�R2��w�������q��<�����m����SqM��P��+�m�1�ON���\���7�����Dr�GZ$�m=e����J���CD=�-���=Vi��~�QLs�����l���3��'#�h�8��D��yS�U�E3��iF6��D:���o����}��R�HJ��Khxd����RI���*^�+.�Re��u�����'�/���&����WA�y���	��U��#L�D����x�HN%�er�e�,�4�,b;�V����A'���>��ht'�Rt�88��+�9�qa/���D���������������qu����R1����Lf|/A��{�a��N<J�r�	�V�JU��?
��\R^)��[��! �WYP�)b������Ij���~���fb��>�m�aD!�X
�>���v�O�U.9>���J�������L��V�������M��A]�H�r�������"`L��^B+/����VT�[��
[v�$!��E���n.���V����_���%����W�g�KWT/���#fF�-H�`W_x�����8`f�C8�M�$:Jb�,���O����
=�6t�,���t�
OK��Wo�UZ����;���v�G����&����r}Jh.0�����|_N?�������
\@v�
I2�6�����YL�n�-P�q3�'�4�S�>�C��-�-%��_h�#��o���T#(���J��G��#I�u�9^\��R	���h���-2~q������������kC����M��JB�Oe�m�c/\�!?�����W
����p|�f�_L�3�
��~>�M��[L��|��M��?��I��A�1Vk������	��VK�8+�G���e������e��nwI2�K�X;b0�&2�"tZ�$���m��h
e(Ec����U5����iP�����]�,��`�RFR���m(�����>�_�<M�(y��m_
�����:�%)[N�3j`��������@h)6��ec�A��,�]���]�E<���)H"
�!:�F��0�'W���
�0��<�r�`�R�	�� o�V���p�U��p��A/�Rz�����%f?��1���G�>/����(|���W@�Y��P�{s����&v�����[�c� 6:��f��X����B�G�A��:�a�>"w�0Gf���8gc;����k��.N^�9�~����wo���~�Y9L6X��JA���\��`�pD�F��\KQ��� `�6�u~������W�dyN4��uY�eB��p5�S�A��6��&��{�W�+d'��
�8V	0|�E���^����P���BX��O�8J�F�{�D>y��-/^$�sX�n���8=��"��r�J��q�����M'���^!\�V'�<�i��b�$��t�������{<�������N����%�RM��(�(�%�Ch�D��
V/RE������a����Z����|R��#�6|�G�,���`������q�-��W�a���c�tIv�$&����7�1����g��ce�QB�z�x���&5����q����=�Z�T�bjo���:��ySTdw�G��^�c�����ZJ�?v�b'���|+�,$���*�Mn��C�#�!S��d�H�Y�����l�#��z����[p��`PL9MH�N���!V��?�52��=�{���|�BLxE�����[��N3X1�Gqa�W��j���P�����?e�x��x�+�|�w�l�_��G`��?��q<����!D�V��RoRba.�l�?Z���+�����r���z@3H��������J���~�=�A�������>�����i�j��?�o}�����b;�iJ�%��E�����[2���\�{^��/�H�A��%��J��QVg �G�������1]|���
Sj�Q.\���R�s����^�\���=��K�LQ%��������,��?����S�+9m�G�"��n����|-k�����7Dg���Y��-����QnDV �u��>�e�y�����Fhu�#'n�E����>���t�r��Z^������A����(W��l�K��\�7IB�	��������(Xp~�*�������"�nU�B��X�����];q�����5�����d�y��'-��=p��q���z��h6�'��Ny�������A<E�h��E�U�36�L({�-�e��K��dK��T�e������Wi/[��OJ���
=��[)"���WG(���U�2���m���+A�$P������A:�J������
&�)�c����^4��E�.BW�q`����e�U�B$����F��%!t�l��������3n�P���E�����Ffa-%�,��$A��������b	�[/�5��5)�)]p��3^�t+U.����lf�tu?��`��L�nz����z����������V�5xa��j*������u6��h�Fzw<�X���rL��i��b��SL�s|	� -���p;O����=�Li���\����J�t����0c0��.
�GOy�>E��P��uy�������f���T�,���',�"��H9�P��7�6h�>;��������^\���]�T��?�xKp�����fs��9�j��	�����9?9}y����oc7�X��Hyyu|u�N*ket�v�N�8	DG�J���8��/f��7���^g �6���\�Z9��-��w�w����eOO!\��
�����n������d/��m�I5��������a�]�kM�e�^
\(-�
*M
����2���+%2��B$Y��x ���x�WS"��0�GY\s"���������8��d��
�
�{��;m�4�,�,��*����QR�+_�����
�c������H��B�zf��]r�\��-�M�1�AFw��Tz)f=���Le���[�p��OU��C���k\�U76��
��W�Fe.Q����7��R��Pb�m�f���G��R�p��i���3&�a�U)w��f�������}9��a����v�uqD�������(��$�&'GBt:\�(LK'�����"
�q��
e��=b,��w7���8�%B�(�,T���-C}�/��2���d����%������Z�(�HM6��1��j��^���tx��}��K�����|m-}:ku0n{����4]���{�	������E���~b����N�J������������;���i�9�c�37z|v���}�v�W���x�7&���=�O>|�����~��1�K[�' a�O�,Q�7|����������7$���lO��t�9���2��n/����*����'���vT�����b������K� !�(�/���o����n����u��xj2�F���?�63������z��1�c���G�?����3�g��XD3e��W(�8w����U�p���?�z8H?K�������+�S�,�-��������>G7D�,�%\�&qc������OL%Fg$�/g�[�p��C�d=O�R+�AV
Q{�'�������E��Mg����QP8�b,���r4��P�����fnwZ"��st�zN�)���cO5��@S�b�s*��!V�p���-���7o��������1L�%�����,y���*o�/}�30k@&��c�/.����
�����m�;�������@�V}htEw�6�}����yzU��'w��+��~>NS���:>9�I���<v	�����<���A��(�������<1�����%8o�K:d�����y0[A���6|���@����
Mr��jQiBK?(��� ��g����;\����|0�w{qX�	:Yo�0�����QT�T�r��k���A�T+�f��U*�������Y2��wP����f��?�x���O��XER��V!j�/�mv@M�$
�Q�����*��9�����0���!I��K&�;�&�=���^�}G���<d�k��z�F3t{�4K�<����Io1���7�G�U�H:��h�z�z?l��4��N���4��@�����=Nm��K`s������e�4��V�Yu�c�v�ytM�w���������u�����N�Qt
v����1}x�������Y`�g��k���~��wp��wX.��N��_9����V{�������6hs�������iq���d�1~X�r��n�������������T���c<�S>x���������z�UD<;��x��vP�+9>���W.����z��_����E�v���0�h�?�U)VypL���WT���y+��|2��qw8�F�(���p�����M��w7��eh �����+��a����9�����eo����������vgp�&���������f'�_�_N�����e��v��'[;o_�<}�:i]�@Kg��h���wR�N�����:�f�.�����<=?������8��T��+���C�L�^�D/=}�v����W�/q���yy�
Z��N���5>~<~�W>�j?u�����s�Yh=�}�YG��� 	�8��L�@���������v���:���p�	������R�d�^�Y����G�):7� ���i���p��tj�@4�{]��u��P3d:��&��= /��h�J�0����qsJD3M��(�I��;��Z3z�[F���<�'��I��|,B�;��h��d-w�)�Z~�^�t�Aw6)u�q�f�>��Jx7Y�,�a*1�P�h����+J�D��%y��|�L�;�B����%
D���QwA��|�C���uT������v�8>�����^�z��~��_��-��M�Y����f�T����8�&^��1n�|��lN�||�������$T}�'�`�m�A)��.A^{q��qv~u���8�7���A�`;���T8US�S�+�f�����t�~���}���k�v���7[����_���`������fk�v^���x�5�;������y�/������|�/o�f��_NDMH�;&�;�ve�^������_�Z�~�����k�s[����U�B�����;���j����FB�:��f�������X�h|��7�qo�6��� "�(�J����z�������p*�b�;�����������&�4B�,�q��fk'_
�C�l3Nu��*����h3��*�N���'�Tf	�Y�E��?���z��a���H~�HC�� X�N%/Gq�1���m�����/.0��	uV������(n7�I��G�����arcu� c�������}����+V��c���=	���v8����P0�R	�p����H��H7��(�2?���yX�PP2tj�P��t�m>������tz5�?����]O'��=�Do>�����86-�wG�b��9z��i=�>�6����!�0�R�X:S��iM��D��;
6G�����y���sw��x.�F5g�u����6��m���r�8a
"�E)��@?+�o�F8�O
��h�����.�hT�n�����92>�R���^�i���N-����,����V�o�{�~������n����=��==ar��h:$"���
�'����^L�y7&df?�O�0H��������$xG��
��F���$���2�8�?�u1������gJj~��]�]�&��b�L��W�/��v�/����~{|��������w��Z��/��rw�,w�����S=�;��|����7��" �����O��M}�rWHh
�D-f��K�g��|1�����L�f$�����V����[�(�_����+�h'�E?��J5x5�,����I��������+�E���������~YJm�=�%���m<R/��&�]��l|M�����z�x4���7����~��RQ	�����Yx�v(����`��������[OW��E#���U�����j�=�-���Z�u@Of$�����L�`)���5fo����������A'���XUo�Q�U��zc��R)PzE#��x!0����<?��#:�h�YI	���}[D���
>�|��n&��:�.�$����C��"D2)�z"x��4~��W)�������w�/FQ��F���i_��?a�@q�S8��a�i��?i>�)l��3���B����TL�=��t�i�+Y��@e���Np���n�M(�y��$�]�0����m���
m6W���f V<�&1#���;���+����8y�/��|
�.�z���u��s��k�ht�\h���X���b�`.���o�cLO/R^�?&+��s�UJ�L��m����o�kU�R���^�y�����_��J�<~/=k��~�KzS�k�cR@9�����c!\e��A��	��U$9
��.��6�)2��RN^_�Ga����(�q�������@���b���/�����!j����-�
�B��7�?�Z�!d�s��^S��t�d�%����5�ud��������1��l8�/sS;`�o�jp���0
�z<���,��2��b�]��8��Z)"���0`�huJ(��s���oE�����a��Fw���N�(�������@\�i����}D��+������
5����g�Wy��5���#���7�z#7+�G��7�sH��:JH�C��$�ztN�CL����I�u&t�	UXf��Th/���� ����X��N��Qk@j��v)w�I �T��id"��gjg9Q�����a�hE�T0SBr������	����?L�]0��28��!]^A���+ M4��*��3�����v�.����")��j�PI�eG�L��D��?W��\���N�-j@!�y|�&�N����;��r9]�Q����D.���ly�Vp2��n�M�`�.�q�^���9
��?m@�����/��^3��7�m�0Z��L�Ht�@��E"\m�*�W��`�������c�[�:�Z{,U��>z�r@Wc����R���d��|����V/����U��k���H��pe���o::@�]��}�/��a���w\n���r�����
���,�>`��~��q�o�l����O�k {����|�����6F@K����sZx���"�g��C�C=U�����~br��N�������b%>I��N��
F���8��Um�O'qLa��k9�g�k�$7	T1�?
�����0����)�i���=����IvX���Y��j�e����q���#��b;�Q��_�R8K^
d����4�q'Of"���$�Cb����h���T:�-*�5c�.�t�>��?:T_�����g<�RI��t2���d:� �P�GB��������`��k�Q��b������JvD�/G����@N��������R��D������a�1XL�S]k��bgAR.X��p�J��$��&�@@Z��9�������DQ3�#�84�
)����4�9E�5Y���b�|\�d�Te_��_H��T���������`�^�)�Ly(��3�b�������' ~�������������������6��&����\c���YT2������o�{XNz)]�����L�����^�
�����!U�6!X�j!�?d��2����zw�M)��*��CG�Q������M&dD<Z����r:k�>����Q�Qp
`�����_8����!�J�Og�W3F�����q_g
��/=����P��,�kt~@�B��\?WC2`+�����o��C&d$�Y���8��J����7��E����!��ME�+47�'W�� Dj�����j) ���{����>�TBZ��58�,y���h�4��QS:�����0q��J�W<*���#�"9��XXB
WpD&	X��0&��#�q�%��q	i��[6�S�}��3�G�������)��h�%�F�p0c�`!��,B�i4��j2���X���.W�s
�k�H	�E<%�@
~��u��U��[Y���}@�A��#2;�������zp.0-z!Ks��C�I���W,=���
#f��a(�Ke��_�����P�8>Ry�����9�${{m����m�a��50�p��������)1k��1tG�%b;�� ��D���]lW�P�_�~��?XJ	?�4;)��SR���?t?%I������l���#Q���`�'���RZ3E�t�>��0�
)�	>n2�2����p�L�����S�$��5+�j2�E3/���~�����J�;X1HcV�ebK�?;|�;p���/�:���WM���\��f��$�+8�@���+��e��K�WJJ-b��DE�����/q QJS�6-T��/m��� �-DP�|��|�40%;�P�5;���%�����+h���C����Es��U�~�t���D��!��������P��@�|�U8���aJG��(��6-����<����;��e2M����ga�u�:����Vs��'�
�~��'��E���s';��a6��|p�^�����g��#�@�	 iw������%��k?0� �V�k�g���:
DY�<�D��8d�8�xF�`�������B��<|���1'��W�����L{ak����'j��{@�5��)���������$Q�������|;��kR
�e���I7�_��E�I�48�`| ��C�����|:_S
����k�i�����(w�������A���9x�X����2sA�����!VU�F �*]ECb����O���� k;)p$��5|����G/��`�KH��v�������!�$���E��^��_)8���B^���|h�D��3��\��'��2�1�{�����T������>�
��wZ���\���a����R�h��d�/%`(m�V�����<x��N:��E����N��*d��v���g�I���m���-�+���<��f��;OeT����B<�����:��|��!@?��Qljr%��H�Z�jV%C��.�h�� �v&��9�"��#��x\���gc�_"�3��q�G�m`�`���M�Pgk�I#�X�
C��8��&��(�N��:�2�y�j���(��e��p�����0l�R��+��p!��o�7��Z�O&���/(�N:��v�X2��e�;7���-czC�P_���m���sc�>�}O����E���i=u���4\�Q�S�6�OCz��J�����>�
Q0�g���,:C�k3��4�����A����c�N(����x-%*3|�y�#�������|u�%�|���Ja��9�U�3�a��P���"�������[�
��Z��C������y;�c"/qV��'��y�pG[
�c+Y���8
by��T1S��
�V������i`\�qS�R�U�1�,z������SfN��no&@�+��k�����"<s�[���q���PV��Y]%IU�X�E�B3����?�d�`�������	*��	yiT�<����9�/���<�4��x��r�%�9.�m�z���M:�H�n���M}R<�(�<Lj"?Z���hYPC���x3�8aS`����D���)��9l���%(n���L�!`�k[8���H3��/��w�4F�_��g��m	C��
�f}��|�,LV���@�T��v��eP�b�x�28����|����s�+���d-��>�\>��}���<��������u�6B)�m�{�i���=k2�;��^�er���)<��+��\�xd�����2~�#������Rw%�����sx�R���h�7�&s�����io$|}�+,�f�8;@�yD�����b��d:�2������������E���M�������_��X����u�W1'l����zK����V�f{]������0vv9cF�zl�b�����0�=�Ed}�7	�W�����x�����N��m��R{��M����8�����{���5��
�@�{w�u��#��l��=����	M�>��������V�|�6,jFt���o�����XL��$`�<�,�_���]e�I��6hc��I8_���o�v����W�b���Z,	���z~3����e�je���=�0N
�	�|����"�=���j}�7w?T�;�	���Z�3�����xP���cl����-#���+Fez*2An�@�4g���1i������+����������OO��r�E�69U�=�_r�u3oF{�~-�G��W��)O�4�<�	���tjc2}��w4\�m�M��2T��+����GpC����
X�*1h��[)+!+l<1�����J��E
���pg�F����r�m4��r�V�u��n�A����gW�[>����R��
������M��`����R�K
��1����:.����8O7Ujy7���<T������92]��{H���p8G��
�yd�a ������'Y�������A���d�_�_<�}#���'��q�9�i�pWG�d��ooQgt���W!3q����p����{s8L����
c���
G,���F�3.1��W��9���.��D��A/�����>��j�^��k{{����*�r9��j6�f��}��}��).��V�\���[[��;���h��Sh�NZ?������l�`�b*�Q��/Q��i4�<��nT���a�o +�l6�gu�j��{{�����6��k����Z}�Q����-��-PZ���qx3�8�j����������kv����c����U�o-:�7���~����u�N����hT���|L�jp�[8�����Qp���0x���(������p0r�l�/@j/g&�4 �V��Z}����Rv*�������@��=�z�S�T������q��JL��V�m����-�K|6��dVs�.[���.���FM�\��?��V
����TS� ����;;K>�|P�R��A
�%����C��~��}^W��7T���]�/�f��W[_�
�w������9��������@�>;	���A�P��L�qD=v�uD���Y�L���a��/�/��F���b��v���::��2��olgu�u����Z�S�Fa��W��p]��V�{�$J['[:��Sg=��v
�tw�B�
�;vjn�|�zmN����k-��=����Oo���O���
�Q��t�je3����]se_U�����o���<����r�K��f�������
���G��5��W����mm*E��EOZ/����b5+�8w�%}~ ���J,A�~G����+.�=���</�S�������Zh*��,B��N��m��2DR�(��������V�����������������������]���L�����Vw�P�'��	�B�7d�f��I���n2j�9�
��+x@3�&��G���g��X�Ma�=`���|�\��F����~o��C��hn�h�\F�^�7Z��X����L�&3��m��a�m��-n�E�@��+W^�U�<_�*�Vm�Py�Q]�\���=���M�4�N�t�x��C��=kU�y�N�5
�!�=�{�Bi���yv!��*t�l���`��J���UF0���M����E�������O����������������e�m�������?��B����I.W��ht��wos{������&L�M�Udw7����c��I#�����st�@�[��Ke�����uOV�8�g,p�/g��8�����#,o����7���3IsB�;�t��[=���	nY{���O���-������\�0x/�v��)P��r����h0�o^���Z�M���(�^-��o��[�%I5���jq������h~������@)��� kL4%�p0������r����#�s>�L���U�|��O�����Uj����������r����@�x��U�@����h~�����gA��l�~��|����t���w�H����i��
���S�h�f�H�1��f8�E/���s�(;�����L����yg����K<V!�	*�C�W�?��������OQ�r�D���B�EtCBS	>�z�"h��<,��Vc|�285�l�[t��`8P9��������Q��?PN�K���s����r@3�����i�qE�;�sY+;�
�x5e���EEv��[��0"���
!��v��1i��y��,��I�,c���so�#UME�7��i�D9��rR��WhS�g4�r�<<���D�P��p�G���`{�i�R5������L1YI������4�s�O���Q��N�2���a�����z��z���`��?�@paOe�G��9$�+�t�Q�<x�A}OI�^�:L��md0|9����@z��ib��������0k�gW���wo�.���s.]���z�+��SF?J������NH���a*����@rd�.��oV�/7���Oa��M�TE����0�d�������g>���8vNf+S��W^��!%�P�\p6�������
�$q:w��R��n�,�@�d���Y���cj�Lpg�?���
��T�\a1�K���f�/E��"O�}��XCX*]		�@�`��XM�U����^&���f������I��;8�n����u�������4
�v)�m?�2N�+N������p�oB�>zAl�Iws������dJ�6��r0JL�������F]��C�?qF��.��q�vt���>�u���GT��fC��U:��2����*�������p���t��u4ow$������
4HQ<��S��(T�_��y{~y
�d:�@p~H���o���a��.��b
#�!������.��{�����-;S��r�<J����*�i/�.��pQ���7h+�C�P������CzBl#Q��E��������@�m�TY����?��n$��L���"��+�j�`+[�"���H�q�.��������H�y"�]�6�]�00^X�AW�&�,��#��<��Q#	���+R��F���t��<�	b����s�0������V�L���\.s>;���$_�������P�-NK���	�,�!�3���b�s�,����pc���5��dD�Z��`2��������*�	���_���%���\��&��#��5���j<�l�������A93�P�

}�xwa�D�REK�N��������QK����=Q�������`"��*�G��*��kM�/��y���k
�Lh�A4�K��%Q8�p?"�#�`&��E��G:u))���{j��!���Jp��Z���V�4-�����s
��-���?�L>H}��&���;Xp�r@S�J�<���ox�~��#:���X�C`
 #��,XHG��XA����X�w��Z$�|���|G`������
`V��'�|J��il`��aa�T�������3�o�<O�����k�������M�W^��b��:
-���)��`c��q^�h:P�3���v��UHoz���P��J��R����@�!���������9�;x%�ZJ���J��&�ue��DQ������eu5��J���p#2�m�X~!�m���$����/��=�=�d��o���5xJ����p0
>�Y|���0��w'��|�/&|�,����s���S������;h�7{�n���:��#�_m?��a�Wiv:�z�Z�F�D�:�e�2}����S
�B��|�l7�~��t1����r���q;����u��^��T�*�����+	q�������]�{x6��N'���>�aP�!S���y���'�Z!�G'_�����?^��j������_
�S!���K�R�Z���{�r�G�s�Sr�EMG��8o�������[*�8���~�3�|�a�~��O��V�WGx����
0�!��f<�����]�����Q��~O� ��V��F��E��U�@&��G��`��:�/QL���C����*�o�\�B@I����"b������2]�3����8�����A�����{��w��Z/�R2����(SB�a*�N4���y�u��RY+]�,;2�����	{��P�L�����f2��9����������eY��L�oH�������}G�m�<0��L�@[����)�S.I����c�`c"��
�R�?	�#oR=:
fv�7�"�A' ���h
Y���v3�\�v��/��/v�����%7��M?�M?���m�/A=�B�3�{
z��Ju#��9����2m�����]q7"@P��{�}Q����' ���A�"�����H�z����_�m2�=^�&����Q���E��<�V�)��z���c�a5���j��.������7�k�'b�Ym�]b-��k+�����l3B����A'��_�t�'7a�R\��/��$We��`f��{�bca��7���=����EO�X���S?�J,(5oDM��Xr���T�D�����d���%���+9��/��U[�����6M�*7���0a�LD#����eN.��-���)B�Da���X9k�NtUxU2a��e�A���ZN������-t�J�����<����u=dG=W��pE���~m�Y;g^�f�,#�Y=�i�Q�8�7�_�|�<o��������A���_�������U��le4���Nh�/@��:��������������������d�<���<�f�[��k��|�:�j�qsj
}$��@�3`��R���!�
��n��o��n;��R)*m>�N�(#:(���z�d�,6R��k�tV�A�Q��;���c����	�z�[
��^3�?e}��>�������Rf���N �������7�}�gD5�na����!v�0n�
7�}��Ry<]�5���YKI�gX��:~����LJ%J�RL�	g$M^��n��._��%;�'k�}���
*e�Nz���	gd�;���d-�^��m�U>�S�����x���	���~�t��h��	���tv;LN���#dd��L����:(�q�i��h6��?|�uH��d��2a��i��dJ�F�v�Y�YPbh�w9}�����|���S5�L@�����(���a�Y�t{�����1�7�����K��=1���H���[�����f�V.�Vk{���o��.QRz_o�x����OE���<1���������?,l�G��
J��]���
,fL��S��6n��S�M���m�;
'�Qb�%��=C(Q��z2��C�����NY��_����+�z%���OfQi>)��K��4�����8�o�~i��ju�^I�oT����(���~o����p?���j�����ZX�7k�F���tj{��o���VN�.����U����������"�5�7�7�zP�[A:2)����������M��$X��z��y�!�f���E�qnAG*��6�Tj��7Tr-�Q������wX�9��@?���-naV�0�/��ge��C��H�_��w�6V��;�Y,F�-`oX��~��* ��k���\���z.�+��A~�B�[��[�ZaA��^�xF8��>bUw#��,�)��7��	���0`��PN�wq�)�G�[�3���Dj��;��"��xr���&��\�cr��$���|�[a�]��e�[�De�f�Z��x�Z�rpE=9A>��hH�
�B���LwS�9&�s*�HA���+4���gw�e�_�������f��-3`��3��g*
`7H���>��~��{����W	v��	�d�O���v��nKdws5�m�QW#��Z��B:�^�GP�;�SX����������(`�3��[��x�9%�\@?���������!��Z,�$�Ss
\cu/Y��Zk��(���
���	��=����������\���^���@~p������C�Z<]mr���H���a��W��~�\n�w��f�r���kf���h��i�������D����](O����p`fQ�� 5�������r�Hn����}��i���<�
��PW�_��lk/�V�;���r�������a7Zw[�r�zmI��#��7W���SQ���R��jY|��q�����h����f��G���wa���P���[n]��{��^��T��)�R��~����1�t�j����$RQ��(�K�!��������0�E�x}-�P���gh<�1�@BA�0h4F��B�����6%�d������I��"
�O����<���N�%���Ar�E�a*���T,�t�!�04:�<��|!��D���-��rpNE?��c��d+������sAckn��8_0����C)�R��5�]hST���@�`�p���y�����'�J��m������ �`@dYar�7�X�"�]u&t����9�/Q',Z?�N zE�~9��.O����#
���j�"V��S���$�Y�E�
���1
%mQ4���+l�����D>���!�c����}�H�"���T����2�)��E�d6�zZ�'������/���6�z_�_c���]�tH�j����)Fs{���\���(�8%�"�cryxO)�7O���C�s�efDZ���~��}�����y{���i��N��_���{��\���?���������[lP0���������cr�cJ�O"��* ����ZS1S��A�����%s��4�L@��9����!��f.7���%����g��|�z��u�}|qq�����������-9�����������"��X�1��
���j��T>&3!L���	���������t_��=����9t*,���F}�����*�j�~3�Vl�0	��
d���i������s�]�n�q�U)��3�sr��o��z1�L�x>!��\ M��P�@S|�����S1�(�p"�J��[O�wr_.��>p\���L����:9�udw{�����s�'~�-�����AV�5J1�����.�DY�����u�"{
{<_�6e�.�P�$g�'D����(�&�b{��#�s&_�Dd�z�u���#�����r������<�#����d����� q����(:��F��5�q`����c���P�����M+�!o1}V�d������[*-���Q�Zn��m�<��9���%��t���Sx)���>,`
��E��^4E����gN�KZ����-��v�.�
:����<|�����)6�-�>8A��$
���u`��$����a�����1�K�o����t���)1�����8����N/�hn��@�x��b�|��>�'r���7z�yk���w�i�0�Q�pH����H�E:0�A{��Os-�`�Z3�Hx�I�%^>������4��^p,��\�S5Y�Xq�\��������������2�����!�bX���D�P��!(I�q�E,�P����	�P���q	�R��1�h~9��G���2rP���z�`u�zS�YEq[�{����3��������l�<�L�����0Pp/���/FS��|9b%m�,J�_K�I��q3�7��Q\��` D1���wZ4TL�T�Oc�a~(*�x��p��	���K���K�y�Xetu@y��V�!FZ�:����<���8Z8�,�d�7F�f��7����n$+����t�|�=��'�k2������i>�`�!A����!�������/b?�3noPW�)�R�&���L��7�&�A�������UZN����q:�y�����[���I��c@���X\A'�����t6���Q�U��||N�E}��y"f�!����������M�A&�PG�#�G`]�h<Y\�B�?����0���@=�GK"����������x��|��1R��-�r��bw�a2�:�q�r�(����p?�e\�2��p��M�!�������n��d��p�u�x8t��kAM��!�Yw����;}�j��7+?���>��Q�i:C�0�/�>0YFl���)GV
'�����PfB�,�#)���Jy/x?�E�������S#���Amr�M$�
^�ruq|���x ��������zX#�h�@`��C���d�US@���bh4$FI��FS���������y�+�;��7���y�5O=�f�t<��!|)����H��|��b9�p�~iI2�9����|����TX��aM�a\�T��D!�7g��d�{d6h��+[Wk��X��)b�S��H������4��/��/H��$�����aO'��g|l�2M,.V��d�W&��������RI��CG��I���4����3&jV����LHn%K���M�_,�=����#�EyE#�����2����co�$�A�Hg�e�|&�fk�9����$"�&.�E�-_�J_[l	1���%�a�����R	9���X��#���!
�&S��{� ��E�UO��Xe�5j�8����)��L�"�ns��.W�S�����WAn���W�� ����y��zq��bb�q!uV��9e��v��8�"����cVS���c������G!�=�C�G:$I�	{��1�U������fTm��I_���3��Ab��U�?�#F�1W|�c�d�|�V���/�>��h;P>��U�&Tx� ����m�X�E^�g�x�+8�����
���9\h���g���J����wl���1����4�t�����~�]�a�8����p�?�Kk�N	$f�/�������nic`��r��S�u�^N0>�nV�r8����al;��au(�M���O!m���f��V�Hh")�=JVWx�����
�_=��u+�`V���/^��<����������8SQ&��8��M�>k~<�ao
�! �t���L2�P������$�gP����
J0������0�a�JV�&7�����z�T�����A7�}D��L���h�����7|�U�P4����/P��r�.6�l���q�����Ee�H	�Y,1�g��aW�/*�9�����.���N.���0�"����G!Q{|0���L0�?����3i,g�v�� Y�Y��(`�un��;\��"x�T�5���l��q1w�D4@�H������3��q0	����?��B��U"t�����������)��������F�����'��}�q F@�z���	�D�~^��6��;{����W�j������Y���2�00�Kb�0j��1��8ET��5�#P�#�(Ywg�����A�����Y�����J�G*7�L���X�z���?����l�1����>�r��}���1i���e����}~��O��l-;x&�'�����?���h���i`/����,Cd&�f��T(�v\[0d��9p��K����ai�g�	8���b��re�UXy������.��b��f.	��X�K$*�Ti��$*�8�a���'F;����ah@<	�6�e�4�-�x��<O���!KU�o��*Q�R�5�H��V��7LS��������)&p)�Ds���X��2yQ{�c��@Q1K^��C;KAL��a#�E����,��({���X�X
J��V
���x{5</s#���fO�H"�����6/�q�����`��
��#��f:���<�����l��z��G�oX���b����I���q�<up��g���TEO.Dc6[���C��`T2�qZ�������MD��\��(�f�`.
��@��)f�]�o.��<X~I!?���y�����:`/���a{��c�f���c�c$es��(����x��	�����l�t�`dd6��"<�"�3��8g/m21n��2���~J&�2��k\�-��f�����hrh�$�C�)�m����R"�r1�W��g�=KSn�@l����w*9�
��G��5��7(/%K�j��S)��`-��-��������:���4��c�l���U���u����3����	Y	�nl�eGb&��sLG����hJ�>���+0���0�(tJUx$�u�+K���v����~��kQi���|:!7	�O��Sr���"c(��`4q�O9e��I�
*�E1��ql���;��M:]�����\r
:�T�7�[NT	)R�����}��u��D���,{�.S������OT�.���,�?3�����qd\:������S�9[�i�$Ll!�d��	c���$6&�#������y7���B��7ax�����3%Y��IE�U
���Oa��0�h����ng��fkvwS
��-��#YV���b�N��p���SvZ����|�����Cw=�w�w@�-��<���x3�g��?f3���" �����3p(�.��)����`>YIAZ������U�0�uz����B���������Q�H��=D���|�VQ�d%&�f
�WXY�;��@)�$3������5Y����l}�r�:�G:*�6��Qp_�!��T�Lbt�u�.�E�5	6�������#�q�
�C��?�����s�8�4'�fKR;�.����&)�\�K�������W���%����-1�xR�D��m_c �
S��7�g���Y�=��:?�:����+�5>����$�t����������b�)�}��O"�{A��,�������
�����G�(Y�����G5��7rv�ZW���X�cqz�����%��i��X#n�������O9�P~
�`�7��$�����z�o�?��nJ�Kd�����w`����y�+�&)
�3�&-�����3HE^��M�Gl��6�=q����e-����q�(��� ��-)N\3��L'�O�]
U�h����OtF�m��
�A��o�&!�c��Z���@m�Z����Sjt#��������l��R���m�&�zxe[`yf�qyI�[��n#�U���p���1�mT���z�Jv��X�,�v����)n���q������Ia��;V�>�j��7����W�������R:���"<������)R���9�='X�.����G���V����=�8~����}��m��D}��d:9�b��������}�	)Z����W�=u����:���]u����{��A-3�N�7��:{�o�Z�F�v��8�_���G�o�7K0�ct�&a?y2(��fQwx@�����=L������#O�M��/GK#y�q��}�L��NC�~eU�0�d�R��R��9���X�r�,I���&-����;��
�H�7�m������sh��T�������Re6�x�f��xgf���N6?�a2�mh�Lo>sg��l3���b��Tfl�rU����������IK����
7�}�(��3��Y!���JM��$���j�F%��,}��5V����aHl�r�o����z���&M~SM(Ac�B�eT�.B����N{1�".r����R����b
�7�'E��b������Z1]ER'epVn��TL���%%���:���b���;�������J�[=h�����A���g���c�����S�7����tf�����^�����Pn��0n�L&c�i|3j���>���I(�B�	$�����~�<�,|�x:PO>z��Y8
g�s�b7��6�^����N��_9�����~�Y�U�#s��mM���Wig�n5
�Phz;�fvw�Sox���a����������������'���~s����M�H,����;��~o�������%+��R2n����/��
����`��������+���Y�v��,p�Y�ey�����*=���RJ��Z'? q���W�3D����_'��������
f{��������DD����~q~���*R�� 4�����e��i�_y*�?u�����3���o�c]a6�0��y~�Y'=�rq&n^.����c��
���$o%Cj������.�>�]�\�%)�W����k^����8��G���\�%��K?��zq��tz�x���R")����XW��,
F%l�l�������y#����*
�"7tQ����*`�y*`.��b�Y:�n�
W2�D�yk�y���Yp|u���^|���M ��`�7���?
���44�
G��c�"X\�|��uu�2�E�����O�,y�v�/�������*�b����=*���I�l�n��
��.}'^^���rdyyG�����:�� g���iw��svp�8���\Gc���,"�j1x���@>���?�H���7�i������Y)`�_~j]�vm~�u<���E�5��b�P��1�����F#T������V���*5+�C"o<d��o�vZ�����b�m��&7������=�M�-��I�����er.Eu�m}��)H�����5w��T�
x�=�*�������D�;Z�;Y���nEL{���X*��QJ�����������	�_��+�-g����\�2r6���]��UM^���]�m��Py���U��
����'���
�V�f@$�����W� B�.���R�n��������f0��CA!
}��>uq���|���!��	����s���*�j�]K�]O��H��G��p�����V{��4��+U�a������X����.
����jcO=9�����uRj�2�s�*�G8R�Z��������I|��������w#������YSK��n]��k��~���l�Sj�2��FFt%���<�F�L�3m��������Y��(]��\�ET|�CD)A|���������;���n�������(�'���"��t|p���&c���a4�&/�����6�}����������]�x����v�wA�Kug���(�A+�DO�@�4�:�>�8��o%�f�GIu+�N�E��-�]��z�~�Y����U�{Ot/O��i=��=����*n��o�3����WTP�c���|�_�;Q�h�2�����������r-L���;`���gN����i�N�~����*���}Y�n�6|���^����}�"c�/��6������6j�F3�:���v�����F:�>-���<�S��5=��z����k@����5
k�ri��Jj��'���]Z6�����C��sS�iv�oM~���V�N/����/d|�btu��R�j2xn��tHI?�������$�b
������7�8�4�$�|E���Pe���l�6��'aO�d��0��_����'	����#�N*(�����.�.N���I�����ekkP�C1*�=���� �g��a��^~c�%��a4-pi�Fp�~VG�����wW�>#�}�9�Y���uE���<�N�Osu1��7����[;Z��3�;��)��gr�rZB��(4��>vo�#���.���U��Z���X��&�u���������.��E�~�|���vr�c}#����R��
wI
�U��������~�(������5��f�SS����'<=���E{�t��3�Xo�����r�U��tt@��*��T�=pY�q����t�BGM�
�������y.���
�O�^���������R�weMW���������lIH}�,3�,QRW����3u�]O�,*X6����p�O��bdE
�dO�����DRG���j]�������y���~b2�u���o��t]�����I>9A+8^�SU!������=�rjN(����&�M�g?�n��s���T��� �G3y�7}�	��m�Yj��d�����<�� d�����d��������+T������d���7�*U�<U}`r����9�����'y������x0�k�4�&��?��^�M�U�3<s�/�?_�,�X'��Q�N$���$;�w�"�KJ�K���G�
�����VG���g�Wm�e�i��9&~���mN�i�����v�`A&�����
M���]s*��i[o��y�p� J����lB�K����I��Y�w��Z_���?���kk�
�X	V����bwe������n:^���J%�`�s��>i*�|ttzv�l>2����4#V�m�m�8��|5���"�T��X������JCx�|�.?
�����a�P���V���%7�(��ywvn�����B��b��qRj�:%\hR+�1��D���g�J.� !�&�����:sr�t��7�i�I�F�E8j������t��9"�k��@������e�������y��?E�LI��?�S�r��z~��irR�8��&���(x���3H>�2e&�I@e���|o��,A*)��i�H��P"����hU��t�3�cy*���/�M7>�����.Y�7
O+�u�$�/K�5���S6�?kapE�hf�����|���wE����z.���-S1B�����������>�����=�>��1e�L��5<���������K@o�A'H��\3����;������q�;<�t?��������+5<��*
+!u�p(k�V��������\A����	�a���t�N��I�h�'lm�ll_��1���	��S�A��c�-#���_k���w�`�fP [�o���e
����:f�8��.r�v�z<�Q=q6�e�<o�v�'U;E�����)	�sr�����3�`CM�����z�C*?��\����-���t;�s�R���y�>y���3@<Ll����$����>E�_����\�G��B��h� �?�����-`�	���_�6��j`C��G�S r�w'��������V��U�{�+f;D'=�6��F�r�8��t]��*D��E�l���+7�:9����:�{��2�����4U��(/[SL@D!���9��W��>��'HS���E!��sPi$�>Q��)u�z�RKu�^���P�Q���(���-��Yv�n�j�;�N���|
��|�6C�V]�P�}<�r��T��[;������),<��Z���r{�o�Y
_��p�������,�J?��l���j������w���@K�t���t��������J�I���7�p��<��z����l����^�������+�Qm�5*{{�Fu�?*��~��A��/7��C8���qx3�8�j���������N`�v��q�W��~��l�;�N����j��~�����{M��>`���������<���f��wC��O�����+3�_��������q�r6(���xqM����|\������~��u��z���_�_��)X
*��j�C��X{bBg�W��+����rS��A�����gp���1�VP.�w�{���4~����p���r�MuN=(w�c�N�0�;xc���5f����<��q����$k�����K������y'!6�{Q��t�*���l��{Q-����3y�GG$N�{E��}����h�GW�i8�#t<i��U�`K��d#��LY��,��2��|�����u+i7'�DBX�-�aepy�{�1V��y4��gO?��x��vml�S��
�i-��P$J�k����M�u?#[������~5������v�nxO��7��`t��E)��V@	��6�`����&����t6�N�T���k���Q.��=�x��/�RD�/�
0��������������G��{���Q������w�&��M+�0��$8��� �y����������������x��8#e��3RV��D�n�;���0��|���h����~B����78�9	�,���k6��',��(�����M��������U�������f���hi8�^�
O`�|�H��|��'6N��X��k�U���'��cYOa-W1(�nt�d��� �=�h�PJ��P��l�"��L�w ������i�����|F�4�r����/�Hu_�s�w�f,�����W��
�n�.�*��N�+4*/���VH����\>�?�0:���S�\���^E�i��0o�����:�������e����8a2[��
���R�y3��(o��@�`��~�u)�eb��4�G���\��"��?y	����N_���a�/Az
;O��8Y�ir��&��C���H�2����s^���IhSk����$�*��7?�Y�>��j�D��V���V�t+�_�R��}}���L�/�����d�P��4�����F@Q�N����7��_�3S��<$5����3bn��
����H�!vy2��bj��l�,��6�y�s��kQas�)Z�3���\��6w�c*���CY����4��
���a�����{Z�]������5�Fx���1��S���g���bZ"F�DbH��Y�E��mI}q	E����l�Z���U��ZB��h�W�����H�uk���N?����N����4�a�����z�p���u��U���R%x���(������p0,)��/���U���+��������a�S9�R�������*�G� ����5P����Km�b���u���Lf��.�m�+u�o����������|?��eI���'w������%���Yid�n��I�"Q�&�2B����~R���|rT�z����C���8�Q�0�6:�/�-[Cg�[�����(����rm��
����
d�����lA�����R�O������4��7�l�Q����F���O1v����u����=�%��l�\�����~XR� �����]H������:;�O���8yN'	-0��I�u�������!6"&F�z�X�;�z
Z���d��������-��ww������T�c6 �#��u��SV�)p�=����OoZo�J�����4���������VX0���\��;y�\b����R��� �~��Vv���A�r r\���.����������k�M�1�������!��$?����j�����'���GF�k�5���.�Mx	�e-��x�����4fzbE�5����or�C��q6�"/��,�*��oc��/U��=-bk�oFD��X��4��v��>��2���Y5���H�MZo�/�X���w���^9D���C9�3��>=�l]\�O��0)9w�@V���!�E��(��,����{���e���/I
f��u/�F(���;�Wgu�W<��:hpQ�Mf���r��X�+b��#�l��/������?�?HKt��3)k2{��[%�����L�H��3}���x1K��1Ww�M�Qj�8%��
���<�ZJ��(��Y}Rj�w����Hs�V��b]���^�y�$�w����'Z��J���P����A5����
��/!A���#]Eu��IDs�*P}?E�v��j�f0F�p����h�p�Jk�`�B�����T�� �b�!�)�[iB�PA�.^���	����	S��G��
|��p:�(Q&@E�>��P	��S^�V�s���U���u�ggJB�\��v�������JR��������)Ht�Af!p0��]��7/�J��L5�0��VS�^[P�R��p2����|E���]�J��B:��U^����v1���(��x>���X��0������X���� �A&���d%��@�h4�U�
��Z�X�Y����B������]��B&�O<DR�U���C�r������Kd{!�M�5~}W���;[����v���W�����������5r�!BB� �P�>NH�[\�Pf����
�(��F&���`uqL9��������q�Yw>,=�M&(#�b���<�������LB���_I��L�f1����!�%�G(�l�f��]����fp�P&F3��	�}�V@���9M�L�*xI�.���'�������}���"��H��[�XO/�X7������}]>�4�����+�3�.v����\�O�i���k��\��>�_���+�L?g�����������=�(e��sBJ7���$�X�:�}
b��x���d(�'�=�hD�M���D:��8��`:�e�sT�V#�2}�t�M!�=-sc1mz�c��O�`�z$o���l5�R�.���]���������W��^G�����%��?�61\(�����*{K��������I`HK�P�T�%+�]�`�>Z����->�v0���F�@���5��5��&p���(��X\sH9��(B��?5	S$0�Y�{��$-�����-IF��Z$�"3�T j���^���\���7o3��.�J�l]�:4�~aOy��������c�eI��cGC����5�����������\a�z�w��;(�{{��Z?�TV��Z5�2��%�H<jTH��P���nK�5Y������G�x�SQ�An!�0B]�m�g�UnF~/d�v�>��AZ����yX����'���bpD�����x^���R7"�U��am�C��D���
��Q&�J��l�<��O��f�
�����wCFr����2�2�~P/l���ZEoh'�]"�����d;��5D�b���h�P��`��4Y�^��_�|%����.P��2�S> �������A�q~@�P�'b(�2���u:�F!�:��#���@���(5$�j`�Log{��k@�����;J)����}�_�E=rV���=���^�~p�L��
����(X+��z�V��rKK�K`�����K��n3�O=���W��+;Y���6����:6#�����00�s���t?}tBk��;������g����>}]�U]zW?k$��,���+@�I�V*v�.5���g����nZ.���d6�_z����u����;]��w�����y/V.7���_~���JW��,��]�@����/�O����KG�����5��zu�����j����]���W�Wov���Z�Y�uk��^���7���Z��f?�u���W���Z[��u`�)`���kOeC���p8�E
�O�����3�"��N���Sq>�z0��&��Da�.�Q�����lU����YQ������n�
]�8�{�������K��B�}������:�=t�e#�<�rY1�j����KWV�^r���V�zg�f�\v����+��r�����~��M�:Vg�+����6���r���9�Q(��.�!a��1p�0A���sRv�h0��15��������YO�R����2�;e�+���]��e�6��
@5�1A��b�]���'�����M4V
���MzR�B���-c��/Q'�����y
���|����x2v�i��D-PK�(@�${2/���mP�6�ak�����I�i)`��	Co"p���YZ?%���E<5��btz7P.�]�^���x�)�^d�-���P�V&�
�
�tr�Z�I����DUHe}����,�n@J��y������0Z/�0it����W������_���{��\���?�n���Jb�N������*<����o��>����~��y�g'��6@�}hW �G]��[����o��*����t=�H�~�1j�
�>��+T���rhkk�s���(�&���(`w+.�%D������S���K�Y9�l����Il���Z��%�x�%2��;�
y�@���H��]��3��'��$�Vj�G�'�?o�E���k�5�C:�y�8�RY����k����'}T�)��	gD~v����`����������*f�A7�
G���9KQL'j��l�H�b@�L����y<���
��*���W��O��$��W�([�q^ECW,lL�8owj�-D�|��!������f�������T�&���f���
QX�a��[;+�'{����?H><�#?�/V� b���pj�T���-��rS��R4CI|Xt}�Uw�����a6�Dv�D1�[�pzC�O-��R�~�}��:���)���Z5'��� �������x������`�rA��x(U_	e�90$Q��������U�p:>&��ee��K��Z����d1�$������.�Q�b�s�l��}��4��U���h�]�uAYe��"&�'��5"��&����~�����TLq1HG�C7�C:o����*�ey�~gCi���q���c��(�c�������)��0����x8���U��'X����g>N���k�N�a8j^ �6��R����m��.����}�M&����`3�4�-��
+���X��k�Qm��D����H�����p�	��
��YMJ�W�/���
�W��2Q��QE�y&��3n�^��
�%���7���eM4��*������Q"�������I�.��^������U*���`)T�1_x����]aA�`��\��Tb#�����S7��T�2�s�1O������������a�-�����"��S����O^��rA��=7���I2�,�?�b���K�X�=0�CO�t�P�&�����t ��Qr�j������/�@z�e��3.B���<"W��8�!�M��� o_B�o��;�8�/k��K�Z��q�z��=2�Hc=$22Z�jS$�aMjx�
�2?����oY6Ag6���!6+����J�UX�s�����������d����)�N���x��J��� ���KQ�1XW��s��g�9��U���7�G����kD�)����0��Z	�A�3����N�����ix�y�+P�v�����/�GO�������Y$D#�����&~�������c�E3k�2��&a`�tB*$�.�&13
����i�tg�{��6��J��k���~��;<hD���x��8���t�t5���������=K�2��]���%c�m��s�K����v��0�����[�O��nq:s��{����L
�{��	~������0����+�I����Kd!
�%A�7��7��AH]a��P�K�l��Xm7�����2ov_�h/�7��W/��h���u�,���Z9���'u)X\�J�MVH���d�G/���K�]�4�P�*�"N�}��~���s)�T,���(�
�Bt��Q��I��e1Y�}�Z�uz
��,�8��M
�\B<Q�Y������0��N�Wp���R��Zs�>����0���Ruj�+_U/>p����	)V�*�������R�|+�o������I}Pf�`�
���
�>���DPuq������+�d����my�N�h�2�P�Zp��}�����E��z��<������lr�Z�ujw�|��gw���a-%Q��3w5g�Y+�d�����V;���� �������U�[���VTM>����gw*�G�������T+�S�Z�|?�u��u�a�Swp��{��~����XU�������z��������j��~���l�����n!������G�����z���<JN#	8�$�4���HN#
8�4�4����g���fd���C��~�&r��2NP��@F�kyH�`+�R4@=����|�����3�j���7Oc��`�*.�JS_��U����=��C��fr�����Dy+����V�F&��*i�9,\���A'�� ��.G���J�}L%�P���?�
�V�������JK:����A�������Hb'J�N��`V�X~���C%�	��(�7G
I�$���<
6�H�.Z+}o7�C���a�Vi`���Fw��;������~-���${raau�J���?���?KU(Mw�}R�T	��K$��K$�R"i��{����lF~-j��i�}��u&�A�~wz�U��t~AI��+j��]{�z�V�����a�����Qs#���>���2�B�l��Qa�Bv�#N.������O��]�������dF��@2��:�&���y���E%4���OV��l�Z�C_N�@TwC�����&��������2Y�����k�v���yS(����k*)�����+[�����������s�^�QQg�\������=��p�13}^���He�+�������
�5�����T���61U��^CJ� ynv�_��|�`;��:,��#�H�U�h��^�J�g�`�q����������V�W�����������������a����t��~}� ����f�����z�������5�n����K�f��?�<����R������~A`�3���5`����?����������A<�w0F!���8���A�)}&��`����E|0FQlk��m��� �p��gt;���!fQC{q8��~��h�i0Y��h�5
��2(Dq��kr���#���C�{���BG���4�E���x�JD[	x#����	�yj�����8����[���?�^���h�)5,Ib���Y�����`'�ON���F{p�Zo�%�$�
�5��L�����m��?�
��|�x��������j�����8XHt8�����y4jcD^��f�� �]�Xj,�b����A�������
�\��<vT�v�����G��p1�X��5SS/����z�;�EV���R���2�����n��ob=�Y�rz�k��&t�:2�7����o����q ��A���n���j���
d���Z�3��SM���6S�������Big�k������A��S-�{�N��o�	���a��_��u'�\������L�W���W�AP)W�?@m��|������b�mu�����)1�D�����,�������m8����l^�#,Lm�>+�nOv#t�z����Mn������=x����p���������~�����Q`��'��0�('��dk2�2XI �3����g�w��A������X�Tw�Lx�zX�$�|zM�8��r}��&�u�����f�a�t[��6��~�1�j0.F�6I#����r�89(QL9~!��R:��]�e�}������.��i�0����X�p��O�K�W��[A�4��C��]�D�	E�E= �X&HO�x�>x�HA�r����	1
Q���{�Er�	9BL�K�0��%J[�1#d	����p�m�I,`���E��,tZ�
��}�
V7�+��\L�q��.�U
�q��)���4/�
#O5*W��;�54�t��f�_�8Z�.3�=��F(��WUh
�&� �=����A~�l�k%$�-,��<r���#���1�p���y��Q�����[c4E��H����`��h&��l�3�>�j�t���A��NN�b&$�2\a����1:�C��&���h$��=�����8!�h�P���N����$���H��|�@pD�p�]�6����3Rv/�����Z���R��F���#1�	$b���6�v0
�&c�G��l!���������n�)���\�)��������"\�b�'�{H!4T��)���.�J��V
(���^4�x�h<B�,�'9��2l�F�3�
n�G��nf��$��pn���P��HD�y~C��1�-b�[dP�&�*sF��m�LJ(K�	4y)�c��kC�1��t�G��
�`����"��'�r����&�u�� �����]��o�&a`�����N�-]�&K��E	�_]��1�v������w��������"d����#��E�����(eNc�qZ��Z�(��>��d�w��%J�E�8f
<Y����4
�G������E��'1Z�g��kW�Vog0�p����;��F���D}����T���	^�|
/���s�R��\��1Kt�P�L�q0�(x��}�:>	(LM��=�<�}w���u��f\G��ZE��r?�����y4��G�o
��#J�@�+�0� �@.W�cB��Nd�[�n��{�_j"�D�^1P�K�'Z��i�9�w'O�������� ~~S�}�(�g E�;w�J�a��:�9�8���x�.9��+�<�E� 
��LC�����	���HZ��
��8Y!�W{8M���-n� &aV���AaW�6l�r��p�`��G�6�r8N
�B��,.�h'�%�Z��HY��hrA:G2���B"�7�.dF����2��c!���"�F��7d&$�m��@z�-0��Y��?D}9L��� 1_���E�a�D������a��b<$e�=&�)����3[��
r��t�-bq�-������1�t��F�R�cBTx���	���h�:�%��|�E�a]��[F��s��h
K}L�".C�����h|��V�q��\�����l�ZQ`-���|���S��i����hM��&`20dd>��X��tI�9=�����H�vT{��~��.a�7c�����g{���/�O�z��R{���X����O�3q�~n��39�>Kg�����gz�$s��S�Y�9��@��c��!��0r�X����M,�b�8T-�$���	�6��*��1]�����bl��q��'���6|5��0Z�����p���`�h8��0"�����-I�������q��kCF`�=��H����*}P&�^��x�������#w�=Ll���z����{��}<��B�+���X��~��aX�����b�@�
UQ8�4�����	!o�)�8���1*����hy'�{9�����-�>�oIi�	_c�)1�(}b2d{1{8021W��}L�Gy�����M�.;F�c$���3s^��f�
�os�I����8�"*}����4���q�n��O�'���e��k,��Q����J�����x��D��)�:R��<��
G�[>(L���ps�QR��3��5@����G�������$8gbG�Sr����z�[�������ml>Q;fX"�����e:�A� ���)

Oa����9��Yj�����*�1@N[�mJ���9�=�-�����t���-��{q���uy�fT�~���e��{su~u���S��m���?^3�l��M���y��M�
i���+h�
{�k��N�w�����.�H��o2��Oh��n$�'^.q�������
�.�/_�������(���B)������Ys
#��
j@�aO�����$��d�Q��6�j�F�M�3_/�Y[�;�����eC�xlHS��Lm�B��3*��}E8��4pX�3�_��:��;R���;t�����K����P���HF"����/�����E���������w+9'��2?�M��0�}^.�C�;)4�,�9���&�������EN��i�p��@����d8Q����&'�C��"���Iy
�L�R"~��-��D�gN������TEZ�T�
�*����3�B��������G�T��x�a���S5����/���A�t2E�#��!`)�����ZRC�����Dh���W�,���K=���+@�o��
�AX�L�8~B�'D
���J�$��b:�\���!���p�	E��NO{*
H�9:�����o�������Z��)[�0�a���`d�2��7�^��u�Mp����j�Q��+/5 m���V	�����K��?�UR�������[���.���)>HZ1:�W@UCEc9]���dL�?�-qP�+g��Z�aW��r�	�]E��a���G	���- �Q%G��G%����.@�#�P��l/�O�W�,�wu��e��d��3I�$F��k4��eh�41�$U���GW�^Nf*0������5��	U����*��;kO�Z��Jv�]�7He��x���A�Mn��//�(�/��%s@���'��'TlT@��R���I��>�����z3�������8�xMAr�����jB.��E���4�d�S�(�c�)�F�[�@�j�������R����CE����i���Y�*��P5�dfXGO�.��g�k�F�93�I#)���H@��;J�k���B#J���� �6��W��b��S���������������3��8�����m>�w�I<x@F���x8��ie*,�Y>����5e��������U1��%�2"�~A�t8��!YUJ��26&�?o��y�7s�
�P<>�g�"�3U��V!��Z��6<-�f�%g�p�8��&S�H�$�3��e��R�,���uHk���Z&,[�����U��l���~�R��k�)��Y����{����
��|2I�ZA�5e��W$�`?��z6Y�(I- �	�Dp�}]��m�IQ��[�����6�U��f���M��jO���5��$�d=v$`�=J�;"_�I;b[��T~@U���/�L������
Py�
����v�aQ��{���+�h���,�Q:h�b�i0��%�,���6�KfF�04�W��|�������f��+�8��<0D$���)�������!P>���p
�
����{v��o��{P����+���s�JI� 	z�G���V��^2�jV����l/��_M�,Z�
GX6��L������!�����p�R��o���J���#������4��P�[�]�50�_������|t�@�"7%��q�XkJmkL�����@
�n<�2QT���3��R�������uw0�Q�`0�S����\�:@��a�uIb���j+1�TV�0���
��dg�g����;���6�+�����.L��4k��u������X�����uq�~�S���!a���D��������`D5h���4Q�9��.?5tfI�L���D��4i���	/G����b���I���6�1�'.�%a<��QB�7�xXB���A������������z`���M:�����3]�Tk>lt�"���*������t)���aN�1��&�N�=�80O0�v)�+Nn�Z��x�R���z���^>�o$�	^�R�j���g�_���ig��vi�������.�tN!��E@dD��?�#�����wN�����wT~�J����������oLMN���,)��X|{3F%�7��D
�gN������+���Q��`.h�<�S�v�
(�K��.�"������i���%;h������Yp�;9�u�By��v���Y*D�����a�k]9b�Li �L�a�9j�c��c*l���y�H1�K��4�H�D;��IM'���B�7W�*'2>���9���>;�����u�C7��<�����+���jy�����S;�2��bS���e�K8f�y�"�y��}�z�t�K�R�u����m1��c_�
;z�*�)��IFL<������R�[���u�R��)&�%e�� ����TU��� �4���XJ8�|�],���3�K���f<Q`R0���&y�V+���uI�Nn#������M�;���xBg0�@�`[�]r�"0g�M��%���wV�)���MQ��<J�	������V�	3T�o.�K������^G��J� Pa{\y�i��l�qA1X��$D�L���B����g�RK�,~Y�=R�G�����
^�ac��9q*�_����Ei�3�E�]���R�aRX\�l��5BR����9�L�sbK���R[�������kl)��z����g����p�,��u��>,d��,�4l�@["���z{��'�O?��>��}�t�;�sO�f�O�Q���j�y7�	V	��zz�zs���^�ds:�����E��co�/��x������U�I���~Z�5������h��'��;��	�'t���ui����������������aj��� x��!z6� ��������/�}��~�{���d=�������?�A��8�w������Z�R�B}��A�^��3�+�v1`S��$� X2�P���$p�Q+�
l������"a��n���f#�M�K
 ���~��qO��0�ec)����d��&�x���$������g��#��P6�$��4�j���[������X�6����a�!��L�v
�1�hFQN�L�D(�p88�`�s����7�:���
N��O)��cs��lF@a7A������F�����N�}6����l+�Ol�v���%��,|����4V����I;o~��C���F�7P�����..^<py�>���T�����z���c)D����<Y�<���j���^���>������*�9���X�O�~�{�k��.5�x�]n�"�QK���,���2��;���W�'14"X�~���s,o��~�E������I
���7�M����*b�
�`�`��#o_��E��gG�jo�O��/�/����g/���v~�������g��h!�ag�y�\{�����E����?�[�_�~wy�sK|�/���'�v��x���<���j��]��X	14��4I���U���K��(���|��V�fy$Td�W`�OY'j����`��]��q�_���]������1n�G�>(�L�r�S7�5���O�7%	�itt$�3/�4��������uZv�V�����u3E��r���0�Q���+��:i���(sE2Eg�(o,��Mb
!�����V�����s�k3*����G��������RiB�!A-}�������k�@xG�u~����u�X'��a�e{���^�����E�����m��Iy����2��)h�L� ����l�gW���7�W��:s���i����&����0$�����A�y�7!������������%��d�AaQ�t���X �����-��|��N�-�o
�|�����x���p�+���?(��>�2l���p�V����pA���."�|y�g��eVb�1����Ji����A���t�*�'A���_%�+Hx?s�WID/�����jTW� �)r�T ��x�i4{�l3�{"eXb�Al�!�`��$`}P��S�OI|��Dg��C���Ql��$D"����`Y.
'���W�E�LD�(���(�3VC��IR��F� �D<�[��u$\�1�?���%�17c����]E��U�����%b���ct,�8�j_�������.r(��+��k���x
)����(bsT�0=�A�M�NA�	���aJ(�1���0�r�a�� ��4��O&]�`��G��l�\����W�8��bo4��;�~�d2���D�3���1y�oqt����Oc��<w&��?��������K��6+,�����P������Pn�J��z���p��L�fr��v�$�!�������������=������<�L&i�`�D?��K8![�t�=��������i�8�LT���7n��Mx����}�����z����MBT�E��
�2+�uVkp�AC�:�����5�1K�4��`���eN<[���{�J���\���	;=Q.�7�7�� =���$_�:�����A�M.��fc������<UZE���if�):,��)�S���O�/��%}
���
���`�H���z�i�vE�T��t8�2��Y4�0�+�:��N����ULy��-r�[��K�AV�m&�����K1h����T���,z|�c��/���-;��bh�-J������ws����C���4q5����GN����k�sVj�o�.�q^�]D�x�l�|	x#���!��`��A�G���Pn@�����0��p> �!rB����6kE-��])f"��?'���'���Zl����Y@���c
L�|��{�3u�+���WV�����j��������\����o=U@��Z����N����S�-Q<T��"�W�C�}�r��
p��)��b���@����d�����Mm��
O����mV�&�q��Y17
Y0�y�W:p�3����)%�Kf[/k�������������4)�\�Z=`w��1�k?��q�.}"�J��Z.3i<�A�;�O*M+��I��s/���!g�V>~��^�IFO��K�M)�������!�1;6�U`�r*�D�^����^U�9�9��rP4,������b^d]����� ���!��(�T������Ajv8���4��^�w�w0���Vp�sRZ�����D��^����0'w�=V)�0e�v4��:Z��{�
�������c�����D*�B��9������M�������$Q�p���\�7T|nfq��]���Iq�j�W=�z��^�\���f��{J��.]�"�)W��-i������_t��A)��������:;��,$,M��29]���� �#�����e�})����4�vJq������-�MROQ�GQ
��0Q$H���/�^t�eCp-w�������������{?���"9��H���/�C*�c����������a_,���}����>
�.��<A�V6��*�u��P�D�z���V�rt�9����PO����TN)����V��}?�|����>]`���Y�H�A2|�`�IU�[L������X�}a����g���i�$&�����
Gm�X��'Q�5D����������|`��4��-�)���Y��?R�������;���i�3,r{�)�����f��;�1^�{H�W�So�<��mT
�?�(�+-+;��D1Z�p�En��%��������l�TB�-�K�E��<�!�����b�8�:cY��.Z���(p��]���s���$n?�.���������x���<z�<�W�"�	C,D�|^G�a�	� ��`\QpIV8^6�?�-��,���������f��`���v�����,Ez�4zA��I�y�7�'����*�+�����U��j�1�dA���h\��
��Z ��i�TGp���uA�q:�I�g�SL�L�|Be�Vi#���a�X����<]�+�u�d�w��b ;����>�8������k�*�� Q�tC@�H����G��5�����h&p=�]�*'�����* �3������{�.�w1�diT�Of�hVs����6��+-�)��-"���}�Y�������L���{������;�E>+Q��U��;���RP������q*��d�f��RD\�K��$[��Q8����������ycL�����<��]�Y�L��bq�1�v=Nk���R�1��/M9gM^��3r������-��@�]�:hN�0�����{O>�����t:�I�CL��%0�����@����S���N���>X-���`�8S������PI�9���J��PL�K9U�����	��}���0P�<m��$u����X���-�{26��[e1v	g�wo�~��S�|���G�A�T=�<K5���W�����a����?����yc�g��%��������A
���M��7���5t]�*��T�1��na������5��r�p��������.k������r���J�	��m)U#������\R�%�e���\1��K��D;�e�z��������]H�nE��@x���
��t~+3��
6�z��`{���\�0Y�.���A��a�t�R�����?kv�����M��J0��81P�%����pk��`� �Hh<�R�a�$���C�$���p�D��8��<xdT���Ac=@�vj�p��GS�M��x^�bp�����Pj6�k�:��DQQO,��D@*{s}��(��XUB>����S����\�\��_�F
���KI���85�u��h�m/K�k����H�^���Q*�IU?��.��M�B.l�B���&z
*���y�$�k������va�����.�V�F�n�3���j��.����;���%��U��haV%Bx���6�D���=���8��x�b��	�!�+�[
|
q�65,�fBQb�b�n)���I�j��I����"�Qh}��\�Hxm>N���#e�ak���(�
=D�9Bo��2��|���X7"�a�T��R��0���Y���`w�/�D\������4)��{7�$�������G��g�p�.K���Y�r��L�_@ANe���crKy����p�G�mI�m���t��24�`8l'p�p8���]{��7���d=�>G�v�N0�"=
�/,��:���2������X�7z���;o�fwC�_�v����zE"����r���=�t��]���Y�?� �������[{�hu�������Z��yZ��]��Bm��y{|q��u��
W�c�O���tJ����NCk�!�d7Z���%�j����u���Z�Y&��c�uhy�Y���M�����L5�<�9���6������}yu|uI�yt�"���t���u�������ES��p�
�j�U�U�6`57����swug�u��,��4���9t����)�A��qf��N��������Z��`��<]SXe����;q�e��=����AG��!'��F���� \k�����C��%��/�y��Rq�T#�
�U�
���u.C.�`{J�#>������a�
{�B�8�?C-��9������k���|GF��Pc6��h8�@y����#X�7\�i?	y����������������������IQ��i��0��D��DF?���cf	;�G��xi�Wp�Q�#���{����!��cJ�J+�K�rR;v}�)��l���q`��({*��n%�e�j.��@����^�����2�����L����������FZ%2��	��lr�O�GA1�x��I�&d[`�*#$x}���C���<��o�V���p� �6���7A�cfJ�+����#��At0��9=��D��BFG��*W���Aqo�g�:8�������#����P��Q����%~m"0��^�YR�������P����x���0�*��mM&*#d`9�/A�����b�� ��-�ZMk���JV���Jtj��a�����7�L�RV��:��*Z��6�b�h-�YOi���9���(��4����D�*+��/���c@��qO�����t��
���w�z�V������|��Q�'6,���h��qS��D��J�u�(��E�_<��:L�%��2-HS��QQ�5��)�QF+��Wi����A��\n���f�Q�-�L�-m;�jI�bc�A�^�/�F��=.����h�~B�I�2>"�}1x��^N3�n1� �#�A*Sl�*��DA��k����8���%?EG~���\�(Q	ER�$�GRNv?�0 �fb}��:���@��&��LOM��U�U����~�K8D�A�=����T�F��/��=%'��I�xe�\-a�C����<� �*$�UR<��(�eN9�[�s��N�����S��k�y3.?%����V�^r�����f�7�C$�4���a�ku�m8��U����g�}�j�]>�,���cTt\��G��l��~��B�����x1�B��������O������������@}u����c����=|^v�^���Y���?�EO���9��}�^�$E"~�'�+l9q�qI�7�[)��4Pa�w�����(���,�]t��2�[P�x��w�mG~��}����=���.������Oi>e����!����N/����SO�N0��M�~�}qz���%|�����y3�f��6�8�0�������]�o�����P���BT��{��+/��-���<��j���4���;��v�����*J�H�*E����.N�]�� �a�Y�k�d�'�;;�&��(�Pi�/��������������Sx��ps��e����g�J�����\u��^��z���������qo�����3�y��r:��������p�
�p�����t:�����[H[D���V{����~�$�������[��E��d�.FgM=�bpf7])%�4(��a����<�%d�f����bp5���E` ������@���T~#���\BE��\���
����TN��}�x}4�Qt)g�:�}���H�oE4)������i&�C*��[�+o��Q���H>�[��	BnC";�$e#M��Wg'��������������<����Cg45�����DY�U��T2��v�.��"�]���1U��2�Jq8�r����gg����^^�)UV����]$�K��~�z�Q�o������jA�)�<�pt������j��lvc������`�U�7k�s�������'b<	���
lU�I�vv6�QWa��3�8g'?��f��<0��{-����76�P�8�SQ��<�[�	�F�� *u��,����d��ZT����;T��kS����dO'O.��FR����J�U$Yp��H����)i������<K��Z�c~LN��5�naz���|Z+������������`O"�����Z���T��V�I�M�n��,�
����[n������q^��%�����)�������J~dS����C3~�������a5qG+�x��{Tq��5G����N/���������S��������j��r���T���_g���7_��.�YkJ��X
�e���u�u�7p�N��odGg�x�����������A��������k�g#�m���NZ����N�X�]LC�vW��
�����������]Z���f���J����J����������6����-�l�gG������H����������������� R5���]�n��~������^�fY�U;�?hG���jr"�fM.�5>&��TZ#���H���F����lz�i���)/�u��7����F����7��9��7���m�k	�B|#�i�v�!,*������
��~�e5���j���6��`h�mw��VpA�
��;�8���U0vKX�>���] �/������|�<���$������T��R]x�+��`_���lW}I�q���w���'*V��v������q_���������*7�=��E�*��Raf%�F>�f�e�rb^}wWjD��|�2��`P�������2���}q�X���=<H����ho�����@~�-n��i�{���1��	� ������=VI_��Et��Ft��-��&�F��"�+D��4E���O[� �"@(U�U��u�WT��w�B���=��YL����:V/��A�nH�2p�;�d
��" 5��-��i�]�4�q�`���|tS����4��Mm��n�����UF�A� ^���'*���P��
_���_��(�L7���X�;H�K��D��<���}��yLR �]��PLh�9UlIz�h�i���5�H��5MN��������Y�������u���ND���Ji�d3�(�FL�	����>��|5UK��q���LUi�0Q.E�@'��dL���`|��d	�z���@8�[�8�jc:S��A!�{��
G�����u����
��IqJ<�=�-�dt�y���U�-�Q��58U=N��pL^\jS�����&n���p�rQ��"?
O�p�3a��2����@��{{zq}r	9)�$s�}����������7�i�q�u��Y��Sl�}��^@x���<I!_�V��r�2��:�����t�k���7J9���f�� L
0e�E���T��lFy���	��N6����=��|\����&������$sd8����������r"���ea���3�	�n�L������q��%��
����@~����RKQ�W��1s%R*#�I�qD�����l��vc��������{��K0C���M��G�y:�S.y�+�c�[H���nO��?�f����+��.Q���
��v����&r.P{A
R���lp������H���Y��k�����
���B�~d"j87"�t��n#��h��f���'7R���E�T��Da�$p�H���[Y�*D����,fl��D�1�lO'R��N�;�k)�d�:���M���kR��:������(��O��8�0�.���
�;�J�����W��,�v��`�W4����>�)�f��C����'ioA����������c%O39��k��E�A�C	�� K8V�o)�9�y �5���6d�'��6r������@(
J(&�[�E�R���#6)P��R�v���<��F�����1�%�;>f��\��c���^n��FI�4k#�8`*a#��S-� | �>�D7��\�Q��-��S���vERh3zU��v)��JF>�"�Z�P���C���H5Fh�}Lv��������[��{��������~�]_��w��@6�>��#�)mn�b����+��]0KeI�*z���Q��j�"�3�#ob��������]dV�
X]������lAd�c�Aw�br#��N�P!v.�2��0��{px�QyX%^h7��+��e3��*f�F�2�DOC`��O B=%��fI{���(�L/��j�L�\����j�H��Z�;tR�M^c/\�/�$��>Q<Q�
W�Y�|�7��g'r��txvz\U�W���>�iy���s����f�c�fb����)���'�=k$�\�g���R����@��R����T��wM�q��T�`�sp=����-�
 ��l*��l���6�<K&������[�z�Q}Jw��<������`nq�O!�A�16Y`��(�)p��/��<��x����7������U�D��U����w�XB�X�CIn�'U�&$}��&C����|�q��Q�@�� "�gT��������g+$�G�f1��z%eY^ad���=Fz��J���t�srf'He�1��Od� 3{���Y����M���rg�&���i(�����OFq�|%SB������$l�.(��GNa���h�6M����J]�"�p���j���<a���P���������tYX�VS){A����P-\rV?��w�F����ytr��VP<[t���y��+�����c7��7��;V���wX�"dR�z�w�:����mZ��0cM��S����)����D���H.3����?3&��CNZ9}�q�+�������"E�
[�����Nr�Z���Nr*�z�SY}�SA!��Nr*�.}�����uT���;h�QG��:*t�QLj���
u��lh��|����E�Qj4�L�87S&{=SUf�_%��8���2[��m�hw���w�,�e�*�[0D���wc
6.E{���������\���El�������(���fO����"����"�b�>�W��hL�T�>���	���E�AK���*m��
9:5Y:Y�K7=kLXC�h��r�{��s�cB�/�����ua��9���A��	=��B�c��J�&��=>D�t��&(����Z
l>w>�cs�>9�CjQ��V3�������{��t��x{���(cz�D�=`K0xe�}��������Yh�z�������I��y�?�*��e�K?�xF&��Y���MZb���s�3�E���
�`���@$mp5�{mW�b�O�v����kr�B7P����D<�����v����?������=.��t�A`-���
�	�����p�l����[�������?mk�i�[�*�O�?�Y���,��OA�9�tWL�w��W��pe���q�r\8]K������&Kd\@�Q38P
���,��]Fe��\F���.����mTLE��~��~�t����m�D��%�u%-,�����z_�����O)�Oc��,�)�:l�1�b�UB��O��$���-�M�M��K�~U�5�I�":�u
jT�����X	�wQB��R���q=����L��Y%~V�(��,"�)�#��9�	�������!�����R�	��'���:�DH������8���P��rHA���T��
�N3�[@�,A���
����(��r\P9������4��-F� E��P�����0�|�|�H�s��$~�{e�P�1Ms^�Hw>�Z��H��	D`A�&�����LC�:
~���JsN��fUG�&�kKH����2��s�A����D���${J�tO��{w+�#�(��\!�
�#��!=Tf!Q2X�EiT����T��8���Je�@E,��4��v�A��wfc�U��l������+"E��E���6G��*�l�@d��������3
u6z�pE�D&��2��6 �a�t��&'�d��,�
���Z�A�L��[&l���I�2GeL�n����d�1�(1%WH>-�)h=� |M��vxDHOR����s~�j|'?��D?�x�0f���6R,�m�%�
�P��\����.���a��'{��NMo�J�/&��D����K\�U:,���;��l0�#���8�?�f�IU)�����n�	������b_~4�1W_\gR@���B����j�Ph�2��!?���%����������vD���a�5�����K[��#�|�H�t����H�(�s����8�|��\��[��_�R��ofW���#M�F�����=?|}z�y���N����;^���z������������
�1�\���Al��A��PFubL���B��>f��0��Uu�/���j%��+M2$�U93[��<��x�����A�w�<F�"��0�X�T.��i9,�����}���~��$'7X����BT�tK�OB��<V��X�1 ������|Hl{���
�J��������z�-�k�g�g��fYx(�fY�}a�1e]��hQ���*$o�e|�o��PZ�(s��������2D�5���E"�*�d)Y��P���%s�#+��'W�K���������_���|�C�js���@��/h3��
o�������H�=�������0p�A��Kj�cY
�oz�F�k�lw8�_���:��W��:�a���]g�����:h�Efeh�}��*�[s+�������������`:,������������^������nj���B�/���]��2�V� ,e����]�k��SX�^��6C�W�:�a�g��v���Eq��(a%�����	,I
�����.��b���)���y�3O���T��f�����"q��3-���R���\�R�%g]g>�(7t�F�l��<A�'�+B[[]����1`n��p9#)��K#�z
������'��r�&�������+A��mAn�
��6�7��p��J��zU@�B�����G�#�����8������	�y)kQl����<�:�c�a	�^SV�0B�z��2�R0B���pLW
E�6��~�O@}"S).e��zs�[�0�X������s�����e�=����H��0��%u0�+�	AO�\
s
5e����?
PF��R��3�=\AJ�K.*�\���r�IE
]*oo���Ft�w�������D*i��?�@��Q�^�z+����������	�G���E��i}A/���A�&�l��-�j��&��KF�������G����T��D�������~����u�7l�V���������$�w��E;I��a��E����+4�gL9�L`��l%%�X{1o���k�����|���0B N���w�_���/-<�O/��^������k�}�	�� 8EOh�lNl5-j�����OP�tk�xY�Vlu�����C�����o��M���34��`�����n���[JC���_1y4�Ml����]����< �����?74�eD	�m75%��[���q+ug�q��x�O7'��,c_!�B�j�^��v�o
yy��<��s�	������J;[~c(�MP����U��S�n�����/k��^�\l�uk�'?������������W�s����nz�����{\h�m
�A��{���4:��o���!����p�~�a{N�\��5d������8�ns
����_��R�n��a}����w9��x�
-����| a����l��v,o8l�����t[��/�8�:����_� ��)�>�k#v���P��:�8��uW��G?�^�V!���s;E��<hT1SF��N���N��(E��v�DkP�M�	�P�(z��N*.��S�NdfQ�70������p,D�2u�R����������~���b��G�n0�Xvv`T
����!�����t�K.��|`�d�e��-T)h���j�����M�8����AFc�8�U��iJ#b%:m4�h����������y���_����Nt�l�W��qQz����D	�g���P��`[��jf�u����hc0Y�<����YE?��v�t/K�.�Y��$���J���c4���`����l��t��,���?�����;�G>���QA
|�\���n8���'�(�j-��w8�d>&�
������'�A�A&�!d%Xp��������|`�$�S��v�W�(���jJ��K�Zt���5�����"�
22jo�a&�
�0���i��i���Zr_��D��}d<�%���;���sF�x*��l!�9��(q+����Pj��/���-e�l���O���NYp!Z�Q�Ri�x����
��upo*�@!�����F..?�?@�i8��\�s�C�m���>/"D��
�]�2%�.��{�g��^���d�,\�	|�s^���C���	�pmE�}��@(:����aH�
:F���hu�Z�����$ncA8j��<,����v �0��O��,�~���(�Y��?tH7E��l�M>2�����}9Y�TQ!�~J��*wt�g&f�����n:}
�()���^*rA��`����XZ�aD)G�B��a~Q�`>P<,�H�7��u���Gu;�����\>��R-W~��{�����P������2�h�����[<��Y�t�(��v�--�[xx�ir,�0��B��O!k�r	Y�=7g�ky��i���z�fjL�AD�v��
����p9�^�1�#���U2/A�,F*��^pW�PE���$RRDB�K��8Q)����O��$�N�=gGCJ�I^9��7/h�������N�!����|�/0SB(����5�y���2N�A8
��l�(���F
b�A�(5D��a@��A�h9�
s��g>��tIp�����O��f��k2f�"�2�Z�)��Af�)/
��x��?i��f���*���N��H���Qv�d�EM2��rTC*���pK.�x�'��$�+9_������f��5L�����l��F���[u!��Tp\��(j�'�Z�i���D!����7�*�TW��`N.X,b�}��P�}���\����Q}��?`$���\�.��a�Cb��5/��L)�WCMpn����4x+��>@wF�4
����������M-����������/2���
��'F
3����B�?�"g�p�n�T��M��GN���l�����������8���+��{-��O�&�_�(\���k�� �= >qR�E�.Qf�'+�j+l/�e�;��z��F�|�����T�*]��&��u9A�o��,F��v2%���������T_�S�]��OVsyz ~�j9��$�9�X0ol��V$��/X�4����K����=M� ��!���4��������P�V7��B�������N9��*�7���Z�T\�t�(a����k��<�������'�����Ur�.�
'5QK�L��X�����_���K���K4���>�����n%�H���X�@F4��Mmd�|��e-m��/�#e�"^�~
7z������I��<3�#�z��7M>�P�5���P���L�����e��5bdVG�f�5�re*��XXFn�h�����X���$��8�9w�p���"[��d6$_H���9�]/
�����2����r���d����W!���3U`d����,Q��}��`.!2�E}��se��$���*s8�����]��a{7�
��������0{2����#���k/Ti�Gfd,l�q���o�aeA���c�j3���f/Z�D��b��oSG2��S0��c^�y	tL��A�a��m;7��6��g�i���nE�����
�D�M�\�Y��ksR����^�Y�-e��u
V��x�xs
V6c��}&���C��nYU�rP�i�$*�����x�KX��r4�ZO.����6	�qP��X;:�����?��4����������\��8}�n[~oh��`�w���� p\�������Z�����-9�R<���R���?Q4e�}\�7I O�
z��w�}�ZE@:�
��Bs� n�����}�p���&AN���ape�����La�:�1��G����
i��N�X�f���=��l�F��'�ir���C����G9�/{�����?�
�H~�~Z.n��5��S��]�p&�	�����]���-o��
��.��v��(�U�]q3�8�I���.g]Y��6���0���g�~�O]�����^�k{m��������^�k{m��������^�k{m��������^�k������

#300

johncnaylorls@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#299)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Dec 20, 2023 at 6:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've updated the new patch set that incorporated comments I got so
far. 0007, 0008, and 0012 patches are updates from the v45 patch set.
In addition to the review comments, I made some changes in tidstore to
make it independent from heap. Specifically, it uses MaxOffsetNumber
instead of MaxHeapTuplesPerPage. Now we don't need to include
htup_details.h. It enlarged MaxBlocktableEntrySize but it's still 272
bytes.

That's a good idea.

BTW regarding the previous comment I got before:
- RT_PTR_ALLOC *slot;
+ RT_PTR_ALLOC *slot = NULL;
We have a macro for invalid pointer because of DSA.
I think that since *slot is a pointer to a RT_PTR_ALLOC it's okay to set NULL.

Ah right, it's the address of the slot.

I'm going to update RT_DUMP() and RT_DUMP_NODE() codes for the next step.

That could probably use some discussion. A few months ago, I found the
debugging functions only worked when everything else worked. When
things weren't working, I had to rip one of these functions apart so
it only looked at one node. If something is broken, we can't count on
recursion or iteration working, because we won't get that far. I don't
remember how things are in the current patch.

I've finished the node shrinking and addressed some fixme/todo areas
-- can I share these and squash your v46 changes first?

#301

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#300)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Dec 21, 2023 at 10:19 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Dec 20, 2023 at 6:36 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've updated the new patch set that incorporated comments I got so
far. 0007, 0008, and 0012 patches are updates from the v45 patch set.
In addition to the review comments, I made some changes in tidstore to
make it independent from heap. Specifically, it uses MaxOffsetNumber
instead of MaxHeapTuplesPerPage. Now we don't need to include
htup_details.h. It enlarged MaxBlocktableEntrySize but it's still 272
bytes.

That's a good idea.
BTW regarding the previous comment I got before:
- RT_PTR_ALLOC *slot;
+ RT_PTR_ALLOC *slot = NULL;
We have a macro for invalid pointer because of DSA.
I think that since *slot is a pointer to a RT_PTR_ALLOC it's okay to set NULL.
Ah right, it's the address of the slot.

I'm going to update RT_DUMP() and RT_DUMP_NODE() codes for the next step.

That could probably use some discussion. A few months ago, I found the
debugging functions only worked when everything else worked. When
things weren't working, I had to rip one of these functions apart so
it only looked at one node. If something is broken, we can't count on
recursion or iteration working, because we won't get that far. I don't
remember how things are in the current patch.

Agreed.

I found the following comment and wanted to discuss:

// this might be better as "iterate over nodes", plus a callback to
RT_DUMP_NODE,
// which should really only concern itself with single nodes
RT_SCOPE void
RT_DUMP(RT_RADIX_TREE *tree)

If it means we need to somehow use the iteration functions also for
dumping the whole tree, it would probably need to refactor the
iteration codes so that the RT_DUMP() can use them while dumping
visited nodes. But we need to be careful of not adding overheads to
the iteration performance.

I've finished the node shrinking and addressed some fixme/todo areas
-- can I share these and squash your v46 changes first?

Cool! Yes, please do so.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#302

johncnaylorls@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#301)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Dec 21, 2023 at 8:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I found the following comment and wanted to discuss:

// this might be better as "iterate over nodes", plus a callback to
RT_DUMP_NODE,
// which should really only concern itself with single nodes
RT_SCOPE void
RT_DUMP(RT_RADIX_TREE *tree)

If it means we need to somehow use the iteration functions also for
dumping the whole tree, it would probably need to refactor the
iteration codes so that the RT_DUMP() can use them while dumping
visited nodes. But we need to be careful of not adding overheads to
the iteration performance.

Yeah, some months ago I thought a callback interface would make some
things easier. I don't think we need that at the moment (possibly
never), so that comment can be just removed. As far as these debug
functions, I only found useful the stats and dumping a single node,
FWIW.

I've attached v47, which is v46 plus some fixes for radix tree.

0004 - moves everything for "delete" to the end -- gradually other
things will be grouped together in a sensible order

0005 - trivial

0006 - shrink nodes -- still needs testing, but nothing crashes yet.
This shows some renaming might be good: Previously we had
RT_CHUNK_CHILDREN_ARRAY_COPY for growing nodes, but for shrinking I've
added RT_COPY_ARRAYS_AND_DELETE, since the deletion happens by simply
not copying the slot to be deleted. This means when growing it would
be more clear to call the former RT_COPY_ARRAYS_FOR_INSERT, since that
reserves a new slot for the caller in the new node, but the caller
must do the insert itself. Note that there are some practical
restrictions/best-practices on whether shrinking should happen after
deletion or vice versa. Hopefully it's clear, but let me know if the
description can be improved. Also, it doesn't yet shrink from size
class 32 to 16, but it could with a bit of work.

0007 - trivial, but could use a better comment. I also need to make
sure stats reporting works (may also need some cleanup work).

0008 - fixes RT_FREE_RECURSE -- I believe you wondered some months ago
if DSA could just free all our allocated segments without throwing
away the DSA, and that's still a good question.

0009 - fixes the assert in RT_ITER_SET_NODE_FROM (btw, I don't think
this name is better than RT_UPDATE_ITER_STACK, so maybe we should go
back to that). The assert doesn't fire, so I guess it does what it's
supposed to? For me, the iteration logic is still the most confusing
piece out of the whole radix tree. Maybe that could be helped with
some better variable names, but I wonder if it needs more invasive
work. I confess I don't have better ideas for how it would work
differently.

0010 - some fixes for number of children accounting in node256

0011 - Long overdue pgindent of radixtree.h, without trying to fix up
afterwards. Feel free to throw out and redo if this interferes with
ongoing work.

The rest are from your v46. The bench doesn't work for tid store
anymore, so I squashed "disable bench for CI" until we get back to
that. Some more review comments (note: patch numbers are for v47, but
I changed nothing from v46 in this area):

0013:

+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.

Recently outdated. The variable length values seems to work, so let's
make everything match.

+#define MAX_TUPLES_PER_PAGE MaxOffsetNumber

Maybe we don't need this macro anymore? The name no longer fits, in any case.

+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ char buf[MaxBlocktableEntrySize];
+ BlocktableEntry *page = (BlocktableEntry *) buf;

I'm not sure this is safe with alignment. Maybe rather than plain
"char", it needs to be a union with BlocktableEntry, or something.

+static inline BlocktableEntry *
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key);
+
+ return local_rt_iterate_next(iter->tree_iter.local, key);
+}

In the old encoding scheme, this function did something important, but
now it's a useless wrapper with one caller.

+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+ return sizeof(TidStore) + sizeof(TidStore) +
local_rt_memory_usage(ts->tree.local);

I don't see the point in including these tiny structs, since we will
always blow past the limit by a number of kilobytes (at least, often
megabytes or more) at the time it happens.

+ iter->output.max_offset = 64;

Maybe needs a comment that this is just some starting size and not
anything particular.

+ iter->output.offsets = palloc(sizeof(OffsetNumber) * iter->output.max_offset);

+ /* Make sure there is enough space to add offsets */
+ if (result->num_offsets + bmw_popcount(w) > result->max_offset)
+ {
+ result->max_offset *= 2;
+ result->offsets = repalloc(result->offsets,
+    sizeof(OffsetNumber) * result->max_offset);
+ }

popcount()-ing for every array element in every value is expensive --
let's just add sizeof(bitmapword). It's not that wasteful, but then
the initial max will need to be 128.

About separation of responsibilities for locking: The only thing
currently where the tid store is not locked is tree iteration. That's
a strange exception. Also, we've recently made RT_FIND return a
pointer, so the caller must somehow hold a share lock, but I think we
haven't exposed callers the ability to do that, and we rely on the tid
store lock for that. We have a mix of tree locking and tid store
locking. We will need to consider carefully how to make this more
clear, maintainable, and understandable.

0015:

"XXX: some regression test fails since this commit changes the minimum
m_w_m to 2048 from 1024. This was necessary for the pervious memory"

This shouldn't fail anymore if the "one-place" clamp was in a patch
before this. If so, lets take out that GUC change and worry about
min/max size separately. If it still fails, I'd like to know why.

- *     lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *                                               vacrel->dead_items array.
+ *     lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in
the TID store.

What I was getting at earlier is that the first line here doesn't
really need to change, we can just s/array/store/ ?

-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-                                         int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+                                         OffsetNumber *deadoffsets,
int num_offsets, Buffer buffer,
+                                         Buffer vmbuffer)

"buffer" should still come after "blkno", so that line doesn't need to change.

$ git diff master -- src/backend/access/heap/ | grep has_lpdead_items
- bool has_lpdead_items; /* includes existing LP_DEAD items */
- * pruning and freezing. all_visible implies !has_lpdead_items, but don't
- Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
- if (prunestate.has_lpdead_items)
- else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
- if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
- prunestate->has_lpdead_items = false;
- prunestate->has_lpdead_itemshas_lpdead_itemshas_lpdead_itemshas_lpdead_items
= true;

In a green field, it'd be fine to replace these with an expression of
"num_offsets", but it adds a bit of noise for reviewers and the git
log. Is it really necessary?

-                       deadoffsets[lpdead_items++] = offnum;
+
prunestate->deadoffsets[prunestate->num_offsets++] = offnum;

I'm also not quite sure why "deadoffsets" and "lpdead_items" got
moved to the PruneState. The latter was renamed in a way that makes
more sense, but I don't see why the churn is necessary.

@@ -1875,28 +1882,9 @@ lazy_scan_prune(LVRelState *vacrel,
}
#endif

-       /*
-        * Now save details of the LP_DEAD items from the page in vacrel
-        */
-       if (lpdead_items > 0)
+       if (prunestate->num_offsets > 0)
        {
-               VacDeadItems *dead_items = vacrel->dead_items;
-               ItemPointerData tmp;
-
                vacrel->lpdead_item_pages++;
-               prunestate->has_lpdead_items = true;
-
-               ItemPointerSetBlockNumber(&tmp, blkno);
-
-               for (int i = 0; i < lpdead_items; i++)
-               {
-                       ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-                       dead_items->items[dead_items->num_items++] = tmp;
-               }
-
-               Assert(dead_items->num_items <= dead_items->max_items);
-               pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-
  dead_items->num_items);

I don't understand why this block got removed and nothing new is
adding anything to the tid store.

@@ -1087,7 +1088,16 @@ lazy_scan_heap(LVRelState *vacrel)
                         * with prunestate-driven visibility map and
FSM steps (just like
                         * the two-pass strategy).
                         */
-                       Assert(dead_items->num_items == 0);
+                       Assert(TidStoreNumTids(dead_items) == 0);
+               }
+               else if (prunestate.num_offsets > 0)
+               {
+                       /* Save details of the LP_DEAD items from the
page in dead_items */
+                       TidStoreSetBlockOffsets(dead_items, blkno,
prunestate.deadoffsets,
+
 prunestate.num_offsets);
+
+
pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+
          TidStoreMemoryUsage(dead_items));

I guess it was added here, 800 lines away? If so, why?

About progress reporting: I want to make sure no one is going to miss
counting "num_dead_tuples". It's no longer relevant for the number of
index scans we need to do, but do admins still have a use for it?
Something to think about later.

0017

+ /*
+ * max_bytes is forced to be at least 64kB, the current minimum valid
+ * value for the work_mem GUC.
+ */
+ max_bytes = Max(64 * 1024L, max_bytes);

If this still needs to be here, I still don't understand why.

Attachments:

v47-ART.tar.gzapplication/gzip; name=v47-ART.tar.gzDownload

��;ks����j�
wz#G"E��N��c)�[;��n����"A�1
AJvn������H����3sg�
-�X�^pfw5�0L�4J����MD0�vuz>��,rR?�����&C��>��<u&�<��F���mv�����ac�������a~c��������L�<a����$�5�K��O�7I2av����)����]��y���p��n��m�=���8bWb��.3�#����4k����s�'�m������}/���rr��&�~�;q�Cm�Sq����m�'$�63�#�{d�Y��F�*Nz��x|}�#��0��2{��2����-���2bV��Zc<I����f�/Nz����V��`fO������r�I�(��������try��.Kc�fI���`��Y���c�)��u�_����� |���S�����b8���O'<b�c)y�6w]����#0l(�T��P��t2)���M�t*�Z��|�Oc��Q�=NnZ�d7B������m�>m������e04~:�����y��L~NnO.������o��O�������k����L��9A���4N���CW���U��O�����X��p��GR$t���A�����4��Oom�v����G��c��;�5v:����-���L�N����k@���^1��ovX��2�	Z����`�8��������R��jd�&S
�J�9/j�p�r��R����[��E{�������dF����&��*8R�=d#A"�S��k=���I([2���>o�N�;������\��g�2��f{�hi#��!��:����	����y ���,6Nj� �Z��M����������a6�n������.p����_�M���5Z�k
��]
����:6����Am��i���Rpg�D P����56�F��5��������@�Eg]�:`�0��L �����:���"�=�0�['�����w�.F�v=�`���{\�`[�����b��&g)�U�yq����\��`0����PD��5�.M8K���Ov�Z�����
	���,p������|����0T`=l���v��[[VA���?�K{��g���gF�������m5C���0w�w��dX�N� J���(%ra�^�l��QG�&��]Mu�j���I�i�$�k���:���a�x�<����b��f`�gav��fv�z4�gjn�4����6�.��!j
����
Q�yb�K�..Q�����Y*��C
`u�M�5l�n���x�N�ZM���	�n�S�$�rP-l���
��YX��B7|�N�~��tw��
~
�
�����r�)N,�|��������	�J����n�j������������d
�t�T����,�b/o��;_�������:��J�g�f�?��������s:f�1�.���������W����=����������`� z�@���
�#�����u&������?����&��m�_��?dV����)#fe���4J�Z��.����������-���pQ�
F���?��+�4��,�dp_�&��FcZB�`j����?^�_\]�.�
A�;I����/"���1�A������J���>�1C�Q��=K�@���j���#v9<��uA#����j)�]�P����$*����*kc�1+��<N\&���S�9�
	��������I����'l2����W*)���T[ �)�IH="���js��f
�E�H�K�=�RQ��Q��>Mz?���"�fu
�b��H0B>@����G������w����O�����t����kK�>�;��R�31�}g
��:�������
�A��sz�����2w����`k��j$��g����jH�1��rMo����BI	�hGi+?�>���n0�Ma�,���j��d���Y�A��o�����<	�5�^��v:�nu�~�5;�
I��x����c���5�>�[=�c��+W��>�>?~5�K���!��������z�y� ���dM��Y�C��B��QKi9�������B�q��( �� �n�xB��"K|��2������
1�$_���8�<:C��!��@���gNs�m���8D�%<�',Q} �3��G��b�A���Q���:�
"+�*���
(�[���'��c�oAC�����Y$���=T@�:�#_[I$n`Sf���������,�� (oYhIlW4�F��a7�G���c����� �'S~;����w���4���v�6�_�>W�;~7��o;�=�1�����$��H����oUj�
�@�9b���z������`N��r�ra*����m!&�Z�;`�����6h�/�m����<���rv}��l�8�`�^�,C�s��%�A4�~����Z�� �r�7����}�����A��j�~x9R�������|����X����uV�>?��
��n���
Z1����vO������	��b0>[�������El����b��#�IG�~��iU�8J�m�Yi�*qS�����u�=��_d�;f���m��u��������j��Y�m��!��i�4x�*3�d g`�_��Jn�3�-
/��y-��!�bl,@c��&y&�V��J�V��T��(�+x�J:�/��X!�j?�\�g��2�]�jf�m���(�{x}���f�1��d��`�����P����*S����[�5�
v�V"4���������
_�
_��_�r����
�_�B��b0IR�A*�Q.�P��`)���4FV��f����k��eC��2 ���gr�r��������{^��XE�� XDZ�ME�v��r��"����������������c��}��$���k��pP� �F���8T%��DE�I"���E�-Ec=�h<eD��gEc=�h<aD�X�(_Q4�#���E�����V���n������b*	����d��JT�!�GqG��JH�� ��2��(A���.R�h�P��W()�(��
%��h?�go�2�QR���4��Cr��
p����ASSI�z)�o����9�����[X]��9�vt�����s�����w�Z*���T����]P�����u��)�k�8�!X-��zB�M+����q�;��~}��q���U��za3�����H�h"��dU����,��u����]������__������Xv��8����C�?���C�����t��9�����>��������
�g���=|��0l��P��GV���?����E�,�,���Z�K�>X�H�9�,��$>��LI��YO����
����J�r���/��l�F5k�����j_��L���D��Q����p�3��?��
c7�[>���n�}%8\q��3?p�
��n���A���X������pm�!p������Bg�t]o�;��R����X\_��{�|�mB8���oe����M�\����4S7
@��p[H\��Qp��!�y��s�X�L���3\�H���4�$��Lx��<���&��5�2��Y+��2Q����9
��[�]B�u����Q�o?
X1��@w�����x�Q���Q�������'A�hG�r���wv��c���/����5���7�b{�r
�z�rmz��au1UAo���.t�������i� ���9����TF� �{g�b��R�����xt�<:�8���x�
��A0���b�A�7Ny@S�L�G@j����;5���)���s5�������:�����/1j���yi?��G9�-�t�/+W�7x�@B�3=s��)�	F��:l<��V���zpu�r� �	����4���%8>�����k2��[�Kh�K�3��xt��5��A��r��eA�!�:���4�����zUM/���}������u��L������a��-@)+���!P|�gO\�����)��F�1�s\`�$l����_^{Y@Oq)�S�s?�������B������-��<xG�8�0��3 �����!Wi�w"y�_���S�i��,E.���TbL���bT��
{U�s�M�0�dg�T�'c���K���YRg1Q��RWt
�������&���
DU7`�����8}�7{pct�R\W��Ki���3��,El&����	�8��x�Y�*���(��5f�"7B>�h���+�$��?�����J�I853��k �u*M�r��+�z>�a�	LbRO!����� �4%�#�Hgxq�(���Ug��|���L=���e�LS�~�NQ�]�SG�[L.X
N9��(�X�}�c�[���$D_��S���Q�o���|�^
�bPw�+L�V	�`%�:���xUA�3�������$�`�J�b�>^�S���xK,���!?��%W�L�}g�
��x00���f����^���2����������	('L?�Z�Y��Mi[,5i�����C�l�`��On�&��y��y(��g����rvC����H�0F���@���
��_D��R����P��'�W�+(����x.�>�XTMq�Hh���o��Rjj�PX�X��Y�z@P�����'��OJf�S����`~�6q��!�<E
��)�
DiNR
��� %>�B��Dd
��g�QcP���9��	���R�`$���N�����eN�M�#���V�|ya{i������\9Y��41�\
:��x#-m���L �.��b��k��J�;0�������n�P������Y.��T�����u��V?�����~����pd�*����aB�9�7��u��c��4G+a#��U���wQ��,x
<^�.�/J2i����n����p!u��@��!B�����3��"���BI�����1�8uT�b�JTcD����B��1
J'$��v�+���Q�B�� ���>*��lJez����������p�����8�}q7�]�G�$�����+Yj�,��	������#�Y1�i����� ��	��m��oD$H��2U���t����+F`6f��-�^,�=����gX�^L��o����|UW��O���i��~t��>*��R��mN��K��u��ZTm�{��s���P�(����f��.hGQ����X���4!�
z��E�L����<�B�b"�b ��3/���=��+F�0�>����iAU=]l������Y�~x������$0<9�P	~,?��X��B"R�
'��:��P�V�2G
�`fj�H6�_�\�'�������B���"y�����7/�PtU����s7�f���������i20��E������/���L>�(I������������/|�f��,	�!�i;��<������_I*@���d�$���=�r�:%�;��0�6T��������{@����/�<�2ki�������v�:T�o#�@<L�VH�ebZ_2!��`F%��#�r�����G���O�q�!�x��=%���&��j��������a�z����}�qJ�����EgcCr����kY�M2*f���Wg�����g_�7���7C�@@��#�F���JR9��\��o^t^�t�/:g0��x��**����B�fT��7��t>\�8�o,A�kQ��V^�a����Q��z�O$���������5|�:��R�L, �!}������(�}:��/�]:��A�d�*��G��Q�_jY;����E�����7�!��_�W�m�^�U��l�S����L�n\�Av5�&�f�E
 @`�n;�_%K6��?j�������r2�����/���Q�(>�������IO.����%�������:'���_^����^�*�����P����bK��l_o�/`.�q��5���k�R�;�hp3{�S��[�X�OV�5�pp�[���syCLg�?I�D��T��������������d�y!*w��/O�v
@����n��J��2~��+G�����<_�fZ���=!WwvD8���_{�}#������Et�(������,BL�vN���n��3������=�&B�j'q��@�������o{����|�M�����?��w�Xx�m"�m�����o��x{�+�6�p[��"������w���i������?�a�v�����Y����$�fx���]^�{�����p���i,_O�)��ED;�l��2'�_�g���iJ��;�L�ih<sz�=��|v~�s*A����R�Q�m���R7��S�n��7�/�.�9�����9|���i���B�w�:�9~��7->�=-�A^0�8�=p����f1
~}t��I����v���b����=�ZH��K������3�+`N:|����ghtM�(���������@S�BJt�v;�3�6�
u;���<�n��Q7�K�m���q}i�&B���y�����ypt�{2�KF!<����������'��=~st&������?J�s��������cW�2��C��\/�9���`����`,Gq��P�<_���3{�3l�Q������	��K`����)��=�	���������oV��Y6
x�j)�@�L�	�M:t�D���G"i0���1��5�{{� �#����D����
[��&�~������-�*Z��H��W����������jQke5��2Gig��j#O1��|��g�������lr������_�i�����9L;�Mj=�Qj=��"|D��!|JN;aNCm���eC���L-�i�Z'An&O����CC��������i���!������7/�Zb������W0g������b��5N+g=~�����C�"%�r1�{I�N.�N�d���wNz����#C"�2q/�5i��Mrc�C�<X�d@H����)|�����sL���&���Fig��j��*�Q����ZEp����qx��yB.���7����^g�$I��]I��f5���Lf���W��-A	�qr���L���R��oNC_K�f��0��O�YM������j��M}L��8G/'h9s���]2�=�"3.`K��B�9f=���_�����Hk�!��2}&`����2 ���a�{�)�6*7����p��!�m6D�.��@[h�InI"N��y]fv��^��-J�4�/)%���?��9��_��ea\2�o���~x����-�-��z�?gWS��O�!�`C�K$2����_V;b~�M���fj��9��f��n0W�yWz)��
����m�|�R]w��fM�����3{Y�u�I��s��.p�X16�Tt�V�~(y:�������Q/0�+�y���=�' ��/���u����3Ws�E5t���}�r���,1�q�UO��]2[//���Z.)�^X����l��wra{�R\v<�KB�`b�������O6L:�?^�;����_���e�W/?�<���E3��8�~��t�*[0Z�jA�U��|k�R���A��EE�L�<UP�����)�RCl�yZ������V�P�_���������7��CS��O���E��m�R��jx�a1��	Si)�m�q���&hW��PN��.&c0��%�SD�Yi�Q��{��?�Z�{G.�B�k�G��B�!k%���3�������J'��y6�W:u��D����U]8��p|$������(��kG4��sn
��zS���OP3���~~���y���������t��[�R*��x^���d!��E5}pM%��6�c�x���`�d�L�~�J��!��8ud����L����A�V��rU[���xu�cB7e[�7�r�D
�{1�����v�:�	�;���a
��FWL�����������qB]`�����S��!8�l�B�"��/�L����9n�d_.F�6�Q��vL�Y����*y�[��"f�k�B�9�[�	��/����
�
�ct��om�r2�OgC�T�o>|�`97�C,s�����1gEX��m@BVZ)�rEGF�)��SWm[�'���7���=_���Q���"����T�q���<z�%/����7�Q������(l�/2�+���%(��c�5y;�'
�?+x�DW��)�^��l���������Q�P�{�4��t�I� V-�(0� �	{��l$qG��F�6Htc��v��
��x\�vyC�}�Me?SA
�����$��mB�����$��������mz_�x/��od�y���bc�i���������(���6����-mZs��c��n�:�j�G�{l�����!G�MaHlZ��|C���%su���������%�m�����	��fTAQ��)�n5L$-���P�����I�2�ms$��!�6��������
J�����$c�����1��B(�y ����d2���8����g�t����EV��L,�-�FsN����	��r��1�2�L�p�n���`)���6����)�4����&g7� �CG(ysG�����4K=�p�P����]���1��2c(&d�`z`�r�jQ���z��S{j��@E�����>��w�n��3�u}���}X�m1�eS���
���M��Rk)�?5�����c�8�
�O|����
��'|�'}��)��`8C��.�ssU�������/pL���>����P���o��KE�9���Jki������JZ�����-�.Hq�������]���f���4���`�TL	")q<:�/�"����o��gP%f/L�F�?��G����1FZ7)������}q��������>6��j�K%�ji�h���j�z�r��N������%Tw�� �<�O���zL���Sn�r�g��}4�8U�7I:5@~I�t|���}�9::|��#
�����hC��/!D)00I
�:0���D)�!b�#"-���)ax��D,���)�K/r������(~|&��8}���n�s�)���s�|��O�R��+��Vl���PBvO�9��bk��@�ht��1@���Xom V<0O�m�������#�I��.��M�W��N9�����ON�_t�`%*�EE�=����M�p���m 7��mf��jca%��h6A8	�k~���?� ���h����}�_(��K�Y���8����XPs1:z���h�������k�Gw��b�Y��}�xNdJ`}���d�L������������x%y5<�_���9r�� l��m��
�G��:�N�5C��B'���3)��'�j��e@�I�HE?A�"!�����	��`'�)qLpb�F"
���N��������b����V��;��Z;�����t��N.�F��4�<��#�~��qcU���UJ�WAD{%HH�6���������8�%���� ;����\����'�]���z�A��C�p������{�^���-�1�P���|�0����E����]p��t�e�	�8�9�5��|�E�(D�Y"�
��t)��Q�3���SIv�[)�.[��������yH�U���B����r+��9o�S<�A|��U�X��rv��z����;NO�%$�����j+����=��������J��������U����sx*��mX�N�s�U��n ��������SEkw
�_���I���Y�]Ce�$�@�3C{#����P����x�P�!��q�?�\��m�Xc����J)~:FT���k���}���H�Ef)��2��M���5��ck9�s�
�l��R����YvHd�#;�K�
~}����X�������3�H-�X<fE{��#&�O���C;���li�g�*���X�u�u��7�S�������~0�I$���4�� �H���1d`���L�����J�X�/�PPD]������Q�7��e��������$cRx)������hY����4�"��d�<A�[��Hh���{����_���E�;�k5�
��L��j�f�X�Zm�:A8����ER�*;�P�D/M��������\7��g8O>��M���w���-{�Z���� ��J�BL����n�1�-G��QJK�%
������j�0!9�l�Eg
�GL�/2	���o��.���q�D�� jDJ��m�&�9P
�����%T�ve�������G�3WFL4���5g#*l%AA���X3�E��*7c��2����-�4�\<7h��M�r�\e8��8T�Az�}e�WN��D�l?A��?�t�i�v�D���G�I��`��t�KO��@H�W0M�Jm����x�������:�%n��%���QGLR0�$=�/�%O��������s�YJl�_+�R��������Do�6��`r�
�'�����%
h���z����WUR�Ws.���6���������/��M82��['_zu�'�2�� _
�|-;3C�Z=;��gNO��)Ji��#��������X_:O�.�������l�slf}�n�^~�m������oVN����/W�����T�>�}V�Qs�.���}J�E���A�-Hx	Wwx�I����y{�����~�m6����c���x�$����5�G���^N20f:$�.�n�X���P��k���
\�I��������A4�j�jB{��I�S�A�X���,5����%)��_�2�12t���v`>�H�>�u����(
���)6��ZH9���w��c�N�����N���x\��
EhnPRr"��Rh ����1:(@`X���������b�(�%Y[K3������U�Yr�D��Y�&M��}>6�|�S&�)d�1�������q�q��4 m������`�kEI�Le$���o�/�J��gq������$ecPxx��(O���Y�i{"��
����J�B��%3�J��B��mA���+��5�����=|R�������`%��y^k�|�lT�n��T��N@n�^�EU[�PV���"����#����p�������4���_�{8�|F����ePVE'V��l�{���3�t>�v�S����HTK.�������6���-�N�M������������D<���e�C��Dn8x�^[J��4��w��V�{��WS��-�v�2aE�*���f��|��+
1��I2��f��'�$��6zN���P9���r���#��C$F�#�����B�������a�E�g�I����q�3(x #�5�]�G��s�9U\gR��l�v������&�N������m�i�y�H�F������s7k����5�X<��e��QSGnq�<�J�. �i�'�p��??�1���|T*�6�/��.�k���e���1��GOR����W�{��	��Vl:o��������6�D�kS�A^%��x>���7��"�4J����$�Y����������w{�y�~>�&��;n��[���[�cL�x�������'�SN����.�'�
��h���dg(�
~���={��@�Y����vaN���q>���
y�)�����Z��\�Z�i�y�����u[��
���x40tO���e��,��5��������a����S��dG^��W �{ �==8	�����{�������c��3��|�]��f�h���9e<*j�5����w�������Q�AF��#}�e��Y|�7�;����Z��"�����`/�,����z�������,��0awF��M�'�/�9�V����(�X�iJ�����"zK4������*#-Q�l�2=��g����R�<�(��[D�TL�K������3���/�_���V��E�(����[�[>������u�I�wk�%��B~�����"q��f8��r��o���N8�;�j<�i!y�I�&D�0�����p�n�����/���fY�H���0Xc��k(D\�����8!"/|�B_�N}}g����:����V��}�5�<IQ�mr��)��0m0���a�����;tx��>���������u�y��p�f�1�7���%������+���B����6E5>�b��&Cr�U���4�� ���(�h��T@6��4m�K��#��+jM�i�����z.z��"�0.~��H��t���s������M�F�U��W:�/"����Eo"1��}R�;�����PL�E%�5��h�QW�5E�:���]��]^��R�{�(j�T�|���R��xBk��]&�x|�o�U�[t�(�3!�h�![�~(x3�����dB���x�z#��x��'�,��,4�d����C&����h�k��(TN%<�c,�������:��	0":X��^(��;o�/=����j8�������y�=0G���4y��#g�}5^lOfW;���_� �#{������W���j�Z�]�H������$��D�+G���tL��)�W��5SKb��]��k��o��k�h�Uu,H5 
K��T���Ty��.��������N�Q���HZl�-J���t"m�JV���#�o��v��i�x�:&���mf�@��'�$��G�����fq���L�
M%0"�1n��!���V��������>7�W�=$��v�&�O[e|����?���}�%S�	�*X���=� (��p�����e���W!���aY*�����KMm���W����9u1����t.�uY�����.�$&���i3�V|�3���}*�&?�(�wo����N�fBh�d���Yf����LM�}��FV�H3����'������`�	PqJ,���i����D���-bV��l������:���ZvJ�T���J>3��T5�:���8��O Y��v�]e1��^�If�8������E��Z������>�2�w�R���A����s��=/����>���W,"h�|��K���-9�����{v�TX�`C�6��(���;t�����tcJ='��%������b�o�pC�����T�����T|���������I2@Wm�L�X)���KG�k5)�I��������hd�m�@NR �ZwtH���h��1�ziI�M��|��_��W���}�����k7)�\������^<
���u����z�2����5)V�GkY�q���K��O^��!��N�c��
��T�D�+Ove�'3l���vd������4m��s�/�A�����'�	 �����/<4��r9��k9�0/�
�;!�:�GzFW�I�i�x~�k�o�L��1�D"Q��j�s;%:Ei�
��q�XH�uWz~^k=�����m��D�T��1D(�4;X�����ZL��(�^�yZdE������!^��E�{2�#Y8�DIJ�k$���3'��VggNJ��h1��4$>^�g���E��� ������=��������,���s���o)"���ET-A�s\�\F�������j^�O�����v����=pA���t
�����z|t�^K%��2�����������^����G�����3�$�`-�������@�w���]9��E���6���!�;��#*����Whny��a�_��n-�NtB�o��!�%�P���-�����tf�1;��qg@�����*��g1������'���XI���$�"��'�s��&��T1�I1��Xe.X�2]��$3�$R<%�u���e���)�"����E�?k{-'|P|�3����3�H���zm`��b��R��4p�p7E�%Z����J���2��P�yn%�kP
�E�E�F|�J�f���=�o���'�����(�7�E�n���;����P��-��R�����MC~��*�Mi��L+��o����Y:@k�����~g��(��7���%�_������p������jY����p�4��`W$�Fx�j�[�r`��k�(N��j�.@��{�n�N�S�������F6���&�������F�.�4T[�S��=�����iC������~���1��Vd���A��$��A��-:X��.I&!M"���
h
�tB�\�P�A�aB�<�����M`o����o�z6���"��[�7�c�2Rm�^G�����>����n)W?��&������q���JZ��|���C���B�?R��+�q����0����H	U����	��3��-�'���/��@�����?���R���l+��!lV��k����U����@�o��
��$��j�~et��T��E	�'����G��!����r$�
�P��C�0$�F�$�R(�{.}�/�"�A��VO���ZH�������u�L��O��4x��s�����pj���|-,���a���k��s\�����}�u��Z��0CP�8Z�B��L'	�P��]�Zt���.���:9x����V�X����~ �[�~����B�������rv�#���"`�r@y��^xP��Hr�=29�Ql�+��i��?:���sd����d:�t�s���"�����qL�|�" M'�G�7a���>]
�t��lPA_�2I��e�s�(�	�����F*�5b�����Q�;%���nUZM�������C�S�F�ft1k��������2X�<��5�6���0�E��������Oyw�����l&�wzK0�l��W�� ���p�fL����QNm����^���������]n0f|���HH��M��bn���k�qT�k.6�m��mfa����{3���5�S�����	Z���A��F��*\9d<�W��#Cte�?�/5�S��S:[�y��\E���ZT�$�s���R���J*�������%������59R�b N�%�|b��8�6��h�2�a��)�4OWC�Is=lx���:���@����E�a��Ki��:=����
�����I��(�~�@��v�l����SYo8�$�*�jw�<h��~P���s2B�Z��%�x�n������|�7R����z��F>/�����,o-��Ms�)��*�s�)�D����J�����KE8���s�������]k��<!�g�>[�)��Qv'a����|et���_�w6��v4�2�M�H�P�aeC"K��w��ay4�X�(�����Ri�5./$	�puB��)1������dqCM�SWI����(6�����
W�L�i��G���/%�{<����6���-D�L9M?�8�3��W�"�uAU�����4���L]��
�:��
Zy���
j�/P3���tv�^��M�V�g���Y��t�*���0F�][VW���RU�V�|*+#�{�f����9�vP����N%E�K�JA�5�vF�Y�iHF� �.�1�BBV*<��BV��������2��Z�r����I�8s���p,��9��r��l�kg��k?�}�8K�G����o���(BM-c�F��V-"e@�������.��G������g����F,����/�T�#��?��<�x�m7A��M�@�	�% ��O���.�����|���*g7�@������	���t���3�9��aETB
��^T�����L
rS�Sh�kv�AAY�P�V���u���m�N�~�^P�56���C��tS�������\�B�v$I]���:���u$|��Y'{y+������sf��=�M��xk\��(��%�[������i�ir��{�c-��I����]�������������F.�b�sp7�JZ�?M�����4��U,(�?����3q���_����m��7�7v��FZm��j�'�uU����ui�}�!�PI�
����D���z���R%Ur��k�1���TJ���x�J�m����5���e(�"�7�8
L���.[�%�bpn�����9�YA�)�����k�@i^�����}�0��K��w?����Z��v�6g�qf>t�>���^zh����:�����qpK���h�zf/��u��m����:[��~	�=C=�����������&�o��K2�li�o�i��&��O�����oe���e�H�/����������8�=�[4 TS�
�i���S�38���'=��<���
>��q�jx��X��Kx�<�7g��n��t�ups��=��u)Y%@B�E;���e�
�e7��/C
��'�+���T���&��6�e�H��ff5+����)���p����i��BG��g�	6KYN����[��$@���@9`%��\
Xe��)�@3(���s��y��L�='�~'��E\���\�������Y�uU��4a���?R��Yi�RK�GX�
JM^�A���0d!����������%��w��ge,D����;C���
-�j��T���ci��J2�
�B����(�������o���R�k���kiw��3�6��W��Cy�L�t5����b�����&��5�:�_�9�n�L���X����C��m$��H���0�w�3m�w�4w�3`���)e����F�{������^���Pg���d:��@�# ����E���X�s������evC]��	Gw7�d{�)�:>r�>���	R���W�LuU�+����`z��s;pH��;�kq1)�w��p����)J�LA�`����E;���L�DCS�Q����^��j7:4�*��;��m��S��������*}0����*>s��*
��T��Xs���
��mx90����l��� P�g��?��dU��R�e��+�_6�$"	�?a�"��D��l��a������������m/��r������\���M���e;�f��2������vs�|�����|
�&Q������N������7 G�[�����������Vy5
e�4QG�&*�T���'��������a��`.�"���'�t\Hi!��Q����7�_���67��B6��J~�k}^:�����Y���d�����P�����I������]'�37��� ���<�?�j���4��W��i�-������
t�|g��9d���
����]�@i+��{h���[K�]z��oj���O2��I�|'���0��&�Z�e X��}�tMi����g�.�aj������!����\�n�/*x���L+����!s�~W�����Q�WCN����I��q�>�TL�������t���=x�oHsA6 ����h?���I`��T�k��9������I�Q>���;����I���%T�uU�K��D)�o�DL(�P���n������g�y<��j���HI�5�Q�.^���],�����5I4X�����L^������w����U�������L�vT�1��q�O��9��A��&��|.<}%�v�[�T�O��}E����:\V?:J��L6�����3a��Sf��w5�k-��)����a�^�r���on�v���UN��%���l������P��~�}��V�wQ�vcZ�+��;;�*�|\����f��j)u�����NZWp'����]����885P���
+n���*��K��?KZH�9��A!��u<��>���&W��������4���n(	_O�y�b�!��dZ)T��g��N�������K�s>�\�J�Q=��a��������'��[������	w2������"�,���uw�^J�����y��N9�g�/#u��'���+��������E����)�>�+�u`j���950`���
��:"�n���HN�2����X.�	�E�Pj�
�2]���
P�+
�����+����l2�mC4^���MfE��������&CZ��0Fh�pNWy=���4cv7��H9}��.@�H���/�v���,=|>�e;�6�U�A�q�76�����a
���$��r�**����bikP*�t���2�R�W��v%�)�����,�-�U���/����	S���Vv.�fG��
[�k���r����!%�����3A�~�^'���/�j�!�O�p"��?���r�	��2�O0�� S/��`3�.�v=X}��5t�iMZk�t�4��	�D�]���qH�9��D� ���jb*�I�&c)�����l�z
�,u+���]����e�7c�<8��N#X��;�Sg�ujO�)z_q����{�v�L���w��f�(�����F��s�����1!{`��AUk5��Y��N&�1��{����9�dBW��V��/r���q��
<
k�F�&Ew�QY���� }��|�
���f����x�%3���ao1���1��b���p����e�(�������j	�a��z�bX�����0\���[���/���7��[��A��[�|��U��kHX:�;u�$������Z��\8+�Dj�!����L��g
I�7�g�+;�8l�I���>�a�.��L7v o��&pw<z�,.<A8V�q�idi W
�6$�mUUy,�Y��"�F':54�58j?�@T�;�T)�u}�[��
�^c��>N��#m���6����L�Io�]�Ao���e ��$8���3�A�~��f-���x��u�%�8Z��K�ji�%j/W�,|8��k���iE��dn�aihYvb�����z�]q�9yO�7]�;�G$4���F��iP� ���)��v�����<?�#r~�t���f���A1@
#U��s��� �b:�H��S4��C����mvI�\�RN1��*4J;q�r8&�\�y	g��� ��R�+)��	���r��R��=��E�� �����%�o7]�U�I#�[&�KS�k�~'}��t�,	f���CNZ&�QZ~xf�����6<����A�U�>X��x��+v$�GW��*[3���$�f�&�V�Rs-�:|�����5�3�6qz/���h��,p�g�<	<K)�������<y�W���������y�b+=�x����QF���|�G(*��WR4�B�1���+�3^�U������4[M�z����
/x��i���r"KE����k2��($��S,������N��Y`��^�$�J�gH<��"�	�DA%=���)���\�0/�����H�����
�O�����k����l���7����=XB[3/�j���s������v�["�Z�W-������^�+��`�Uj�l��c�3�0�{&p[��*�
�-�����]��z��E�����=�|@m�-�H�mF���[A����Ut�j�(j�FQ���^y��%�	��<����s�#=�����W\�o��L���e����`��C�z�@@j����t9�k*���M���k�q�T�����
J��S�P�8H��9�;yv�����=~$�����W~������M�r���L������
�!�ix^H����@�vN���xA�4������q���������q�'�^r��|gx��9�����Omi���=���l*9
e�������l�:���
T~{x��u�$G<�"v�0��n�W����,��l�/���JC'4�g�C�=&f�un��|>��
�J��aM�w0o,��B�p�YN��& ��������z��7�(����[`�/���T�;�?Y��B��X�v�;;���z����]��>>y-.��o�V�c.\��/���\N�����'�.�7kGg���d����2������,�77_�X�s�:qyE:'���l�T����:>��������i�qTT}n\�T�7r��?��K)I�=��|w'x��2D����R�8��V�O�d8�LY����t0H�U�j����z����~}?I@'��Il��6p7�!��g��sP�N��*	�
��QZ\xsDx��,��9���Y`p�`v=��F���GU��7T�c��TT���c ��{����+�g%���1��[t���N��u�~�0bH�O(	����>>Q,��Q������%����������P�]i��Cg�M��M�8����,���O�g�#�(k��egJIu�����%�B�k��*-D���l;�v�v�{�k���n����y���L�@��� �94L#�.�6�[�����1N��2���.�t�0�Pp4��f1)P�����4w�f-S?0Yz�3��}�5���J?�0���gzL|^�S�c�d �2�hH!G��
��j��AYcth
z�_D|3U�+�<�%Xe�Ix#�,���t�C����o���/�c�!��|�@��M�I��p��G�O�
�;�l��!I .{$�P�Y�_�g��6�[d������y�3�:�[d��M�u@�����������1}5$9�;N���o=.�hTM.$\0EV
���Q��� ���iz�6��H6�PW��3$��p'�M7���4�	+�'��g�b�}<�������8�a�2�<�Gt	��&LQ:c���N���y���(��qq���(�9�S<���6.<J�WXQ�S�����
�a�����3��n��*X�Vq/#j���N���}~�zF�!���.����y+�LE���]`�LX�u�h'������`�P�S-y�{�wwA�
48��_�A��.X��}G�,g����M�3�/�o�� _�/��]\�������k�+��Q4�D2��q��;�I{{C(����i����iZxD�p��t*���<�W����'5���p��=��S��<�X��9�9T��W����,�h)�������o��r��fQ�{P[?$?$���D��~� �O�4�)6x7�%���3#�����-�TlU����$��6
�4���~\��u�
_���r>��Tm�������S�MY��.�-��{_���~�\t��;;��=b���f/FY�;�U����Z����JY��S/�z#��G����o�������M�){�|nhvX1�
��g�#;A�9�i���s�I�+Zs���5
�Uz��0�����x������
u_j-�a�,$��q(_
�<���%J6�/�a��d��P��d��/2rCh�B�5��$oa.�l�?�Z�I�v����_0;�op�����f|W[��mXu��ow��U�t����x�K�����^���]lg?M1#�~e:������N�s��x�2������%�8�L �G3)���������1]nB�Ge��|��,���F2�7���\!���z�<7e4����R�����p�C�U<*T�c�=J��;~�;O�j?��d��������������Q�EV �<�[�2��c�����VW���Z�x���N��Q�.�������|��U
�-\��
x�f;\B�]���	q&\���rc*���9���`�;�rk��t�w��U��Cb��/6w���q�qu9����C2�������I�'u\�p����^�9�
/}}�SP`gG|��A$����;� W�
~F��	F���G
(�Tu5��|���l�WW�U�A�.���2:z����fJ�(w� �#�m�|�������
���[:�4V�D��V���,�G6�\N�:���d>�g3�[e�����8�f�=�2%*/o�I�#�o������/3�N@z����
D�e�F"��q����2��"��
K��I�����!����b�K[�Q���|�9EB��'��/n�W�@�E!�����0�-�T�]�W
",d���7�B�Z�d�[��p����2�.�U��I�Zk��z:���K���
*��x��������?�����e�;1M�)�������l�r��W$�N����t�m�|M�O<SbIn {$��>vG�9�>�����gdp����ICMI
���-������',m���H9������M6h�>;���@3�x���������H}�i��So��z|r�jB������:R�aK����G���;gb��4s��K�����y��*�c��^
�q!�W�3w���p��P?*�.OG��,[z�s]8ee�v�P'UA�i,�.��==�xxu
7���VV��q-Twk]�=c��u���'^� :���V��-��I{)�Bi��Qi�;�����{W�1d�-I2G� ;�k�����@4L�A��%��?������g����f�)<��Cz����������P�zI]�B��VN+���U�D�
�7#�c4/��
�<,�l�s
�$���c"��n�]�� �z�#_�����x��F�
UV�
<��-����nl*j_QX6qt."�����H1cL���t������c<�p�@�k\7�I������������l�%�V�C8{���i���;��x������1�D"��$V�EGB��t�@�P9uH�
�~��%�|$���X��jC�&�{�ct�N��%%�I*]����>�Bs�����t�T������G�P���*���	�6_�\������B>\��a�?KK�k���"��}m-}8k� �Z�����	^�39��	����#Y�3���~b����N�����/�uKV�m��l�x�z{��N`n����m'���,C^q�oW����pL��T<����s�����c����'R����	>Ro�.��	�+S{SoPEN����t�9���2��/�VN�+��������N@;�@�N�>��>3F�rJ���H�����"�
->�R�0�>������&���]���V����s)��^�b,�8�?����C���������h��l��
���SB��#,$|��q��v��7�W�'��/���Yj,��L��<������%��J�:�x7�j����b�r9=I�r����
'���*0%�s+�AV�7{������R�~���%�$Mg����1`�a(�q|,�Q������|d�%Dr�MWQH?^�?���:�[�!W�q�@e#<�,����������S����q�f
��BI!����%/��������y�q����W~�>��g�S��j�Ch�j��h��e�o�{���W\���+�_���a�
���U<��%"v-��s1s�O�q-��v������3������-i��|"��� ��9�i���Y(x����e���X
,��R��E����;�q��;��p���h�Z���leo�E�7��Q�F�����^�m����V�����J����f�TZ����_�R���-Q���|@��N�rXEQ��VAR����&���U�c��x�5e�������Qt�X]��^��������fOuB�7/qs�V z,s�z�R�����<>K�<N�;7��b';o��RN�G��~�@�;h4.�V�/�f��W��k��nh�}�	�������"���qt|v.��?m
H�
�R�N��:��p��3)��/�aI?D!et��Y/�(��v�$��1�|Co������g��^����Z��������v���j{��`o����d���	nn7�I��,z����7��U�L>&�������c���s�u�����;�jB��~'xJ� _,?w��m���j��;���&Y���$%|$tlU����;�����{u<��A�ig����&{@8�j�*�g�)�I����A4mE��OfC�.��&W�%we1�'���f����;�m�24��i��D�Vo�1�w�A����7�KB������N�7�s���f����G����_�_�O���y�v?��l���~u��s�9	VN�T�o�u��\��"�h�
{;��>����??>=	�����0��R�mW�����*�y-'zh�Y2nj�^8�i�����k�������rx�?�����)Q�����������D�f����$�#�?���p�Il� �"#�~{{G����������p�j4�E#�c�	~������z����)9��R�?z����a���{�=I4��;]��{x�3`:��$��=��R ^�!?�|�A��7�e�>�4���(���sTi�����������N��'������|���p�o0�`��/�%��N=	����:n����������h�q[@u�/��N~�XN%

���c����R;����4Q�����f�lR9D����q����3C�d�U����!�1��8D������p���^�k5�jw�e3�)U��t_� 4��d�+�q(���l@�����~_(����xN�7��Mn�H%�q���L)�/#)�l�H�R^{y��U������e���7
�B4����#UJ������^c+��:�k�������{�9���#���`��#7��Xs��G^o���9G���;�9���#���p��#�����n|���Zs������	w �t���n��Y�n�j����7���Q�[����y�}�]�/��4��zg�^M$G�f#���G��|����N��Y�x|�����`�ji b�Q~����j���b�&��Ci(�b0��$��{.��g%4���Mzi��,�q���f�P�C���z��
�U��y9�V4T[���w'N�Wf1���eE��?���1x�������ie�N�zF"#R9gA\��>���d���c���&���Sgm��7��mq=���<��y|+�ar�@�\��w�����R��^����l�K�a�.�Aw&%V���*y*~���,*p�y\������"�QPG�������q�����S[�?f������}������p���@����'|���V<���q���������u�[P~�iw�,|pT2�f���*���TX��}Z�)�<�����������<N��s1����!]Ps��^�#�}��Y|6�����a�DR�R�����~qk���/�G9.o�a^�����x7������<�DY-�����k�Zd���,�c�s��5��h/���tnY�[Gj�Lw�= /O9���f:B"!�?N1F��%�*�����pC"f���4��3��?����x��bk@m���h
�eU��OY�0'n�b�R@����g'�����H��6��/��_^l���cVy�;��	��xsx������xy/(���}�������~Tk��j��9�}���2>�t��&65�|��q���K�b0��|���3+W�����E�B�(��z�,�I�B����~#^�F[��d
*U�z6YLu������������0����:{���,���������-
:R/�%���w$@Og�+|m^^�\��G7��
n�c�9���PL*(���PS:��wz��7�n8Gw<�vthJ>7�
�|	*e���L���W��?u>�`��(�SV���)��1X�:�����c���%eID/�G�D��a�5�A.H��p$�J��+�8���g��.��k>������)��X��J��S�w��zrKi����b�lY6���P�pe6c ����#��r�J��f��V�{�[��%��x�-�F����>!��g�q�SC���4V��8��Tn8�3��?�~n#�TH�=��tzi��qdIe�jN`�eL��:L0�y��[}�0���V���U��l�����X�X�6�����q�E!29����89����"�yZ���I�x��`k��{�������|����
����
��,�����1>=Ky��/�8���brd���R�d��oy��fY�yP'
����������'��T���nY�M�����7����'��7_�=8dB@�1DZ !A���(����r����)0�x(�����#H^��K�=�*T��J���.�������kM{Z���:�@�������Th������R��2�9�)�nv�B�1�jm����:�brVi���
2�C�Q�pD_���&����W������4T���o�0��(��b��(��V-�?,U�y�:��ZS������,r4W�:ld���~V������Mkw���h��@������@�V�	�����<=9�(XCU���������|=���{��7��G�:BN���Y���Z�9%���&q���'TQ�
ThP�=/���7��n/�����"�$�q�R�%�]�NJ2��x��L�������#'��<�R�:L���
dI�g������T��_s� ?���j�ti^�*�4Q���WqR�1������OuY�T�E��u��J,;�gJNE����Vp�
��tRm�NZ�j��l���ocCW�R4���8!?��69+(��|&��u�B���q�^���z
lh�Z����J��WC�g��`�+�^a��-����ca5��*#eiV������tx�����>v%�n�=k���P���@F�b��;(�����B�U�svvz&W��C�D������nY����k��w	�,uIN�AG����>������>K��\a}�(�Rh���(RzW��&z*Y'��$d��1_���e��������$0�OV����I�:��	o�e3	�zn�H
�A�~o�������\�Z]�����FS�(�13�J_M�d�/�R#�b���BdAD��*�%��D����D������B�����iw�����
{P�D�q)f����nC����G���V�3"YN�4���}T5��h��|N�DE�+�t	���	������*�Xek8��]�P��}���tT#A~A�9��R��$aI1���<��(8��I��&�y;h��4E� �
��	h�@�AxyK�r���QQ��\H9_���^��!m���~����b1�N5��q��������
W"����������lF5?`�G���@�4+ .������-��%c��B����d�Ty_��/��G�������35�.��_�S�*t����y�9�\���p�����^����jyi�j	�0�`^k�g�_�;M%��������������d�Be�L����f�?=%�Uh�p`���T�Z_V��Uao��fr���9�"��M�����9�����/[6�������R]����^��ogB3�J���C����H�@<'�N'�G�T;���.����.BM3�����+kF���J��XC*���\b2������~�]�!@��Wp��Mep
K�k�����y��&�3���V~nlU$����������J��$��f���d��<_gim2���H*����X)���/����'ad
�������S1�A
���3,,��+8$�,P�rfZw���x�����l��-���������DJj���pvL2*XP��F�h8#�?c!��,��i<}1��d����T8((���+��A���jsa;z{��
��E��R��������S���Jh����.��%�|p�0�:#K��4G��R�W-T��
CF�`6�K%�^�����@�p|�R0L��Is�>�%��6p���L�
y��-
��U`{=������0�k��7t�G�%`;2r'��L���]�VRb9`�a����a�iv�Q����Cz�������3�l��j��?���9�sE)���R��C���a�R��Sn���9�G�LY3�i����I��qV��d6�gA��-����?iUP��1������v�v�>v���_�<E����D��o���{��)IfPp���wYW`M)��;�I�JJ�Y$�Hb��hK�6���yD�b���HIl����S���@��|5���`�c���9����%���l�+�m�����O[[es��UL��t����P`���qi{��}��#�|�U8�uU�H(��>�b8��X�4��a�3�m��pw��4)�s���y�u�@��Vs��G��������e�����w^������r���J��'�#��?�@��7�������
=��+i? �!W�k�g�!
t�u>`e���K�t�A{���3����v*f�_���r�����p��B_8*����f�7{���	���q���HS`0"��fq�����%�U=���,���h�jp8�&\aN�q3�K�F�|�����?����p�%�vX���Z��RH�������c���)���T�M��_�~��L�N���d�z�
�Gh*�����!	E�@ ��p�������Jr$��4t��S�A��9c��KH�����D�[���x%q&�����o���q^L!/9��@4����3���K����A�3zr/�}���
��q8ss#�A^�}�������o�`X!b�7WIKP�
2��0�6`3�
&ps=��}G��h������wU�<��e��C���J����;��J�7=�-��lx�o��h����������� $�(������IL�.�h�x�EQP���_;5�q:����N��-
0�`�F��5��6v�%f<_n���w[2Cr�����@gI�rLF	�!F
CQ����0DO���be>J P�'����Y7��]G3���On��*�aD�r���g1p��������n����u��:��7�x�u�]�s�iCR�e��H���(Y���#q�},�'�������`?[3��Nw����8Juj�f�4�~�n�/���>Jxe>�h ^�D��`���$\����� ov`�2w�8^���p��^L�\J4Pf���R��;�5����|u�9(}<��T�>2sc�gD� ���a=�$�
�O��[l�K(��j=�
�Z���n(�	��I�2��z@�e�;"��hx3$l����z,���*f�p����fp���dWA�����U$�5���V`RxL$A������D��������j�L��<.DNA�9�����>�����r�J"�b���[����N�c���p�9X������F�������C,~����3,<�O)�.�4Gu�-Zo��IG�����O��'��IM���-�3XT���4�N*���"&vD{=����!E�5�<f��������
b�����t����U��q���(�<C�D��;*�f}
�|f��LV���@�T��v��&\���p�28����|J����21VYy�`-O>�\>��}��=�
�|R[�5���kC���6�������&�u�s.����+S<W�>�X���<���M�Y�)��X*uy��~��]qn_�Sq�7��'������^�Z��^o^�7����m�?��\����'�)T����,���6{��K����3�lf�L$��V�,jz[����am������}�w4�6K�r��fu������6d���l�U��'�����5mh��gJY��s�5��
Kv��\��|v�g��z�gw���P�F#I�_*y����}6U����y�CS�����o��3ri<��Ia��P3������fCle2����=�I�
L�l~]b*vEG�)M����h�������g���l��X$�������t;+�-+�=���m��0J-oo�hQ[��5�Ww?T�����G��J:S��z$Y�vQf���~��c�p�)��JU&&]~%�R�����d�2'���<��J�8Z	�������Q���������!�&5���3�m\��F\�?HO<U
RF�(�>�c���9��fI��h1�w!��@A�]�/ht8�kL|&_��h���o%o�XLpH���AL&�dgz�ri�����'*����e3�����v���]��{��~������k\�����?���U*�)����F�`�)H��]nq�/�e������d�Dz����bR��I��O�{�IL0��~4�8���x�&��)�>-4��
@"<2���]b^B
i�+o��t�$.�O{�"� �?�OObJg��zs�����urH��jwQ�1)d@RO��a����xv����jV�QI"���d�������f���]K~N^�e����W{���nkO��Z���5��������*�r9����o�������������X�7����vus�?�u>5�*��hV�N>�%T03�?�H���'W1�m����T���x{
��\cG#�����[���?w�Z�Qk�6�{�Z�Z��j����E�� �	��\��}������+p���(��{��Z�����z�Vm�����e������u�V2��T��$���'��47������{$N�/#�@��=�?��JY����dv�|S�q9��/I�%��>fk������j[��{������g)�>��;�x����*���Z�7���V�#A�*��6k�$�����������h�$s��	$e5����eP���
j���BI���FM�1)T����������e\���#y��P�������~�]���%]����j��������oCl�P���s�9<{�S��^�u���<|���*��,J�uS��mV��o��lVv�lV��c:f��G��$�X\a	g�?�Fr���`*xX
.?��ibG��D5�hr+��z���%����X�H�"G�L'�+� �o�x��)��QV�xsqzt����6��Y
����*@�e�����K��&yvE���r�C��h��@3�35��=?~{Tm�(%��f����@��
��rGuFo����+r��:oO�t_�t��j����^�u:���,�@��T<���e}��y��+��p�=����Z����hp �����{�}&
��"��
���o�a8)@�2H�����$���m��J���<���K\5L(e��2�&��F����o���w��- i*�p;�
 ���������Y��/:��o�cX�������f[���l�(�6���v����x�/?�M�pN&���<���A����6>�3M�&�F�<v��0��B7�����\��@�=4��88>9���M��f�x���$�3�)�d�S����������^���������'������BJ$���3�l���6� v|.����?^:bH���U���������K�	�W�>�������m�7�$�_��_���1�RQ�jib:����xy�!h����!�^l!�oq�E����qc=���4D�dBu�O���.���C�/�2d\'�zs=����aD1�y.�B��k�2�
&��Q���i]4�jk1A���'P�6�����t!��6�cd��h
u�^Ij�
���(���f�A����������(�6Xi���������<	��|����!9>�����Q���������|����o��j�H}��@��c6��=z&�o��
��k�/M'�DG��i����[{ ����%x'^r[s�m-Ml�2&cO�BR�Y��_wP����h8�D����p7�X����j������G��D��$���w�%�|��_�w6��v4M���U1A�4=�)�����:���������c@�{(���Gj��1���`n�J5�H���.��7�(�Z@��Kt(���K�+x��� '(U���U8SaZ��-5K:�����0l2���`\����c�+l�!��G���C[������s�|������O!t����AY'�A+�P��A��jf�����
���s�L����K~+������Q�8��6��<V�#���]awdv�O�RX�*���1!�v��t�hUP�u�������zr;_�B�\�b��
�:H��~���KJ#(�%������q�}�ZK;��.�6������5�?��-���e=�����^u���{c�����tA��������C���b��j�P������z�]-�[�T�����������&'Q=/F���|v~���	�g�Tx��F��4.��
�F9 ��.5�UN�J�0`�
������������?%V^��#�����T����T
):v����x�!�������+�Y:b��mD���Y�0D�~�N$�hc���c��D�K���D������a	Nc���@I�^�~��.��p8"�EF�/���J���4N9!�XV���I@�'�������D*��Z��@�E;�1wk24�!(�IX*&����Or��b����	��i�E��&y�Q�V8�.`�A
E�6��R�h�8�$�F���C+�T�'��a�8<-�:������l�������Z����2�S�$�b��q`y��v>�|�`���g����.*�~��������\�Y>[�������P	���Q��,��@�-/"�����>�����e}���]�tIe�
�I��\��m���"*�x�H*�:9�gw0�5��}����-L���6}����7��������w�D���!���r��@v������n���e
����E����~��0@V)0G �u�&X4���y[F��l�L[F��l���e�,[F���x��7�e��h��QZ��Qr��m�c�e�K�m���2�
��e���|�v�
�2JA[F)d�P~����=m���5�L[Fi�-##��2Jv��<����j���a���Q�m�
�4�������Q��-�t[��s��-����j��2JY���2���,�-?�-��Z���Z���m�(y������|����r[F�n���l%��Ms��=�ok�pN<�-����a����-��e0E��-����Q
�2���7�e��q�-C3a[�\�����Q��-c�\n����d�(��e���2���V�2J�a97(�-������2���eP�a������@���������@���w�e�]_f�`�	�2J��e�[��NW�2�3m/��[F�?�e,�p�-�d�2JA[��a[�_�.
}�u��;+�K!���%��/�Z8xT����B�.����/-W�+d�����)��H�U�+�����
�A��(��z�U��r)��\�R���V���WTr�MHX�_��B����/���w;W�_�����k+�����
}}�B�otM�6:�JTi��
�R�>tu_���{V(�Yo��B��*�K�
}>/S�����W)������U����*;���Y�	�F �?����v+�������x1��i�8��2P�`�_�F�V���j����}����U�A�z���qu������]k
���F\�����l�����Z�i��,�k��ZC+Hc)h��B����j5a�~�����P����2�]��q������q���^���~y_���Z����S�������O��j�0�r�R��b�%�����NO�xtX�~+�����4!��������K�G�
�x4J���zx%YO1����+�y,Uu	���.���n��'������*0}@�WN��"�����Ac1��J�%|�v*(����i��G���o�+��o�Y&V�A�Y�o����V�p����Og�' ��\.�}�ik0{����_�U�7\���j����}�����V}U�����]6//����A���
�{��~T��5��o���>m��V���-��� ��O���Mri��V���6�H��V� ��Pv�^����b����)���nR�c�[���g��c��������?6��mQ����B>�S�8U+�.@�7��)I�)�������x�W���D|,�-����$��:���\j-��)o�����r���`2~��(�:������u����?�L���:�B��?����;����	����z��AiA��"=�L6Pv���^#~��yn���X�
I�|e��L�t�9���(aC��A�UC�hSMGdY��:~uo��A���o;�]�fv�w�uK��c�vu�D�s�~R9�o��f%����W�E�B��JvO���=��N��������x\���NZn7�YQ~2L�45P�{��U���x����T���X����BX�^	BUPPtz}v���p���Mn)md���B��&�VV�;��i��`���fF#,�l����f3IM>�����<������<���!�KZ��4�2�V[����e�M3����l��0�����:-S��m�_�{��p�^�&�������:;}�6�\gI�IN��h�����������6�_��u�jU���?����r~p�6��vYr{ �t��%/���7�_+�2YS�N�L�
��I3H��f'�9��y5�h�t��m�N���!$)��d)?����]�1���l`��@�����|r�z��D���71����M7���AN����N���'������K�R��D2A�������g��"��o�����C������0����#�����_����0��D���+����=]P�6����p���I�X}r,��pt��(�i�c�8�6��`��yy����Q�=��3��w�Y����8^��^������6������^�uR$��r����(��d�a�!���.�
�&��[;=1������jG ���P�3A���$�wL�p��_���oQXP���5x��]R�W���7���f�����6m$�%3�ap
�CM����yU��7�m�Y�-c�+X&+�,T���2�e�����
�	!5�����~��K�)�3�J�;^��2^���M����#e�(s�8:����y� �;��d��,����e���*dQ2�D;����g�v������
���Jju���`���
�!��&HD�9������N�eD����5�T�q��R��.�U����)s���f�����l��Z�r��j���2Q�r���*�_�����j��1�l��r-p9��Y���Z0�Z�7�
;c�VwPgcP>��a�w�����:��:�.C�������A�_q@-)eF�e����/�V��?���S����aH�;s��4�r�%�C�4�m�;����9������>��6Iq�w�wM���������?�G��0���pf��W�x��#X��[I�l4*��c�'ue� R�L�����!�^���i�5��*8,5�V=�[x��C�-�V���w�(*T���Db==�Ju=4jN�1���F�/���m��G��.�E�c�V2a�MeZ���5��&�����y]-x���W06>S5��t��J'��S�n��f��9��a1�7����R��b�Ur��J��7���-rc�o�������Y%7������<���6M��8D\�?-s�c���]��{�xxT���4�dal@���Cd����AmX��T@��"[8/J�i��
��������b231z����e�J����A���:�=�g��PF8��Y+���O��y��e���Hd��P���]&��ZOJ�����U�S�3y�<��	�;�~*�����*�:���o�[zI�$`*��
pp��Z}�c�z�e�����w��
t��|���k�������s)��
\�|S[����
�*�w.F�z#��x�M���N=V��c��������?���W9�fx����s`e+�*�H��*��oFq$�����?k{
?���j���������e4h�.w����Z����7���~m�q������j�V\�����_,$���[��{O�����2��=��dh����En0� �
���M�&������?�WP��N���=����_;�oo���h�������'����%�EY-:�R����,AB ������p�Z�nDlp~��(��d�����!�3�I�^����DI��2Q�$%���+���(���C�nAJ*W�*��7����`"�WVD��"q���M&����d��:SW�'i���$��	6�:/����sTf6���0��v�,F�yE��V��_��������������������vu���{��n4�v�����]��Z���������FT{`���M�:�_��������?C���5Jo4��[9
��RA���7D��oo7��V�����=8�����[-���O
�%?��	"�"�p�/f��S<��u4�R�j�0��MG��P�;�v��u]�sH���������9�����(�������-!�P42�RX���������.K�T���?v���������*��?u^�����u����y�
x`�@�ftI3L���z�D��c�������8�1y��hwN��pv���E*������y�X��tv�5($Q��SZ!���N��V%�,�d���AU�x�`&BaL��A!�����?�6:*�r������XN�0+���-_�Or}
t����
��;��������Z���6+��{�~�:�<Vk��c����2��e�]�r����,����8/�G�OF i�$y�P�
���1+,3w����J#'���4��
��F��l��#Z�i��R^��=�e^Dw=#���h�_�S
��}N�t��0Fy��65����^�\;Vy�K��6\eFHy���NR'8���9f�_3�n����d�����qrLqs�3,��5����l�r~�${ �G�Gry?G��)jd<�MF�'a�#���x4}����j���|6��F���u���?,��W^
?W�����<�q��
�'_�hZx�,�:z���������f������A��1h�����^�=��F�n�����~��~���������w��k�Oww����(�KhZ����V"
ZBk���X���T�o�0b?$��m��x�r���A��d��_����c���C�p�Q��Q<,j��+�B��s_�%\�bPuB1��s�8vQ�=>y-��M��9����s��*8��v�-1e������d,n�����x�\�����	����O��"��)2�e{h����,77_�7�5����N�j��Y�+�������t����E��V�$�����UF��xn�0��������z�������-/���7���~����Z�O����eE	��L���5�����V[~�����������Z���q���W�����%�o�/�j�A�R>�U����w�����zs	��U����P�*ZmC��&f>��:I\�+��9�x!����*	�����av�gLHs�>����Y��Y�����j����Y�[*]���6!�`0�������\�������=bT	%�g�)�������L]��(g���	���P��K�����>?'�<�����?8���a�b�0X�^� ���~�^�<�7���%x���S���:�����������i���*DV.�/4o�:IV��ij�QL*=�P:r����h����3����N'S<>.�E��Q�S@�\���b�{Q��
0
��;t�f��������SI8JM�?5���:,�7T��A����&}�y��0��t�'�1���Q���U�W@6���p��~�����������������Zq�U�7{��F������e��~k��������vw[��w���_���jOU�.��j��)h���i�<>r�|4)�����m]����})�e�w
����[�fKI�W>8P�$�����{���%�`�c�����U��@��J �%�)��S��a�{H�&�~��;*���rU9��80y:\Ik6a�9%�p�B!��q������"��i2wn�3Mm��*�����x���^o�n�R���.�s��@����9�QX88�@������m����x��C^\��������`���AS���
C�0oH:��D�d����>�c�e��f��9���skDh�y�F5�P�:��������ki�4���,��^�}�~><{�9�d���X���\�*oW�n����e�'�������y����	�!���i�0�k�$`��]����t9�{/�����U�����.e���=X����"��t
C`\�(�����x��p��d4�����*(�u^���WaYgm�g���<�s���{v�I�0���	q�w��	����$8c�S�����Kz|z�G	
��W#�|�\�
�4u��QRY��*4�D.,uyF�����������?����iDs�y����q��6w�
��U�J�I�YP�P?f(�3:I���9�e�����r����	��-������Q�LZf����f�yW1$>�`�2):����$)�Z?C)%)\A��/��
'�d�������$}�-���o�ojOg�#��F����o�R�+�	����s�����.�lOrm��g�����l���6Ep!6�p��U�I�0�6�,'7����Sn~1��&�\�|2��I@=�p�B�q����Ql5�O����f��.���*��$U�#��
&c�IDS����&@"!�
��� ���pqR�M����P<��'�����d8|�&�
������/��:���e'x�7TW,:�Qj-8w�g����rv0<�2�!�n�p��2m��}WI��s����F��1u~��������N��!���n�p���Q?�!z��
��r�6��{�(U��i���<�����/V�*�i��B�^����P��� �K�1����q�r�P=]y��+(�<����WUl���b���%�,�hW���;����@0�S1%0J"��d�J��5(���2�o~y3�<��e1����6x����_L�b����$[q��/���{4��������}v�l	/�0+E��w����cB��A��[b���������{����;"�*����O�l���)�	k4�������P��X�xy�VX��f��*���� ��g�L��+�O3�[�:'�}36T���v/Vt��i��*�������.��$����:�]<
1���i4@S�wy�	����(�EE�ty�"�X�Y+��b��4�l���sLq�&��1&�xJ����1���^�Pd����7����Z��1�nj|x*40rz�PM����U���G=�-���6�rc����rc�f-&�`j��4Q!�<:�a�%	��P��\��+���\�l���\�}s����:3��ou����1�W�3������4t���1rd<���#�#���9��epj��z���H��{�G�?x� ^[�����-#�CT������@��������R���V��*-���J#�	]2g��������������emWN�Tn�S��A|	�Ku�t��$��	��Km���Z=c����o�D�f���-_����-[)yd�p�U����Zi��0G\���t�:.�I�2�����e�DJ�<�T`R���(�����[b
TL6��7�� Y�H��I
G#$(#F7q���<�&�X��$9�����d>2X��F
����n{�J���U����|�"��r����[�
�B2��9��1���$(�j��(�Wo:�_��p�:9��3�@���RV��\M}�V�^�h�]�����5����^P+��6�
+����O�?����UXI�9�9�(��5pq*��@1���f�N������3�q�[A�����������o�2���8������Q�|<!��PB2����r@����<�����w��?#�+cZ�g��-�R�)����82���]|K�,��[ �fG=N5@�2��g����{cby39�lN�`U���L����v�P������gDJ��r+nHo`��V�����[ZP{2|�v�{��ku;�A�����L<:e���\\��J>B1�8H-���_���h�%!�\|�GM(p*�h|Y(�?
�* �h��������$|F��Aa�9l�Nr}�T��o��mEU�� ��&�jBv��@�5�C�f�$� ��j�Q=���I=[^��hlQ-�VIF���4���������7�:������������'�����V�5�����;�'��|�������NpQ�
��~�$���9f�,�^ja���@�V���?�J��'��
�1Bk���Q4���K���6��j�&v1�NfsR��$>���v
����`T����wd��U7�)��RQ34�;J��A������`���zJ����(��-��R�5�*�I�����lY3��;k��Y����A�K��:�e��SiB��i����[��ap�o��A���	�aQ�����@��`1KGdk�I���![����y�z��Fv6�7gM���_d�Pfg�G��h
�����MFd���Y\=�v����r���B�)�0�)�2�H���|
��Kcc2�W�����#�+|B�6�U-E��
��[�.��5)��:j^��n4�a���_�C$EJE�W������</��,g��3rJ��lo�34��J�cafh+O$����q*=��+Z�a�n��=���_�.O��Dv:�JM������~���7w��@[�\<�m}|:�o"��H,9�{5��3�&K��������������"`�����������iA���������/����B���24{���6)�dfi[�i�[��2����������'�]X�kU+�v�������n����s��������r[��n���vw�eb����\�������Qn�����?,7Ct���g�A��Q��P�h��K�pD�t���]$�z2��90M�H
�e$6X��3�@>A����
Jq�� ���B.��&s��OB{	O��p��#��!����s�Vp�	�����c4�'�+p#���~��0�����>`C� Vb��9?�����������e��{�����1�N��S��Of���p\(�����|D9g%����?���$���0��\=(2�E��&]�H���U,���	����2 �:���2��wPn�����9��L��A�<-�kq�Xm���R&S(��UeY��z�I4��%t,��z7��
���/����Y��e���l\���(�Wk����2����������W�s8���-x�T�'���E��9����E�F|�
�����U�J�\J������g��k��Y�}/JX���Yt����|�)���!C>���i���+o�x8�.�	�X���d��;�(����&^ZB�*/�������z�Ya���~_�j%Q����"OW�&����@Z|�v
�Ds/���v�m����Z�����������$���.���@��k���?$*��ot��F%HPd�T���]����U����%��(	�}�io>�z���������r�_o�g�+_kp�9m|J�:�%���`^�U�)	�{�bmt��I����:�QW�Q�*w&i��79$�%/r��!�@wX)�����y��jO=�s����&���������KDq�����T�km������Y������	&�O�<��I��C�xz,7�����S�-�:�%�Lu���"�������M�@��-f���,C�F�����5c�+XV��b)����p��.��jV*tC9U`lU\� ����ZU�)X��CL�bv�.o�W�$�����[�*�V�X�|�Z�rm)����q*QzA[�y���_�,%�$+�E��N��t���O�)x�G�7�*,����u�����:���p��x�e�PqY�
�oJ�	O����@
���.�w��{Xx����]���?sQ���|���:664�W|���Q�o%�f[
d�����m�P!�l�����V�����.�0�W���r�_��L:~�0�����PZ���Uj������[���)�������!4�V}����f>[�*+&����t2p
�`%��R8J�c��,�Msj���������p�H������6��ju���� }�����p�5�����&~�Q���E��Z����X����{I��G�)�����E���q!��
F�JU0��������U*��>Y�]&�0-zD�
�d�����T'�=
��p�.��M�Z�M��]W��2��l��������
7YC�#�P�o�9j��&���U����'�'A(W��!��UW
��r
8�*�$��F3{�c��������?���&���Ng�*���$������y��.�|����(K!�J,���^]Y���n�����^��{���l����j��a�8,��'�������zX��-9�6�)�Zf�>=Y}?�v	�����C��y�\����F�9��1@�%I0�j9>z�k��"z�
����Q����Uk5K����O�p��pD9`]���f_�P+��L������I��w����}'��Z
��n��u�
���)���}Vq���:�FF4��:-�J�`g�P�p�m���+s7�f`2�bR
��U����I4#qVo������[�^fx����K5�15�tL�M7�uJ�J`?>��W��$�ft~������+�F����O6k�>aU�����p����0mKk��Pm��kQ���������0�%�f?������.gq��f�Q�?_���� ���ST���N��+�����Be�v�z����7��'�\��i���������bo�P�d'����NV �������_D�y�`���s*Q���F�Z;#=)��k�e]�z��n����\k��C�#�c3'��<�������&�K�����_�9D�q9���e�
%�w�'|�a�@�Po�k���G���lrKh=�#�'�c{���po�!y3W�QN��D��R��m������a���}��+���{��K��zC��WT�}�p��{
W�5��\�!��Y����(��/����)?����l��*:p/�g0�*��F������\]V�	�5O[�,���C01��VjQ���~AN����S��N�^-7��7�37�i���0pP��NQ��{YZ��Z r���9�@W<�D�����$�[�m���������K�Z#d��R��!w�)b��3+ �lo4x�k��%)�w���P[��L�`&Ir��gCX�plW3��J���-=�"%XX��U(�o�Ha�����L-�����V-��P�)�����R�H��V�{���]r=�M$�
���lu ">�����@lf<��	$�0��iU��
��R��-�%y����x��J�fS��)ayn�]4bCe(�2�U�����Y�gan�����B _�uM��n���;U���d \[aRt��"������O��9�{�e]�S.�Zd�jH������	�+�D�����B��V2�[pM��-!��y�JP��JJ�bi��c�Q���a��4Q����\^Z6[�o����������;1Z���v�]=8��]w>|1��P�Z���ch���������Dx�����Et�J&�������d"�<q���{&�LE���,�H�-�n��"r��kY�����kt�����-�W���.Wg�D��Zs��fv���I3�b�E��-$=�lp��	;�!&�q�c/7-��e�|��%{�%c;�e�|�=*���Y���!��&P��hs�Mg�����Z�E��p6�/G�7G��y���h�K��
hs�#%�Wic$:��A<C�k����`��B��-nM!_&�6h��@~��9��EwDI�(?���6�mK�^=Gyf	�����-�8a5U�e�	{S�NVM�mw��~�SL�wt����T��4,
��d?(|1�@aHmZ���_���F�M��u����\��R�hb`)(S���422�I�\�����v1&�������p`���UFq"\
�lX�}_�Di�"�	�e�Z��X�j�	0���3LP����^��=yW����#��o�nu��� s���?����F�+�S�qi��kk��X��q�������E[s�a�p?Y���B�w��n����n����sH"����_��7���t�aN����������X�-g�Z������B&}����������sm��w6-}�+�=�p6���<R`��m���R[���!��}��eY�Ys�P���NW��U������1U �r-x��9*����{Q
(�������*�X���������?!8
P���[�sq���u��?Pz����]y��w���|F�Y1��y����A*u�K���Ge%'��r��A7E���<���,�������]
7��yv�b�~b�����j�0�b���s��5���2���1�����'��S8'r`��x��=��d���9|�}A�d���r���V����S�|��\���F@T��s]��c\��7�%�3=�S#8oSd�u�f;�������[��%���*�����CJ�����:Sf���}����ju�*C��3U �X�������o:�t[�HS�jM�:w�"wN�c��V��$�����-����VVk�S[Y���K��
���pP5Z��O�� O��~������_C^�wB�gY���t9�k��G�Q����U�}f�#�����j����+�)^��
�}�(���u,�U��5���:+�����A��������R�NL�7��dyH�n%�~{���%b�)��n�W��5M��E���1s���v���L��&������c����r��;9���YB����A��MT�q������!���c1]������g����q��g�U�kiE!>��#�����f�6i��z?�}���j�6r���-D�!��WFh���p���'M[��[���
^�w�'J���O!c�j��vR�� mS ��#0j�
��.b:�i�������t��Co��D��{�������J���,�9R�[D�92����h*��7t:�\�����fdn!uCUf{Q�#����%�'��-�I<1��h&�}!�/��J2%�
uv�))�,}��
��S�WK�Y}��}
&nKd�
]K{
�9G��rrw��
BH��%�#�g
�%d���I����S��������	v����v�����1Df\s��JB�1����U��I����,�����C�)jgZo|_�&�X�H�M�����*��t1�NS��/����u$�������G=�9�[�e~==�o�3r&3"^��B
A���UB{��i�FB�0	��<STK3m���@�m��=���B_q�)�'$�p�[t	Q�I���ah"8�����k�5e�
��2��O.%��z���.GwXqqV��f��������fW�]��`�����2�����}����7�\���O��T��D�2i6��_��d$o�,���� �4��C�#U��U;�I����ao1���1d�/X��*���*��CtVF�����OhiS���By6��R��EV�����D`fE�b�����_�p���kRQ�k�����Gj����iGJ�j���Z|`h�%����;���|����|�L����h8��(<����m��vg�k�R��R����}V<6,�����lU������t�^���2�w��������T���#]�$�(9H�+�ES)�t$gAw+�d;IZ�8�0�VE����� �b��Jr��B��6����
YK�:+�9��HA�}����r
�����>����gKt�&<���'EA������_ ��.�6����1=%J�������h@�S�7l����!��C�d&�
c/:g����D�0�-���2�H��S<�����E�A�������6/E���]xB�\�Zs�e*EZ��{E�r���h�|���:7�{��v�}�<����N@�r����hmC��������Q��f��J	O�X]l��b�,�-��W�~uv���>{�d���e]�}��<�P��zm����ZU���9w��m�|"@Q�[HrInY��lB�A�&E�d�$��r�c*6���P�9��nE�F�)Z>{�y}|��a����UMm��j&�.pO�5��:�����>�S�t��|2��@�_zN��cK�F�s���r����$�P7hB$����t�=�R�Fo�UQ���������*mu��(���Bg��|��p
�,D@�"��xUKAN!N���ht�����Z����>�����ZJ!�2�
!�C����z�m�H�����*M�VrV9x+.&b����Q��2�OrZ*x�4Hg��2g�|�'����%~�Ir��dG0B�S;J��)�����"3����]�T���,c�H�7�Y�_�g��B�-����>�����bx uw��y
w<�"T�s�EpGj)ya���5Q�}�����GS��-������z��P��W ����(
 �#��U�)���������-����l����������^P;�Fg���RW��m��B��Z��6j�'_���R��~g)S����q*�Z�����7�����O�h������(�Ns����!�Jy�F�HyK�&����:l����tv|�����f's���������7�s'�
=q��Q��y�U�l���*#�X'$	9�f��>�&������]�CZ�������
��K���`y��!y�(Xo��W �M�(�������Klz@;V�W�����Vp�,��Y������H`�FD���&b\q-%�����,N�'��r7�o����i�R��1���1^�F��{&��A��/H��+���{��jLV���_�M���*o��b�.G�[vXH&�M@����_b�����Q�3hw��]��p�	����&���?U#"CB����r��m���:l�}��,��� i	�������
��8Wq;\U����Y�&N��z=��f�FoJ1��@��<z���������9�� ��Bb!��1o���x�T��h��������-{Q�:;����\-������XD
�&�9�GQlV�	����7�{�H����5�����]����~���S^u���aJ���ys*v��0E#�X���c��B1�f X���l��+��:9��!��#������6��T��,���%�m�������e�%�a,y�TEy�E���(��o�w�f���R~$��
�����$��Ij����La���/����%��E���u)���k��C����-]�&i�-�F?��m�H>Q�O���.
 &�l��J�^#AE�<d�<1���|^��UmtY>:!r�f>^��`18�����9;~�G:u����S�m�-8�v��	�p�����a���I(@�#xMh��SC	�_��C-T0hU\�t��_�:m��$������Jzp���|0��H4��ecL����iq���r�Pr���R3�^���N�'�Q�E�4��X��=*��J����V��d��~��K����g�RX5�o=z)v\q���m9����Q?-�EpT�EHF����LYY�MSN�q�P��;��Z�Rp�X���l{����v����g���1�����o�e��evU
$u\�� �Lw��d�O���^�����K�	W���N���e��sp��.���]����������{�:���o!H(��Du���uL�p�M���/�|;D\�DhU�T����p�%���g��8�(Y�Ym����hr�*d�T��L����{1�
o-���������9�p�!��8a:\!{�<�cf���#�+M�f�f�4�A?�p���A-�����2�.P���K�QS����b���>0��]|3�����XOX�m�Tz`�5H���/�a�Pj�$V��M�R�.��2HY����u�����~��4�N�����?�h����2��W���2��=�7�V6��%�X2�����%���|���q�5�q�A�����p��I�Mp��Q,B�[��	�����9cK���fE�nk���~���x&~ ���x?6Pi	�6�OFFk�2"<�6�3��j{�j���8(��eb��~��`���j���EJM�=!?�g)�.
�Z�g��������j���yg���Fe���M��G���:���-	�sB�����l������K��#��#V0�[X�{������B�<�
�~���2�s2
p@�����%'��V���?v���^����+�w�V76��y4R6U��j�V�V���M��&Z�Y����`;�$+�����w�7��+jwSJCF�3LO$���!��x|9��eV}Z��Q4�D2��qB�9�!%p�������3�Ir���d^	�U
����#���I�3���h:����_�O��@\��,�za���d0y*��8����
�l�kp�X���%��&�2�T����[��{T�(C�tL��Y�WZ��S[?$?$�������~�J�pk~X<��Xg�����W;����XB�����O�V���m��)rj�na�5�Ky��xj|�7��V��������F�n!����D���~�����eZ���M����h�5�dw��E�?��%�j7$>��b��(h����8��}���{���8
uJ���*����Y-�_����%��RW�(o��{/~-��B X������|h�[U������b ���"�s7��"�m��W�����_(p�J���L2��T~����
A���3lfB�v��l����\u�b`�6��[���7�������>�� W$:����H&7�-�����g��x�	&������,��jK.�_��qoqu%_m��5x�S���3��v�ma�F[X7��V5U��� ���mz�L�P2�@q�6.�B���d>�g3�����@�D)�+a}��~
f����un�:Z9�*�7v���t�������t+}����������)]�kl������E'S".=����t2�caTrG�+x�$�x	��J�e��Q> 5�
��>0�cw�
`��y�U�6�w��4��O	���*ePEXmB��3��Y\�V*���jK+�V�d<$G����\���W�go/�?u>H����������������dq]"��G��+b���1��N�Y%�\r�R���?�u�TZ��v���m�[){�g�]�33U:�w�����K�8u�%��
~Xl��,���;�+�"t�i����IJ�,[���,V�8�5k�t6�
�(n�
C���\�F����LN��
�>����y�M��b�",x6L��d>n�Fv�Y~A�<8A��!����:p�?���d:�b�3��d��j4�����y���^���\M�V���������&���f�n���!�@�oF�([*�}�0jV�t���Wy��Y�6���Kr�dj�M#�1
����q?��I<�$g��K�M|�!5�n�<���L"��:�&W�"OF��$��������+�fa5�I)�e�$i��8Ug��wa5���E��3����!��D\U�UR�y�t���	&��T�f}����nn��������W�Vk��Q����d2�H�X�����TP:� DM%�3��@���8��cT�O���k{�U�_���[k5����mV���V���z����Q��W?P���\��}������+���z�z������z���^����j�KIy�����^����l����8����'����?Q���	�<o�$�~����6D�����I�?����h8�<���M������lX�vX�g.j���{O�
Q��W������������������UQ�����		�O@+r�Zm]����nq"�oo1����p]�DH�2�_�D��	���f���xs����F����%��lR��aovq�.���6�(�oI3���Q��|Sloo��T*���v����d�=�����?��c�V�.��?��}pR@9�h�e���A���6��q����f7q2o�1��6�7�{�Y��9�v�[�G��X�<u���tm7�M�������`�����^�L��_7����'PX��4T���:��:��)9�KQ�\A�����[��&�����|6%�U�����[�I\G1��3�?��w�������Z����:�~�W�r�������t{�+)X�_n�N_�����$�:�	��4����'��G���Zeg��g��s����s<�^����{���w��^8�,�{-e��W��e��UW��rx�?�������x4�����P�W�!�H���B���p.����,z��RMg��f	r� ���1no�����;����y�j4�E#�c�	~���+����G994
��l2�~�&7������k��7{`����1�h�U�!L�I���r�JI"����9x�%���O��������|F��U	
���?����x���)�t��I4 ��@��/�!�4�(���Z'����Q��x��f�f8��>�%�Y�E���f�x0��pd������7�K�W���K9,{�6?���7�M����a~����),�{��^�	.$�@������D��2���d6�,I��1c'���tr���x����������R�_���?��9�8~��������O�_u��t����7�e�.��_nuKv�Ifn����
��<k�����7y���\�
��V�f�u�������Z��H���{��"�0���� =�)���`���6��-;�F�$���y7�@
}k�f)�dv���Z�[��c���)�����,A�R(s�Y�z9�~��6��/�z������,�M���,>��7�5�j��$&S,+�z6YLU'X;��e�]"hL��5�i'��[p7�PSL�;�����W����������u��C��H��kq��x#Y������7��
$����4f�U�3�������|8JvP#����P
���f���2�0��hA������p�Hhlz�q4�R��Qp���(�������N��������"j�n��.��8�� _�����hc�Q]���������<w,�,Y5K��wg�W�$�q����<\�G��o�:�3P�;�@�"w]��e�b_Z9=JH�5jD�r����U��SIt���R��[���/x�'{�hr)h���%����m��u�'�n;��U����]�]P��+�
3C�U�(���$P}�9!��z��BJh���\+�'�D�|��������h����9��
l�����������;��,���w*E���T5���|�{N-76�)�E���#�?���W�n�^x}�)�����w����%ENf��87�[a��u<?��L��	�Ra%�~Y}�>�;9���O��L���n��������)���������+�E�(g���G��Z`S�_4�G�P���-�F�D[��x�\��&a�����Egg�Q�������jA�����@�Z`�����e��1�F�W������Q���d��L�Q\�>`����B}�Gxu{��>U���
?����U�(;�(�����{��/�Z�p����x~���cx��2�@���d��G��h8��d�~�ik��8�>Uy��Ujq88|�X)m����!���!yE�4T�������24K������?�F��0����b�
��Z��B�5�f��&�����`�A�|Y8�P�'`C��<�.>R)��1����D>c%��������[�/��TK�.f�C��a
����6c��)�~��OyR����~���������wnE�5��0��/����y�����Qk�v^�<l���g,������	����]<z$~�F�S���8-,�>/x_i`B�� k���<w��������4#�����:���\t�����8/�R��p�"�G���_����eX);q�m��7���������n z����X��������Y-�����T�����dN.OF1�|�������>��7.Z�0���H�L����G)�����������kGs6�>"��^��y�*�����j<k��%<A���l��7��O�y��S��\%����%�X�l�l����5+/����.�C�We��n���&_w.�^�e���~Ic���Dz(�����/NO����.g?3�F<�m?����'$����`_����o>�H�0�@6�(N���T��}|�����

�9���<��ak|%�������O��Q��ei��;<I��"N~����#(k�	�!/�*.	:�2���D�U�k^�$��,Fs�� ��*�w���Ko����@����!C V�P�����KVf#+1��YYl)�,0����j5�G[�!��2�)P�@�p����K��^�����y(���!�m@��O*������8������^^w�PD�<�B,B���y�&�e�����%>i����{b�j�,B��E]��y7(�����-E��K���?�}����������|9�+[�B�����q�	.���@]�eg��<^^���>J�.���.��/��MJY,9���A�Z��~$v�O��s�!AWa!�lA:�$k��"�['��/������x��!�0��x�loo��~o���<1�����b��i���G����H�3���P����Ng�'P��B�0�3�.L/6Q�1��H.z�	�T��\'9%BLc(�{���a���@��f5�J<'9����2>DU��@X�����;�BJ+�4�PF	�����t�lH����o,~H�����L
 �����F���v�����:l�;4�����|��(�`����u0V��=��JR��R�
��F��UC��N��TX5'���@+L��	rjU�d��}�"�AK�<b_����� _������,���� Z�����o�n�E�N��B%V��p��	�E��������X���5g\ �O��-p��w�uh
��*�������u����il`��k�j�
6w1�#C���6$z��xv�=:�x���
��k����]��
,/q9{�����F������m��"���*w�y%�Q�r�c�G��9���et�m��5\��h5Y�����U��A4�B��$^OI��]I<�$�W6^OT`U����:��� K�C�C� =U�d����xo�[�`���n�pr'��!�e�t���\<�NHO"��t4��?��,���Qr-�1�|fd�Y��g�3�;�f�,�<�xyzr~Q�~�]���{�A�7���~��iW��h��T���^����jq�k�@�,_�����9a,d���f1�"������N��Za����>�LP������~��y���T�����U��8�~.w�������������2��(��)RX����������'_����z��x%��Z��n�&�jF��		��0R�V��Sm<��c��{,h:�	�, ]t>\t�A{K��$�h|Y0]��U?�0��%�DO���A�Y�p,Y��|64��0c�������	�.qs5*�p�?�hA�k�#���mQ�>��x��zd�2M<�b���6l��?Z�������^*,V��o��,�%:��n��C�CN���h+��ga�A~M��nJf�52���� O\4�Anc�<}��@f�����D`���5F�N�7��-�?�4�i^��C�����G�cP���,��@�m3l�&u*z�q��G�A��NuE&S�)�d,�}�*lb1k�=Q�}�S=���g���@�,��/�)�58�8�GkH�E�4�|����}�g�����V�/�����G����S����������G�L�5��<�~'T�I���_�i�k$7����E>����������d�)��D���O�%���,�|���/�6�����I�H��oE2~o��a�G��m�P��FP���p�6������S�����w�����k���g��w��8���`W�{���%��3�N|��&��O�������'�/�������(�$#���X�0�u�zga�����0g�S4lp�+��S��
�w�OM��xN���G�!�_���QUN0k?TH�H����j���=������j��v���&����������
�r����w!TN��[��Y�w�C����	C�@�d� ]���z@k(�/
�:.?^�Tv��E\fegQ(��2���A���YO�
�3���� ���W�q�3�����28��3����zd���U�!+����]��9�Z���+g�m��n���~t�H�����U��Bne<N$�l� f=F;�c��,��"�9#]=�,
��(��x��Mm�*_�%.Fqw���b���5B"��<C����Z�)o�$��P%'�;�yr�-��@n��g����F{��fN������x<�ZCI��$ �Ib����]�E��E���I"�(����R-+m>�J��=:(yzu=V�{f)�����*���k>x��.9;�]Nh4��Z�
.��{(��t���]J�3#6K���s��1	
$t��D��/����M������6yM3�/I�����������D>L
�i�.�vw&	���Y��fW�S�+t&gR�`b�rzN0#������Q��e�3Q����V������[Q�l+��	��hK��x��1������p������FN'@�g����[Z�$����o��tO�s(�>�mLg���t�K�@?�	Y�<S�u�������_kZ41����;@���f��d _`��0��t%�.yJ����of1dAI�����)kp	,L��o���2��|W�M@�����(s^��������As���f��Vn��-����������T�,��^���o5����~��nI�jW�7i�O�x}d]�����'%�,<�n�P��qq�<Z��A���xO']�@�D�y/J��.l��S,_��S�����j4��x�� �.����:�����;It����������7*��A�)��YU�EJ�Ier	o���u�F
����k��j5����h�;���������e=��M��j�5})j�k��1�_6��R>���� �{�!��>��7��V�6���o`�w	�Bku�`}Y��M�3�//����%F_�77/�c�	e�J�W��J�7>�����
�M��Gu��kSc�LY�pF�/T[�K<��%C`D�i���M���b+��9(M�^���h�
�R}�8�7����A�����U}�zj��FWTt~4��RZ�Y�����	�E��Y�[��
,�;����
L���!$��`P�e1�����}�|.�,��z���
K�����8����o���<#,( ���IM����q����2n���ZN5�m�C�7�������{3[�-7� �:��x61�$CPqm��8�	��l�7������TE#}���C?\��x��A���d}��:C}v�P��������Z�}?P3��G�������~}�������V��q�C��wk�����.I���`!����V�n��?6�h����Z9�B:R���5�����YC)���� ,�c�Z
�Y�%J�v�_^�
��Y���3t��s�z�%4��T�A�Y4a��]��C��l�(����d6���9��.m�!�.��*���y���Q���������]a1n����E��~tyY��/����^��o��mW���I�9>�� h�
��P��H������=�?�H�YBA���~�Do�|J\ ���T��'y3��AmO6E�
u��\���^�F��^����������~u��������[��wQ��r�������U��J�7�X�Az��7f���]3��>4�n=�����y7�B���o]�����e?j7./k���+�P������"��5"$���r3/^Z�m��l�%�H��
�������EY}-�`.��0���P�x�i<�a5%;
1�n�YF������h�,cW�d��u'/��[H."y Wr���J<���G��j_���QU�����m"�����%Yk���x��'8g���&
��������B�����8/xm��7r'_�S�,��+�u�|C���h��c\��HJX��x�iuN.�_�<���$�h<�s
��������<���|GB�M4���� ��\������$Q���Z�G\�J�}��{����e{y�!C������������o����c��=�x{�>(���C��2>G/��}~������*~J�cM���0.��'8#L<�F)�B~�/�W���u�h���1���5���g��z��p��M�{xvv�������3����wy`�u��{y>�k�\������A����g*������v"p�E����&��tA%�D&������/�&��G��l7(V.N�8���~n�[���V����q��Q����[Y�S'�|v��sB_�J���9[5��s��������$(NPu�HS	8T�h���=^�{r��#f������|�8'�i�wo7�V{@� �>y7B��9�
����XK/�R���x$�C0�12�0#
�
T� 'HC0�����7rs7`�u�RP��;�����^�z�6��������.=�@��~����>�w�Cd�2u���;u�
��9�mA��D��[4w!��c�YS?�ie1&�v(�O~����e�����6�V$������:�e�Vv�:Y�����>�&��X��;<g��Ib$�V�1��A<����d��9N�q��7ac1F+�Ux�q�����i��~OO���B�a�z�H^����$w���'�_
�$��<O����V�c����f
}�o�t�����`�o��m<��6T���m����f��0��������{���P�{+��,���1�'��l[�e�r�p��s|�fC��@*i���8#7x�����jh�wSqx8�.:��IR���{f�G���T�W���[W~9�n#������)MZ��I�^YhA���`�W�TIX�wKW���B��@�.f����8���b�W��;O*��OC��.X�>���'�������ZQ���{q3E�{�n�����m�O`.�����0!�^N]n�eq
������e4=��M4������_GseI�DO<f����.U�n�&@M�i�J���r����/��<�s0�J|BoR�]���u��<��5��I~2���yzH�����SO`{�][��k�OfD�1\R�0���1}���	�����BF���1�����j��%���v�.�(��������J�H�RR(a��Dnc}�����[��A)����b�Z,�����B�t:�����H^g���.l_����	�y���i��[���lD<����$��A������
U��1/&W1X gM&1x�34:��"%�(n�($1����#���vyXIO$t�$�'�L$���A?p���K�E�v�����e���QD#���������F^OFgX���GI���Z@����aH17��B=Dz��H�\�V�u����Ff!��#1�
"�<�Ot�����3�������c&�HX@qN8#~/��{������1/�}�0�<3r�
 �t��0Q/��x��������:Rp���QW��*�f�v0Vq�:@�c69�����r�)���]BL����,�N�T��s�7����cOhSMK<�g���@ Y?�$elMB��x8F#9R����By��yc��[8M��D�a_�q�;n+���&X���~�e��QL�M�Ye���ui�(��ZW|�������U��C��/�Z�kF��BL,�WR��\�@on������eq�
������6�\�.E��7wD�@s,N4�0V����M��JB��0u�f?������XB��LIQ��;��`�7���d���[�M%q�'����XNS:�*��r���y~�Dl_8N~�,\��n�u��BVp����!SO[c���*�_$�b`9��a����Z����[�r>�+*6?�������BXZH!�,�>����q�*tg���������b�!8 ,w�����|�Ng�4q��&Y��;����3��C� �&_�a���y��[���,C��d��}���L����lj���*�����=k��9O�� )�b`�<)��
��:�2/���lI�==:}JZM��u@y_�'B�V�7��R��3O�=$��&������9���=�qz8;��
o���,��2
#F[�P��=-�95.�J��p�P� v���O�&T������������Y^<d��Gb$�GS�����R���E7(�Q�Jd�����ly�!�)��#�������~�]�n�����=��Q�9�Kk�N���uIt�U��Tic`�-s;"Ro���jARx��R|CO)V��'��h�@�"�8�K�����nf_!��,f��������08<�!��;A!�����,�
2x�n���7�����`���[�k@�d"�r�
��`-��5��B)t�h 4�g�B	������/������'����e�%J�\A�}���M��������7��Ihe�2	H��b�<B���+���W����@]��q��b
�	b6Y��ha�Rj�m�H������m~�\F����'�d����/�*�&:�aZ�I�Ae��?�^��x��Z���i�.�/�O��N���5A4�SaU5"�5[�w�*��x�L�5���l�q1w���N�"���J���*������|�����6
�W�V�@�>�������S��������P���~�C�1WC������������,jc8��Om�!5<j��I�Y�DB#��Q�W��4I��6�qK���������G�AA�~����{��_�^��)� ��^��t=O�H��b�)H��K=�����l��1��x��C��KH(�E��%���q�����������?:2��l������E���}�BA�3a/�I���1r����wT��1d]�s��C,)�rc������'�y.��B�{0������P��,�����ov�Y���a�P����O��-�2.x
�1/�����b4�8d(����`!IuK�3�����S�
�gy�52s�J��R�i|L��p�=N������ox�aA�J"��C	E�Q
�(������?����HwQ��r�E�$�2>�eK��}�*��������e�b�J����)��Y<�����=���4��=���p+�nkVShd�����F3O�(�t�A�1,|������h����3s�=������q�z��=Q:8���E`� Gk����m�u�\���3)�K|��=A��_8u��K��9�Z�/�!�Z��V��$��{���x'v�0�fFjpZ06086'�����=���������|����Af��l*B�q(;3�_���:�O
A�����y,I���>�����i1�~7��E2pn40�U��W�1�+
��
CHn��H����T��
�^:��1�_+J^>������h����7XJVs��� i�{l9_=N,�
m�!���X� ��9������4jf��`��c��B��JltC1,��`��_@&��R�w�E���rV�SL&N�N�j`�tE�Am{fF��V3�^�j 5����\���A`<A��� �yC�+����R�������\5>����kR�f����}����D�O��w�R�m��uyt�%�7O'>5�q�"3�3�����
N�E�u�&�\u��D�u���NV�1z@��D7e�����������`���F�-�F�� �����~�>�{v[�>X�"�������F��BYU�"Y��x&R fe�EPau��x� QT����]N!����2x�������48��K�Xi��KUwC�U��&o���2 E��n!x�}kl�"E��:;P��L=c2�"5�:�x,��y�9����~&��v�������(�6V�T�wO�]���G��������uI|a�&�~7N�!��S�$!�x����L�8A��uM��j�D�R���8��'���S�������^-F�%���8^���o���PQ<_bA������^�:��t�f�o��4��Yjx=��:��:����+�1������><:b��T@�p�0QO	������s +2m'�%�BqfX|2�&�7q>1��qL����]����L@�A�:��u��=�D)����n��r9�yD�u<��r�Gc�1�G��(��otV4�wC����8��f��#��o���D&��oP�x"J�!!S�-�"i|���4i�-���<��0�)eO�m�*���*U�-��,��UA�=d	O�(8v�O�Q?V�H1m��b�J�ws��N�}�(�W�YR�Q��_Bi�����0���@1�{e��Bf`&�^#/
���q�	��2�`��
c�FO[�-}��$l���+��!������/)��>��
�*��v�8�R�7���6��+;{p��������^��=3�;�xY�z����"��n����������������4/���yG��Q��eim������= �(����H�d�/f���H�����7��a.����{�,���
9��)bF�r8x�s�LrT�
e��e�������������w��#5�;N�q8���@3	f_bH����t��_`���g'��-�7��e�9�o7��^ff��1��2��o�Q����#q����S�,��o���r1����H�����D��!����f1Q��3����jh.E�r�4>9K�K@��s���R���|�N�c�,I�~�v���^Y]�cv�xb^|
�������%Xh�xu,��
�Yf�	�[�9&����`4�2O��O����93�|`���_�Mk5��O���3��n������m[��2�H��T+��?i����av�|��}���3��Y!�E��B������jNF%V�-}��V�xw�/�����lF{��F����&MpS�`&�jS�0]��p1p�.u<H$��r���s��x�y)�\�0�����|��!tS��
�VlVui����������"������W?:?,�O0�������n��W��k�����~�n��H��>�p������M84�_<���/���T
I��`��(�^O&�ir}�����h4[P�'~��	���9�Ky�b(:��(�\7�����A�vM�x����t^�����Q����U������e�UT������M����#5���[�6T~z;�gv�w�Soh�����C���u�k{���f����	<����G9�Its���W��
����&��fI��9��V"n�����t!��3�7K:ki��
��4�g��"�T����l���V��d�-��T�Gy�
"���p�9D���������$��m�:��l��-�Z������>����/OO^�=�UW����+x��\<}&���EE��������gP*��~�b�d�"�������dJ������?�o��"��:��+�����'������V��������:�aI����$It��_kCq��5���Y2�]����;���#I�0���,���KP�@��]IJgv�A�$O�O�����r����~k������d��Jd�q"
���E4�w���19G���&��_��A�7���0-��^A�
7P��L�����:v�Mt!�m���3e�]"�N"�����[/
��ym�^t^�������/�>�+Qx\x��������Q|\|�t���%�B{�F�������s,�$j���S����-������6`�d�FErWUGg6�_�g�t�`{��8}�^�\�/�?���~���\6K�IP�V-�E��y�a�"�E�mv�y�yy!��>�#�����TC����9�]]������S<~N�%���}�����B�d
��l-�q9�p�7r$�i�.���#?��_���/?u�:���'������~'qn�9$wB��Ce��D�r��R���x�l���n�/#,�c�5<��B
�fc�75��X��0���=�M�m
��*�U�T��)�3�~i�O���
�����N�h/Ue����*%�>9�Z��.q	�
�*K�P���:lE�{���-b�0
[��|�~�t!��3!��Y{��r����9�%UO�~I��_����;�'�Cd

��~k���+b��O���Q�\f��.a)�L�W����DA��]�z��*�
�os��1�#	L~��������<}v���ff$�!{\�0��"�S��������M��]���k��������(�/[a������X�`���Q����w[��n��������R_���K�jF]j�o	���K��S�d�h.6���y��������[}]�_����Zs��n��{��z��Bud=R=:��k�g�����>�f�L��\��]1��������B���a�P��4>�O4�/�8[��B3>����.^O���H w��w�1�h�[����jGc�h�n	���;BO�P������w����6�-?XK�[���.����t��>��z��]���B����xu��m�)�R�� V����Q>Q
��SX;c��Q{y����|�:���,\���J���a�u`N��?d���'U�:XJI�	���W}:�
���t22�N�����K���_�5�{���^}�.�"<�]���O��?S����������9��wT��3�qr���V�|G�����t�p� �[~�z0*�h��r���jv"������7k��7���wae2��x��r�[mS���6���PV�I6�b�(��H1���2$��c~���a$|&J�q�z��L��sX[@�G�h�	@��)����pS�)���7KL%������sr�:�8;����b�����W����`�������'�$e;y�J��<X:p�l�{2\|������AR!{���N�b\��lB�f�A������J>`�@���Ms�!��7��A�%�����k��)]T`r�rZ����(4�P������
�b��~���.�����[�J�?�;�����Ke^�$_(�K�L�����_�%3�������z�}Y�,����1?v�������7W�O>���t����r(��
�wa��5��m�M��R�����{��k�
d�3� ��'�N������m�X��O�q^<���w�v�,���H��({&w����HPb���d%3������FR�s���f&�4
�TW�^��N>v�"?��Y�T���9�������m
e������^2�d�V"��B'�PjrH��T��m�E�Fb���ZO��L�j�.]��q��	/����V�/�W�BY��}�T��|�t�=�W�p����W��^���w����O2���dg�g��Xl�n����W�||xVf���Eb���=��c.K	�����	���	\����:��:���8X�K���-�\��T�RG'�5O����Y�-��&/1!��p���JPI��PlH<+�����2�+eAI���������$ob�#q��7�����O4&��^=,�uQ9\��������"�E��>�� 3}<\�G����G�����h���\��/�;����c�xw��������4�����&����	����^��P��
{yqrzN]|I��w�
z�bR��{�x)]�Y���S��iNe��1��s�_I��^%g���ly�*���mk�C�2�D��Z�1����q4|b'���fM2j����A*=J
;��)�K����$����Yb�W�i��(o�}�������d�A�>��b3���9�3f�@�`�Joh�����a8R����_��G����|>���DLzu�[[A���+�c��uj�>�����
��2�R�3`N�������+
�+�X�co21��H|L��@.������a}�{�����������1P�|Xp q2��js��9�O.����o�M��P��f4�a�y\�N�&�^��v"�P(��Sx���i�t�!�| "��P��O,�]�U�
���fz�5��\xWi�s����# o����5G!o�D��H�m���������[$��z�+-K�����+&Q���0�t$����S�b����[���#���u�H�F��2���|��7�������= ��QG%{n3s��q0[-a����0l����h�sB�.��\i�(y���Dd���������O�]t�����	��e� Q�9b�P��D�x���\=[c����������KF5�;F�r?(�y��0rl=�������0����^n:����9���00�y�w����'���SOc��r�G^j�x��H�x��%e �S�W$��})Y�
e�q��i�t56�Fd���%�b�4�/-l�
�e���sl��u�����3��_������K'~0��?Sb���:m���v�o�P�����_��!v���A$&���!�/k�
�\
������������^u�|�N��;X3�	����+t$7��x����F����|mHI4��j����l��y��a��G�c�	��25�F2���3V}����s������f&8DQetj��T���`5Jd����Rap��H��(	5*������>���W��i_���1o���a�Yk�����6����P�}�~����G�+����r�����N�
7������E�7���L��}=4]�Tm�i�r�����[��*���������[�j���O��G�a�X�����������v�-?UX�����-���V�j���jUp0��Uw�U���[u����������-��������v����~^c��Sou�����������4|�9h^7�#w4��[�fk���(P�_(�����S.���9P��H]x�a�q�*3���h�_��|y�r5�@�^�K�i���
Y���<p�?W��ju��������GW�o��r��N�7�z`���Pb/`J�,����������H���9��v�i�z��7�����J��O�������F���r�AqO�&'N\���f��~�.{�c���
�Jo��2�|�� �p�N|�].�EO���a��<�!90;
�^_�����t�
��%���|�~��x��%�[���3�\��}�-�\�Sri�����R��e6�=�������]��u�Vz��$�U��22��� g�x��I�u�J��O��O�i��1N^x��\"���u�qS�U|��������������f��j�K����r|��FL����+{�R�[������Du,n����&vS�U��{��;�g�I���PE�9�8������g����i��������?�5��6q�6u+����I����������W��'�U���|���x�Q87g���9��&��;o�L}q�sfNr�y6�G�����/GaO����7��ci{�����u���",���B�)�e� �{�9�l5����!Yp�i�j<����.���+�DYb���a���
'��
��
������}�X�!X���$Z	P�$�(�%% ����9O�"(t�6���z�kn�R�$�������js�4Bs���w�?Q}��wG\ M���>���[������>�X��ah�t������b w��C����bh"�sx!;��Q�a��7���C�����Y9yC����k��;����~�&\#��^��'pb��`����4@+�&�E�������
/��s]T�B��O�CB��wW��]��`���#A������0N�}-��YSqO�����\��l�[q�w[;m�z�E��_V�G���_�"����"�U���VsF�\�[��V��,�?��u��T��P�|A����h�&��&�den�������t<7�5�f�-��LM�cH9RUN�_s�mGg}��p488�}���by�{o9�}(�o�g�H5|tq���	�z+�<�+����1E�j}M����4[�Rf��)�8�xx��(�����_MDJ	="�^��~��i�?~Y�j}X	4��a�f��1(s�vy��������m��fL��6j�w�����?�u�m�F�a�i;N��6��F{�����V�U�����)U����+�����L���R�V�\���:�:���i8��*���N���A�l1���G�/�G��8@DV�_��aY��zZ�����_T0�bY7
�F9�#^�g�;��xG��+��p��70�xo���O��n��f��@R�����_(
zVM�������������y@&��0'7Yhj`�������
�
��aq�������;�	���hl�}y�P�I�D�p���LYu������K���un(����K�tcx�=�E:���)��Fq�ul����'��t�������[C�@�XJ~��������S�]�M7��%��#p�L�>��������U|�Sz��{`�����x����*��%��B*'���T������P�0���v�[����k�D���z�`�t
�m.^��`����&���u�f3��rN�Z�
G��������L�/�c�]�#(r����m�YaJ�%��(
��'�Eu�������H����o��>jW�a���T�
�����^#����W�����a%~;U���
W�U����V��O�Z�4�z�U�S����sP�c�$�PJ�g�Iz�y��(m���'��xU��/e^�2��������}*��rO�a���?��V����	�0���%���u0����
I�+&@	�d�q�gD%��tFY�Q��AI0-������.���@���Q�N5�Y?���u+6�NM����uDN��Md��A8�5�K������������<��}����s�H��|�����0�=���^YC�P�����?
�Yf�����M�u/9��>�0���p�`����?����)�n1���\����k�_x�	���b����+ME!q��c�����<�40mT�������i�a@���?i��<�L��l�Dt���$��$��X�������� �/RYuIM	�	��)]��.c��U^��J[����N�t������\�;n�2�tF�k��Hf��Lp��[�M)/~����i����U�=��<)H3���`y�In��c..*J�u1�R��������9&F�"
{����S����9���=��e�kd���;m��@�S�cr$h�����P&A<����`z�Z� o��'+.Hr�=j��Z���\�0���@��1��jc�?3�8�h�^��3�|��	�x�	f�����buW�&q�>��� 4{I���j����B��M�4�q�k�Y�����T^����;d�M�Bt�jt��C�\*4y`xb���`��jpA��@����L���C����<�����y�c�)�F�w�{�����*�0@��	��.��`�9�Xh(�HNG�NX��y`FN�Ef2(I�dA;���������`��S�ts��f�&<�)s�\��a�"��V�0�^�!�lF5�#\
B��\O���Z'�}Zc�C{��:�o�c���"����3��
Vcx��������9��r^0e�
4�R
(�uN�O��`QZ5MZkQDC
��������@3��/�P���f��:��ik|2�^�M������3B��-(@�3{�z�a^
�i���q0����0��M�{I�k7Q%�_V�#��v�����(���E�R���0��s����B(�YQ�������	���Bl6�0 ����C�{��flG�����P���r��78������-4��R�
��<��gF	0����Dd^
�����X�a�}�)�Tk5��:�x%�
�4������,M���<�;y���i4wH�v>��^O's
 }f_?�[s6�?��r@xwT���[V>q����*H,)�&�[���C.���t�V�����_yq�S��"�.��85���Y�RcF�}mS�X�
+m��6��3_,��r��P�f^�-�=��-TT�����E��E��9���`-�s�������2[��T�"��B���Ej����L*<L�������?�'�s����W|$0����j��A�bt���?�Pm1��gR��0�g�c������7���9-�G��PP�2����p�U��mVK�.���P�n���t~�C�V����z�F#-�kT���,:�� Z�/����v����z��XYC�K<�h�>�DqG�)E��|���������=4�Z������T����Gb}Nv4�2�V������8u�$���k{��^�'�N����P��a%m����T~��1��}�cz��F'�~��<F���n`�d����Zh�D��H3�!�fZ�u��U��^��E�:���%1/h��d#�^y�g`IL���t��zT�QGK}"^�����_�A�,'�� ;.
��$0"����=I��"?<� B������	�������b������Ae>V
LG�f�1sh����xc���?�����#�0c���&�[�g��r���k
��^~9�������M=��������	�n=�BO�,�Q�/C���wr���
�H���hM����&��
�����g.G��E��� �������	{�6	�����XH<����Jxd������x�[���-���J��0'3
#&d\`�����^@�������V��'����J���5�3�3������8$"#w����x49��(�`���v�2����
�B���v`9P){]�������SOzk��b�SI���k���Sd�?p?�"��NFc�)�����JN_w@��|���g�A�<c&����9}��~��fgd[���Y���_���~��J8������=�Q�����wWGg������Wg?������J��Rqy��M8����]�4��
sA>�|��KJ,d�#0�$��f(��7�C�/��&�����$[3�|��K��~����|���B)�"r������DG�M���)3�����2�h�|����<!����Tt���������a�P�:B��h<����*i�����BK�j���#�}�m��iI�O��b
h<6���]
��b{�G-����H������pu���P/_��0#��#�������EV��?O�FY=X,�=e�iNhS�(K�z���F�\�t�:�I��|���A! =���+�r�(���Y�R`x�DEr���Y�p�gZB�O��^��/9�I��B�Y�NcR���
���{A��OY!"T�E��)E�T��x�s��
���"������L:��>�B�!h)���5�ZRC���88m& ������.Gt�p�Ja��&��P�:f�����<j����R@IH��_j}5������-'c	;B%1<C|Rk@�/��qN-��|�75�����#bI� ^��GL�tY�|�$��X_8	���~m�!z*�1v��:���(���;[������Wyqf��0������ G��S�+��q|.j��(4��v����g����!H7����4V��d��A�wA"q�����������h���(9�������z��O���"L�ni�
�OY��O�5�����0�B������s9Ce�s�%�B�=�I����)��c4_��	Q
����Qs����z|��C�����1k��*��a�	�>A2([��%G1.���^/�z�;�._��K����O�k����j��{��v�i>������K�/������`�/@r����?�
��9���B�!Z
�d-�Nj�����P�j�G����EMr;�PQ��N�$J�llt�Q&D��x���~W�I���U���3c�1�B����J�%�yE�������i�	��f��l��/�]9��zt�Q'f���'�@����w���+p`6J���m><m��$s�"�|M&�����$���	}6�0�5���|tyqz������D�T��q�[~B�YU���2M|���/�>������0�F��0k������J���:�#W��Y����^��r���`)9t$���>K���E�a}��2g��o�#�l[�M R���U�E�'���Y��P���O����in�1����y\�Ry:�}op��~��c�I����D�P�9b4gH�'���
4cc���!��h��itV��6�jo�As�h8���F�'�8��bG�;R8��|Es$��me�����f����X�"�Q� 8��T��o4y��C,����_�pEm����2-�}Q�P����	����7��=�#����V�*���d��'�Pd��9q��y`,�H����� �Z���z�T.��5�����BY�>����;��OR�NyV������J���k�^Z})mm���E���f����MsA�j�d1��7�������J,��B,��%��s��bw���J�b�]D�����mi�s���0��� �����
������)Q���Pp�Y����[��_"�y�����T�
������o��H�hMB��z0^p�9fJ�h����9;,E]�Xl�����0U43�|����%)Q�
w`
�S���~*c��T\��F�v�������������?����W�������!a�����^��7����~���%[�dc�9T�PN�(���
�d"�>��� �b��yMx8��m �n����'�n�h�i);Jh���K ��E&�e������D�|������w�@�Zaf~���Ezrub�Q���=����B���e�8�q��r�Gn���x�*\�a������|�H���������0��{%�]DD�����K7��+��OW"TD�������2+	���Dk����W:3�j���i������oo8$�/s	XRP����v>���o ��p�(G?s�C��-l_e&����ra�p��B
bS]��QJ��v����� ����*�/�Z�����.+sxMv]���������H��m9�(��v ���@��,������s�O���5@
�]J`4��d��c�������B�7`��i�����Q4����/����R��N��-���|*���h���7�~g��Q\��AIb�45�l�	g�4/SD3O���G�GV���,���C|=��&���Uv�v���bf�����1&S6gp����B�>��?���m������K�N������m_���p����t�D���|�8��Ty��#�?z���M
����$�U����v��I}�#��W���H�X!�[/��4���9����4H�>�M��L�&D����u0'N-��<)����GQ��=6��J���JZ3C��\��r1�o���E���J)\��a��N�>��=����i�
��6a&����K���i��h%��)Gq(n�4>����:�]��S� ����L���X�)
��1��
����aGL�Y���4o���.����f�j�)�;t����nS����a�.�!3q8�}�:5��,�q��p4l�@_"�����?��t����wg'��Go�=Z��:�^������A���R���\���;���{����DbG�W����n?bo�7z�����/���*w� k��S�m�v����h��+CJ+<�K�
�������.��F,Zz����8k�Nb��� ���	z6�A>[6�KUU��N�7j���w����aco�az�a�6���.���1�h�?j�p�U���9k'���9�g7�S��$�A�d�������0�V��*$1�	�?9E�RG\�	��0:�����0.�h�_�=�JCT-������]�H"�/�h����'��_�E�=+��.���'��IT��������z�XL��}�!��1����!x|���m7X�[�E9-=�T�As*����}�Z�-qC��B�r���i�Y��+��`3��	��-vp??�����\R�O����r����v�D=��A�$U�8���<�5>�N+I6O�fQ������\.�x�Ejf)�������R�d���y��y���7��(~E-Iaw���Ui���\gEF�M�[fY��5e��x}��y-������>M�9��hn���d����<��L���h�z�����&�p	&�^By���W@(��Y;�W;?:����?�`f���W ���<?:�]�M���
kq���������]La}�V>��;��_v����I�����^d�T���������c%�0���$/LWaJ������pz��Y*��k�nQ�	��p����%�A�<^����������N�v~�M�I[B�r������3[&A��}���~��mA��S����n��8����Od�)G���8�y%���[ze��g�P��jJy�`��&��c�Q�%	�;��N������H3��+XZ��9~�����Z����t_}x���&��������L���L@��l/N����h�n����e�����Ix������
�D:.�d�v���"�H������]�TL���o=�����0bo9'%�]�zD��z 3CC�~	���1YJ	R���X C��FY#[�+��D;I� �5��c����.	��^���\����Rh���l)��������<����"$�KD�/������B,=��Y>a�)����D�H��������'���F�������d����E2�`|$�F�(�S�0=��"h�$��
2��������!��������`#\�U�O�}���d��p�"+K�^�k�D��<�P�rTDRT��A��x����?��"�c
"�M��i�Q9<���aW����?���%�� ����g�����6R���]J����#t8�<����TmAs�d/9TZTIQ�6W[I-;�(:/6��j������g����i1�0�
%�64��$��BJ�
��A�?i_M��@u>rD�'+#���oQ����-^j��h_�v�
�Qz0B�����3���5�,��?
���0y]��o���;?���nVY�k�:�#�7����s��+}������Z�-�n���c����0L,C������DXi)8C
�{��H>+"�3$��$�3f2�	3dk�I�G}>�
?��dt�U��J��N\|t����\F����W�+��/x��8x�L��a�7th��,��|����>"�`�	D�or�=�f3`)�@�����_���N�Y�c�B����B.S�����(G���;��#=���+$b�Z[����b������5@�[f�����1���_�a$��$��9�+�����2�du�l�[4a�bGn/y'PjRU�w>����7%�nW"H���-����B��ajz���6Y�e�,��D�y���&���$+�����n���Y1��J��u��,P�������)7�{���B�6|8q���k�����e��T�rJ-'O����"���&<x5FF�����gy���!f���.�&`�(����g*�\/���)� )��4J��u���y���+���
w���T�����f����Xl�n������QD'�M�7���|y�#������?9,���K�i�)��G�4��T:&rbYI~LFe�`����!�k�P1���0x[,n������"8�M,)���1�0���u#i�
t0`>���c,VD�P��v�����N�;�����z�c�Q{Vt���HHO��������N�{��.�N�_D����W�mm�O�b�v4������s�N)�ar��;�� W���&��3S��S�"Zu�!Q���/_�z9�2@�;���}�� &�����N5X:�\����w���v���!�������7!��jUX�VDk3�GL2���l!2�|�l���{����8�Bq"�y>�o
��*�L	f���}d\�m@'H�Y��T�#}�?G9��#�+m��H�Xi��R���F:����Y��z�fR��q�N�v��J���^s���fv-�Lp�Z�M9�iJ��d�>��]u������Z�q8	��g
��>���cB2d)l�������Hz�1zV�7k���:	�m���N����>	%SND7lb�XM��7cuQJ���\�$
'q71�������}K���!CH��1+ER�����2���u__��������'XG��w�R�M*���u=���(�v������_�]���j�R�]�Z^��]oB��O��k��Phg���_����7V1���k�%�f�@f>L�n���-��t��Q=�l��@�[�>��W:1_��<
����c�7��������?�g�E*����f-�a�;���jY�
�:���-��j�
D����c%�R������MV�S������������n�Gm���r���I
�]g�U�Y����d_�
����z���������3���Ylb�(���
�~�R`�L�;	������,l Q]y�;�Fl�-���*X�^�n�������7~��a�@��)W�>�D�����/C���
���F�]��!��x0�����+���7�M�	xA�H��������)I�=R+#q�M8��p�����W��h f��=�WbT!���J$N�i��XG�\�P� �#:#��h��6�����W���!c�j�mI2�_lW�8�,0EX��{}��Zp�"g�S���Us1BH
^r�4���zT2��k�����Q�t�f������7�/���Lj�i���HY,Y��1�4`l4���8�/��� FB���=��o��%8tz�0T�aA������7����.,�2�b�.�S��
�C?�vRe#�r��Y��rP:�)���O�������������$�R9���eeL�)����M�3H[p.�B�<S�����io���`)�	nCN=���Q��Y�{a�cs�b!dd(8�9���A��	�F���	�>��`U����=W�|���V���V��gv@����F�h����zf�3+k���V��%0y����R�	���
�P����bj[�@�_�Y
P�n8M=�%�O[�;�E�91�F�����>�`i�m��F	{�{�q�i�O+����B�Q���
^&8�����n?��D��y�7�d���
^&8��������)�D���S��H�d��p��*_Y%�7�>a;�Zh5�Fc��U*��p���m��Y���m�����1�TR�|oW;K�)$\�r���$�W�]�����#x��Q�-����F��w�����ro��Q��������{��{�<���h��m+��r9�xa8�;�Rx�6����|�`�H.O1�����/�\��:wD�g�FN�
��M��k���iIKB0p����MW�
k�W��=�6t��La&]�G��{�����p����R�"�����l��P�f�O3��X	N�'�51���=�>f�-��)!�k���@{����k��1*�`��U)���'�����;�k��5W���u���T�E��Z����e���]��+� �4}5�[<A��*�kz�*�-�B�u������'I:�}��?���x��g��d�!�`�-F���-%������a�����u�.����<�(����?�����0Q�������%a���j�\�������?���%���@"z]q�C"�vs)Rc~=���eh[����E�}��k�����AqS�4K��Y�"NVGTW���`�����xq'n�|�X(����3eq�(�`�A��%������]�t�F|=&p���?�=C$�K���W�����B`��I?�����`���u����3����u�ut�(��C@��m���_X��5��Sk���S��wLW�V��|Y>Lt5�S���Z��T�[`�����j>�k����v=���8R~)W�K�J�)��%���?)�:�|�+M��(Z�5-��.����g:$����8�i���%��[v�
��>j����p������|����4���=�P�&��O��.U�m�D+�mQ:�z<��
�9�&�����K�wut��4��\��. M,]a��9�h��/��e8�!ci�4�!��o���{���#�.fI��$@Q%�a@�`����|�L�'���g?_]]����]�9�6��k��6�t���0-��w�>�/T�.�E�UK��5�� ����
�B�@���<_���F�Z�]���l�\>B�)�_�z��>9��)I�0
� �y�TWd
�s����b��6O��C�����n��Q���wT�}���g����K��'6>o{����]4y1l�����]vOJ��!�1H��l#o,S��:�����#�>����7���Eo��������x�����R�����.�.�1������K�5Gy������A��/�~�#����		c6�g���686���[_1\E��X'�	y�X���!���fv�q~p�M����U ��{~�����y��-���a���
K��b��� N�cBp�-�3{�PM�lh�'#�{"��46�L�:M�+�9�(��7?��gm�H�MIoh����6J�I �V"�c�$��H��0����QkU]@���gQ�m�b�5�B��~
�������|
a�B�b�u��v����������
),��
�F:���=<��1���L�6�E&��YF����?��s�A?P��14��4B�����Wm����l�I����s������{������^)��`~z��a���&,N���6���[�����6X�8Qq��>�x��M����.e�z0�S/�C}?��NXz2Zi��j}4
��v�R���f�^n�������d�$n���S�_Q��sr+y6�����������^Ro���|����9���^X��g_
&�t��l���2F���i9M:d���6�s����t�'�7I��&D
����!����x������3�)�����=�$��t�PR��
�j�M�gaC�ql���`g������rrO�����F�����Z���������W����.V���I�o"E���t�lz���&��j=���3,�8��d�)���l~3�]z�/��xu���dh����}�{�{{��>���^pK.q��S�|�<��D��?h�z4v�:��<�7�[��S��N�w�O���8���}r�
���J�3jt��j�mV*���j�j�V�^�,h�%�j���6�[���K;c1��:�?�K����
W���7��.N��?������4mc�l4j�m3
z��tz�.��+�qs%n\�8b��E��F��jh6����[��
-d5�z���[u�Q�4k����n�����)\�M�x(�FmO�y��0�_<��)CNE�/�'���Kk���k��1Y2��|����@�Z�j�.4���0C��^q7��T�g�O�����v{\�������e���N�&w&����iuoTu;��0mPs��p��Pd���g�
^��s������GZ����zQ�}�r�v�4��qF������x�u���`��M��Y�M}v���]�l�,�f6h�`�fL���0O�
R����
V�Ce�>�u{�|�nmIBG)J��v���nL�
�~������3��/0�9;�Y����|s1e�&���~H���=�/r����"&�������sC���$�cI���g�_�f�����o�������o�����qi�vxL��[�y�Jy�[sA�gi�
��L!Q����a���oN��N�����"U*t�t�.��S)��b�F��{�����6�����U�o���F��58�H�F�"�����e���XJ�I���t�d:�t*/��U=.�#�~��v	�����C���K���8v!x�8P_+��Iv�=�����Le�����"�j����k9�I�0	��Z�d�Hjt��>���v}�:U��%���LG-Y>!1uR�����bN��$�6��-%��u��j���{�K����^�u���qkJa'�����S�E<�8cL����\gV&�BJ��fx��l�����kXm��^���U*����5��;�f�`��W���B�L�U)��F��O���X�����O���J�;Fb&�w���qgl�����>��]�?�n��)�(/�0-r�����l���O����yA�}��k8���L��V�:�F�U����T*���m�����\�m��E�������0����M��]��g��+Tc@�dl��4�el�����f�
CQ��2�������
��g��h}p�z��
�8��B�K�+��8���wM�������|���-�
����
��0
��P
��p
���]LRTV{n�^�T��~���}���r��4���'��]�����e��e1���{+���u5����*Q%�vZ���
�V�6p��z�U�J�?T]����������g��J�����lS�m��M?�Q�m�n��
���l����5������W�n�u�S������rZ�Z=������`��x��Os��������]�����f
Bk���{ ��g��rj�������Z����b�S������C�@�������TUNs�i��Y��ys������Y�DU�^"7U����v�<���1	�?,_?�����:�0�u�Z-��}28�,+~�?��{�
>��W�K�]|�G��JQ$��`x '���0�����0�$1q��2~W�F���\�k��D^����������BI��&�)��{[���(R����p#���5����_������w���C0�z��	�M����6�_�*/i�O���>�o�Q9���K�i4X�����$
Dc���!0�1��6�����,:nb�)��L?)�%4�1
��$&��cU�	=�-S�q>�g�sY�@���o�3���|������H�j��2)<�D���Mk�`��O/N�N���������U�w���*��^����;��K���b�������v
^������~88?�Su�*���\��K
�I1����?y��������=JT��F#
�km���%Ut� ��P���F���?��w��������kx��%��Et�c�cF_�_H���� ��E��b��O��I0�����R�������Y��zz�S���)X��JH�g�PG��5�<���N/���X��E�w�_��W��OO��*�EJ��Ag��z������W�N;�	�^%P�WY��4(��"�U�X���j��k1���]
a9���`[`��JyJ�ABx���"������Z�5v�s����|bODt/%����Q�H`�L�������O��V��r�����D��S`��LO�[���8	@S��O�(�/��q��|�����s�\H�
�z cRf�`
����u|�T��������ks9�ir�qu��UM����$#d5�o���_c�&��GI�e���z�v��i�Ii�&�Fx}��!�A����gG�^A�U��,���h��z�d��g�2c��D/�N@Pj��0c��.)0&�����#����}0��(P<39�3)~����r0���z�2i�]Z�J���ZC���l
l_�����0�P����*T�JZV"�7!���&H^p�X�e�P��bO<�����x�U����VE��t3�`k�76�t���	<IiG+��5��0mN�^u�{
<=�����Y����3����"���^c�c���!HK�p&$����Q������sFs8e���x��p}�'�WY�|
{E+n����b�u���=��V��"B�������N���5��3���
�?��)�5�@�d�8iALM*���d���h@�AX�+�9�
^vI����Ut*�?��R���YM������ �1i"M�;J�*j�������8������]9}6��D���z������6���G�r,5��#Z��o7N�,��{�!*������9��Z(�����5B�B���y��oW	������J�d����,���
fj5���������'\�s?�%��C#��aT^�q��(95@��#I�6!�$���$�l��q�f���O����,�H�B��s�lLG6IJ#*Qj�A�d-����.���g���K��8�S���p)�����]��ua�|<:;=1Yx�[})�
��sK(�{v5�`KF]�m2��c�`Obz�p@���6�g#����������3��g���[#k$�`Kj-=f��
K�9m����M|2�_�n�&�k!@	1�
c����I=�'�$������`���fXM���I�9����\
�)�0��E�q��A��o����w�%�����e��s�_���������OJ��M,���*��c�Q��b�����!��H�xf���W���������16���#N��lx�
%�D��+��Q-��f&J8��#�����g�'��"�{#���I����N��Y�6N��2 �$2e�d�v\J�������P
�����]���'��h��?�J�7m*t%���n�A�����������9��>m��/�����(�=�
��T��e�2���G������R&���T��i�F:�N
�l*g4O6���c[y���hw{o�:�z�>�T:���uu����=qlI'�.i����u��&-�8�Y���r�}���6�����M�6��FI�S���O$�����j�|�I����1�g�"�NE��k��Yr��������%����S$&��,9���~SG���:����!���Q���"�:�A�n�(���VeG=~����E������c�V���S����+����"����o�u�?��x���M�X�H5��-/l��
%���9Ggg�����{�N�V�A�Ce,u{�wK�ga	�~�������I)[;C�'�p�6�#�����B/~$j���4�J�t��������5�"����T&l�`����Y���L��e�V��$<`E�:l	��T����%hs*Gf2p��<�����s�W[G�<|�=�cK\;;�;��D/��`��a����E]x��QQ4��>$
lQ�����O�p�9��c[�3�]�[����Y�d�w������[�(_�%�3V!=N����2�]��:O�=��x6<����q�G q��
����7���g���z9e W��}@���������?mZq��z�������?��N��V�
�p��:�a���4����}gT�����N��mTk��?[��)�j�<�9M�*N23���i��?��z�JX/F��gat�]8kU�����z��r�*���������P�*B�Z�ef?�2�vvu���N�t@�h��k��3�T������� ��4\��4�)1_�N��"��)�L�b)0��R�������
��r��$
O��
L
q��n�?�b�W�$�r��-�jW�������q��� |�nj����C��+�Q��Q5/A�zV�����C/�1w���I �sjqF��`�
�������'����0���)��$�sxS����sC,��������
A�Y���aM���n���nq�zI;.�*���dJ#xFc�c*��g�]
�C�4K`�9G��Jw��u9fq�+T0S��'�R024o��N8�ir������R�������^���B!ybw�$����1�0���Jf���>�^fxTzwh��h��|Q��({R]��[���
�@,�����1i�	�X���*?���Kj}������
-A�@5����k���q��M����=XA	;CC?��o�)���A���^�����\�~�.r���>�xn�t�L��M�!�� R�8��o�?�Wu4GO:�	u���> ��S.��H�Y!iv�#�J��dP	G�*m�>�d(?!%���O�x�R!��y����L��$GB���D*f�V�/�
;���B�a�.�%Z�-�� 	�}#8��R�6���b~�/��Q�
7{�����+���{Z]����D�u�'��t?g����f?��a��8L�T��=U�t���I�XU+��:��&;�%:K8�_���)$�%��"��3�WM+'��_�����]v�!7\���)�4�l8fe.��@��U��X����s���-�d?�&}|�����3�������~����nx��P��GoN�w/(����"G��;6���p��!��<r����a����qE���t����c�**
m���5(�=&
lv����O��H���)�5��5���VWL7��} �5�=\�0~�`--��F��V���f��d(���VX�h9�z�R�7��B����b�n
��tN��c���zL0cH�;�(Fb��W���3��j��j!�������\[?c?����c�,�A�++X)H�U ��fU ��B�x���x�D���G�;6^7>�_Q=Am�o���U=�_�z1L�X��i�OL���[��93�����
j����'����z}S�H�
������j��G��N��t���;>��u:�f��j
����6��
�Y���n���i����m��W��80��u��ZsK�������������#�������z��)�T�����H���P���S��F�k�_�u;[�k�����u,%�>|ON�E���W�YF��?��}�����}������������+���N��>��Sk���4��v��hTh�F��V)�cM�mvi_��R��hwbvy9�b�.;F����k3rdn-�ib�%s����|����B�[3G�i���A�U�|v��:1�x�Q-<i��QQx,2a����u�����,��E�Ef��d����H����������5"��9���2"9�r�ht����7�[~Ea�d.���L��_Q;�� tAc��7#�a�l�[q���[���O��B(BW���=N(�VTT�I�V������:�A�.�j��(��z�X�����s���x�=�P�}�-����*�������\��Dm��h���E4�������@��K���8vA0��
G� �!zc
��l`�T"I����::�6��w���y#p@�U�4������=�
��=�4�G�J��]&0D���M�JI���Y����9���n= ��u��=�����<vJ��c��*�G�L|�L0�9��>���Z+Q��t�Bp�l�`��b��b'qb(
�����l>+�^�H*h>�j�m}�E���_-�-w����c�]��ehC�#t��h�E��x&{^�|	�`]jfb`.�����ZQ(��A&H����
)��h%�wQ/r�!���ftA��l��J��!j/TZ������??<c$
5	���z�C1���`��WB�m-�>���((7���?"�[������R�Y�c7^�'��D�s���L���3r+��S�g�m
������C�e�W���l�M�F���������'N�����O��4E�B!&�p�%>9N4��A�s�#^`�Y�u�f�
b�jNc�##�:t�!���@�*S�.��I��a���_+�m�k��k��4nD�v�g��|�����fjM���5SW@^m�_Bx
���z`�[���q�O�)��
�3��N~�/l`�2`�vG8�(�������p�"�V~ybH�+�/�

�yJ���K��f����U�����0$�gI	��-|��
a��!2,H�{>ZHR�l��q����%����Ff�/���l�\����A��Aj>%^�Ol>����������`�fo^#��X�$7��
��Et��(I6Ef����!>�Btt82��@���"f���k>^�Y��/	��g���-L����������N�T��r �����7[HU����
��S�d�E�o	"���o�7��q�+>�s�+���)�����%��{ B�&@�.�	Af�%�z�F�+?������[��=g$
9J���B�D���/
�5�?�v{���~���~���o���76x�h ��_�5���1�LI�(l��9]��_!�f�v~��|���cL?{^�*�ND�i�?��O
#�k:��r~+s�Ot��|C��0�G����N���>I��h��x��wEa�q�)�dL���G�S5.M�b����r��o��������q�ht���RKzC�/_�S�]��ORry�B����|�@�&�'3�Bpc'�0�#���=|��Q��e^.4_<��!�=J{5'�@K���T:X<��JF��]�
�������B��M�oT���;$��8��@Hn��?����`;F��6��oz(�^t��0��65C���POj*����o�
���2�6����3��d:�3
���e��n��!�h��]udB�H�2���B���?#W���]-B/i@�,0P�`�L��@S/����G	��9���X�Ne2k��l�`!��HPG=A�5�r(�l���l��z/���@���My\�������=���t1�/@��d7���:��%lL�2^�h=������'b��1E��gJ�'�(,���jym��;�@u	��
w��������K�OVr�eCD���">��Jw���.�FE*p���~$$h/vi��g8q�����Sg���]��3��/F/&R*�e!����i���}���&��K��1E��������C�A���������B3�H�]���"����w0�Z��|�I��`�i �g�g����Y4?��8���"�<
��j�E��������:�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?_��������

#303

andres@anarazel.de

about 2 years ago

In reply to: John Naylor (#302)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2023-12-21 14:41:37 +0700, John Naylor wrote:

I've attached v47, which is v46 plus some fixes for radix tree.

Could either of you summarize what the design changes you've made in the last
months are and why you've done them? Unfortunately this thread is very long,
and the comments in the file just say "FIXME" in places that apparently are
affected by design changes. This makes it hard to catch up here.

Greetings,

Andres Freund

#304

[1]: /messages/by-id/CAFBsxsFyWLxweHVDtKb7otOCR4XdQGYR4b+9svxpVFnJs08BmQ@mail.gmail.com

johncnaylorls@gmail.com

about 2 years ago

In reply to: Andres Freund (#303)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Dec 21, 2023 at 6:27 PM Andres Freund <andres@anarazel.de> wrote:

Could either of you summarize what the design changes you've made in the last
months are and why you've done them? Unfortunately this thread is very long,
and the comments in the file just say "FIXME" in places that apparently are
affected by design changes. This makes it hard to catch up here.

I'd be happy to try, since we are about due for a summary. I was also
hoping to reach a coherent-enough state sometime in early January to
request your feedback, so good timing. Not sure how much detail to go
into, but here goes:

Back in May [1]/messages/by-id/CAFBsxsFyWLxweHVDtKb7otOCR4XdQGYR4b+9svxpVFnJs08BmQ@mail.gmail.com, the method of value storage shifted towards "combined
pointer-value slots", which was described and recommended in the
paper. There were some other changes for simplicity and efficiency,
but none as far-reaching as this.

This is enabled by using the template architecture that we adopted
long ago for different reasons. Fixed length values are either stored
in the slot of the last-level node (if the value fits into the
platform's pointer), or are a "single-value" leaf (otherwise).

For tid store, we want to eventually support bitmap heap scans (in
addition to vacuum), and in doing so make it independent of heap AM.
That means value types similar to PageTableEntry tidbitmap.c, but with
a variable number of bitmapwords.

That required radix tree to support variable length values. That has
been the main focus in the last several months, and it basically works
now.

To my mind, the biggest architectural issues in the patch today are:

- Variable-length values means that pointers are passed around in
places. This will require some shifting responsibility for locking to
the caller, or longer-term maybe a callback interface. (This is new,
the below are pre-existing issues.)
- The tid store has its own "control object" (when shared memory is
needed) with its own lock, in addition to the same for the associated
radix tree. This leads to unnecessary double-locking. This area needs
some attention.
- Memory accounting is still unsettled. The current thinking is to cap
max block/segment size, scaled to a fraction of m_w_m, but there are
still open questions.

There has been some recent effort toward finishing work started
earlier, like shrinking nodes. There a couple places that can still
use either simplification or optimization, but otherwise work fine.
Most of the remaining fixmes/todos/wips are trivial; a few are
actually outdated now that I look again, and will be removed shortly.
The regression tests could use some tidying up.

-John

#305

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#302)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Dec 21, 2023 at 4:41 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Dec 21, 2023 at 8:33 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I found the following comment and wanted to discuss:

// this might be better as "iterate over nodes", plus a callback to
RT_DUMP_NODE,
// which should really only concern itself with single nodes
RT_SCOPE void
RT_DUMP(RT_RADIX_TREE *tree)

If it means we need to somehow use the iteration functions also for
dumping the whole tree, it would probably need to refactor the
iteration codes so that the RT_DUMP() can use them while dumping
visited nodes. But we need to be careful of not adding overheads to
the iteration performance.

Yeah, some months ago I thought a callback interface would make some
things easier. I don't think we need that at the moment (possibly
never), so that comment can be just removed. As far as these debug
functions, I only found useful the stats and dumping a single node,
FWIW.

I've attached v47, which is v46 plus some fixes for radix tree.

0004 - moves everything for "delete" to the end -- gradually other
things will be grouped together in a sensible order

0005 - trivial

LGTM.

0006 - shrink nodes -- still needs testing, but nothing crashes yet.

Cool. The coverage test results showed the shrink codes are also covered.

This shows some renaming might be good: Previously we had
RT_CHUNK_CHILDREN_ARRAY_COPY for growing nodes, but for shrinking I've
added RT_COPY_ARRAYS_AND_DELETE, since the deletion happens by simply
not copying the slot to be deleted. This means when growing it would
be more clear to call the former RT_COPY_ARRAYS_FOR_INSERT, since that
reserves a new slot for the caller in the new node, but the caller
must do the insert itself.

Agreed.

Note that there are some practical
restrictions/best-practices on whether shrinking should happen after
deletion or vice versa. Hopefully it's clear, but let me know if the
description can be improved. Also, it doesn't yet shrink from size
class 32 to 16, but it could with a bit of work.

Sounds reasonable.

0007 - trivial, but could use a better comment. I also need to make
sure stats reporting works (may also need some cleanup work).

0008 - fixes RT_FREE_RECURSE -- I believe you wondered some months ago
if DSA could just free all our allocated segments without throwing
away the DSA, and that's still a good question.

LGTM.

0009 - fixes the assert in RT_ITER_SET_NODE_FROM (btw, I don't think
this name is better than RT_UPDATE_ITER_STACK, so maybe we should go
back to that).

Will rename it.

The assert doesn't fire, so I guess it does what it's
supposed to?

Yes.

For me, the iteration logic is still the most confusing
piece out of the whole radix tree. Maybe that could be helped with
some better variable names, but I wonder if it needs more invasive
work.

True. Maybe more comments would also help.

0010 - some fixes for number of children accounting in node256

0011 - Long overdue pgindent of radixtree.h, without trying to fix up
afterwards. Feel free to throw out and redo if this interferes with
ongoing work.

LGTM.

I'm working on the below review comments and most of them are already
incorporated on the local branch:

The rest are from your v46. The bench doesn't work for tid store
anymore, so I squashed "disable bench for CI" until we get back to
that. Some more review comments (note: patch numbers are for v47, but
I changed nothing from v46 in this area):

0013:
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.
Recently outdated. The variable length values seems to work, so let's
make everything match.

+#define MAX_TUPLES_PER_PAGE MaxOffsetNumber

Maybe we don't need this macro anymore? The name no longer fits, in any case.

Removed.

+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ char buf[MaxBlocktableEntrySize];
+ BlocktableEntry *page = (BlocktableEntry *) buf;
I'm not sure this is safe with alignment. Maybe rather than plain
"char", it needs to be a union with BlocktableEntry, or something.

I tried it in the new patch set but could you explain why it could not
be safe with alignment?

+static inline BlocktableEntry *
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key);
+
+ return local_rt_iterate_next(iter->tree_iter.local, key);
+}
In the old encoding scheme, this function did something important, but
now it's a useless wrapper with one caller.

Removed.

+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+ return sizeof(TidStore) + sizeof(TidStore) +
local_rt_memory_usage(ts->tree.local);
I don't see the point in including these tiny structs, since we will
always blow past the limit by a number of kilobytes (at least, often
megabytes or more) at the time it happens.

Agreed, removed.

+ iter->output.max_offset = 64;

Maybe needs a comment that this is just some starting size and not
anything particular.

+ iter->output.offsets = palloc(sizeof(OffsetNumber) * iter->output.max_offset);
+ /* Make sure there is enough space to add offsets */
+ if (result->num_offsets + bmw_popcount(w) > result->max_offset)
+ {
+ result->max_offset *= 2;
+ result->offsets = repalloc(result->offsets,
+    sizeof(OffsetNumber) * result->max_offset);
+ }
popcount()-ing for every array element in every value is expensive --
let's just add sizeof(bitmapword). It's not that wasteful, but then
the initial max will need to be 128.

Good idea.

About separation of responsibilities for locking: The only thing
currently where the tid store is not locked is tree iteration. That's
a strange exception. Also, we've recently made RT_FIND return a
pointer, so the caller must somehow hold a share lock, but I think we
haven't exposed callers the ability to do that, and we rely on the tid
store lock for that. We have a mix of tree locking and tid store
locking. We will need to consider carefully how to make this more
clear, maintainable, and understandable.

Yes, tidstore should be locked during the iteration.

One simple direction about locking is that the radix tree has the lock
but no APIs hold/release it. It's the caller's responsibility. If a
data structure using a radix tree for its storage has its own lock
(like tidstore), it can use it instead of the radix tree's one. A
downside would be that it's probably hard to support a better locking
algorithm such as ROWEX in the radix tree. Another variant of APIs
that also does locking/unlocking within APIs might help.

0015:

"XXX: some regression test fails since this commit changes the minimum
m_w_m to 2048 from 1024. This was necessary for the pervious memory"

This shouldn't fail anymore if the "one-place" clamp was in a patch
before this. If so, lets take out that GUC change and worry about
min/max size separately. If it still fails, I'd like to know why.

Agreed.

- *     lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *                                               vacrel->dead_items array.
+ *     lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in
the TID store.
What I was getting at earlier is that the first line here doesn't
really need to change, we can just s/array/store/ ?

Fixed.

-static int
-lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-                                         int index, Buffer vmbuffer)
+static void
+lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+                                         OffsetNumber *deadoffsets,
int num_offsets, Buffer buffer,
+                                         Buffer vmbuffer)

"buffer" should still come after "blkno", so that line doesn't need to change.

Fixed.

$ git diff master -- src/backend/access/heap/ | grep has_lpdead_items
- bool has_lpdead_items; /* includes existing LP_DEAD items */
- * pruning and freezing. all_visible implies !has_lpdead_items, but don't
- Assert(!prunestate.all_visible || !prunestate.has_lpdead_items);
- if (prunestate.has_lpdead_items)
- else if (prunestate.has_lpdead_items && PageIsAllVisible(page))
- if (prunestate.has_lpdead_items && vacrel->do_index_vacuuming)
- prunestate->has_lpdead_items = false;
- prunestate->has_lpdead_itemshas_lpdead_itemshas_lpdead_itemshas_lpdead_items
= true;

In a green field, it'd be fine to replace these with an expression of
"num_offsets", but it adds a bit of noise for reviewers and the git
log. Is it really necessary?

I see your point. I think we can live with having both
has_lpdead_items and num_offsets. But we will have to check if these
values are consistent, which could be less maintainable.

-                       deadoffsets[lpdead_items++] = offnum;
+
prunestate->deadoffsets[prunestate->num_offsets++] = offnum;

I'm also not quite sure why "deadoffsets" and "lpdead_items" got
moved to the PruneState. The latter was renamed in a way that makes
more sense, but I don't see why the churn is necessary.

@@ -1875,28 +1882,9 @@ lazy_scan_prune(LVRelState *vacrel,
}
#endif

-       /*
-        * Now save details of the LP_DEAD items from the page in vacrel
-        */
-       if (lpdead_items > 0)
+       if (prunestate->num_offsets > 0)
{
-               VacDeadItems *dead_items = vacrel->dead_items;
-               ItemPointerData tmp;
-
vacrel->lpdead_item_pages++;
-               prunestate->has_lpdead_items = true;
-
-               ItemPointerSetBlockNumber(&tmp, blkno);
-
-               for (int i = 0; i < lpdead_items; i++)
-               {
-                       ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-                       dead_items->items[dead_items->num_items++] = tmp;
-               }
-
-               Assert(dead_items->num_items <= dead_items->max_items);
-               pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-
dead_items->num_items);

I don't understand why this block got removed and nothing new is
adding anything to the tid store.

@@ -1087,7 +1088,16 @@ lazy_scan_heap(LVRelState *vacrel)
* with prunestate-driven visibility map and
FSM steps (just like
* the two-pass strategy).
*/
-                       Assert(dead_items->num_items == 0);
+                       Assert(TidStoreNumTids(dead_items) == 0);
+               }
+               else if (prunestate.num_offsets > 0)
+               {
+                       /* Save details of the LP_DEAD items from the
page in dead_items */
+                       TidStoreSetBlockOffsets(dead_items, blkno,
prunestate.deadoffsets,
+
prunestate.num_offsets);
+
+
pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+
TidStoreMemoryUsage(dead_items));

I guess it was added here, 800 lines away? If so, why?

The above changes are related. The idea is not to use tidstore in a
one-pass strategy. If the table doesn't have any indexes, in
lazy_scan_prune() we collect offset numbers of dead tuples on the page
and vacuum the page using them. In this case, we don't need to use
tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The
LVPagePruneState is a convenient place to store collected offset
numbers.

About progress reporting: I want to make sure no one is going to miss
counting "num_dead_tuples". It's no longer relevant for the number of
index scans we need to do, but do admins still have a use for it?
Something to think about later.

I'm not sure if the user will still need num_dead_tuples in progress
reporting view. The total number of dead tuples might be useful but
the verbose log already shows that.

0017

+ /*
+ * max_bytes is forced to be at least 64kB, the current minimum valid
+ * value for the work_mem GUC.
+ */
+ max_bytes = Max(64 * 1024L, max_bytes);

If this still needs to be here, I still don't understand why.

Removed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#306

johncnaylorls@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#305)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Dec 26, 2023 at 12:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Dec 21, 2023 at 4:41 PM John Naylor <johncnaylorls@gmail.com> wrote:

+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ char buf[MaxBlocktableEntrySize];
+ BlocktableEntry *page = (BlocktableEntry *) buf;
I'm not sure this is safe with alignment. Maybe rather than plain
"char", it needs to be a union with BlocktableEntry, or something.
I tried it in the new patch set but could you explain why it could not
be safe with alignment?

I was thinking because "buf" is just an array of bytes. But, since the
next declaration is a cast to a pointer to the actual type, maybe we
can rely on the compiler to do the right thing. (It seems to on my
machine in any case)

About separation of responsibilities for locking: The only thing
currently where the tid store is not locked is tree iteration. That's
a strange exception. Also, we've recently made RT_FIND return a
pointer, so the caller must somehow hold a share lock, but I think we
haven't exposed callers the ability to do that, and we rely on the tid
store lock for that. We have a mix of tree locking and tid store
locking. We will need to consider carefully how to make this more
clear, maintainable, and understandable.

Yes, tidstore should be locked during the iteration.

One simple direction about locking is that the radix tree has the lock
but no APIs hold/release it. It's the caller's responsibility. If a
data structure using a radix tree for its storage has its own lock
(like tidstore), it can use it instead of the radix tree's one. A

It looks like the only reason tidstore has its own lock is because it
has no way to delegate locking to the tree's lock. Instead of working
around the limitations of the thing we've designed, let's make it work
for the one use case we have. I think we need to expose RT_LOCK_*
functions to the outside, and have tid store use them. That would
allow us to simplify all those "if (TidStoreIsShared(ts)
LWLockAcquire(..., ...)" calls, which are complex and often redundant.

At some point, we'll probably want to keep locking inside, at least to
smooth the way for fine-grained locking you mentioned.

In a green field, it'd be fine to replace these with an expression of
"num_offsets", but it adds a bit of noise for reviewers and the git
log. Is it really necessary?

I see your point. I think we can live with having both
has_lpdead_items and num_offsets. But we will have to check if these
values are consistent, which could be less maintainable.

It would be clearer if that removal was split out into a separate patch.

I'm also not quite sure why "deadoffsets" and "lpdead_items" got
moved to the PruneState. The latter was renamed in a way that makes
more sense, but I don't see why the churn is necessary.

...

I guess it was added here, 800 lines away? If so, why?

The above changes are related. The idea is not to use tidstore in a
one-pass strategy. If the table doesn't have any indexes, in
lazy_scan_prune() we collect offset numbers of dead tuples on the page
and vacuum the page using them. In this case, we don't need to use
tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The
LVPagePruneState is a convenient place to store collected offset
numbers.

Okay, that makes sense, but if it was ever explained, I don't
remember, and there is nothing in the commit message either.

I'm not sure this can be split up easily, but if so it might help reviewing.

This change also leads to a weird-looking control flow:

if (vacrel->nindexes == 0)
{
if (prunestate.num_offsets > 0)
{
...
}
}
else if (prunestate.num_offsets > 0)
{
...
}

#307

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#306)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Dec 27, 2023 at 12:08 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Dec 26, 2023 at 12:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Dec 21, 2023 at 4:41 PM John Naylor <johncnaylorls@gmail.com> wrote:
+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ char buf[MaxBlocktableEntrySize];
+ BlocktableEntry *page = (BlocktableEntry *) buf;
I'm not sure this is safe with alignment. Maybe rather than plain
"char", it needs to be a union with BlocktableEntry, or something.
I tried it in the new patch set but could you explain why it could not
be safe with alignment?
I was thinking because "buf" is just an array of bytes. But, since the
next declaration is a cast to a pointer to the actual type, maybe we
can rely on the compiler to do the right thing. (It seems to on my
machine in any case)

Okay, I kept it.

About separation of responsibilities for locking: The only thing
currently where the tid store is not locked is tree iteration. That's
a strange exception. Also, we've recently made RT_FIND return a
pointer, so the caller must somehow hold a share lock, but I think we
haven't exposed callers the ability to do that, and we rely on the tid
store lock for that. We have a mix of tree locking and tid store
locking. We will need to consider carefully how to make this more
clear, maintainable, and understandable.

Yes, tidstore should be locked during the iteration.

One simple direction about locking is that the radix tree has the lock
but no APIs hold/release it. It's the caller's responsibility. If a
data structure using a radix tree for its storage has its own lock
(like tidstore), it can use it instead of the radix tree's one. A

It looks like the only reason tidstore has its own lock is because it
has no way to delegate locking to the tree's lock. Instead of working
around the limitations of the thing we've designed, let's make it work
for the one use case we have. I think we need to expose RT_LOCK_*
functions to the outside, and have tid store use them. That would
allow us to simplify all those "if (TidStoreIsShared(ts)
LWLockAcquire(..., ...)" calls, which are complex and often redundant.

I agree that we expose RT_LOCK_* functions and have tidstore use them,
but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
calls part. I think that even if we expose them, we will still need to
do something like "if (TidStoreIsShared(ts))
shared_rt_lock_share(ts->tree.shared)", no?

At some point, we'll probably want to keep locking inside, at least to
smooth the way for fine-grained locking you mentioned.

In a green field, it'd be fine to replace these with an expression of
"num_offsets", but it adds a bit of noise for reviewers and the git
log. Is it really necessary?

I see your point. I think we can live with having both
has_lpdead_items and num_offsets. But we will have to check if these
values are consistent, which could be less maintainable.

It would be clearer if that removal was split out into a separate patch.

Agreed.

I'm also not quite sure why "deadoffsets" and "lpdead_items" got
moved to the PruneState. The latter was renamed in a way that makes
more sense, but I don't see why the churn is necessary.

...

I guess it was added here, 800 lines away? If so, why?

The above changes are related. The idea is not to use tidstore in a
one-pass strategy. If the table doesn't have any indexes, in
lazy_scan_prune() we collect offset numbers of dead tuples on the page
and vacuum the page using them. In this case, we don't need to use
tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The
LVPagePruneState is a convenient place to store collected offset
numbers.

Okay, that makes sense, but if it was ever explained, I don't
remember, and there is nothing in the commit message either.

I'm not sure this can be split up easily, but if so it might help reviewing.

Agreed.

This change also leads to a weird-looking control flow:

if (vacrel->nindexes == 0)
{
if (prunestate.num_offsets > 0)
{
...
}
}
else if (prunestate.num_offsets > 0)
{
...
}

Fixed.

I've attached a new patch set. From v47 patch, I've merged your
changes for radix tree, and split the vacuum integration patch into 3
patches: simply replaces VacDeadItems with TidsTore (0007 patch), and
use a simple TID array for one-pass strategy (0008 patch), and replace
has_lpdead_items with "num_offsets > 0" (0009 patch), while
incorporating your review comments on the vacuum integration patch
(sorry for making it difficult to see the changes from v47 patch).
0013 to 0015 patches are also updates from v47 patch.

I'm thinking that we should change the order of the patches so that
tidstore patch requires the patch for changing DSA segment sizes. That
way, we can remove the complex max memory calculation part that we no
longer use from the tidstore patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v48-ART.tar.gzapplication/x-gzip; name=v48-ART.tar.gzDownload

���e�;ks����j�
wz#G"��DIN��c)�[;��n����"A�1
������� )Q�8������iC�.����{��i�r�q�d�P����X�>�(n�����F������}*��xjO�y��5�j�[�v��7>����Fow�Mkm��h���
���>Y����o���w��������8
�����i�����v�Z��u�p�]�/z��#z�3���E�k1cz�i�1��8��9f��k��g�'�V�dz�jp�W�(��6��8FdMf�����`z�X�k}����V������?�����������������$�2dVV2+fe�����x���V��(��/dJ��E��<��	;���!KD����W@��d:����-�f�pX�4�C�N�E��)�\�����Zg<��4/�������v�V�sO,���������u�$<N��;����>��6a���@�^bgI��l������Z,�,J�	��W�x�
 ��sZ�'C?�|f������`��t1;
����-����^�]t?���n���N�kktz��E��$�[^h��#Z�(N[�8�����]�x�Sc:s=_0V=N��01f�qT�9��2E�x)��]��w����wLw�e�c��U�q����m�L��n��v���{��z��n�i���e��
�,��B��Q����x2�E�����W���M&2`���^������!���B�o=��_���w�#�=McJ� ��*8R�=[d+A"�S��k=���i [2���>o�N�-��~w�N�(�Y�$���{Y�H����O!>=l��~4�sH�6<�F�I��]k����l:���'55�iZ��>����u 8��n"mn��z^k���J�^,�������%��@'�K��)�@��"x���vL[���x���KH���C=�%�u�����`Ds�8Fb������
�s���N�h^����X��-��T��y	,��I
�e���J8`(�^�d�,e�j9/����6��#���=��~�0��F��1c���w���.�1B�uz�{S"��E�����������_S��
�G[�������� ��?t�O����������d�������f9��7u������'����p*�Qz�[����mz�����F����v5�)��Gl�L�O�p���&��1o��z�������x��$�\4�=��S�n��h���L��i�I.#Ym�]��G��EQ������K��B��/�ei"Uj�k4�����M�_��)_�	�zs�-�|*���P��-���!� 
]P����	��/����~�@Z�A! W�T���B7 ��+0_u��c�k�#v��R3TSW�Z����s���=%�����)~4�l%�R%r�6�
���_����t��k��������C�_��m�u:�i�������=�c�>��uh���/�?E���{�����C
���j0R�7�h2�
������?n��uw�����U:3F���YQi�0k�v�8l6����1�Ba-��e�����]�������M�,��,���6��9��t=�A������xsqy}3�|7�n����R���l���1#(*D�:(��0����Y���
z�vs9�<fW���_�4�(�N���\��6��|6�#��\A�\em�1f�\D�������s�B��^��=�
�I���V��'l2����*I����[��$)6�IH=�x��rs��f
�E�P,Vb{���(E3"w
}��~&@�E���r���1B>B����������N��7��>�D���t��;����>���3S�s��}gN�:�������
�A��}z�F�HZ���_�������:����[��|k_��^��K$)��	@c����<�������w����'+q�U����8K6(����5���'�Z�����RU���-G�z[���,S�����k�}�z��1��{}vsq��zx���a=DfO�����������.�5��g
V�
�F-��<���z��
�#�F�B��rF�Y���Aw���(�U�1�'����	�Q�5a7��u��<�`��l����!�.��=a	�����sU?���2����$
7�TYI(��P���,�U@���
lG?��}��������d&l�E�Kr�+	�6eN���^	(/�#�������N
-�����0W;lx���{z4��(��������p��(}4�B�~G������-�����h�N�
pp�mG�����V���Y�����V�Vq������-f�:�����/
aG,G.�)
P������bR����#V���h���b��.
e�������/�7g�����F�Wehs��$<��o2zS�
D]����V3��o�4��<�@\���FR|����/�i,�E���4�#�J����h��v���*h���n��Y4��
S��&,Z����X��~�"����v�"���G��V�1�!H�}��U�(��g�M/h����|��do�����"�m�F����z��Y����d�7�l��7��A���	
^������9���������V��������B16��PI�<�M+A���[��T��(�kx�J:�/��X!�4M.��3rW2�]�jz����YD��=��Xre3����@c0���oFc�!�z�2S����_�5v
v��"4���������
_�
_��_�r����-�_�B��b0IR�A*��Z���e�R�Ei�������K���U#�-%���\g�]���3�}U����w]��XE�� XDZ�����4��1��C�����������M%���X�0:�� �_��8�����0�\-�P9�8h�'�(OQ4�.�h��(�E�)#��?�(�E�	#��fD������Q4�.�h�G
����]u�-&'��PI�@6�%���V�
Y>�;V���^��/(���4[L@�"U)��
%V{���a���PR(���{��%���Jc�>$G����x45�$A���������#qH����n����������m���{/����;L����f8��j��:
F���	��i����w`=!�f�o2�ix�(�_��h��5e�3�^�L9���7�2�
�<Y�g���S9q%��w��B(��n��{�<{�?f��-�Z���N�?���C����B��p�q���wLM��~�u��kw�^����O{�����6!�����w�������T���N���`��^������1g~D%!�)�xF*-�f��<�VL���p�{�>���������[%%2��g�-���x��8�6��L/�OA�����|��/���t_	W���������r�-t<��(%\�!pU��0w5<\+tvNU���_,�������G=`��6!��)�[�|��g��-t��U*��j ��
�$��iK�n��$�}qJ��q�B��K)Y���Y���S�:�X�������
�[5"����e��b��B���L��0�~x������(�=��(|+��(����(T;D�	p��� ���d�ac�{�s��6��>���ut�j����]��@�Y�������]�T�[����xtD���^Z�&z9{��6s'�Q'H����������h�>�>�;��/O�#�t�+dI���'��F)�i��I�H�?���w'�:�?e^�u�&;�0��ZQ��%�\�!?/��|b1��w���BV.��`����o$���EG��d�9�y6�QP+qaL=��`9E�E��BlyY3��	�=�����g2��[�;h�W$ g@�|T�u-������1o���oI����<�55`�&9D�^U�K���X���sR��&we���3J��0F�JPJJHig���x���<����!�xN
lW������7�	�n ��a�����<x��9��I1���w�!�^>�]����X��n��������3�n��hZ��JI�K�T*1&`���02�����2������Q.�3d*p�1iE���)�X��b&�</%o�X	�=s�M�[-*ET����Oh�{����1^����[Kq]�g/+K�f��d)d�\zL����r��!?�@��0����F�+Ss �s����
��I|���S��L��_�Q��~
�`�NfIYN8�+�^L=�dK0�'��Da��z�����3�(�%�����&�|���D=�,�e�K��^�NQ�]^SG�[L.X
N9��0�(�>��1^�-zZr		_��3���R�/���|��	#Pw�+L"W
o`��*�� yUB�3���	^���c�/�+�b6�������xI,���!?��%W�L�C{�
�x
00���f����^���2����������	('�?KJZ�Y��MY[,9i���y��
6�V�����}O�"��y��y ��g�
�?��
Eo��#)�Y
�5�rP��g{"���"���"=����[A	h���o�s����GY4�%#��7g�]�H��Bau"���R���-g<�&y>���D`"�>�;����jWO	b@�S��@��Y�@��$��}�R��(��ODV�0q~Fx5�x���@��{�.5F2���ggp,� �RYF����(����g�����!��L1����B;�c<�&f��KA���/b��,b�	DB�e��X���a@����#��.{�d1e���G[I��"0����zxS?������|x3�I3�L92s_LJ�DG�e�����Ax�d9���&���H������T�	�/~��t����n����p!u��@�K��S���e�E�[=
)}hC�����1P������#��0P��iP:%����UYP6�*�,�DC���QA�f3��EW������(nU�0���������>��q�L^�R[d�4�Q�K\�m%G��p�D��9_�,*f%Id�#��](���X30h�>�m����	6(�O?�R�p#�},�������=Y��?aWJ�i��~C��;��E����.����b,���Rw�<v�BE�����Z�@�]D�@��F`��WC�l�7����zWr�#P����D���@8�gn=#T2���@��Dh.p�GG��d1�����)�x�Y�~��x�
yP]�?��C�?�_�l-�S])q���B=�^�i�[��������6���������T]�5�y7�_S���b
IW�Z�^/w�i�_O��n���!�q?+��+�[>��|�N����
3�������7�o#9���/�)�������(�C���5Ej	�[�v/~�@��e Jv���������A�g��4����<"##"#�������S
�X��z��������o����h1���*��<�x�>�#.�����D�F6�Y���8��H8�(������^���.Ml��8�J\�^�Y��8Cz�g(����q�)�
>�4��f.%�ch[Q���Y�y���P�%�g^t8@��9��"aO��L��S�
����*������}v���N����T�J��0�,�<��j!VMH��E��5� �{c �D�"N�rr����S��MU��*�ry������+,�*�@Q�4,0�!���Z|~�j��:'��'/^�,����,�������*yQI�;���j^�_�\��7���<�r�l5���qV���N�����x�b����=
-�{R�
05��w)X,���{\7�l�
j��d�����|���^�b���MZ��<x5�t`@�(�O�������u;m^\��<{qr}vy��{��[*~U+F&��A�l�p���������IB��K�_�ao4'��<O��o%]�����yn]�y��1��;$�������a����)�y>,/�� ���|Xl�%/N�4�x��G�|�)-"�T��R��q?�)H�3�:��g�������T�R($�SM��Z�1���r*1�p���I���{��zQ���nY4&3�(��[��z���������d�mf��I���[I�����u,��V�n�_��jxLb���$���������E������{�6���h�.���<Mv
�qk{J��'W����q�h����>�n%�������B��\>?��P'�gpdqH$�Br�\4j_]^&VD��l2����u~y��j�xw�:��C�h�i�"�Q�v��j�����fU��u~~��3�0�I�3�\��y��W��i��i���u*�n�Jg����5�O|6n�w	r���m��zK�^�t�-������V�����e����h��$j���B���C'�n��|F�hh��k�Z��Pm8;�T% "��w���I�![����	B��i�e���y�V��W�-�� �?�h5���^��_G���{�=X�HmY+/^��������
������?��s�2�m��������h$X;���k�����}���"���!�y���h�O��}]C:e��v�$����!�w��S��7m�;�b��\C��\?��vmoe'���^`�e�������
t������@W�������"M��hP=tI�������-� ������
Oo}G-�T���c��(����}��)/W��8�z=^�v#�Fu?�Ju?�+���9����U	^%�����dK$N������!�Fr<�Jr<�W3?�cH�'������Kv��������e�\����;US+�z}ul�;��-(�2�kWL������]1�Y�f���a2��6��V$���yyu�&�u��A��	�S���S�8�$O����������yu�����c4�?+�����E8!d_����o_��Nc�.d��������<�^[\|��N�u����3�����I���l z�J��O��'����w����Fr��*�c}�0�3�YuR[�T���P|����wo��=�w6�|����I+������8�*��s�MV�6z�d�<�xy�Y-o	��*%&�7|%A>?����|4H*N�����*�'o�:��M-�� ��o ���s��v(A6�;&.T$�Yx�\a�*�Z��S���u�+��/�<fSi��@���7�;�YD�
�|�������IR��heA�qs|���q�9��7��T
����lD�8.���c��i�^��[���+e_��-�F�
��O���>0(��oF10���A>�?4�����{l����|���y�5gZzS����d ^	T1�"-~����kcfE���
sdC�����>�kf�GZ���{j��)*�8�����}�7���Z6�8?�������93!11B���g�n�J|IVu���5��F$�-k��7��?��3���a&�f�y�2]��<B���#,3t�6��� M]����T|{����+��������"�����Ly@0��'���
���%�	z<H#���,�|G�.H���Z`�����z{r�3H��7
� U�gP��2i�F1
h�ny����c�� 9+ b���&cnx��/y�D��7'�?y���TM��U4d$g��S�q��(��������]\�7��j�����M�7l���y�ZSa��g��nb��M�]V#r0<�K�yN!^4���!M<��Cb����?&c��`�Nv�?���X��g"K���N���vco�9Gs# &�������'��1#J������W=8��r�G�k'�������9��i��
je�6��
(����d�gd%��U>�|)��"�����T��BT��C��m�5�/��<{�]��(c%T����}hpb��:���o���7�	I���RH��W��5��%37�����u"�;M7�
6�V���;�������*�T�@/6H�����K'Yw��w�t�+�� �L�C"���
n=vQ%_���b��N�&]�����"��#���1�;gp�����d��;�����`��)&	A=��,���6����aL��o����\�19���'�|�?������������[��
k����oict�����^y�KN��Q����Y���+��~�#����o��o�b� ��_�i0��S{���? G0	
�G5^������Y�~����S����N���I-^MA��,��E����bJ����Lb��6��}������K����[N<Eek1� �Y��)��*�����
�/m�vc��g�^��<
M��QD����TE'`�r���������|B����>��fU?��������}=����w��\�b9/�a�6o�a�	:�7MT^�w<t�L���S`��k���7N3lj�3������}�����L�������k�N��_[���rI<�������)&���t�P����m��(�83��S@�#_]��^	��B"&�N�j�u�n���;G�F�Y�3�C��U>��;���6�O���H�0� ���G���$F���L���d��P8�4���a����!���."�
�����\v����Lj	��\����I5c�r��i��":@Ay��m��Of�a9p�Xn�yj�������O���'>��-	�0-���>�����s?D������S�
��A{�
�BQ�'�G^�,�������ED�M�B����!��!��7<n6{���(����wb$�J��4��0v��r�
���d���H$ERep�`N}�	,�Q���d�k�k�e]eq�L�
���>d�$����cn9��p)T���/���mp��oA�;����CJ�5���|�o�?������up>�MS&LF�g�U�'�|����,�I}N�c?�0�g�����:c�*�ts��~[��4������\"�;��e�Z5�|�������rJ�+�"����&��'���t�OS�Iv����YM���A(q������!����_=�1���Z����������� ������j���P�sDf���=j��v�a<��G
�$�:�#�O�-�L�K0K8'�k�U��)�<y��q����%�>g}S��w-�������<�^z�'Gy\�c�D������/� +�K��m�HT��_�m�E���<e�Jnm.����B�#8���bm��>:t1�[�8�����<��|w�n�}R�0(f��CZ	�[\��	�;H��<:��f���(��M8�
1��C?�Y9���Y���!V�����<���,[�(
�
�Q`�TC����M��0N��.y���-z����`����SxbB�b�W�7>���V~-��h�~v����s;���2�~y�O�"x�U�?�VFFd�"?�A�����O��I�c�d��.���1�X����������$96�#������k���V����1''nH6X��!.��Pj�4��H�m
�3n�R�s�xH�����5��.6������'���TA�njR-A1{�e�
]@����u���?�
@v���x��T�s���H�mQ �I�nl��
 �w�G1��C����6�"��^^1��0L��.$F	���o���P:�)��p>�����,�E���#)xfp��K8��S|V���.�+}4 ^2;�BgXuf�!��w�����������X�yV�z�W�et�q��{�e�wHP�U����������$V�+���x��>L��,��Y����dv��*�����c6�B��p���3V�G���������\�F�<�a�������FC\<�3��a.�3����Bcq�5�IR`�����ct77�V���d�"2'sGm�y�Oj�/���+,���`������6�#��OCWQNl(!��k�2��<��:I������/����@�g�7(�_%���LB���s����,�5���Y=�gM�aRn+�E�-�j �����OO	sT����������(��a��J���	AY�
��I���(�:m���� -v�<�e���=�lN�?hE%0}�e5���B��:�2;�mLB|���2\��4����u
��Bq��q���9o�����#S��hJFgECK��+�����.��2��{�2�#����d���x=t���S
�RvkF!!��`<�|�XR��,���@28/V��B�PxW!��,�\D��D��G��1a��J�n(�����wft���&��g
y��'`h�'��\�MY'vP�:5"|5�IG+?�)G��.
�{&0�d���3b�����f|��1����'���R6�[W�Fp�;���D��6+�*��Q���T�ZZ�!0���5f��0WpK����:��������iP)�O�hN>����u��IlC��|I�Q��x����	='�%�������{��G(sr��e��X�4�
�2�7jSuP,�a�`�4�.��<���p���F�U��	<�j("S�����>m��P_���~��)��D���>�h���8Q��R��l�Hg?/S�����m��6�B�A.p�@~�'^&D��u���G^�������(lu�@?���&��]}��#K�������X[�O�:�������l�1��%���y�j&��p��aB���������=Dz�������f��7���zb/�
���O��c�b�t���8q�����M:OzC���:{T>�5N�����=��<����a
C6�����w����3����1/��-�^��h�M8��k'�?��V@��o`}
���a�Y�a��p�#F�0F������w�Tb�\�;&�.����"����{���O�����	��g6�*w�W�(�.�}�V�]	�y���L��d�r�D���
:�2%�z���������(�������)���c�bj�a'��\�1Nm�;���p������p�q��T+�V���4�b��"d�=�,0+�Iv&��H�@$����3�����Wk���z,�3���b9�k.���<���NBe�"$x���:m�W����h�9��+#9!A�&E(WF�J!7nk���]���������o���U;�&��Kam��`�#Q1�+(j�[J�q�n�8j��<����"�Xe�2^ ���O	%N�;�#�������D�D�>{F��S�e\�a�
��v��53I�S{z�b�|2mS��]^��r�~�K�&Y5�0s��A����6�g�9$���~1HR�Q��<������X6E���%/��C�5g�Xrpl�Z�UX!�J���a��>N'?����@�d[����q�9������-�z���g���t2�~ ���;�Q?��.j�#�����$0�;g��h�A���\V!����8����TAm���0-��Q�>�n���!�|XV����o�
9��WV�����}k��`�$.d1�6��|w�N`KU����1�]0��~?��:��<���{
����y~-���_��B}%����c��g���7�����V�����
<���$�
�����`��RZ�$��E#�d�(5����� ��c��������_�����t�{�~	��Sn&�s�)Zg[�;�~:�
��:�Y��K;�l8K���D�&��|�e����w�+I���A<�\��lc����Qw>��w�!�G��~qT:�s��}Zs�vG�j�A�V{:�y
{�H�el#4����[�unu������&����/@ByUp_�&��=H�Y��g�w����=��V�&9k�������&a��s�*haD��b�DGz������C��������3��/�"�_��,-v���E���-o��o���uJ���S�\��A,����7}���1R�Ga�!���
�Yok�QL���Q���x�l�4��O��"�~���W�Y"D|L�K�E���]a�4g ��/��-_CV��f�����\W��Ep����?o��5��r
��@2�.g����KN�W.���j���|��x��3SF�N��Z\�4kV��~��I�y���C�k#r�����0�K`/�����!V���,��s����]�s�`���/��|��J���q�
*k;z��/���ytx���N"��t��j����}����i�����^e��B�����,�L�E>"�rd�G��d������2� ����em:�E)��B��"{��Ux:�#4����p�0����$�86HI�r������Mju�i���[x�����t��Z�9~�
~��y����O����nE+�����t}�������t&������H<m��2�F�=�el�@{u�rZ����H���i�����a�?}�!
/h�M����������#C��e"Q��	�_��7.S�>%�������<��3�IP^0�L�e���+��0y<���<�0g�/�d�� �����������:�c���� _�����9B)��9 v�cFt�#�2�=_�G�=o���-�}o$2�<����P�}�>
H���4~���*�����f����;^��B+wm�����q�{Z�T�U�m��a�u=���TrT.z����@u�;Ao9�>v,��a�*�����C����*��E��N�?�*���wj��S��}1��~��l*p*�����)
��� ��� ��#�jQM�i�<I���Pg5nK����bD)y_K��6�
�<o+��PN�)��)��#����m��5����0D5jD��$:����-�����sW.����+�����rz��5��n;�"��P�6�������*��TE"&�m��P����h�,f���KY�V>��V�<��
�%�!5S�9��@k���L_���o
�����Y3?Kydf^�}ea�N��<w���5���9����f'AJF��%���t����0/�}�+�/yDOz<��>=���$�(��|�Y��5y�������e�v2�l��o��;T��j+ :M�(�?z�J�ly!���!K�"���b��i�Wi�c�e�K}N�4���W��OZ��i�?)�6��_�W/���|���������LV�E$w
�-���g=�TI�\lG�-[�n�������q�6�+�5�^U����2�b�`?q"��z�B@xo�X��4P�l�T�5G(ZW�DqE�b+bI*a�����@R�U��ez�38��a`�(��_q�Rk���%�v�k$�E������q�"n���~�f��k~/�N����[:3K��-:21{'y����5��� ��iqv-�K�P��J��SB��)���f
�k�7�a�w,�
��3p�����f����������^E���F�����$��Cb��d�"TR5\���������t�V�4$m�����(1�
bE�+��q���
��wi-���������L��9��o�q����3��"�g.��z�s�����p���g��_�n�����G�a{���l��t��r�8I��I��c��_�a��%����5�J�*t�4��>�^*����9"
�!�Xl@�X��N�b�;�O*�����	�Xx����K�c������]���!�a�L*j��C}��\X���M�Px�=&�� �Ny-�}2���.�[2�c�Pz�p�x�H���k
�f��Fggv
�D������z5��]�`d�L"��4|,yh;�/��&��	j�ij>
���X�)�:dnR���X�r��TO��I�y�p�����1I���,d&tI�z����A��6[��Q:O�^?��|uL������L?������L6C��,>(K��+/D5�|}].��S�W/���c(-���}b����6��H�����T����!_�v������E���f�'��y
�R�iN�|g�}qC)��9�+�4���Cc�F9h-: �J�����2q�s������-Z�m?��`�:���}E
R��@�g8��a�
��-�	���JQ{��Z(h��H����JIw,�%��\�oDYaO�,r�h}��8�3��E�|����R7�q��������xDpP8��qu�!����8Yd�����r��d�3�����a�.���$/���]B%<����W6uNg�\����V�6�e�-��]L��w���@�i�S��9W<G��������J���= �K]T,��A�S�S����W��D$���/���uB�^���:�O�@�n����znJ���2]�]�Xr������S(������8���9{�%!����Ab���yR�s*���~�	�5��Vd�����|?�6�K�r�������x���?�Xw�I�{^�l��s�Q����1�������o����a�y5�����r�!�"���v;|���#��?�YE���9�,��U���?�r���4�*��:,��k$���B��l�W��F�[���w������j�I�Ed��Z�&�q�������~���m�b�g?b�$�������
�c�M��:�����8�'�� V���~�i���M��D��>R���Fp����Y9���%�Q'Y$2G������o����e����+*i�h�
�IE�@�F1�M�rlspHEN��5���6���������_OG��q�����k���mL��B�������3�-C�cc1#D��t#@s��N������[c���8�x�qW�V������ ����Y,�=����6�������_������sm�,��W��� ���s�n�VfE�����������Z{�m��3�v��/���fy��M��6�V+�c�C���Q�������ip��f��p��M_�Q�����`�yJ��vvlLB[��6���E�G
`�������2�?of>�'���` �
���B�T����i>T������h2�M� �"0���|bGS{�����e�������JD6�fQx���X4�E<�.�0`p���zo�`���!��i�)�j[{�eS\�6�^�$#����oK����m�����#V�&63g����;_t���iYY��>��.��O,��;ER5c�to�2�m��M�YTD:gc8������3!��V�;G��t�pF�����zE��HN@v���w�'�1.��I�����xk�����WM|����oa
{���v�)�Ugt��K���*��K�Xf�F��C7
��G��hn�Y���
A��`R�.�w�����L-��QA���ol��-��_z�����}��!�0��I_�
�
�Rtkf�v��0�.8�����<���~9��*��R0��x����(0S��D�����X���MZ�p�	�(�����
*���m�Fy�:���B=O��g�������!k�8����*�zF�?��n[L[%�������2������3�Cq
����5�k���\���p����������!�YOks��'�.��W�z�D1d����Ul�M�[�knrX���P�p�|�j���56w���L��P�g�0�@^�EK �h��~�?�����������,�t��Y�o���*LT����a����*��D(���y��m��
���
/���|�"�u�LtI��&0����\7�1����u6�����htI[�K�O6���.dw*��:�J!�����D�zD�&C�-\�J|�����Lw{�4e^!QY-n�L�Tu/-1SM\�'�\u])n��;p���4���a���x���kO^�Tb~��5)�.U(�n��G����C[V4KV���M�%����Js�Qf� c7��~1�G��>)�����4������� ���63fX*�W25&�V?6�
�P
V
��F��^J�P+p���W'XG%X[#XW!�������yr�,���;l�����C'	��@�fGNc�s�C�o��O�{� f5�(�
����Mkn�%��`���%'.m<�������!�X�klnFK[�V`
����6�����m\�m�`�
K���y������V�,�����L;�*,�q�(��f6�#5��?�o��'���je(�j�0{�u�W�WB�]w�oO��q=��E���(��LG
_�����6�������},sX�*�����]w��[���$��9O�I��86����D���""/�j�\7�	L{��o�d�nY��s�a�����cE�9��G�q�t��i�%���������]��{+���M�/{��i�E��z��w��h��4��S���A���l�l�4����;���v��p���&B3*y�h��T����a��&�@!�BM���Io:N�]�&�p�:��)0�$'�8�R�']�E/���u��/�}g:X
��g�7LV��,�]�����&����!@�\A��F
�owpv������������M��D��r��
=6h��B�<H���B�U�I�Mt,��3�|�(�F���	�
�;����2�I�9d��Y�Q�(i��!2x���L�s7�H-�P<U.X	!1+{}.�|'e0�S����s��N�/O���|v�Kvl��c�j2*�a Q'���3oN�K
�����d�r�b�=q������]�f'}��E�Wg�uZyf
��Z����n�gO��..���I�\y�Z>���0�V4_A� 0������Se�EK�v� ]������^�M��/���%y_\��-�Q���R�����+��L1�y��9e=��W��S�����sY�mW���7���-��w}}��u>!�\��I�~���-f [�M�_*9���r Q�j0��3� ����G���<
t�'�\���|�X{:�E��@��^jyL��i�����H*|[Bg����M�,��6��_�����8�c�]�b��%�B;r<��-)�&f�B�M��n$�U������X;'IK�6�ER�<�!�:�����"lY�@��~h�k�M������@�]7�����n��N�&��x.Yu8�2���M
�HZ�8}��"�$}�����b�:����#�Kg��q�2e���E�Zb1J:V�_�b�*��C���>���*u����\�����>�
k�K�1�[�=.��=<�s)'��YZ�����\N�/~K��o2{Y�_2q>'�{L����Q��f.���#'W�*���FLZz\���K�!�u��1�A%"��kN$�����*�[���Z�D�bFt��$��?������dE�-�"��pt	y:�����G���@�)����1F��9y$k\<�����I�� ���gv�w.�BQf��q���t��."���9�M�H`�i���/�����j5��5PA�|P7w���O�	v���K����O_bx����N�f���X4 ����#�-�>�+�z�����cn������J?�M����S�#^7G�����0�':I�M�
��`��M'�V��,���=����#i@W���Yd��M|>OS��,v�B��l�����e��y;rBX���4�0�����h�m�{��	fI�_��_=�{���+���UMN�of��'�yt7yJ;"�H�5g�tB���P),({_93*���\�����cE�jl��/B+6v�O����	F{���T6�������6W��=({MB�zYr�D����T�����A6��;K�R�x>��~\K�44�/���EB�
Q��a�V���w���P��N@�W/vR����R����-���7D�vzHs����{+���>����Z��H�4K�E&�������<�+�cC��������s`�����0o�*�b;�S���ujN�)9�y�&r'�Sd����	&K��u�R2�7�����S���xbN��jQV~
�(W8
*Y�qJ@�g6������u���rJ�<1���i�p8�����	=��"������p����-�6-~ %��(Q���R0�J	�D�P���Ag1���1�	�B�U���2���YC	��T���(�k�/j8qW��C�v1����������~�}�����;�M�J�M��B��J��Z�"���A��YH�BbG�>�=���D������	�E}��%?am&�z�L%�J���V��2BK��Y2���;_q��.d�����{>G�]�"j�y�w0����y;�4@Hp��2q��U|U�M�b���m[�����6j:Aw��(pH;@VY�5&q/����G��1%+����%Q[_�����H(�tS�V�QO���uG�59t�h^Gj�t��f,�81�(��F]V��K�w)��������U#�/S�@��a�e��"����&������l�2[]�����l�@b�f�-n1^������"���@~��� ���A�rp���Z�!�<�g�����R�j6����(�RAL�DJ�[.��L������l�%	�b&P�E'�aH&
�b��M�j�"vd�S���Jq�)%����Dy	j���r>
!}H[� ����e���
�^�F����}�B	Y ���xq�������4�����1~��l+��5��e
������6�Ur����l������OH���x��S���k�P���%w��^��rJF�K�����-D:j���tg�����&���I���	;2��d������������[���de�&�T�ex���&K���ou��K�%�,�����m��/��#P�������sv�Bs�h���K1�Y:4bz���k�r��g���9
���#|�;�M&'����8�kLO2��+���w��P������-s����t�X2�x�|����7������ ���`k.��-���y�4��54��=�h/�	�I��z)�8��V������$~�U[���u���r��8W�3�X��6�O���A�T������8G�D��H,�-@��#��k4�el��'��z,j�'s��	���C��!���t&��dT�,�����p��a�	_9�S���PN ���������a�
���8��,�������q
�(!������C\9�8o	��������`������,i*�S4�Y�q0Y�6��mHKr<�<YV8���f/.�������y����&�^r�xe���Lp:��?������ �FD���Y�����=���_��W4�%1k�f�?��
Dt,� �'wQ,A���Z)�l�W3Y�f�}$�Db�J�x�L�e���
����%�
���^������G��t?p.6�d��F8��������G�0t2;���cD�W���	���;���6�N�g����.��9��j���OA1
�0��|	n0�R��q2����������xTo�k������i�N �w�}+����RIe�x0`�E��z�^��uMj����-�;��
��k)�[�����bv0�)'.v��R{S�4x��|���{r��y�Zf<6�������)f�N5��r����`�� �\������Tu;�#�[���u�7O+/���g�i�7I���[�p�W6I0��6��������@����eS���g��XV'z����%��&&������V@��op���Q�T����� u���`x�E��wz���$�}����V%�8�C�n��";�(�V���O��������W�2���9�P�0^ ��t� �����c�����N�7��jZrNp?A��[B:2�4�����A��P2�@l��+o�$qr������I��������f(kr��u�GhDDV�������C�9�l�����,�x���K��xD��?�\]�	$`v���:KR�^���%�$��
�m��~
{�?���R,�TH�5���)n�M�g����|�v���(��Kd�8fY�;�!:�����l_v�F�9<�:Y�m��V�P0�����6��$����P��)|���
~������FIP?�P���o������Wg�m+C�;
�	Z�1�����4*���i ���V����|��Cl~H�4��*���v��)�"�yltp���4��V���>0��Jk�D��2�M	����	�����<��n��T��
 7������
���������j�����u��d��gv��������^�:*�}�$��s�X�{�����!�X%0"
 ��cV���FV�JP���n��@�������B1��P������gZ�GP�	�W�4 ������������@��#)t�j?9��:t~���f0��j�k��v�b�,ly�e��d���G���5I�+O�/����P\��i��h#E�fyd���������(Us ��D�QL����[M�[I$���>�����(*�`(1����T������M��mY�T6���^=g��>;)�>������.�]
y�*��|�����R����({�=���j�� ,�>����we�M��}����A*Q�b���" �p&�<u�t(�!d�0�%`���~���g^�E|84�����tTd�$DxR$��s�(��x���VTgb@��
/����eK�i�*�������������a&*�����a�����s�
��lx<����_W7L���O�C�6P.��n4��+'�B������ �X��Z[���qw]4��0���^���,dv��={����W=(��fHT!��/���!�<���*��+a�B�X�BE`�������{�����.Z��Y����.���3t]ah3�p]�TH)���{g;N��pH��:{sZ�����K���LR ��F!a��|���S��(gdx����v+�O�MS��a�[���e��i��Vg���q��2�f)u�t�-J�_��������q��+�Zjz�#Igpo�������Z����d4������Ed'V�%C�xE��D�d�qJ�|0s|,��W���B�5�Xb]+�����Kti!��5.��,���	�L��2�n��l��7,�h8���[$R��t��A�N���/a�W��z���%l��@��f
����4 �	a}�OS�v|���QP��~v#������z��m��1�S���}N;�{��"��j*za��1+�fvz��h�D�@&�;���[�`�Y�}���f�<)����4tyR���S��-OY����ifq�����o������yE}qvC���7��?\,�N?�P��d:�\�cm������\M�{�qkp���������T�iN���,!�>d	����q��c>�I�NGZ4��C��Z$��h���F��l<��(#sjf�_�����&T�a�bA3�w�'�&}e�+���*���]z^�5\]/��KC,QP��4�|O ��hy���MnwY}5K"0�zJZ��:{(�}��A��3�q-��@yU���<��'VZY=s��)�[��2�}q	G�f�!$��hZ"S2=j���s�]���R`�e�O�Z���0�y�5�]L3��m4Vl@�������I$��?N���HL�`H�q���j�����)Ii��c���D�d`� G|��-�|�si��)��%�W�������~6�O������:m>�
	K�:
�aO�n$�:�%jw���O��:�M���I#�S���E�T���j0tI��2�'����S=�"~8Y_��~J����_P /�'%����F����*��<�K1��!�!f���`����+~X<��XC����-����	�lo�������!	/f��������S��z��X�Z����G���ULjd�4�L<v�����@E�J����$�w��nWl��
��a���| �[!�H?lO-`$��i�����N4U
��A��P�����eR	r?���%�|���Egib�T������gkN9g+���������W���1�����
�+=v�a�)8r���j��jV��Q��&���/��R��W����@��ZP��d[��b��������8�F]2�����JM<V����&���z/��R^�(���p-�yZoz�����Y~lS��7���G?�v�=���^~�U�����^�]�B���m�����J��L����-��/�p��j����2�EJ6���EY8C���8�p��m��L*�� @m�R���2 ��V@�g�7��z��'3mlt^��[��LKq�X/���������jJ�K���R)��������`7�m����{�&�C�r���Wkl�7�?m)7:V��u��^�e,z�����F�u��N�u&��?��SR9AY���g^[0���rB=���NV��lGP�:��7��8�.��_�	���-<�� ���
7��4��b�LG�[��s�|M��
��������������k����<�v��'-��=x��q���%���l�O������������A<Ewpu��&�t���>!�����M�7��$n��Tlk������F��#��O:G�<i!�y�2��]x��0[r���-��o}�y��V��4�-���i�nid���NZ�����8��|<�E�m�m"&�a�a�dd�e-�Q����h�H������
��hz<Z�~�/9o�K-�8��3g?�h�V�M	a��5	�����L�&���G��9�������WH��"�A�\P2g�fGW�Q����v��h �������h���u�u��;"��(�M�b�
�@�*�����PS7y����PkR�M��kH�2%	6��d�S���<����Agt��N�4��%e:�XV��J�pbZ=�AF�j
o��.��l�S��
�Ow-B���O�D�����uv�����l�aigE���9;��A���H
($�t�|w�������JV��� �wg��
��������	j��?|syz���yl���jW.�D�g�����V�@uc�y���q�!t;pzO�b��vq�kG����*�u�j�S�����|��^��DKJ�.D��[��?����5$���M/����E���7�����&�.^^��A�S]/��O���v�jG���v���0�TsJ@�-�6?���fp�S�SY���wg2����cEN���q���@��?��q�]gm��M�a�Zc��Rj�
4�����W�oY�c�Jj�+��3[�01F�\��x7gts�k�d1���5��^W�{�	w�yi������g�C/�J�|+��v�4Rd�Sr��*��r�����(���j/3�&���^:B��or���19���y����K���>AS ����iB���&�a/��Y���z3��Kh�C��C%-��E��BFtt"�1��0�h�F��:.����h��`PQ�8�V��a��i2������0��}�I��q@~b��\�5����e�=R[�q���?���}e*I�����7��i���ES�����H��'��1Hd�p�"-&Wj�A�,Y����p!��z���HW����"��i����#��������������n{��vg�8��7'jz�F�/N�4���0`���*���@r��B��g/����?6���_�_�~�xJ����}N��-e?fJL��]zru�S�
)�$<:��,�w����&	��:y�~�����WI�Xg<IyA���B$��R�i3�K/./��.�S����JL>$~��������s�+����x�3�Bb�?��g?�?]8���e��������k?���UO����f.�;�P�q�8�������3O��g�^+(�����s�cH���>G;D#Y.�%/��'���L�������]��|�.M��������+<���P���<��4�a�W��O>���� $�NmRK�t��"�,Ey�1|
�*��\
�}39�n.L��%bE�e���(�'�<UT����g��:���!&-vH� w$Y�����V��i^��������%/4�����e�W9��7�8%,	E^Yg��]�����q�����[���>����B��H89+��^����I�N����S(��"V�C��cF��s>�����N�nHo���Dp���	��ld�q�K�&>��,����'>�|�C�v$s�M���������A�t��Q����;w��^�����n/��A'���Z���Q#���JX.��~���8l��j���hl�J����vvv��������jq?��<`u�"��������<�	M<����gWQhTH��%]�z��b{���L����j��[��������j�T��^o����;R]��	b���(h��5����X�`�Q<�Mz�a��	?�vK��~'D������~xDs��T�����^�h�m���G�� ���lZ����]����_�����mP�c�]��

�}
2�w�CR�I����:�G�
�}O�����:���rg�����Z�eZ;��������r9�w:���a�`����d��]�&�A�������`�H����'�������g;��g<�\f^�8/��������
x��*�����~t��7c���Tf�-D���AizdIx�*�_����v����A��k�}�E������ .T�X�Yl�2�^x<�A�TD��A�@�Nn��	���c�u>��	&xw�	Z�2��;9����9�����&�a25������� }sy�����n���)��s��1����W����z���*��n��dk����g���f�^�&8K��F<e�2�]D5W��Ag����+j�Zg��^��\���J�r��mH��W�������qy{��}�����g��4l��d������������1�aC���~�7��Q�zz�����$b-_���#B�������v�v���Z���p�	���P��R�m�^�Y�����)��n�@�?}�\�:���[!<����<
���c�F8�}@�P���f��M���`1�q�V"��X�^Iq��Dp��G��������2�6��?'P�#��� )p�m4��g�C��9�P����;��:��{;
g��v}�,t�����E(4z��I���f��e B�(���N�j���$�	%��>E��=��_��A����E�0�2��&y|j�r/��.��j����A�^���IuHd���04����p2J��~h�D�|��
B��l��7���K
E� ��0j�O����A)��Z����N�
����>{�|�A��A��K@�C�p��P��D��*[	C7�\��/W7�ru����|�/o6fo�u�\�}��������������1T��c�W���7����_�����=������W9�!���@���{U;�_��������W���^����������*<
�d���L��6�D7+TsmT�k�Uf.����b��E���#�D���~��F5c2UlV��}��� N"T����WR�����}���e�����W���z#Iu��*�����~��U�;�^�^�h��%�g}O/����)��x�d2��#��G�
�g���'s�u����%�~�O�n���8�8$�Y�����N����v�:�gX�:�?��y����bF�;O�Y�
*V�&���I��P�^�m��u������M�8���^�J�������F
(E���F���	#����/�C���c��uKY���C�WC�Q����f:A/��Gz���G<���i�<�����=G��9M������F�1����`2�&^j�C�+b[�6�q�������(z���4��br�.f��-�F3g�u�����2��� ��?d����DR�R������>u�h>���56������Q���^t������x���Z{���_7Z��^c����k�[����Qxp�u����plv�}���5�
��NF�!���`J�v�/{�:���F`Y���8�$�l��%��N�wql� �m�XN��=��3����9�I*���������E�,�����t���:{q��s�
������k:.����W�N^5�_DE���X��}_��
���N��+�
|����{B2I����4�|���0�������h����|���3���d�yF*_�[j�Z����|��������d�&���l����g������g/N������j���zs)������o�9vI��M8��w�����
�6/���Y��hw��h0N<'��]nE#�����7�n�2\��
��%���z;��Te�Q4�_�/��B�h�F�cot��}�|�C�2Uom��b�i�U��x��<=�e'K�3�2���N9�t��M����)��^�I����Dm��[�}4F8Tbhlx��f;m�O�����0E8���brF4��;�B$��\c�KW���,plg���*��S��x1��E0�0vq��K����@u�c8L�����3����8� ��@A,��]8�t{@
3�[��"}N�H/'��b�B74���&�����������aG;M{��+%	����J�G3P+0�_`���gE���+P�/9P�-�n�V}31����v.������������U\�eg��]%n����U�O����|�0�)���������CdE���9����|^k�9�����R/O�K�T�_��������6)h��f���6N� ���u���xq^hSkiS��@�����opx�������S?a!f���.�wo�P�rM}�4��������T���B��;�?�J�
!e�s���k
��������k������n1��'9���{D4�	'�evjg��l
�:��9LS�n��f(�����lW�?���V�(��?��*Z��:�\���[Q9i��~��Q]�����������=�u�����u���+������
3���/./Z�y������-�����z-�W��\KO�Q��:�����I����"}�}`w�
�,���!hw�f��b{	)�o���H�[X�G���6e�:YK�B&6�~&�y��+a[���PD#���p�^���&�m����`��@��!��A��<��lT���g"��i�Y��:N
FBE`�}���������S�Tn�X��z~:�]T�bQ��@u���m}�w(��r����uF@�Og!��GI���`<����m� p��O� �_Dw��������Cm@���������3��7�m�0��jL2Y�48��Hl�mY�K�ltd�bp7�}h�|Kn�zk���a��A�T�H��n�T
�7(�7������^�cNo,�zB���V�����7mHV�.�����8��;O�v%m��2o@�g���-�wC5�>`��~��q���l���^�D%��q[����m�c(�����a�y��<PP�i,�lj�J�����4C5�L�7�i��K����O���t���c����T'�]����)��/��*����1"S�,�edh�O�������N�3-~���/vf�Zy�����������C��=��s�R��v5�2����xFP��e��E��LVp�������I?�TQ�r�20��H'5f�$�?
tCDyS-[����X��	�1�H�|P}	�
��������o�����;� ���
�4�����	4s� |��x,f&�����PJ%�[o9���Gc��������1��y����5d�3��q_���l�D�����")m��p&�K����36s:3@Y
�m8\��/������\�,s_��k�����Kr�I����d�_���S�����������m9�.�����.[vL�e6�"�+��MgA
/-]-af"kmBa�J�����>��3<�Jf�������w������|]S�^���G�����?$�rJW��Uqo��fJ���Y]#s���m9�l4SD�t�B�~���i�[
*��,�F��|7�0�9n��V/z����D)�>�z��S���B�;��_u���Q��\�b�F���2�5�#C���X��Wf�������3�*��F�n�4�Pj

��%o�#�iR���D�(���Z(n��X��>�������JV�����
g���dL�yI��5����> Y���k4V�Eg��-�S,�g�8�O%���:��pO�9=����n�BR�#��e2f�Mk�3����3��\rl��-���������#bJj�~s���uT�A�%�F�p0��B��Y�.�h6�ld�	_�S{dx`�^���n���+����v{��@����b�"�,/q�> (��Q�n�):��}l�qw=:���e�S�#����+��	��!�$(T<����������D��
�i�6OZ�N�^�������_�Ia1|[#6���x_Orw0��=l��7t�G�%r;�<�F�@��� ��2��U#�(%1r�G����X?�43)��"�����?t?%����3so��je��KR��I�(e���Q
?�'����!������[��Z��`���c�>-��?��u��5�������h�~�����JdWX�1Ha6fr�:�?�;|�
;p��6���������T��wY.�n��#��8�D���-��e��k��JK-br$�TE���6������*[,m�����iK��9bA ����c4����	]�sZ�I��4�.q���t��2��/�?����]4����J��N�� ��!(�GJ$��4�=��>��	�<k�����Q�!
s��E�b��,7F6�2��{w�N�8�h��8�:e��S�k�9h�I�r���'��E
��,3��a���\v�N�����#��?�P��7iw�����
=��;[?0!�F��C��
r�u
���y���0�v���X���ad�v*�_�	��r�����p`_1�/ ��3����3���N�T~����x�OQ�'����~;�J�L��7���C�og�g�E�L��y�y�Y}2����<��)�<9���G��,��3��� ������A:�/��x<�>d9�����/����D���~"`�i'�\�,2 �i�	�C�*����:Cb��@
g���0k�
$�x�^�[@W���h.�����kK{�"P��9��t)v��F�����b9Z������C�� ����e����y��/8�3Zr7�����
��q8s��#�G^���Lk����7V4���[r-���y���e
��p����_�IG��jB���	[�J�����]�R:MR���^��z���R����<�iP-zS���|���a��Se	#���b�,�4EZ�B�W�*�S��R����L�ogrw2'oQ��pD ��k��l��S$|K3o�hB�
���+����B�c2�%1T��pn���B�`�$L�����:p?� ���Ci�6�qj��|2")���y���"�|�M�6�����$���*��C1t�mG�x�9y�^�.��a�<!)�2�!��:�:��O+b$�������=�VR��g[�u�y�:�p
Gi�Nu�|?M��*�Z:��*rD�P��	�zo����M�v�|P&�3B���^/��v���,_���k����c�K���e�������.A���dT
�$�a��R��
�������(:R���C����wK�r
����sZ��Z�;��������2��xP�a�n5���d�z�c+��5.����S[%�C)�3��y�Y�=H!Z�\��!��
O@�L���N���w����������$u ��PW�����TU��pE*+4����?���`���Z���T������e��?1L0���y=����x�����g���z����t:���+��z�:�Q��M�#���>���jZ����I����,�h�'����:���E�=��%M��B�	2T��m+v�aZX�������(����L�3�[�p�G5Y�n��#�����=P,�0�]�oE����U�$�E��1g��Td��0��e���G}�	�|�>\��
}TS��Zr��R������=�4b���5�m���r=�2=������D.yx,q�Y�$������\.��<iW���������T\�
t��}G��8�G��b�����m���6��{��c��~��t�#��������^����g��)[�5I���Euo����r���l+v�U��2�;G[$v�������D��B	g��lJT��G��h���)�$�l)������-�Hv�6"����+sm�����I�k�>��Xs��p�v�si��`Kj?���_�PQ���L�:-|���{�Z�wR���������e#��:���b��H:3��d������5�&�LY�6�P>
��QD7oh�\�j����$$�����������-+������m��H���nTo7�Q��#U�#����B�aN��' ��9S
�����1�w��P~����f�Ud!�\~��R����� F��9�,P)G��8��/�N����E���z|M��.�g�W���zT�~O�x*;BFF��>�e����:�����p1��1���@S���:.�-����0�U�Pd��%�C6�xb2������E
��O���d.Q����?�����J���v+
����=;)�����^^��vV8m'���0G`o�������$_�F�������|�*��P<�T2������`y�<j���c�@�Z�>���n���q��J��G74L��I#<2���.�,�>i�k��1�KZ^�%�����,�PN�#���	�YER������8w'%A=�&��;n����9��C�,u�A"���u����f��%z�r���<�z�����:�EQ�����Z������^w��r���\N1�ZG�������X��X�H|�V�W���������������O��i���5p�3%���r���K��7�f�;�F�)����
�����z�W������~�w��^c�Qk�~W���{�*�r���u� ��(�����IV�U��M��D/�^�W�U�*��=��:���n���9�����z��~�
���e[�4�������`5�����������0x���(����7�p0�z�l�R`�/gP���A�8kA��x��q��T�*�����WP�?�=�~�:R
*����/����DoM����E�[[?��}\L��qQ�I0��3���&p�mmV����/P�����R9Yv����k2��~
�Z5�].�SUL^��?�vp��,�\�A�K��kP/���}O������`��u�NzC	F�����jv�'{����0�
��-���Io���
hI��dO�^18l�8��8����^�����6��I_�L����E���4UlV����QE�@�\)~����.���h���T�Qy���cg�J����v���%������`};/x��d���;78�>E�6#(����<k�����	)�7���'{a�&�(�u:Q���f���s�2'�c��7iR���&	�Ry�v|����J��o�����^.6��W����/Q����]O�/O��_��J\����??����%`�g��x���P���J����[������t
�e�U��N�3Cm���P�B*���������=�:9?o��.Z'/��Kf:��f���2uWf�S��w���I���)�&SzM�*�����D��U�{���/�dX��]������OX��KQ>v.�d#J��r�b�7Y��Vk���o4lN�a�
�~����D'XS�����a������������W���3Qy������_B{�F��r�F����C�6��p;��A�i:7�~�d�U���:I��H^G�P��J^�K�������!F4���3���%�@����BO{�<i��j���]�\���j�\�xTE�T���N[��������B�����i.W��(t��wos{���'�&Ie�M"��ew7 ��d���Ct�9z����1$�5��QK�\4����+&����IJ������,/�������+z����W��U�IL�|�k�'8e�I��?�G�����Z"pf���L�-�S��^���#
���`�����������(�^/�@�o��[�%�s����Y9��dg��U4����h~A������I��&��as������ZI�J���9Nf@@7�T���g��C�|�,��@�������r8��B�x��U��~A��4����g����$��$X��B��Y����7�ZKk���������U���;k�H�Z�wi8�C�9���s�({����L����yj{����K�V!�	*�#�W�?���X	.���(x9�o�Y��!|�q�)5��;/�*���2���Mz�n�
x���/O�~E�������?'�x6�����hS������g��������W>oXo%���$(�s6�X_����������._�g��3���F��M�T:y,n��(�R�e]��}�������Y�_��x�&�����$a��Gn���`y����0^/`��'SD��CR{� ��"I�~J���pg"�An�������������'�6���{*#<����"�_	d���A�{�u\%�z��yN��q�L��`;L�#v��������N����M��.��WW��^�����Z.S�l=��Z�q9�$T�����Jx4�O��j���@rl�.��o�W�u����������	��=F��h?!�I����C�oL�3�age�02�~�1�R�J�����t�Y=����!�$�B�.�W����
���l�L���
+������'_����T��s>W�D@������a��XV^�����JT�:K�19B<� ��)&�{��>�z��	�7:�ri��)�G�;<�.����<_��M@���F���H
Z��/��|v�$��;/H�<��nn��"4=L���Q�����v
�s���u�.��Zb��Mm(����h%%#@���/f�X#OXyN��p�c�	6�O��"o�,0Z.^����.n�y�$������W�A�������1P�D��������������a�!��)���m#����
J(P�@�r���/�G��va,���$��	�U�W*U��^������-AoPV���A�O��.?���j�j���8���O�<���
���R�4/�B�6����^Rp�TIb���X�j�C��qi�.��������H�y*1]�6�^T�0nX�AW�&e,��� ��Q �Q!����@�B�f����v��<���b��t�xESt�BB�a&�^\.�4$����c����jc�������$y ���3���l�s����1sf����g2�S'����(�La E��+`��W�N�%1Fg�l.��f_p*b������j.mV�(�9�d�t�HUQ��������w�{�N�S*����P�Vpa���t�lH�U�����X�������O�Q�3������=osz~���
�e��RA>�IO0����(:�+m�����]z2�����e��k�5��Rkr�5��M�95*d2v_�%*�aK�u���O,�R!��u��q�n�`H@c���sp���
��n�
�9���)����'�����b] �j<�cE)�;c���*�P�e
�s�;�&g
�
�5Xa��x���H�`C��
����{�����M�6��&`��������`�����6���I����Tt����c7f;����-����+MG<{v}������N����h\)�]
��R�c�o�)!��g�b��jN#�
^M����k���x������f����������%.��2$+!�(�,{[�D�a�_(n�tr/��k���i�n�9	Y����O	�=�����e�w����N���0�O��E��g��:�>��:�U*�:�����~����^��p���d�;j�*��N�Z�W�����W'��l�O��ha�B��"�����G�
B�#�N��!g���SR���i�C��������g��XFivL%�]I��G��$����*��� .Q4�w!~*���8C&�Q����'O��B�k+_9x����?^���>������C'<& ���~)_-h���
g��#�9�)������\�7�t�|�~��-bw�q?o���U>�0|�
C������K�<�EqX��?�CU��������	�.Iu�U���e��C+@����
�m�3�����t�����@%�8_b��Y��}3��6A~+��J�/'Q�<r��h����7��9�}N�'=�����u	�O:�7���K�L{Fg"��"�!&d'��E <}��x�R�0�D1�Xvd�~����f�p?��cKZ�I7o��@~�U~~���������xB��E����	�`O8����h���}2E�b�T�)G:��1bO�07��BP%f��I��8�Y��^�������;��c5d�"���qQ�q��r�����
��\��`�����	��T�q���S��[�]dPJ��"(�	~?�_��6�[n��)�w#"t EY�g���\�f�K0�����v|"�$3����O�A��&��������l�R���/�Mn��Q�l�r(I)�����>.�M���/��0���)����/��Ep��e"&����Kn�=��
ap�}o�5B�&���N|�����On"�����'���Te��f��82<�������w�\z�����.��64�2�������Z�F���.P���*���[�H��X@���.{�}�l\5�����yj�t>x���J��e"Q�d�Y�>���������]*'
������w:���W�+��b�.M���<k85<)�:.?^7�|<_t!3��~z.+O�����y�j=��O���63A��`� ��G�z��J�8�7�_�go���f��wtx�w��/��OCf�|U�lS���6a���������X{���ik��o��y���\��J�^��6��|�Z�j��ZC/Ij����\@���{0w(��~�w��0��#e�N8�T����;���E�e�V]������y��&�U�Q������\�,6���P����0:��7��S���y������',!#��v���NX$tn'�D�3l�%"���O���X5�����6���M4.Kz�<T��`o�g]��1���pv?~���'���}�I��g���f���XP�qA���C�/�RP�m;�iB�6�Ep ��gt'k��r8h�]�c8#1-m;�}�7�x�ptB�4<����,��Ap���1�n��tp�^�����������z���!�2j��w�z�jY���Y��0��K&�.�21m����t)]���.3�%������.kr�L��/���b*����Kv�D������a����l{��y�����[p�o���'��3t���Zv+���Aw�^+�k�F�pD�=��%B���v�7���������[��#B�X�2��[�,�����6���h?��7m�������A�In��p2%��PE]L�5��:�����=������'��{��^������,*�'��~�?.M��&.9^�@�/��V�*�$�{�Ro���{�#���^����Fc����?��W���^��Z����W�Z�0���{��F]����U�������;k���H6
�
�(KR�V�3���Aw���syk��6��0�U<�sBJno�r����D
�[P����u
�*�
��Eg8�<�EK�gL}	\�������%��b+��9M��
��!�����s�;�1,��gY,F�-�U���~�)* S�^���\����z.=a�A�B�Y��S�Ja���^�x~F���<b��$��,�Ir��3����\��C*���8�	���-������9�N����3NZ�fy<��$j�F��SN)�7����9/�VlW�1��Pu���l�.�t��U.��'+�k������V��H1��z�~N�I�"`}���>���~�jzB����}��3�nG���nf
��${]S�O���������U��D�$z���x1���4{@v5�P����u���-��3.�S"�_�*0;�K>��5�R����[&�T��v����)+�^����{�~�+�j��@���j;�Y�������P�5k�pXg�dV�$��k��}�fS�}Z[�zaEu/u��!?�}qKKt�&���N���w�($�d�(��+�n�W.�t���j��U%�h���[Q�)�5Q4u�� w���<�?���@�P���ah?�Cocz�.L�\h1���;�<�;��d+XwB]}|�r2����[=�4���~7�u+G�h�iM����%-~���=�\�c[B����J�7�X�Az��7f��b]3��
�y������u'�b���om�5�_�~7<�������+?e�:�u�[�L[_DP������3DxI�J�u+PKk!W%����!/�.��Z!,��g�=�/&
 }����h��Ct
G����7J(���)J��']b�)|J�e`���:E�6�!,�
|k]����D`1����0
����i�3�<�����-��KJ�!���y��&T������P�|�����%�x������k��/i�Q���T�|C��tW9���!hX��x�n��6/��^��8���$4��K_���CO�=����s�P��Q8�*�/@�����X��8�>��eZ��(������wo��� �q�!������tyuz��M�S!����d������X�`��C������ ^#����O�/�q��D�yq���^��� ������G���Q��a�������G������O�y�`�g����������y�}ruu��������+���-���5��������I	�nO��D�"��b��B�1B���Lx�����$�7��B�"d�f2��~�?���bq��n4�\_^	�|P�tX������M������n���V�)vu��������=�[�d���_z0��hkS���L����)����������=98�t������}���E���b`�Jl��2�g�F���q��TP7�c�����S4�q�_�Y��A�����@
@�9�����ars8��aP�{�1���>��e"�A�|Mr9���j*i���m���%��X��-�����H��I,�Z�-
z��n9`�b��[�t<��>��F��vp������A�����Q��+�����6J�(x��������w���dmIO���t��1��
�D��u�����	F���������d�?s�h����[�����A_r��!�s�e���S,�[�/���z�J��J��p$��H��I�:R]��x��K����3��b�g�['>����3���z�Gs���=�������l�?��f����s�S��poE���|�`6�����h�~#���p��s|g��%d*i����b7xn���g�������8�99�t�����{f1���GS�^Ar|��om�21:v�G��\;��3�f�l�b\8OZ��b@��3�}���T�R��2��Z��:���(���Y�6G�)u���U���q��0(��Vu�B�q�==x<�40p-n�0��)y�3uc�e��5��F��	�K������+�%��^n1���5�ew��1
��?���=J=/7i	����=P��{�r3�0'�4�����rz���
�O��<�K0�S|bk��.�t�5JO�l�o�@��>u#����gc����8=�Y[�M�0�(*�����p��!�1]��������"�"A���1!�U��@�Iw������Cd|�y����������
D��8_(�����Ez�����;n�����|��a�,������B�t:��jC+ �T><����>^�J?�5����{sm�5����M�@:�TG7#R�1�n4�,n����@aL�Dx96G��/��������������|������-����5�.�d$?$�����V@�%�z2����eR��(�!�������'�����g�kwqs�(	�����7BCJ�����0��Ejf��W~P�[}�[��,:eiI�y��Y����U���f�7�8��&���	��t�A`���Mp&�cP)������� �%>#4���A*��b�E���& N�����N.^�n����i[��*�����
���u�JB��O�?a�)��@MbLlo4K��i�~������l{������"O��<��f�4�)SmVB���� ���������,�-V4p
����*��#�?�'������Lk����Z�_N���d�U�8�,�o��?GSk��������l����t��������)��@�p���b0E�s/�eQ��
������&���6�O�7{v�Y'*�0V��f�U���RA�:E���'KM��M��!gI�l(_%�Jc��4�L���+[�Ieu\:�99N�(����a����i���Y�#�p��3&tSnP�)U^(d���������������r���M�"G�2�U������:����\�����}����|&�aiU �l��4"�&��U���
_��_[���=�0��!a���,'�+r:S��X�E0�Cjh@�����z<�P��E�,L>�"�Q����T�M�a����}����L����lj�<y{+������<6Xa��<���dJ�����Z+�k��2\b;CG�e�����1[5I���]OHf�so���#-� O�2��+w�tH��e��^�87p�ej�ZS�$��@�2�/S?c��
�n��b�S����b�K4��)�T���s��z�E*���VP���B7g|����#l<���j�F�)e��Y8"��a+I��v������9���hK4�vS�����,���nNNw�@����.���;%�����M�k�����������m��z# em��������<���z6���i�B������W���nfo!(5��L$�����V^E[���������0���|���]���f�v_�h�c�:��@DZ�R���0������5������&B���J-��ic�/�&�<�6�����i��[�[�IO�F��v��4=_
g����|��}QW[�D�*B��GH��r+o�q�W���s�
i-����&�9^Z�w)3X�$R"uv!7W��<�.#�_�	9��{��spNc�p�����"����G!�{0b�$W-�?����3y������A�l�����B��l���#UY/�gOU]#����swN��'�9%.�Vz�N��a�������(Tx^%L�*X��}������@w�"��O�1Z/P���~��P�C1B7D�&��zg�d5��1�����6���,n�$��_��jgw����GL�F���M�K�+N&�G�u���`��s!B��,�?��<��K����^���w=OxIe�����R�C��\��7��hL�#�TE)�\�4"���Z���V�������l
;x&���������n�Z(�|�Cs���pL�\xcO�\)�)��C�a?��������	�I=�� y.����{~UqQ�Q8Ii���H}�7D;F�t	�3k(/���������2/�v�d�6B<d[�����#������S^[���
x#�Y�����9a�*m��!�����2���SLpP���s	��(^�HQyL��8JQE�cZ�����?�D��b?�/�Nv)��\�m��T�*���T��S�DI�����c���%c�OX^�"s[NC�?!��qE�m-j���z���?��'�+�t������}���"�&-������\���ryS���;�lA�(�9�<��QD� �j��98��7�asO�D�1���O`����_8y��S��s������S�\��/�|�k�!{9�-����2�jFkpj2648�l��%m��Dn/q���&!�����}�d�"��D�������m��I��H�+2?X��	��59l����|[�>�$fK�og�4�H�F��`�U��������0�W�H���:��P�$
�V������@��ztE��i-hc���z����g�qd�>HZ�[�Wc��B���w�d�tf����pz��N��?�3�YDS05�!�\bOe�6�����`���#���y)������ �
�^�U~��8��j�����Z{F����{%.����L������u�tPy��,H.n��t��Z����#�)m&1[MV�7�eO���fk��������'@��8�w�R�O���uyt�����O�h���������t�S�s���I�?q��xC��m����Xc����In8���V�zB��/Lc��U�����<Q�Y��{:��}
sv�S�^X'#����)���$�����F����i�"1���s
��7��w�����-'�r�`����E
k�C�{�B��	b�\���}T^��6����������t���[U��m��D
��J
����z�|�J������c|����V6{8�d���W��wY=��������������������y�%���k�5��P!�w�4��i���2�}y�Q����B�8�S������Us�.��8�q����k,�0#��HjV/�������"2,���y'>C����A*�H��u�cy��h��&��_�>C��'c8Y���5u�v$��[����X����X�0�1&n)F,`��Oud9��f�`�83J	>����[����8��AAk��w�N����C�����U$=���u]���5�~��c]��{�������Xp���`�M����n`�9"�����K���z��6�	h��d��%-����#�X�4I���4i�-%��$e4I�a�S�>��c��bO\S�zn]/����`�&c�G�� ��I'N��4�Fj)��\-�D���3�kO'�.e�����r�VgL������v��4#�*H�L��/�� �P2==q��y��#�{��^*�	a��0�ot���eO�N��RmPS`Bb��cU���A���2�����7?{`g����:h��{vo����������L�q�I��uV�X�zu/<8����r��(���j�n%���Vw�_�AZ��u�xH0���9!^�R�(��Ij@������mtGS�3%,�
r���(gm��"<���9��*"��9�=���]d=*����z�{{rur~�<o��}��8U�{+'S��T2{��B�O���>?k]S���g��u��_Y��i���A'Y&���^��#��
�,��7��a�|�LI�1��4��kx����=< ���RUl;��R��UCK)��������[�]��\��.G��gU�37��u��7Z7�\�3������s��r����<��Q ��ll)Z4^K��i��D*@~+:�d�Q��f^�)�Y��)�Q����Q�+;�x�r����o�K����h����`�WUR����i[n�2�����*�9?i��)*4�B��.��G��Lb|V�@��8��b��-=k�$��U;K_+T�J�0�;:�������A�����2�I��"�DX�!� �<��"�d]	�����J�f�����/@�E��7��E����M	���Z�YE��e�Rn���TL��#%
�j]���~�:)�W��q��~'kw��T�����^�|��Q�����Y8�{\�z
�Kk�z����JAH�DC ��a���L>����v���}4�-�I(iB��#����0��<c��v<�'����,��3�VY1�i\���2�a���T{�r�_���*�C����L��&�#�U�����'������t�������c�q��<T�n��g�~nf�~���C��:8��I�@�]S'���j��wP�F�\�k�vP�K��u�������F���.�o�h����:������c��%������T�#�����R"��:��$�6�_7/����vzr}�|j@z��U�����������k!�{����_\^�<{�v��6����
>~�
?
�+OIE��.X���|B����{L���QL|xA�t-@v�L9�/�w�����R�
��������U9]�g�K�GT^�M6�����$z�
sT=l�5*�~�>f���|E�C�C/��p4��Z �@^�9�Z8w%��I�X��_�,���6jG������5`����FL��N�q��N'���F3�B&p����`��j�Z��L�he��V���;P��tL��{�5�Mt"�������B��|��$ ��J��B��?[������Wg��������U����1!�S����QxXx�xN��Al�9>��C+��&(�E���*,���b'�+�����F�\2P~�KEvWUKg&)9�!�����l�y}�2=d�Og����A���V���!$A��R(������R�>(Pn�V����:����� xyu��s���9������^ h���?���������a#z^n�1;�0�L�Z<SKm�'I`v�����PS��+�/,����USfd��U���2{��q�C�&�#��.��k8}�X�q�/^��t�^I��2���|������|�f����d�p-�������zg��J�����dm�jM��4
�+!����3��s5�K���'�3���t�Zp�K����SE`~���T�4j���b��0P�e}��qA<��_��+�-g�1��\�/9{��K�����&7X���g��r����z����qe�X���S�R�lfa��0����}�#Q�g��p�������{y���&��)(f��������.��������2d7AVsv��W�V��k���������������	����M�<��hf:��7��?��0j�������/��U��7~k��j�A��j��t�j�t�Z��������I~��������w"�3�U)]-��U��.]��k��A����8�j��T	���H��<J�X#�����6�k�H�i�;�K��+&Y��en��(��3�(�M�<��<4��Aq���~pP\��,�=����N�;[�l��%|���jw�c��J�1�'�?�b5��N�.�f������b��FJ�
Tv��uSHW��^x����G��J$�'���g��4S GXQ
bA�����D�(Tt.�c��v���F����n*j�_���5���P�Q&���m0����'���f���'����JI%�����,��g�g�i/ed���Or������kM~q���{��������,=��|�g�|���>�1��?g�����e3�Z��"�� �����t�r�U��I���P��F�%�/��'2��ej�j5���n�>�L��-Y���C���T����MF�!�J� ��S�e�N�)/"87���p|�AR��#��N(���r	k��$�	(�l��b0�.%O�$���S��
J�h��V���hO���.Z'/�[;����������J��v���`{q���i�����S���K�w�}��u��.�]DHJp�6}�J�
j���W�@D�;�v��Y�E$�o�S���v�	
{+v]S�(O��e��mzQl�k}��<XGW�_a]��9�:��^?��b��&x���_����O|�����|��/�4�6��M��K�VF2����������QZ�[��	;��sy����&���*h8��aT9�2n�����?���	 �����^���z���}��m�d�3� �g//�?V��s����R�����ZY�:WV���}eIb��{�|��Z���6N���1$v=Y�(u�d���C�>
����)R�����W-���%��u��"}�~��k��j����ZC~��L��s�"�I
�x������//���'�vp2��������L�aj�
����I>9<��d���C���u{�i��P
	�3E��&p�i��<*���A<�f�6o�.���YR�(p_]�`=O��(�������P�������������$*�}�Q�2�+�AI���������4�c����8������&�?��^�"���r��'WW��'-�E��?x 7�Ag��x�����K�Y��E���h�|4���>|qy�lck������
R��O���G��Q|��fP��_�����Em�����7��g�zx���g)�*6����t�fAg���e+���4�zx&�_I��^�g��Zo��#Th����t�k\S�2��2�9�z�6��	��d�"(=#�c/?J+;~���$��b�.���cu!y9����Iy�o�/.��9_McR��
�[��1������jCS#���+��#���hqc�R)�w{W��HIo��Ti+h>{�z�Q1��b@�Z��1*���A�1f���d����c��;d����Q$3V�%�D/��Pc(#�1�����0G��#��z�>k�_�]��_'���������9�9�'�������Q�I�\nL�h@�eP�;�{�|v-G��D��(�����{�6�.)�a�| !%�H��O�.Q�`��@35�����/xN>�0��,Y��9
y+�)�N��m5���k�h�g
�H���gZ���_`:��^��A��6p��B��)�z���-�b��!4m��(R���Mi2����I������Q`X�[&�l�	�=7�L�#z���3X��`�
W3]��;'�����Z5����%&���N/gU?��h����%�me�`Q�1|(16���?	�����M�������3&5�=��r�>(�y���a�����A��@N<@���m=~��D�\�o`��D�)�s������f���8��2�</h<��O�d$rJ����/eZs&WlN����7"�^�%�V,�9Wb���b���/]�F����=��\'�4Sz���b`=:������N��Wk3(�0�h��vHnc����D������4��{B��7D�e5����K�]���vv�������fZ�U`H�����%!=��h]�}�\g��\�����c6v1��-����
�N�c$5�����;�n>���MD���)&(�����F����Lu��GS�����t��}P���6��� �Y��
�X2�0)�%�&��51������*o������������/�1���/�~�~~��>
^;>������^�qc�� <��+�������[�x��5`���y(:�(�����"�A[�r�Z�nm��?������qX��/�6,��z%};M~u��������Q8�P�bv����[j��?�{U�'�;88�����k�U*�Z���J�z���.�|��������w�0o&Y�V��7���r�V�a��po��w��Z�N���U��j��w�T�o���iP=n����`5�������������x�����1������o��dv�l��q�r6(���dqC�eAu�q
�Wv*��VkAy�?�=�~�:Z
*����_�����%�&�e!�������^C&���7��A+�{!r���\.�R���5^���j�����MNUN=(w��~�*���5�tG���
B��Tf��8���RP�cw�����{�����f�Jr`�EQ����p`������^TK����q>b�Q���"�-���g,��O�4�����G�r���_��`�.�����Kvu9�������y3yKO����*�Y&F#�a�pv����������_��Q4j�9��+_�#�GdyQ�B�.1���v6��H�<6(-D����W���h5�~�����w�{��r��������V@����`���UM�:�7��l|�x�.���7�h�0v��	�wk�C!a��6���K
���^�6B��vLn��+p���xy,R�R���{=��RX���`y}{>i�>�	�6��O��w���?dl�Jq���������mq�32'��<F�y1���/Ga+�
���o���6��3���3�������E�ZF)r�G��g��uV�%`��W;�����hz�w\��#�A��O,8��o0�]��r�O4�����:^n&QF��������2�������BI�M��$�!�NM3�s�����g=����yk��q{�/_���S�_�p�u/�����W��
8��.&���N�0���*�}n������?��/���8����6
�/���/8�Ja��w���X�~"�F��ag^���k�e�Y/���pbFP����6��p���.�L|�?
�M)�;�v�O�C"�����M��xE�#A�^���q/I�}-��YPqO��j$�r��Yzo�}^o�����7Z6�����7_%Y�^����D
j�6s����R�t)�_�a��}��:�� l� �R�^���+�L��!�]���N�����_if��Vg�&-1x�� ����H���#����8�8�]���bz��p6�}a��cB�>*��$E���g�9J��y��*y���/:�
 �����Q�3Q?*�D����cqn%�/��&���G�����O�?�wqTR4�V6���W�Y+s�K�]������S�����F�v�����G��
(�����8��U�{�^5::�w���~���9�W�jm��'aT	����<�o?���(��)���"Bs��
�h�i�������^�S9������Z�s'nQv��P� iU��~�B����������	���7X2��vO�2��AZ�#l%yJ��E<h�����5��nT>	��`1�z�����-t���i4C���IJ6��C,n�.�P}+���I���Ng
J���Q4��I�8��7��S[(�s������}FQ������	� �
z3�e�����]���x�? �w����;�~Q��p1#���Gd�mU���F	}�pd�#�b��5�~~L�����c�F�7C������q�A-���)�[����v2{�u��E�R�����A�]�����v>���wwY�/#6��������^I���l8�M�t'������`up���m0���|�<�vy�H<h(�5�����I�|��.`�qX1�-%���w��G�2��yp�d|��6�|P�����n-�E����v��|b};������w��H�p0[N��)EG�[3����O�����c��q�n�d�����.I�NP�!���	���O��4�VG)k��R{�R��,}���+��s�����p��p���_M�C/m��kK��)�r���Y�����h����zT��O�R9���{���j�[X�Z}\o �=��� ���t�(
�|�E}B7rz�}�%��)���:t{�YYU{�K��m�?Ag&��O���'\���$��37���]�\7���	�B�����>8�����+�	����������[�0��G����;�2{����XB�U�q�v�k��, 5+����m�!1�c��F��f�a�t[XX���Mo��=A�
����H����g[%�Gd����,qB<��_O���F���?`I3|�����(�����R��Y�-�%�ljw�Z���N���tN�FD���&�r[� �MZS�3�k�����;D�h:�'c].��n��<���u�����Q4r�M�&I&���V���OX�������>UgZ�i�Y��T�QsK����e'���:kSi�:G�����Q�\����Jt�����X�\:3XfQ��/VA�������)P�'+o|~�l�=@!X�Ga�W�#-��w"���	,@�p��ZET��v�oM�j�Y-��pP�fB��:�.{4��.�i@�	1�1��P4������`�x��"��?Q�n�1j
C����C{T�1�����0NB��&!<�)���4�p@7�0�������6���X������aS	1q�>���Z���%1����9��������iB���b�b�md����3���p�������I�V����rID��3�'7�xBh���S�0��cy�03.M����s���Q{��F�S�>�E�6b�n�|7�]�D��f8h���������G=H���Y��	���-��|E��?���)�nk�:������Q()�	5yJ����
����Q_��^�,��5f��M�,��=8���_�8�D0_��<������@HF{�
�c��
��sk�A�������
���EN^�{���Q'}XR�O��R�A�KHU���"����"�7�Z�jQE��$t
 �E�jv�_2�4.4��������������p��rH��?#r>]���>��W��+c�m��/?�Ag:��Q8�Q�%m�p-���J�$������-G
r�^j�1fh�v��
�S��/S��������)��@Bl|�sC�����V�T�������L�v�J9�j���-����%W0��I��"�P.��FDDtf����D�\�����S�)g�|r�u�&J���4���jB�?H��S,�N����s{�	������t2{Rgv������.�
-Qz���"4���JvJ�;�����xB{?o� x�O��|��pAg�/�7%�x*�EF�cBQx�/�iy��w4-����:6�j������������x�C��u=1G�o��nYv2;�*�`$r�:�^�B0���}dja��c�	���k{���/�Ni�)_{�0Y2���R��1�>����qT�����Y*#��j4�?����fJ?1��o��b2�r�%��!����X��������,cnK�Z�����W!�������8$���k����E45�������t�%CLgZO��g1��pLS3"������In��2'W+A���6�	����������X�!Hf���c��]�`�q�2fr�9�}���=�����6�C��9]�`3����9�s1����fc����W�i�����J�:���;��Vw����G����22�l�u����:������QyA6�q�J�x�%��q�.$������
������3#������&�<i���h���^`0�M[�	/=3+Q���jQ��9�4�������HGET�����4���q��m���cu�����l@���PO��pk���!�}"tfkiH&Y	���JA��49L�(�a���j��������GF�������(����;�I�c���G0��~<l��D);@�-:����g��A�c3�S(��|��7XF����yA]FA��r�
�|�k�+�-���J��=�n>������������>iQ6����������������?�H�I�A��]I�����"���g%vW�����\��'���++�f��a��g`{���!�I��n.�%���O�Z�Yh�^�\������S�H�0�kr�2k����F�E�o��$�������{G�����Q-�s����]k�o�`A�f�lXl�g�����\Egj��X%k�V�_M�6pT9Bm��*�z����iI�O�n�5>�d:�.���[����0��%��r��j~
h�����^/U��s���VrVb~���7@c��e�����D�����h;�sI�n[����<�c��F@6��3��A
�b��E�</���H�-Q��|�%N)�hR�Ez�6���t�u[H��^�4�uK%��"��x��m�BhA�$+18��P(�����!W���@�_6!����"�����$lB�R
�t����L���$8m!��~���CCl�Rr�0.��b��3�*��}���h�e��p��hW�$�G��4s���/�'��n��(�
�:vBnB(���&M6�%:�M��Xn����������B������� ]�V�=	����{N�;|��U� L/����� T@�4����/�0
����2���p�H�9��r����\No�c64�CWr>����b��a�3�����V^���=�u�50�����)Y��\�	���������v��N;��|.2�)�{���}�W�F��F=9�s9����_���/�>�O��;���l�Q�,��E�Y2j�t[7	P��?< #v����)�L7�!�RoF�~	1	������T:����C�Z�]7)�b�q�o>���g�#�X�����k��V�T�����p*i"j{�4���
&p:I
&A�n���{���d9� ��d,5*q�p�O�%�9Gb���ic~R.3S��$�\��N�g#Z������a����/���;�fmm]!a��~������P�,y��FS[o���^X]c%fg��"�gI.�!�l]�8����^Vp:���>���H�1��������r�V��L'�i[��������R��g/�����FkQ��T�P:,-!C1��f�w��%���K��nr~�c�o$5u���4����f.,���;H?�"�����f���R��O\t��'t��yTc~}T�O��%S]?�r��q�m1@O��.�U���<|��'
[��3~���r��B���x&�@�0����ll�@�����*�Mjm2�������NtL�|1�R�D�o�,����d�X��D�L�6�")h��_]��ke�WW�/�G���	?e�F��kQ�Q����ui��|�/SD3O���P��_QX'{�sr=�������Z�o��g�L[r=	$�X�Etw�,��	�b���<T��W��Zpv����#/���5-�����3��x���^R�����>�T�����:�h�h�f��T��������B\���q��OW����B
�u��OO��pi��L��	
��,�0�x��,��Y�C���X��ZK��O�����������+��9����CJ�M�Y1���N�r�38�mCX������y��e|�hd]�E�d�%HoR�B��:���c������
_������.�o'��L�J�X�)
��V�J�����M)�Y�BR;����=]����bf���6�t������S��@�?a4f#?
�m�2����f�z'�7Y��T����n�����'���_�~z}y~�~{������|�s�i��Li��Y�����A���'�����S���oZ�z-��x�~S@�~�mt��B]�������(�����V;8\w;���o����.,(��x2����r_(������4b��k��#g����*$�'����a�j�?*�����;X�����8�v��{I��!:���������R�BN�s�]KI�Q��p����z���������J�����%�E6�MG��$Py`+�8������H��UPso�[���n�^RQY�����>:e�:;�z�BO�e���_G����`qu�� d���n��N�_��Xw"&,@/s����_�;����Cr	�,&�8��|hX�8�JO�a�3b�����8�mD�vf��=C��$pWl�$��z�*�&l[�X����GT�z���Tb���fm�v�>�W��$��2���G;O~��F�0��O�QE��@���B������df���9�Y,��/����
���+����q7��KR�" 5�
6d4��#x�,�t����a��M�4�!�e)s���o�Xs�����!�f��I5���7�D��_�Q�y�U���^Ay�����(��^;�e{srv��^�Lk�W�AQ��zsrn���K�:�/�������/���O����]����8�_�����v�;��uo����,	]�Y�<ZRS���:N0�FC��c,��dpz*��@VxP��i�q@�2���M�:����c�?
��hak������1>�f?(����-,���2)vt,�3�,����������}P��
��'2F!�[���6Z\r061x^'�f�)
�I������>a�ag"j�X`�~&w�+����s�j�0*����[���<�������,u����^��^�W���������l?��2?��<C����L�> ���`�g[�~,�DT�b�G���S������Y������!{srl������|�n���i�1IX$d"�\
��!yp��z�0CC�~������$��l
*C"�f\�Hs��9���x�������%�9
M����>�4�=������] ���~���8�O��/�r�#5o��D������HI(b��Gym��]v�B2>F�M��0lV�
��x�U�v�����C��$\!�E�$:���X� ��2�J�n�����TSk���t/�(���o�6�#����
s��&����}
ST�):���0�V���w2CX/�+�%�$X>&;4HDx9��dh���j�4�
/`������.1C$�����E|�s����Uq\"���<�W=�(�_:���Z�'W��Wg��V�)�$M�'��R7�j��fDZoj�Tb>��r���m��I�C=npi�z��6���v�us9#������(K�q��6~/>]�:�(>���/���N���`[r^T�z�~�}qo��t&P�.���P�'&��+���]_��Y������`����u���R�T��~�����jq�[P�?�,-�lh4�I��A�NF���������:`Md&�-��l�uq�'�h�' L�����C�L4�`�Gr�a,~�|��1����F\��M�����jLj����79��__���>����]=q�a�&9M�9�
���}��1^��Y���Z��&����:��HM�
F��Wf4�j}��@��j����|��cPB�Y
r"Y��c(���l!P(�0������y�D��0C�
�!�-��r%92�90p,���G����($
��e�S;�R��i�$u���\^��!/��;�?��AQ�������p�eIN�]��7'�M�;����s�fU�@�y����,���M��
�w���^[)FxY�P�~
��0���'0<�|���K���O� �l����S��z��;�����!��R��5��O�+�&b��{����\�%��y�w�:�[Y��O+������P�	A����CF��|@�C�	h��#G�f�1������h(�Q=O,�c�oe}���?��=�������:�~r�++�7�r:����^*�'��&3��������M���b�DW|C?R��S�������zC����T$h�K�����T{��N{���v��1v�S��p�bD��W�cm6����"�}���a�\1���1xh����5��������Y��2^��z?S���,����AD�	��i�(���	�Y������7?�z�J�<��C�k�r���@������DA�Ju������-;}��=W����k�\QQ�,?)�����/#� ��\+�,������zE��
���l*����e�8���q�<����J;���S�a�������t�8�{��r��%>%����+9���H"���T*~Ik'�@���������{��n��{T-�+��~=��z�@�����$3�r�N��ct��_\5O����g���,���FkM�k^�"�*��ES�U&?��#�S��YQ���/�*�H�-�l���=|o_?�I(j,SUp&}u+�2�Z��K�`�U������u�������a=�q)@[�R��?�����m�5;o�������E�U�
N���(�M&���m=����y�,����F����V�r��?+�ag	�����x�w�������pR
~�%~b��+�,�1�a236d�Hc�
H�q��c�.��(�X`R���t�#Y�q���)f�o�5[;���i�o��E�$r)�W<k��\u����YQ���@\n��=����������
�L�F���l
��Q��?c�E�ZPyT��
6���%-���"�3��(-G[�
����z����GN#���������Dc�E��I���$�>i�}	������,,�f[H^����Xz��%��z�S/q��M��d��0�����U�p\���M`@	8(���e��T��o�����d[<�+�u�����"E���xA����F�l�UO���ST�H�b:���_mkd��K�&m�C~!E|���sF�,���n��q:|GlFb�Q"mLeqA
~��LLC���jlmI)2O��6�8�Z`�wD�b w�;�d*��F��!��j.���Kn��6�.�!e�����^ip���t�f�k:'����L*�Y�A����e��C'd>���(�3�'�^4+������6f�+��R:�;6�3��0��4�~�_�����2FQ%���>��qw��|w'����]�/���*7�oG��S��6�K�%����K@:�]�����#�K��3h�?�1�7��1���:� �c���l�=#4|7Yo��4�E�1)n=FQ��n��-J��%���,�� fS	Cr�����mT�9�!��~��TEE���=	x�~��/���:d>��*��%2�����@����S���N���>�X%���`0���V���	����rw��N1���p��	]/��<}����m������\�:��C���=]���[u��.����f��������'�b�H�zlx�*P�^��q����S����o����7x�*P]���7�5'
	R���n���D�[;��0����WVB����)'�V�`�����������^/:�?��k�[V{+�.�,A.�I���&l��������Jg2��������� F�1f�jO��JO&�z8{?z�p���CWUC�A�����;��Vf��gr+��}Wn`�8A�����/b�X���Wp��.�
��Q�>�bd��������P��C!���d�� ��eDr)IKBm�b�����!j@C�+YT�qK:}�P�&"��M�]d2UI���n�H3�������<$�E\/1���a�	�����s�D"��'�kr�N�w�}��(����������w�����LY�_�A-�%��F��i���v2t����Z0�}�	�_z���Q��4��1��������?u��L�.X�<A�����ExezK�p�����D�>��	�G�+����.�A���X{�7X8������~E~������m�^b=��Co�r���<T	�)�A��-�z��h���_R�V�a\T}�X�Y65NJ��sa ��!F�pF�@�9s�C�{ws��<������Pw5�fCQ}�����.pG����J�nr�E��q�:��-ea�j�
�����5���M����BH<�Y����#���<�h�-{��)���]8T/vr�@ ����q�-��n�C
?L���vb�������U���h�����%���9z�#��C#L7�wt��0�[:
�������F*���������z�}�o�9j`�j�Ra.�5�9�����|�m+��v�v1���4Rz&O'3�J�)����~�>������iJQ�6r�����������y������3�F�%�9YzQ��e��A\��F�~�,������b���@v���Ej��x]��Q�����]�T�y���9|�pvq�|�n]�\�(��v#�F��KO�p�O?����h�rN�,���f��
hC�M��v��Y��pf1���$5�2(a]�
���z�7u��87���O�W'/^7��.��E�� �'���B��,U��Xr��x�"�� ��wDD�U��
$�sP��y�j�����"��[�<�O	65U�v�A�v���u6@.�pxt�����0�������n�<������r��%Wk�_Es�o��	-�aFz-�������0�c���Ro{���,��\��=��y�������W;�#�6�gp������iQ���M���K6�&��t��*��%d�8����=o��+��i+*��5nH�css���_iw)Yna'�3��B���0l��D%Qc0���pSF���I
J����c�4�a�{�,�:��/�@��y��`h;V�5F�>���M>��QP�.:�wo������J	��k��y7?
�������]�=RK���'A<��k����[��3[;FM�lx�o&#�{���41�������r��'���^�<[�@E�n�jC��?���C�G-S�������$
�����
s,P��=���@�9\%-���2�(�'�K�p2����=�y{K�V���;�R<������/xD0\�9�
���T<<���D����Hn���,q�u�?��K�A�Q��(���������E�l��9�)���Qy�?���g�������MWA�@S��h�8�����~�Mf3N,��o�T�u����E�_<��:�I��/�F>���1���W;u��QJ��U�n�{�=,���h�������e�����*I�bc�Q��@\�_0����y������8�����l|D��������,f��b�ALG�n���a8���3^�uw	�������������Z��uk��Y5I6�j	������"��-r�p���-�1x)�<!�{��T����c��7k��!������ql��[r�(9?v�������P����-#twp��{�^�R.W��N�0�W��.��JS��� v+Rqb�����-�o�����|0�����F�I�\e�������hzN�W P�N�
6�Q��O������������=u_��-��9�=���|�l�"�|�7Z�����N��0*�������$�����\�*����L�������atx�������~�wpxxp�^�e-�e]V��(��NK����.�L�3�Q>���i.W[�4?��������?h�r0�����tp2�eN��B�����Zf^h<�|�,4�9H`���������g��w���V��LZS���a�R;jT�@i�zu�[��|GWv;K�L��+>�������4�7��3��k��M������M��8���z�����}���^;�g�����N���h^,�����[��aw~�����X8�r�r�q	R����V�_&��D�T��`�?�C{sv�]������<���A�K�cdWR���=���$
	�ij�I����L�v��H��:��/������<��:�s�M�����/������vs�g	�E�B��J>o?)-w�-�Y�����s>�.Dse.�+	�������7o/���h�����9���S^w�4z��l�����A9�����>��W!��^�g�B����r�F��
�X�a��B���<m���Gz�| �*�Y�����r������eA��S`�q�����8�2(�n	��>��?�ldi;
�n`�^�Bz>a '����2"���p�M��Y��K�I���l��Z��cwg���gy$������RN�i�)�vd�����._�i����(������[K��8�
K�9��>T�M���2
F��=,x������%V���W���^��a�����r���Q��-z��IK^�"�8�p9+�6��F����z(�XJ�+�+�OO�O�����-��
jm�3V���u_�]����Z�n������D7�M�Mg�ww�n�}����l�Y['��L��R�:���^t�8�����j!�=���Z3���$�Uc��1���> ��A�y�|q���@���k����Cy�B��y�%��N�%��m�p��dC�qf4��~IH/E��W��>~��B�q�
�-*��>R��5�Y��
��'?�����<��G����E�XUMVU�U�hU�ZU�VU�U�pU�\C8�R�U+���������������R�R9,�N��yiG%u���J�UR2>D����y�v�o@�������U����`o���jc��W�4j���*�������7;�[��{���v�a�Un���/Qo����zGG�h�rT��:�F��~X�E��N�;��n��!�M��AP�<��5X�-l�q�F�(h�wa/����<�o?�����2���@���hq��+C�FP�\�=nT���Q���Zt�
������'�/^@�A�p���K��pNaaZv$��d�:x����o���<�(�E�������'��V�'[����&����A���*�%�P6&4�-�d`JD�7YNd��tb|�8��(%G�0����C�t������U�"Q����x�tw1�p?n��i�x�.�nf�(��n=�h?���3�
~�B�V`����
fH�0�[��~
� �$���[A�a���h���V�����;�bP���Cq�T�����Gg�bJ��G���p�����Q���k��������"�E9����*�',l�?"��[���"�;w��b�H����o,	���$'L���C�r���mHO�{}�.�iJ��,N�l��M��\�;��6�l�����h�r��r��A���A��a ����o��`�d��>��L(��	[h|���Y���89hq������������H�����x��h��l�t�xWh�/���B�`��=�uJN��M�z��	S�Q���?���H}"��4���=q�dT!�=�9L�B��������7�D��$%��l�,������.�4���rJ��0������V�]�O��6���2c�T�
{��c���wh�"����C?��~��?t�B^�,����f���.�T��������`�}UFx�
�^��W��|�3�f�7�04�����O��T�U���
_��m��Zr�%���t�� �6��]��8�,Q�3�k�3����<ML^�v�����(Y2���&	S������y���M��(9�2C��	�cD��r(�Yp�9HnS��z�Y��ns�
:H�Q*�
q����0-�P.H3�l���������b������j�w(���'^�5�U)��zx�"�������s���`'�����fK_-�������\CwKLkF"����/m�a=����C��Lx�c�t��"���,$X@�8�s����`��\=5�I�ALYE��Qg�&&���������V"�5?�["����!����4�����T�F�3F�d�P����L�?G�+��*%7p	�C)��l`���J�(\�I�H����|�d��Q�V=j��Q'�*��W9lt������a#D�P��������_}�����?L�A��!������3�m �#����1����Z�I�p����~_��1����p�v���w0�0DhMA���&Jv�ZmP��=m*�/��_nS	6����������[��9��������l��#n>�0=�'C���ni/b�X�1���$K?roM�o��8G���`�j������BK���I%g��H�m&eym-@3�������a{J����C����������HxZ�F���P ���hqV+L.=�T"��a��x�x2&�RNL�~�<!i��.(K]���Q�d����(�8{���pv=�K�J�  ((�(=N������,[O���=q�Yp'](y�-5�fX��E������Ca���3�w����A����=m����<k����-��;l��������I��w��'���Lm���@��X6K����E��M�������W�`��d��J������*����c��JA�RK?,���X��q:��Z��0������r0F��y4���������I�s�R�K�����P���Hk��6u���+�w��,kB/�01v( ������:��`I�"�X0���56�`���,�r$	�0*�aT�!�k-�^iM�^��MzlR:�9:<�od��mL���0?��4?����
�������W+�7�
kW��4
?!�R��uR�P������7_���V�WR�m�?������*������
�{�a�R���+�J�{��E]��Q9����O����{u�P��q��q���P���UaP�=+CI\3�V�������M��7P� {qH��~#��8Fh5�v��v��[��k��K���
jd#���������6�F�Y�Z�a�Z�oU:Q�������~-�e�R����"x��W�U��}N�jYI�hD�$"��%�t��1���R�L����zD�PTe��}vq�5�##Pa�5d@�0�%��?)�'��D��n��.R����!7��J�+�k�s2���1�gf��Q�V�yj�)%�q�H����g8��j ��
�RqN�y�gg�g'��V�����u�u�� �a/��B��J�1����������?�����[
��.2������2 �V-�%�c6K�{5|����?� ��.������#��B�����T�jH���A�2�V����<y��G^KP�Y��`.���<c
������S��e����WY�p���d��b`�C59E�5p%��8S�P��T�3�������tC��j��dp��/���������y��y�%���/[���Og���������C���ZaR���7���R1��F��"��j�Z55�e��w�����u.����l�
a9@����~l��<Z���t�Z�����"!�����]�T���,!tBw����>�Z�G�2e���]�9i�	6b�*^���'���0�j�	��C[ju�`Ts���OB/���e>����7��'���R�Ho�P��x��7Vi��6���w��� ��U�u=��;:�z����$#d�Q����������	�h���L��D�n�?5PM G�����6���ip���I��l�"6Zd�v����&��s=|(���\���p��c��5�(oh��f�j��)*������(������\d�wL���^YY?���Q�(���Z�Q8
�@�k��;�4o�Z����<I9k���5|$���C=�"��G�X/���sPQ{��� �� 
W�?�{����Q���O���u�����O�$X���9THG�9I"qg�)�4#u��f`/Q�qO5'Q,]�:���*v�4�`g�@�j�?��PR5��5%��Sf:�|��~�����JF����P���0I1XN/Q��r�����|+pl1!�cl2��#�L�.F�%}'���&s�O#z<�N�=^S�J������Y<�m#X�d�gK���Q�uI����bPaXY_���Ri����E��H�����}����M��}�����������h���Y6��"evC��tZ6���3�$���o�AN0����"�v����L^q�~�62�Uq(��L
���>���y����A���Yay���L�-�(2�����V�
C�E�p����!�2�\R���06��E����"^�7���n1P���.R����	�GO�5�k��,�,^��:]�9�K�dI>��r3��ak��P�N���8=t�L��y~)��d}A�F�:\� f�|��'��M�*?������;+��h���M�ET�C���b�rMT;�l����iQOjzVH@^�g���T���ZC��R�h��T�g���Z���a8�F���nX*���A��s���Dogu��,��Q�K#W��Y�<����B�	�2��K+<y��
�b+��>K�>����;�?��bZ�����M7�I� I�'7�����U1��+����C�>
����/��~Q�[o�(�������*��&�(tB��|PEa�st
xf��0���E�xIE�X��?�U#I���*(Y3��x\�U�>��itwg;M���CgN��=M�^��2�x"���I`�8���p�YW������Y����O��t%��)'�3n7��I�b�_`ZS���6mT����MI���c7��`n�j���4���nI�~�������5(�S�����TJ5�~����b&�Y^+�o�A���q}�x���Q<]t
�}���z��~���8l����J��S�����/vi��-Y�	j�[;�-inR���m��7���nrBy�0�8%1���3 ��~8J_��Z%�4UN���]��v0�C,���e79�Z��������79;�orvHH����gJ��U��W����*W2AK�:v��U�_ud7��U��\u���i���2;�m�$�
�)�������^����$��'�0�����wsA����*�a�hs���M�X�H%o�`�g1_�
���3�����E��4�.��(��fd��\�2����[�=+K��CG�<i�E����L�����i7�2aH�r��I�QjJ�t;c3��N��N���COV�`����Y�����Kg�4�$2`9�m�4.�b����8�Tg-�9@?S%����6�B����sG�����?��]�K�6}�c����6��~�j��6��!���=�N��V�Lc�������6'Y���/��*��e�G�c<c�f4���H�r>A�;�P2Y��x�T�X���s �����n���	��o�O�?���4���K���
T����?)'����������T����=���g�^����N�0�t{���
,B����9::<������V+�k��@�U��b�j�����?N�3o��]{�=";�x26x�]t��W�L��d;�`���J�r�����j������
�L�E��
��~�[.GG�~;�vW�����M%i�~Tl�P�:*����:���1Am�@���(��L��C]�!�������������*X{�v����s	��j^�{Q�h�7�z
��S~_���E�����k�m�H�;�b�CS���)��S8�.�K�����I$!����8����v��C�sN��Z@+G�]�sv��a�<:3�P�i�(�TG>��lA
+��t ^R�(*g�S	��_�������H�c��)hO+P-j�����B{����-������s1�� e>J�T�Z���2��G��
������u�D`y��h���
(�s-4)�f���&�������A���hBIX�6�4��jMHkt3d�	j�������]'O�l�`���i�y���+�E8������uUr�&�!Gq9)K6�o�Y��d��Y��o8S�@��a�#����0F'f�)�,�x�j6l�S��p����~��S��V�WQ��TO�
j�A�36�'7�`au��?fR ��)�/oE���E���\�~�r1/ixn��TM�B��d#��;��w�A�L.D2�V���yO��$�J������v��k�0.���;�>`RS�E
v'��iO.���D�#��Z�LA��
��e�e��C���`����%�>��~/���D�����b��d��{^�H �+JS�L��<�n�y#K�9�Q�{u��+�5x�	������H���p�7t$��^�}���LN��	��m������nk��,~�f'{�+b�2��Hgr�q�k�42�n�P��2h;��H���wp~|tv��i_yS�3���D=f�
/I;KW��#�`.��+9�I�x��F���Q=�>�P��eT����T�f�n0G��y��������;�z^j�Y9�f�v'����M?��8���$������LuY�@�"�����l���"Y�S��������n�V������XG%�B���W�9(���
�}��*=���v��rRb�(�H�2F=q����mgXjnMbl��u[796p��9��Tq��JIe1;b�q1�n|[�X%7�H��,z�2�.3[N��|X�7���R��C4�%��\(���3�]�n)�o����1t���3�w�w [}���3���md�SM���M�lw�^W���:v]Q���$]��NV���������>�y���������fu�D�(uwQ��[��S�7Q����J����B���:���e����(��L�^x��;-}K�&����Q���Q2Z^�����������~�����a������6��n�I��0m��h���4������F����A�������������Dk��}J��uO�n�c�o���C������OY���F��+��������V;M����h~�|��Z��F�P���|�F���>�"���$�xw��s���@�������R
0}�<6�h���_�{e'�T�6q
)�'�(���������!�
����j
��_v���]����d#����1=r��,=
�������c�����-�d��z���`����$W��lr�l�m��F����r��c)p��f���#T��J���Ecu������QW��`���)�
��L�$%k��j��h�5�������N����A�[�3~q��!es������j�
t�c
��7����m���5��[;�OX�;�"
�/V�J1H=�1kYP��Z7�tUo��I�y�/�ly=[�~��'����k�d<����|���
|J�5��%f�L�^�����8��	�Oy������/7�-�� �
�?HoD��b@q9��IR6um�Y�)P~��nJF
��z�
GOh���GmE�/eB8Q�wtk�S��`.=�yVV��U��Uu1�H{�[gV���&uv-�y5����d��L�����O.> (�l3��hg8}w�:y�\/������r	���4H�M�`S�{r�	�q�oDSI\~�?~��>���&'���k������
�d�����%���)�:{B���*���&V��g^�|����21�VD)*�5��q�el�:����e���X���a�o��ld#�P���4Q;e>��K��m��\��Xl�N�,E[�6�n�
�Qka������T�O��M�F�I��WI<�_�j}
���Xp9�a{a���AA��;��[�V��^�r��l~Q�b��Y{%j��u��8���l���M>a��9gl���SQ���+�Y�o&��S�Dc-d�+�|��IE�G"�x=�muF,����0d��pk�B�f���$!+����������]a;�rk��9���+.p;�wi�����5&���D]���T���1�]�z\�����q�'(��0)���.'?�vc*�l;��JN��+���M�:tpn�uk<��b��w��|�{��mc(����u��|XCty�J�M��5�!�?sL)
l�A+=D���j��;0UkE*\���������*��q��3��~5=�Q�s���g�4#W���!����/�I��[��9�V�n����8��4�|�&�,���{�G�R�v�c��*�XaV���+>K��}/{�1�]s�T���Y.KeK����@@1)���!��'����[A��;!C+!~s)�M��o���?��/��dN�M�.���2������m���T�A<$T��N�W��X�]
=��5�_=-i�U���Q@+x��2���U�y������/��je��Y��o�/�-�*�
�
p��_p4��k��S
7
��F��.����"n��h�Gp�[�l0�����hI�	�8��Wl��������������7:�Z���^b���)�[�-Qc����j-�����mt=� ��_?:������.�.�}�����YL����������C�%�A����)�.��S�\v���]��q+n��$�����1r�3{��B#�_��[��cvKG�^�(C$��MMjr5��n�7���Mmt!�Dl�LW>���_3�Y������=A�Us�NA@
����-|�9��`�y���'��{5��3��^�s��q_(�PO,���L�+6L�S���KT����|�`*�g�|E~��oo��E��G�H��:2&_�m����.B����?WaH����������$0`��r^8�E�H��o�%T���*�]QNe�U;�nkkE%F�:����}p���$�q3����E�=_Wn��'N��:�9-t�QF@N�=���p(��"���DP��F��.�� �W����E�@y�,[ �Hc&rI@�9"��:WjL����=�dsK!|���n�*�}���L4h,�<�n���]0�]�^�R:\�@�$`�~�N�A�l$/�{t��?5f<�_���H��.3l�6��;��WS�_�9��� h{	:�����$y[��M9�
��EnZ	,$[Q���~%�����{@������@'yM�Y��.����
�z5{g���[U'O���3���(��TOnP�O�M����q�@����9��N��/G��_O9i\g�������>?0��V�o���������|������N'���A�x�n�xi�O{Ia�q�5�?F������U���@�f����E~���|�U|�TT����� Q���9��d�dr���U�j��)c��4Jz�tw���7�|/���B�V�a��%8'����K!���o��o^�2��������������Gy�N.c��VY��5�S�*��
�M��ne�R������@��[����>(G��k�T������Z�"�G��X��x��g�������[�����_���vk}BV��@+rt(�&~����m��#pB��K�K^RLSL�n}�W�^b��&���a��f��/��KU�����d�tc�l���z}�%p$Xp��C���z�^�_�Q�<=<:���������I����6�'�hF���V�,���P�D|+�A4��+�-�
h��1����J�D�O�k����zbV�!���#��`��H/�������Mc75�W��PW��\�pdW-P��������x|�2f�d�T���������#�?%7�`�J���<zw��jk�E�X���}����/�|������5��O?,�I�t��`��-��K��lVn�������,X
V�6��#5#����`]�����/���o�A���{I�������v�~;��a'��A<j���{�(h}m����� \�����W�uO�n�[�����h�lg���y���R���mx
��l�q(�)��W8�>�w��{��n
�
�{����{!��������7�N�?��0��3�����bJ��P��~��v��+��@�����fD��
���B�����+�$US�n�
��\jbW�,��v�P�������#��f��i���(�F�Ht������"[D~.���Y��]?J���bm�q��K�a���-�b��<@���0,
�	7
��x�0,x�z=<n=���Q^zx�Y�
Fy��-I��[�e�{�6.��x�t� '=F�f!F~u���j���5����1���-�?�S*�����;_���f2b4�:��X�D���/S��a���"���|90�+�H��s��s
!G�(�B���]E�G���K�d���=�hOjf@o��|��1d&T��S��x�i/��d�R�$a�:�0����@�`e��Qy�Ee�@Xle�ZG��w ��6��xM!l�bq�����pS���w��]cFW�,��%��<W#0p.������K�W�������]���?�O�4A����JQp��������j� ��63h���Se5��������;�W�����}G7&��� �p_��B?jB���-s@���$�0>��cie:�=	���P�R�o����w�hYi#��fe�ZX<Z��4
pmk��M-�w��+4�����j�F�8P&����ou�F������9����?�����J�CG�\Q0��^'������:
H*����a���N���"1�*��vE����hh��c<�}E;Ka��7R�w;�9:8
`,�� �>���\Z���4)�q�(
e���"A|kry�l*�mY��[�����
�xT�qsY[ImU]w#�-Z��.��a�����?�'jc
E��Xt}�J��C>���h[7x����oppN���Q��N�5����[<����f0�fc��}S��nS��7Q.o�����,k^lQ����CFT�
o��{�G�/U���/l�������K�n��n����U�k��~����}�{���M>���������Q�u#0G~������(�y~����������!F+��{{~w��/D�_
��Y��y��|���5�B�+H:?6\^g]�� �aQc�����6��k�A'��v+���Q����qw���������$j/����l��aC��)��)!�
����#O��c���r|�-�c��������S)"��F)��A��%�@V9��J=���h

�F���kzVM��5�:�?�@a�|~�J��[������%gD�����-���i�!�)��-�G�N�t�?)����>p�����c����{��=~?�������&O��`

#308

johncnaylorls@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#307)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I agree that we expose RT_LOCK_* functions and have tidstore use them,
but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
calls part. I think that even if we expose them, we will still need to
do something like "if (TidStoreIsShared(ts))
shared_rt_lock_share(ts->tree.shared)", no?

I'll come back to this topic separately.

I've attached a new patch set. From v47 patch, I've merged your
changes for radix tree, and split the vacuum integration patch into 3
patches: simply replaces VacDeadItems with TidsTore (0007 patch), and
use a simple TID array for one-pass strategy (0008 patch), and replace
has_lpdead_items with "num_offsets > 0" (0009 patch), while
incorporating your review comments on the vacuum integration patch

Nice!

(sorry for making it difficult to see the changes from v47 patch).

It's actually pretty clear. I just have a couple comments before
sharing my latest cleanups:

(diff'ing between v47 and v48):

--       /*
-        * In the shared case, TidStoreControl and radix_tree are backed by the
-        * same DSA area and rt_memory_usage() returns the value including both.
-        * So we don't need to add the size of TidStoreControl separately.
-        */
        if (TidStoreIsShared(ts))
-               return sizeof(TidStore) +
shared_rt_memory_usage(ts->tree.shared);
+               rt_mem = shared_rt_memory_usage(ts->tree.shared);
+       else
+               rt_mem = local_rt_memory_usage(ts->tree.local);

-       return sizeof(TidStore) + sizeof(TidStore) +
local_rt_memory_usage(ts->tree.local);
+       return sizeof(TidStore) + sizeof(TidStoreControl) + rt_mem;

Upthread, I meant that I don't see the need to include the size of
these structs *at all*. They're tiny, and the blocks/segments will
almost certainly have some empty space counted in the total anyway.
The returned size is already overestimated, so this extra code is just
a distraction.

- if (result->num_offsets + bmw_popcount(w) > result->max_offset)
+ if (result->num_offsets + (sizeof(bitmapword) * BITS_PER_BITMAPWORD)

= result->max_offset)

I believe this math is wrong. We care about "result->num_offsets +
BITS_PER_BITMAPWORD", right?
Also, it seems if the condition evaluates to equal, we still have
enough space, in which case ">" the max is the right condition.

- if (off < 1 || off > MAX_TUPLES_PER_PAGE)
+ if (off < 1 || off > MaxOffsetNumber)

This can now use OffsetNumberIsValid().

0013 to 0015 patches are also updates from v47 patch.

I'm thinking that we should change the order of the patches so that
tidstore patch requires the patch for changing DSA segment sizes. That
way, we can remove the complex max memory calculation part that we no
longer use from the tidstore patch.

I don't think there is any reason to have those calculations at all at
this point. Every patch in every version should at least *work
correctly*, without kludging m_w_m and without constraining max
segment size. I'm fine with the latter remaining in its own thread,
and I hope we can consider it an enhancement that respects the admin's
configured limits more effectively, and not a pre-requisite for not
breaking. I *think* we're there now, but it's hard to tell since 0015
was at the very end. As I said recently, if something still fails, I'd
like to know why. So for v49, I took the liberty of removing the DSA
max segment patches for now, and squashing v48-0015.

In addition for v49, I have quite a few cleanups:

0001 - This hasn't been touched in a very long time, but I ran
pgindent and clarified a comment
0002 - We no longer need to isolate the rightmost bit anywhere, so
removed that part and revised the commit message accordingly.

radix tree:
0003 - v48 plus squashed v48-0013
0004 - Removed or adjusted WIP, FIXME, TODO items. Some were outdated,
and I fixed most of the rest.
0005 - Remove the RT_PTR_LOCAL macro, since it's not really useful anymore.
0006 - RT_FREE_LEAF only needs the allocated pointer, so pass that. A
bit simpler.
0007 - Uses the same idea from a previous cleanup of RT_SET, for RT_DELETE.
0008 - Removes a holdover from the multi-value leaves era.
0009 - It occurred to me that we need to have unique names for memory
contexts for different instantiations of the template. This is one way
to do it, by using the configured RT_PREFIX in the context name. I
also took an extra step to make the size class fanout show up
correctly on different platforms, but that's probably overkill and
undesirable, and I'll probably use only the class name next time.
0010/11 - Make the array functions less surprising and with more
informative names.
0012 - Restore a useful technique from Andres's prototype. This part
has been slow for a long time, so much that it showed up in a profile
where this path wasn't even taken much.

tid store / vacuum:
0013/14 - Same as v48 TID store, with review squashed
0015 - Rationalize comment and starting value.
0016 - I applied the removal of the old clamps from v48-0011 (init/max
DSA), and left out the rest for now.
0017-20 - Vacuum and debug tidstore as in v48, with v48-0015 squashed

I'll bring up locking again shortly.

Attachments:

v49-ART.tar.gzapplication/gzip; name=v49-ART.tar.gzDownload

��;�s����U�+��dL�O�)�q��H�jD��8I��94��|�����$��,�nf:SNb��}��{�NK�Wo�ZZ�*L��ImQ�!���j4��ih'^���up��[�D�yc�{��S>@��1|j����?gC�F3��3��h�����}�ZO��o~R����o>D��1�O���~^�Q�\]�z�a�5������jv���~��ma�;FO�:E!�����������,!�6������M�=w8�A������Y�=�aG���O��1�`c;"��4����h�X��k�J��� �����mz��-�kiM��O�6V��������20VF���XK��1_�Ri	�������s\A���g�����L����go������3��pS)�D,I��%s�b!S?a��8S8%X�����y!�r�T��T�t+��������d�C�2���I��{��q��pq����j&�Qx�N���6O���h6�LfqC�f����z)��e`�t}v�r���������������w~�����	�]���u�����l��R�^/1�M/����#N�������]V}���4�z�`6,u&�b{�1i�\��J����^�y	��cl�c#%/t��\��������y����+\�QH��]�y>B��V��jV��~B(��e�^(�E>[���/�6�-a?��lYc�E
�q���_K�����
�Hvi!����v��������Pl���l�D�ZD�����88!�K������!�E�x\���M�d����uIrM2�*ej�r�rY���8$��������84O��!:kA��v&�C;���^:�U����[Z��A�>Y�%�GN�D}@��"	�Tm~_������`��rX�h.�����$��L��X����!54�R�I�W���iN&C@=��<��V��A`�R 
S,��2�����\���)���� z��=��D���f	�r�@.�(��-g"1}
3���[�y���j����EF�G��pF��~��$q'
�"�@LH���r �.��
I�U� �%����l�k��I&��h�$��u���3����~���j��Z���g��1�Wg���=���F�?<h�Z�LNp�`J�a����LAG������:t����-�����>����C���y(�W�,v�e	�x�Y�wZ���x��-���^E�9e����#>U�p��+�)�0�{lk|���o�p�&R�8t�vW��j����+��S�����kl�%�Qnq����E�*^p�
� 
r����g8���bt}<��{!x>Z���~��3�% )K���,������*)�
��h�J�����?�>�X�Q ��u?�yv=J�z�f� M4��h���'�?���w�?��������?�3���;g�eu�!�\�u����u�uz�������������{Y���?��!�x�z!?��(��
�45�aag�_����04�����?��?4V���j�#ce`�V��:-���@�3�dS���8Z�L�-f& ���K(�7��������G������D��1'���#�������^�j�-+{
1��R	a1�>����d
U�S��W���}�Fg���Kh�����������b)���k�v�_�/�q�����,n�Aj��\�������k�S��v1���tVl�tj�����B�WC.(C��j����z|<k���6���Y���]�m=�����i��`�=�a���a����<������'����������'���I���UU����7��x2�qfy��	�T��@[0W��CA�8>rq�#s���bJ��^^M'����	_F�������9����O�
��Qu������;,�WNN�?v��u)��P��������haGi�;b����)'YD���Et�/_�������/>�Tu��P�{��U�"Uu�ASIh�������O[��G�3��k��Z��M�}��7T:�n�-�i5�h����t{�7T���o�����J�6W�5���d�X��0
J���h�U����Qc�,h��L����so�WU�~
o�>�M�/�����M�������������q�S�TU�T}�7U�w���rECz��
LB�b���+���OV���}������X�z"������������L��\���NZ�����������s���Y���\h�����3�eX��5W��u��SMhAU�����3��[���6��`�lm�,7�Ri:�/��}
��9T���?���\}.b�_S���Y�t$��r��/�k�b%�{Vs=�����z���H��������3-���4��I��n������������z��Yx��\��������Q,�x�S���F<����a0��E�F������4�5�_�q�~E1!^�,��
�.gxu`�;x�����h���w����O[�u���2�O���3���nTE`��Yn���O%�o��}�g2�cxs��G�����:�M���g;���m ������Q��������E�������"z[��E�����Hq��@���AD9�#{{9�zt8���vO�=�-��������%��`�������nKk���o�Su\0�G�&v=��^R�Wd���k��	�A�������}w��<���>��y=����x�����,U�"�A��}b��D���(�z����oL�� ��~oN����Q����X��L_�3�b��fF�I�:����:���p�SzK�����
^H���.5�W�k��h�X�(Z��&#��H�$"z����\I�{������zl��{uc�2l�?�F������ZdyMG,�a��z�]hc�Z�F����������Lj�'�>9�f�!5�P�B�
��u:�$S��)ZIu���-���o
�N�k?�vK��@���o�n��+ �(����H)*`����������Zx�����;?�D\����<
�$�7"
x��ZZ��p������,�X2L����PaJ�#`l8J$����yu���QA�&���:up'l �I��"e�k�D��C���mo����A�����_����`a�����O[K����|)dK	%=n%@���I�z�t�����d�$�9�5b�X�E��/��X�������Ng��8��N���%h
��T�YBHw���&��������@�aF��P�l���Ob	��A+n��
��"W���x�I�J�I��lD�3���h�C]P���3�,�Y."/L]/P��T��|��� ���\���
t�$Op�����;��#Q+�-�"���DsP���3�$>�S����x��*f�O�-2q�$q�4���!����*������
O��$�n�cr����j� +J9�c
���I�&@w�(�@p��2��z.d,Eu#$9�b����D}C��U��4����x����4���}&��Yfn�z��)FhR�8��:���$\��=��������K�H�R�'V��\�4P�%�r�t	(
%���D�D���~
+����Y�H��Y��y���r������l]x
2��I��h�U����.���#�� �x�9)4�����x�
hN@�"L���!#���{v�
�[�da���N<;�@=��W�2
T&D!�����Q�<������8xm2
�4�Q���J!��N���������R�T1�����@�T�u&�I<+]d��1�9"�}O��/�Q�N��pZ��J��`x=�������1��T�����������&���I�� ��wTz���'��2������T���d;��m#��iB�pv��F��b�V3=��6�(��*��a
����)5����qGb�9�)`�U@�gP2'�Us������7i�4�7c�F���.��ft;~?�%Ox�\Y�)�-������*�&����Q�Y0U%�h�&�(F7^�Z�V*D:��PP ����q ��pv�X8��
��%`��)�v�w���bGv6�/J7"����\=LRT{B!�W��z��m�����1<�����fc�5]��/_��,����#�������`�
�E�
TK@j~{;�����]�O�N`X�m��F`�9 h��E���������C���0] �jGM�j$���}����D�~�Y�~�kx��6��L0����~l�C<8A|��#l���d��E���>F�^xk"��'���������TY4f
a��Aw�*�O�,��
��zYYL~�_���_�
%��j��y�x��o��7
'WeX��ClaV�,�/a�Y����-���
G�6�M)C)s]&@�J=�Z������r(�+*�d����_ yE�JU����^��.	Vb=�B%��V:��P������X���m��:����*J�CnY�^
���s�W�C��W���a �(#f���^p/�� k���1����W7��tx�����V�g%X24Q
� +<�(��D�l����vg���QCa�p�#o[����u6�A�P�����&���Z$���j8S	���.(MT�T��q�������R?I�Y���}��9�@��w��}=��o�7���F T��:'s�"^�q>eXTV^0�����E�������nf�E��=
{�
0�q��Q5(���BF��\%�Q��+:�fe���^�{���M�~�^��`9-�3}
6���m0��^����O��7�������.�]�~�qd�������8�s�
:7q#�B�bs'����{7AN����Y���?��:��~�p��{��
1� ��0L�������?
�l^�v�X\B���2���}���hX��^)�2�;��%B�F��)���*2�*�|�r�V.���
�����Q��A���������-�<s�T�(�q�$E�*z����EX�LV����������vV�\��EX������P��{)71�q��z���e-�M������a�:,��dQ�R���|���w��;�`�c���S@]��d���S7�K�c�yQ{��t��?,Dr_�
L�_w��|9���2���]]Uele�en�����xG�P��q%��8�O����������wn�X@���u�vP�k$s��-���_;�< w=��:����!�y���>q�>B[-�y���`�8�v��n�.B��o�;�}X����W�g4�`�K������[�+x<��.�,���pg���R���`�p58�J;���zt��Hub�����7����W��8���dI�e���<���7��>-�ew���!����[��z�b��7�I�g�S�NU�Zd�O�wQai�������2����+��n���]�\\�;��������Q���F��Zy��������7� }]\�N����m����>���D�\���M����^���l0k{f��8�B{�q�Im\������x�N��Q<cT��-D= -��o��1���Vp
i��-����2j���K�e���,�_6�o`j��S'kL�,�����|\��
c^Ri����R��!�5��0���"c�YG-�����s��(��5x�&��c^�v#q�z�N�F^�Z3�J���K+��V^X��J�*�\��$["q�O��m�y5���U��qzi�v���0|�[d{��s6��z�!z���{l-�r�XS����V����)��L��7 �~��[19Z�br�n��`�������x�����c=��\��?��������D��iiE7)-��� I~�O.N_�9��N���7��'X���������i����p�]]���������s]"�fI��$�4h�\2�=0�Kl_���GF��G�)"D���$��m����U�������M+�Nj��J��$k*Y���@��3��9����I����|st�O�p
v@��o�)�6�R�nrk���<���z��H����������9�*
<�0J�M�����*�����2��*�Z.��A4��?��tY��cN�cu$�S��Kh�\~a����Mg����:�*��/~��4Ap`�l����4$5�=�D�ih�a4���"�B9!�9���PM#2��>cU.+�#C�?��<!����7o��d�r���O����.��P^����>��%Z� OJ�%e�;��?T��u��{d����:dmb�o�Ko�6�>�#�I���i��]Z�ou����`���H
�%�cp����t���Ol���LI����G��1W���>�=�������3�p?�Y��`����}���$Y6�)S��������������"��<u�J�����BL�tEh��E���(���Li���/�FU������[:Q����w�i������2m����0��+�FbKN�w�F�\�Yt2�U8�ga�;���Kp�����6�������B!w�bP��2��F6�
���T
�$��c����B:�
���"����E�����u�]�I��UY�l�2TTc$�Hf��������iT
�O���
,�[���\��7'?���(75����+��j'&6������(����s����bk��h��!M��\����!6������:�����(�pc�(��jui�x�+c���P}��,[�� ��h�c;���C�2M]�y��~�.������[��c�����+UH���j��9�mv5���(��O�HGn:��~~�R�yE��������@���G�j�|^T��)�������E=|4ze�\['�|�G�A?>#�dx��5����CN
�pz����Y�3~S�P����6�U[�m`��&�����*����/�*����2Z��l�I��@�V��ZC��1���1��m�[�
T��9��1qP(	D�M��"���`i��x>@+B��A<���M6���	G����d<��h����������\7$:�
����{5��d���,������x<�L#��Ho>|���7�Ed{Z���
�c�f�8 !StY��L9��T�D���e���	Wp�F�hI)-_w�
��b- ����D�h��,���Wd&�!���k�t6
���,�s�B�bv��������I)}d��,������1�e�1&�mC���\n�N����A�~�('���5���4�`Mb�*�����
�*m�rc����[A��<M���?����D�&`�J��������#���j��Z�������}=�=l%z������������Cr^�.m�v���:h5M�6���W��N�)��v�L��2�lj!�o�������>�T.n���
z
����7WN1��\�m�V�O\��&�/�BW��1��]���V�G��W��j�|}�W�8��G��/1�Z�;�v�9���w�zc&A|(s� ��j0�E�!j�^M�x���RH�?��O�1�;|atDW�M��#��(�
NP�qcD/C���m<��@e������%llP��0�!+���-b�����&0��x������r3LS��v��
�7#>-JB�����}���zc�{rr�%��1'���h5��������c�b�����L(��
�T$������������x�;
HW���7�;
n��R�nn��[q0���3%g��1zO*��p?��f�-GR"Q��2�������C�J����i��P���$�@�
c�AJ�G��>:������BA.S��e��*���#��;WD]��fym;�0����n��P�������c5a��	���~�����|x��=,H|Nxc? '�g�����:�b�*�ts���z[�������{6���p�N��h��M�P@�AQU��;�`b\^)wExA��U��)M�NL"��`(I<M�"�_,_�'g51z���"B����^�v{�8�Y�tz�>y��������U�6�b��}��*�c��T��gM3|�O����*��i�a���q�%��{I�eq�Dyc���Z�S����
������q�y��R�����|�T���{���Q��M��x����I=��$fit8[�<^�m<�E���>e�JFm>������C8���bi��>:t��[�8Ay	�#��Gg����
�O*�TYY�cGFg�Z\��1�;���8���:�����:���������XV�h�����60������<7�]�0(trw��n8�Y��0��*#&Z2�$�K��p�w�����l���d�;ucvwL��	q$���\���lcc��_����gA~I,:�s�}Q��/��-Bf_����2�0"�Y�F���c+�c>i���]'%�n����QD�(kKv��;D��$N����*�Y��;.��V6TD���1t&'jH:X��!*��Pz�C���6%�%nt���s�x@��������
R\�l�Y)U_1)ZK����R-A1w�e�:����������(?12<�iV{�������#�<�C��%��h���?�~{���-FNLP�-{-���
�p�`�.w.J����&2���Ja�������H.�M��I��j�P�7�p�4StV��"�.�+s4 ^6!�3��P�U��@��sMh�z���k���Uh���j���5Z����h�Z��f��-\�kL��������L������2(&�$�T�0B�M��w���:F���`F7�(r�d�p�p*kk�����{��*�IMc}��129d�����9W����
�2�(�
�$���������Z��^���,^D�d�,�MR/�	Am�%jp���L�b��
]�Bvdk�i�3"��
9$sx�R�6��1�N��,�����7'N��S����_$���L��1�s���
/��VT[�*���|[Y.r\���<���{a���3�az�������b��Jf���0�(�\lD�2x4��2������I�K��$+c�]p�:����W��@Y�t� ��L���|�`���l��|�x�K0�[~!;������F�����Y���)/
kJJg�C�%�Au1%����hfZ�k���He�Q~�]$�v��{��	cT"��RnkV bx�h42����i����;JH7"
�C��]��y9m"�l&��0�?f�
9UJ!rM���u��3�;O�k2|1O`��l���&�m#�,���N���_��1�G��]A��6�X,��1���v����|I�f|}�>3}�I�h,������[�N�g/Ds�����N`��	������hH��q�L����o�6�������i��j�}BGs�	�^������ ��I�QNGx��x�!�����}1��cuo��z�|jt`���D�$�����Q�z�������_a��1�������~��:z���M�iTC#����%�������N|+\�c
�3�2w���K�;���g��g�F&7^��Tm9�C�w��Rr���"�L�����v+�
T����x%od�a�G�A���."�+���\g��OW�b}�8y��;�������Z������x�[��n�>> ��w����KG�6��sN���s@��8��������M��p1��s(G�{��:f�����?o�^��P��{��F���O���p�s�H�6A�F�h0(�n�C;�nFnA;'k%��A:��n}c��jc	ol���g���FT@�o6�>9r���0B����mL_!���#�����������*�;"�.����"�����{���O�4��ap��V�js�W�Q�\���-[>�v��i�L�E����]Nh�*j�4C��G���l��
�e����������p��G� ��y��#h���g�-;��*}	���M��h�\6��DQ���cY�S'��g$�3����$&g��g�{q���X>����cL��&9dk.���<���ABe�"$h���*m�S�����9���k%9��6M
S���K�Bn��
��+�!Yo#r�?��Q�1�M�a������9�D%%WP�,��j��n�8j��<T��e�@�J����Q������S�G���k��Kd/@��2�_��,���K��?���������i,��u6���Q��./I}9��
:z���&I5���0`0��G��
A��9n�:��_RI��<�$�<.{�6A�$�<w#1~)���@l��B*�E���n�K��e�~����iK*_�H�^���i�)2^sNcR'=c��T��A��vlB~b\�N&����M�~����-���y7st�*�����?����8����
9c�3�{T�MA����Z�0�(
�g�
:�b>��+��r
���@N�8���H��+B�o���3��Y��������X�r�w"n�2���zOd�Pmo�=��0���[������lp)�#:�8�?��_�v<i[u���b&K�����DedvHi%������j����s���&� ��K�������z�������-���!�v����B��t�o������P�Ge271��+,�y)yG�t��.(���I�7$:�
H�Qt�7��"�����=�'����+�t�Y?
={��2�10����YT2�:����O�90�'(�>y�����2��AZ�#�D2`���BxG]c)OV��5�[�G���L�A��3U4����E����(&X�G�����:��o��_(ANv�z���
�\���@���;��2}���UJ�����O�	�X�!��7s��l������ �X�ar�B��z}S���b�|D��1���fz�T�����B��%Dd����L]T����V Ns
L~A[m�
qX�"��6ZN�XS�����S���i�bz��S����w9#�_p���Rq�`�j�+��o@�=��D�;}�:T�$k��~��I�y���C�kC2��u����K�(K"���!����cu{+F�aCd��������|���U�����O���g�]���,_��EG&_R4�I����I�"Xm�q$�1����(F��Eq�/O�u�W�����.����+�4�~)^9�M��
E����xMGyI0�����:��)��8B��"[��Ux:j#T�h��`�0����nCj��R������M������-�R��JG:���i��R�ow:��~���!��_�qu�Fg�9]_���D�F�(��(������m�T��������#po����w��H$�����`v����������6}��O~�Vv�W�x)���\D>��_������I���FC����<�����S���l�H�B�`�,��d>����E����39oU�8y��u���
�s@�pDy#��$�b��(Ff�8?�*��E�{d��6��1����F<����k!M�
����#�t3�M��;;0���h^O�w:����5��r��_��u�����V�5Z5C=i��a(u=���\�D.z����@}��mU�[N#��k1r
�'�����EO� ��N�����Sm����������xm,��@_��z��
���oK����D�x�
z9��=>�5m��I�U��:�q���T!J����'�AU����.��<�S��������������N���QC�?�$��M�5�(��F�_��]���_�_��),�lHcC<��s0�_	VE��Z�z�,�"�JU$d�a�4�ZUU�����Zu)���&����)�-���X��K��h"7G��,�f��\;�0P��	�-�Z@�,��,X��l����N��<���z�����-���b#AJE��%�iB�o�V�bXf�����+z)�'3�5�K��,9���$p�F��D�Fa�^f,h���_�t����O�q�7�ljnx������3���le�O��,Pj��B�uD�"$�^�C8f^����
�xw�����HRv�I�����r�|8�,e��W�d�nVE����C�k`j�O�<���Jj�b��n�B����f\+��Y�n+��{s��U�`T�Z�*j�'�~��� /*��������V�����E��W�*� ��F.u�2Uvv�Y����X6����{\�������E;45`�|xx�`��������2H��������T������t{n�"#�bsw��9n��D����Q�g?���R�^)=}JQ%64JW]5������ �{��f�c��p�l��_�'���l��^��U����f�D���`��)q��d�b"tJ=]��SZM���
���jH��6�'5Qb�G&�1�t�&�e}3�?�1�2��1k�i�,N9v0l����12���n\����Y�����A?�2�����p���g��/�f����K�����p�tLK�uL�zB8��$i��M�0��k0E�^����Z�W���!�0&;J"�/��|�PG��3fP2�#�S�������C��^�t{�<S�d���q��Z�q�.!�n���r&�5�h���*��F���i����s�O�r���7<Y)�g�[1eMKvr�JO��9�!3��	�������������=�3��h�V�huW(Z>���<UK�F�������C��z��GQ��K���:V7)��_,Oy�Vz$x�$��68GE���1I��o8���$��������dl�F�<A{��.f��!�LR�&��M���E�r���s��,�.2CQS^��RImP�"�f�n]i���C-�nQ���Z.��M�x��B�En����_,R,�`�q_Y��)���
-��`_����J�x���I>�g�����QQ��0��H��2�|tn661�Dei}a���)XrN.�t������(!�s����������-�R��_������G�U)��9�D���FTt���Fn�O �z�F�H����/��
x��w�U�[a<"�)�n���:@�Utq�*���8����}8����FA�0Za�bR�r�W/���;lg�*�g�v�`X�eV�7��v5��]TV�w�v�1
@mi�S���=W2����?`5#��}7^tD\���:&Q�+)�1��<D���~)��2�����	��pd��4i�^IZ�oL�X�]�)����l��R�(�������(���%}��!����*��W'��4L���w����w<�tr�w����x)PQ'�o6�\�9b����=C%�y^�\��3�QX���1�C����������a�y5����r�!�"���n;|�q{
B-�?�9E���q�9qF�*�I�~�m��0������D2^#����G��_�=Ol
Kf�[2Kg"��%'%�5{�,�����f���B����M���!�T�mj��=��?f�D	k���kN��$��b��H�os�0��g�n�W�di)}���F�T����6�e7vJ��N�H$�8H�/Z����Ra?)�2�g����	�i�Y'�9�%��L��f������M1�bw����	��l9mkG5OJ�5W��g��6�����WO��\g�9��E��c�3D��d#@uH���7S!�Bn�)f��A�%�\�[�^p
Xp�������s����F����v��e�zg���V����L^^9�)�N���r�A�)[������P����WZ�.K�g7S���y=(/��Y v����j��,rh�b3�Yje���3F\��!(�����,Lq�W3�<O�#�����IH�FB6��,{��
������?q�]���,�t���]R�P
T�G���5(h�~x{�X�D;�1&��d2��A��l������Q�����1W
��ih$*c���
�Yd��	�L�5��xf]����a������AV�
�mU�	�Q���>����A��%	����7�-�#�q��w6h;�J5�fa��d����E�NG�"g���wG\|�����(!�;�{-���{�lr���=2�8�A0�H����=��9�����L!z�%.�����#��S�&V�s5N�q>����
nK�����{k��h�����>,3��x��v�+�Ugx�v(H�EW�+��K�Xv�����fd������Ff���*������.��h����\5��QAk��o\��-��_x�����Xy ��fj���v��R�
kf�t��0$.x�����������sx[��I-`��x�
�``.3�7Qs�	���4�	�47���Q�;��)�3T"�$��F[�z��L�F�,��2�;���
��������O��0�2:����m�0q�����_���fJ}�&�z�F���+�*�J�36�j�1.y��Gmr��u/�n�J���xC"k�imO�4��rI{5�WLC���9������s�#�j�-��3��e;<%0���s���C����$
~����aw���@�::�RR���.�"P���b��J���1Q0����aj_��D<��x��E���������	%�M>E��o$]�&���9��3��`L�q�gs}����7�'&����%�'[|�XR�{��Y�����q�D�
z=�j�!�e�,ce>�w���L{!7�^�Q9-�CL-Wu')1WL\"'�TuU.n��;0pw��L093|��
��!*�>y�P��A\�$g�P�,��Rr��8�<�eE�x%��[�Y���	��4��ae�32nc���z����<����_��J-S�.Cr.�\W���}L���jL
,~�!� ,e�$�721}M��
�2�*"����A.�"�����y�����a��m�:�@4B�w�4��1�;t����id�;���t�
'\�#,8�V��&D�h���%'.m<������m�C����X_����-=�t�g�*[Tz����pU����,������B,�f~5u��]�����U���a��(���[c���#�?�z�8��!?���93�����������Mk��[6�f��u�d7�� ��"��ele^��t!s���},2X�*�����}s��Y�g�$1�%�<�'�������'�h���H�.N��N
���hu���eu�������D�����=>���[^�M'-�H�����`�Q�3
+���M=+{��k�����������4t�S���A���l�]�4�S���1�dw���$\3�tVw:W���l���)PLE���{��5����Lo8e���X��K��<#]��/���u��/��������
���(O�k+����|>�e(�!J� �s#9��;�������sN����J��D�_:�=V]�WAp��w���i�$�&�_���6��O#k@�	�
P$w`M�/�H&���4e��������������}e����Gz�|�"�T�`%�����H�(��9������Y��:>yy��
���W�d���0iR�S�-0�:�����=� 5�_������������8~j�o�v@��-VRL�)^���3���U��7���K={j�wqI��O�a���b�f.(�y]��%X����Qi+U�_��k�����d.=���t��B���T����������]��R�s��[�1���rnN���k�iXU^��sY�m_������5��A&������B�������~����f Y��_�9���2�iU�;��F*�=���u���$|��+��;�>���}^��d��K/��9>�5���H*k�-�3�F�g�+�u�'��=����[&�Xt���C���
C;�?��-)�&f�B�M���'������|��$!���A-����q�4>�POd�.�������6�rm�����E��,���N..����,��pE�K��d�a7� ��7	��A�����s?��J��h��r-�������]:g��4S���^��%���u��$:��v��>�C���a�Y�.zV���V��y���a�y=�6v�%}��x#e���&�� ��0���2H�7�^����.����3���+�����������Q�}�j�&5=>�_p���w:�������kN$��=-#gU������`����P3H��3�o	]��$[<En��Du)��d:��C
sN�0�[
�L^|�X�1�E��#Y�b`�dL|�����,f:_�`�[��T��V���H��H��"2�;�3����2�(���qY}��4�BT���7��Y��jw���||����(�G�����Y���E|�bmO`��f��}�������o$��x�e����}��)�1^7{�����0��&I�u�
�$5�O'c����,,��<����#�@����Y$���Mt�@�f0�����k�jm����1���oGN O�Q�p��Q0�o>����^�d��?s8��Z�{��^��}`��z��I�fA�zr�Gw�7 �c������q�*����BF�4X����*i��(+�Vcs;�"�����i<�x3Fo����*}�)b8na{E�����$p��%�IH����O����<��^~gI������W.-M�e��E�U��y�T�A4�;���P��N��/_��@���"-U;���o�2m����)7(n������X|��m�{���"���KL��DK�+��d=�����K������7oU cv����^��Q�`BO�,��`"w"?%��8�?���k���t��9����i����Qd���h;��R��8
:Y�5J@�g	l:V��Cj���E���1w?7�C�`8j�ZW��
	�n�{k�'s�����o�h�����d��9ja0t
f_(�0�����4��ga{>B9�P��jaz�4�LIy:k(�������w-�U
'��qr(��G.����}H[~�O���|x�����V$;Nf_t�V����j���R�	��z�t�(Bl������@��u0�g��1z�=��A�.,���P���@s��W��f�����dO�R.K���5�	�N�}����Q�t����g�#�Li
&���,t�8��w�8}�j���&q1�t���n���!l�t
��+��C��Z�mBa�B�Y�yZS��2�J:�������#�;L9�n�F\�	��.8t�ZGb�d��v.�8��(����c�.����'�#w/�W�,{q�#���1�9N)r�'qs�`���mu��J&\�������P��|���d�����cC�5����S^�u
[G=��4O�25��qh)j5?�wJn\�W�CLN�K0[���L�����O�4B���g1(����0$1@�&Z5L;�y�)�FQ|�S\�r	�r��"���p��rYB��v,A��s��&[_ux�q�b6�E�&�s?������8n������8�������p���
M+HM�6n�HOi"#��KO��+��:f�;,i%�H^��&}�%,-A�����+�Q2\���( ��D���������������S�o��}2���c��aEM���+�su�2cy"��
�`�b��gw��fL��i�<��i���!�K.�1�vas����8;y���7����e���,��(z���������+��'�s�5�����-���N*��*��.�WO�������]x�h�x�������5s�w(�w2�|�U��o��yf����;�����2����xVX���D�FwQ�3Nr��K����n�����Z��o�jc����e� 3����<�;�M����L���m����O�#�M�LV$f�l-��73��k%�e\��'��z���O��������!�"�t���xX�,�h��,s����)�6������w�\����#�����6��R�H��@e2<������@G	e3��Z6������������S����B{��A�$�_�9
?E�y��Tq�iF�A��r�u:5wm���.�SF�(k=�L�'$��0�������6L��83'���(�fEDr��)�����3"����3���l�l��u�����c�43��������d�sC�����%�+p�#�@'�T2�I��i{6K}p8�l?��La[_�fZ%e��G�w?r*64d�oC�2�'�������n�9��U:tD��U�{`LY81N���
������W�g��E����=��O�|-������6��FK��8�C`U]��K���\�7�����������@h��VJ"��#���\�<3a3�9bYJ��������l�tkn�GF���.��� v�O9pqH����&FC�l����
�	���,#9���cn*:	�`�`>��c&�Jnrj�����2Q�.:����h���r/�]���wZz���z.���{��!�U	a|�b�_Yi{80yL0�D����y���l}�3{����s�W��K�O^�`I�������#/���o��!N���)|L���b�&������G�����X�4U���9�h$I-���Vh����T�D�|�O������w�aS��Hw�aci�>��O�&w���b�.�%��F(�~PG@O�}in�l,�2FN�6��k��(N��mc��nI��eI�$)�JN���d��a��������/?N8r�4�>�8L���e�N2��?�D]8�_�����9IF�/�n��tTy��6;]���_���)
�S*&�����w����C�v�}��.��=	��'��,Kuk�C_�m�d��
6���I�����;
���[F��w3^%Q�
�J5��t�����^;��6�0K��#�y)���)]�a��{q
���|�����e
����ZM����~�x�mO���@YY6W���=�����/�:E��]v�HGP���nA/�`��4>U�{�qY:4�E��h]��0�_pS����;��LD���2�m��R�6���y�p������+���\v'o�<���*�	����g7�0Fk�������s����aNZ}����S;2YU�#���;�`����7���@<
nw{�"��N�G#r �j��2_#��#��A�
'�F���h;�`�	�6�FaP�I��U��0���h������@�R�%�,������d��GY�k����_�&c�������3�
)�����G"����#S�^0�cAGQ|�D�7��@_�02�7�$o{�"���A�y��g��mO�n����]�������_�v��9?h��3���:yE*����K��tJ�	�G#g���G��9���QF���!�����+a���<�����rg��*��B�l;Da��	�O&��q
9�-�~A$~n��ND��N����Y����D����� <�\�9kC}��Pe+�3 x�������i������m�������anH��O����n�L��H����H�A�w��l�JV�]������\�'�p23N�(
��-�����Ej�l�������W�>_93�����s�i��D�b��V����*d�����b�0"���>�J(�0f,R�|�O����{uDD�-�,V��H
ii�zW��������"������ ���1[�����.O���Tj�s�Yn6�2��H*��z���*�VZ"���! r8�C���;�:M���yoZ
������f�O��~�v��v�^o��4����gK(eAs���"'���T�������I:=����=w����N�B?U�d�,w����h���?�#�!����<�����j�0YH��J������#=s�.-�����1C?B��B���8k3�)��,b���������7��K��EnH=r��K��lY�@��!�7��l2 i	a}���)O;��J;(h�����Y��9@W���{���lOYF'�9���#=�|�y��p���o�����q8��P$"��@��'fy���>��av����I��.����'%�|9e���)m���M+��d����Y/�C����qQ��<��L$w�p��+��0��b&�F�"�h��<�Hl������,j�����}dZyy�*�Z�a%�^ +��c����d�1��
8�#-*����H/R������F���
L���Z����;����+�,G���
��\���.E���2�a�^&����4\\-���+����4�|�(��ey�v�MnwY}
����$~r�=�|�q#,
y`���HB��*���5�a���Z#���N�"�%m(���H0�7�nB
�.
���L�/��~y�j��/e��g��u�0ew{�n:�Q��w��60 o��w����l'2b��#v8+�_��R���W\y�u�fMIL�>�E��V��)����X���K�VNc>(�x(t��B9���Z����c����6\�'���B�2���Y��I�
%Q���F������X�����\x�5i��d�//��.R�.@�"�-2�Z���7~�����OJ���D�}�{�9-��F~A��<�����8`4z���
����!�!�`�?��.���++~�?�����U�,���QT[[���*?S�qx�1����z�J�Z�
�W+�ha�Z}���W�y�q�h3��{S���;���K����$�s���P\��{M�������D\_�
�*�p-��<�K��v���>�t),�
)_C��������y����C�U��������R�iY"�nh��
m��IR���������B_t<Dl���s�0Vf�43K*S��y���jH%�G����r_0
��>F��W����@��[�v��[��<�~T7��X��I�%��5g��Y����c���qA���o��h?��y�Q!#��h��i5���q�,k;����q�c����N�~�
������|�����O����m����`%�W�l������6��ZNl���<��Jn�_��`��Gq��J[[f@�Xf�?��*�h�����������3n��?���'����2�c�m�S�R�:4��<��YM�mi��.���i�9}{z�v���n2���~�>�����G�*:s[�y�eK��)���B��R/"�+t����(�R�7�s&��?r����$R�Y*}��2c�6�I	
�:Y�<���io�$��\�X��X�
�Yx�
��8��5�7��b�DG�[�&��O��3���]���7�o7����}����2R�|MZ<�;�"M����%���4�'��^���GJ}��A��ax�W�(m�K��u�c
�z�z���mc�K�v�LD��;�kro�����������6�.1�y�����e(Zr���-
�7�J=�s�I�~�����i��H����XOZR\T�O`�~!�����6�!�G��	a��c�JY��hA=�~�*	�+�ek��#�dH���on(K-�Z	����2�)�leKM��yT#�\�<�1t yW���1�
)W��_$���9(��������?�~�w���M$��j������O<����2� p�!T�3Du
Ugn�&��Y��[���l4[[�<m�g]�sp�)��y$�B����m?U��m{2����mL�YU������z��1��B]j�V�|�j��8��i�����=�|�O��\�aaoE�	7\'n���f�k@v ���P�|������R�JP�����[���?=�j60a������~�R�����������a�]��%�/8/���.WJ�o,so��4��n)o��(���.�����4�������|z�4�f���z�fnb8�����7���<c
K����Bb�"'oN��\�=��&�8={y��UN
��F?��������8>z���d�S!���4��k�m~b[���"L�Ne1�[���Z
�	��9����y�s"����!����:g��o�53��g�,�@m��(�yY���;�VR��P���C��1�a��z��9+��]�d��y��a�5�<���eYi���������Y����$h����r����%�Tb������6�_�^���d��r�Ln���MD����H�}�>���I���������n6���&�p��&,\gE3�lH�/��t�:T��}�E���!dHG@'�8F��jK��"�H���Ns
kc�T��Ip�d��O+��(�4�?��wC�Y��>��9��8'@v��F.G��oD�r��-�^V����K��m��*3�������S�G�E���F��M�i01d1�R����B���a-������t|p����O�{�H<�Q�O�x��S7U
���m/��"���8N����N2�F�����d>l{��-���E$������
E_���J>|s�����G��,�%����>�m�SoxK���3{	#�8�)��Tb=�$K��w���I�k=o?�QC����I�7�$� x�jSA�K�1�����������7)x-�6�f��^eM>������7�P0�_���`�
�,�q��S���d��V�e?�h_tk����O7��jF���,�� .�;9�P���83H?����_�f4�J?K�Z��x���>�;3��2Rw�CL$���������v/;������k��'� �K#8������
/wrc����<wbiz��	^�>���
�J��������MY Y���c�&L0|�,Kr)t��$`�D�1K��x�f"x��"��\�TQ��z��z�MR������C9�#IR
��Ik
o��Z*9���e��A�3��;����[ ��8�,pE^Z!�A}�{Vs��EB�T�nl3�&��y��r=��I��z���G�S8�������l�c�q����'32G��}\�+��������K���J��^���7~{�@�)��W����IMNE���~{�MR�C�Q���|U�|�Q�v�iw'u�^�3�E�x����{�Ak��n7l��n�T*��_��5Z���U��Fc�\.�����������oU��k�����}X�"�-�0kG.�Y�� �����[EG�B��)�r�KT���>�+�?�c�I����H�LV�:�$�=��z�i���oKuV�[���@A
�P�'2��f�����7�����#H��$��w�4{���~�i���j�[k5��H���E����8�}�����������K�T�eS��&Vl��t�G�4y���n��$��T��v���k���$�����B�Q�3GO���:����f����;�T��N��_m�����m&�nn���`���N/����l���c�����ah�3�����b�{a ����O?��x��/����N�E�#��������v%��%�e���Je�������uZ��^�i��-M~��P�R�g���_���������1��y0�F'@e>���u6��)D��zZDr��
�^Pmvv{�����M}W`�re8H���s��D�x�����>����lo�W��������e�~�E+���w�^��99>�|�W�	�R��1�2'��!�r���A���-u�z��������Q�p�z\.�*�J��]n����^f�����n�^}��a�V���+(
�;��k|��R=~����6�OS��3��������0� �0Jb���+���)D�q�JeG��6��*n�%�\��`�,E�(����g�KW�a�tO��v�����
��W�
A�S��C� ��V�����O���!�qN9�7�!��|�����x'���x%�)��"� |�A����^�V�0�������Xw6�H&��F?���A�0BX���b(E�0�N��N8��������]_�s�/�!%F
'�V��C��`F)HT���J���y�i5*C�}B�7��awN�g����BL@��:���F�g��{;I]-_��WE�V�W��7��;!���'U����e�4�0��2�_
�_�>D&f�_�P4���:����n�h�L�H��)z�������E*��%�j/�R��8���N_�<VfG���R
��*�'L�O��Y���)t�������V�����sf�;����3+���n�+��Y��=�7���+��n�w�sv�;�����+��3'�����+������]yy�	���`�2���n��Z�n�j����7�Vo�~�z�;��qG��-��3R�la�1pTkV�[*������5�Tf.����|������c��8�/�f��Z5cRU�W�
}8�EI�"`'1L�3U���bS���I+M���y��lnj
8$f���Dk�_zUN��z�f�����M*�����E���o�����p�HJ�!HC�� ����[Fq�1Y�����|�
�'�*���t�6��,n��1�$_`
�h4���0��b������i�
"V�:U&�q��Pq���=��XXi�������Q��h�����F
hE�P��B+�J����_�Z�N��`��uKi�J�S�W�O����z2F��'z���g<���i�il
�y���M��@�����n��X����`<�&^�Co(�[�6���h���F�0|�8{i�=x����O1���j�"����e��5�A@�����
��&�Q������tk�h>)��6ov�U)G�v�����w'���yJ��\z����k�Zf��<����V�o{�~��i��zw��;�����`�v�0��x8�!���Fr��}���!P�7�e+�OQ��8�����{3V��Pm-A�-�V���<��S��o�h�c���%*����zqvI�\�G���
���8}q��}t�������k:.����W��^��_�E���X��|_��
j��Z���uwzkOP&�;�����Ou76�����%j1{�r6�wg�����x�eJ"_�[T�j}���Gd���Q�H����<A��z5�'����������GW���������+|��K���&<��r��~K�l��x�=�����}�^O��Q�
z�h�xN��;���J`�-&�
���e��~��B>��~;����6�pH_2_�J���D�����b�����q�uf���R{��j$E��n��[f��1�.#8Yb�m�O����L�j�Q��/����+^"��"��(-]��Cg<��G
�+��D��J��S�wj�f|D��x2�

��}AE���k�c�j �H�����7Z{��>.�1e�.&�r	A��(�~
��x��~h<�	\����X���`D���&��u���3�2��^.Np�%�L74����������=���!�����=��J	��:�������
��"���x�E#2�Jp�%/�%^�_�����T��A;������� ��k��[t�� ���C�Z$��������9f�+�g&��6�^�?c�>t�i��g7���[����z�����=�zy�Aj����5�������$�ux���#98�8�����]��E���6��E�@Y�����>�5�����(��8a!� �o����0�\[_����b�����2��Shta�rG��p!bF0c����E���>��K���;4:�������
���h�p"_v�v��~��������4V���7�Y�|e�u%�U3��Y�����?��*9��:�\���-��T�ev+T��:�������������Nsoo��SG4�X����@���	��� q~vyUp�.�v�A���,�j
��r����z@���k|���|H��t�J�.����bqk�Pa�B����dP��D)�n���P��;q ��
��);k���2�:��_��L��#�O�>v�=��f�C-$G-�#,�P�mio�d�K�(����d�:���@F%�ha*���q[5ypR!$����Wm*��P�i9�N�����/W�����
�j��Cs�/�y���G�q��0@��-�$����/8�l
[�&lG���Oo��?o���i��<�ZE������/�����(��J�U�eK7&Y,���U�/��������z�XI��������1���m�Twi0#�TE/��i���E/�7�������������Xr�LI�'���k��u@Y�������8����v%e��2o@d�����-lv�i���)��H��������V��%��Q[\��m��ald;�1>�L�GI=�a(p�Xph�J���D�P
0��
?�2�O7���X>���U0yc�P*i2�	��K�eL&Dr-���Nv�)���)B��I�P� e����
��
 -v�T����9$�V��A�1c���n�3w��G��AX0�b��A.��'�2���X+��<�-Y��4!�z����������BeD;2ICm�2l��?U��#���,����~1Ca�YYd&���P��Q
��g<�r�K��7y����2��*x��"
�p����*�y�@0t�����\p[�+>)��o5r�����=��	�6�J�>S���G��mWu�)���	�GD}%�g
&O��� 5
Hi7���.i~�&���\p$ @
�m8\|�M�tw�,7'K�KJ�������
\��_L�/�$��������n0.���BSn*k��u��^ee�tY�c3,���^���=�4w��t�y������W����b�Bx�-�����}������2���|]��^}��lfa����;Q9%+G�{��[4��\�4e����]{��A]N8F#�h�L�a����/[��!��r�I���e1���*�N�Lf�5���E������PJG2&[��'������N����NB3=�_�W��BH�k��@�C��tcd�(r�e"����>�!w���3��u��r�FM��������E5iB��D�(���P��U�(~�K���,rJ��L�7������:K����:��8���b�[j*A���dS��SI-��N�L
j��������j��S�#��e<����g"5���g�������[4&W�]��7�GD�4�~����eT�A�e�F� ����P!���4D��I8|�������=R<pp^���n����%E�D]�=�@��������quq�������JtKM�mc;���������=��97������`��"N�B5����*���I�mqvt��C�q�6O��N�^JXo��T=�	wQa6|� #6�q������~4�=P��o���EJ�v|y(� ��11@��~��gR�(1R�l��q���BRH��6���?���$V���y���j�v�I�A��\QJ�mx�T��,���0�))��:w��m���� ����}Zd;+~��8�Kk<����.�Y�����*�^a� �Y�K9���<��}6l����7���TNV��V�����lc��f��#3Spr��y[`M)��;VG&� ���8��dH����������#J�l��Zb�������J���S8����)�z�i�"b�x������R��l�����V����+Ab:��������h��H�qi{H�}()�!x>2*`]u3����c����������#P����]$�$�2Z�<��JY�a��p�j���R��m ��!i����|��y���i���Q�#��By�FY�vw�__i������}�/`���9���P���W@�U��N���C�������kT,&�N��c�?��y>�@�b&_��g�h��g��=����
�����j]Q������~7
���L�n6���C�o���D�T�������>)�����=:6���Ck���H��^=�,Z���:��RH��EO���C���!���y{�������#	f�6��P�"�ML��U���t��w:R�;�����X�1��#�#�8�rx��J�^�3�&/]���en�CQ'�^WQ�jb}���/�e1E����t��pZ�������������i����J�z�����������3�
i.��i��X��<K#@e�E������9f0;����t�P&-�Pgf��*�U�Tj��������Z�I��w���Po[9b���S��h1�Z��+�"�z�,QF��zj�`��)��UA��lO���!�C�E�M��dF�������?7�#�����P��f�����
���+����`�1�d�b�);��$	������X'��G1:*u�4z�Q�q����M0�����xH6����9,>�������M0�6���I�.�+�����1��9���{9s��d��t�
���� ^V��8�6�I�=�VR;����L���^���+��H���?�����zp|>�h1��P�7�w��vt;i:(����G�}+��v;�p@�/VL���h���"������P�R���f��Qo<,]b���V�D�Q�CM�zg)|P����I��a���B���9�N������y�2�9+_fc�
�<M�#@��0bj%�5�[A*o��jf�i�=�u�9�r;0
L���J�Ar�*2�����VQPx
$���������*P����l�H��{`�)LP,�e�	^����:v+B���3����*iB37��p 83��
�3"^�T0O�7�g���m^�6����rx]~�d��};g�s��N:��o�wO�R\�0��Mz t��>���nZ����I��d����'���~H�����"��:b��&�z�nT�h�Y�]��m]:lz��@�4F�_C�DO�n	=��UX���>E&���X�8�a~����.P�j�V9��cdM>����s�+K��1�Z}�2M��e��������'
�_W��V������������f���t]��������+SdX6.�'6���C���L�m��u�t�B�R�'m����7p��,Z�}��o��!A��%�������
�����c�������~��d�#(����m����%eL��S>3k�~u"������r��6o+z�e��V�{G�%����Y��a��b�����X6�*��SdT�D�y�t�@B���L)��h��F���e����3y�uL>{�3�I����������8�{_����-��l����������"u�8Z��m%���4d���aY���r_�D����2��k� B�,����}E�q���h���`6&���G�.��	���%������_�4;+�7Z�;{����u#a�7��UP�]�Fm�NjN'���������[8g��U���ElMg���k7���18f�������Q%f"��N(����.fql�(�"�9�'������8���O�l��U�;9���hr�U�x�{��z�V���S:OeG���!	�i�LDk{��ps�h0�����h���T���KpC����
"�h����b)�!+s2b<�;�kK����O��
����z�m4[�N�R��;�Z�� x����I7�@{qAJ�Y����Q�c���xS�����oI�/�G��g�u\�A�^s(�t2�������^y��+(�
F����u���n��Q��J!�G74���Q=<r���]^Bw��k��0�kZZ�5���8�X��D>=9�=�:%He6xK&8$6���8���o�(x�M�	-;�h2����nc��|<bmg:��bs�Q/�b7����^���"��Z�0�u���@�j�������^w������\^1�ZM�Y���������X����Jc�R������~k�e`���C>����1&Z,c��r?������7�@��m4)��	^���rj���gm��~�p����5�{�F�����z�����m��sf�����oF��-{�o���a���^��<^z��=:axt������A��
��Z���S{NTm�������jnb3��������'���~�����Ae<�~�	�)�������8�RzUk>��{��W���ju�r��+�������^�V������WQ��T��R[(�����!��t}on���5���c�Q7B��tJ�a8���p��&�)n����1�Q������p<�n�Q��I	=V>���'�f>���O������w�j, �EQ��vk�]d<�����i�E\(7�lN�8:��
s�{4��fw�)���a� h4�	��Z�|M^	J'�*B�� �)�uN����3��+<
�R��L��d�;����
+����Q��U~�:� �}!�	�}@�p��Be7�P�%�v���H����C������l�1>��zD��Y�\��Q�
��d���-rK8����c��B���x:u�u��b��K�<����A����c��
G��
d`$��:v^�&3cc���qe^���#�����6���O�d0P�����;��a����'����y
:�8&F==U�x�?wc�����$�(UAZ���v��>���w��gA�����	
j���M�_r��_WGW��rqt|��}uqr"+S<������{�	�&�� 7�#��={��-F���*s�������4�/�k���g5b���#��bx��#�:}��|�&�$���(��cY��|C� �%����L ���3��J����5S������-����������q
([���(�@�#nt2f�� f�K�-���=VX�DtoT����vH������C�;�^��7�W����7�oO�.8�]�=�@�.��jAB��m�!����������������GoN����.�d�N�P���j���H���XM8�uL6�qw<	y�	e��<4�����o�_��xR �W������4,<p�2c���z���RL7w�����x���|�!�*B�`u�?�^�����uFNH�.P7XV\iiB������)BK&>�~<�xsr�1�7�Q�*f1Gj���x51��vsD���{�T�i7��RmWm7���z5���G@�����1���Hh����������g$.��4�nc�b��������O�Cl�����`q>X��%�}�A��g����e�{�H��
s($'4�@�u��������q8��lSZv�:x{�b�$�o<�PA`��v�Db9?>_���/����_��MuI^wG1r]x��x7��l�>-��[�k�U�=?�	F��[��/Z���=v������%i�������[�	;v�A/����[�Z��W��!W��fy���e�p���X�L���,���N���~S s�L��`t{�z���E��X��Y��k����,��������l}+(�(��A)�xp��L�)op��-�/����Fu��{��'J�f�"��1A�Y�G�Y^@>�lM(�2���	���"P���?R+;���t�j,����\M�e�����������1�d�F����x��j$�X�^�]�x�>��� A�t@���|��h(���`qS,u�{��\�g��������b�#b�J#�ochH���t���b>.C�.:�P��{3���G���;����o�-Zanj�t8��B�`����|^->�L���(����W'�&^�\\�;�\
�,��J�=X��]�9��M:��%]y#HCva��@?�z��'�xf��\������������9�8A����?��>}y%�&Q���<}����E/2�
c��o�4!�d�~ �nh}�G�dhU�L��h{X�k��C�29�' q� ��(�[Lr>"�mQ��g,�����$��Q�O�]w��V�V��y��B���o���0�����p�`#�]��x���E-�z|����9�C|���i�Y0�\0S+��[�f6���wv`T���u�;������ ��Vn������������F��L'L�����0�����/,fVT���@���A����t���g���E���Hb7l��~�E~��i(;��t���S0%5���>�$&	0�����/�ml���a�]��rM��f{�d�E�*�k�Z%
�A+A��xh�^�N|	���&��Htv���>4��������+[�+�G
�����9�@
��,�����������_x��H���'A�|:��!f��8M����j���������?�.�:&glI5%�������[ep����b5�w��4n=U��.�YV�qiGc��>���0 ���Yg�!K�]����<~r�e�D~����#�����@�O�g�����)�����T�P�`��4���c�����Co\����t2	�y\����T���j�^S����~������@
7@������H����lOG��1eVFS���6
�sLc��@z����3<E3[���k�I���//��.RR���Q���d":�A�|R"�T����c���l�bQ����<�Y��
�M`Z�H�����A��!�|9���^I+f~ L�C���c2 V�Z���`j����as�P&��3u��^�-�����x
�g�U�W)�5��Vd](�9U�`d�������P�G8�Ri
H�2��q�<����fY�������5��m�e����B:�E��e&l���1X�wi�Mk!�n�Q@/��N.���|u�M\y�s��{-h�H�C'L����\ZY���>���_*5w
�<�8}����>���cR-��"���.��u����xB�%�������/E;6������s�sM�<��9��R��.5�w7�[}�?���+O���E�ul|�-����5kM����W���������O���7��n5��Vk�V����N������ng�����?��f����g��[��������q�b[�V���V���4�|�R9��{��^�`��Miv�%�.}���:�]"�|�L:�����F_���(}AK(�7,����&��B�T�t:���/*�8tx=��*����%��������.�yEd��6Y���$u6���A7�h��
-���H*�iz(x��)c��
��~����������������������n��Bc�@dFQL$4��;V����:�X�9���_���3�N�~$K^>��<_���C���)MF��Y��]:�c����pJYC*j=|�(Hm��D��nR������x���s�r<^d��.!"��	d����\�A�;�.�`���d��������S�S��V�8�pu���w9�����a�f)}?�1��o���#F������r��A��WY���S�N�w`�T@���zK�Z�i��#�7����	��L�$���]I�s����c��T����1���n��<z��TE����������e=Xx� ����(%e�'�����E�������EW�%l>| K�[�yD�`��fM<�	����|�:m��D�����Z����r�n{1er������N�r�������)�c�n��V�^�X��M.����\���&-����J�`�SF(���j{�4#J�/�������~�~q~vuq��}k(s�>�<��<��)k>������7'�b�V��f�b�K������>A��2�2�rY��z����Q��%�?j�Z�w��{�����z�$���Q�}X�f�VmU�f�������a�;�5������&�th���[�R�B�"��#ss�ll��)��
�"�3�EV��@�������X����+�]�����^p��c�vPC��m��	/�����r�e���c�,���m�V��V�0�[�K#1�� ���

F���%�����j1��c�<�9{��2K�4�6\i����e�W��������Zr���4]�("���A���DM��d�����%�g�"?�Ey��`�I����fLN��M����0��{Nu%���=)�[n{�������`�!{��&p:-:�A_���z��fl��-N������C��J
�����*��c``�d�����H�������h����^��������'/�_���s7��H�@'��)")�n���E���O��$�,@]���P�%4����Y�������'��G�����U	�������M��QU���)q4Uo9N�������
�Sl�(c��'O��Mp-�i����%�
�(�4w�He������~����i����/����$|B�r�*_�Z����������f5��W�������#���m�j�z����m��������^�����A�y�m�������7Z���}��5�Z�Oe��|�����j*��:��|�;!��W*�z���t�z��^�X
�����kO��G�<}��7������8D������'qs����k��6����H�On�����6����,e�.:z5�ae
zDyG8)����5M�bJ�-V�����s�	}���z�j5^�q
����d�Y�rkU^�Z���}�V��4��?r��b��!557F��GJ&pk�b�h|��
�[/X���(wU6\j�h��&�e��D-,�_MC�c=��z�� %�kMFH�b�VAJ "����Gb�-�%pvz
�t]��U��A���8����{��!��v��DL���%���5�L�cE���t�::)b8,X�]��f���,�����u��PP�'����=�b��Vk����Z��;��=���?����~��o�Z��n�<8�n�������jP�������t9Q�"���=���,��[>�o��c�����usSL���_���:�h��fk�]4�'��/�����2��Q���C�h�I���U�I�)s�������<)��0b��lrp}X5��,��E�/���),b4��0F=��:�� {����t�����G���`���^�[JBV��T�^��j�����_q��2~p�"�@X`����=?#{��?����������6F��)�4s��u&�)��l�����?��m��2��������6��=c�y�)���1���<���U���v��$�����z�f�`i#�d+�����RVp��0\i�^$-���>��{������Y�x�=�=�r�������Q[����n�\S�)����=]&tk�����3V8�!����<���V�1��2�^#�
Y������SE���Z�>�C��U\_�n�=�r��W6�z�����q���.��q� �E��97�z�T��`�i�zS�E7���fw<����H'!����64�
��cU��~�H�
��6���n&v{{��2�����5��A�s�-����s��~�b<ma�]4lS������Y��Xc���~�TC��������a{�'�����V\���1���62���
LTOLRF�W@t8L\O��{&E��z��oE��E]�^aY���xj������*e���#���1=Hx>�o�!�|7���>��Z�����5��Z����+��L�A��������CO�b����Q1?��0�C��
�����%;_�T�RSf��'K���`e�]����.LuQ�zf��>�h��
.8K��B���R�nB���
���o�,	_��z���X���������F��#���D���������M�al�Ga8����@��n�$�/k��������S�a��9e��c����T�����c��6�(�e�R��/�h���HP��l�Y;�v�od:�7��5\���u�/[Q� �3�x��pn
��Y��4�K�1�O��7�Iqm��`�p�]�����fF���d���;�7`�Z��_��+x3\��d�������p��
�����������TN���L�������8W=������<4i!��<DiY���w�Vc�VmMoGNOga<�j�����wr���]�^&����~���[����^�<�@���
���A��������dkE��K�?��������������[A�C=�a�����A����j�ng/hu�`/��������-�Tk��Z���Yk<��=n,T����U1���J��Y�F���?�h���������������J���[����������e���	�r�w{wq���C!(�������
A�]L�d
]��!�G�B�)����X���-�����i���{I����&gl���e�O��7��;��d������?`?����|yuQ����]����[|��3wV\����>A������IB�Ln	��Dk�����k�K�9��7��7����)�����~��GW��.���CT��wwY��`�/���H�!������>��U[���d�1`�I;�
��r/�()��"iNSkL�2�Yi�Y�"yT���8��d���i�X@�����^i��R�m�l9�t�����������w����M���NN���X����%o��w�$k{�2K�m�L����b��d�����t�u����_�.�����L����Sn�VX����1F��G��r\n%����<�{���B�2Q
{�;Smk�T�����>���5wk���i|�������%u����mJ�����s��p�=�P^9>yy��
��WH����Ml�3c�e8��^�aR��tp'����	t�g[f1d�!�y�-����&c���^c <3�te
��(k�����x�+L��3��o����s
X�����T,&+f�2:%)��(����|@8�+��p\w�l�&sX���������+���w��_c.�r_:���������6)���r��~������Z�������l,�.Y�`JF`w�,�����%���{��|�?���a���u���z�z��w��F�{��4���n'�����������6W��U%�K+tCn���.������4��I
@�*C�L9��S/o#�W*{�N���Wk�{�4k����,g�c��\���	���m�-o��-����'iy�l��B���^�=�GK�(pj��e�����n���6vtv��D��Vuw�2e��l\h���C�`�3CJ�i��X��cP$������3�� ��:f��X*��������g���n:�	�KC$1�C0�&�Z�v��=��)�Yip�����Fl�(��8C���kLo�Ee���>�=[9])#^���N5b��v�n�-f!�;���%���>AD'�8���K�#p���4d���m�|�k/���|m��b�U��*��*
�����`!���q�hb�S�Eh����`���q�{�E�zW��O�z����9wP����.F��h�"1v���}<d�[����@����+����Mi
���G�x�W�_�*z��!qn���9Gy�^]�����s��B<A)�
?���Z�n�F��	m\xW��Qx���v\��kD
�GIy&���_������0�ti~�Y���FLR��V���������8>��~�"H���491�YL�W�s�a��-�'�(1���lJ�DD���|V3P�����b/���h�H���{N/�����>7��y����W�B��|
���^�����fR�Tb5-��e�_��+����p4B��36S`'�X;R���N�&�om�8��o�<�*��8&��d�J����K�;K�NS���[Bk���q����M���9O0U�i��z�������Wa=0+��qo�_�]�������<����<�tm�P���&��w�l���������K�!���P�	Qn�Xn�<�9	���#��Qa�^���~����'�?�����=��/�A�1t����*�Bd����L�k�5.����%�?w�����~�?���o��]����A����7;���z3�m��^�V��k��+��g����(�K�F�?���AV{�,����g�Ij/t�	:� �}Y;2�����j�6��-o��
P���������}�T��>u4�{U�������:��k\{I�KU�e������-�{�rN�9X����)��Q����B��;bD����I0
�!�^�l�Y�7]��h��������;4����jl��k��&�+��-@������/��s�{�0�[��:�R���?F�SSE��d#�x��Q����j<H�a],0cX-���x 	�5.��E�������>���Lk����g��t�6�o�|����	�v$e�$[�GH���hd,�oGx'm^y�����\Y��Lo�V\�z���6|�m���� K��%E���,��n8����xy������|��9�L�e�G���i����U��r�q��u�vT;��~�"���!�q(#��V�Cf��Q/��QU>a��r\�L��'��6�{	������.��?��;��j��~�^���zg�{���kb�~�Y��{��^�[��-�W]��V�OBV%��4�*BVv�cd}o*���]�*��VS��
 # ^=)�����oE�J��[�5{����wc��������jf�{�g���s�?���i����:�Rds��i�&Q0@?�:t4:����8)6��9���Hx��IbS���0B�6}�|�xYL��z�d���L�w���B0bp�VNt
���x�k��Cn��Z�X^LDR�C��C��}E�f��7�hN91��0�`����u/�����������8���j>�����*0/qZR`���]fbl���Q�]0�-&�p��m#�&o�}^���?Z�GTGb�d��9�F�G���k������J+�7�A�c+�'�qh���4�XQ�c�i0�`V��hrP�X��l��3��t�P�z�KQ�"��0x^q��D=?��l�;�h���G�~:�8�W�����?����/�5rQ����9�`+T�A����5��h�Q[$����?��B�"B��'8I����Ew�#1��W����������������1�!�����}������s<k��!t�
	��A��X����������x@?�0� t(��oN��;���g�G3��������|�?���Z������z� �6:{�� 8h�[�����u:Aww�e��� n��cu��@=���2�o>���0�p���/�&5�_]�j���Ww��;���v��d}�[It7���O[77)q����<����#f�U��7�`�1V7P��t�1mn~����`�7�1�����%�����;����B������6U�RIU)�k�j�b���j�yAw��.�k���kv�m`oV�_5�����X���o*�IM�F���ag.C�����j#�����
2��,�*Z�q������H����������rX�V.R|]3����gTX��4V�W�Y��*z�������2�(NPag<"�U�_�r�W�t��0I����vF��@������Q-a��R���l��
(\�����oN���9�D�������^����Z���������������GWG/,�B'�.N./M�~�9��oB����6���W.7������/O_qI�e?���.�����
�M8��)X�E�\Fg]���Kx@���y�M��8f�����M(��0V*n{��"�����������]�g�KW�ap0r5��?&�O�����a��t�Zu=�������}�dB���)4�6�	�`�Ax���h2Cy*��]@;��~��OQ�$�����������z�j+�2��[x��abY�$4P��<"�e�>q��f��Bq�-ln�Qg�o3%�|v~MAX�}A�������AP����Q�]����t����N�s���d��������q�R����^�Y����M9�$ZdtAh��b�P��r���������W��^�������������������������7�����������;��9�i�A�@�%s
3�\ ����A��QS<���g!��R��y4>`��u�%�����\�0I����@"	����O���lTNa��FK��K�o3�xL����7���X��8�A&X���6�� �6����.eSf��
������l���~���)^�&+����������~�^����O���Tx5�'�r�Y}�P�^��A)5���o�*7[�7��C���d�-������4��S0�OwA��'C`�*7����(�=8z��:�w�x�&^�g� �!m��/Q����D�	2����G�N_Y~r:Y���|L�N4
�:�����bj�8
oO�.�_�����]�Y<HTM�M]���X4��Ml��ol�V�:���w{U����b�q08��?���d�A�����7G'�g/O��G(g�������8<E2��I#r�n�m0E����
0�'��`66�-)#��f
�Z���`B���e��:�eF�a��q_	�f�_�����sJ=�l��Y~:������?l,'Q%�����������W P�����B���K��<`q���p���d$iNhq�\-�FK�z���1��=����p�n�P��$E�n]=��s���1�L���<�42����
���z{�^x|~����]8}��(���!��$��sF�*���-��R
���&�A��0�4rh��UO���D1����+��Xp�:`��+�?_��T�������h�3��(�B����E��@�E�������j?QK�/=�J%(��e�����L���Q1n����Im,��>��w0�
������(�EU0OM���3��(����"�V���?��&@��nQ_F1�9��W�m!m3��y2����C�x�b�P�d:����y������e��a�+<8U���1�;��y��L�8���4�����T�+Tgf��,P�5e�?�U_���[��a"
��_}qz���G ���6 k\pv�bY���g$�3���F��M����1�:*�]�_�%A��C�DA/�5��NrO����}(�D0�
"�7l�&X�K�\_�a����+~��#�+lO4�n��� ^�' ���n�Nv����Z{'�_��]�s'�TfxH���"�/��
�t��J�{����c��c�D�%�����R��>y�6>��(��M_�\\�wuY(j��>Q�d=��Z�`:����u��#Y	�F�=�b��qY�D�4D��Wf�I�����d�S�mw�M�
eI�z4����yB�lw���$a����2��~_y������E����p������-�%�Z����x�x�*P�Q3�-�	���f������2��5n,�,U����UFIVe)0��ps�e:V��`08�A��F�Qb
*GH6u{���|urut�
7�otfm��wFQ>z�#��K#�������]�����~	��>^��9�����{H��ln�O�@��������0�n������PK�W����r��X�yH�5��7����q���X���P��
��J�@�a���/����'��<�[q8�q���H�'@W������3����(B!�D����mA�)J��}��P�D�-��_�+�#��
{~QR[��aGC�#�	a\��-=��9��>���������������
F8"��D6{�cCB` ���
��b>E.��&���
�a�KA�t�d�p�Q���7����G����d(����1=!�����9�����������}	�YJ�9��d
A���r2:{I��9[P6'�r+mx�t0. �E��z.Z#��8���^���J�
������6ie1p���Z�Z��
��)tzF<$�*�o�-J�	����0F����������A��)"ZG0��J�B�
��9)�e`����zc�������Dy@�1����1�����'���o�l�M�����	���@?R�	����������v��B:��E�L���ZE���w@P{i�E��'[���$W�Z���EN��'P��S��0v:�
)��W��Y�y� %j��5�E �Ot����d�RA>8)��l
�!U05��0_���l���i��n�.�h�2��M�i��~D�G��XK�n�Rv�Q���2��-GV�:	7VY���
dE��E���w�j��a�Rq�/��E�A����
q�*����
�h����Q���"�7\��s#���8':w9%��������#��+ekLq�1k�l �DB����wD�u������H#�2M�H��H�ilq���P�T�ffu��{24���!<���'W�/���GW����i�����M�WA��g�Cn������$�sc��^��H|���=q^�h4���}�;4��Yv�I{������>}�������RR�*�;��4C��)��S�x��W���XU6^OTU���qL��D��W�4@~���	�!��(���l%�����$��`����;���w"z������ �
��������� ��w�'���|V��S��;wI��Y=z�X;������P�������^�6���A���W���w��U�Nmw�V���D�]R��f��L(��!L�/��6W����p�E!��&c�
3'���H�4�E�o������'��XfiwL5M]I��G��$�u��G&0r ur��S�.DOs�!T00�~����n����`���O����%�����T��*M��:9b2&�^bO�Z����M0��x��>%�X�tt9�r�"�������BM����&`d��?>l���i�W����A4�EvX��;T?�P��g�����}��KR]�*�jZ��<�R/(%D,��-x���x ���;��p��Z��QL�F�a���W� ��r�+%E����U|��dh���7��;����sNz>[M&��K��	����^Jf��2��v��Q/�~�
4�
�y�v����=��5!�Qt��!�kD���=��f��C�[��0o�3�����G�0*����-p�g �c��F@��#���t������#m��=��m��#��"�;_��N4�1�D�e��M��7���C���
�P7h���Ok��E.���rQG����u�U�%��G.��-��5ev	�)<0�'O�0�7�]x��3
c��-�����o�FrC����*��=S�_�p76�� ���Q�2s��bI�b#���'��b�d[{�������M^��1�����
�?
��[�)md���}�������%��+m
l�;��o�[7<#��������fp�}o�5B�&��3����'��O�������'��$W�j���;�K�����w�\x�[����.�7L�L���a����NSs],a�fOH�_�@���"=�d��
��z}7��=K�4��
7.]��(�PZ�N��(c�y�o9��W��T���[��7��t�
�FW����a�4�t����(8
�6��J�N��\0C���,ls�B��;h�0-�{��
R����u4������!��"��@�Nx���G0���2��1����z|���u�H���p�����m-�{������w�j���A��������/���e�#6�G�����=�C���A���MF�z��������)V�:�J^����=	f7��^�� �@P�q>��
�!���l���
���u����WF������5����n���fL��������Hh
�$��D�Pc�R���E����6A'���m'�V�%������E�e�V}������i�
�&�U�Q������?�`Y� w+b��n�[��^=p�r�~��>Y�����������m%����K���qg��<C��{x����jn��@q�m����pT�L��0.@��:������9f��L����glFR.S`�RzL8")����������b���e�r� 5�_�:d�vL����d�;����M�^���6�U>�3�?���'z���g
G'�O `x�p'�p��a���1����	r
�~�2��q�^��k�	�Q���x`4�PsH�>�[��q^2bv�`���^�2�t!�b��4�((1���>��K��4��B�����^mk�H�8?���;h4��V���Xo{������2m���[���I��	�KTy-��N���m��+�z�Qk5����2d���7I�k��Am���|�[��M�����i�g7(Y_O�m`0b����Q�� �>��b�?���vqk
��a��E��MC(R��z<�' ���7�������o��z�2�SR������@����>������}�������d��������������N����������,J����f��_��5�����7��^�U�a����w�U�,��������J#k�t����[7U��(���n���Kesst�F8�U]cH+���p[u�(��r*"����8U�>t�,b8����
j:��3h,��%���7-mb��@��3T��y�n�����~��Cp���_�d>�6�W�w�'��bU�"U�^Q�����z�7�
����M���(�)��Ka>\M��|�8�$�#�E!�0�|����}�x.��	�����~��7N�"!���aLa�n1�,P��d�8M� �J����D 1�I8��#��������{3P[�-��D���=��)�dG�o�,{���&����jPT��(�TYg�"4�v�.J;;�/+k��P�_-+��
�@�no�z+3g���	�7����~]�u]w�KI�����p�{5������ �
u21,��0{@~5Q�}c��]���/i�3>�S"���%Y�cb��Ka/�G���l2��l���z��D�
����z�#���J��nSL[��Q���h�k{����*9V�!�sq;��
s���(t����^XR=���P6�}uKt�&D��N����wV($�d� ����n�W�4�;�n�Y������4fs�-)H�xI����i2E���������1���1��	E����s:�F� ������'�O�3�����7������	X�{A���i�*�~7�t��pU��������Gb�W���P��*T`ix���!���0��z�"�.h��<"������wH���;n���F��	{��N��k���k�z����vz���WQ/C��%������H�qU^UY��2��o
������c�H�
G(:�@�@~4�(�n�QFNoO����t�Y���)u��	����[X�X�kLpC	���#����D�a���P_)�0D���|][��s��!�SOz��!&wj��������p�hG��,��h������r4�-5J�"����6}?���rF��/���Q�a���]��<}qtE�I(��J/-�+��f<��l��L�w8s"X��x����=��^?�x/2����-�D$@5&�<{�����Q��NV���-��?�'�	�#w~
)='�h�~�3���pS��^�FEU�2GD�D���K����:���576F�?���������oN>�>s���oO�>?�@#�_�M��
>'^^b�]<�Y~��%�F��b:l����Q�����{���5����iT�0D=�{Q���iRW<P�����B�����V��o���6&;Z
����	��Z5�����3.�,�zz�rE���d�MOf��j�oLz@�<����&��9G�M�-�FL^Py:L�|�MSb��]����H���P�J���%��������nSA�%��y�p���0���j��4�4��O�����}���B�4Bo�_��YQ&|���Tac�+b�K��l�.���_�0=��y�9|Dq����I,�^�M��y��nE��0Vn&�)����C�l4jn�K)�)�mS�]X�6��9+��u(1#���L
�����[hX�.�fFImzU(z1|�/��u�������4E���6�h<2����4M	��1��wj��Y��C��������X�Wenq��&E}�R���6�x��"j8�Aj���0�������(������@{�}J����������a�n����~=�-��
�v���=���m����l2g2�������~+f������L#��e$*i�Ep\�M;��m�S��k���5X��k���(������$�����J����;[�B��-���1�N����)9���p2rTY��/WFe�2��"��."�jehuIi�V���<d��q��S��.������L�2=��B�8K���<�\������������2q��H��C�2	�>����b���)��%u�n��8�<�|��x�`�C9�����r-�#i���J��*7�;<N�nG��b3����B[��L<Kg��[u>���ax��3��~2��
e�KwNG��C�<	����jt4��|h���{
!`�@b������"�"F��5+_<���~<�F�I�=�D���W�
�������uDD���B����
�����[n�������G%�3Y�A	�y���d:���V����~|N��}���q"i�!����4���&b �]����!	��7����>����L&P�!^'N�L�!��6z���#�E(�w�*����a���o��-�'��(�v'c@�Q0l7��0 .!�(c�z�IM�S�,��_�����#o���C3������#Pc��eh�i���S�3B�m�������������[#7+:eiI�y�nV����U�����7a���ON�����r~`��������=�>����������S+io�@����W���tuqt���	N ����fz������*M�x�$�M	�����'�5���ItP����_i�7I�/?��y��`+�{�@�5iE�*��2�C��iBR��,����Y����&��b�%�bAA8I�tDo.���Xn'0��f\3e�V?52�b�(�a��V9������g� N�-�3tB�g���n��1��&�D�
#�M%��=���h8{S/�ea��
������6�\p6Ego��@�,�kSra�:��dW��:����4�N��<Z����9	9�3e��2�T�n���E����}����qd��T�rz��T%7�G^���8�����G�U�n����NR��`�N�y�����X��@Q�e��dT""����������:.H��\��5����E�t�G��(�"�y}B��Ud��/Q���n�ft���G���0@�/+I��,�t�%�v��X�h@
E$����v�Q7B���0��<�04G���RS�7}��%�
���[�2�7�Y��fP�� �CV2S����Zc��z2l3��	yY��bj��:�d�Dv�`������c�j��o���0���Z&�>J�GZ$	#	0d�E�/�V�;m�q�zq�L��Q�����XQ%�_$f���I6at�
-n��b����������]8P���U�M�p���g�&t}��8�?��#��},a��p0���`C� q=
�$�qJb��o�cwm���Na����|b�`�_��������=��c�q$���L�D��_�&15�q[S���V[�7�G�e��6��1z<�����FVO�J=�=�]K�4n��E�1rF}r�B�B7u����r&�b>C��Z�<��Q�;a!��G���a9<x����7�/O<I��K�Mbr{�N&�����h�>�k�t���
HCHB�� ������]im/��M�yf/�+��x��^�hEn	bD=�W�.��C�&�x������gQ7��D_m�h"�H!JY�����n�5%����	���BC.��2P��|����CJ
V!��P����5��������,���|�E]���+,`md	Ufm�QLd��F�$�������?0��{uV����E;gI�=
Zs�T�s��r^�gOu]���N#$����;��DjF�o�����@�����4�R�B��*>�T�	���'y��>+3D4�8t^<q�����(����_k<���0T7��(�;����9������?v�����q�'�a�b�F;���/�����y�$q�=l�<�4����r>��
�#T�#�(�3����m�A��21����"���v'���G)p��5��&C�S.o��� Q�A���A)'�%Pjwtyyrqu�>?{�gOjv&��I2y��L����\G��x�T�S�b�%S�����3��J9��,42���xt
%anD!��3���|D�,�����j:�\
���T9�h���0�w����0(J1�X�0�������\������4VBD��D�>E�9�,���N�kI�#��Xf%2R�-�i=����2�1_��n?����.|#���xPT�!z&R��g�)�� ��G#�?)�R��I0�dc2Y�XK��������S����E6b�Hf��C84���8����[d*�ri��'�����Y$��mMe��pZ��:y���u:q�ph�bOxae�lV6�����Z���bnS���-�tNv(�)�4���A� Og���=6O��!�T���D�0MQ����N��g^x)�����$|Z]_�N��#�~=��T�
���w@�3��Cc�fe��Ec��#<�f������M�����������p���@�#���;����q�e�.�30.R�d�,�����8l��}l�M��fsy�w�p�O��9�k`�Sw�����ea�����l}��1Q������z�I��|�����N{�?Id����X:p*�C�l�d���cz�0v+�M�6�GZa�k�	o��������#�CTSc�/e�S�B��T��������hXd(���[>y�VV��\a�|���`W#[��Q�G��T��w��8)!�*�UP E��Z�u�,�9=&=$�UUl���*�N�����TM~��,�����R:����>��JY�_��A�h�����.�:����!�����	bn��gDC��d����n��+�&�Vc)������d�$�x��%�e2p���Ojg�1��tX:�x�)����s������F��MsR�$������;G�"��n$���Xe�� �#�:+;�J�h�,K�����/�9C�;E�����-M�a�i�u+:)�{&�����xUFY7y��q�}
]J
�C�ogF|J$�i8[-��V���XIFM���%�s�	��81J=�;�*q:w.�e���d����{�Z�w ��|���<�0�Y.17��2�_�S6>"t�e��&���\ i�����i4��^5��,C�kH�C/�������X����
f���OQ=[���@Hm�(�t�a������3��}��=�����_�<�e�D�6���&;+cQ�W���%Z��J�`���Hj;�����g��pfL��Uy2��K��\�[�e2���f�5`H0)l�eCO�����d��u�<���i�9�Y��q�;�#����<�YR\V�A����V'����N��=������Nw�_���}��*�1F��y���K���I�%y#�k�����8�m�����E�
��#�X�u�%f��z/-/��;�J�W�u�2����T�H<=5�!nI!����fc2&ox�)%o���w�k0����mJ�AiC��-�%x��2����@�]8~�X���_X+���.tdr�g
�V=���~[����sci��H"�Vgy	��[��[�^pP���"����0�n+Q#��mr�f�EH�$pn=�($�I���W�[����Mx��g����6���Q:��[%x
��2��U$�x[3�=�
�Ub%�&�}��z9��;�8z���M������c��;	]q4A?��j�f�^���1l8i�9�����U���Y�~��)�N�w�j���1S��p�����}���kK�YQ���s,�7����9�_g/x�G��{8"C�������5-�9�������_z���Q��z)��8��e~��s��
���8�<��=������w
�����5�K�������p��=K��f�M���[�[&k�)��o�"����v�9e3lsJ���J����������K��V�3_��]Us�+�#s��X�d+���P�����o�
�.������i������6g�!V����g��dt�����:�^u��t�P�U[�F��{����r�I��"c�Z� {�A��4����9��^���y\�\^b������b�����%��f���{sB��~�����rx)�j�c*�F��%
���:^�����^������w�v��N�����{��A�����L�]vv������6p��_Z�����O/tpD�h(<�t�o�����4����
�s~$47�K���ak��OC�q �v<����Q�};
&��7�@311���4��tj��Vo�Ri���z��he�����5q�F����
�P�vN�|>�So���qLq�=�PwK��w�[��O�x�Aq��@��
��L�:��~o����	V�������k��@��.�s�F���>�on�����U���l�~qrY�h�n��xs��+��jl�������J�<�}O>\��!!M�����(������'�`�����	I��K$t.q���g/O_����
br?���.�����
�/>M�"����3*5��o?`�W��2���u�Z�<�p�.^�������RE�
���k�������J����+��:e�:�aA��� ������j�5�a�w�
��7O�����8��QY�j�x��]�iY!-F+D�`��j)/���Q?h4��{��\�^�a�G6����3�!�w��r)TE��s���bU?�U���H-��r��9i������D�W�S������X�|�;�*����� �[/*Nly�mn??yuz�����������T�a�!�k�f����(>,>~<�L���0������8z�"y���k8��c;�m�WL��SC
�{�\ ���ul����)9���������������a?�^��~U�����*lnS\=�V-�T���<68RV��"���<ys��J��]�G���o9=VA���N;��.������cZ�~��#����\�#�R��*�ZIe���������
��M�J�e(V�/E,������4��'���e>��_�J�g����,�H���C�1s���
l��K�J<������}�5<������Hhzp����`E�{�@!�fB��v�l��f�����(���+!����3�ks5��������Vy��P+��������"?.�1�B�71U�����G�\�C�(ncHo@���{U����	df�������`��K<���q��B����i��m\�"N;����\
��3l3�b�f0�/p&#t����+������htWd4����&f�K)|�����;����!��Ze�����_
Zu��w���{���g�0qB i���U`K�"M��O�~��qaYFo������|����7~���Z�A��n��t�j�t�V��������Iz�@f��SS�<�weMJ�J��.]3�k�t���o�v��}��<H��
9�t�����5�k�H�i#�����6�����b}s�1�[��q�
��=%�����,c�����=(.�#��B������|���1Fr����,$f	�V=�va0���.�}b���<R������]�y�|����%�-�A���Y5;r�_���f�.b\^��=Q}�<����1����e���S�qtMD
�-����:{}f'ZN��Y5�rv�_�b�W�e�W�z�L"���`*�r��2g�����Tz��$A:K�:��+g��{fU������wI�|�fg�Z�^�7���^k��wz����������������]������;*~~�j��x��"�Q!��
���t�p}/����uo"�����B��f���9��+�2�F�vzA�V&�o��ih���e��l6�n��d@�R_��i4!PfQ���@���F�cv���(PI&
x��{3��[�9�-<���'�0�fS��o�h�CJ��I���-��Q5�q��[�W��{�,F�?�<zy��
��\�;�zM��lg���:Q/v��=��L=/-�4t�����G�>������H1.�f�A�GA.�?����
k�k����]���MFj����QA�H`�Z04����������/��Wgw����b�
�R��A�����a�������b��9�c�������|��/��n@��~-����4�kBI7u�pY+��
��-:��rqJ���:i��"h0�V���
J9*C��l���)j�c1��o:l���lf^�w����u�z�	D����y��Z!q�"?��X�T\Z��XKK:������/-I�v�(�N2_+�UW���@Mx����[��l<�3�!V���q�������Kv�4�%'�Wm���C����P����U�
�xST0�����pE�M�����4�~~~������=8�?������n���h�����s\�mn	�8D_�[P�#N���R�s�-}���YxkL��p����xN�m��-%L�L`cI�C��6��$���>��B����W�Jr�|C�1z���N�\)��HR���������������(�}iSD�6����{�e��S��:]\�_]�(
�>����Q12C���5�w�_NC�.JJ�)��)�<����'mlM����_���8����(�1�0�.lq"~�{T��V��`�����-
�%�]q(hY�AO��������L��HcJZX�E�3��K��Q�!��6[>�:a��n� N0?���C�D&�Oo�����eM2j�����q&=J;��)[�"�Xszv�l������#���m�k�8��B-�������H~lbD��RX>���q��^AG��{xTR�~\rc�R�P�GW���I��.ti��>�{�Gl�Db�%E�Z��1#(���B�Sf������Y�ccwH����V��Y����s�!��������t@��>���/�������W/�l����d��nx�
6t|�����mBmH:�����\�I��`���=v����R~6���6F4�4�L�9�������DD�D�����4�5���pWY�s���#��@���z���[�M'���&;�y��&�2�9�-Rx@#���e{���^#�W��R�kz����J3��.��Y�9Tl�!����E�S�)M��]�7�x}|wrAx	���`>�����63�����(=�a��L�#0���>��|����������k
��&[s��������%	��e�!Q�1|(q�@��Q�M�?���.�*?cT��	�����<�G�Mc�o��S|vb������.�~]�^.:�0��[��Sa�w�����'�����1�E��#/3�:���N��$tJ"����/%���L�����f���!P�����+�9�l���a��0.S�������.p�r��L�>lJ�yt�]�^q����f������t��V�����;�����8��GB��7D�e
����K�C���v��w���W'i�VJ����/	�����������-3��f��z�M.*_��4\����v,������1�0
��G�E�	�
�R#����u�9��l���9T@�h�c3S����r:�����*:�=v�ou@_�����������1Jbr��=���W�,bw����[���5�����14]����F;?�h���W�M���ps�����e�i�*Ux�=�'��-���
"t~������r�  -��C����W����������S��\������.)!�7��2�|�B7G��2��eb�Q\F��W�`r���@�\����^����z�^��Zc��Wo4k
x^�7�{����v��7�y*����[�����%l�z���t���������~���?�TkA���n��Soa�^�U�����SuX�Ml���_#u|�F���������
*����M���c�F:�U�+��RC�Z�k{���j��_�n^�)��c���������������JY��\�C�{������uI��J���j_m���1E��`R����>��m`d�pTB�].n������R1!��F��	{��N��������� p������E���o�Zj?jh�B��M�b�x���6�+��//��Sn�z
���z�������z���96D�31D:Y���b�<I���iSm��������B�+l27���G���r�7��p�),����.���.&�/�f�L�y�9��o������K���o����{���_��ZA�����v��F���U���N��
����n��[��~���^�{���o������z'9�"d��k�w��rmER���'�=�+���^���[��PW�}��V,���gm�=�WxH�}��q���,����`�$O����GK�D�!��������B��[��'�����'�y{��}@@`�x���|%\�[C��i�(���K�����~�`|�,p{������y����c	�
F��� �;	�Q�$�
��^b��������o<�Z0�G��q����o�1�$e!��t�t$.��cOjs����H=�����G�#ln�6B�l�8Ha�fo�(`7�o��e�A������#<J�R�*~����k-��"��I�
����(��/�q
���,������an��d:������~|N��}�A�8QM�������t��X��Q���%�7	�1��e��c(,�	�u8���]�>>����{�yg mBq>q����������J��K�X�]��P~@�9�n.l%�y
!�(c�z�IM�St�
H�������,�o��4��]��$�����`8�7�C!��w�TL��n���u�������]rk�1g�%���F����UJ{����B���`���4t}b�@9?�K��7������V���c�h�d�vRu��s�t�r�t��7#)��U�p)/�nJ�k�%�&���7��J��I�`�����3��B���L+����@-��!�����.[	��_~�e
Q������2](�abq��`y�}
9`!�_��5k���oo��;��=�X���Uk�au���m����n����v������~}?X���
��&�8V��m����>+����_S6������y��3��W�7��{�����������a�l���UI����l��/��S�������wpD�y<�d������{�������5;*yaL���M���{�����t	�c�vt�6$���n�����g�x
,�I)���d����1s��%�bs���K���Ma%��M��^��q����A��l7�����������Tw��?��&�a]b
Pf�Y�7a�+:>����sg��`>a���5�6�u�6
�,wb@0�{�.z�@�/���������hk�g�����s����k%WM���f<}�y����r�����(���������&���]+�vg�90�D�����1��q�$��z8�!*�����o7����v�@���;<{DQ"�m�m��B����q����qZ1��a�*�/�{�X:��8�[j;���D�f���{Zu��e���'�n�K�`:�xQ��cS
3�Ty���;����.'*$�"�EY_�C�������(�(u�S�r,�"�5������"��B<�,|������������J�l����LM�	6����EH-P=hQr����3������*`0���n�z���{u�?����j�w}	���c+6M9���|��B�:=��
�Y����y	�t������=��4��6�������{?q��'\����'���c�����A��O0
0V���a6Za%QW������\E�0-c~���O6oJ$�269g��_�^k��m7���, yrk��l���6H��>����+T������-!ImZ�6��z�I�.@'A��O"�[�6��%vC�G���,A����/���y�^�Ov��	���>(Vh�`���_z�&����X�f�c6�C2<%$�������\��<��iM��`J�A����������L����w��F��<��e�Y�"��b���X�}A���o������"�����[�eez�++`�Z��,�~�K�T��v2U��b:Z��~p�i�+��n�����������\ex�(�,(��1�1g�����(�r�$�/l������
Ur��~����T��@�R�����Bs�|m�W��]�5}�����lSZ�l���x�"I�i��QVa<V���$
e���Y�����v��i�z9gh�Z��?tr5��������	���l����?���U�W��������
%���2�"��Y-Sk���~p��b_(t�Bl�DUm	m	-%G-@70��`����l��5e�>��U�f@�\��5���5��b���2`l�+�	�������c+<��R�
.��A����\������ 4���hK�u��,�~P�qfr�wvjU��ei�?����D{2�Ebj7��.��|lV����$N����'�����W�����"�#�M<�9�D��.k�Y=����A����P���u!8�9��t����9P����3un�S�����8A�Aw�������o�w��%56*��'����S���j��(��&-��(��X�>"��9��<.(�U1�<Q������t4�.�?�.���%�y�3Ly4�tet,��a��OQu"���Iv�/Is�&Ls�?��T�}������"B�
/�����/���m�����y�>>9:F�P��ix��
���g�/O���,�c@GZJ\�
�CC/O���-�xJ�0 07c��$A%�\���1�5Q�c��H���i�R�2���o�T�N�K�&T�O)~����)���c4�G��~|[���N�������b�,���_�%�I���f}�q�dx%�Zdi{4��_�h�����?���x����M�x#�8�Q�UR�	F�����5��\��V�j�>���P��`*i!^%��.;�>&�&�$���!�C�g�_7���,���T"�b�@�+��6�U��D-�g�8s�$j��k{�����i��<�.�0C4�U��?fL��h���8���*�Y*#��j��&F3�O����.�����f^,�F��������>^8��^��>���=�*@��)X�2�d��&���1kj��^���:Pz�0�5�'����j8��a���O��87��bs+��A���&$�p��p�9#
�g��2&^0��ZlU�lG��u`h�����	�qk��0|����a��9��c�%��n��$<��0����8��
�jQ���VUR��i�`����S�(�7�;mG!��V�4�����h����O��P:��;�p�h�@�	r�$��0(�94r-��]at,��kG��\��E�`�nK��3�E���a�,��H�HYhzwTB�9��a������~B��t�}
��w	�1���"��[s�����3��v��`��;l�t�e��	8���������R�p�~����28��c��I����#Q{>"��t~dJ����C	(2ka��.����>�"�9LbNa�������(��~��7X��a��8�.����9m�}�k#z��mA��Wb|��L~�X���9�&l3�i�{}tI)\So������_��k?��Kbf�e00��d�E�aq#�g%�.p�3V������1�.��XK�eb�I+�^��A�^�qfb��7�D!f+)�}r���B����M�0Y��p-��$�kr[�k�&CG�E5�����1������F��PD9�%����:,h^����,�����L���U���P���H��j��x�����D��D!�^���I�c������-J����PZ"����v�J~:�F�R�gOUfx{��BK��jeoe�!��O�[�h���ig,0��6%����N><�:K���u2���=���p�/�2�u<��c
��9���l'[�q�|�%N)��L��%��xm �3��������@0�eK��� �!��8���P�,��P��1��|5]k�
���*����H���ODaw��R�����uj�D��X��.�k?�e"��
����1
 *�8�!�����x>���'2�D����?��$���i����!��0��p
;"�dxzX��
@�.������oxSA.��e2V��)�{
 ����;����uG ����Z!zjc�b��
BtlJ%{���/i�A4p�+�������nh�q�=��0���
����Q���f)?n�����O�Q-y��((U��lj��JJ�>���Q�>fC���J�L�����x~	p��z2	��d��x�C���\����5O5������}xr�3�k����1�'������^P//�jCm�"`E���_#@���o���z�X���.���)4L�I�3�+���2��c�[�];.�`���_�y����e�=�(J����e��jCTTJD}�E��7��L�x�dLT�n����^=�#
�r�Q4���HjX$�
8�����CQ����2?��Y<�9I�Es�b�������u|���b�m��/�kI���t5�bV��;�.��re9#�p�\E��i���X���&�pP�Y��K�!�~t<G�\x�ZV��t~r>�����1���n�"x��$�Wy�!�����q:=rp�r�J����H�����E7S9G�p��������X�K
�g�G�vn6�E^���HJ��d�V�Y������4���L����$���
G�#D��0��~���>��m��:���z3I�lL}�h����i��y���c�;E�J�D�X�maF��XN"Vh���a�d
c:��Fv
��Z���B�����p�Z�A�OJG���(�b����1!F���9�Z,Nv�L�np��$~�.�����������h#zM���VZ#���$���wF��h�G|���Q���}E��:|E����/��4�?JJ��U��������zPV�����Bk�?P����K��P�w_�j����sVyaB�\��W�_w�*(�E"����d��7b�+=���� FA�B�	�[��T�
���HgtC.+M����Vk��q���B��!�����!e��4Q��b���k0
�,��Y����;�X
��VO�EK{��u��o�x�gV����?�3P�R�z�/j�z��)`��{��m����[cVO�
�o��K�����M�%P�"NnH��C�����k
�_�����<��-��I�#�I�^7\9U�g��hS
q������T��C�X����ji�6jt�w���� E~���IH$%?2�m�V����!��i�)��Q�h�g�nP���wG���^�~z}��������%�E��sOU0��'�{�����������������������G�\�_�{A�=k�g,/���\o��<����H������A��#B�ez�����������R�|����_e���E�����<w���y.�A�q��mQ��e�������j��vw�8���+l�m��hh���?�)�/�*��;�����Z�av����a��6��xEU.k�N�z��C��p;V3��I���Z&1.MI�i	K��UPs��Y���n�^R����~�p��,��An��������~������h��8! �%�j4� �v��B�����nf��j������St�?�$$�������$pX���H���xg�8X;��;@�1�[�t8+[�n�����H�Id�����M��.�����Q���L��?��p:m����%+����R����hg��������`15�*l,�=z.��
0�����pI?��Z2�/�Cxq�,a�$���W@wOP"�n�L��*� ��
Vd4v�G�(w?'��A��c� ��#9������9�����D7��L�9�Ft1�����7�U��^�a��X�����w�y�"����[��G�g�����H��AP;�x{���0#6�24L��=������/O.�o�_�	�u�yy����]X����v�;���7\��P��.�O��iMRI'�gc���a,���pz*���VxP!���'wX��u�zc��z�&�f7�a������X��C�4]���_��}��D�49:��'|����E=k8�8��]����%*q���	m���F��!����7E.7I�k����	E;�Q#���3�K\���,�+Us�Q�f��W���
n����Y�_jM�N�w|����:�k
:�+9���C�����[����#��*(�^���Gh���sf�����O�RSR�����!��%A���l)�l�������ZeL��cZ�N��X��&�c�H��8����d�M������3N���Oa��W5S������Ae�2�����r� �z��b�����<��(/I&HQ���s��UZF�z������bw>��������a�?F���FXJ�@��<*�k���������oB�n��I��o����[�� ���,1�'RHL��M4�"�1�^G:D;��+���vhg�����5p�;�^(�7P����n�6dC���BP���'I&Ma�rS��)���i����m�p�9p%�����E�c�CU�����']�L�Tc$��y�6��������j�G�W�s�M�/b+#q1���/�� ?�W����H�.��.��/.N���'�lHs�t{r����)f�k�6���,��P��<�2��Kh3�G�
.�i�������-���F�^��4�M/b:Mm�A8|�.����x(�c���2<����m��n��I�����eZ0��&���P�S`�[����/��4F�.pe�*�~�:�/��R���M�����A
��o�u�����S�\�
���]R����ymE���k=3����0���_�x�F�xa�_���bd�������������@T�6�
�?AOxc��A-�XGM:z�c�&g}����*�����_8�X�$���<'@��"m���7�bN�h����<4�	������"5hG��{��S�m�$�V��v![���l�������1�_��|�0P�Z�"!���"�RaZl��C�[�a��i&a`����+-���D!�� 8���|��	)A�4�$u���\^�����.@���oHt���( �\�Z"��`2D]����q�fOs��E��6��V�@��"��v���6��);����<s��EW�^�?���B1;
���L�/~"��wl�S(�-[����)���^?�.[>d����f9}�������K
�1�����fj���U�Q��.�P��sK�y=�D��hl_�p����S�7�I��!c��,"���$��3#G�f�X�������d{CD
��-p����fU�3���h~_��d������<�_��dt$���k�'
��f�������M��������5E�#�yWaVE�n|-�i�+|!�(6<;�q����}�V__@�����{�t��v��6�����$��U���2�oZR��Y����a�X����<t��9�O&��fW�����^L�r^���l*�_��Y��$��0����i�$���Q��0nK�����AN�Ah��5�����qh�;���6����'y�1iV�R}=����:b/w�{{��w����\QQK;)���in@;���@�_���hMaN����k�n�8����+6��C��[��*c��l_Qq��������"u	��6����c�x�����\���,�}ffg���>���;��Tj��zSa�����Q�m�������))V8xi�%�����D�� V�����OzNLF55t�Z��Hw�R$F=c
��_��I&�d&��p�b������Q�����7��e;H2��$�dfQ��q�����*|q�;���w��E*�i_������3QE���0���kh�J�f�lYE]���*YI�.�|���&\7��hT4��*���1@_]K.�@Y+�4�
�Gb����x��nt���y���1����#�V�7�~�A`[c�.z�_���9
��*���3���|��7�8�j��dO�n�0��B�a�5r�Z}8i�N��r�z���9�(�����3�CF�	'�!��y#_W�N����w�R<U��L@��B!p��������"�%�������z�a�d#���l�lhX���g��7�o�8��a�@J���
���q�pe�;�~����S��k�������-���<��0�,�q�<Z� ����`���|D`nZ����
�(���X�<�������l�sw-q�����g�s!����
���xb���X'��z��u�[<KM���i�c�G�l!�Mx�S%�!��"
�RT��f���bE������=RU�+e��5���[[�,q��C;�	50��Or�Hk��+�F��w�7�u��\��{������@7+/�I��Q;�Q�jJ���ST�H�b�����mT:������K���_8�Ji�Mw"���Y"h�,��d������T*\����A5OM�0R����0d��lv�{���bb���{�e��e�3��)��`���� D/q��h=�R>d��m���p�`��l���_�5��?����T~/��K�#e/[.;B���p����w~&M�q�,���)�v���:�w������������bf���7Q\�i�G�q��w��&�r��U��p���z��<�B�����L������b��a���mI�a<C���]��5�O��������=�g���m�##4|7zo�%W3�9ncBQ���[�$��>��#x2]�!�J������,#9�?�:��4�����#�$��3A���/���\�����L�T���1
J�������xf\�#k\��F���b�"��k�;��B	��Z����.4�!.�,V9
���8�=��������I�E�9�x��]gRZ�����n�>�=H�Z{���t>����>P�9E���Y�y����������oL�����=��vC���n�^1:�%��� ����>�e��53
�	,����W�U���c{�m�F�fs���j��x�[����2��hwer]R��0��.CL�V�+���A>b�2,O:���G�)�>�2*;�����t��gk�Gp%6�!�[jt���&o~���G�3�����+�`��� �Jr���g��=�*�K��Q�b�>�bd���+Y�N�-�ZFP���3jKH.W�H�%qH��j�`���F�����������0U�H2���1������\WP���P]a����Q��#4AU�#����0�1�����xk]�����Q��'O	FM�O7w��;�"S���P��`�� F�Z��7�����k)�����?II��=Y��(�2-`M�|*j��2�O�6vSX��:������V�BE��^h�@�K�'���"���I���z��>a��I�c�!Z`���l�����H�������k�`���
W�n/�$�'�,�����)�R�����W� /���x��SC�'��B�@*�������qi�U~$R�n�,�g���j���2�^����I0���
a��:@1�t�N^���AV�Hj�WP�V
y�R����5��,���>��Oq!{-����D)X�3'm�����������g�B����0�wO�E��yR�c5��lf(@8~���u�Y0��0#8�������2_��}�?&���^%�t���O�e��x�2���n�Z��U9_�����F�
�_�;I�������
V��@�>
F������#��|v����"|������a��nI���(E�h��iy{zuzq��Piu���@#�D�,9(���� ��Nol?=��z��~	�f�)o�c�=��u
t��!�O��^���BD[�V� �#��_���6�_�^�1�W#�F��g�q��~��o��=�e���bi|�W#�!�PmS���:�W�;�G4�=+������z_���Zb������N_�x��� ��.J���}�	�6�����89�����"���\.�����b7�j�����G`�\�(��p?X�,��&
�
h����,�B����'����0���a�QGY��L�����4Wc�_+�o��	������Yu��);"�Q����;�Sq���^B�{T��f4	�}��-�r�������uz�;��d������V����������M�+9��
��%�{��,jq$���N��!�>��KGk�U7H��vj�0c��47�	��^���4�cHe��*��;�����f�����1�T��=8��ci���)@���X��D:�����s1�%��B���6~�m��+�������O��*q	�:���D�����_�	�dm;&WXB��9(G�h�d�}��32��V;Mc]H���r�r�M� {��Pj��-�&RpSQZ<�����Jh�g=�����������	s�P�NS1����gH�M��]7�"�xB�D@�����b�`� �b�U��f�6����7���	)*��
�F��������X��[��m�,-��Y%�u�?a�i�@����`X,��$���V����\
��|�D���\|b����5
v�����������EK�1	����N��6#bY�~S��p�3���Et���<&�Ww��F���������#"��yR��O�x��79���N�n��;9���u%�w��1j2N��~@���B���z:[M��K���j��?\,n�������@�JM���d�GF����{5xu�uD=�r�+?���$���������?�W\-}|
y9F��T�nV�RKy!@�H.�D�d*8�v���S��k�;A�����Z�>q��N��Il^M����"�.&_�qH���P������9�^��
o��t�����)}M�D�P�/�U:F&�C3�p9Un����`Pe���ee��/�/���lG{sQy��<
��I%���P������Lk���6���i���9��8%,7���y���6J�v&��j�A}#m��i�����
����fS�yY'�f�=3�c��[��n��c�J"A�b
��@=}	;w�G������c�D*��R�8Q,���G�sP�qs;���[b�|�tze�L}*�s8[�b��8������T�)��A��x���R��F��\��b�6q��!1+����������;�O��N�>��]~�������� ���F���MI��|J�h���1��a��^��E����`�\6Rk�����������7W��8����t��]8 �y��6K���V*j/��y��Ftb���=�z�A�^o�b�Uh�I%����
��4o�`�.�����N�����2,��B�-c�P�&�[������������{l0��{�w���\�-�x���(%Vw~(k9����i�!���2#��bz�t���6,p
5��iV��)TZ�!2Dd:�:���������2k8f<����.�=s��VY�!����qyT'P4��5�P��\@\_�t^W�{+IaB�d������*��P9^Ved`��Rr�����%2���W�2�^c�3i��z�^�5F~��l�8[�JT���E���F�T
@ �0f���A=1�������^�V�+�6����+`�����Kf����U�z�C��}.m��
dkf�]����p�ms|�G0����3�w��R<:�V�t��P����,�Q'E1�T[4H�J�X5�h��/�[��1��E���k�S�.n�R��SZ��s*��|J�DNRev*e��(#�2�������UbVeqsZ��x}J",B�L��(������u�m�o�[�����vU=��C���*]U��*uU����U���bW��$IUx�F��|�o:>4�U�q;���������������H���{'��*Gs5�����k~�v�1�h7��Wn��hz�����=�Yo|%�����XCh�_�cq;�+������{��&���3�o�m�����6'�n��nnw��4����q)�~p/��p�#����h@5G�����������/���[�o����N�t5�t>?��Yp$��AEx�/�\��5��:r���4�������H	}$~z{z����3U��C��Y����
�H���h�qB+y����888O)�A���70�;�"�q�T���a_�V4Af��&��H��!	�J��Y�du��G����A<�\���DN����Z���zA���n�����qu����l��!�|����><:<C�fY��������{\�����E��{ L����b��VZ����`���q�`�F�`�J!�y���!�X.U����a�Z:���g�c�U1�����t�c���;�`�L<�`Y���������D�B8���[�t�0i��\E<c�f�X�y����`�a���9d�}K~z}�E��~��� ���pQ�����`�Z@eP��m1WL�Z�	S���x
&z&o
Q�Y����tH�B�/���K1��(A�X�C�P^�t��Dx���� a9)
���8+a�����xGL�J�A��$�n�?������~�s�@"q!�C���Nw�K�wmE��#��Q�}o��m~�7�@�m�Xl<x8e�$F;r���X[6��Q&�%�1����Pv������q�����;�l*d�D�o�b�,����O=��8��DZl*Z���Z"�sYDO���Xb�,��kld3��`�)�F���C����.��Icy��%_���W��\�l�3��
�7/��Q�H'JP7e�p�#�o�*30�!�1�|���A<����K�L.*&Y�@8=���H�(mF��|%5����v�Y�b/sH��@j�A����b~�9�d���<�s��CB,���|}G�������� ���:^� ����I�n��2������"/r&Z����h�e�ke-�YC�r\�������:��z�qa�5��O��U��%�'�7M��9�X���yP�rL���������dL��"���o�����D�8��8u������'xx���t�W�T�/�*��j�*�����F������n}���=��t�]���8o��'N�q}g<j����p����pt����G�����������H|lPv'����9�yc�������1������y�t�4�~����6�\%���7������mB����LxM��r������T�_�S_�S;�T��D�^���������T��R��2K�<����."O[��nX/��X��3������]�<5�h�i��qr�:J��P���1�}�!�2�?���<����uy�-7��D��${J�SD��@4���_��y�)��<��,y���W���S�C� ��U����B>_����l)�:$�T��K�S�RW7�]��9E�1��3(�q!�
c�ic�a��b�j�[��I��d�����-XN��r]�l���$��M<�z0m�4��xW�:�������b�����\�lnE�.>B
���J�bC7�����t6{G�
i�J�A���^��C��f�����W��_���?�+�41�"4�V���j�1��7
�k��ZAq�D��g��<�A��N}���~:���Up�O��Gb,%�E��@��q��+�_����hWg��XY���'�
=D��VM
�{5��S��[���%�]n��K���m�����&�P���'��g������
�m���n����K�\J��N����7�-�?���T���$vq?��?�v��������U����F��f������4��m���v�������qw��A��r��V}��xA{�9�(\�o{���������z�,��.<���:�9�����#����}�2�!�y1��Q���</�`i!�����v���o����F3�'����������a�dfu��v�()a�BW�GFY7k�E
������Ro�?��A��9j�Z��a�n0�e"~����~H8_�^�U*V���t�i��(	�������7@�3�H%aRR`@O@���|su������R�(�#�w�������@�qy�
�����Y�O��yw�:������gF��K��{�R�I'�&��O?��W��1T����c���r�y��Ic��u��{�Y�|K&At��W�,$�������9�iC~���_�o����*O=�����dg�Cl1�k���E�`��&��boE������8w�Px�o'�	/����G�$B�%��*,I�V�W�P�V��/

���ah�KSxc����O��)2���������$�kv����sx�(�LO�����6�q���:r�0���d�i4�������r
��
	2�cB=��=p�!�s{���[h����p�Z1����v���3dG����'�.{��P�!�1��j
�x[�X��X�Z^J�4��a�2I"kEs�������'��x�%�b��#�V�=�n
�����X#0��H��"�������'�4c���2�E��7*�	��3r#Y�Tq�0�.Q����tU�����DK)\��?���y�r��j]J��*��1<1��J\b��7a������95����Dekk��d�s$�lH�/��-S8��������$jtv�%!��t�xD�E�vP�j�06��d������a��ct�b>�S�k�F5����1l'��w�����4j�Y�L��W�rJ�
2�!I��W�����������]3��"��jIn��Y)]�y�������8��5]+�2C	3���<6I5\LM�])R
vOG�
"2�E�kQX���6e����{��~����7]Tb�v�7���(f,Ux�D�&���''QG\��<�����_%.46�A�}�o|n�"E��7n��Pw��[��� ���)3\�%�4XVKsEH�k��y=����7���
��x���y���z�)��s�Cd��Y,'�BE	�W�7G�XE1TZ��i������x����O�DXQ�����ht������Zt�nAu�>�(���'��|����T������J����(��R	Tp��x1�6|Q�Yb�,��'i���s�w�ZJJ��,���r���Y��A����0~@��6u�%H
��/>E�
�_��xC��B-�NIY�9��f��?DI��6�Zd
�6j����1��S4V���f)�J�/��q���e���>0��R����������Y[�V�!�<�/<������_s�a6���@X�;�]5��2��$�NY�����v��j���Q��J�S�O�E�S�(k�I���EY�~�N��T;��&�Nj�M��D�
��>�oJ��BX���gL�/����Q,k�7n��d�wL��<3t�:�39�hE�����
����s���W�s���J���I����f�I����("S�����9�rY7)��<f[N����<�id���3�������y�b�������'q��m��IZ.OO����������l8>���sc+�0�!��h�< �h��+��0���i�`��b6�l���7
<�����sQ=���n�C��
���Px��5y��m��V�4\�vl|g�[���c�>�
[L�0��V?�W1j���u9�*$"Q���XT�MVj�vWl��e�a�r��'�lB(`�����ea�>��,��#c���f�,k��b�,'Y ���0�O8O��d��K���S�eb�k�L�������9���gE���������R�����:]��O��M�w�6"��4�������J���@zY~���6��������9/��h
a�er^��Q��T������������H���Vj��.XN'��@��%��q������VC�������B����'�3����Je���l�
:�;hQ�Fo1Y��[��5����3Xkf�d�=q��7Y��$M��Iv��$as��Gp����c����?�����c����?�����c����?�����c����?�����c����?�#��8c	��

#309

johncnaylorls@gmail.com

about 2 years ago

In reply to: John Naylor (#308)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jan 3, 2024 at 9:10 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I agree that we expose RT_LOCK_* functions and have tidstore use them,
but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
calls part. I think that even if we expose them, we will still need to
do something like "if (TidStoreIsShared(ts))
shared_rt_lock_share(ts->tree.shared)", no?

I'll come back to this topic separately.

To answer your question, sure, but that "if (TidStoreIsShared(ts))"
part would be pushed down into a function so that only one place has
to care about it.

However, I'm starting to question whether we even need that. Meaning,
lock the tidstore separately. To "lock the tidstore" means to take a
lock, _separate_ from the radix tree's internal lock, to control
access to two fields in a separate "control object":

+typedef struct TidStoreControl
+{
+ /* the number of tids in the store */
+ int64 num_tids;
+
+ /* the maximum bytes a TidStore can use */
+ size_t max_bytes;

I'm pretty sure max_bytes does not need to be in shared memory, and
certainly not under a lock: Thinking of a hypothetical
parallel-prune-phase scenario, one way would be for a leader process
to pass out ranges of blocks to workers, and when the limit is
exceeded, stop passing out blocks and wait for all the workers to
finish.

As for num_tids, vacuum previously put the similar count in

@@ -176,7 +179,8 @@ struct ParallelVacuumState
PVIndStats *indstats;

  /* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;

VacDeadItems contained "num_items". What was the reason to have new
infrastructure for that count? And it doesn't seem like access to it
was controlled by a lock -- can you confirm? If we did get parallel
pruning, maybe the count would belong inside PVShared?

The number of tids is not that tightly bound to the tidstore's job. I
believe tidbitmap.c (a possible future client) doesn't care about the
global number of tids -- not only that, but AND/OR operations can
change the number in a non-obvious way, so it would not be convenient
to keep an accurate number anyway. But the lock would still be
mandatory with this patch.

If we can make vacuum work a bit closer to how it does now, it'd be a
big step up in readability, I think. Namely, getting rid of all the
locking logic inside tidstore.c and let the radix tree's locking do
the right thing. We'd need to make that work correctly when receiving
pointers to values upon lookup, and I already shared ideas for that.
But I want to see if there is any obstacle in the way of removing the
tidstore control object and it's separate lock.

#310

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#308)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jan 3, 2024 at 11:10 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I agree that we expose RT_LOCK_* functions and have tidstore use them,
but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
calls part. I think that even if we expose them, we will still need to
do something like "if (TidStoreIsShared(ts))
shared_rt_lock_share(ts->tree.shared)", no?

I'll come back to this topic separately.

I've attached a new patch set. From v47 patch, I've merged your
changes for radix tree, and split the vacuum integration patch into 3
patches: simply replaces VacDeadItems with TidsTore (0007 patch), and
use a simple TID array for one-pass strategy (0008 patch), and replace
has_lpdead_items with "num_offsets > 0" (0009 patch), while
incorporating your review comments on the vacuum integration patch

Nice!

(sorry for making it difficult to see the changes from v47 patch).

It's actually pretty clear. I just have a couple comments before
sharing my latest cleanups:

(diff'ing between v47 and v48):
--       /*
-        * In the shared case, TidStoreControl and radix_tree are backed by the
-        * same DSA area and rt_memory_usage() returns the value including both.
-        * So we don't need to add the size of TidStoreControl separately.
-        */
if (TidStoreIsShared(ts))
-               return sizeof(TidStore) +
shared_rt_memory_usage(ts->tree.shared);
+               rt_mem = shared_rt_memory_usage(ts->tree.shared);
+       else
+               rt_mem = local_rt_memory_usage(ts->tree.local);
-       return sizeof(TidStore) + sizeof(TidStore) +
local_rt_memory_usage(ts->tree.local);
+       return sizeof(TidStore) + sizeof(TidStoreControl) + rt_mem;
Upthread, I meant that I don't see the need to include the size of
these structs *at all*. They're tiny, and the blocks/segments will
almost certainly have some empty space counted in the total anyway.
The returned size is already overestimated, so this extra code is just
a distraction.

Agreed.

- if (result->num_offsets + bmw_popcount(w) > result->max_offset)
+ if (result->num_offsets + (sizeof(bitmapword) * BITS_PER_BITMAPWORD)
= result->max_offset)

I believe this math is wrong. We care about "result->num_offsets +
BITS_PER_BITMAPWORD", right?
Also, it seems if the condition evaluates to equal, we still have
enough space, in which case ">" the max is the right condition.

Oops, you're right. Fixed.

- if (off < 1 || off > MAX_TUPLES_PER_PAGE)
+ if (off < 1 || off > MaxOffsetNumber)
This can now use OffsetNumberIsValid().

Fixed.

0013 to 0015 patches are also updates from v47 patch.

I'm thinking that we should change the order of the patches so that
tidstore patch requires the patch for changing DSA segment sizes. That
way, we can remove the complex max memory calculation part that we no
longer use from the tidstore patch.

I don't think there is any reason to have those calculations at all at
this point. Every patch in every version should at least *work
correctly*, without kludging m_w_m and without constraining max
segment size. I'm fine with the latter remaining in its own thread,
and I hope we can consider it an enhancement that respects the admin's
configured limits more effectively, and not a pre-requisite for not
breaking. I *think* we're there now, but it's hard to tell since 0015
was at the very end. As I said recently, if something still fails, I'd
like to know why. So for v49, I took the liberty of removing the DSA
max segment patches for now, and squashing v48-0015.

Fair enough.

In addition for v49, I have quite a few cleanups:

0001 - This hasn't been touched in a very long time, but I ran
pgindent and clarified a comment
0002 - We no longer need to isolate the rightmost bit anywhere, so
removed that part and revised the commit message accordingly.

Thanks.

radix tree:
0003 - v48 plus squashed v48-0013
0004 - Removed or adjusted WIP, FIXME, TODO items. Some were outdated,
and I fixed most of the rest.
0005 - Remove the RT_PTR_LOCAL macro, since it's not really useful anymore.
0006 - RT_FREE_LEAF only needs the allocated pointer, so pass that. A
bit simpler.
0007 - Uses the same idea from a previous cleanup of RT_SET, for RT_DELETE.
0008 - Removes a holdover from the multi-value leaves era.
0009 - It occurred to me that we need to have unique names for memory
contexts for different instantiations of the template. This is one way
to do it, by using the configured RT_PREFIX in the context name. I
also took an extra step to make the size class fanout show up
correctly on different platforms, but that's probably overkill and
undesirable, and I'll probably use only the class name next time.
0010/11 - Make the array functions less surprising and with more
informative names.
0012 - Restore a useful technique from Andres's prototype. This part
has been slow for a long time, so much that it showed up in a profile
where this path wasn't even taken much.

These changes look good to me. I've squashed them.

In addition, I've made some changes and cleanups:

0010 - address the above review comments.
0011 - simplify the radix tree iteration code. I hope it makes the
code clear and readable. Also I removed RT_UPDATE_ITER_STACK().
0012 - fix a typo
0013 - In RT_SHMEM case, we use SIZEOF_VOID_P for
RT_VALUE_IS_EMBEDDABLE check, but I think it's not correct. Because
DSA has its own pointer size, SIZEOF_DSA_POINTER, it could be 4 bytes
even if SIZEOF_VOID_P is 8 bytes, for example in a case where
!defined(PG_HAVE_ATOMIC_U64_SUPPORT). Please refer to dsa.h for
details.
0014 - cleanup RT_VERIFY code.
0015 - change and cleanup RT_DUMP_NODE(). Now it dumps only one node
and no longer supports dumping nodes recursively.
0016 - remove RT_DUMP_SEARCH() and RT_DUMP(). These seem no longer necessary.
0017 - MOve RT_DUMP_NODE to the debug function section, close to RT_STATS.
0018 - Fix a printf format in RT_STATS().

BTW, now that the inner and leaf nodes use the same structure, do we
still need RT_NODE_BASE_XXX types? Most places where we use
RT_NODE_BASE_XXX types can be replaced with RT_NODE_XXX types.
Exceptions are RT_FANOUT_XX calculations:

#if SIZEOF_VOID_P < 8
#define RT_FANOUT_16_LO ((96 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC))
#define RT_FANOUT_48 ((512 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC))
#else
#define RT_FANOUT_16_LO ((160 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC))
#define RT_FANOUT_48 ((768 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC))
#endif /* SIZEOF_VOID_P < 8 */

But I think we can replace them with offsetof(RT_NODE_16, children) etc.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v50-ART.tar.gzapplication/x-gzip; name=v50-ART.tar.gzDownload

�/��e�;is�F��*��Y�R&��Im�2-��6���8��-���pP������$x���/U�jQ�Hb�{f������U5MkT��(�����pVgW�Qu{fd���.��*]�8Uc�*�<2W�=��Z����F�������m7�k�;�^��i=�;��n�:�1�I���'�������;������>�B�e�o�A��,��f���k��'�nc�
��h�b��zl�{l&��1M���X�Y@2l�����=�8�I�g����K��N����#�$Va�6��i6Y�}��.�+kM+�b��0���������hLk���LY�RV���2ee�������C������a_wm�US\n����������L
�{�m�������#��pc),�,�C�E+�B!c'b��q�p
��������r�P��T�p+��������h�=�����Q�����q��qs���@k]�E�]Q���������@^���/�eX���/��^���x���fsy�r�C�������|l]��o�{��<|7���x#�����F"�W�V�&C�n{�[����]��b�f�+?�)�[��`&lu)�
b��!I�X.
��X�juiG��OMk�)��%X��Z��f�-j��f4�o6M���n��N�.�:������.+�����J@i`��	f���2��RwD���E�W���+L��A?��,���
[fZ(E�O�����{7�7S�����c$[M�Y���Jq��U]��@�V`��
A�]��|0<�l��:qr�ENz�8��Lf�26�gk����4qH�)l��B||������y�B���
��[�Q�W������hKkTZ]d��e_`����2��.���r��B���n9����cu��-0������0�NRps��#����^(�Q�+���������l6��P���h���������E`������^0��y�"�i<��g��3����a�*b�`.#�%��!2��RD��=���6����rz����n�q��"��#�q��n��\~��8~������MH���| ���I�U� �EvNU�G��5p�,����A�lt���3`��Y����*�>�g3�lU2�v���|v��^��Q�����FW��p��J�a���Q�AF���/8T�(q*�q�#p[\��J(�Cr+	��(��q(����*��e
�x�Y�w��C���d�#�VvI�e����#6UFw��+�)�P�{tkz���ol/�#�|�z����Y���+�A����<���_���,��S��$��%A*y���+���M�>u_`��eo�����q?@�o{`��U����i�O����4q��d��C��������V�����U���N�V���*�����K���qT��;5�
���L��nw[{�_��u�[���F����Zk�l���1��lt��������N{������WT/�����������S��eZ�cQa]6fR��.Z��v������I�*+CeU�F�HY(+��e�JA��(���T�� � �a9,X��G�#!�����R6~����'��x;��h���k[��������t�h�7B����b�uZ	+�9������o��7��a�w���K(�Z����C+���Z��h��f�_�.���>R���T�|f���r�T^;O��������d��I��Vx���*8��P/��J��k4O��E��6�~����Z�c4H����t�E�i���F�r��/���^E�f8�'���Wo����f<+>���EE���n�^��O��9��0��s�U���9u�*$���"��>��*,�����|�����e2|�~z;b��3����b
�
�E����>d�����������VF��!$���z��n��{MG�oR`��'6�m�&��������H��#~�!���� ���c�:G(�(������p�n���U�N�[����3��k:�����!��
��'%f�?�L����v���8}�rH��X��WZt���4���F�@`
'��-�����UU���s�k@!�i��`��\��_U��-����5�������5�����������R������YSYYS�I�T�=�"[J
���|T0	!�:�?/!���$�oU��U
�e?D��H���������������N>���v������P���fs��5�6��~�7{�if������4ZF�gX����=��m"����E����_;���(�ee���TY��t4��\�q�' k���SI�l���J��0�������nI����o�s�|&��F=[	��cx�f���H����>F#��5��#!���[1d��	�t���x���W3b�������5H.����"�k?/�l�k���w;�&��:{�W����!��N
K�����O�/�&�ku���@����^Y�^�g���i)"��S�K���Tx�~�)U���`�x�����}T3X��jN,�s%������[	5�!�.^��_4����� 3C�I�"������O���BOU����~�#��U�vT������W�:a"���w���2�G�0b���pRo���kb��V�j�qm0h�>�8 sx�q��VCk����OUq� ^
�X�����Q_q���V� �#F�`�g���}��*,.Eq4�CPg�zz�K1�./
e�Y(#E�.���;4��$���L������Qs
����������7�vD7@T�=g8V�9�W��db��k	=!����\%h����x�)��p6l��
6����J��
�t1��E+��l��I�D!�{s��	yO�t}�����<q��1v, Q@-{j5��Y]K�]�4�n�u���HM�?��V�@�*��x��x�[=x�DwR�s����y�R1�(��k���e���-��v�H�F��6t1��8v
����0�vK��@��
o���!�W�@:Q��'���%x��9,�s<������ka�
:<y� �
]�W>�D�F�.��q
���Elw�w�r�E,cI1�*O�����G@�p�HR�[Q7�����RA��E	���:uq'kl$��a_�R��*4W.�Bc��,�"�����j���Wo/`<�
��E����ig��g�[��ZBJ�G	���HR�Y���8'Z!�{�r��
"Vw�`�k!/2��*��$���{:Nd�[w�b	R�*#�t��Q�)"c��q�����
L&4`���)�*��������� ���
+��� [5�k�!	
EFal�������������46�R�8�2�m/�^��}���t��� ���\-�N�h�\�)�����<cNI��=a
l�����h
JWv2#��/�����"��vv�L�)@�E�.�$��&�9$�8�j%�;7��9X��
0�LP3���j��+
9�c���$&�?�'���U����B��7B�a�I��K�7��e�H��
�@kGyH����grG��%����N�����*�s����-rHb��M[x�fkr�phR0m����
Z:�2vU��!X�MM@�'!���EJ�Lq������E�������Vv(��`$b��I\A/�*w���$B��-_-�x�:�1�%��e�u�`�F@��<K���}p���)�e��6"�z�`@�#��A����������#�:��e��s��L�3��%}���Yf��|;�t+��}h��)P�jOE�q� p���6&��;IYxo�!�7"���'��A�n��l</�h'�c4�����
_r�Q���Br&9��$r��nF�R�
���D+��c���C7
D�m��A�K}T���h7���"d��iD�pu����.b�T9c;p�nW��0v������������%�}�19^X�K�gP�"�Uk�������)�AbG���v���+`E������@GF�8�|�!4�m�M�>i���U�{�S6��q��C4��VS)��4t)<A��j*��ob`�]�p�_!p�k�Vi������k����N�6H��O�oK����(��L����l��}��E`��`[R@��-����,�����R����%S���^���`�r�C�
�C@H~{;~u�|;_}�0V���]��N�����-|��:S��I{98��\���2T'�F�����H����s����"�V,Pk������P���m�����X�tx��j�<+�`��(�;#rv9}K�^�Q�XAEmY���3�;V�x�M���[-��N�a�������>������Q%Z��%m	����F�$�*=�\s�|
���[&�%�i���[�������x�cPe�������0���T�m���O=���PfRx�����?��@��	?��&�/���K	2\b��<\�J�%�x��]�f���f�L>��`�K�����3�$�1�ek0H8�{N�_���_�����@Q06z ��Ifp;L� j���)����W7��||�����v�F%�2~� I<v(��D���������U��� ���I�7������,� �(����w��k�y-"�n�
�XBV|(rF3?UIzZ:��|x��4�",�<����q
9�������s���ft=�2��9��$�����K�!jA�d��,�a��[g�? b��BF;K�?�L�/�`C7��h����UaoU����������f�~@E�kP�`+���f~���r8���(�}�;A��&��Z�>����mz��y~��I����ix;)��-����4u���
�z���pX�C����{�#�A�����C�/�z6��'��K�2.!��Q�W��!��p2.b�.�^��=RY�/U�8%f1�����:��*-���fH5*�����i��^��u������nA}�h|�?�U�:i��v1��mQ��o:|��w�{��q':�D�=������BT�:��<&"{0���`����>�4F������q�Yy��O��,q��QyX<�V~��m��e�x�G��S���<���EWn~�)�a$����R�8;$����%c�������������nX�c���Sd�������b78��o��������l<�=7�K�C������=�a�SJl��h0��~4��������lu��+��K�&s3~��N�{���G}?:����t���/����~=��x6���R,����QYfH�V�;�u}=�<�v�<<�w=�:����.�y�G��!q�>A[m�y������p���#
�&�B���o��[������w�3�L���7�O#���/�5�36X�!�$�yd����+�3����k���,;���������,��L�;��@w�u����WL8� I{5�J�,�
(����:���t ��k�^�^�rz��2FpD�9���v��g�������i���$���j�rO__��/���5�
@R�K$<�����r���������\���9�������>�����^���z��-h�v>p�������]F���T����!�p�D���y����?��pX�::%��@�nv������`�pf������A@�e��2�����#u�����]�	m��Z�?��B�@��'9c&@B�W����{����7�g8����.I��$��s���dc
:��rrB���{wH�G���g��6��T��43�if����UI�����7?��������M�!��?S����K3�o>��T�TAp��1�F���������������EcM-�S5�/�O%	�hr-9����+��u*���+���L
�L&�O�&�����A}�g����=����������3T!�Y�67 Q�[�������(�O�T����C::-3�����������h������"|}y��,�^����N�u	������I�����tl�$�2�$>��_���������1U��b������U'uPL��)	M%4���c;8��`�����H�<;��K0j�' ��U]�-o�WY��`��Y����7�5A���X�[L*\��4���PiI��|�GUyy�U%�O38�f�lj3�0������Q���mj��xx*v�b\�
2hY���YMw�_�������!I^=�#�NR�1h��1
�6�z��	���IeI�1����#��$
�t�f=�g"J%�z4���L�D�~wvm{��#<g�\����s%�����������>�X�M��Ea?�7� t���{-�Zl/tHr�����,�����4��V�Z����4x�H	3+���=��f"5|�td���3[ci�������#�L�#���H�h5�#�y�'j�h��0�G`��;� &��7C�s��o�t��&���=������J-�Ci�1������l2�[F5��#`-�'�{�	�tn����.�FQ	��w��.��S�y�;.�n
�����2e#A��0�,R�dKn���F.m�Zp@ND<�J�B[~bd��kv������\�����5![���/�2�
 {�y����k����7��]�@����1��1B�����?����(��i��
b��9��{O"8�D%����e�%T��f%Wl���S�,�Gjj���$�8�^��2��������$	"�pN�!��d�q1uU�tEgK%	�����G�v �O2��I<JHI�m9*Y���#mq�����\V'3(KV�h
'��`�+�M����&�x�����[�����%Z��'UF�	�a
f`�%��c4/������e0����l3������6V�����%�'r�N_�^�|��E]��){������;�7|�E���\�������$Y���x�x��`OPX�&|����� ^� �������d�g���~�O_�h)��g����|��!_�?f�0������\�I��'�o5�*�\���6��T�ItU6)V&�)��^$q.ZP�i�zL�ufnK��&)�6�PD4�c|Jv�`KO8�p��bM�j
FR�A�1�@A`S9�-(���d�
#���`�H�6Jfq>DM6<�����0���'�,��4����>��h�0Qc��}'f����Gh�9��[<�M���o��o�=��M$_D
�5�cv�0����;���R�%���%����U����3�0�����]��Dqc����f"f8vlOn�rh��*��F{Vgpc|9���M�	rj�?���������Ju?��.@EW��T\���%��!8�iC�<*�SC3x�L��GhR�L��'�1� j��h$qC�
�`&7�Itc��VaGcO���D���e;�$��=�Im�Y����}*�+������fUo��Z�{y���N��go�������y
�x ��
����7NPA����O;:����S#\M�zac�+��i�)8��I�B�����DJ>��%%w����*&1-�"���v����3��
r����t�m�H`e�=R%�����9�'��9;n��������h�4D����PQ��U,�p|�*��������Pb&�VL;�!���!�d��U�G�����n�#���D$��eG���M0�1�Q}�e7e�!2EG@�5���Y��T+&wNn�����P���I��S2�835C85Y�����{=�������1j��&�����3��@��.���q<�GE��V�=p=������P��F������F_������J�|4�;�4��DI���2���fO�����������=�p?�N��Yjgn\��;��F���bC��o�}0D����&�88�J\�g�P�w
���.lg�l^P62���QZ4,�C���p��{Q�i����$z��g����nu�������;02�#���h?@��'����0,:h�*�ts���z[�����^��6�����V,9����� 9(���qg����
,�+���)'�cP"f�FCtv�)I;�.8��^�[���NN����D�
�@�����N����R@���q!
x1j7���%'(
	�.�������������	��se�����#�������q����� E<���y�^z��<*�=������
��������QE$��+��������^���
z�]����>��DB���<�	pr/�e~�����vTA��(	����fi�k��v�
$�1|�4�@�u7}M{Cpe��O�@����UP9������u`���4n�#2��������2���3W�v�D���t����hV�*��T�$v��v���q�I��Uo6<��FVXuI	����n�o�������V
��nB��j�����8
M[��$+��=
������0���$D���C��%��cp
�F�a����D���;��VZ�fxUXK�&�GT��St�A�$j����L?�69N%��R�q�x���{�5���~1�������R�@�1�D����A�#H�F�����{���E����(������]�-jj�rB6t�F��j���]l���1�r@�h�K�~2��a��}W�B���^�C�q�r����
����������2�4����WL�q;H���]��0���%s��l�E(�?�X�]�w�N�X43��yO�7RW�B
�Y�W�
+������Vk����K���0���}�s���	W��C�[j�
0;��u�*��D3��x#1���ocy	*��_`��g�:�� ��:J�$	�2@��K!���=�tC IB����
�c�>��G>�P�\])A��S<B�c�O�;�&�����-R������-E�s�����&����/��W�^@���b��1�ai�C�����/��Z�O%�L���}�:������G6h3��P���Z�fhU���em�dR��6A�#��p_�R�i'�m�v����d�����dB!qD��AT����^�@O�\����@K��c\B>���1t��BN(`TK�`q�������Z��R�<�y��Q-��N$�S��\�N<��pl��[3��
���F�O:|�x>��(� g��q��<��*(]�����=rkA��#���V\a��z���oAP������!w�.��>H�	�2�A�yIl���$����&�)�*�L(��sj18�~4���������[���/�,b�4\�1!�a�6X
�����Pc���M�0%Ed��(��-M!&�
�����u�H?�q���1�>�J�HT�����'�&sj5��x��MO��?����.@�f_�p��M����T<��
r�9KX��;��h����M��:)�������\g������00��,��-Wn�i��$lZ���_h������F�L��=Y{���*�4������p��k����h[8I�:�!k��<���<]<#:#������3�7�
���������5W����5
0����|�t"��#e������AT��	aI�D�����Q$�P�~%�.�w�^��-_������g��y�~\oV{��3[��{D�:n��;#�g`���������-��?E�D���r8W�6�]�c.�T����7�P�
�G�
��� {��"�EM�!��fI����T��������x~wQ�������^!u���%���zJ!��<kd��u��=��3��O=g�o@t!�IP��,p�@@�����[�#I��r��]���Qg�����2F�a�I��FK���>���dh�&8i�
<��je��������'���V�%�fS%SEj"�pGJ�JQ��b��.���	���jcs=��B9h���<��A����[���uZ�
wB����6��y����b�b�$��J������m�����+�Ecbn-�&��=j�U���������ejR�V�f��T�t#��a$���A�S+��%��������o�:zUl�p�"#���O�3���y�$� @�M�^u�
�i><�����A������:zT�U�77]���@���Y���v0�u ��;����3��E�C����)�a������P6:���"�
�!�>�
M��"(�
OBY�[���`�.���P]���s���&3�b�<!�$T������y2I@�%*���m�`���=`7��C�	r$
��C�[���'��@�k�\V!k�n���N��N'��D�#�w��*�v>���7HP!F7�&w�`��b7��'~��,��)	lL"������G�P���������[[�D��A����'�!��?�'G��zG��@��~��
0�J�-�lGf�A��O�����#�z>�N��#��A�� (-�U������s��v���Z��l��fK��U:Gt�3=K�rqw)����Z�x�W�Ok������?���rg0��a���������	�r:��t�4���������j����L�o/
�nL���
�$�)������`�b ����
<Y~:����J����<���C��P���g��y���R|w��������)���E�,\;^`6����fU�A(��
E5C{E*0[H�J-X��S�5�dM%�{��dQ�Q���UJ��5��p�k1H��iL
rQ���_�q0�����>��P(�Yok����hF�0�J�(��t�����#�g�+��\\���,�)�{��������!��������ro��B�}$}`�H���3�CJ����A�5��E�L_����v�5���d
��'�e�#;�Jhh�]V�E�]�R|M�������`T�d�L�������j���?����������{#��4/�S�d��'G��������51����4��"b���<��$%@-���
�90,���k��Z�����n �w��#�$�uL0 ����tLA�HH ��t�j�����$���n'~��U+�0lb6�S]�S��*�����8��WZ�^����~H��"9<9*k�|?�4����.��Ty|��Tn `P�8c��*

�bl�;�u���w�z����5X�EW.����V������r�~��y�{7uu�%�����.��s+Z7|���G��������]�0�s�������hl4 GI�J%V��S��M[��<��C]hws}b(��<�%�����/F-6�;I���rJZb������vp�0�qA�@����TC.�YLY���G��	�p���Y+*���|f9�8,��t��xrd�%�J�m���h*�s��@��������N���m�����i�;�V��f�V��/^�{�nC��v������M[��m�bKTA_���Cz���am~��
o��y�	����H��V���de�0�:����{�A�)�{���S/�`o�G"���b $i�����i�x�:����mb���XaX�x):�:y�v�z��<&`e���
�L��[[�A��c)�M{\�
&,�&3�L�Db8 ���
�t�+1��^2kgl2���VR��L�Y���W-�b
n\V���n<�vcp
����S�R���y.�
���X�LX o1:�N(:{���>$H�]�wU���/\��Y�_�X���������:��'xZd���l��&����i��]��-^i��.p	qG�Oz>k�_!���������z�LH�5x�o����eR3VH���1)�TO+���>�C��`�u����5�>��4����
A�<;�0���+���-���c�e?�������@y%1o������%���c��H��S;���:�6*���YEkAV7�c�5z/){0BTE 
�-��e����V�0�@���� 0��X( T��v'|���dn#{��"U�T*,��P!
�������m�_6���70�q�:Z�s����6���]�����VD/[.N�{�wX��o��d(���'s� Z��~h���ud	�J�%S���5����*Hnt+��� 7��]�uu�%s?��o��S�)��4��5{��LF6�OI!w\�Z��H94[\o�4%�[���#M����������p��M�7���#����<
J�y�`2H��]�����#�hb�d�R�����t%���l����r�Z81�}�4-`��\=	Ye���ef�������[3ml�@%�B����%�!]��<;4�2n��}��k������oilBhK�it)�dF�<�I��J0B�1	l��;|IU��,]X�JtJ�E�4fSu������O����2$�Oy?=��D�F,X6c�O���,=A�j��<-�p�)���b�~t[�euKfr�(LQ^'�s��f����7���B�4�3�'OG4�h@/�GS0!�|��������Hp�~h[T/:����
l�(�2f�/��Q���)t�:�e9�5������{A|0��<c������om���Q����j��=���c*�f��R�B7XG���D8������?��`SN���Ke��B�z�4���k���j�G�"���J,^S)�"�����AI4tG6���r�����T��Is�8;���g�o�#�������5X�����#i��H1�<I�<�d��#��d��1,�c?��a��<��M3M$R#�C����k��A\#G��7{�`{��Z�� �`��{
0�JyD�yu`��H��"���t:97y�iS8��`x=�z�T�I�]��R]�z.rsE5��� 2�*:J��w�8��+�~8���h.�e�r�l��r�[Z��![��g�����V0d
����d��myp6Bg������X��c���������L%���U�����������b�.�]�H����v��������R���!S�AYQ7���k��bQ�L��[	�&X��i:������Lp�E�FZ������E�8M_b��@���w�WG�������z9�]J�:�E4O�V\���
����V���e������J2&nb���q@�t�����&�K�wid�r���m,�i>^L��4a�B�������)Xs���a����r@V��n0+N�����o���n��uE�kvJ���P``��f.����J����wR}g�*�������P�����4���_���/��=�����������%�<T"#s�Y�TD��V�m�k��}��};H�SL�g�nKXr������Iw�}�(KV��"5e"d�@&�|Xoo8��������p������ds!���������H�0"'�Wb{�lF]�+T�����lP��=G@��6p_�����2��RI�8J*U��FL-��,��;�0��3�u�AQGH�S�a�^��^e���:��5�#����$�_�����f��������Z����#��}t����@���Ag�/�8�����^b�	�������y�U^x�CKlqh����o��W_��$WK�ss����6+A"d�7'6����1h:���[�YQ-���(���������R��������e�a�~���z�lkt��|4Jvd��3�O'�:�!U�gp�.��������H�
^��8�*E�RT�`<Ho'�%���Aw�]D3��m�����q����P�|�����~b������6�r��E�������\%|&�rnf5��z��q0
��������
�A�!<�`<�����N�<�&��n�������708z;h�g��2��Rs�=DL����Q��G~��
�g>"��9�&L5���G������%����5[�!&v-Y����!w�to��U���T#�#2!/�NV�p� A�
A��|A�[_�t�K���+z+���*� ������.���#�	�����������
Mv���b�*�Q<
"%��d;7�i+�<�\��W��ck�t�r�q7����#���c<3�;Tw����i2�Ug��[02V�|&�-�?O�����m�b��N)'yfx$��	u�JW&���@�.A)a���U��$R�!�6<PHGV
�m����V�<Uq��>�R!�,����!��6]F�8�&�k�&�\����8��r�L�Y��`�������s�t���9�~�^~2�5��]��	�
��C�E��}5���0q����H����No���o�������J�
,�j����ZrY���j2�'t/;kHG��T6����6WHz��r���+x��J>�wV��N8���<� :12��>�A�j��N1^+��Z�Yt#����n?�LKb��&?n����K�W����Y�<:�
v��_0���*���� ��\����#��<�*}��AE�R���k����Q��x�z1)N8����8��IYj�B/�.����f����3��:��M���^
�8t������������*�C&U��!T�Q%��i��PU�3�E|`���Ux	9e��25$��X�Lq	�hc�UI�U��@���\�a����+���F*��o�� ���>Y���,�_�R���/m���l��u(�c����Qm����[���C�b��n"��`!�I�3 ��M)�l�[R�!����{���l1kp+0K)�{c	<u^Q�:;�T8�@=�
������
dR�+��_*���_[�����g�u�G�{�j����Lz��u�c�+�W�k��� 8�7�3R,��V<|�$����2�N���i/5�Q�l�]_�n�������l���bJ�����u�	[je���������i}�x'n5qj������8WE��1%"���o�{����t�8�du�vuYS'*o0���G������
c��Z`�f�R�,9�Y6��#��2jr��Gp���;�V��x�hue�,�3(tL�8�8"G��"�h�;�N��h�!���@s���/[y/�9x\Me�/3*��?B�J �f��KF�dZ[�M+�*�F���^vy����A3Pu<{�������xN{x,.�y����:]��p
;UP�2�������z$h���jCUr�����BR��B��S��ng���M��B*:16���v��S�2��P����P�2�jz<�����Cc�������Dj��&+�M��.��]�c�d5��)���jG�����X����i�)����������G-H��_������G��*$��z����6)��
���=��
j0���9A�a�����j������%�
��Va���dyv}��F��OB���(�����-X���),������u�z~����K��CS�f0R�&��+Ixu���yu��k0����9��SW�(�������m����
3)�F���i��]TT+��I��RO��J��y�8{�M-^����;�E<[�bY�V7SU��L�v�#]������sY~z�>
��0�\��b���f� �������rv�����1���-�@��[�e����5xgSK@�]^?{��(�s�G�]YXH��D	��t<�Z:�n���>�V�������$�u4�)bW_�9:gP�
�
Z���� ����C�|��/�=�C���{b���1k���A����I�D2���y�C�&�Pb�~���|�P�XsA�G�R5I
�7��C�S�������;�X�A��� �����s����T{&�z,K:>[5Iv?�1Tk�Of��(a�d�\����������p��Y�'f!��(�I�a=����2R�TB����`(>��8�[��wv����p4���x��-f����QdqWI9t��C.T�+[e�*�
ZaZ����uW���E���:�%������@������2�Fs����J���=��A!�d��1�� �8J���e��|����)��t��Z�zS(:��Z���b4��7$_.����XZjq�[]�.*&I�������8] ���e\����0�h1�V�8����K�D��B>n��x8]�?������
j��S9�[L�����XG^c�3��F+� ��w�s4��|��!���q�T�%��f�H��H;A�"<�;	?���x3�i�������nl�X':��nx^�!�n5�9;b��������-��������t�"�[�|m�g��g��|��������9��x��3RS�R	S.�9���p��1��d,T
��X�7I%��|:+��
�a`=������_���7�����W�|������&j�r���X���&�u<��H9E���'��i6��/�@6���@B�c��n�O,J�����}��`Z{X�ZC���_����K������L;�earJ��xU�uV�W���te���!���V%�(������
�M+v�������c�����l]�u�V�8�Pu��dZ)X�hNU����'U���N ���N9t!���si0h��(�C�EW<� �Z
r��``	T�r�����3�6�^����kj��e�>"�-�	j�#X>N���~4L����.��,����;R�J:�O����r�!Y���4i(�"�s0}��3��^P��0;�S�ujM�)���G�S$��R`�/6��M��u�5e)�S��Y7��#`(�EYY5���Z��P���	Fs���x��b�����A7Z8���
y1~�yc��3�?x��"��N��6�Cy�O����XZy�I*Y�H��	���eJ($�$���l:��gq{>>!_�V�>\*�(��,A���BZ��QbU��f&<�|�6�������5E�~����w����u3�}Q�g-��F1�kpa:�5���%�YC_�=�EJD�"?��1�����G��0�6C=X��EUJ�w7�2��4&�p�
z�_V�L�����s��gGv�I�]�#l�p{�������Y�E���5�0u�*��,,a3���85X��;���e�r�U�j��X�B���x����F�X����n�1��1D��|���'�"�MX*��E��E�4dw&�X�35s1��qN	�+Z'�9�=5-�[>4xZ���~����C�:��t�,\j~�'@36KC���luu�$!�w�������h~��t�h+IX��k���*BA!�@$-'X�Z[Kdf������Q�,J�.�:E�.t�T2�L
�#���,�	D��]L`�Qb_-R�lIs�]C1��3H�:�`��0����T�I
���� ��x.��U���8��el��Xq���\_uI�q�BR�E�#�s/��i�%�nq�@}���I���B[���9Z�;Z��Q�k8���@�*g>�S�p{��X�\I��]�$�M���2\N���&��T�0g�k ����"�aSw��wz�:?�lI:�=���ca��7F��6If\���>Z��u���dh"��_t��Xh���n�V�I���[���_p�ly�~I���llF����l����<��������'io�E�t�y[��6�"������5�����-t��,N*��*��(�WX�����g]�����6�v�a4�z��9�PP������L�x0{�	��u��Z�,�L�yr��
��Y�E|����K��������}\-��7���a�w�B,��gJ@d��&���U:���p�^_�����N�D��iC��z&�u
��������4_��A���1&4<{���,Kg<��o��,L���!NT97��W��8�,3�woO��r�}qy,�/����Q�I 4��~��0��#P�A��!�%�	�o�
�NP�������\�XH
�J����i�i0�'v|[<�&�h	i�,E���["+�����y��U�+�bTr��2���JP:	�?���|�� �8��q{�;�)�9�	��f^81�,�Y�5��I�m"^'���p|'�F����`��W�2&K�����7F�N�b)?J��4=���>($�;��)�f+�m��(^c��H���`Y~��U��*.�U\}��{�W�:,�����30�"!@�!a�d|�������`����&����_3r0h)��r�(���.F����M��pz�H�/^��N<��E��x+�����Hy�p/@���z[�buMv5�����;���~���7�]���7�9���/\��������[�,�����{o��2,ZB<�����.�	d#�!����1���B�Y���E���qg)�#[�����]��RuR�l�3-���%lO1b|aC�_�is9z�������q�v��M���E��e%c����b�����
�T�}%���t�������*����5�s�Px�Bp0������;����z�"�?C9�Tq+���1$�H�(J�����HTY�|�n��������a`�"�J��m���G�I�7N�����\�a�R��tx�q����D��yad6�q����$���e�-B�l�C-qD�'�l�P���&u;gU.��/��B�|?���������	
&��V���?�4c0�_����9q� �/�>��t<���@�\�_����/������)<���]���sv{����o��sbg�C�l����s����f��/�6�����5���M��P�;�u�f�
EM�TIR�L�\
&]���J���^;��6�i��,1���t�^���
�9i���d���������3�Hh\���j:D��D�n;<�=�
E�\=��}%9X�E���_&u�Vu���N~����m�x$F~l�C}_�����1a�u�7�k���c�M/�&����&AJ�/Y#��]bo�;��o~�x`�2���^���u?yjHD����g��8������#��<4��(5���X��|KCG:',`��	dU�A�PA����=��`QG'��
:�*1M\����O4eA�����'N��G���[�;B���.�S`����U��0\��`d��Xs[H�K��_�2/�8k~f���,l����[��f,�W�=-����|X$=2[�J�������r5fpd������I��\�����CcnV �S1)!k�e�,���������!cc[$���lT��j�(]f��r�e��Hy��`x�H�V)7�C�h���N�9
��Q ~}A���RH������]x�|}_y����H��P(��	D�!�IO�V�0��fD� ,?��_{���	XH�
$.�����4�,�'	.��5���^{ �e��|i������eCKi%TD�/��7Tj�^�����
�.Z���~F^#����#ZI��~y�T�]�"���f4��|��'3�p��Q�(����Y�[��cp�brW�4n9�U��r2�����	X���C�,��o������UA�����7��y�2:��.O��d%���	�-�sp�,����E-��
��B1n
YB����oxG�Ma�6d��Q�X��1aB��!U����W'E;����l�uU���^����h��[�h1s�����M|��G�d�Z�/��a$�F��F�Y�s���k�
��6[�����h���ZY��q�����$Hn�T��}%�{���9���������B.��n� �c<��"� ���#�|>�:�h��|�x#I��P������2�*,���ez�!�C?���A��[;d�)�re��t35j�B"������f��Q�yf�<t��s|Eh�?��������\�����.�R�t��J;'(��p0��������y�^�����S���S��S����=����"�u~��<t�B3��~��6��|��_���~Ci�!t[�kl�h��;����`����K�x,���'y��j:���qg�'+F0OI���n�vr���o��R��w�>�V�t�BeR�iDr��w�2��}Z�	�trsg�!���hMR��b�����1�������nG\R��K��Z���h���F����Nl#��e�^������T[`��@3p��ct&�.�l��vevM�� �?w#1�_.>�)��������Iy�v��&��*�V��E��o,8T�����YAtpdV���S��DZ�m����#�Q�?; ����,$h�����������B�\
�>�4	j�P�.�>`Uw��#��6�M�>jB�w��G�lyy,��c�7�����`��+���<
��;����i����8r�>2��z��-�eZD�hZ��q��5�x������OG��������I���U��I4�D2��1�����pC J5���N3�L�2���J)��z�������r�E��c�cBq�~��!�������8?��H���q�%���3�b!�$67���,=�R1���d��8X�l�����J���tSlT�a�
������yST&����e'^I-���f�w(���
���n0����x+�STx�m�<�H�!Y������U
�������+�7�e�?���� *�|o�Y����r���9e�S��
I�G�@"����4d{�Y2�@{,�1g
C���;�� �T�n:��C����nK-��_0
�\|�W���`g��jAY�D���A�����?��Q2��(N2��g�4�k�7?��%����F/����9@�x��Np-�uZmyp�����k�����������������/T������[��T��9��z��w�29���7���:pw<j�td�N��k��2@y���(����Q����P&�i�	I?m�R������D��3��{�9B���,�P��Y�'/Kq�Po�������
QJd}���V){������K���i"�f��$b�n���`�m����m�Z��\�
��[�E����N���K��`	�;����g$�*�K�e��]0�����T�%\d���v���y����p�
�$�2��S�`���^P�H�5w��<���$)���>Q�K�'���!k��nv+`43��$��2�|bZ<�; $��������x:���_'���#!��/��M|~J��r�'��D�c�Vz+z��	����{�~�����J;�+�
���M�\��#W�������d��2A�T���D��_% r�U_��zg���P2��-����"�/���
'������Of�x:��������0��Te�t^��b��{�W���4���
�'�G����unj�������p��[�8y�V�o�b����Y�u.c6`m>�������O���"V��-��#PTg6o�����Q��]�!�� !u������;�s=|����v!��*T��I8�q���c�����Z��:��S�>���`m��Y��%������D���=Op�y��tN��
o;�.n!Q!7��4EQ�;m�cBv��O�-h���A���D
=�qRg��?�hXX��V?�_��~�h��v� �?��9����g��o^_�T	t�����9|������:d}u|�~��������zsr���u.6�oB�s7.C�������b�|zJ����N#:cM�)��#�X�2��Q��URe�T���)��*M����p���phr)=�xpu
����f���$�`r]M<��H���
]���:v�5q����<�rj�E1��H~�o�v�c���<N�?e��jN���f���c����T����mE[ �.��<�#&���g/�:�T�y�19��d��5������R�l��i��O�,����A+����|���j��^$�"{��a������F�arzS�1�U�������L&�}����K��[V�������^W�II��Po��EuK�!y��/� SJ���t&";�U��iOQd&e�e-j�gc����>S�+�q���M�ao�I�������8b�V�C7J��O�*W�R� ����,��Dl�HW�Q���d�������C�V$�T���Bey�P�h<��H*�c��$�����>$a�����k��� |�)�HQ�*�BVb�w�P����p��`�������-]B>�����?K�IrlM����2z�X�,p����
�"�����s��{�	����%I�s���~bGy�����p��$��bIX`���;'��^��>~}��|�v�!����e��7&����{�����~j���f� �{"�n��kM�zCG�~L�����9��7?�� �����<U��st�����u����F-�_����||z����$���B�i3��go^_��9K���E�P�)4������O��^!(��O��P!�v�W������M� �X�q�}�����O7h5�������&�p4�q&o��#��|��4�@3�����x���<w;����;<:������u|���s�p_^?{��q�	 �4�C~+g�[Zp�U�C�]��l��VtG{���,�))o{�w��Z,�3U�vx�n��x8����-�V�����!���l8��C\����oy����J=�n��:�e�!��u@����Q�&���V����8���N#\EJs���E�rf��V�6\*��q�M�xa#A�y�����������5��;��yzJ@� p��-������U�>-�''�W4�������s���y��jVq�}��5�����
������Yo���y��O�"�ah�������n��.:<�~��s��2��!��(�� ��v2�nF���o�g�a��K����d����>�F��[�4;{�r��w;q�wE�Ri���R)����������R�V-6�|�	���)��b�r���y�)9�H<�%;gWQ��X����1��{^U����\����*����&PYH�������H�;���q�,����yX������JN�f�b` (��Y���o���0N�_E%�*������f������~T.��;�j��l���&����{�������������q$��!�s*�%K���u]���g��/���~��;r�v�i'�(��v�$��v�xAod�Q�3?��|�����J�_������r����t�z��%�j7���v\�:.n�7�wz�i�!Nn��8L6�7��������������B�|�
�~�1�@�)��|�x�N�����j�e+�Yx�:����5�%j��_����vv������v/��
aa��&�C��J�"�g�.�I��C��2-E���!T�����+b�U�g	|��L�<w{�Z�2���8���������;-�� xij;�6+W���7'��Z���
�N���7O�!����>}�S;o]��[�<��z����Y��u�t�f)a�e�R/�A-������mZ������7���D�q�q�T-W������n������2Fmo_���a�����YZ��x�\�����H�W>���P~��	?������z|�2��m�$D�Y�[�\��/���v�v�������p���~aY�b6j���Y�����r��]�8H���Uk����������Ks?��Q���������qO)�5f�N�|Y��[2��o:���D�����V8���r������O��u`��H�F*$��6��1P�
/��:8�������q����7��cY@<�/���$�P%
&
���C����R+Q�Rx>(��C��N�H���r�YA0-E�9����lp#��4t(��*.�F�g���{�>�]-���W�f�W��5�{�;���7U���d� 4��$�K%��89�
D\�h�n�|AK���tF�7����E#O����G�a
��$+R��.$+��2����7���ZB�T
Q'%�T(qW��`���z����T�~M���z��~�������zsV^����W�9X��=�7�`�{w�;�9\��=�5�p�{��w�s��{^k����{�,�9���`c��y�j��U�6�V�V�[��������,�;����u���h���M��0����P��� ��Z�%y
�����Gw�{=��qo��O"���R��q��	�*��B�<��(zINB�'���BSx����@�/����e6��U!)$����Dj�]zUJ��z�F�������YLy��� ����Wr�9QD+����d�7�R���g:1��h�v�N��i7K_�`��,<qV�����U v[\��'�"������u�&�bF�[Gb�mK�}�����reYq�&�A{�-����W������(��,�a���(E�PPF�������a�CA���%�s;�i�R��b����27�!�{w5�	����xK3�\A�8}ZihM���=��.a@�A��������`<�z/����3�-`���hH8�bc��&~x���IO�"p���-��
b���!��e��� d�����5��&�@�~�Z���p�l�g��o�����^���n�������<%DY���f���v-�WX�,����V��G��^��i��z�6����������l?������/	���`��tt.1��D/��B�����A$|����k���wI,6��&X$�Q���Q�)����`c����) ��|w��%�IX,�*�����������x�"����/��g��_�;~���
����7��]Y_����;��j��Fvk���.��O;��OGfOu,�r�p�����=-9������
��l<�2E�/�-�Z��So%E!����>/P�-N��~<��x1�'���I������g�����������.%��������&9&����G��'���6/�7WS��� �F����{������
B`�-�J�����}���
��O����
�M|�_�/A�,�����y����N�e��v���VT���P{�h�E���N��["����]F�fItN
�;EoRr$��@�
�����Tt�^A��.`�?�t)�Z�a}<�����H�����fZ��cC��[��E8O�����I���_@���Y��t,����*�Q����y�ZI�����K�7��G.��;����{j`g?E��x��Ypp<��\p�g"�������F#LB'Aa�3�o_��=�[����F[@t�f�����9Iq�� a$q�N^;����A�S���x*�
�O���
x�E!2��,A�����n�t��-����$�����3�k�h���4���?P~7��t!1��"?pc��O�S���3X�r�W�����m��L~��v�S/*�jd�|�����Z���kz	g*���=�����K�X���n�m����_�58���B|#:��<@B��eQ�ZX<�����@c��_����t�;���!�Sn�T���������&bS\S����;=�0i�;�%�!o�I��'�:R%$dD3���j
|�����L��j�������R�9�#�=D-�G�eNjg,��[y4(��9LC�n�|3��VFXW�U�G���J��O��U������������.���[����a%��o��||�<&Gu��;��:,	��������DLhE��7�/.�V��nx��-�����j-7*�[����M�v����*��%^���
�!��%��5q� P�
R�h<��4���#����+���V��c�W���2��:�<�B���~��w��� a[��Q�3����Qs|Cz"�M���g\0�c<��!�X����*��(�(���O��o�B?�\Su�T��a��E��Jt7;�g�O�[��S5�����D���k��j��l���/�{��!?����>q��td�70���q�����3I���o/��i�� i�Z�����J�����3��7�m�0���j��G58�����-Y��b�H��bEq;�~lK������nK%��K����*x��usEQ)8��%Sz�o���9��������s&��0����h��M[� �\�����'����S�]Nh��L��i�e���E��BX���H�aT�G<�~�j��#!{�f���w�s���#��[t~�j
--T�UY+�-��&!S�
>�4����>�h,�$���[�X)i4���A���s�@t- }Qe�S�"���=T&1���Thr����h�g�3�������?��s�{��A��D��eWK��
�L�e&51V�Y�[D��i�.y���+K)E}
�3�)5M�2h��	�����,��zA��H`F���P����'4�R��\��xf�ANQ����Q��1�q;h��4D� <�	H�L��M������d��Rd����=����6e�J
?c(��G�>mvu�)���	�G�}9v��E�oTk�
�����Y�.J~@�wl�rx+����my���_�E����\�,k_��U�C������=�N�����
�n ��^��Sj*t����q��3z�$�1��IL
P����]5�4u�u�y0������W�LG���Y�i\2������'��,'$���M���d�����`=������)^y@�W��Y�I�sSf{t���g�
�r���`�m�i|�t�e��:�}�n5�T��-:��������������|���d4�0������1�>�
 ��S�cx��P�L������5#��
l%���!j��KLf%���"�_�}����3B�J9����I
R�4�5�%��#�hR�0�;	�(������U+��6�J�+��F���o4�^��#T��2Kk�)�v��a�5)����X�����)���qY�B���1gXX�-�P
x�"��GY�:���F=����:\��{�B�*�+rr����Z�_]8;%4��D���$LI��XH�a4��d`O�_�5&�:�����k�P�v��R�4���c���/[�T���%���2�*����4�v�tW�s�i�Yr>%9�4���b����E	�J��O��cx�w;B�S��Pj������}�K0�m����pV!�750Bs���Hw��!���!FzM�����(�lG�Cn���"8-�+��>?����!�
��a�iV�Q�&Y�z�v������3�h��j%��K�����TQJ��i�T|��nc�������[!f7\n�h�4���@vV�$��8�Kk<������,����?i�P��
b��$���t�v�>v�����}�DF���)�b�S�����7ye'H�:krQ��:0y����y��$����j��{9 ����Y�)��e�6u�|�Cf����m,6��������O��S��I:������������9E�)*A$:
��F��P`���qi{��}�i�!�|�E8�tU�H(��>�b0��X�$��n�1���;��E<�w���ga�U�:S��}�9��#HE���cC������+/���l�O���F��|�n��8e�M���gx}�\.�A��E�_�������H�Md�,�r�ub	�n;��8�x��`��F�l�k5��X��C���.�WB�KG��6����cf��e*���u�ZW��)�]�o�q�����!�Y�}(��i��1�F��`V&\av�qS�%O=Z6����ca��H��N=,�	
�|-hL	�S�B��#�����(w���{����H�_�~��L�F�9���4����
��������C���Y��f����a%)�d@�[��b������W�����@������2�
�O
��/^����qQH!/��iOv4��i�[\�z�{m}�5���{������������
�Yv�B�;��F�y>~#�
���Ki$A��_��P���3���y�@���
e��	v��%��b�J�d�?����O���w��SoZYl�Y��#^���5_��wpE|�z!VF��D?��Mb�|�[S�m-��U�����^������d�����
�w�x\���gc��c�3P�p�$������3��m|2�H��c2J�
1R��p�}��.zr��M�+�Q�JyH|���q����u4������1l�"F�+���p�o����J�Oj��]�7`Y���#l{���s����;'PN�$%[�h��������
1g���uR���+�r��%�z�tV�a���4P��m�OCz��J�����(������x=	����3�pmF�������:����xf����P>[1�V��0#���m�I�!F������N�����u��[�8#�$
��Qx��E���'��Y�^�����s[�U���y;���F�$|��i>��4���j8���������qQ1E����*�P���$W���:���U �5���V`Px$A������X����k��Q��<�@�S��N�|7��O@��b������U�g��-	p#r��l�z0��TPQw��K�
�I�a�La���M������^�^�����]o]1p�N���w��S�Tv#N(l���,]�g�,��Yxi,��[��VL���zB�g�c���yrd��������rg�����t����^���:�z��%�7gT���6;������'m�b����"{+��9��XePR@�|J(=��"VYy� -��>�L>��}��W�<��������U�6D)k�m�;�i���<k<];�dz�u|�����%�D��<�,�d�c6�qR�2�J�����3W���KY*.������P1�A�������������F�c���ll!��l<�@��@���qk����$���:c�&f�@��n%����y?�-���mY����W"qG�h��.5�;��?��]L�!q���H�>F��`4���h�@��,S�J�7�Y��d�mH���u��Z�4���?���}�>�}�8	�n4��v�Ki�~�%u�Mn���T�8��S��Wi�V2�XI���N
��n/��M$b3�"#���Z#�\�� ���U�+(:��LI��V(�D���D7o�8>q�:`+D�� ��^����|����j�,w�����I���(�U�{�F����T�N~u�C���5���%y+���h�#I"6�3�v��Pv�3�sMagT��"h�+o(���.zsL�(�#9���TJ��J��8����������wR��;�������ww��h'�t�%�'w���!#cdP�����F��pcK��h>�����@@S���;l�5>�/�k8��V�dYI[�*�0'���0��\�X�] ���twV(�kY����vz�Z�\n6��n���i�z����I7f�������Bi;�G�9{�
�@*6����I�Te��EWI�XzE���dR��I���e���
*Y���A�v���n����Q|@	���C

�4����98q���P]��Z�6L��$.���lEjAn��NOb
g��zs����upHh�rwQ�1�d@PO��a����Xvg��j��QA"���x�������F���]KzN�E�.��Kb�8���t��r��S���N���x9Me.����,��������8X�7j�������~����}���$aU/��~*]JtS�TNIX�ILJ�%�7��x���P���Pn�F>�{�U�S���V;��%]R��x���Jmg���Q��Ws`n���M�D�����r�����=��x�Q�Vz�N7�5��j�S������j5��z���W{}�J��DT�$Q~������
h�@��5�m����?�7�������CIq�<������O��j`��J�&�����������W*���$�~ ~y{|�����**��j����z Z�&��j[�Z��������zL���Pg�K!;��������X^{D����2�����P�Z���e�kR��������T�0�������������.�k4k����?��q�����)<�
f�aWwV��A���Y\
��PG�jAo� ����V��4�Z%�^Q4���g<J���&�c�)]�����k��������J*��o��Z�� .y���v��YU@��[�$7����by?v:�*g![^k��C��t2���z�������2<Z����T�?���Z����Yk��o��>FX?��2��q������7�c;A��J�t�Y�lJ�8�Sb�$p��z��s�����:���u���������m�C!_OZ����]� [�x��R�8�������'G�~��2��eU�P�&IW
��&a�
����+L\���Lc�QnlEI:e��������Oo��/O�Z������Y�L�{}q��u�`�������|�q������]T��Af.�=��
��<j�b���'d7�L�
�{P���u�������\�0I��5���[
�ap�'kAz�S���6�Zk�����)��=�o4����0o�`Eb`���K�����\�G�
������U�����U%�7��/'�b�d�{���It�)�����!���IZ�����-o��v(���JY�BB &�|dd8"W'��t�h���@��'�M�U{�:�xw�j?=}}|���E����KQerS��<=�h�m�����y�g0
i��9>��v*K
�������
�s;��e�l3�&������-PB�5JD�	� ��LP����	8�5��aK~��e�h~��h��g|��&�sx��e=�jq����d$iJhq�^%�o�g�^{6�%k�����,�y���uA(J)��^���dbS��r9ea�<�>�j��Z�U��eM.�	�o��[�K
��!���8��=���M�����JyY� ��6AY��a,i�P��������������R���^2��e�(j�I�u�AY�r��`(��x ��*t�(@�4.�.���h�E������ge��,��mt��^�A�`��@ ��]�,���Gl�
FM��-X��Wl�=�d����S��e��
y>O3.Q[�%*�}�+�|z�#�����S,���xZ�K���Cn*����oM���E��l����{�n�*"���/-�~���e��L���Pyv�	�W�7���vC�;(Z��`���	����3��o���<������	 �&u6�X=ll]�#r�������#��/
�r�,#������G*�
?f{M��c���/�r��^pQ�g��2���<U��X�P��P�G���@y�,e�����O TL!����!T�o��+5;F���p�������Zm���>;n���goL��gx���&�//��#�y�@��z�]��A�t��F������4q�4�����G�b�S�	���k��/[����^^(�X�$\�H����l��~d"��s{���jT]��T8E��'rh�.��W+JI��u�������L�@Q
��0�P�������{>��F8evvf(g����4�!AJ�EipA�hpI����1xHB+���|���]�I��(�-�	���f�������7x�*Pu��
��������6q����g�px:�K&�	�9_!h`�v��^�.��_�a���Y�4������a}���Y�F���0�;
�v)�m?�N�1
�������'���}�d�I�ts��b��C? Z�X>���"�A2rI��]\L��0#vS���A9�IN������#k�	�!OQ+./t5������[�sp�K������W���� ���h��W���R��g�^�VS�~����7���L{l0��T��H��->�49��.���Ed0�rP��E���kA#�4;����!�(V���*�n/��	�pQ����,�������'�a�f������"B�������~�b��]�",����l[&���YRq�#���@)��B������"\4��-}���y��]�6�Y�
�0X��W�&%,���|*	1����b*�!��5l�o�-p��A`��f���K�q<S(��`<�%�r��	J�k�E�5��;sN�:�qZx4�� /�e<��XoP�������R3�X�gR(�N,Z�#1��H*�'@W@��{9NrH*����7��r��7���P��Y��IJ6��	{V��)."�@`�N�������J0;s*T���On�Z�F6�������O,yY��Y�������>����n�9��V��t�fX����O���|?�f��[�{<����B�������^u�@�����Xe��W�"FN�
��=V���0h��
����"� �	��X9|v��f�4�(�6.��������E�':w�%������yb�(#W��+H��k~g ��������w�u��y�� n�@���&���z�aX8U���bp��6n�D�O����w��Z�����O�`��F�+��RQ���|�]g���S�;7�S;�X4�
h��Q�C��i:��o/���b���i��~G>���(!��g�b�k�3�
AN����k���x�JbU�x=V�E�v���lQV7��$+"�0�D{[iFVA�_In�pr'��>�E�t�������fHM�'�=����g������:J��;����L>����<;(f�,9yW*�;��f}���v�z���/:�j{����J}�����T������A�,���� ���1F:A��E>A<! ?�d<�)}��9f��8�;�; �u��^����,�����+2q�����X�����<DCXO�� >e�!PX!���y����;�����;_�{��V����j~�w#4�OC��I�����)_-h�������#�8��cA���?%]�7�t�z�~�-b�t�Q?o��#�|�q�~SN��b>�J(��Hn�$�yo�C�c"����vd_xN�v���^�S�������%K1����W��?�����Jjs�F0������w� ��p�+%���
�U��[4�d��������!������V6s�R���n�5���)�i��L���� Xd'����x�w����b
�(�`���5B�N�6��
��C�Z��0��s�����va�z��$�=���-��I`T�T��C���1���t�E��>��M1F�T�1y�f	��(A��
Z�=�&��C1��;�����������x��
�bJ�����������SUI*���~��~�EuY�.A=�4�S�{2z
�����D��1��M��T���X"�S���F�y��:/j����X2/��h�2��	�$����J�����m�i�`�;�������x��R�?0�����r(J)����>l�
���/����+
h�;��o���i���g��w��8@�-!��{���&�����&��O�C�����'�/����	1�Q���bca������=^���9�>Ee1����t�4.����m����w��J	�f����9��x���G�G{G}V6.[�uV��<�h:Q��wn���ze"��,������
�r����w��/���.+g���^U�J$����?h����r�#�����U��'�yCg�3�R�x��y����U�'���H���`33�
��>�����9V���]��uT���v���_��7�q��}��:�v��W�e�\�x!��H���;�K��~�m�}�#no��Nm��
��[�b��h+�Q�����4�kk��h�Nu�B�Sh
�%�����W��J�'�HTy^�;�� ��#%���V*E%��A��E�e�V]����YG
8s�f}cn�iu5x��.�;�����S�V�x��_����������)���1��%�X:'�a��@@7q=��_������r����!T�lKb�n[���U<*s�����S�g�=�B���4jG���@�����JX�����<?=k��.�������������m+�L@�\G�X^��w����>�F���<�w$��m'�O���-�Z^��}\Zo�dO��p��.c:���7��k�����5�3�Z�^{������E������ k�&=���K3/L[��|���.�Pl���%����C�������k1��b��N��r����j�����w���l��X���-�����a�n#U��n����W�H^���~�^��u�~�0�����>���a���"O�2�0������|v
�e��{:nK"&��1N�x�)�|��o��b���7^31���	��`�N��x<lO��������F�8��n���+�;�l<�K�q)�/��OJ�>�IJ���!���V�����^���=��������j'�Vv�n��Gq�^��W����n\���j�ZG��7��^�'q����vv+;�����b�w	�Bke�
�0}@��H�����;����Kyc��:��PD�dp5�L����,�U��",7dE��
�T��T*R�����$�_ '������-���!�"���@hJ����������~(�X��/�2��7$�:����SA�(��(������Y}��S}�-���H��0GXU
������G�s��y��4%���|4��gD>C<N��/0�2����	���uK����E�9!O����X�"��h|K������B�,�!R\o'��3���HlV7���P8��?��L�d�@�U����;H{���(���"JT
�H_&r���I���S8X_(k��P�]-�~�jzA�vo��^0g����ws5U1���uMu�%���^����V��W��� �
t���W[�= ���~���aG�pKj����H�_�Zo
k��
r)��{n���$S��
�����rV�z�����g���vBC��N� �lF�X��S8Tk��
9V9!�3a;��
s��� t����^XR=]wi(n_���
]�	�q���-A��
1/�����JW�b����4��F��ty�3I��N���f�9M�)B��/�.X����s6��&/p����E��RH*4����$O�yJ��xC���.C�R9^���n�^����\�*������n��������"��l�.-.��E�,k �R�
2Vx�^q��Y�.�����f��G�������[(|�����t��wz��d$B������N��of���B�.�����W��[��4K�����8����e�EZ_3")�
�
c�dH�^6��u�I���h��t�P��=r����M�S���t���S�o!Y����\���r(��zK�e��%�0�$�B'���0������Po0�oz�=q�aY
1�S=���v���3��c��h���}���r4�-���<����8��h���:6>�$�5�����I������g����#��������ja�St��*��)�����h",�-h@ p�|����&�K��sy�L�8P�������^�?ry����b��^�:~
��|h�1�8�M����S��Gm8�W�-�5j����V~Ty�'8"�$�Z�^B��(�b��'h��������
��9|��������g������_��Z������W��C3�W�g����1���?��y�@��"�B��
�cq`�����aQ
j}��Mt6&h���=y��/h��PARv��$�o�9���|n��FT���ab�����}������]s����^S�eQ��#g�?t������`�)�o�r@><���zG@�#L��#&'�<^�|�%�)�Y�.�h�B$�������K�L%<�g��u0���H���=���0�/ ����(�?�v�`����3��9Xa��dz{�����1�����^��
�U�f5���Y������������(�Fh{�mQ���~=���-&����)��}��
GM������A	��P���r���r�f$u����:vb�V6�V�j���6�*�X~�/��}�	R��$r��O��@�F����4��ir��|�Sd����c5��6�l����B�_�����2;�4��b��zmC��#�@��RC�%p��KO��(F�B��h(�����S��Ok�r�f��
T{�x[$��/�YG�"g���z�����=���
����dNh�	L�A����!\�N��)�����A���l��U�����mCLE�Q�r����d#o�#[����������	;�X��|Y�n#������)MZ��I�QYhA����\��T~X�;������E��[ |��m�QLEQ0�����F�h���PGa����D�z6��V���82���r�:.�3�N�8|��������^�$A ����"��!x������Wp����C��N
����A�3h/�^�hob���	�2^��'D�yt��+�t4�
$���8���	]\<s�/�a!&���E������M��V@�R�m]����8ok!�6+������ �g��fb�B��\$��-)��|��jG}L�)�,a0���L��77qo KI�'�L�4i���s�������@;t�H���1��7��s`6 \_�x'��GG�g�x�PX1G)H��[�$��2���A7!\�����|fQ�X�AzQBtp�CFq����'I�����'���n����')B�!d	�F�F=�m]^�����*j$M3Cv��u-��u�%	a9��%�}�����BX��J!�,�>���<����r^#��;xpTI`�(M��rEH^�o�oPY@�����`.���L�����;�I|��%�8��@&�p��k�����k���d�Pr<� `Y����P 5iZ��}�t�N���;l���6x�L�h�R{�K`�S���hg��]Dz�9ys@<7R��w�+�!�A�7b��&�&q�3����P8��N�'n���Pev�d��36d�)T��`�`�`c�$����"�^O���T�8C�(�z�u�]�r��p�j4Y�B-`���h2p}�@8\�?���$���(�����9��<����j����P�4�w�g�{oy���>��-w���@YA�������� ��{��#�F@����<$f�/�������.ic`���
�Uo���|��x����u�����&��R����V��c���h��/��>B��4�;�F^�	f��p(y�x����������r4x���������Z)�}��&G��x3���$��A}`��0���� ���
�f��\mB(���lZ����5}����x��^`���D	���+p����q:z�THg�F$�A7	u������<�#H���(��q�w��'���1��O@&���Dj�dL��'UH
�������gH9��E���=#ZB����h��`Zy�-Y����?
^���h����r���6�/�O_�C���5@TQ?�#�5���:w�*��xr��R�v:�8��k��6y#�@8���VzB�/n\���n���
ZW�H�
V8*>���A���C������=G�������>0�8� p5�@�Zc�w�%�f�6��{�������:��$��K$4�%]%6���l���a��l�,�8��p��=��5�CP�'��3(`pR0��%���h"A�,����V�6��(�5�h�z��C��l��h:m���M��3]HL
a��/.Z���7����p�������~2���:����#aO��D�'^0rcir2r9��82���L��������h�g�I����,�0W��Sx���W��R�;_5���c�p0��&@=?�7����bAs�9��Pb�H1�@�O��\^�,�����&�FD�7l7�������+z�z(
B��1!_��v?E�L4����v��C�.G��	�h��`��
`��`$���P�f�9���w�2%Ppi���~�*�y����	����=#�����F*/l���M����gE��C(W[�B#2�O��t7�trz���tb�a�:����L��dV���:r���{���T��U������-8�5�@`u�#3���\���������������#�����'a��|�P'��W������z�'�f��+�b�$����\��N-�Gpy�H��h-R����y+�d�������
���F�o�s#�egFcQ�6^Zg`T$$�,��0��8h��>���N�����i<��b����������#��`�������no>���,F�1��P��F�S��S��|�CY�E�=���8nd��D�8���s��k�{d<L���kirH�>)�Pb�#wGM��NLr?52��A06�1�R"+:�4�&���������r��" \J����HY�����I�	�������l�������r�(�~�#�!�40 ��;�P�]�!T~��	X�C���Tk�3�)��r������&�����~���V@-�C��	��
f��S&-!��9+$QS��o,%��hJKo`p��lzKy��kK�c��9�lI���(AV�tL#�O<?e����6��q����l�*�Q9,%Z��������T��k�S�>zNj�p7�f�����z6<-"u��K���Tl�4�Y�.�n����%����T�D���gQ�O����_��mP'zXP�h\$�C��d��,��RFV������
�}5^:�������
\ZQ��a�
$]��4�xaX
��@���	�+�?�^>�� ����LH�c��s�����mlix���G����J�Kb�K�������s����B��>��\k��_�.���x��[]���j.`��1� 
j���]�kN9d����y�&�>v���{!��wlTTO��l�y�AStA�y��>S��g���dL'k�����.0���r�������}:T�B�3i��+�9+�>��,�L�F��}g��g�M�@�WZ~w��	�q�Cc@c�60��B�c����}[��w�]����}N"���n�Qol��!u�AJQY�>��Q��HI�m	��:=v����%��]����L�����P9�����du�{�����1��E�%��_!�:_�6,����/!�c��yih1� ����rsgpR��8R�k�v���������d������'�^`�����Avnn��T[2!mY5K}1�XNmwp��Bkah9~	l������: y�������r  MWkA��L�r�����,/�b$�;������(��{�qcw���f;���J{���Q,��k���P���$�����@R���h���_G7q�����M�j��fQ>��2��]D?QSs�{�f�H,J���Z�2�{{|~|v�:k�}�z}��{�n����	T�`���:���r�}vzq��S����������zo�Y��:������V�}��k���[��l���/�7sF���L��/��/]�tu�h�:]�+n��lCv��9�Zk
F}9\�����"�!����m�c	r���Q#DMXzy�����9s_Y]����=2/~
t�������E�k�s�����Yf)����l�1�F��m��vf���nF��%eF�{6�	����Q���J1��D������pUE��(��X��"$S�x�����Is��(���9�|��^3��Y��A�,H�����������Y��/�jc/�w��J�\����~����U���/�T�7U�`�)��A�����u)��^"����h]\@���3I�B��WoN�N1E�{{���
�VdV�T��[�������"�����cI��\�;e��~�\�;�j��l����qs�G�Mn�l\�=��N6M��{v������
&)�6&����c�<M�o�7w
�s
�5�JI�#���Cly~FA���d��|����hM�Vd�j���e�VDf�������;�r���I�#��.Y�p���"��U\����T��S;�����7����r�!���P��]�^y��Y}�����V	�X��8�+���S����*�L�`��,�����lV�+��K��Z���
�v��-fo+�_��O_���[��g�,�7���Pi-d�M����<�D�B������izl'����S�Q�=o����H��� ���]�{co?{�����1]�%��\�����H�Wc��O]� �gr����p���!�!IM	�@��p/$�<�P�*^.o�����R�
���+�����U����D��7U��u����KvA(��~�Y��W�F�.fA�vZ�z�������@��$y���8%q����l"��I��_A,%�R��k����^m���%F�����7^J�JK��S�.@'��OY�?HDm�Z�F�8��%3
Y��U��	��6cy�n��
('�/���B����>���HH�!�����[+J�vml=m�8}-�/��:}�^����C�kD�d����(<,�0k�$[h����n
0z��yAI��0�������-�������.�~}���3���|���{'o&{��|�<=e9��O/_r�"Vz��B�7�0���V)E��h)	�AS�\��Z�.����~�x~������;�
jGWWy�A�4"?8����gd;���^��xDV
I���EXZl�?s��3�>��4�[��#���K����u����4q�\<�����K���4Mb��V�c�
�/�,�
���Y.5*��BKB`X�7�NKx6��U���Q����#S�
����5�"���?�@�����D���a��	.���/hN�h-U�iZ���J[f~r��D��
�*J�P���,E�k���M��0����rY�������j�^���e�d
3��FIr��Q�����0��v����P
���z���sp��X���+�2�Y��MXJ"0���s3Q�W}`�J�B���{~Fw�_>���\�
�������l�����~=�
��s��_U�j��y�w��u��.���8�n@����D��R���0����tp�G�E����B�5����o����Fo��S��J���C��FM���ok��C��SXd�h.6���y���r�jQ�V���tU��U�{��N�������!��j�y��X�����������{Z��Rs�����(�+�=8No�%��F���&�����(��#>������������1����Nk�1�h�������jwG#�h�N	���3BO�P�3���7o3��C�m6Y~���D���a�X��~��u���W����������_�����0��r?�y~V���76���:}���zyV�8n�������p����xiG�S9Gc
��>���e=�T���*5AP��uIF��=3�!#3��]R�,?���++���zw��k��v��/�]���	|�g�|~{B���>���}�����,�Y�V�] �"_-`XiK3�����Z��B���;��0������7�H����j��E��]H���-Z���C����3;����0�J}�t��	����7�D/J�����h$>
"�Q�����c�r������K��s�M�l���}>�������[|K�UA�d�uqy~*aO���w�/���6�${������/���7����_������q��aO�K�/5����2�C�#�x���t��o3��T�@��_��pG%0i {{G���]���MFj����A�H��5���,*0�o9,F��^�������w��V�����S�����Wc]7�$��gs|��������|��/�d|v�
��~-�����o�5WI5u���V��k�Ocl�t������7W�M>���d�b9eP�	d�S�����)j�1����Y���
��{����X%��g
A�O_?������Y��K��Kk�kiI�^YZVc��%�.%�I�k����A����?�I}�7{@�I4��X��L��C:>qAN����uq��<L�����x1��R����qVr0�WO��
C=�)���K�x���7gy���p<��;�39��Hta��������?=�5�d�@�C���u{@	�W
v�3Ed����:����yr8T�W�dO�m��-z&��:�2'�Q�uu
gy2�o������O�l��_v*���
�#������)�	�I\Fn1������������3�D���K�_�������/p�n��������F��T�0����6\�	�C��x�t��^��H>�NA���9i��5N����e[Ro~n��8~�I��������G��f�`A�����+�LH��P��B�����j�w�
CKR�1&�*��dA�,�G�^qZj}��O��'�<�������"jI�`U:�Po������%I2H���I���0j�#�ad�bN__6��ctx9�P���o�����K������G�Ql�?6 �n��b(�����q��^���)q��(���q��
-r�|�]�J$��<W�-�z��J�X�������Cv<I)��#�F��uKH�c"'wP��q�Hf��sI=��C}��D�D�K��r��a��8�>�h??=��|&1A��O�L&�a�	V�S��r����C�8�f.7�E4�yR�	J�HB��%_=f��"�Rzr#��!�����h��$o�$(!������y?J9��5V��"�]���t�C+F@�Z%���5�.o;A��h��s��u�f�gML���/�-[[�	t����p_�@�N����h��������>����H)?��6�����k�&��o[���%���9f��cw4��=JO�!�`7��p`p��^��_����T!Y��=��b�v����y�R�5���-e�@Q�\1t)Q�	x���X<�&��K�����W�'j��C|9��<R�p9��R�h�H�##@TV���\t�A��W��N��K��q)d;�z<��z
�l�������q=�@����@'�g�0��s�1�5fR~��6}0�\����X��3E�s�*�|*������s���}��+�����H�C�MQX�N��k�+�S$��:*"vZ0y�C`�G��S�db����H��h���6�`+hx]������������V��U���+b;%!>���)�=�n��u��u�{�=��e���R.]Gm�C�����U^�s����
���Dy^���L�G2��]_�aj�*������������
m�P i����a��[>���� �-��D�F$q���W��W��U��;U[�����������ku�����`���xx����7�zq?���1X����a�\�/���$�]���������e�i��~�O�������w��;�}����R4J������4�Hw���^	Q���$)�-gi��<��q�F�?X�F�������)�Uw������^u�R����R���;��Z���9�7Q]>���-{�����u�{��N���5����~u���n��:�nT�T;�^�*^I���'��'���'jr7�����H\D�Q/��,�$����F�(yJn�lHd���� j�E#P��Ee��^?�����~��q1�T�������^
	�����6?��[K}gcA�
`����*�c	�_6$����7�������`<O������uOde$���L�[�D�E8����	��3�[�r��q���B�����;�(a�Cl2��pcE |��N������ �j���q�c�c*G��Hb89���j�%KHH�,�1��(���A/�tW7q�B�U
�����^��I���9c���5��I6�&PW��:,kz@$0���������\!��VE��h���:����z�zr<����/Pv�������XCrm�� : ��(i���m>�O6�%4��q���r�q2H�s����l6I��IV��
�O��G�{%�.p���\�.^���P^�3�����'�5�%����f������P�&����K)��vx��d��i%hv��t�p-��K�S���s�)�2�q�M8P��.�5v��[K�����n�	0�;QpQ�6cu=�Ty���=��������MU5������O�@�|{<�
�/���E��P���$�%y�Zc_B�0�'���K"-�����*2b�R�;��n\.G�fsw���zQ`6C���"�������)�.�)�1l�^eG�J�����}��wk�Oj�/�h�z�S�#�KT���n�rq,��}tc��t�y
f%�A)������~RV�oc��'��dm����c��xL�1`���2���z�t��,��&�c�7�	��b��^]��#�������p+^�k����}�-�
�70���_�Vm��m5j�
o Z/*�,'����1�d�R��E��%9��6%�qo��f��c@��mT��Ll>�(�����?���'��m���c}B�a����6^�]+��!
�n~�/5^�,�"D
�|K���L�d������~=����|4s["k�uZS���#j�����;E^���.����y��\3y��q	���n����h�$���1`�T����{��TV
��h9i�]^��j.c��S�������Ng�b|�E�N'j�����2�/��V���������Cg�8]��$����K���p���j�g�o���C�nYHn�#�U��"W@!�S��h�����IG�v����[S�r9V��1������tZ�zs3����H�G�k�-8k9�<31������@ysN��#�L���Q�
����2����]��[��d:pJ:�Su ����K~b��7x�t��/�!/�B**>,4K�1�el������,S��O�[�L`,m�Br��� /�����&	�;������L�]4��;]= �.6]~��8���y��c���:���b�RQ.
|��!84,�G�'�#�heNE	6����+��W�.���l��asn��1������i�:	�{�f&���p|��=
�Aj���AT��,���E�5���NK����O�_�Gq�P��f��8�������uLY�Y<�0�����I�@�_�B`Ls�������7T�����3�
NFn	hh��X��d�Qw69���w�^��y��[�����#��K�5BStN����S��Z`�b()�>�lZ�X�.�����q}C���	��-`ew�M���C��3P�3�M�@�H7���6]����&��� t�g�rM�ET���5r�[�a�c���g?�����9�,�Q�"W]����{-Q��#e�I@^}��}�:>�6�#7K��PC�����[����;���[	]�r���#{��X�&�W"��q3�����\�N#�������	�Xa�����bp��X���ED�(U+���<K����G.�$��g'�����s{�����B�{��
�;���	���.}�-a����$h�c��j��b�di{4���7` ���M:����������a$P�p����!
����������V�j`2����`*P�Fmc���d/�������tC�@f
����9�e1��������R7�(��> ��qp��Sq��=�BO���'x�Tlq{� +
��?W.������U�U����|����Jt5\���$�3�O���+���-��X/}0!�3Y�������Q���O�-%;U7��,�QE`3:��I&��$Z���E13ij������+OR��P[�[p?%�=M$Y-�il��x�����"�3t� m0�"P.�d��>z�����Y���1A��*�),�)B�J���z������[>jm�5.�3�5�0c��9ZC�En��4+��g^a�u��Q^��4�&Fg�jV8,���87]G������5��9��ujZ1�{>�h����,����T�X| q�{.�\��*'��R������<r�f� ���^ ��u[�/=1;Q��I0���sdi$�7�����#uT��'�y���1���X�(�K�S4 �~�PG�
�8��Pb;�g�bm
2�HKK�����i�qq���������D���>|�G��������)L��R���#$�P������~r�F���bv��8�t���;r�A$f��^����@�.��DA=ze�,=	���P�Y�c	9m�h
jSb) [��J��=�W*���
��n�i�}y|�iRo.�\��_����������8��r��!��+a�l���*�����|��s�+�R�2��*}9�p��j3�=�D��pq�
D|N�h�\������8�C�Z�
^.��M��^�����tQ5&eH��� &W�������|1g~�(���~�,���(�"!?��6��*8S�F�*J�4{( 39s��}�����c��������FA�����D�1��tXoq|��0��Pn4��X�8?eQ�
�������j%gg� �����mp2C�41���&x(�f�H�G������d
5n����h�!�p�/�3{5���c

�y0�#�!%[��~*��{K�N}��(1<�
�~��������eL���FR9��y�K���
�7Zu������Q����oa4Eo��@���%>�v����f��L��<�&\��$�0��`���r@f�E6IF;�B�����|���O@�D ��}����-q3�����J�71j�EM��x*v�f?@���&.4�(:�����n����u[&�B��|�{�E�!s+w^{��������E`�'r9�KP����@��_��\i��4-���k'��S.w1�G���cxH���IO����p}�?eG�
��q!*EG��Y�8b/d]kg,.��CSn��,�����$�������VJ5�����h��x��\��\��������B����'y��]�
|�Rc�Oy:^�o���x~�J�(^V@����n,��f�o����__�����}�7E'^�L��a\�,�H�Wf8��V����K��%3X��/�2����2����Pz��O��������
�������5v�X9���O��<j�b�����+J���#K����D�q������ZB�s��|����t�Y�'�6���Bs�b�9�����u��hV q����W�����)���$V,��:�'l��|e)�vu3q�hWa8a���54bb�� ,P~�#p�
`d�z�Z���98�eg��s�M���G�9�w�Z����Z)G+�
��r�8��!��a)��%t��������tX\BcH�f�w���L�k������0�B��V>��\@�m�2W����F.D��9��~$�+v�����o�y?b��&��\��~���~����S�M9�y����� B%���(��W����j�-��!����Ad�1���qH���N2���F��-%���6���.Z�����u�k�D����M��fJ���_2T��������JH%������=]+��R�?�+�&y�)�5�I��U1~��7~D�~^"\yl�^}���;���n��?5����'%
��L�i��d�|��+	�{�v}=~�\0����T-yw����/��4$������'-[1���R���������G��	����w��O��TD���Pft�.+
L
Zc\�6Pe\����H��<���S�2�E����DbJ�`K<���l����x8��iI$�pVkiy��EQ���}������^������v/�(a�g�YLP�":���g���f�b~�gkL�����5��}I���� }HC�*�d���Yg���U�8�Zew��96`]�\��=^\��F�0R��u���R]<��R����;
��C���bf�����x�vj���K
�@��=a<"!?��W��P���������T���u7 b���cy�._������I������E��sG"����c����:�B��xu�N��e��E~�v�B�����k������z`{e�`���Q�Ax�HDR��5W=bm�-��^���;<��-*��J���U�?�������>�sg����2$'�����Xv��@����[,>�����3�����^��i��D���_l����B��3I����Z�R-��������5�&��5�sV����.�����v,b�r����Ll\��2�K�$�cW���5f�#4g��z>��O����>e;;�{dB��e����@��`���,'@��X
f����](;�X:��7���mL�/��6E'��MBW��<W��C_�JtF��}��I`�g�������e���'��g�@�'���Va7F���
�g?C�kM��*��x:m��Vau�
��V����_���;t9�k�4X���De2b}>e�~F��W��pQ=����9bm��%�H��W�=(��)3]���H�P�\'AF}'}/Ye�����8������#�"�Z���S�qn���O5�h���T#��zc� j�F�f��L���\,���#-_��L"�����l��O_���9��d�4�x�T2jo�_���z�Z)�����CuuO�=�:o��y�g�P{������0^XC�GV;�����+j
yKP�fy�hJMI���:?K�
c�G���K���
VM#�=t����.X�r��;4��`v
����o^c�qh��������B�a���itt��3M�4�	�������5U�i������M72x!��������
��C����5�.7����{/n��SN<��G
_�OX�����:��f���@�� i�8���,M/�$D��>i=}��F����3MB��n���u�@{��8R$�!�I����N�8'�gS�~,��(5E��`:�)h���r.���h���^_��U���
�o�����6��E�4D1���:Bn�\�Z �������C
^��A�����AH��.i�1$|��k6��z�+3�1A���	`\��>�>�x�=m����\ e�Ev�����O���P�B���$����p�&)Q��b~�����E�����o�n��HU����[�� ��-�#��ra<S2�,1�Fxz�D��� X����C;!8Zjl��[��z�Xi�0�#�i��a�HH0Pn�}�d���G7E�O�~��I����`>�X�����bl"���	���#�]�t�Tc�����9���)&9�@�����������r�#*F\�%B�'v�{F�k������e����e���6%���C��� ���)�n0o*7���2&B��E��zH������8���x�sP�m��r��hz13��	`�il���Q]��HD�����&��V<<������ ������eZ0�L������0y�]�uE�2�B�O��<U�L?���QrA����l��v��������]ZRa�q]}����
���{����f��5��x�~�hod���\<E��8a�_
kr12�X�]���1����M�@,vI����^�oSP1VQ��_e���]��}y����w�W/�������p�k��?@�x�.f5v` ��hx}�dB�"6!�npv}e��S�i�$�Z��qA[��H2!�,:9!-<1�_�}� P�dh�b�����gND��Q�:��m"@(���&2�l��y:t�b�Pp�����N�MR�'2�����
#�H|����!Q!�Q@T�"�����d2t���d���a<M0���`sNbUN�I�?x��!�m.iS�|C��<x�K,+/�J�Oe13
���c9=R>����o��'_`-[h���)U���^�,��/h�YJ��F}x��X��0CG�����Q��U���]�\���sz���J��}AYx�{DB^�1�~C���!`���'��8�1r�n��[�Rf b;�
%WT���\l�l������/�������
��Y����^Y`�����o�����{�4Z������`��tPp_<X>1<����T���*���kfN����x���i��!X}}���<��{/pL�v��	���)��l��������2p5-)���G�Dy�/�@>�WL>��=8�O:��"W�����Z��2^+m}���Z5���S�r�r�s<O��G���g5�X�����p�i�kUT��Tw)���v/�L�4�����HA�Ju�����H,9}���+�]���&WX�`�NJ5��4��;��O2R�Q�������s��GH�f�i5U�yz!��1�M�8�y�=7�x%�%7x|����A�=�P9W���R�t�9����#P�K��l��_��If ��p����q ����4��Q�\����~��h��If6�$�Y�|t����S�S�I�:m�,���_�h��_��	QE�h
]������{��)@�YQ��q_rl�t#u����vv���~n#3P��LU����V������c�.h��������&�����u��_��z��Rm�N!����Clk��Y����_oN_C]�
Ax��x�Zv*����<YO6����w�gPo�{����^��������MI<d�z����� pW����$��'����y#�+?!}�9�,�)��GJ<�P�#S@�8gsZ�vP1NI>M0)�����'��n�A�'�R�
j�6,�F��������~,�Rr���I��3s�8y2�)��#��k�����I���)��?����`rq�h1����3�����.s��>�$ix2�����A���$���G����p���E��
I���sa��6K,)���o�O��*��,��g��F(�@/����1����*����,(�&YH^����m�e�"��_���t����c%n��f{Y�=U$c\���&P`B
ts���I�8u�R��Z�W$�m�d������j��ut��c��#a���4���M�ST�HXb��I�_���b������r����B����b�h�%����=I������?�����$.��o��i@R��-)F���r�����C2���Dv��^5�!B����Yt��[	-���W��V��,'p5�fsk:�[�LI&������������yp>L�Q^�� 8O{���z=�o������R:�;4�3��p��ij}��P��T	���e����3kt'R�LW�&�K�J
����q*B������D	Ck���A:IM��v�(�<C�=����@8������b���t������Hp 6��:|7Jo�)7��8�#d�zE�+O�oa�$�4_p�{���� !Q	���`sK9�HB��D�0���Q2+c�����X����_��~"pwP|f;U�����-�%P i���~f��;k���V	�xqLF���;��B!�gcL4�3�.t)���w8�@��78$���D����%���+^��PbK�U�x���Q�M����[������},�DJ�C���T�*�z��5k�kVL���q�N6<n(�$U�� �^�<��TD���Uk{����\-�}L9%Z�4�z��^-.�;�f�����������R��*�!Q$�$�	;�zJd�n�*�Jg<� ����3!��Q��K������i��GCk>��J�J������`������&��g�N������0O����Jx����dK����8��.����O}1�Fw��m}���1[��B!�k�6(� $��QDr.�[�m�f��7��%j���;YT�qFW�D�AM�ml2Uq���n��+�������<$�E�O�	J:l0F=#}�6���`{�^c�:����Y����'I	ZM��~ ����(��?��\7�����B�vr��������K,�S��g���%FOV>=*Q�F�v�|*j�
����?��XL!��Z ����:E��2��U���{�d{�f�F`����+�C���]�
��c�!:`������K�d%
��{I���%�s|��V��U����+I������]=�#*���l��S+��/j�Q��,�<������E)�):aw!�*/��������=R��e(]����C`�c0�l+�p�[m���tH��/b�����1&��P�,����@�O>�Y#�{�����y�/��\���}5@�9�3�v����������o��z����GDmwO���u[>������p���
P?�'�F��#�
�=/Y��������@����(���J$�(X��J�~s{�<����:��[	/�/�'���:���T*�E��E.��i����2}jw�����#�'�t<��^"�]����K�Z-�2-.J�Fn4���=>?>;k�)��?��BA#PD�,�)d���nvv���8�<���Ip@L�aV����#�@��]�w5L���c��������(f`�Z���'�������t"��Ha�?%(��'$���.��i<�����e��j��lC�M�����U�Zsf3��t�
���C����6���Mg$��_��|y~����V��\�(
W����)���be��Zo�=�G���[L"���$
��d��}l����"�
_��<=����Bx��xT��9�����O�	�-`�����Ea�QY�	Ra^��jm��xF�����td���?�n�a�C;
�P����}�t	��H��mk\�f
T[;��6�w`�������"�A;M��l����i��e�+��1��%x���)�:i)*��4$���>vt�����,��c����!&�|���.QQ�����w�)��R��:&����_��0F�|I���#%���� 4��U�CC���������((b����kD�B���G�S
�����L�xZ
m�U�E����@�_�l���S�+!��lln4����,F���+b�������UJ9P
���|/��<I�$�_����?��C	��1D��r��$1�~'f�ga9ueWe ����w\?.����L2��	�=�?�'�b^����jX[�����V�K���!@����o�	����1V�{���2@m�YZ@�Y���,B[%� ���C��g�`�/(P�XD��F����H"����P���9����_�;2v&}�
WB<��.Z8�0��`>��m�QbY�~S��d�=�X�E/����k?&�Ww�(9�]��\M���)���������7v���j������Wz��O���~'\��&#8� ],$�g�L�����`8�.��:����'���6��W�R������8Fk ����:;~�����/h�u_F�5�%9���M�Z����r��!����;�K2�jv-)��� jDzs9a�Q�)���5/��*�|}��mvw+Q�����q�w���]����E�H����;������Q����:k��z�:��j���P|'�8�7���\�rr�Y��a;�f�"�������Z�0����������I�[�M�coyQ���|�������)l&���i�]y���{��r9��7���z����v��.Cj�{�	�g��T�W���H4�U��d�V����vwaj2_k���"�3Sy�k{
����L�Q�D���\�!��+������������!c"e��D�(���2�$�@��r����[<%���v����O9J���3O)���t$���r2��y`5�<\%&
�VNG�#_���vm���N<.���_�����>������/�W�WO[��������0b�)��_e/�������� ��x��-��4���3%��TC,9��/N��N_�}s~��9�H����M	�1��]����P{qz��A��(��0�Z���[Aag�A�7�9&�L?�Gme��h�b�1{���%�7����;��L�3�v��\nCn��i�}hls�����cu��5�����}�W��U:������D�le�xJ���!���C��
n���|��A���L��4O��"�1CD���
�A	�]$��J�r�,
��<��c[�BNZq��u�D|����
�@�g��Ry��=st����y+La��\w�1F\TB�x9eT(#+�\!���/���^�k&�j�f��o��$���������Z%��J5���RE8y	�F�p�OE�r��v���b�^�l�}r|y\t�h�5]�7����X�%��>?=��l���v[�O%��Y�z���3y|��1�6���	����!s�x>��ZR�w���kF�{�^�\�D�ng���wgYkf�������"3F� o	;c�K\��Z�.ER���d��Ke����)���2Y�O��X�T���r������L2=*���*��2������>��a�:NU��T����;�Y��]�l��6�s_�SQ_T�U�dU�YU�VU�U�lU�[U
WU�5IR���~ygc����O��R�R�+��G�Yi��%uw���JTR�!<�PO%�����J}��+�F>�{�U���vwk��w�{�����J�^�6� *�x��7�X!� ���z�q�Un�����������xo�Q�7v���h��h6:;q��������~\���4��$�.�����J��5�����x�k$.�����	~�o����}u#Y��$:�lH�,>���(j�E#�F�.*��J��.���~��q1��Mb������g/��TQ���6?	�g�F���V�'��V7T�����@14�g���bg��:�Nu���x�f������YfW�������at��q��uC
5�Ol�"�g���Gb}���d�-/��xyt�
r��o�O6�c8�J�/'��;O���@\�f��`{��i9N�o�7�F��x�K��)�=�vC�2i%%�+�����%v�������
Q�0T�k`Y%������5�U(�jC�$�Ra#[f�9��J�x��{���nu��,������Nu�eK�3�K�3�R
��]�0��I��~��4��\Y<��f���C��mN\��}8��\�?��^�r��Cg�N%��z8�(KC"�rp�Z@c��nm<R���h�v�����*O����J!2T �B�;t(��O������f� �b,�C���b����u�l���J�]N�J8� 3��_H�>=@DV��&&���<�{��.�r�S+�D�E0�DI{���/A�Z)e��9��v���V�����c�5Rf;+
���2{�$Fk����:[nvq���,���{slUww�V���s����/6�F�+EP�M<���?������A����$���5�����R��+�"��r��!4�����c#����,�M%j��jTH����N\�I�ey��%'�N���~h.����@D����m���H�J�2e0q��]���)�*g`�B��iO�Q��7��u�!9�2�F�1����Z�D��f���f�r��\��/����������S�+�GW�O�jYe�'<`C��%�%S��o���Z-��/�h�u�js��N��fSy�����l|+������{���xmKc5��C��\F�
���m��zv;v��2����U�s�N�1�452g�b���@0-`�T��p��X7A0e��G�}_������'2E�����+v��7�s�[��,�c�O�t���J�.�y(�\���?,��H����*{������:q�������~7��wk{��F�����^�Zk��{��;��;��?M�����e��;������=�\�S��k�]r���p�'i���/7i~_������J��o��������M(�`M�LIM���\���;�TJ_-S_/Sk�TH�Di-j��2s��_�R�O�����9������(y�xv�{a��|9��69������&M0A�U��E���,�j���O:B�f�?�L�"�g��iy--1���,��A{��S��-�h��_�?,�O������o1e���P���S��M����I�!$�A�B>�0�!%�%<Z��PM�.(I]���^Yg��H(�(�>�l\^�c�ifJ���q_�al1;�6&�6|�R��V���-2gX%/'������n��N�2w|�xLA����=i���:�{AnG�(�����N�^��b�2@R�]���p�A�*��&���6�Z�x8���||��������)Y�i<��`h0�D�Q��c����P����s��NDA��yQ�R�����5w��`��x��O���'1��<N���x���	����W�&��?Z���N�8�e���	�E,�����P�l0�
,�]�3I�.7HX��O�xw���SvM!CQR�W����a��ZI�WZQ�'��H�DJMZ�fsg-)oF���@�p�SiE��XG�DR��>�~����NZ?�N���J����/�[���*�����6������w�C��[�Fq��lVv�{�^����n�Y����������N�[���Z'q;�V;�i����d���@ ���Y�6:�G�X�Z���Y��B&���r^;�o��Mc^{��wz�f�Z.���;���ng��B��L^;]�)�����i�n��FM�{I�y��sI��|��|�)s��{�cWh���.Z����$�|�z�gIX����>�����	�������y��[l�#��KHNI�h]�������w������Gl�������������-(����]��pq������^�Z��:����w�r&�L/����7�,���S�{Q�r:�iOg�w��J:V��^R�r�5�Ur�������g��R���i��^�_�*��������$0����3�Clp�Wo��E��tM�F��2�����d�w:�{���?a�����'|h��D���YR����W�P0V$Y1^d)�! �/Y�!�1�#��O�� �&��h� �B�L"F�jfI�	L���i�����<k�g<�M�F�����+N0��z'�F���M�J� �$&�M��W8����pR�i���!��3X����^8G+��$�~��+���B��c���'4�`�L1��j#,��a&ml\�����"?�����L'�u�9���Y��A&�W	c�g��U����I�3���3�a����3��������3N~Ba�,tQ�������Y���pJxR8�.�������������F"����!���Jp���W[%��Z��<�O��M�K����
����E�IdNC�xy<�r�e��d�97���k��e"T-[y=�Ag{YRdY�.:����r;hi6[
����
��7.�8�)�
��5R����^��y��F�n.~����-��@�*�Sz*���-��n>�1y��e�zVQ!������9��Le�%��I�����G<q���_s(av"��T����+yj�]iC�`D&x��ZT,�L�
�
w����������W��e�YE"����&�C��g������W���xu������bs|i,%��z�O@��'�=�:r;�j�����Bx�S6��� *
,k�Q��i��91��y[2��=�!�Es����Yv�g\����}���[�/q�Y��%���E�P~�p�3cC�)	��ph>���x��F?[G�����j5<��}Vt����:8�@����J6$����}$���n�	<*H.��c�Q�����<*��(>Xr"�3EoY �B���[	��;2Ui,�
��p��|&w�<�Q�8��g�Q�����
�����������i�:�h���)�C���R��:�E�F����CV��F�{{�tl���
��������G��#�Yr�d���XNOK���5$[�#��u	�d�plx��]����������rgk]�}cNM:���J��J���R��s�T;[wM���W��jg�IY��N������Q�-����S��q��v���S���me�A������B�������,n�����^��%`�9���'�����?^4!�K�1�����?�����
����s�5$}K��I��Z�f�I���QD�>��c�:��-=$��<d^N������pn����x;.�w��8�~��7h���O�2���>i�����q���8n%�D6��D�3l��b�Dg���L#��x^��4�|�#����a��anmiHR����K^�P�R7�C,6�/����h��.3�FAo�q���S�;#�����\s��E���I����b�x�����+��TI{bQe7���0����w����%�S
O� S���s/�ca�>L������J[8������@n��@*��������O����>J<5��N���Ir_�yQi�C��4T��Lf���-,��)����|�Rajh��~ge�\�%��rN\W�E����[���rk�����9/���y��3 ����2���R�O����������=^�J�~��������7`�6��y[��U)���ijZ1,�[�x���0�`p��J= VfN�f�����F�����E
zn��M��Y��&����+9��
����Q��!�0��?�8�����bIKL�����P���
k�%�?;�;u����������{���O����v���J�Wm�j�z5������h�Sk���j��������t-���V=��9�i,0��V��G���``�B�z�@U��v�����W���r����W:��N����+��?���;�
�H"b�qH�~<��$��=����1yq`���|��(t#
�Xf�$.�5B��W�g�wm�����4�	��6�g��u~���(6)��� ���:@�nxc97�es�9�����H`�}2�P	s�Y6�h��CB�-���� g@>����t�A�W�(D�#/"H�LIf�6"���K��^*��������*\��>���������n%�����iM5���q�Gl�o�R������C;�����=�T�7�����nQ7����u8G
���������w6s�!Z��(Q���y��1�j�Fj�x
���lH��_���~i,��T$��?�0d��ZB������	*��.$��p�w���J����������}�?��z�Z����wv+��Zu����Il�[�*�Fw�uw������}C�U��o�����"��a�@(`��u�e�~7n�w8�l�q��%��D�+e�l��>}�WYF`f����-����J���D�n����m��������o%��/��WN�G����26��e��2���`�L.;�]�l��C���8�ci2B���~s��6�1���/�9�9k��:�`v���66j��u�S����A��1�B�N��m���`��s�R&K��j����;5���Gy|�j�n����J��%kNH�(�!�*3���D��
2	�m��"�<�oi�X���f��|�M%�bx�r������X��������'0!��#{�������^�����b�������C �������[�+
��L�AU�E��u�>{���L	������K()*&���&l*�
�K��97`�|����� ��{���V%���M(�;#���Xjp�XL�����H.�$��D]y}��8�^i�
�|
Q�P��6��Q
��4s8�j�AA��������G��W5'�W���Y0��Vp2�5X�/	��	���T�b��Hk�����6����Gt�d�-T.S�$e����-����o�����q�Y��3	t�S�]tvz�7=���>J%X�,�R��CXe���!>)�.@�g����j��}Ns-��������.�����}������1�
�<��(����H8>��$/���gb��x{�Z��V;�1|�f�����i�A����j}(3DC�U0&Jv���1��H���
`���`P�J�-�R�gM���PML�j;�*�S��/�$�<��[{R`���0D�@H�g|���c9�[���><NA�� �)��!�m=�F|J_��I�@����Y.Z	0,/��G�������Q��8B3��4C�����h8��������i�<!}�� ="1�e2�������}~��
�TO''�O�Z*�X�Z���0����6��q[N�+�u����9�h�J��G|���-�jJ3g����2I���Q\5�����U� �G?�%u�%���	{�5�z�P:�>Ars,7��)��\�	�)�)<��X��v��d<(*���%��3��������{�S�u��"����9�_��]~�J����(��"E���o�r�](� {k��:0�;�����*����)�a�����@J������k�B��ww��Yta%^l0�P��kl3��,�Y��ni���U��4+I	�7����\'�m��9c�,+�I
R��!��������Y�c�'�%�57�����C���,�_��A���=E�X����w*^���Z�w��w�#��~TiV�����~���Z�����V��H��~����E�K�W�?��-�P���?	���B`�XEHW[\���(������������n��mT�����V@e�E�A*������_�yw����m�l�<-��[8�JV7�z a���9������fxot�F�����%>T��!�UZ$��c�tYm�[7Wm�h��D�7������'�_�w�&�M��.��H2��}�����%��@��_+PY����+���t7����������tq�Zo�C~���7��%}P��F	�[{=�����Ou��z�����������C�_�6��N�Y�T��Z���w�������N�f�����^����?������_�H+��X7�nd02��+���@�[��J����4\�;�!�a'�
r����`5�U~zsz�~���`�����A(��!�^�������Z���7�N���5���wo!����D�};��J���)�}MU�[*���X.�%%Rk���������Wj�j����~XU�(J`��}:��0]�o��1�X4��h�� a�,C������/��,E��,Zr�$N?��������^SU����'�"	'��������;�(�H� �UW�
D��Ce������[b���R���VP�������!T��4�����MHL����-��-_�e����]!�O��>�D��������������7?�JR��)�
�'@W�2�����������7��[�U��a��#��������H�8�%0�jC��t������9�8��'��Gd�-���)�1�L�&BJ���L�2��o��0�������%�	�mqv��AV����ZFs����R`�
�����8���������m��<U�����r�F�?Iy�6����[�?g����g�3��W��
�u~��/�G�K�?����������!������w+�n�����U��~T����~_��nT�tvj{�oN�������V�>���@,`��r�Z��&!�����`�%��Oi��-IcC��{�J=��*���&�k���"�����Y��.�JY�;L����?��>i�v�*�?��V��F����;�B%�S�Z@	i��\.��
�@��k0�]�
�3SAs[G�ss���'Hh�a,Lc
$�4#�1��R��R�1��S)o
HI�f�,����SR�C��p|��n���E�Kd�����������E)V��F��/�";+���[��N�i��Xq����w������������57S;����]�R�vvrKm���}�Zm��.�P�7��������4�������E�=�+�d���l\��o&%Hl�%�O�[������;{;������G���FE�i�f�����T:Q3�Wwc�Ouo/j�����n���%�W_�������U��=\!� �>�AlC�&�$��M)T6�e�(zc�������VBA��E��Dy�5�.,�O&�)����i"���Psc�opF�5�+y#�	�f%E�6���WS���d<���n�[�wJ�Qik�F��$�aZ�U�������]���|��>�3����T�[���)�)��i�A�(���:����n���6��+�^:t���=��/l8��#�-zJ�*8l!��|{1o�S9�&	�m�/�0�X�W��6 ���%�3��r����h03������d����RE�������e��m�V��m������`����[��W��~��u��2����=9��������b��D�:H���`GG.3#G�'Q���6����J���tSGRvz0T���c
��yS����>���Jb�k_����@��'p��d���S�5�v7���VwYAki�����2\eQM�%���xo A�FC��A���0���o��i�'��+��M'/�g����k�m�H����b+@�d�.)Q$��\G�������������\T���og��]r�\QRR��6��}�����7GL������V���T�U�N|*�n�3�9D��_����j^+��x��m�<�\]��qD��/��*�V��I~�+���U���gd�R3�c�
��J�g�h��Kt�}Jh"��E}�����m	BjZ�m�����~5k���3��$��KZG@+���-���C�C:8+�W���c,���T~	7p���*�s�r�f��C&Dv[�(n��vFQ7B��#��B���/��Mg?�u{���?��t��+!�������r
D������\���@��[�[��K�d]����z]�-�,�s��L~D���`J��
�M�� Z�5>�[�%kf��n�KC��b�k�A��YFq)���`���l�� ���
���SLdS���])���p�xy������E.�J� ���7P���B��L��?N��"�|�F#w���?�������6������6��*g]��*Z�{@=���ta:Zlb������+���,E�IrZ�5q���������u�NyW[x��6.N�tY�mY`�nk%.����h����A�
��^I��W_�����K�x������������u�_�i���tw����@_����[������L[�U�#�
�r9��6Z_��Z-�s��c�������@���)�\4�`��zJG���:�2��_,�b��u8�������
��E���x��C7U������[�AukH_�J�7�A���w�[�<t}��d���0�,'--"1(���O���jQ�TJ��_NZr�pBZ�����C��N>~�(c�B��QQq�n��.��Uq7��V��t�o�|�z�8�=�U-q�S�k�<��='�����=W��xF<���/wMBs�R,C�7p��F�d��S���*��);�:5���i�tQ��"����B���I����������zd��YMW�x����PB%a�}��.,|�N�f�r�K&���8����%����,>������	��O��V�e��`�S���G��V�=�U1��c�TW�������+���2�j�A����v�eH�9�L��">��P���1�SB,���2g}����i�5��S�	+�����H�#C7H2�ID1�������j�e�8q�`%������H_03���!sV?�H��q���E�bN�+��W����4�+dA.�!%����IK�JB�u������V^?���F��A�����������eaa��������������`����I��������'�Yo$�8B!8�����0�����IY�gd���7+*\b�~����t��:W�9���������<s�����Z�~�`L���|�;�si��p�A�Y`��^�?���t���O@�^� cx����J��I�"������`XY_cSB���s�',]x(aN�/Kg�/m ���4�S�Q3QN'$?�}��*��9�b�oJp3������)�'2\oP��3RB�Y,j�������Q�������Zi�+�<E�T�Z�S�[������4�0a�Q��'��
�?��O[����Gn�����X@U#|��9�����"�M1��}�!�\`����8�� T��8����$�\W<<� �����9L���T_��,���@�R�cMR"=fP��i����D���e�����bo�FQ�n�4y~P3LL���������U$l��UD�������;K�(_Ro?C!���8?�[��N?t���4������P�b&)`c$VC 7B~�
��~�{D�M:���7R7PWj��z��g8��$`�G�]�wsD�V5hV���P-L��M'~F���5�x�#��5
(�(k�%y��p�
�d�GG��4d�i{\����D��1�kz�a�B�X��4���/�0���� �����xX�W����0���y���x��~4���d<���7�t�?����Kn������J�<��+���S��!����o��2d���YS�`�}&����a<���|0�H�����E+�L���2�q�X9��f��bM�]H�6Yky�[c>FgG: ���v��I��v��TO�v0�������9�6I��J�,*��:*��
����t�R��
jM���,�����������	�6C�>���f[er�����|���K��S���1�����!'�V�|&����"��C���_i!	8n���W�ii��O�����8^_�]7����>��*�����P��0f����f�����+�c�E��|�����`j,9�:=�<>;� �7p�X6�I�� ������G��s��n��tM�cN
�T���;�\��v,jZ�����B5���`����d����T�_�4$�m[�����5�K]�fo*�r,��.N�)��CHQ��g�Gc$��,�0�n�Q	������2;��������NX�_���k�������������9��C���&�M��N��%v���/h�����R����uB�X��9����w�����I��bwe�������`j4H���a�F�K�F�4*+�����f�F�e�Q��V5�8G�Fm�t�FW�gQ�'�$�:O�Q�z:H��Aj�5�
iT��WJ�^�4��b�FE���4�����miT[p�4��ROc��(�?���J����Q��rD�[��r�[wO�����tO��*�=7�=%:�A5/r��L�pk��G��\C�
�;HM�<���R-s�G5�SS���3y��x
yT�.��,�*Y]�m��<�z#���l��E���yT����X}u��_����$W����c�8�}���aOU��A���
z�������g��x0K������8��&�4M�����EI4�d���/��K^��2�Ob���6+a�Uy���LJ�e����i/N��i�/�c�	����������r��=�mB�cS�4��;_���a�l���}����>�g��-���X�V`

#311

johncnaylorls@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#310)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jan 9, 2024 at 9:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

In addition, I've made some changes and cleanups:

These look good to me, although I have not tried dumping a node in a while.

0011 - simplify the radix tree iteration code. I hope it makes the
code clear and readable. Also I removed RT_UPDATE_ITER_STACK().

I'm very pleased with how much simpler it is now!

0013 - In RT_SHMEM case, we use SIZEOF_VOID_P for
RT_VALUE_IS_EMBEDDABLE check, but I think it's not correct. Because
DSA has its own pointer size, SIZEOF_DSA_POINTER, it could be 4 bytes
even if SIZEOF_VOID_P is 8 bytes, for example in a case where
!defined(PG_HAVE_ATOMIC_U64_SUPPORT). Please refer to dsa.h for
details.

Thanks for the pointer. ;-)

BTW, now that the inner and leaf nodes use the same structure, do we
still need RT_NODE_BASE_XXX types? Most places where we use
RT_NODE_BASE_XXX types can be replaced with RT_NODE_XXX types.

That's been in the back of my mind as well. Maybe the common header
should be the new "base" member? At least, something other than "n".

Exceptions are RT_FANOUT_XX calculations:

#if SIZEOF_VOID_P < 8
#define RT_FANOUT_16_LO ((96 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC))
#define RT_FANOUT_48 ((512 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC))
#else
#define RT_FANOUT_16_LO ((160 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC))
#define RT_FANOUT_48 ((768 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC))
#endif /* SIZEOF_VOID_P < 8 */

But I think we can replace them with offsetof(RT_NODE_16, children) etc.

That makes sense. Do you want to have a go at it, or shall I?

I think after that, the only big cleanup needed is putting things in a
more readable order. I can do that at a later date, and other
opportunities for beautification are pretty minor and localized.

Rationalizing locking is the only thing left that requires a bit of thought.

#312

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#311)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jan 9, 2024 at 8:19 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Jan 9, 2024 at 9:40 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

In addition, I've made some changes and cleanups:

These look good to me, although I have not tried dumping a node in a while.

0011 - simplify the radix tree iteration code. I hope it makes the
code clear and readable. Also I removed RT_UPDATE_ITER_STACK().

I'm very pleased with how much simpler it is now!

0013 - In RT_SHMEM case, we use SIZEOF_VOID_P for
RT_VALUE_IS_EMBEDDABLE check, but I think it's not correct. Because
DSA has its own pointer size, SIZEOF_DSA_POINTER, it could be 4 bytes
even if SIZEOF_VOID_P is 8 bytes, for example in a case where
!defined(PG_HAVE_ATOMIC_U64_SUPPORT). Please refer to dsa.h for
details.

Thanks for the pointer. ;-)

BTW, now that the inner and leaf nodes use the same structure, do we
still need RT_NODE_BASE_XXX types? Most places where we use
RT_NODE_BASE_XXX types can be replaced with RT_NODE_XXX types.

That's been in the back of my mind as well. Maybe the common header
should be the new "base" member? At least, something other than "n".

Agreed.

Exceptions are RT_FANOUT_XX calculations:

#if SIZEOF_VOID_P < 8
#define RT_FANOUT_16_LO ((96 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC))
#define RT_FANOUT_48 ((512 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC))
#else
#define RT_FANOUT_16_LO ((160 - sizeof(RT_NODE_BASE_16)) / sizeof(RT_PTR_ALLOC))
#define RT_FANOUT_48 ((768 - sizeof(RT_NODE_BASE_48)) / sizeof(RT_PTR_ALLOC))
#endif /* SIZEOF_VOID_P < 8 */

But I think we can replace them with offsetof(RT_NODE_16, children) etc.

That makes sense. Do you want to have a go at it, or shall I?

I've done in 0010 patch in v51 patch set. Whereas RT_NODE_4 and
RT_NODE_16 structs declaration needs RT_FANOUT_4_HI and
RT_FANOUT_16_HI respectively, RT_FANOUT_16_LO and RT_FANOUT_48 need
RT_NODE_16 and RT_NODE_48 structs declaration. So fanout declarations
are now spread before and after RT_NODE_XXX struct declaration. It's a
bit less readable, but I'm not sure of a better way.

The previous updates are merged into the main radix tree patch and
tidstore patch. Nothing changes in other patches from v50.

I think after that, the only big cleanup needed is putting things in a
more readable order. I can do that at a later date, and other
opportunities for beautification are pretty minor and localized.

Agreed.

Rationalizing locking is the only thing left that requires a bit of thought.

Right, I'll send a reply soon.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#313

johncnaylorls@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#312)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jan 10, 2024 at 9:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've done in 0010 patch in v51 patch set. Whereas RT_NODE_4 and
RT_NODE_16 structs declaration needs RT_FANOUT_4_HI and
RT_FANOUT_16_HI respectively, RT_FANOUT_16_LO and RT_FANOUT_48 need
RT_NODE_16 and RT_NODE_48 structs declaration. So fanout declarations
are now spread before and after RT_NODE_XXX struct declaration. It's a
bit less readable, but I'm not sure of a better way.

They were before and after the *_BASE types, so it's not really worse,
I think. I did notice that RT_SLOT_IDX_LIMIT has been considered
special for a very long time, before we even had size classes, so it's
the same thing but even more far away. I have an idea to introduce
*_MAX macros, allowing to turn RT_SLOT_IDX_LIMIT into
RT_FANOUT_48_MAX, so that everything is in the same spot, and to make
this area more consistent. I also noticed that I'd been assuming that
RT_FANOUT_16_HI fits easily into a DSA size class, but that's only
true on 64-bit, and in any case we don't want to assume it. I've
attached an addendum .txt to demo this idea.

Attachments:

v51-addendum-size-class.patch.txttext/plain; charset=US-ASCII; name=v51-addendum-size-class.patch.txtDownload

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index bde6916184..ffb0b58826 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -287,19 +287,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE * tree);
 /* Tree level the radix tree uses */
 #define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_SPAN)
 
-/*
- * Number of bits necessary for isset array in node48.
- * Since bitmapword can be 64 bits, the only values that make sense
- * here are 64 and 128.
- * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
- * inside a single bitmapword on most platforms, so it's a good starting
- * point. We can make it higher if we need to.
- */
-#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 4)
-
-/* Invalid index used in node48 */
-#define RT_INVALID_SLOT_IDX	0xFF
-
 /* Get a chunk from the key */
 #define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK))
 
@@ -426,20 +413,29 @@ typedef union RT_NODE_PTR
 }			RT_NODE_PTR;
 
 /*
- * fanout values for each size class.
- *
- * RT_FANOUT_4_HI and RT_FANOUT_16_HI are declared here as they are
+ * Symbols for maximum possible fanout are declared first as they are
  * required to declare each node kind. The declarations of other fanout
  * values are followed as they need the struct sizes of each node kind.
- *
- * TODO: consider 5 with subclass 1 or 2.
  */
 
 /* max possible key chunks without struct padding */
-#define RT_FANOUT_4_HI (8 - sizeof(RT_NODE))
+#define RT_FANOUT_4_MAX (8 - sizeof(RT_NODE))
 
 /* equal to two 128-bit SIMD registers, regardless of availability */
-#define RT_FANOUT_16_HI	32
+#define RT_FANOUT_16_MAX	32
+
+/*
+ * This also determines the number of bits necessary for the isset array,
+ * so we need to be mindful of the size of bitmapword.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
+ * inside a single bitmapword on most platforms, so it's a good starting
+ * point. We can make it higher if we need to.
+ */
+#define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4)
+
+#define RT_FANOUT_256   RT_NODE_MAX_SLOTS
 
 /*
  * Node structs, one for each "kind"
@@ -448,7 +444,7 @@ typedef struct RT_NODE_4
 {
 	RT_NODE		base;
 
-	uint8		chunks[RT_FANOUT_4_HI];
+	uint8		chunks[RT_FANOUT_4_MAX];
 
 	/* number of children depends on size class */
 	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
@@ -458,7 +454,7 @@ typedef struct RT_NODE_16
 {
 	RT_NODE		 base;
 
-	uint8		chunks[RT_FANOUT_16_HI];
+	uint8		chunks[RT_FANOUT_16_MAX];
 
 	/* number of children depends on size class */
 	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
@@ -476,8 +472,11 @@ typedef struct RT_NODE_48
 	/* The index of slots for each fanout */
 	uint8		slot_idxs[RT_NODE_MAX_SLOTS];
 
+/* Invalid index */
+#define RT_INVALID_SLOT_IDX	0xFF
+
 	/* bitmap to track which slots are in use */
-	bitmapword	isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+	bitmapword	isset[RT_BM_IDX(RT_FANOUT_48_MAX)];
 
 	/* number of children depends on size class */
 	RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
@@ -493,28 +492,29 @@ typedef struct RT_NODE_256
 {
 	RT_NODE		 base;
 
-	/*
-	 * Zero is a valid value for embedded values, so we use a bitmap to track
-	 * which slots are in use.
-	 */
-	bitmapword	isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+	/* bitmap to track which slots are in use */
+	bitmapword	isset[RT_BM_IDX(RT_FANOUT_256)];
 
 	/* Slots for 256 children */
-	RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+	RT_PTR_ALLOC children[RT_FANOUT_256];
 }			RT_NODE_256;
 
-#define RT_FANOUT_4		4
-StaticAssertDecl(RT_FANOUT_4 <= RT_FANOUT_4_HI, "watch struct padding");
-
 #if defined(RT_SHMEM)
-/* make sure the node16 subclass and node48 fit neatly into a DSA size class */
+/*
+ * Make sure the all nodes (except for node256) fit neatly into a DSA size class.
+ * We assume the RT_FANOUT_4 is in the range where DSA size classes
+ * increment by 8 (as of PG17 up to 64 bytes), so we just hard
+ * code that one.
+ */
 
 #if SIZEOF_DSA_POINTER < 8
 #define RT_FANOUT_16_LO	((96 - offsetof(RT_NODE_16, children)) / sizeof(RT_PTR_ALLOC))
-#define RT_FANOUT_48	((512 - offsetof(RT_NODE_48, children)) / sizeof(RT_PTR_ALLOC))
+#define RT_FANOUT_16_HI	Min(RT_FANOUT_16_MAX, (160 - offsetof(RT_NODE_16, children)) / sizeof(RT_PTR_ALLOC))
+#define RT_FANOUT_48	Min(RT_FANOUT_48_MAX, (512 - offsetof(RT_NODE_48, children)) / sizeof(RT_PTR_ALLOC))
 #else
 #define RT_FANOUT_16_LO	((160 - offsetof(RT_NODE_16, children)) / sizeof(RT_PTR_ALLOC))
-#define RT_FANOUT_48	((768 - offsetof(RT_NODE_48, children)) / sizeof(RT_PTR_ALLOC))
+#define RT_FANOUT_16_HI	Min(RT_FANOUT_16_MAX, (320 - offsetof(RT_NODE_16, children)) / sizeof(RT_PTR_ALLOC))
+#define RT_FANOUT_48	Min(RT_FANOUT_48_MAX, (768 - offsetof(RT_NODE_48, children)) / sizeof(RT_PTR_ALLOC))
 #endif							/* SIZEOF_DSA_POINTER < 8 */
 
 #else							/* ! RT_SHMEM */
@@ -522,14 +522,17 @@ StaticAssertDecl(RT_FANOUT_4 <= RT_FANOUT_4_HI, "watch struct padding");
 /* doesn't really matter, but may as well use the namesake */
 #define RT_FANOUT_16_LO	16
 /* use maximum possible */
-#define RT_FANOUT_48	RT_SLOT_IDX_LIMIT
+#define RT_FANOUT_16_HI RT_FANOUT_16_MAX
+#define RT_FANOUT_48	RT_FANOUT_48_MAX
 
 #endif							/* RT_SHMEM */
 
-#define RT_FANOUT_256	256
+/* TODO: consider 5 with subclass 1 or 2. */
+#define RT_FANOUT_4		4
 
+StaticAssertDecl(RT_FANOUT_4 <= RT_FANOUT_4_MAX, "watch struct padding");
 StaticAssertDecl(RT_FANOUT_16_LO < RT_FANOUT_16_HI, "LO subclass bigger than HI");
-StaticAssertDecl(RT_FANOUT_48 <= RT_SLOT_IDX_LIMIT, "more slots than isset bits");
+StaticAssertDecl(RT_FANOUT_48 <= RT_FANOUT_48_MAX, "more slots than isset bits");
 
 /*
  * Node size classes
@@ -1332,7 +1335,7 @@ RT_ADD_CHILD_48(RT_RADIX_TREE * tree, RT_PTR_ALLOC * ref, RT_NODE_PTR node,
 					inverse;
 
 		/* get the first word with at least one bit not set */
-		for (int i = 0; i < RT_BM_IDX(RT_SLOT_IDX_LIMIT); i++)
+		for (int i = 0; i < RT_BM_IDX(RT_FANOUT_48_MAX); i++)
 		{
 			w = n48->isset[i];
 			if (w < ~((bitmapword) 0))
@@ -2771,7 +2774,7 @@ RT_DUMP_NODE(RT_NODE * node)
 				}
 
 				fprintf(stderr, "isset-bitmap: ");
-				for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+				for (int i = 0; i < (RT_FANOUT_48_MAX / BITS_PER_BYTE); i++)
 				{
 					fprintf(stderr, "%s%x", sep, ((uint8 *) n48->isset)[i]);
 					sep = " ";
@@ -2802,7 +2805,7 @@ RT_DUMP_NODE(RT_NODE * node)
 				char *sep = "";
 
 				fprintf(stderr, "isset-bitmap: ");
-				for (int i = 0; i < (RT_SLOT_IDX_LIMIT / BITS_PER_BYTE); i++)
+				for (int i = 0; i < (RT_FANOUT_256 / BITS_PER_BYTE); i++)
 				{
 					fprintf(stderr, "%s%x", sep, ((uint8 *) n256->isset)[i]);
 					sep = " ";
@@ -2860,7 +2863,6 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_NODE_MUST_GROW
 #undef RT_NODE_KIND_COUNT
 #undef RT_SIZE_CLASS_COUNT
-#undef RT_SLOT_IDX_LIMIT
 #undef RT_INVALID_SLOT_IDX
 #undef RT_SLAB_BLOCK_SIZE
 #undef RT_RADIX_TREE_MAGIC

#314

sawada.mshk@gmail.com

about 2 years ago

In reply to: John Naylor (#309)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 8, 2024 at 8:35 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Jan 3, 2024 at 9:10 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I agree that we expose RT_LOCK_* functions and have tidstore use them,
but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
calls part. I think that even if we expose them, we will still need to
do something like "if (TidStoreIsShared(ts))
shared_rt_lock_share(ts->tree.shared)", no?

I'll come back to this topic separately.

To answer your question, sure, but that "if (TidStoreIsShared(ts))"
part would be pushed down into a function so that only one place has
to care about it.

However, I'm starting to question whether we even need that. Meaning,
lock the tidstore separately. To "lock the tidstore" means to take a
lock, _separate_ from the radix tree's internal lock, to control
access to two fields in a separate "control object":
+typedef struct TidStoreControl
+{
+ /* the number of tids in the store */
+ int64 num_tids;
+
+ /* the maximum bytes a TidStore can use */
+ size_t max_bytes;
I'm pretty sure max_bytes does not need to be in shared memory, and
certainly not under a lock: Thinking of a hypothetical
parallel-prune-phase scenario, one way would be for a leader process
to pass out ranges of blocks to workers, and when the limit is
exceeded, stop passing out blocks and wait for all the workers to
finish.

True. I agreed that it doesn't need to be under a lock anyway, as it's
read-only.

As for num_tids, vacuum previously put the similar count in

@@ -176,7 +179,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
VacDeadItems contained "num_items". What was the reason to have new
infrastructure for that count? And it doesn't seem like access to it
was controlled by a lock -- can you confirm? If we did get parallel
pruning, maybe the count would belong inside PVShared?

I thought that since the tidstore is a general-purpose data structure
the shared counter should be protected by a lock. One thing I'm
concerned about is that we might need to update both the radix tree
and the counter atomically in some cases. But that's true we don't
need it for lazy vacuum at least for now. Even given the parallel scan
phase, probably we won't need to have workers check the total number
of stored tuples during a parallel scan.

The number of tids is not that tightly bound to the tidstore's job. I
believe tidbitmap.c (a possible future client) doesn't care about the
global number of tids -- not only that, but AND/OR operations can
change the number in a non-obvious way, so it would not be convenient
to keep an accurate number anyway. But the lock would still be
mandatory with this patch.

Very good point.

If we can make vacuum work a bit closer to how it does now, it'd be a
big step up in readability, I think. Namely, getting rid of all the
locking logic inside tidstore.c and let the radix tree's locking do
the right thing. We'd need to make that work correctly when receiving
pointers to values upon lookup, and I already shared ideas for that.
But I want to see if there is any obstacle in the way of removing the
tidstore control object and it's separate lock.

So I agree to remove both max_bytes and num_items from the control
object.Also, as you mentioned, we can remove the tidstore control
object itself. TidStoreGetHandle() returns a radix tree handle, and we
can pass it to TidStoreAttach(). I'll try it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#315

sawada.mshk@gmail.com

about 2 years ago

In reply to: Masahiko Sawada (#314)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 8, 2024 at 8:35 PM John Naylor <johncnaylorls@gmail.com> wrote:
On Wed, Jan 3, 2024 at 9:10 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I agree that we expose RT_LOCK_* functions and have tidstore use them,
but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
calls part. I think that even if we expose them, we will still need to
do something like "if (TidStoreIsShared(ts))
shared_rt_lock_share(ts->tree.shared)", no?

I'll come back to this topic separately.

To answer your question, sure, but that "if (TidStoreIsShared(ts))"
part would be pushed down into a function so that only one place has
to care about it.

However, I'm starting to question whether we even need that. Meaning,
lock the tidstore separately. To "lock the tidstore" means to take a
lock, _separate_ from the radix tree's internal lock, to control
access to two fields in a separate "control object":
+typedef struct TidStoreControl
+{
+ /* the number of tids in the store */
+ int64 num_tids;
+
+ /* the maximum bytes a TidStore can use */
+ size_t max_bytes;
I'm pretty sure max_bytes does not need to be in shared memory, and
certainly not under a lock: Thinking of a hypothetical
parallel-prune-phase scenario, one way would be for a leader process
to pass out ranges of blocks to workers, and when the limit is
exceeded, stop passing out blocks and wait for all the workers to
finish.
True. I agreed that it doesn't need to be under a lock anyway, as it's
read-only.
As for num_tids, vacuum previously put the similar count in

@@ -176,7 +179,8 @@ struct ParallelVacuumState
PVIndStats *indstats;
/* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;
VacDeadItems contained "num_items". What was the reason to have new
infrastructure for that count? And it doesn't seem like access to it
was controlled by a lock -- can you confirm? If we did get parallel
pruning, maybe the count would belong inside PVShared?
I thought that since the tidstore is a general-purpose data structure
the shared counter should be protected by a lock. One thing I'm
concerned about is that we might need to update both the radix tree
and the counter atomically in some cases. But that's true we don't
need it for lazy vacuum at least for now. Even given the parallel scan
phase, probably we won't need to have workers check the total number
of stored tuples during a parallel scan.

The number of tids is not that tightly bound to the tidstore's job. I
believe tidbitmap.c (a possible future client) doesn't care about the
global number of tids -- not only that, but AND/OR operations can
change the number in a non-obvious way, so it would not be convenient
to keep an accurate number anyway. But the lock would still be
mandatory with this patch.

Very good point.

If we can make vacuum work a bit closer to how it does now, it'd be a
big step up in readability, I think. Namely, getting rid of all the
locking logic inside tidstore.c and let the radix tree's locking do
the right thing. We'd need to make that work correctly when receiving
pointers to values upon lookup, and I already shared ideas for that.
But I want to see if there is any obstacle in the way of removing the
tidstore control object and it's separate lock.

So I agree to remove both max_bytes and num_items from the control
object.Also, as you mentioned, we can remove the tidstore control
object itself. TidStoreGetHandle() returns a radix tree handle, and we
can pass it to TidStoreAttach(). I'll try it.

I realized that if we remove the whole tidstore control object
including max_bytes, processes who attached the shared tidstore cannot
use TidStoreIsFull() actually as it always returns true. Also they
cannot use TidStoreReset() as well since it needs to pass max_bytes to
RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it
could be problematic for general use. If we remove it, we probably
need a safeguard to prevent those who attached the tidstore from
calling these functions. Or we can keep the control object but remove
the lock and num_tids.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#316

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#315)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Jan 12, 2024 at 3:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

So I agree to remove both max_bytes and num_items from the control
object.Also, as you mentioned, we can remove the tidstore control
object itself. TidStoreGetHandle() returns a radix tree handle, and we
can pass it to TidStoreAttach(). I'll try it.

Thanks. It's worth looking closely here.

I realized that if we remove the whole tidstore control object
including max_bytes, processes who attached the shared tidstore cannot
use TidStoreIsFull() actually as it always returns true.

I imagine that we'd replace that with a function (maybe an earlier
version had it?) to report the memory usage to the caller, which
should know where to find max_bytes.

Also they
cannot use TidStoreReset() as well since it needs to pass max_bytes to
RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it
could be problematic for general use.

HEAD has no problem finding the necessary values, and I don't think
it'd be difficult to maintain that ability. I'm not actually sure what
"general use" needs to have, and I'm not sure anyone can guess.
There's the future possibility of parallel heap-scanning, but I'm
guessing a *lot* more needs to happen for that to work, so I'm not
sure how much it buys us to immediately start putting those two fields
in a special abstraction. The only other concrete use case mentioned
in this thread that I remember is bitmap heap scan, and I believe that
would never need to reset, only free the whole thing when finished.

I spent some more time studying parallel vacuum, and have some
thoughts. In HEAD, we have

-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;

...which has the tids, plus two fields that function _very similarly_
to the two extra fields in the tidstore control object. It's a bit
strange to me that the patch doesn't have this struct anymore.

I suspect if we keep it around (just change "items" to be the local
tidstore struct), the patch would have a bit less churn and look/work
more like the current code. I think it might be easier to read if the
v17 commits are suited to the current needs of vacuum, rather than try
to anticipate all uses. Richer abstractions can come later if needed.
Another stanza:

- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
- est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;

With s/max_items/max_bytes/, I wonder if we can still use some of
this, and parallel workers would have no problem getting the necessary
info, as they do today. If not, I don't really understand why. I'm not
very familiar with working with shared memory, and I know the tree
itself needs some different setup, so it's quite possible I'm missing
something.

I find it difficult to kept straight these four things:

- radix tree
- radix tree control object
- tidstore
- tidstore control object

Even with the code in front of me, it's hard to reason about how these
concepts fit together. It'd be much more readable if this was
simplified.

#317

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#316)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sun, Jan 14, 2024 at 10:43 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Fri, Jan 12, 2024 at 3:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

So I agree to remove both max_bytes and num_items from the control
object.Also, as you mentioned, we can remove the tidstore control
object itself. TidStoreGetHandle() returns a radix tree handle, and we
can pass it to TidStoreAttach(). I'll try it.

Thanks. It's worth looking closely here.

I realized that if we remove the whole tidstore control object
including max_bytes, processes who attached the shared tidstore cannot
use TidStoreIsFull() actually as it always returns true.

I imagine that we'd replace that with a function (maybe an earlier
version had it?) to report the memory usage to the caller, which
should know where to find max_bytes.

Also they
cannot use TidStoreReset() as well since it needs to pass max_bytes to
RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it
could be problematic for general use.

HEAD has no problem finding the necessary values, and I don't think
it'd be difficult to maintain that ability. I'm not actually sure what
"general use" needs to have, and I'm not sure anyone can guess.
There's the future possibility of parallel heap-scanning, but I'm
guessing a *lot* more needs to happen for that to work, so I'm not
sure how much it buys us to immediately start putting those two fields
in a special abstraction. The only other concrete use case mentioned
in this thread that I remember is bitmap heap scan, and I believe that
would never need to reset, only free the whole thing when finished.

I spent some more time studying parallel vacuum, and have some
thoughts. In HEAD, we have

-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;

...which has the tids, plus two fields that function _very similarly_
to the two extra fields in the tidstore control object. It's a bit
strange to me that the patch doesn't have this struct anymore.

I suspect if we keep it around (just change "items" to be the local
tidstore struct), the patch would have a bit less churn and look/work
more like the current code. I think it might be easier to read if the
v17 commits are suited to the current needs of vacuum, rather than try
to anticipate all uses. Richer abstractions can come later if needed.

Just changing "items" to be the local tidstore struct could make the
code tricky a bit, since max_bytes and num_items are on the shared
memory while "items" is a local pointer to the shared tidstore. This
is a reason why I abstract them behind TidStore. However, IIUC the
current parallel vacuum can work with such VacDeadItems fields,
fortunately. The leader process can use VacDeadItems allocated on DSM,
and worker processes can use a local VacDeadItems of which max_bytes
and num_items are copied from the shared one and "items" is a local
pointer.

Assuming parallel heap scan requires for both the leader and workers
to update the shared VacDeadItems concurrently, we may need such
richer abstractions.

I've implemented this idea in the v52 patch set. Here is the summary
of the updates:

0008: Remove the control object from tidstore. Also removed some
unsupported functions such as TidStoreNumTids()
0009: Adjust lazy vacuum integration patch with the control object removal.

I've not updated any locking code yet. Once we confirm this direction,
I'll update the locking code too.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#318

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#317)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Just changing "items" to be the local tidstore struct could make the
code tricky a bit, since max_bytes and num_items are on the shared
memory while "items" is a local pointer to the shared tidstore.

Thanks for trying it this way! I like the overall simplification but
this aspect is not great.
Hmm, I wonder if that's a side-effect of the "create" functions doing
their own allocations and returning a pointer. Would it be less tricky
if the structs were declared where we need them and passed to "init"
functions?

That may be a good idea for other reasons. It's awkward that the
create function is declared like this:

#ifdef RT_SHMEM
RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
dsa_area *dsa,
int tranche_id);
#else
RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
#endif

An init function wouldn't need these parameters: it could look at the
passed struct to know what to do.

#319

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#318)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jan 17, 2024 at 9:20 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Just changing "items" to be the local tidstore struct could make the
code tricky a bit, since max_bytes and num_items are on the shared
memory while "items" is a local pointer to the shared tidstore.

Thanks for trying it this way! I like the overall simplification but
this aspect is not great.
Hmm, I wonder if that's a side-effect of the "create" functions doing
their own allocations and returning a pointer. Would it be less tricky
if the structs were declared where we need them and passed to "init"
functions?

Seems worth trying. The current RT_CREATE() API is also convenient as
other data structure such as simplehash.h and dshash.c supports a
similar

That may be a good idea for other reasons. It's awkward that the
create function is declared like this:

#ifdef RT_SHMEM
RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
dsa_area *dsa,
int tranche_id);
#else
RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
#endif

An init function wouldn't need these parameters: it could look at the
passed struct to know what to do.

But the init function would initialize leaf_ctx etc,no? Initializing
leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE. The same
is true for dsa. I imagined that an init function would allocate a DSA
memory for the control object. So I imagine we will end up still
requiring some of them.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#320

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#319)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jan 17, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 17, 2024 at 9:20 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Just changing "items" to be the local tidstore struct could make the
code tricky a bit, since max_bytes and num_items are on the shared
memory while "items" is a local pointer to the shared tidstore.

Thanks for trying it this way! I like the overall simplification but
this aspect is not great.
Hmm, I wonder if that's a side-effect of the "create" functions doing
their own allocations and returning a pointer. Would it be less tricky
if the structs were declared where we need them and passed to "init"
functions?

Seems worth trying. The current RT_CREATE() API is also convenient as
other data structure such as simplehash.h and dshash.c supports a
similar

I don't happen to know if these paths had to solve similar trickiness
with some values being local, and some shared.

That may be a good idea for other reasons. It's awkward that the
create function is declared like this:

#ifdef RT_SHMEM
RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
dsa_area *dsa,
int tranche_id);
#else
RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
#endif

An init function wouldn't need these parameters: it could look at the
passed struct to know what to do.

But the init function would initialize leaf_ctx etc,no? Initializing
leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE.

I was more referring to the parameters that were different above
depending on shared memory. My first thought was that the tricky part
is because of the allocation in local memory, but it's certainly
possible I've misunderstood the problem.

The same
is true for dsa. I imagined that an init function would allocate a DSA
memory for the control object.

Yes:

...
// embedded in VacDeadItems
TidStore items;
};

// NULL DSA in local case, etc
dead_items->items.area = dead_items_dsa;
dead_items->items.tranche_id = FOO_ID;

TidStoreInit(&dead_items->items, vac_work_mem);

That's how I imagined it would work (leaving out some details). I
haven't tried it, so not sure how much it helps. Maybe it has other
problems, but I'm hoping it's just a matter of programming.

If we can't make this work nicely, I'd be okay with keeping the tid
store control object. My biggest concern is unnecessary
double-locking.

#321

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#319)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

I wrote:

Hmm, I wonder if that's a side-effect of the "create" functions doing
their own allocations and returning a pointer. Would it be less tricky
if the structs were declared where we need them and passed to "init"
functions?

If this is a possibility, I thought I'd first send the last (I hope)
large-ish set of radix tree cleanups to avoid rebasing issues. I'm not
including tidstore/vacuum here, because recent discussion has some
up-in-the-air work.

Should be self-explanatory, but some thing are worth calling out:
0012 and 0013: Some time ago I started passing insertpos as a
parameter, but now see that is not ideal -- when growing from node16
to node48 we don't need it at all, so it's a wasted calculation. While
reverting that, I found that this also allows passing constants in
some cases.
0014 makes a cleaner separation between adding a child and growing a
node, resulting in more compact-looking functions.
0019 is a bit unpolished, but I realized that it's pointless to assign
a zero child when further up the call stack we overwrite it anyway
with the actual value. With this, that assignment is skipped. This
makes some comments and names strange, so needs a bit of polish, but
wanted to get it out there anyway.

Attachments:

v53-ART.tar.gzapplication/gzip; name=v53-ART.tar.gzDownload

��;is�F��*��Y�R&�Q�L����$zE:Nvk5$,\�������� 	���K��Z�m������{�����V���/�$�����"�������V����;a\�>����JD�7"�X�����V���O��mm���u��i�nW�/�N��V�����X�I���'�	���C��������U�����M���Z�'�Nk��m���������-��t[�:�TDL��9�am�f	���k.������s���$}6|��{1���5����4��8Gb5������v�i��V��5`����*MS����s�����������5����ZY�RV������@Y)+Ce-1�|%K�%��3�wV�qX���7������2)|$����p��M��;�M��Y�$��,��L����L��`��]	Ls�� ~����R�V,]q4������e(%���w�������_?���"H$���4r��J	��l�$�<o6�P&��!d�o^_������7�����������:�s�Q�2���{��<|7������_���N�D�\�^������X^j�#N��������u������4���`lu.�b��1I�\��J��8�^��	��c���FJn`����N�c�;����L���v�b�$��q�%X�#�_�`����cU����'�P�����a����pix�l����z��5�^��g�����
W��P�d�"�:�i��������V�E��I�M4�Et\)����H��J��#X!(�/�Z������$Nz��A�'�$�Y�L�������7m\�
[v��6A�q2�h���Ct��I�,�v(��v������TToi5���}��K�����z@��f"1�Tm�X���� g��r�X}�h�t�8[�I
n-��z��7K�Cb:�~�����T�"���N��z�u��
3|����@����2����s�0�]�5�'7@��{�	1�]&���2�X^"[�Ebx<F��0�l)q`F��j�L`O�	���:5��;E�)�;a��lB���M��w��TH���A,q#p��=�A���'��
�����{��M''��'��1����v���� ��U7��e���7���>h�Z�HNp��J�a����HAF������8t����-�����>����C���q(�<��U���"��#
�L����������nE�e����#6UEw��+�)�P�{tkr���o� J�|���o�:��iuj��`�Ou_j��]�M������2m^'A*y���+t����>w_`��eo�����q?@��`��U����q������4q��d��C�����nt�F�T�������k����e��:�����s���iR�������_�|���tz�N��u�����C�_��B�}������l�=��A[8�fYZ�2;=�X�������zYy`�?!��x�j!?z�0���,5�c#ae��~��;��c�_��?TV���r����2PVt+[�
����Q T�����Q�Q��rX471M\OBr��%M��l�>�m&�O�f���wm���tM�s�U#/�X�Eo��+L%����
Vn3��d	O��^�o���,��}HF/����?�zR��R������V�_/.���>R�����Bf���r����S��v1�����X������
 
^
g�A�E�P�w|����Y��t���:��P���������^�w���<���;��[�b����<���������������1�O���rY@������of����a���9���:F����v��P��G�QlC
@��r65��o
�r=|�~r;b��3����bM�
U'�A��2X���}���c�����"j	�.���?����AB������b��M�:_��^��7q��&r`�I�q�:uH�]���N�2����)%4}iXa�bu���y���f��N���y�e�����3��@pv6hY��hh����h�<~��Og�Xe�Wt:\�k�f��*��� �K������jY���|��P�a��#z!n�W��~k�>nM�����`M�/�������6����6q�s�TU�T}�5U�O���rACx�LB�b���K���'�������1���$��?��0�<�?�<���z������������������v�5�������[����g\�,]�-�;�[�o{���r�T��\?;�b��:�����������\YK��d49�\��X
B/�����#2����~Ly���q��������s�b&��fs�����t����H����>D#��i����I��2��5�t���x��a�0S����]7�\�� ���������?�v<]���#^��h��tF�;8
,
���9{�|E6!������
wwfxu`�;x�����h���w��Y�������uY�k��,/U����a�nF�\W��K%��������1���P$�cm�i���Qt;7���fm����{�(��?M��
=���BV
�U����W������?l_E���|Z��}"��9��2���Y��{f�}f���h�Z\��6t�a���?���rKk�},��S�\0�w��=���nR�Wx�~�kT	�A�����W�uw%��<��!��q5����x�����,U�"��A�=�jgu���O�S]���������UcW���x;�+ *��3������s6��������F���2�?�.o�s<���Rx+�`��)2l,S�x�������4E��t4�f��$KB�����9@�����q���MM��FK��W
[�)`k�4V�/t���Z�yM[,�A�y���P�Z�(X�2�/������Rj{N�}r2��C��!�<p)#�5�l�J���!R�����]\�cG��Ng�u�S�%j3�v���3'���%2��`�k��k%xH�9�,�S<������+��z=�q�n�g����(����vN�4��p������l�XRL����� ��P6%�T������5b�T�'I/��N����Ki�W�xM�Z�]���sZ�����C�������0��LL��W����Ez�3���-l)!����q�$)P��p�t��,��=G�����H0����k��:��$���%{>Nd�kw�b	R�2#�t�P�)"c��������&�4��5J�]������� 'Lc��ZqV����w���"�8����\f@1N����P@��O)X	N�n��/P��Pm	v>�T[�P}�.�u��4�	.���k� w���$j���v��`N���
;�����x�cQcr�[�b���"cq�O3
�2y�n��l��Z����K��P&��������#>��ySb�j�#�	{e���ZHY��FH2,5i��q���C��i��C�hm)I����Ln)��L�����>�3Q�@p�1u�EI8�k�"�V�{��C���d�l�N�������&M�z�J		6m.�Prf�3h�v(d�,Q�<��(u�x���C��[� �NO�
z9�W�K��'�H�'h�ja8���q(�I.�l��@5�H��Y�L6x����OY3�/����`j��V
��<���d�G�������(���~k������/V��z0k����I���Xa��c�L���T*z�5w���J�c������v�{#���{2�l��vfL��r�v?F���l\���%��H/&g�&G��dF�����\i�!�p��h%0"�l�������M�1�t��J�=�l�u^��P<M���X���GL�j&g�g���m�BY�N�}����y!�%�/Z�����U��}&%�Z��(�[�j�B3�5Hl����n����v�~�[���2'S�W $�F���T�'MW���� �`�F6^"Ma����������
:� ��5�?�0�w�V8�����%`����V�u���|Fv)�;�����Re�.'�9����+pe�f�>��"0yt�-)�v��lL|���Q���b+
w@�yl��5������2�|����A����_]�����O�
`��!�k����@�s@��3'��u��������0L �*C
�j��
�=����D�:�Y�~8kx`��:��,4����~l��� >����s��P�Y���>F�^�k"���������TY4�
pa �=�c�'�)��
���Y:L3�:�z76f��40V�DK��-���(�\���K����X����9��q�?[�3}s=�.0���oJ���H����T�m��dHM���PfRx�����?��@��0����/��.J	2\b��<\�J�-�t>/���7z���L���`���g���$�1�e
k0H86{��_���_������`\�@���:�n�CA�"�)�c0/��/o����Yg/�4�J�e(*��A�xlQ�9����6
Y[g���QCf�`�#�oF[3�a'�h����Ln7�M����HT��*4S	Y��,�MU�T%�q�g�����0I�H���}���9�H��f��}=�o�7����!���<'3�"^�p��X*Q���'c�e�
��];��52�Y��9e�~F+��fe��Xz��U}n��_��^h�G�����AM��(F���������rr�L��S����`j��lFv���5��\&�h���fR��������8-�'�B������6��[e���o�Q���O^R5md"F��2�t�W�E]@6��2��}�������Rxe�wH��~�F��)0�Y������3>;Pi�DHu4C�QS��w�-�1Rl�l�����M+��SE���X$��7Yo3���e%�f���Q|������V�����x�i*D���.Ba"��{�xe�������TE4N>�H\��"�-��6>�����OU6]�������E���R�51����0e;��2pCJg+�Q`������q};<#o��3t��y���X*2�v���������7���R�@�sz:������1�|G������0��d�@G4���@����A��������+��+�&s3~o�N&;������0�����df��/��N/=��x6d��R,���k���\#�nY������������x��<�O�����/�N���	�m��C���wfC�y���(\0�4:=���9o��A
�����w�o����W��������Fo�80�`|�k���W����5��F-a�L�o��9��]�j��$��;����z�:����W�N����&q��x�Q2�+`�:|i��/h���$�������#OW�1B< :�����jmq�z����T~6�F]��^�OO�&B�������7����HF~���g�v�z�(�/N�s�<;;��y����������CO�/z6�����e�`�p����w�n33����R�F����3%:�.��o_�D����M�J���9�?
���'/SF�S��yz�[����6�fm�����#A�U�yW}|���v�������q���
�����;������:���3[���*"��H][*D� {��uH��W���������L�Qk�u��S��M�L7�?��N��C��NW�����M�.CZw=����+���8�9���
���� 6���V?w�VGw���Y���kb����8qv���wG�+.1}H�����������11Y�gb�taR~�6�\��7G��?��G��j>�~����p�!)g�!8BJ�=1GIv����N�����}�\�����I�������]$�E��.�����������~=���\vr�kdN�|�%&�&����s��s��aK�����~zx�@B�<R�K�{�WFwq����n�g�i}Ew���R|��.�;w���3�O6�zN�����sW��6~b�]U�{���J������'�����E��<b'g3�q�����/O}�9'~=v���IZ���w>]���������t�s��r4C������L���\D����@#��4r�I�.���������i^���NR�X�*�7�j��&�I�I�0�����#�7g
�b�n=���C;XH{.q�^�}��v�A��L��
���I�`���8���`�����;xR���0����l!��.`r��8Z��e�IX�G��e87�`gc�������Oj�����E�����3��>h:R�`��i�4���Pg����"R���B�(3��y����h��0�o�.l�szCt&
g���a��9���6Y7��WM�X���5�����#��,Eq.�t�f��E52��T�h������M0�:���l�o64����}�f�B��Bv��v��P��5�$�I	0��(�P:���&�!�0}j!9f�t�9���#���k�ys���`�v��c��C����%]5�9�V��9+b�&�IV_�|_��F�"�`���6`pt-F��yux�#+������b���pw]���w���`iT�oO^_���B�~�w2��/�:~Y("75	�JV�]/���6��}������18"�rN"7���q>�d��E�J�a r�M��@���)�h��\�.��d��y��,F~���,^B[�"DW8 ���������p>���=9/q���i�_�1k~pP����x	�Y�r6C���R�o!��7\X.�f�$����o"� ���q���������VL�C��`p"B�)��{�p>y��t��B����b���d=�Hh�("�#���b]�����'�Tx��g_���>x��K�=atw�������|����������]J
�>������x���F���e�2~aW;K�3��*\��Ka"far�����$�q��9�>$-�C�Ca��r�	*E����;
��<f��:��`&��c�>������@��0���jn�t1n�x�V����ssM1n(���l�j�cP:L�r����]�`:m��7o�r|���}�B�!]|m4�-������o��{g�=7�C$_D
��c
~����;5��R�%���%��������l��5�v��/��XJc!���z��8w��`��&"T�Q���)�M��En���M�S�3F��|N���[8Y1&��p���RSL)U(O]�����1��v�:���`
}�.E���j.b!!����b6��@�
'�&7rt���v���'.Myb`��|���@�����6�Lr��y���j5�u�M�ki�������_%|�Hy������B�i��g�d����$��>�q���
m����c��n�:�jr�SG���������dJ��s��3)�"�������F3�i�%0	�[E�'T�H��}J+a�����5�3���*�BZ�D�O��+���F�7�
��
0}��������/#�'�V�@�6^�&3!���������">8Bl�/v��x���A�?D�!����_�%��Xv�a�������K��r������C[Q�2 @_���lA�c�c�a���n��WS������X�o�#��L�\�� ��EssV4u�q<��#P��X0�>]��D�1;�������Hl+�M���t�O�i������{������\
�k�os	���t��$���4�����OxNg+u��#������M��!�9�r\%�����$��X]��.�
�y^���!�TG)������b-�fE2�z������o&�����l�G��{��-8Qpvb��a����r<���K)9o�K{�����Mc��nwCO�ni�"hJ��n�)
�yY�<n���C�I�c @����]�{���G���������GG�O_��(���l�Lh�W���A,�gR�� 1�P��z"U������/$�,��{g�H����e�	������n��XZ���1���i�����8>+���V��t
WTA	#UE��*���/�oW������ ���)���  {�=����yN8���~�p�����~6N�	��XXq�=%����oC����9���$rb���B����d&%��x~����E�	�b�L|H	��@x3�
2���y]�!K<f���	�@���p��V
��
gH-���SI�U>��&u�Q�y$���>
��Rs6x1T�4\�����/�[�Pr�B�]w��BH1FL:"I*#+L����Y�05��s�
���@�;�N��-s�A{�O�!x�Uk'#�F�DC����)�*�%�{��G�J�R���C�bH�!G�4� �A���X#��g$�[��r���+B�L�1�lG$�J_�A`�Qr����J��1EzM�0Q&�p������z�*�:�������Z�@����|����b�1$:Be�8E*�&�t�]8�����<m���=-��&��dB�8(��A���5T`�C!
(����HJ�W}��5���4T�r�9�a����VE�5�&W��
��t���6�g�EI`J���{%r����4A��RPL�p9�,/DM����yh8��uS���SN,��	� �C	�����c��w��|�V��2\��9\��p�YkW�9�N;I������oy ��,��fyd��q.��	��5��!��'�����lR).�, |h�#
f��H��n!��R������*qQ�"'��]V��#�X]����Rjj^��m���]q1�U�1Q�'l�c��
�����f��1��
��#��M�l@��o���4��>��q���9��<y���3��1v���g.��d#B���>S\H��v}�������rL�0����J$����XU����>dm8^DB��=���9bu�N!-�P#��1�a`�p�c���Ws��(����\&B{6Z�}�x(���-��\/'en�C�_�`R��:�Q��q�����$=����&3�x��V�hZr���<��t�Q�_����D��|"d7�P���jB�\:O��Zze�P3�����`��K�Q�m����:�
��|V�B�i�u����P&�r+���-:�����t���@��J��+U��/�->�Y�px�E���
���gKa��&:4����&B��~����x��.���2���O*(�MP�GZ@R�k��/f1������I�moTu���cV��O���O8%�zu�?��	��_�����mw67��:��4m�)5�
��
�w��/�f���X��t��c
e�U�-�����������U�m&�� W�3#��j�w�	�F#I�K=��.�Y�9�Q+�/v��.��������[�Z����e,D�X�b8l��x�MV��d�+zq��"�#����I�5�[��n�u��Y��n�Upl��e�Gy�:����������
�B/D���W�r�w�	��"��,X��x��G��8����P�W�?�<c�O��a�[kw?�g��)�t��*��t0�LB�t;��`�d��Y�O�I{](�FVul�\���[(�1����*����M�����#N� �)��]�(�R�p��!'[�p�|���
LQy����B�h�V��~$wH��cN�?�~OZ��)�2i�0�^����9~���Z3~oL|�0���]����VE������9�G:��>C�}m��'t91��I���6�f[�,?����)�>DI�M��59J�	�����\����F3�'F�(��T��EnR��TF�����2h���&�
��$�A�m!`��'79I�Duc2���@y��$�����Q>a�N�~+u��sY
)Xy��[#y�����t�����74�Rh�q�6���|�dvK������0aa�ai����  ����eUe�A��_+�pN(�	����O��#������pDZ��v�������w���S
���aM� ��T���5B]��=T?����:��eX�N"j��=��=����Q\�E�;��)�N�E4�0���%Q� ��My$����� )
s�D�s�_P,
`P��`D�JC%}����0��J)��f�>1d
uF����������Hn�E
�R���x�;a��Q���XL��X��D�~��r����b�B�Kkd��.{��"S�	����L���T�@t�fL�&=-�k���k���Mc�qV���/�������*�,����	�.������|/Px��r��. g=�s@�D�O�����c�>z��w`�
��%BK�-;�#�������|Fn|?��G�WE����HhFP�H�����7���zO�48�h�v3����7���"�����|5g9��K����/�8HA�~���n=\�k����=����8�\6��!_l��S��[#�Pb�����sd� ��8~�I�%�g���}���'�n��������8lZ��'�S�T�H���,����sH`���@9�c����k�S���ba���>%E��vT���7�'	�&�r<H/]���"Wh�H�=V@�w�{��UM�P2���h.���,d�z��us��.�^�cp����AT���/=8������0�P�ir(d����D�9K���fXbBV���b1��������(B��ds~Q������X���8
0M@�aN��p�_����+�sJuP�_�Dd�+#=�+��&K���I���h�P�*�\�P�{-P�;�n��i�nX*#u�����
Q�����i�}�Jc���b�����?���W���O��@Gc(N���O���v�>%7F$�Q�0�����b b��:��bv�oR�BA�`6��0:Dv�]����l};
d���xD�����d�Ak���)���4���!�Zc����@N�>�c���PL)}~\�c���$$cQ�����"Hd1|�055�{��9�K�P����%( ;���{Jb_D����1�[��D@$e��d�C�7h]C�����jPaSz���nFv�(.\Rc2�OY$�p���\������w���#��������Oo��O
�gw4����6_?&e�RkZ�Tb>��*r�\%�E46S��ARI2h*z�)�7PQi�sB�n%1���>��7J�$aTj��T���,'�%:��z�A��L������������n5�WYFT�ID1��l6'�.DP�@�~E��x:_-��!K� J�AD�'�����A`P����\�������.����M���7��u<|\�����i)��~x��Y���za|��wwMm��D�2��	KI9�V
2��%���84��}�0)�H�������qn���Go
�O"������J��-�o\z�"�������\�
 B���r�����3�eaL�d��H��8s�g�����2��S������X�d>�2�Y���o�1t~C��2B�' e\c�E����*>�&c�kO?_���Ts>��^�!�+q�RC*���AZ7z�t+�X��.y5d�����\cB����Y)L8`�N��V��	|�?��BGo:�IH>��!��A�h#A�
z�~��.���E������������m���@v)�_�.9qa�-�%�Os��R�m^�Jb�M���[�CxR������&�'��b����a^j$[$����Tn�H�Bdj������-M���T����V�f�Me���g���K#>]G�`����`�f�4���!�ptY?'W�c�����?XT+kJ����S����*Ju�	5|��b3�'����M��<Vr�!2Z\�����D�2��X���>��n��q�C������K%��f�n�$��L!m��u�.���
A��2�J�US �������V�5�n�����'w��9{#�h��\&��#��2D��.�"�k�w
X?f���=)9�)$JP�A�Bt��*[��%����s�S����.pil+\���4/��j
��s�~�,����|Z:/Y�����B�0��P�%��m2�,����jqc����i�H��<���/�K���Te�3�����8,�c���QQ���P%�A^i��.�J������fK���gf!��r��+@��if�Mv�}��@x������Q�?6�5�n���/���{&�
���p�����/���AHvw$O�5
�L�i���5}R�����!�}B�H��	)t���g)����%��.`��c"�n�w��V�X"X���_t*�E�4V��������U��o���z>�`����%y���F��j���	"TsM�iI(����?���G{�V�����99I_#��)��]%�IY&�[�~�:=~;��dI�xy5]�W�8K�
��b��L��M'�����1�V�P�;SB�9�a�2�8�.0f��,�~#g$������s�&��.�f�%��}�l�F��Ml����gwP��B���K��@JI�
�I�z8��^*��"^FG��3oB��z^|)z�X��FC� �a-R	����G��A �k��uLL��8������
Sm(�4�'�1��D�x�����LaD.�)�$�Rw���������y��T
�b=�����Se��I&`�_��,���w�37�l���X�a)�S��y�����-_p<c�Pz�H��$�*�������N{2�MBO�
�S��r�#}�����G���j�|'�G�XP�+��(�Dy+�h��h]�����H��p6jo�BQ�!�.�u���hg�0�����}:.�89UA0�#���U^&C7l����Pje�Y2A�=�T����������K6��M�P0�)i��I���]'?�3�/�	|U����v����A!�AY�>#��|�Z�u�s��sl��Ku��#��Y�	QC����{z��3��
�1����P��
�~y����m�v&wS�V���V���6Q?Mzq]�A)2�O	�"0��c�s����3��&�Y�#{�^��jpnF����QZ�H��J*&H7���4?,f7�`L�������9X7W�I$������&J��P0#uE\��S���i��M��+qZ���Pa`�Pn.����J����s4R}��*R@�Nk�"4g��84�	�bk�3��P�59'�_j��w:��Re���`El����P.��Ya
S��7�X�����f^��X���)���L���$�����d%�.���K6�!B�_��[\)�kIo���
�M�}/�*��g�s!Yd�r�kf;���R�7=N�$��F��=]P��
����g�����L���nX�
����������kTR;��J��k��go�w��9��Z���������%�%��1����-���dC�H>��U���Al9M7y�����#ck*fp\��K��E�Qo��o������:s�M~���f���o�f��y�}[lHh���n���� �;�&Wi��������(A*d��NL1QMJVc"xr\3C1�T��J[���zd�DA���aP���)uC���(g#�:�]3��E����6�����j1_�1��3�����j1
,d{e�
9��$U�cG3�0`BA���� ��q�0.|l���p�F7��~�9n��������Z��hi���c���\���FQ21���Tc�T>hY�Y�D=���W8��C�E�-�1��`@��f<A`6�n1��\~�!s�#�!�Kz�=�o`r����Z�A�J�H����1��.F��������jJzY}sukH>����3w�[��/��Bj64C��90tE"D���i{�,3��N9���	E��p
����%b�h�2�gYp]���u-���	��7|<��Z�.����z�B�b��q5�"��T������Q�m�i�s�s���\)���)�������Iz���%S}%;}ol�Jm��:����b2�s��:����b�oZLb��o�
8Ki>��f�'���g����b�e1N���3b&Z��}����f�j[�L��`�)DMk�Q�2�(��($����6�������>UJ��o<Nd-3 0�/��1��|-�@���$cfsf7�b���g����'cI��,�^�&���,��������^��
L(������/jl�q`��������N��`&V��6M�w�Xwi9pt2v`=Wcu�RHcu,T��?>B���!�`���H��Z������\g;���4#.�L9��6��t��y�Au�u�]u��������r��^��t����%��{�9��^����`��xOt5��Yi<:K���_��!$(��z�@]�1cN�C6�������-��\�G�(�P�"+b�lVa�>Fd8ur�l&rP�2�Z(t*c��f�LaN�[�^�#4�<�~���oA����XT5pe
_;{���
�C�jr��a���\���)K����2�a��7A�������*#�M�����#~��V��J# h�n_�$�R1�9$J��`���lm�4Y��<B���mq�g����p>:���?K��>f=c���h�Co6Cp;�����u��l��	#��bjjS��0�%}�+����T�C(X�9lM$p�yf���[a�M�4�G�-�R9�����Jfs����]���L'7n��{�6#6�u�����_��H�kW9A�Jy*��*R�I�r^���;�R���J�:�F9
��-�j��^��UgI��Z�%�J����u�G��]Cv�9����t[����������7���r�����QD��/��^���F���2>,OmY+Q�7���G�R���}G������Q|��5W1�GC�~��\�M����N3��#��"��)e��>�[��Ph�N�\�"79�&�QG��v��������8	<�����q��}���S�CY���J������%�G\��%�6�(�+��'K����Su����*����F4���1�5�9^��Y���w�H�9���T�Z��}�t�S}�h���*GU
����BR-�F�>'s�[�]N����K���8�p��R����yC������%!8Q���X�/�37���v��bU�D��V'Mm���.��n����fG���z}�#
�����S*�����{NX�t5B ���N���DD�B]��3���vl�����q�YA����|��Ya`��*��9f���*��(
�/�-3����m�H�����J��p�iu��[�)��S��s:+���ut����K`�O.��t���T���N�83��	?e~���;�C�����L���r�/|��b�!�cg#��+)��1��5��)�U������[=y�L�p$������3T��z������g�������j�e�����
M��\���{�e�����^r;�[)2+���|��mg��x��i)��������x.����8�[�������������<�he)�}1�8��lrW>�(����>��Y���I��p1��"��l���2B~.�����d<����C�y���2G�x&>��N���Ep��~YW`"����s�S0oqbn���jS�XTsA�G�R�u�l
��n���;;~���<#��*5A|�����LT���L���.��l(�8s8���T��a4_^������������������	��%^�Z-��<���7����T���"E0��y3����[w�FwX���c<
����Q��@����\��#r{��*a
��,[{�u�&�$����S�Z����E�e�S.n�*������J�v���c`'�d\�nY�������c�@���^ZL�]6�Oq��S�*��R�iV:�"�e�d;f\k�\����3�XRkq+�.��J�$��G��]�-��S���2.����iZI��U�$I�'Ns��hrV(��&s6������r^_��o������;�H��9�2�T���3�(�E��l0X�[M�3io��+$��H%L� DD2Z5�����.��7�	���
��-���r�:����`����1v�l�"�����wQ��������|�d���[�l��g����b�!������Df���	5��Bi���+(K�����+������Go�Kb��b>����C�z�SI-KXJ\u�D�(�:�Xu��E�j�f���Yd���m����jBY�K��Tf�o�lM(�a�|���q4�r��}�����4�t�[�a����������C������DOX���y��v��"�����U��Y%2\Qx����Q���n������(+�Nc��7�V��?��f�[y|������A����k%o`�E����;��u�,��^o����C���\����	�"/�`]��h�j(
6yK��TX��@k"�L�V�_�Sy4P&�#������E�#����jNb��m�&5KX.������N����9\�GmL��G�9�
�s��H�g��z�@��:��l-N�o��=���4�E?P�O��?*E������
lr��S��a�O�k#�K��!�{��!���2%��E�]�1Zdm�����}������1m�~Z���?8������n�����}�'��/�����_ZN�8�,�&G-Y��J(%6g���r1���Qo59�XJ7�>\�*��,�5oA����\�e�����e2�j��6)��mA�����~�����o�5�"���|�Y��5TN)Ho ���v4��0���Dn
#N{@*
��Zf:~��cB�����F���^�H�$W��J<]���GR�l�*+��~�a�o3�E�v��������px�="�0���T�Lf����n��5H+������/K�`�O0�����*�F���V�D�����jE:�=�����l|F]c��:G� b����j�!]�& yQ8
���"��L�Z4���)�{E�$4GVnO�����RBC����\��pR����Q�K�/�x�4���d�X���������g;DT����
��u����$a����*l+��=���d�ro
5�����>����G���tj�,�(�ta��L�����rE����Cv����2�E����")IBT��S�	��y�p��u����JV�L�
�q��B�x������|�@��!F��#����4��fQG��N�����7/]v�@E���x`�p�������rHGY����������Sem����j��[p����J<�w���i+2���q	a:f��GWep�D�����&����C�:��������1��!�c.�H���>�\���OTU���:uGm2�&�g]��COnwo����������\�{S��^R;�q���9�6.���2W���U~�|!&�I2Z"+!��9o���!e�@��]5����T)|	[�&�����gw,�u��q����?�������
�a';����`M�@���v�����j�V�������m���i:O'���@�-����9T����|)2�|9���~��2O~�S�L���s��\f\J�y$w���'�R�&������G*tD�O��1GzL����������e|l��rQ�r���T��r�~��R�����I�����R�9�A�a^pZ��d���9�3�'
�$|�'x�����p�Z~ ��Ov�?|�������#_���!%z��k�F1O[��
'��	<qxv|��#�4D�k�����%�@��U{�e�It�x9���G����P����i%�Pj0A~���=����kz�.���~O�rrQ�;��x;�/���8�'K����(�V�<)���0KD
�����~xs,8@E������Eb�I�������T=%��i��c�:�$s(�2�A��!,"��	)�;]'[�F���4�����j4�[!�V�����#����0�K�V���K����
I����{8X�{�X��
\�P���^
05R����X](��9���o��j� ���e�q��48�����������1�F�6M�FIEYh�pv3u4'�:������OV'���T����W��7��lih����'�����G���������q<�v����v�3�����U�#Z\S}'�?14A�RJS���X`&l�NA����T��K��4���q�� ��d�����x/��`I%G��������=P
'�m��������|�u@9���,��F9�M>��p�Za��� }H�x�$=�H����F(�@������<�6Pn
�sN����hyE��|���CP[��T�N+&�t	�J�N8�-��;���&��f6*�FM&�������=�
�J�_���i����<�s��o��7��?�?��X&R�?��V���(?u�rS�����x��{"^���e�='�����g�?I.� ��m��%?h&��Z���jp��:��x;*R�d��-�Q�L�A/vY�;�	I�92�r*)�4|HU���C@�����t��2��T0����C��n��h�g�JZ8S��OdSLo5�bJ4gp|vC�|"l��y����JR���[����������HW���C����0���X�r��B�pI��Q���T�n��3�bn���h&����#��$���7�8*��/g��j|y%X�-7���MBG�FSV`����^�������������'7Cu��Sem����*T��K3*e(b�G�)�i�9��Vv����.zZ�3���=�z2W���#%��M�����y�M+u����+Kd��([OQ0���	��:p�a	)�-}FBnS�vq�K/���B��t���LT�n�`�%�tV���A�)DgD@���=<��Q�Vxr���o��v
������aS���	�S����7(��!yD��woW6L���O$#s6��Z�O�\���b1�)�U���w�\��!�]>�d�UoG����f f� �K/�6/��T�TB����tAw�;$�V�!�y.2Zg�.G�9$U8���	2�`X�p�"SM+*A$t��P��h��!��R�������M(�*��b��P`��|G�����O^U�rJ�S����t!Pe�$]�����P�
cP5���p�����ut����\�4cU�-������I��5�������Q�%�R��N��P,�hk�7j�3'���6I3����*�I��������`_�f��Ob�U���Y����[:�a� \:���1J��S!P�})ch�\T��8H�#c������Wu�Z!�(���l���@��~�PSP��D_�I('���e4��Z�-�Qs�=�wYQ!��*?KM��a�#�--m1'������DC��,?_���D
��n��V�O��C��=8�}��[���7'|!�;���K��~[��-�
�w��r���@����F���}T����
M#::V)O�b�����������.���]��Bg��imf
����/�����F-�7�h��T&}SR�(�2��vqN����#>�N�B(�!��=I�F��qMW�3�p�I9�p���[
�zM��m��FC�h���>	7�]zC����
l������/�+��"�eQ��a���CW�W�~0�"�3;��"��&�����5c��Cu��s/�8�<�Q�qV������H@��[�(�>�	a�;���l�Xr���7��ZZ��`��*�JVF��8$���������Lg\J���-����
XU�yQ��(�uj.jB����E);���2��/WGI���p����������4����/���.���K5�fJiF���D^L|��A)yFO^�`�_A�[�]Q�+?��%#_jAF(�}���L�����D�������l$+��c�{�E����
����x�o��V�	Z����N�Y��p����B��"�Pe�)JE�	���\�&�/�+Q����KbVV�v��T�����>�IU*��f��D�.|�l�-��B8�c�,X�����rj����LB��+�|&��J`��k�
HF���V]��d�.7����H�"p%�?a%����Da��l�7=P�&w�a����X�����Xv
5��ph~��7&f�A�����:e���\����Q��
���Y�l6A��vt��������.NN_��D<���I�X�=y}�nB��W������w���:=:y~r|�v��B�3��8%���8�8������8F��Q1^����v��o�Y�g���V|\��o���H�Mc����6�����S���o�P��6�9���G4���+��p7���aN�6e�����T��S���w �|����k1���$��&��[n��I���Y�
��1P�>����K�[Yu�'_��M��'�_�J3�G��$���*�Vd5{����w+3�\���$�deM�hm�ZZr^k}�f��G<;��ID)4����71��m5u���-0��'���nn.!~$,�f�����S�_e�����A��q/�8F���J"�mN�~Sx�������]9���2��x�YU��A&*��O&��Q���u����6����{G���~�J8y�|'��i�
y�D$�Y��q!��H������K�!d�Y�L��5I�����W'�yp�,<2c�����}��+��-����~Iu�#N��X0�F��r��`0��l�Lx��e;���#�L|r�����Q���[�@7�����W'g�������s�RWe��]W��I_yt�������`~��T(�O�h1���<\��p�q���9����z�Z�?q4�c)A�	j�G�\F<s���'f���Z�����g>.Ab��p�����^�x��>~}�����g-�����6��ot^{�w�9�>|y���K���\`� =������m�L�!���$�����������x�8U��8�����7n(����0i�r
h�z\������y$�Dx����L���]���7g���:�/ei@�9�U�w��5���q�������O��6����'��?9�gT���XS`�W)�@ue]g�BH>����'���g��e���Y�[���1�y�����RPd\*�n
�\�mj�q��.vi������c-XI���!S��yn$��p��u�����t�Z�}���;qI�!�pYK��8t��E� ��8�(dV|wc���r|I�cS�7�����y���K<���#���=���r:��E��k��k����>����o��.R��7�����
,/`�+
�R�hG���3�����q�!�5�-�������yrI�� p��m�����$)M����#���l��y���A���x(�~\Kkn�o{����s\tS_����|I{c����'6>y��b��1
�&�Y��v������g%+�m2�� ��8���b�p<LV���j9���q�w��iov�|�O�=����n�3��E��~��GV�V���N�c���������fA�^��Y��0���)��C��E5���\�3�H~l$��zV@��NWaA[�?��<����1$FZ�Z�nw��s��O����e�]L�����v5��b|���pu=�4K�2���g��$��
?p��I?��������p?���������m��@�C���=N�p���`s������9{������:��H���]���g\�_���z��;J�f�E?�(�������)~8{C�y��^�������V��Qm��k{{��`���G�5�j���f��&nn�67^���E�>.n��8�}���v�7��������b�����B�|��~�1���)��}
�G��f��)����Tj��7���	t���c�?{{������;u<�����CWK;���AU+U�<+M~L��0��F�VD��B�9�<�]r��!���d����y��<�l���@J���3���~c6n�9�[So������W�Go_���K��k��)�>�?�h���U���u�~v|^����{�������<>:>�f�86���X��OQ�.~8����m�����k��8q�x���^��	�*��'z���S��o~xw�����xk~���h|	����G���*�\���U�����'H� h=�}��I�K 	Q��'�S���������C�?sp�U	aC���d�'nc��b}�����Z�]�3E=�F���G��s@'5�{�n���Ds?��U3.B�6��2ZP�-���J���1�EF��S�1�0&��y�~�'�V��7�����4�G�K�.���T�O��XcWX�<�>S�;���9p��"����`1��tpu.>�1���<[�Da��|*1�P�h�>��#��W�����
��b��O��=>N��h�����k>/����2����Os�����C���wF������N����
��f`R�.�!���Fhv/��?;�LDFX��,v�bI�f'K:���<W�)��J���g��e�B.������(��"1*������g�����X��Ho�*e��P����:�[u�������\k�a������fk�v���F�k�v���7[��s�/��w8g�[~y�5�;��2�������o���������N ��b���w������]�Z�v�����k�o��y�mi�r_Z����%�wv!���sTvBs
6���u��5$A".�����j:\D������(?����z�����fvE8+l8��$$�z���o-4���EFi��,�q��f�\�1�!A���8�5Z���y9�v8V����+N�Uf	������'�^�-	��&�.�(yUT�3A�������#��?��g�� ��X`\��u���~e�\"q�]�@&���������}���P7���]~���z��to��A�"����k�[*$����e���C�Y���������RG�l��QHe�����_y_�2tb�@��
S=:������t|5���p	��]B�\.��7nh��K�O;
�Aa����-q=�:��0\������4C��R����"t8�1!���Wl^G����~<����d��5g������6��l�����a�
D��������q�>�+�G9.��c^���E����Z��)O(Q�K��u�SK��;��v�yn�N3l��(������n�>X�N��������$�0��c��^b�9o^��R�Pv�����0�S�{4����q�v��.d���!��C'��?W�%�It��	 ����������������/�N�]����}��~sx��{y�����?�gw������7�m/h
�Zc�_�nn�g���m	�qy�����q�����ZM/A�	�����j�\-��s�f������W��
{�9
�����K�j��d3��A��~X�Vs����������g�����wK���.���������]�f�/�%��Mr��/���Z�]_.�G��x��S�9VIzH�
���B	��2z��zT�w�%dkt�'�Nn<]��:��_�/A������\���������w�+?56��.�IL�b�Q�� �F��[b��1�]�����>�Z`ZJ	p��T��H$�{Az#�HS��HK���1������)�w�h��T�'��8������g��$����+��/����X4
�0�����\�WR{o��|\���$�|#����0B`����p1��b3;/�g:�.��"��5�b�&�br|
U	M�m �����V���m�����A�3,0T�����4�8*��`�}��^`��q��b�MP�h�Y!2����$[�������~Z����.�%��ak����-L
�P*��w��=�%3n ��c�j���+2>=K8'�����,��b�&��Q�u�DB�M�"�?h��������������T���;�����/Ac�o���=&�_�7��{p(�Ct*}�i���9�/x��)�C��3�{}����� ������������o��W��P�������|0�Rs���68��~c�C��"�,��L4/�����n	��wGn
��+�HuE�\(��>�hMl8�/}S�3�~���Ay]�a��x�7
Y���u�[S?�T�V���P��e�\���[��U�y?�rt��:�d���~�������O��j�S�M 4�f��H��5���g���/���Kr\���G��o�|#���#��T�F�������>T#�*l�<DD�t�X�����T%���'��D{N.:�o�W$��i}(����)e�w��1b4��w�3|�Y��Q��b�pE�T B���Hr��CG����	R#?��#����b�2�#�D�FE���Bp�=��F�G��6�H�|�CO�nv �I9�r��Z��+(��J�
0'z�	�S3�/�x�BA��4BTJ��7�(��'�e)��r���U�C����:����'CJ��jz�R�05>�.�p�D���bz��Zv�`"�%p�K���H65�|���3�Z�X���z��E�cHd�%�ro3�'A���wXT�UX�d�m�B�e�������^4���.>2�`����k_tt�B����V�Dy]�}��+-�#�v}�����fQ�=�����p�a@�H��ZG(�s���H�2�"�QV2q���E��{���@Lfm-t��i;�l3��!S�M�S�7���L7y��Y>ZrY��2���I4���@���Z@��w�M�#"+x��T�\[P)��N?����Bg
��';�L�;|�Z��
Q��.FG\��v��b��_�b����e������M�?"�)���R�Oey^]=�r�yo�R:�IE��@cH��pE'������
��Z3����� 1,|�	�!���`���&����E��i`RY���&@iLS��rtW?��4f:�����L�����o)a������B�����
b��b�S��d\�� ���H�B�}�]�jLR��hm{3��$X����n��X���������j���������}�I�W�������K.��q��qw?
1�c�V�wPn�_6S�w����y�9�\����H�&���k���\^���@z�����H��s����E�7�k��o7�>,��R�MU��kN6;��I9X�B���5|H��	YyL�W���&6��C��Q=R�^�m@�-��S�h�/�������&���q�E%���+�Z���bFh��]rP[�iW2\b>a�!1'���k��;����.���V!��\�o����@��%�J��XC�i(���(v�E���~��/�w��,���M��(���~)ycY�&U2z�IhG���������VE�,!V�RI}%��R=*���M����,�M����Q�f@�FS�X��[�q?�&~d
��X��E���j��0�,X���k8$�,3J�w���th���A6���&B�*�-r������!���	��`A�-�F�p� ���B�E.�h1��y���q<T<P�u��:��R���J���}���wq|N�]�D\�nq��a���R	����9��x��L������#�N�I����/�B��Q�z�
�R	?�y�+D���dF~�i�<I��e��`�5�cw?n�B6|W#�+�"�=O&�vO&�$T9o����JU7���32@�fU\XIq�]	0����>������?�N
Ti��bd
6l����)Q�E�!L=3�����Ja�9GI����%T��7JT����&�1oH�YO����SGv|=��iqL��%���'�2�QZ��0Zx	\���������^!b�I��9���v�:vl��o��T,���{��h��-+�k��$�^�����]�
�(C��L$A^H)�������hK�6_��<"����Rb�M�;E9�bXh������F��Z��bn3\)��l�k�m��������VY�?�I�N-�`Fz*r/� ���v�{��B��S������C�u��>�m�{�gi0����pw�d�����a�<m-���a�j��R��m�	G��J�.���������t�u�J0�'�#��?`A��7�����B�\�7i? �}�W.+�g��
t��U���~���
��=�D�L�b��k!����x��	��+&���#x���{f��#�L�`�8��O�}
k�y>F���"
D�2���j��\�]��-mT
��{�+���2n�)�
�r\�
�?����%�~X������PH'������}���)���r��^���Yx]�xN:q��d�z�
��h*q-ICb��@
g16K���V�#���q�����*~�<Z
l�,�t�(O_H���������SC���W�W��8/%�����h<G��5���K����@�SF�/�y����Qq8K}
#������vZ����~S	���o� ��8��L��/!`Hm�N�������|C��2I�?�&[f�U1�b�
��|��O�|�#��Z��B�yj��z����6��u���>���:}��!����X�q����ZbE9�@S�{�Q
��R
P�����&�����+�
M��s$��&�����6g��J�c���,��9��c�JEA8W.b�=~p"L�I�Q�J}N
8>� �q��b��p�o�j9�F���!��1,�_D�����U8�T���qs�D�Yc��^�����w�{��
��
I��!"m����8{[!G�R���Or�g%z���L���:�Uk��(	��i��'!��]u�>�����U�%
�j&���U��Z��a�����B���3m�v����4�
�>����}+��V��//�J�g�A8@fbl���h$>�4lF����B�A�I�w���	��[����V~S�{s���1�9)_�3Z��L~G8��������r�m�Q������"\��v��'������3
�� �J�A�*�ZD��Z�0)<&��Xf��j��v�c���k7a��T�"���|q��d�9��eNU���!t�@��+���*1�`���/+T8X"�R��xy�?Q�`����y�p��{�)������ ��b�z���M:���h �=�I��4���6������}��Z(/���+����Q^Oh���KPn�A@�'�
1@r���gX�h0�d���G���oj�Nb`�����:[�m	"��er�>��>	�,���G��b���m��&\��tp�*��z�������c���c�������5�����:�����G�%�����rm�R6b�����u�{�l�)v��zx79c�2���q�<Qp�G��Nv$��6~gUoR5���<IW����X����T��������XA�l$�m�z�����F�����|l!��r6�C�L�@���y+��������M������$_�p�ENow;��05l��
��:v_��-������7V�����f��9���IV�/	#[E��S�I����)i�/
�Hr��,�����sm�����I.j���7r.B�N8�~�~F�����v����&4E������-�{F.��;Il09#<^���l��4BF�3�Gb|�<���_3L������2�)���|.W�n���@����sa6Rdd�qq�ry5�ng%��e����� ��c7F������=j��f|�7{?d�<���{��r:�nI>�,b��>f���~��cFp�)��LU!&]~9������Gg��'R���I�T-���������H�HwJ�IY�o�G�:�M���z�����$�OTG���
����TFkmN��S�w4\M�=�i�4P��U��:�&>�/��d������S,&8$e�� f�I�p~)si����'*���H�e��jv�z}���m��V��mta�;��Y0{0c��b��*������P#p8�a�l��vE�/9e���.��x�
Dz��x��bR��8�������&�hN�A�~~�n:'�i��
J���Ex	��k��.	^B~��W��c�]F���.���o~(�	|zQ:{�0���H$��ooE%���v�B$�$K&��
?�g�d9�O0�f0cH�G$�=\\_.`8g�k��������`fq�8f����1���5�f4�������*�rY��j���f��~�X�.��bq���5{��������V#���\�{p����X]#�0��<Z`j�� ���!�7���l�[�����������Qk���z����j�^o���U����?+�������Y������<��p��7���lW���Zm�����h8���Q8ju�Q�����y4g����X������
��j�a����p��c�w�:����/�������Ov�������m�����j�Z�G�+W�����U�\�|�~ysx�������������#�����V�5������O�<b�g�X<H��T�60�#:�K�_�8>������g.|�x	�eGVy�
?R]������ �8foo/�EW2px�N����s�
�_�[�����=c����Y��UmI|�%1q��Ff7�P���Fv7�B2v�fM�;�ooIW�{�VX��/#Cf����<�������Y�T���0{���B�F�$�b��f���:Z�c���ol}����~��y�AF^�y�w�:�5Qk}��S �mi��N��������E-�th-�����h�)�(��������_��w(b�9F������
�Q4���Zu3�M��Y��#k��b�_���;��B?G����w|��&R�(�����������W�����~=,hB�H����}yAV�������eW< ��^(B��;��H����z�q�����U�v��2�����L,���x�S�d-���.f$���N_^��<fo�_�<~���>?|~|����?����e���A�}����I�n����
��b��^���o2(�%G`����M�%��w���b�&�����.�ar�'A������V����������/����X�m� '3��G�0��E7����l���m�v���i��j��
Ug+��D/�c�K�2��:����4Y�B�����V{WO��q�Qn
�1\��Un�W�����Z��B���M<
������Da2�������g���'����;?><{����)[]��������<}�#�,�^�
���F�?�}ShUq�=�C6)��eJ��#2����9�����U�����]8�[=e]��?b�Q��s~F;K�n��W��=4�L��Pv���'�����������f�Q�d]��A�� ��j�����g5���	�z� ]?@������z�Q/�
?�����j��M�x�qIId ��`
c��8�Aeg���hy:���5�R�7.��Y��`E���4�r����H�owz
�������%��\����
���Y�7l7�T~��
8�{�X���%H����-V-�<a��I��{��������o(\����4�9���T�>@9bJ�Lt�d�&x�Lf7����x�%��d6[�1(^��c���EZq@c�X�U����������#�|_E��o!�(�����r=��hl�xX���� ��l���p5���X�J���K[�^D��xO��f����D73�&���!�W����e������+��2�[?0���
�'����2!���N�2�R���h�F�|�Y�P�'`�B�?�>�X�c�H(J���8cY�J�WpS�g8��O=<�����!��r�il�-E���b�O�b6�&q�bO2m��3�s�O������NR����q���Z����>;�vL�3�X����^�����T0�������)�������a3B�����B�c�o��l6<�����8���������o.����u�m�~�z���2IN>����E=�N@���a"���e&r`�.��oF�/7k�����7��K4&2a���r�OPb��<�	�7����a�d&|e�=���8!HI'�&���t�Y?����	�$���.�/��
�$������ m�vL�����'�����D��E��":����l2���B��,�LN�#.D
��(�%H�v{%�~8�8<�.���f�����H��[8�l����e�N��$�J��6��	V2,��h�3�Px!�|�@6�(���j�������r�� &StA�l��F��"�?aZ�&/��q�vx�"�}�K���GP��'��<E�8'�0kLt��W��V��&KJB��.�eo�A�Go���(A���X=��X	BE����7��'��L��Y$C�7�hW3�@mA2�H�����+6�W{A#���,;c]6��!��J��)*�ig�xY8�(�{s����C�#�������� ��PT�g"�,�b���@���2�<�R��gYQ���0e4���7��,k�L�
EZ"��	"���NY]��3R�(�J�.39.,�/��MRY�9o��3b���?l$"{T��(�`"�lA:v$k��"+]<��9�B��a��0����4w��Za�b]L����n��dq���'�s`�M89�
��������z)�������������|��2���o+��OI&�B�X�L��^P�\���
N����6�A��9���������9�!%��
B�	�]>w$N���K�C���NLd��'{��@����
������_$���2�G�*\�k��/�ky�S�k�N��Ah�3��%�8As?B��AOXJ��Tj&)�D�"�T-EV��a0�I��*9�� MY��\ s�,C��h��G�K�D�|���K������������#�0���n���N�oC%������E�K�#��3�KH1���N�@FG�[`���L��8~�� i�@���:��@��$
�0L�00p�{������������!�G�o_��A�w�������WQ�K&�=
:Z:���N)q�B���y)��]�����-�����]�����=�P��[��vS�R�*�+���B�����	I������Mye��D��7��M�Y�6��R�,�p-�m������n�pr+�px?����9�I��C��9������z���e�g��
�+�N��'�L>��y�<a
T3|b[A����V��ik�������_���o�9�6���Z�Q�������jYq�>���c�.����|�D7�~��l��P]U�r,�a�%v\�GQk�3~1����x,V�oL5�]Q������c��eJ��8�`&�h������r��{��:����Q/�1~�����������M�
��Zi�gC��A����s�R�VR�up.
��s�ct�M����������w�7��P!�� ���z>����&�v���)����O�!svX�
}�}��z<��-���]��W�G%.�=���[	$6��-��&^�{b�>2�������b�r�2�{� ��r��������"�*��%�,]�5����X�����B���
7�4#��D�7�^�2SK�L�$�9� �a?Z�D�y�
��b�+U�	,92��������P�v�s`pK1�+��X�����1(���?-�e����0U�D��D�;dlE�=f������B��gs�)�4��=V5���2��T��9?�oR;8`���+|���	��������^�fc��T.�v����f����%5�@M?�M?������ �����{
&�
~<(��$������_�i�5�	������8�'����n�p9���h<� �,�q|,�(��������m��=�G�~O��kE2~g��{a�����rHN�����y�������8j�\���J����u���7�fX�=��fp��7���Zr�>#h�7�99@+~r�/������r�*�L����oA2<���	�^��w��$���9�����`Ce�E>~, 5oDM��X��]~_��EKcc��E��T���R��I���q�o�{��MS�u!n�`L8�,�5�����bN6��%���)BeEa���XYk�JtTx54ap(�,�KS��ah
3G�[���y����Us:S
&/�����L����O���c������Y����e�'�Y�������W��7���3��6��v��`?�k�2�s[��r�2���r9[)�uw��B�vA�����.�����W�������1On��g���Fw��aN1����5(7��PG�8-��X@���
������q�{ u���J�"��p#(���r@�3����>�����1���"�F�f���c1�����F�>��Qg8��OG���w��(�PX��X�)3q-��~��IP ���A�����\c��>��M|���_�w��7�wM�DA�"������[���r��/�G���$0�J%9'��h��������3_3�yAI���9_y_�)���m�*\D� #�)�L'm��r2�~������PH�/>��7�jN:�����0��"��8�����dv;HN�)0�k�����5�3�Z�^�oMH@�kL�&����|�Q�!����%����������R�B��fA��7�U�>LY��gaL����U�r& ��qv"���;l����`���z�3q{;7n���K�������D�t��Z}8���!'���~�Y�w����G��m���k����@�DF�-���k�K�����W Y�����g c�����`��#(t����{]�^����&u5���i���&����[x��t��~��#�����'8�b�G��E����q����q���wA�9��2�,�;�d���k�����V�����?���5�����f��[as��G����nV��v������f��n���!�v���O8���M������V=+�{��+`}�Z��LB���,�M�T��:��&���~f(�(��R l�LA���8IG�w�C����!G��������8���A�����q<X�@v���r?z��R��E����;O9���,�����#�0�Z�b����5Z�d�w_�8V#�C�7��c����?�o(�k\J;��-i3�g��t�����n������_k��M�.f��K���:Y��*���/�=f28]������=�`U)%���\�%�� �&�������>;��a��iP�k���<\J�Z;�+���t���[���<��-��i�]@�;�o�0c8�&r@�f�A�XR P�����{�@g�;9��zz|tt���1�&Y�����_�~8;����}'(���	��Af+PQ=f�s�&�);=��ek��Sy0eQ���}�9}$�^�_M��7����zp�����V(�6PPH���CQ��`4����a
�=�G���������^����5�Q���`����t���,cU�YD�\�T�V��/*�w���.$��l�f���f��C��lnFC��i
�_��VC��/��]������������������;�0v~b��c�"�Bzp��~��������p(�Y&���X�B�O2N�C*���7��f�|�^i�;���x(�F@��wy�8$�}2��	*~4?E������Nri�AshC	�=�/������1���tr(��"�#��������b��]au0n<{
]A���]h���`����D�q������~�7_�Y(�,�����ahp���<��/�Q�?�	��F��~�����.��^�f/�)���A�Q�IFA�zq�;|�����������N8�������
�����y5Q�N��M�N��[�P��t�����o�3�3�@���*��
@�?Qxr�*eC��l����������sQ����"��e~��S����`��$�*�8�u���^����o7��=�;K�[a9Tu�A�%OXn��-������:�c��X�o��@Gp��
v����7���0��h	�>}�;9z��{/O^�\������Zq��k������h�s5�-��x
&�n��Wa��{�}(g���`�U���E��������C$�F�6�����O��H+�K��>�����*lhn�����W���CFa=d�EX�g���:�A{���%"��*�����}��SA}����lx.!@'�����p'Tt-��������@0�|��4�
����\���B���s4�S��#�u��\b��
����ap��I�,��J����Z�������8��\�!����T���`�QH������s��g�'�u|���������(��u-qI�������o'x6I�JP�]�R�[*�lv�h�Z=e8.�W^3�&1���+
�6��fb�f7��\>�M�ZkW����v���������4�;l+���i���%����Q(������*M.m�������1n0���C���cu����KY�����19
�����X7bLp,$
gBXL"��G���5w�XJ-�HNI6�L�
9�������r�?9S�U����Ys�l�gW����vM���_�1����4��	��D%wVhmw:����>'H08����&�s�bx���S�Gn�Sm
n��"(:|yrdaMFJ]>n��`�H���Vl�"�F��3��6
�3��)O�CVee'k��}���������s����D�]�'	�:���\8>�\,C�%<��4K�����	"��%>$���Z}���|
]���������c5'>��"
?��4�Ak��7Ik{�mr��:K s�ly���k��-�
m��r�Th���h�T��|}�;?yu�7�'�M�y�7e����/���|]�5�a�nX�jo�f���eN�'*�30��V#�=��%x�N�l�
�!��k����P���N����l5�?�^�����	����p8�e�Z*���-�	k����q����PD[r<����N�"WhnY���u�{�VN�B���������u���K�0����M��Y4�����e�z��u%v���>?�n��s�3Vj����l����t6
����$\��8��s�@@i�)���1��P�~�(f��V(������s�;|}$
Fi	���F�L��u��{^!+;�E��\�!�]N�n]��&�����!��v��?{����P���������i=���.:���&�!���
%Z��M�0R���K�n���t�0�����\k�]tm@��p�������7�������_�1&���9�(`)Z����f��M�$e;���b(v��
�"�=�����<=�0(��8�p�J�!�����"}9�X�����b�
���f��\��;�n$w���b�~9�Ozk{��;H���z���v���
�����#�7k�.m���,��|���c�?n��.�j��i���R�64���������]��[���b1�d>�'�r9���������(��&��z�j�<�t��}�m�5M�z��)�?������7q���l��K����yC�sV�Q��i�t�l��]�G�QT[������n���v���i��[��=�����o����c5w6�����~����j�H�Z��U�Vm?��}��g:^�D#�'t�����68�����&�f���I�nO����u�xD��b�U(�~�{��Nc&�
tV������b6[�����\y\�IQk�fXs���6oN��\_���^�����!�VH���l���$��-�j(&A��-��R�Y������G��[�Q�}��]�z�f���;���-��n�Dl��g�����������l+n2fy�m���M�>u[�Om��8<:"��F�]v��E4�w|7qWq�.���"�e�E�c�!���S������#����<8J����s���1I5"��}Mz�Q��v� L��Rq�����Q��������O�����_t�T�B�L�������@ H���T��&��C�5��/�I�11�nD��������p���V�-�1�QF�(b��pm�gEh����"��z������X�!S��r���y]���#(��`�E.A.����������2�Y�#�F�V7,	�U��z�K����qx�j�"N����
�Rd�	;��H����p<��t`,�����(L������ �����e�!��ox��(��Q�(���N���\�q�2�f�Q�x{t���*�`,�{�GP���X5 @i!qpB$��J�������x��J��LL]�����6�X�7��g}����s�g"�fW��f|����������~����jw�x�$��������.�
��|P���O�_��t����PdM1��`��ctC~��=�y�����S�s�������pFD���~�8<�\�J0�e�Sn���Fg~��)�UP�|�X�i��g�b�N����W���bm�������y��.���DW�3��
�Ia�l��� ������H��c^�S*[SJ�My�l�\�^s�7-U�WDu�%Ll�/jd�>��/�����H
v3�Z������=v�)���^VsN�
J6d�Q-�inAN
]�}^���j6��
5�'x�KV0��<��I��?��yF�����
���*r�3���*��Z*�����r�<o����P���|��p.*��{�!�����m�P�5�.�e
f��
��Y�G�P�%���i�ciU!���

���0��
��@:����x��[wE�?�K2�u�#���1�%�`�mD�������l�F��Q5i9������"�u�����f�N���*����������]�����"'mL4A6����F��##7;�J����W"�����ld��6f��c������������&Lk��	����N>�H^�%C�q�m!��dS�V�<bit���G����Y (���.��=q����2�eM�����d��$�*g�xE���9���,��=�Bn��i�L	�)�A������n~Ih��,��{�J=[zC��fC��D��G�����-�<���0%�<�Nn9'���O�	�e��#��q[��G��<%��	c9<0L�d^�[QS�R6�q]q�I��G`����f��I�E��[M��B�bg����e� ��!AM��_�GK��d;�gR�l\��D����JpZ�i8�|���J_{
J����H�z���ZZ��A��~J����5�J�,2�&��abz��wv����y�W��W[�$�H����M��7����r�wE@w�^&o_�=�]3:
�avj�/�w���i��`4�m��u��?����������>����o�*�����_���&��%��"��L��r�����J;�l�cD��N��v����G�n�������.����P���6E_���n�??�����y�	6�6������qz2����Q v:�V
�����r>���=��O�_N(P�#!'�5�XS��j)����}Nn��v�P{bMgj���
/�{��{���b\]��l��:�#H~vxq�{��6��=`�0�$�M6�~6�*0��<��j�P�?I2j��i,�z\��Q�K�~���UPC��P��Q��Hk�3��
>���-��V�	�Z�;�T��%�{��������Ees�+��8'C���A���[��q"�r�A������X/i�W^��J��~����;��gS{�^P|cs
���^�Vq������g'| �Bw7��cD������������T��0��kr���_��C��cF�	gS�D9��~�������m��l{����\8\�>��"�������R�JCz�%C!�"-�����h)��	G�
G��%?�MG�9m6����'�K����*Vx��T�Tre<�j�R��������W�?!\Bc����o���~O��R�l��L�kfp3�`i���{�
���� ��R�>�c�Mf7"�
$����c?A^L�.]����x
�q�(��G�p��j&�{�y��Q4)H3����D9�]�'|�)��F;c���zjf,p�;�O��)�7*�m\2���W�� ��|�)��	�����f&�DP����g�Q�hOs��9yr�bGf[cj0��%���#�2&?���)]6z��dR9�?F�R�&��3*2e�� ���^�K��J~����|��|����S�5��e������'m.��j��FC�u�b9�h�N%�z� g��e���f���$��p������c���@����c�����ny���m�KI�v���M�Nm
ot��a�;N�2O �~����u�N�4@:��F���T�l��5-����".�����(�������#HEL�����q���# �#H�Z�)_��������6��C����%a��P�}�c�a�Hdd�s7�`�l�7�3�}p�c�%��{����%�9N��Ft���S=�SsLs0^7�������q.N��������������w�nV�z�.����j[���`Gwy��D�E
�`@6���3���
�����L���%��!!J�Wb�����tn�p�����t�$w�#�R�����k��
K�^G�X�d6�d�z�%���h�moV���o{���+.F�%(��VDQB?����	����^�,��m:����oY��EWj���
0���N:Y����@��&�C��M'D���;���!�����1�i�g��p����1�x��|<�E��i���g�_�������xA���w�b�w��C�����!D_$R���2���v�
o]���.��}3����Y����Ox>�|�Oq��*i^�v�Z��}k������*J�T{B�3����L>��$����
q*\s����$�R�l��i�i�u���� I������d��n]-#������{�/����;�;�f8�xH��D+\d�]��yq��CNo���:�ng��a��r���@y�[�>��f��#R��U��(NF	�-l�4��{>C\��"��M�39��UY.6�%�`$�i�=��q)��T����g�7��S��z��Z��G���VH9y
y�A��=+���T�a�R�.>�����	O��a�u-��b�KQNif5�OQ������5�h���N��UR����+9X0������Q1^����v
R�N��������I���HV����_�� ��'b�v��m|T_^�]>{�S�+	g	��s�
����mEI�M�����LK��f�x� <�G����������!�u`*��t�t���������/55��EP��P���Ir���KT��G� ��s�6p+���d����y��5��+_1�c���:N���]7��l�i�������|��[o��d���uF���&>�]���].fG�
+���%73O�����0����f|�l������{��R�����m
�)����]�K�G�
|����?d:���@�H7yG�������j��~{�s5GdJ���P���Ks8/D�K��ZZ���9�Os^4O<���S�k6��;;�w}�o?�?[��Z���W�*��}=��$���L
��\����=,Z���*���k\�2��_��V�?j�V��n�[��T��Z�������V�2\0���]M���{�'�y��]��?��N��n���h���F���Q����a���V��.{5��s� 9cY�>���:���<b���{~�p��=l�����T�>��9��-.�����>�(}���p�G�7Y����z*�j�Z�9_��
���/o/��`VY��������*����V���4��@���U��_�����	?;��F�	�	��K��)�+��W,�����I�Aig���#����U�~M�{*��F��Z��m���F�j�����|���R�������;�17Q���h��N����uqxq���)���`�Av�"_��T|��H	������_�E��(}d][�K��P/�X���f
�8�QwE���]y*<B�������
�����8PE�AL>�b���LU(�iL5����{{�o���MPmw���p��Y��%?Y���������<��	��Gc�O(4<�^N�Y#��AU)���������l6?�*�`$�:��nW|������s�7���>�%���rX�����JT��L��+A/��	���C�������P��#@Zp.Pw����*����5dgO�0
�*��=�C>������{���e���0%e`7+��;���������������t�dUPF�/"�C����������&������zu�����X����9Cs�?W��l���C��w�����&}����d2��c���]�Z��)q�<���\5oe- Q5��^>@���#l�]���K�8����)�=�>�4�I��
'��x3���U���"�����k�2\'������$��'����Y"���B�|R�R������$�����=bO ��Q���U+8y��H�6�&0"6��"A�����4\�*.o
�7��T��W@��,�}} f�%�Y�r�Y�O561���:��2��sJ�r;��Euy����:�7�������d����Ru�f+{����R�~�xY3�JW�+.�Qee���ek���h;�R�*�"����L�OkD�k,'X��<B�&����4+@
���x ���
��.�#��a���W<�}�������S�5@�MK�A�Fg@	�D�2�H8X�=�~����X������O�1��	lt�9l����S����V��
�k�2/�Q(p�nk���zO������:Q��"RU� �B�V@X�>�zQH����&�x�<�+k�P���h���p����U�7^]���V����dt�Dl���������uY1D&���8*�o�y�%	���x�92�?B��'��J�>��%�8����w(.��(.����P\��Br��$Ua�Z�z�5x����(>u��Io�i�m)i��iw��q4�_�8��0a�T@���#���n�e���h�:������m
w��6�E�����f�?
�0cC2�g��(����:�:�
��y����OB\O�F����
!	0�t��������m/��X���r�D�4���	79=�����"���h���q|y)�N^� 9��^W��>'>&j��m��H�q��\r�Y���k�t�;�R�����h}�BCU(G� �M�������9z�l�Y���yY"���B������M\DR����S=�{�����\sk��Q����7�L{ �Mw���v���n�z���6�����pVP�[�n	�N���0_������m��.��+�[#�<|�{����8��Jc'�?�<�����G��;�3�o��y0�������\�8	
��u�et9[|��	`���Zk�\��U�f��?h���Z��~F��5
�V}�3E�p����N���TG��W���[0f���������>���u�2	�;X�����M�7�zW�Z�u����+�6���
��*��v��uW��>�V�t�`��w�\vN;b;UQ��������������OnK!��H���/�J��omT����.�I�3�����?�$���Q���^!������c���q6R�����5��}y|�<��o���}�#����(����k�x�T�BN���6�������h4�AuvA��=v�i	���G<��8�����I�������U�8c"WM�)���������������]p�������������^��4^�?FS������e�i��H��/�9���v;(iu�]�����;�����\I���-m3N�g�����^����|{\!�}/�o&����o�O1m��;���tu���d'�_���r1���Qo5�b�tC6?O����T_�
��]z{s,z��=���U(M��~����ofF����'8�q6� ��d��9%���JW-aj���'�1�A�0�'�a�NGE_��q1#g��[��&����5��#�`���j:@�;�v�K�����|>[���b�'Q H�	�D��A8�D��hm��T���]Z�&'dK����*�>�/X+HT�`����� ��R�z�a�_��G\����������f�/"�)X\k{�����t*2i�����|Z�������h����E���u����?��6���<PM�sA������t2�^
� 0��1�c�j6��d@��8��������5�Yd+�����������/�y�#�!�EQ�Ri��i�A�����"G��Jl�A�[��Y��'��d��d�$��d�1%G%�!Pt�L��������k6t����� �t+���@e����+����Z��;F�ofx8L�s�Ud���~gr����,[���'�/�M��{q���~���������'Lv��Z�5�L!���5�*@p��w��I�����G�����^�:���W�M��)l?<,'��T�t��>�S5����*��	�L���J�u*K����ix�r�r-�00�Z1�^%v�|�9+��R��w+�/
�H-Q���������6����#��kP��n��8�5
S'�o�f�������y^s'n�75��G�������(XD.�����`2��E��]F@|7
���l�_��n��_���|��}�������F�6�6����h��ht�a�����p0h���������G���V-K����H�'�U� 1#heZs)�j�v7�����jT]�_����?����k��v=lv�lY���4!]���������d	vAX��9F��u�D���h1�V��ON��X�a	����aeR���~����m����@7��,������d(��i�W-
��"C�����r�	����^��W�o�� ��@�����3+�P�������(T�/�S�Q=����~��'�F��$�(fE�(��F���!	�!��*':���1 I����;�w	�k���J7��o�d�,���BU����;�-���l�Q |7��G��f���0'�(�h:���l��m�a[�h�-�����k�uY���$��u���5����=�&��~�ht���9��aO%uW/�88-�����i�z:����������v4����s���;_�d������j,����/���9�`����g����������<�o��I6#�"4�\��hj�m���������yo>,Ej5��V����������p8�eQ��h������m�C��~�t������x>\�����^�H�����u��p�PR��;����U���l]���n�|��b�1G��C,aR���_zp���{�a����`��io�%���,Q���7$�j��7�U_����$+?;}}~Q���Vg]�/_����d�6��O��������y]����9���M�g�u��=MH��#Q.T<5�~�K�����xe�|����O�*�\C������p���&��i��y*�+���O3��H^=�Ok��T2 �H�S�0��o�k!z�B���F�����1�� � ����_��?y|���Lz�$}g����97�*��bv��Q�AA�x�L��j��������oCu���xD�2��� HC�&�2-\��E��3k5�����$`k<�oJ���}!6r��������s,��c�PR~�>�C����r��K��A�X���h��?F����'����M��'�>������U�����Z`�|%�n� �c<���0���g�O��"��KjL�)�����������sr�.6�r�ys�� �}������I����(��&������LB�`M��J�g�REL�+����xIRL�Jr�AQ1�K�yA�(�n�
;����]�������ux��N�E�M����~�2m�����XS�_���v������4BY��n��b�\FS*MHue�d6��&�s�xh ���z<�������DA�:�<�O�i�����6���!\���t�\��G^{�����W��=�L���������\��:>�Uk�n��A���������^������+���NL*�x����Z��X
+Zx��-q�7�������I�&��i&�'�����qn�0r��.��D���
��[���t����g��x�(��b $i�(�J���8��SO�d��A����3~X��x	)��:�� �
����?,�Y%1�t~P������N�!s	j����F�	��^vX��Yz���R��`��h�!�+�J��/i��U�P��5���WQ@��f�� "���PzADqqVj�����V�Z��b>���'�s���|�(8A���78Hm#hUA��Y)Oo�/j�t��MM��5j�>�m<
b��p�@K`��(��h������c�/��dT��
^�x�D�[���G1��oSI��/��N�:��&?�����(���Ld�!';�NS�uE�E���W��Y�L
uV���i�3�o���k����� x���}�D�]'W,��b6[r�g��-���\UL����2��C6\-d/:pa�p�gN�3���>��wvz�����C`�,*�y��OhuQxK{��6pq����D�z��"��8��'��rzr���B;�(f�~:|)�{9}`��bE�F�0F�����|�����e( ��K��^�Rv�# �K�)�I(�!�a�\@Y*B��!|4���Ye��z��nn��;���{#�V��l[N�A��x�N��r^�
[{��]���[��}������,@L_�%��7� qUSnP�����tC�12�7��XpF�
����^��o�Cr@��
�t� X�lRRb�5�e������,{5���u�)i8������_����d�G';u��3���:�TTg��d�6����j�*���m�*k�c���oF����5(��a����N�|lT��:F���;>whB*��1��������ReNI�G�M����)�Z�Q4�72��72���	��@am9-~�/�Xd1$&��vl_^gU,o��,�{��U��(G�w�*�u�u�U���*�IZ���e�p��jbhP�����m3�LG�=.�51(e��Cy���<e�����������n(7ej$E9�K�W"��r��MM��@�_���+��^r������%���z._#�8x��u�C!55N���%�~Vx�K���d�v�BE�@2����e��1�C*g#v~�~����T�	���"��m���"��Y�K���q��(C��X�����>K��_����$F<::|�������%�.��D�sKR�4�g�y��2[r�(o�A:�����
=�Z������	(f�'���|����0~������������8�?�����]7���t3��\f����Z7�u�r3)�q3)���{a\7%��s3����)�t��g{n&�-�����fR���I��fR.l(k~M�o#7��fn&�eK�f�^�n&_P�����6t3Q��������n&e���+ZI�'�� �y��=����A2�"��x���!�fn����<B��GH��Qg��R^�b����lh.o�M�|7;��qkn
���)�*�n
����e��)��m�M���M?����|��O�i�����[��g7]�&�U+�&�;���/�/���g���(P��;g<*o��]NZ��9?�]3#�I�~"g��i������`�I�e
&|����B>��!]�%^��7wP�����%�#��X��L�A����=���3A�O�])�J�rr8�����*��V�6���Z��-|���+&Q��Y��e�~�
���D�6�
����
��&���Z����%��I�T�����~�b�k_,��/
��1������&�������q�H�y��� @h�]����0���P��n""F�E�60b��K�����S�+�f��D���4�q�����b����
�6v��.\�K~����,"pM�E8	�|����P\H3��A��=J.CAk�=zK������R�eS��Q�J8\RM5$+�e�y4
D���3�����#\%f������N���[��u)��O}���u`e,T���t�I:8��b|d�����a�)�&�c]�%�0Kx�G�m�
I���N����l�8������i7���W3S�1g����.��]����!��qv
Y��u���|�jWE�J��b���8x�c���;���,���s2L��L����Y�P�����Y�����c�4�s�mm��Q�7��,���������mv4���S�p!���_.�d�==�������9\�8�'V�M7�d������p
��������n�X��SFU��{�O>���;y�1=
H�[�3����#Y��������W?���*��;!b�1�Y8�ge��urv�p�^��#��1+����`~LZ���px��������e�-���-�m;��T�u���C��*�R��M�G[��8
��={�V�G-�_������D��8����"Ez������+#����`�[I���Oh�����9������K�t�"I����K�(����m��!��.��_��.�Rc������sB�
�:�N$�j<���Q�F�#�+i�D�|��G�z�Y=S`�3���6���n<��D�a�
N���c�E�]�� �0,�4�b�3�����t�xQ�}�\[����:��	HG�BIf�����^��X����I���-'�T��#�H��l�}���+�A�XY��j����I���R��ub��z�y��MSVB���0R�jT����
��u<����u�T�e'�z��T�,�������h)���h�Je���W*[��.�)�=��Uk��
�Gk�s��W�"'����LiN���+��jQC�~_��k��f���c��n��o�D�f�L�2?�oF6�U����TU�fnO��������(����Lg-yPG�=����'_�1p
�2���g���^�Y9���p_0�C(�;�������Bq_��*Jt��re����4"e�	k�����W�%],���^�T����RJ��88�j��)��'��7}`��:x"��`�D�v4���4������1rj"U��&�������\��&2�h"E����'�j|S�6��di"#PK��oKE~����`CMd�~K|��`�
����'���$5�*Y�WlG�Ri��{h����jOH�����4�2��D�4���H�+*��4�*���un����T��_���������g�:mMdru��4�$�7������k"���&Ro��e����n��������,�Q,EQl]i8��Q��\�,Md
>J��l��4�2�k"�]��6�D��H��5��`�&2��DF��&R�����t���H���4�:i�f��`SMd�&5���R�)����5�����o���j"�r+0�o�&2��S7a3M��qu�F0H�D�5�~��S�(��4�r�j"�M����h"�
35������,Md���X�)�h"%E�h"�����������pI��D\�1nn�	��px�3.\J	Y I�U\x`��[������E��(g��R`EI����
M�V�_����
c�P��S,�w�|�7���KZ�.�]�yo}�&���n��>��b��wWi������~w�y|1���!������:���Vq�������Wz�`n=�5����~Ohv���� ��u�������\�)@nT/�J$
9�����0j��O!n��i,����q|�p��#��	ZS���9��$���'�j�a��d�rjk�)W��;����q�2�.��(�J�`k�J�*�j��5��YUSAD'��$��z�$nl������`[��$�� ��C?��xL(���.�Kk����.a��6),�M��L:8t
��9�P�2O��@O�TF�$Oy�t�)���A�i�����n������
�����T�T�_�(�(��E����v���Su9�bry}�r���"��������%s�o1�x�$������dZ_�288+�tvp��d�MLr�$�e+����$g�i_��X�g ��g
>\`\j��q��
G�����Qh�����B���MHELsT�����@h���{��`���R's�����hyE�/\<Y����q����?	���*������[m4[M��o��������?X�7��[�Z����5�7k�NX��F�>��k�fX��:��W��[�<��g������Z)YAk�b�5f�M��mh�*��E����{{�����6:�m�����+ ��2����$&�����N�^�<�f��T����$�]�!�I��
��qL�������U��������YRz�V���$�C��"6�� �u%c���T�Nwd=Ej*��
S �,�^�g�Nw6���u��r_�y�}�"����;�>�L�KWjopB)2FbE���5"��|���������/=��\���$:O,�K��yo�R�<�]5��2�y��T��9<��7����`'��(��o2�=�����v2Q���~���j��9b��W�d-��rS\�
Pq�Y��������n8�_�Qkv��_���f�V���p?lE�vkT��j�A'j�����v�������f�Q����������'���VF�J>������&����h���$�X���c��%�G�1����!E�������Z�4�naO�7�|u��q�����b��D���fZ�z�����W�N��l��7�]_���^��3_.zq���_���(���+�L�������v�F�G� z�?=��,��z�^�HX����A��(�t*�*1�z�9m��.{	�����(1��$�	��cI��s�����D�+��)�|�|��?��
>���h���m��^��������<cQ{{ ���s�4�Y������2yk�]��ukJi���h�f��k�L�������&�0������jw`��-7�J��Z4��z�8-�����C����c���^�=���pv�3^���NP�������YOI�X���d�d����r���w��� ������u6�t�\�_��9m�E:P\VNF�	�o)�,B������3h�ES2x�%"O�2@�B*'8�
m�<�v����W+��s��Q{�V�_�D���.�Hw[�Q��HJ_��C��Sf`�*��D����&����t*�[*�;-���m�~��������d�@Y���c$�@�E����c05T�m���m�y��	�u���
M��	C����Y�v�FV�A*��mP��N���v�9�a����KV�fF
�,o�'��i��"�f"���A����E�bXd���c�8�!�����WoO�o�jG�����o����A���iwjasU[Qm4l5�0��em4��~���v���p������%�W�5�����
��V��A�I�����8�q�������[�x��������t���������m����V��
��l��������@�Tjm�����������_���d��3M��BaO�1���_�!
;;~~���vq�=t Cn�h�������^���7b�a��h�h�\}�h�Ft���B�}������=�-���b-���/O��ll���n�Y��q��[Pk�{��6��'[��'����q�>�8��>�8������@����q���A��������Z9j���N�k�/�.�*���6	����SZr�A��g�`���@�N������N����z���5~����Qs�?���fu8lG��~�:j6�����[��pT��m�y�s��n�����l��Z�?���*�yE$���2�Nw�Y_0�p�X���\�a�I�P�9l��a��2��!�������Qu��:sXk�j\�/���9&q^WMiv�V�D���0������������JK��T�N�R�M(�8��1R��#7��,*(Ca�4���q���`��,Uz�C�wQ��@7/:������D�?c�4���3jO3�y��"�b�ECMU�4��+�p�U�t=��"�����VJoi��5���4*�����[��K�h�i�[M�(���.�*V�_�.�������]]Mb�Z*�&��EM6�
��8:�F��!������v`�#�l����������Wr�AyG��������Y��M�Ym+��h��������&�����Q*r��dF*���FF������&�(A�S����5��������	��e,gA��CW�-���
���u:��������~�Q������5�7��~����Q�l�������m�'���#�6�Y�_�?�V���}��*\�n0nLY���O#+X�1��P�1��s8��/�jnpoo���Q��nm���V�t+�K/�����I��9����u�r�k�����e���Q���7�h�kn���}M�s�`DMf���N�:�?"G(kN�Kd����D���d���,{�y*�wy��������5�_�����d��-���z�R�o���\�k�Y�cl]�M����#�QN�DJR3�������-q+U������8���Q�'����y������q7:(�7^�p�+X��"����@�^�`�O�f
G�;�
N�gy��!X�������YR����m�d�0�nVj��6��p���
2���C/�����-%�b>`��GXl�����2������f��L�1HH����Q"+�y�0�	|Z����a�[t���?�-���#��f�
����F��,�� H�v�e<x��_�2������7���������	[��Z8�����`���n{������� ���Q������
r-qJ�����Z�GO���(���$�@���U�@��p�}��>o��+��_R�����^������N����f�������7�.wus7�=
e�mO��z{������4�Z���,�	l�?&��u�K2�]&��%mv:��]�^x��S����':]U,���f�n$�<�r��jq:C����e�?��/E*,Q.���$�!$���8S�!M��MQ�\_�m�0^��vI%�����z�|3�� �
������eLIp��o�0�S,��/yTQ�aOOx��^S�����	��E����~�d�Ny>3g�VS��>�/��8o�w�s#�J�9������������'��7�g`�zu�����#�]�1�_��0}�5'S=~B=��(F��Fq��=���zD6��.�����oA�M�S��yH����N#��E��h�����X5��<�&d���_��
du�a��2�Q����� ���O2�6:�Q5qtMn0u4
<?��<�)`$BE��1���^���k?_�}���������q�����x�j=a�mV�e�5�;�Zw���/�~�j7��y=���qR������������>��|�q����e^���Md�A�����~����t�"��@9�����g_g��}�S>�����6���]�f�	Ar�Vf���.^.�\�w�{�w��l�\�ON���>�}ES����B}���!V<NH�����X{��9�rsS�,�(�L��&Z^?��g�k(x�����i���?�z����j�N�RG�����h<��I����Jp��+�2OA����|$,���m���H8Y�J%��y[����D�9z&']koa������������m��zD�v����w��7<mP�r=m�1����
?D�v�=�F��Mp=[D�8���
�XD�������4-7�W��M��U~(���
�z��r��������~wTm��N��_m7����5����Z�,����Z��V�xF�e�_ih��
�-l�P>j��8_N1Qu._�6K7x���a��EcA�1`oo��z�;�o?
X�]E��k��H���������F@
�s�^���1�~tb��gT������%d�b���,�+�m�}#�����u���U9P��Ue��0S>�3����c��Q<��yw,���}�M(�/$Y���C������nF�Pr����u&�	��^�Nu0��9���'��>�k����/�tku&�Tg��z�����T[�td�����-��<jG��T�~�������X�3��e����\�{9\��F����JW:!`B��1��0O�N`|�=���.��v��L;u�e�g�KmZ����pk�e�>�vq\����Vr���E2Q�8��>F�5G����!����jA��V����
"�W�6�����0���87H����Z������d��f�J^�
RE|���/��_�bc���+���d�O����j�j"`"�fC��5�P��;�f;��l�/Uly�M����a}��V���[�^6���w7�T-�&�����M]���6bghA"��~������@�'Q8��\�w,�6�[����:��7��W�A��~�^vQ����Q�>jv���~��n7;����?��N���?�G���j'K��F��VU�������?;,��I�r��j9 ������k�~g�����z4��k��@����H�vi��i�|i�|i�\xdU��a{��F)�%�c
Yu������j[����u�t�"����Su�g���^���-�s2�s=��O�rIjYC���`~��n�Aw��A��������q�0��H=N�*[f���3���)�j�[%Yvvpg����7
��������������Y���0O7)���������[��?���}����CN���a��mV��p���Fw�
��������Z����o>���}T���:��!��b����MAk>&��\�^��n\4��L�f��������w����kB6:�S�_������������c+�!�V���(���7�Xv~���bFE�O�v�n�a���<h���,K�$�S��L��c��K5{+,��_��k.�j�l6�d����(�Y���W�����k�g�2��K��{���w��C����`*�T���r�@��Q��a>�E�v^bW3�n�����v����~;�����fM#����'~�������U�U[{�DAp�xmRz��O=�����F�������~c�����A��i��j����~���VG����[���!������So?j����?�V��k��z�������5����X�F3�����L��}���C�����a��o=*�^��5�!����1�����7u�P��e���+C^�Q�7afF�c%(tY�BO���f�k~�z���!�3��� ssF$��'�9�@������+(�������zR��>3?K6�����9�qM9s�/7�������e�2�p�x�����c
%*&���X�ca���g��>"���g�K�j?Z�D|y0�S���T����`���s�����w�Hy�`K�5K��
��X�
B~�`�PI������>�u�l4��Cf=H�v~�S�_��H5p&��'�&(������@�����,���a�P�of�������f����$�{���d�8�`���	'�]��Q4"cv\�
\��[Np�����`��z��m~~���L-�B�hIp<kv����_k�����V���W1S^�Qm*��V?���]���	f:/#L��Z����(4���S�$��[9�a��+�����9O�/����QC����u"��w������%D$�����n�,�	]A\�n�
��F����[v���\�������f�8reH<[��U�0�	�>�W!���l"��4���I�vKwT���3�������� ����@���T��v��[�<��pK�@-M�V����H%#��>
��=�
��U;���Ay�1�;'nIgKa�Vb?��cY��:Cf����������-���7���N:��tf����j�X~-���V'�����Gg��/���l�b�5�T�;U�� �4[^�&C��mK�S��]��z���}��|������~�5
�v�����~c�m5�aX�w�������v�����bNW������]������G��!�2�L�+`6\�H���"�Q��H�R�J��E��B'KvMAF�%c��C6������*��I�R�ts���Xj|�tsJ������n3l��P�F�o��*�2*:e#A5���������L'/\Oe��;��������>�b/o8��C�r\����}����-��f	e�����GA+��'/���t����P�#��/�����dq-�[�������^��D�[��J�b����X��hN�n����f�0�0���6>L&�s��%���Y�$L�'��?CEC��S���N�������B�������W�����:_�E�@����z������R������$4�U�6�\��O3������[tS��*�w����
e�%O�*�D������R�k!)W"�Dd���e������]'s;���n
�F��-.U8Lwi_:��_�{�z\w���-*������������_j��_m���5���m)�3���n�I[Q[Q�~����jB��A[,$�g�"x���k[������ 7�OA�y1}�E)g������7cK�T0��^Sd���\n���_�m]i�}�-}iD����L\|�]�^MosB{�����m��v�d�=����B�2��w�c�d��X�zn��	����I����bI��AK��p"4aB'��5�3��'��%���#?&3c���d��i�&����U=�i
�'xk���Hb"�!�}�����������s����[n�R{3^�.u�3�oS��]��5���,g����92�[���81������Rq�ysre��u*��3�f�OV����w��<EGn��s�T�Y�����\4m�y����$���PF�p5Y��p�z5]D��
�`��Hb�Vkk��Z7��8G���_����m�4�R�r���ns\�V��$�O�6�/b��&�kqvHy_ny��Ig:&���<h��j���a��,�O��j�y��FR����GP(����dC�?���x�79o_�X�}�Pz��3�'#��7&�}C��d}�}`�k|T�n!�&�|)���9|�p�8�V����m�������
N��t:1x[�����+{HF5������6x��W�f��h!���|)�	��J�?{{v~���f;fq;;XX������Q8Q���������l���k�/\�C��2��E�#Kl�@�a�-]���AP��H?���	��&���:�6�@�j�=Nc�ti5\_�������@����$�!�{��L �b�P����Q|��k�����n,��I��b	Qc��S�j�"��-`[��<>|.6D���f^���{����������o?�~��|����������o?�~��|����������o?�~��|����������o?�~��|����������o?w���PL/ 8

#322

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#320)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jan 17, 2024 at 11:37 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Jan 17, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 17, 2024 at 9:20 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Just changing "items" to be the local tidstore struct could make the
code tricky a bit, since max_bytes and num_items are on the shared
memory while "items" is a local pointer to the shared tidstore.

Thanks for trying it this way! I like the overall simplification but
this aspect is not great.
Hmm, I wonder if that's a side-effect of the "create" functions doing
their own allocations and returning a pointer. Would it be less tricky
if the structs were declared where we need them and passed to "init"
functions?

Seems worth trying. The current RT_CREATE() API is also convenient as
other data structure such as simplehash.h and dshash.c supports a
similar

I don't happen to know if these paths had to solve similar trickiness
with some values being local, and some shared.

That may be a good idea for other reasons. It's awkward that the
create function is declared like this:

#ifdef RT_SHMEM
RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
dsa_area *dsa,
int tranche_id);
#else
RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
#endif

An init function wouldn't need these parameters: it could look at the
passed struct to know what to do.

But the init function would initialize leaf_ctx etc,no? Initializing
leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE.

I was more referring to the parameters that were different above
depending on shared memory. My first thought was that the tricky part
is because of the allocation in local memory, but it's certainly
possible I've misunderstood the problem.

The same
is true for dsa. I imagined that an init function would allocate a DSA
memory for the control object.

Yes:

...
// embedded in VacDeadItems
TidStore items;
};

// NULL DSA in local case, etc
dead_items->items.area = dead_items_dsa;
dead_items->items.tranche_id = FOO_ID;

TidStoreInit(&dead_items->items, vac_work_mem);

That's how I imagined it would work (leaving out some details). I
haven't tried it, so not sure how much it helps. Maybe it has other
problems, but I'm hoping it's just a matter of programming.

It seems we cannot make this work nicely. IIUC VacDeadItems is
allocated in DSM and TidStore is embedded there. However,
dead_items->items.area is a local pointer to dsa_area. So we cannot
include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we
would need to pass dsa_area to each interface by callers.

If we can't make this work nicely, I'd be okay with keeping the tid
store control object. My biggest concern is unnecessary
double-locking.

If we don't do any locking stuff in radix tree APIs and it's the
user's responsibility at all, probably we don't need a lock for
tidstore? That is, we expose lock functions as you mentioned and the
user (like tidstore) acquires/releases the lock before/after accessing
the radix tree and num_items. Currently (as of v52 patch) RT_FIND is
doing so, but we would need to change RT_SET() and iteration functions
as well.

During trying this idea, I realized that there is a visibility problem
in the radix tree template especially if we want to embed the radix
tree in a struct. Considering a use case where we want to use a radix
tree in an exposed struct, we would declare only interfaces in a .h
file and define actual implementation in a .c file (FYI
TupleHashTableData does a similar thing with simplehash.h). The .c
file and .h file would be like:

in .h file:
#define RT_PREFIX local_rt
#define RT_SCOPE extern
#define RT_DECLARE
#define RT_VALUE_TYPE BlocktableEntry
#define RT_VARLEN_VALUE
#include "lib/radixtree.h"

typedef struct TidStore
{
:
local_rt_radix_tree tree; /* embedded */
:
} TidStore;

in .c file:

#define RT_PREFIX local_rt
#define RT_SCOPE extern
#define RT_DEFINE
#define RT_VALUE_TYPE BlocktableEntry
#define RT_VARLEN_VALUE
#include "lib/radixtree.h"

But it doesn't work as the compiler doesn't know the actual definition
of local_rt_radix_tree. If the 'tree' is *local_rt_radix_tree, it
works. The reason is that with RT_DECLARE but without RT_DEFINE, the
radix tree template generates only forward declarations:

#ifdef RT_DECLARE

typedef struct RT_RADIX_TREE RT_RADIX_TREE;
typedef struct RT_ITER RT_ITER;

In order to make it work, we need to move the definitions required to
expose RT_RADIX_TREE struct to RT_DECLARE part, which actually
requires to move RT_NODE, RT_HANDLE, RT_NODE_PTR, RT_SIZE_CLASS_COUNT,
and RT_RADIX_TREE_CONTROL etc. However RT_SIZE_CLASS_COUNT, used in
RT_RADIX_TREE, could be bothersome. Since it refers to
RT_SIZE_CLASS_INFO that further refers to many #defines and structs,
we might end up moving many structs such as RT_NODE_4 etc to
RT_DECLARE part as well. Or we can use a fixed number is stead of
"lengthof(RT_SIZE_CLASS_INFO)". Apart from that, macros requried by
both RT_DECLARE and RT_DEFINE such as RT_PAN and RT_MAX_LEVEL also
needs to be moved to a common place where they are defined in both
cases.

Given these facts, I think that the current abstraction works nicely
and it would make sense not to support embedding the radix tree.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#323

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#322)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

It seems we cannot make this work nicely. IIUC VacDeadItems is
allocated in DSM and TidStore is embedded there. However,
dead_items->items.area is a local pointer to dsa_area. So we cannot
include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we
would need to pass dsa_area to each interface by callers.

Thanks again for exploring this line of thinking! Okay, it seems even
if there's a way to make this work, it would be too invasive to
justify when compared with the advantage I was hoping for.

If we can't make this work nicely, I'd be okay with keeping the tid
store control object. My biggest concern is unnecessary
double-locking.

If we don't do any locking stuff in radix tree APIs and it's the
user's responsibility at all, probably we don't need a lock for
tidstore? That is, we expose lock functions as you mentioned and the
user (like tidstore) acquires/releases the lock before/after accessing
the radix tree and num_items.

I'm not quite sure what the point of "num_items" is anymore, because
it was really tied to the array in VacDeadItems. dead_items->num_items
is essential to reading/writing the array correctly. If this number is
wrong, the array is corrupt. There is no such requirement for the
radix tree. We don't need to know the number of tids to add to it or
do a lookup, or anything.

There are a number of places where we assert "the running count of the
dead items" is the same as "the length of the dead items array", like
here:

@@ -2214,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;

  Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));

As such, in HEAD I'm guessing it's arbitrary which one is used for
control flow. Correct me if I'm mistaken. If I am wrong for some part
of the code, it'd be good to understand when that invariant can't be
maintained.

@@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (TidStoreNumTids(dead_items) > 0)
lazy_vacuum(vacrel);

Like here. In HEAD, could this have used vacrel->dead_items?

@@ -2479,14 +2473,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
  * the second heap pass.  No more, no less.
  */
- Assert(index > 0);
  Assert(vacrel->num_index_scans > 1 ||
-    (index == vacrel->lpdead_items &&
+    (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
  vacuumed_pages == vacrel->lpdead_item_pages));

  ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
in %u pages",
+ vacrel->relname, TidStoreNumTids(vacrel->dead_items),
+ vacuumed_pages)));

We assert that vacrel->lpdead_items has the expected value, and then
the ereport repeats the function call (with a lock) to read the value
we just consulted to pass the assert.

If we *really* want to compare counts, maybe we could invent a
debugging-only function that iterates over the tree and popcounts the
bitmaps. That seems too expensive for regular assert builds, though.

On the subject of debugging builds, I think it no longer makes sense
to have the array for debug checking in tid store, even during
development. A few months ago, we had an encoding scheme that looked
simple on paper, but its code was fiendishly difficult to follow (at
least for me). That's gone. In addition to the debugging count above,
we could also put a copy of the key in the BlockTableEntry's header,
in debug builds. We don't yet need to care about the key size, since
we don't (yet) have runtime-embeddable values.

Currently (as of v52 patch) RT_FIND is
doing so,

[meaning, there is no internal "automatic" locking here since after we
switched to variable-length types, an outstanding TODO]
Maybe it's okay to expose global locking for v17. I have one possible
alternative:

This week I tried an idea to use a callback there so that after
internal unlocking, the caller received the value (or whatever else
needs to happen, such as lookup an offset in the tid bitmap). I've
attached a draft for that that passes radix tree tests. It's a bit
awkward, but I'm guessing this would more closely match future
internal atomic locking. Let me know what you think of the concept,
and then do whichever way you think is best. (using v53 as the basis)

I believe this is the only open question remaining. The rest is just
polish and testing.

During trying this idea, I realized that there is a visibility problem
in the radix tree template

If it's broken even without the embedding I'll look into this (I don't
know if this configuration has ever been tested). I think a good test
is putting the shared tid tree in it's own translation unit, to see if
anything needs to be fixed. I'll go try that.

Attachments:

rt-find-callback-interface.patch.nocfbotapplication/octet-stream; name=rt-find-callback-interface.patch.nocfbotDownload

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index aca9984ecb..87131cc59b 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -216,6 +216,9 @@
 #define RT_CLASS_16_HI RT_MAKE_NAME(class_32_max)
 #define RT_CLASS_48 RT_MAKE_NAME(class_48)
 #define RT_CLASS_256 RT_MAKE_NAME(class_256)
+#define RT_VALUE_CALLBACK RT_MAKE_NAME(value_callback)
+
+typedef void (*RT_VALUE_CALLBACK) (RT_VALUE_TYPE * value, void * caller_data);
 
 /* generate forward declarations necessary to use the radix tree */
 #ifdef RT_DECLARE
@@ -239,7 +242,7 @@ RT_SCOPE	RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
 #endif
 RT_SCOPE void RT_FREE(RT_RADIX_TREE * tree);
 
-RT_SCOPE	RT_VALUE_TYPE *RT_FIND(RT_RADIX_TREE * tree, uint64 key);
+RT_SCOPE	bool RT_FIND(RT_RADIX_TREE * tree, uint64 key, RT_VALUE_CALLBACK val_callback, void * val_data);
 
 #ifdef RT_VARLEN_VALUE
 RT_SCOPE bool RT_SET(RT_RADIX_TREE * tree, uint64 key, RT_VALUE_TYPE * value_p,
@@ -264,7 +267,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE * tree);
 
 #endif							/* RT_DECLARE */
 
-
 /* generate implementation of the radix tree */
 #ifdef RT_DEFINE
 
@@ -1018,26 +1020,31 @@ RT_NODE_SEARCH(RT_NODE * node, uint8 chunk)
 }
 
 /*
- * Search the given key in the radix tree. Return the pointer to the value if found,
- * otherwise return NULL.
+ * Search the given key in the radix tree. 
+ * If found, execute callback find_cb and return true,
+ * otherwise return false.
  *
  * Since the function returns the slot (to support variable-length values), the caller
  * needs to maintain control until it's finished with the value.
  */
-RT_SCOPE	RT_VALUE_TYPE *
-RT_FIND(RT_RADIX_TREE * tree, uint64 key)
+RT_SCOPE	bool
+RT_FIND(RT_RADIX_TREE * tree, uint64 key, RT_VALUE_CALLBACK val_callback, void * val_data)
 {
 	RT_NODE_PTR node;
 	RT_PTR_ALLOC *slot = NULL;
+	RT_VALUE_TYPE *value;
 	int			shift;
 
 #ifdef RT_SHMEM
 	Assert(tree->ctl->magic == RT_RADIX_TREE_MAGIC);
 #endif
 
+	RT_LOCK_SHARED(tree);
+
 	if (!RT_PTR_ALLOC_IS_VALID(tree->ctl->root) || key > tree->ctl->max_val)
 	{
-		return NULL;
+		RT_UNLOCK(tree);
+		return false;
 	}
 
 	node.alloc = tree->ctl->root;
@@ -1050,7 +1057,10 @@ RT_FIND(RT_RADIX_TREE * tree, uint64 key)
 		RT_PTR_SET_LOCAL(tree, &node);
 		slot = RT_NODE_SEARCH(node.local, RT_GET_KEY_CHUNK(key, shift));
 		if (slot == NULL)
-			return NULL;
+		{
+			RT_UNLOCK(tree);
+			return false;
+		}
 
 		node.alloc = *slot;
 		shift -= RT_SPAN;
@@ -1058,13 +1068,19 @@ RT_FIND(RT_RADIX_TREE * tree, uint64 key)
 
 	if (RT_VALUE_IS_EMBEDDABLE)
 	{
-		return (RT_VALUE_TYPE *) slot;
+		value = (RT_VALUE_TYPE *) slot;
 	}
 	else
 	{
 		RT_PTR_SET_LOCAL(tree, &node);
-		return (RT_VALUE_TYPE *) node.local;
+		value = (RT_VALUE_TYPE *) node.local;
 	}
+
+	val_callback(value, val_data);
+
+	RT_UNLOCK(tree);
+
+	return true;
 }
 
 /***************** INSERTION *****************/
diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index d5ca139b18..64a52276b6 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -116,10 +116,15 @@ static const test_spec test_specs[] = {
 #define RT_DEFINE
 #define RT_USE_DELETE
 #define RT_VALUE_TYPE TestValueType
-/* #define RT_SHMEM */
+#define RT_SHMEM
 #define RT_DEBUG
 #include "lib/radixtree.h"
 
+static void
+rt_copy_value_callback(TestValueType *value, void * caller_data)
+{
+	memcpy((TestValueType*) caller_data, value, sizeof(TestValueType));
+}
 
 /*
  * Return the number of keys in the radix tree.
@@ -140,6 +145,7 @@ test_empty(void)
 	rt_radix_tree *radixtree;
 	rt_iter		*iter;
 	uint64		key;
+	TestValueType		v;
 	TestValueType		*val;
 
 #ifdef RT_SHMEM
@@ -154,13 +160,13 @@ test_empty(void)
 	radixtree = rt_create(CurrentMemoryContext, work_mem);
 #endif
 
-	if (rt_find(radixtree, 0) != NULL)
+	if (rt_find(radixtree, 0, rt_copy_value_callback, &v))
 		elog(ERROR, "rt_find on empty tree found a value");
 
-	if (rt_find(radixtree, 1) != NULL)
+	if (rt_find(radixtree, 1, rt_copy_value_callback, &v))
 		elog(ERROR, "rt_find on empty tree found a value");
 
-	if (rt_find(radixtree, PG_UINT64_MAX) != NULL)
+	if (rt_find(radixtree, PG_UINT64_MAX, rt_copy_value_callback, &v))
 		elog(ERROR, "rt_find on empty tree found a value");
 
 	if (rt_delete(radixtree, 0))
@@ -236,15 +242,16 @@ test_basic(int children, int height, bool reverse)
 	/* look up keys */
 	for (int i = 0; i < children; i++)
 	{
-		TestValueType *value;
+		TestValueType value;
+		bool found;
 
-		value = rt_find(radixtree, keys[i]);
+		found = rt_find(radixtree, keys[i], rt_copy_value_callback, (void *) &value);
 
-		if (value == NULL)
+		if (!found)
 			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
-		if (*value != (TestValueType) keys[i])
+		if (value != (TestValueType) keys[i])
 			elog(ERROR, "rt_find returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
-				 *value, (TestValueType) keys[i]);
+				 value, (TestValueType) keys[i]);
 	}
 
 	/* update keys */
@@ -267,14 +274,16 @@ test_basic(int children, int height, bool reverse)
 	/* look up keys after deleting and re-inserting */
 	for (int i = 0; i < children; i++)
 	{
-		TestValueType *value;
+		TestValueType value;
+		bool found;
+
+		found = rt_find(radixtree, keys[i], rt_copy_value_callback, (void *) &value);
 
-		value = rt_find(radixtree, keys[i]);
-		if (value == NULL)
+		if (!found)
 			elog(ERROR, "could not find key 0x" UINT64_HEX_FORMAT, keys[i]);
-		if (*value != (TestValueType) keys[i])
+		if (value != (TestValueType) keys[i])
 			elog(ERROR, "rt_find returned 0x" UINT64_HEX_FORMAT ", expected " UINT64_HEX_FORMAT,
-				 *value, (TestValueType) keys[i]);
+				 value, (TestValueType) keys[i]);
 	}
 
 	/* iterate over the tree */
@@ -311,10 +320,11 @@ test_basic(int children, int height, bool reverse)
 	}
 	for (int i = 0; i < children; i++)
 	{
-		TestValueType *value;
+		TestValueType value;
+		bool found;
 
-		value = rt_find(radixtree, keys[i]);
-		if (value != NULL)
+		found = rt_find(radixtree, keys[i], rt_copy_value_callback, (void *) &value);
+		if (found)
 			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, keys[i]);
 	}
 
@@ -336,15 +346,17 @@ check_search_on_node(rt_radix_tree *radixtree, uint8 shift, int start, int end)
 	for (int i = start; i <= end; i++)
 	{
 		uint64		key = ((uint64) i << shift);
-		TestValueType		*val;
+		TestValueType val;
+		bool found;
+
+		found = rt_find(radixtree, key, rt_copy_value_callback, (void *) &val);
 
-		val = rt_find(radixtree, key);
-		if (val == NULL)
+		if (!found)
 			elog(ERROR, "key 0x" UINT64_HEX_FORMAT " is not found on node-%d",
 				 key, end);
-		if (*val != (TestValueType) key)
+		if (val != (TestValueType) key)
 			elog(ERROR, "rt_find with key 0x" UINT64_HEX_FORMAT " returns 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
-				 key, *val, key);
+				 key, val, key);
 	}
 }
 
@@ -620,7 +632,7 @@ test_pattern(const test_spec * spec)
 		bool		found;
 		bool		expected;
 		uint64		x;
-		TestValueType		*v;
+		TestValueType		v;
 
 		/*
 		 * Pick next value to probe at random.  We limit the probes to the
@@ -647,14 +659,13 @@ test_pattern(const test_spec * spec)
 		}
 
 		/* Is it present according to rt_search() ? */
-		v = rt_find(radixtree, x);
-		found = (v != NULL);
+		found = rt_find(radixtree, x, rt_copy_value_callback, (void*) &v);
 
 		if (found != expected)
 			elog(ERROR, "mismatch at 0x" UINT64_HEX_FORMAT ": %d vs %d", x, found, expected);
-		if (found && (*v != (TestValueType) x))
+		if (found && (v != (TestValueType) x))
 			elog(ERROR, "found 0x" UINT64_HEX_FORMAT ", expected 0x" UINT64_HEX_FORMAT,
-				 *v, x);
+				 v, x);
 	}
 	endtime = GetCurrentTimestamp();
 	if (rt_test_stats)
@@ -715,7 +726,8 @@ test_pattern(const test_spec * spec)
 	for (n = 0; n < 1; n++)
 	{
 		uint64		x;
-		TestValueType		*v;
+		TestValueType		v;
+		bool found;
 
 		/*
 		 * Pick next value to probe at random.  We limit the probes to the
@@ -727,15 +739,15 @@ test_pattern(const test_spec * spec)
 		x = pg_prng_uint64_range(&pg_global_prng_state, 0, last_int + 1000);
 
 		/* Is it present according to rt_find() ? */
-		v = rt_find(radixtree, x);
+		found = rt_find(radixtree, x, rt_copy_value_callback, (void*) &v);
 
-		if (!v)
+		if (!found)
 			continue;
 
 		/* If the key is found, delete it and check again */
 		if (!rt_delete(radixtree, x))
 			elog(ERROR, "could not delete key 0x" UINT64_HEX_FORMAT, x);
-		if (rt_find(radixtree, x) != NULL)
+		if (rt_find(radixtree, x, rt_copy_value_callback, (void*) &v))
 			elog(ERROR, "found deleted key 0x" UINT64_HEX_FORMAT, x);
 		if (rt_delete(radixtree, x))
 			elog(ERROR, "deleted already-deleted key 0x" UINT64_HEX_FORMAT, x);

#324

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#323)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Jan 18, 2024 at 1:30 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

It seems we cannot make this work nicely. IIUC VacDeadItems is
allocated in DSM and TidStore is embedded there. However,
dead_items->items.area is a local pointer to dsa_area. So we cannot
include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we
would need to pass dsa_area to each interface by callers.

Thanks again for exploring this line of thinking! Okay, it seems even
if there's a way to make this work, it would be too invasive to
justify when compared with the advantage I was hoping for.

If we can't make this work nicely, I'd be okay with keeping the tid
store control object. My biggest concern is unnecessary
double-locking.

If we don't do any locking stuff in radix tree APIs and it's the
user's responsibility at all, probably we don't need a lock for
tidstore? That is, we expose lock functions as you mentioned and the
user (like tidstore) acquires/releases the lock before/after accessing
the radix tree and num_items.

I'm not quite sure what the point of "num_items" is anymore, because
it was really tied to the array in VacDeadItems. dead_items->num_items
is essential to reading/writing the array correctly. If this number is
wrong, the array is corrupt. There is no such requirement for the
radix tree. We don't need to know the number of tids to add to it or
do a lookup, or anything.

True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking
we need to have the number of TIDs in a tidstore, especially in the
tidstore's control object.

There are a number of places where we assert "the running count of the
dead items" is the same as "the length of the dead items array", like
here:

@@ -2214,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel)
BlockNumber threshold;
Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));
As such, in HEAD I'm guessing it's arbitrary which one is used for
control flow. Correct me if I'm mistaken. If I am wrong for some part
of the code, it'd be good to understand when that invariant can't be
maintained.

@@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (TidStoreNumTids(dead_items) > 0)
lazy_vacuum(vacrel);

Like here. In HEAD, could this have used vacrel->dead_items?
@@ -2479,14 +2473,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
* We set all LP_DEAD items from the first heap pass to LP_UNUSED during
* the second heap pass.  No more, no less.
*/
- Assert(index > 0);
Assert(vacrel->num_index_scans > 1 ||
-    (index == vacrel->lpdead_items &&
+    (TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
vacuumed_pages == vacrel->lpdead_item_pages));
ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
in %u pages",
+ vacrel->relname, TidStoreNumTids(vacrel->dead_items),
+ vacuumed_pages)));
We assert that vacrel->lpdead_items has the expected value, and then
the ereport repeats the function call (with a lock) to read the value
we just consulted to pass the assert.

If we *really* want to compare counts, maybe we could invent a
debugging-only function that iterates over the tree and popcounts the
bitmaps. That seems too expensive for regular assert builds, though.

IIUC lpdead_items is the total number of LP_DEAD items vacuumed during
the whole lazy vacuum operation whereas num_items is the number of
LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle.
That is, after heap vacuum, the latter counter is reset while the
former counter is not.

The latter counter is used in lazyvacuum.c as well as the ereport in
vac_bulkdel_one_index().

On the subject of debugging builds, I think it no longer makes sense
to have the array for debug checking in tid store, even during
development. A few months ago, we had an encoding scheme that looked
simple on paper, but its code was fiendishly difficult to follow (at
least for me). That's gone. In addition to the debugging count above,
we could also put a copy of the key in the BlockTableEntry's header,
in debug builds. We don't yet need to care about the key size, since
we don't (yet) have runtime-embeddable values.

Putting a copy of the key in BlocktableEntry's header is an
interesting idea. But the current debug code in the tidstore also
makes sure that the tidstore returns TIDs in the correct order during
an iterate operation. I think it still has a value and you can disable
it by removing the "#define TIDSTORE_DEBUG" line.

Currently (as of v52 patch) RT_FIND is
doing so,

[meaning, there is no internal "automatic" locking here since after we
switched to variable-length types, an outstanding TODO]
Maybe it's okay to expose global locking for v17. I have one possible
alternative:

This week I tried an idea to use a callback there so that after
internal unlocking, the caller received the value (or whatever else
needs to happen, such as lookup an offset in the tid bitmap). I've
attached a draft for that that passes radix tree tests. It's a bit
awkward, but I'm guessing this would more closely match future
internal atomic locking. Let me know what you think of the concept,
and then do whichever way you think is best. (using v53 as the basis)

Thank you for verifying this idea! Interesting. While it's promising
in terms of future atomic locking, I'm concerned it might not be easy
to use if radix tree APIs supports only such callback style. I believe
the caller would like to pass one more data along with val_data. For
example, considering tidstore that has num_tids internally, it wants
to pass both a pointer to BlocktableEntry and a pointer to TidStore
itself so that it increments the counter while holding a lock.

Another API idea for future atomic locking is to separate
RT_SET()/RT_FIND() into begin and end. In RT_SET_BEGIN() API, we find
the key, extend nodes if necessary, set the value, and return the
result while holding the lock. For example, if the radix tree supports
lock coupling, the leaf node and its parent remain locked. Then the
caller does its job and calls RT_SET_END() that does cleanup stuff
such as releasing locks.

I've not fully considered this approach but even this idea seems
complex and easy to use. I prefer the current simple approach as we
support the simple locking mechanism for now.

I believe this is the only open question remaining. The rest is just
polish and testing.

Right.

During trying this idea, I realized that there is a visibility problem
in the radix tree template

If it's broken even without the embedding I'll look into this (I don't
know if this configuration has ever been tested). I think a good test
is putting the shared tid tree in it's own translation unit, to see if
anything needs to be fixed. I'll go try that.

Thanks.

BTW in radixtree.h pg_attribute_unused() is used for some functions,
but is it for debugging purposes? I don't see why it's used only for
some functions.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#325

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#324)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Jan 19, 2024 at 2:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jan 18, 2024 at 1:30 PM John Naylor <johncnaylorls@gmail.com> wrote:

I'm not quite sure what the point of "num_items" is anymore, because
it was really tied to the array in VacDeadItems. dead_items->num_items
is essential to reading/writing the array correctly. If this number is
wrong, the array is corrupt. There is no such requirement for the
radix tree. We don't need to know the number of tids to add to it or
do a lookup, or anything.

True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking
we need to have the number of TIDs in a tidstore, especially in the
tidstore's control object.

Hmm, it would be kind of sad to require explicit locking in tidstore.c
is only for maintaining that one number at all times. Aside from the
two ereports after an index scan / second heap pass, the only
non-assert place where it's used is

...and that condition can be checked by doing a single step of
iteration to see if it shows anything. But for the ereport, my idea
for iteration + popcount is probably quite slow.

IIUC lpdead_items is the total number of LP_DEAD items vacuumed during
the whole lazy vacuum operation whereas num_items is the number of
LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle.
That is, after heap vacuum, the latter counter is reset while the
former counter is not.

The latter counter is used in lazyvacuum.c as well as the ereport in
vac_bulkdel_one_index().

Ah, of course.

Putting a copy of the key in BlocktableEntry's header is an
interesting idea. But the current debug code in the tidstore also
makes sure that the tidstore returns TIDs in the correct order during
an iterate operation. I think it still has a value and you can disable
it by removing the "#define TIDSTORE_DEBUG" line.

Fair enough. I just thought it'd be less work to leave this out in
case we change how locking is called.

This week I tried an idea to use a callback there so that after
internal unlocking, the caller received the value (or whatever else
needs to happen, such as lookup an offset in the tid bitmap). I've
attached a draft for that that passes radix tree tests. It's a bit
awkward, but I'm guessing this would more closely match future
internal atomic locking. Let me know what you think of the concept,
and then do whichever way you think is best. (using v53 as the basis)

Thank you for verifying this idea! Interesting. While it's promising
in terms of future atomic locking, I'm concerned it might not be easy
to use if radix tree APIs supports only such callback style.

Yeah, it's quite awkward. It could be helped by only exposing it for
varlen types. For simply returning "present or not" (used a lot in the
regression tests), we could skip the callback if the data is null.
That is all also extra stuff.

I believe
the caller would like to pass one more data along with val_data. For

That's trivial, however, if I understand you correctly. With "void *",
a callback can receive anything, including a struct containing
additional pointers to elsewhere.

example, considering tidstore that has num_tids internally, it wants
to pass both a pointer to BlocktableEntry and a pointer to TidStore
itself so that it increments the counter while holding a lock.

Hmm, so a callback to RT_SET also. That's interesting!

Anyway, I agree it needs to be simple, since the first use doesn't
even have multiple writers.

BTW in radixtree.h pg_attribute_unused() is used for some functions,
but is it for debugging purposes? I don't see why it's used only for
some functions.

It was there to silence warnings about unused functions. I only see
one remaining, and it's already behind a debug symbol, so we might not
need this attribute anymore.

#326

johncnaylorls@gmail.com

almost 2 years ago

In reply to: John Naylor (#323)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

I wrote:

On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

During trying this idea, I realized that there is a visibility problem
in the radix tree template

If it's broken even without the embedding I'll look into this (I don't
know if this configuration has ever been tested). I think a good test
is putting the shared tid tree in it's own translation unit, to see if
anything needs to be fixed. I'll go try that.

Here's a quick test that this works. The only thing that really needed
fixing in the template was failure to un-define one symbol. The rest
was just moving some things around.

Attachments:

v51-addendum-Put-shared-radix-tree-for-tidstore-in-its-own-tr.patch.nocfbotapplication/octet-stream; name=v51-addendum-Put-shared-radix-tree-for-tidstore-in-its-own-tr.patch.nocfbotDownload

From 79616ca27e945afe73d4f8e69216d364972b361e Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 19 Jan 2024 18:08:04 +0700
Subject: [PATCH v51] Put shared radix tree for tidstore in its own translation
 unit

The point is to test global functions with the template, but
it could be useful for keeping related code together in cache.
---
 src/backend/access/common/meson.build         |  1 +
 src/backend/access/common/tidstore.c          | 23 ++-----------------
 src/backend/access/common/tidstore_internal.h | 23 +++++++++++++++++++
 src/backend/access/common/tidstore_shmem.c    | 16 +++++++++++++
 src/include/lib/radixtree.h                   |  1 +
 5 files changed, 43 insertions(+), 21 deletions(-)
 create mode 100644 src/backend/access/common/tidstore_internal.h
 create mode 100644 src/backend/access/common/tidstore_shmem.c

diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index a02397855e..32caa148de 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -16,6 +16,7 @@ backend_sources += files(
   'toast_compression.c',
   'toast_internals.c',
   'tidstore.c',
+  'tidstore_shmem.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 272910f77e..0c1ae8d305 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -21,9 +21,10 @@
  */
 #include "postgres.h"
 
+#include "tidstore_internal.h"
+
 #include "access/tidstore.h"
 #include "miscadmin.h"
-#include "nodes/bitmapset.h"
 #include "port/pg_bitutils.h"
 #include "storage/lwlock.h"
 #include "utils/dsa.h"
@@ -33,18 +34,6 @@
 #define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
 #define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
 
-/* number of active words for a page: */
-#define WORDS_PER_PAGE(n) ((n) / BITS_PER_BITMAPWORD + 1)
-
-typedef struct BlocktableEntry
-{
-	uint16		nwords;
-	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];
-} BlocktableEntry;
-#define MaxBlocktableEntrySize \
-	offsetof(BlocktableEntry, words) + \
-		(sizeof(bitmapword) * WORDS_PER_PAGE(MaxOffsetNumber))
-
 /* A magic value used to identify our TidStores. */
 #define TIDSTORE_MAGIC 0x826f6a10
 
@@ -56,14 +45,6 @@ typedef struct BlocktableEntry
 #define RT_VARLEN_VALUE
 #include "lib/radixtree.h"
 
-#define RT_PREFIX shared_rt
-#define RT_SHMEM
-#define RT_SCOPE static
-#define RT_DECLARE
-#define RT_DEFINE
-#define RT_VALUE_TYPE BlocktableEntry
-#define RT_VARLEN_VALUE
-#include "lib/radixtree.h"
 
 /* The control object for a TidStore */
 typedef struct TidStoreControl
diff --git a/src/backend/access/common/tidstore_internal.h b/src/backend/access/common/tidstore_internal.h
new file mode 100644
index 0000000000..abb2f46b37
--- /dev/null
+++ b/src/backend/access/common/tidstore_internal.h
@@ -0,0 +1,23 @@
+#include "nodes/bitmapset.h"
+
+/* number of active words for a page: */
+#define WORDS_PER_PAGE(n) ((n) / BITS_PER_BITMAPWORD + 1)
+
+typedef struct BlocktableEntry
+{
+	uint16		nwords;
+	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];
+} BlocktableEntry;
+#define MaxBlocktableEntrySize \
+	offsetof(BlocktableEntry, words) + \
+		(sizeof(bitmapword) * WORDS_PER_PAGE(MaxOffsetNumber))
+
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE extern
+#define RT_DECLARE
+#define RT_VALUE_TYPE BlocktableEntry
+#define RT_VARLEN_VALUE
+#include "lib/radixtree.h"
+
diff --git a/src/backend/access/common/tidstore_shmem.c b/src/backend/access/common/tidstore_shmem.c
new file mode 100644
index 0000000000..50bd422b41
--- /dev/null
+++ b/src/backend/access/common/tidstore_shmem.c
@@ -0,0 +1,16 @@
+#include "postgres.h"
+
+#include "tidstore_internal.h"
+
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE extern
+#define RT_DEFINE
+#define RT_VALUE_TYPE BlocktableEntry
+#define RT_VARLEN_VALUE
+#include "lib/radixtree.h"
+
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index aca9984ecb..90152496c9 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -2817,6 +2817,7 @@ RT_DUMP_NODE(RT_NODE * node)
 
 /* undefine external parameters, so next radix tree can be defined */
 #undef RT_PREFIX
+#undef RT_SHMEM
 #undef RT_SCOPE
 #undef RT_DECLARE
 #undef RT_DEFINE
-- 
2.43.0

#327

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#325)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Jan 19, 2024 at 6:48 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Fri, Jan 19, 2024 at 2:26 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jan 18, 2024 at 1:30 PM John Naylor <johncnaylorls@gmail.com> wrote:

I'm not quite sure what the point of "num_items" is anymore, because
it was really tied to the array in VacDeadItems. dead_items->num_items
is essential to reading/writing the array correctly. If this number is
wrong, the array is corrupt. There is no such requirement for the
radix tree. We don't need to know the number of tids to add to it or
do a lookup, or anything.

True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking
we need to have the number of TIDs in a tidstore, especially in the
tidstore's control object.

Hmm, it would be kind of sad to require explicit locking in tidstore.c
is only for maintaining that one number at all times. Aside from the
two ereports after an index scan / second heap pass, the only
non-assert place where it's used is

@@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
* Do index vacuuming (call each index's ambulkdelete routine), then do
* related heap vacuuming
*/
- if (dead_items->num_items > 0)
+ if (TidStoreNumTids(dead_items) > 0)
lazy_vacuum(vacrel);

...and that condition can be checked by doing a single step of
iteration to see if it shows anything. But for the ereport, my idea
for iteration + popcount is probably quite slow.

Right.

On further thought, as you pointed out before, "num_tids" should not
be in tidstore in terms of integration with tidbitmap.c, because
tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no
longer accurate and useful. Similarly, looking at tidbitmap.c, it has
npages and nchunks but they will not be necessary in lazy vacuum use
case. Also, assuming that we support parallel heap pruning, probably
we need to somehow lock the tidstore while adding tids to the tidstore
concurrently by parallel vacuum worker. But in tidbitmap use case, we
don't need to lock the tidstore since it doesn't have multiple
writers. Given these facts, different statistics and different lock
strategies are required by different use case. So I think there are 3
options:

1. expose lock functions for tidstore and the caller manages the
statistics in the outside of tidstore. For example, in lazyvacuum.c we
would have a TidStore for tid storage as well as VacDeadItemsInfo that
has num_tids and max_bytes. Both are in LVRelState. For parallel
vacuum, we pass both to the workers via DSM and pass both to function
where the statistics are required. As for the exposed lock functions,
when adding tids to the tidstore, the caller would need to call
something like TidStoreLockExclusive(ts) that further calls
LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally.

2. add callback functions to tidstore so that the caller can do its
work while holding a lock on the tidstore. This is like the idea we
just discussed for radix tree. The caller passes a callback function
and user data to TidStoreSetBlockOffsets(), and the callback is called
after setting tids. Similar to option 1, the statistics need to be
stored in a different area.

3. keep tidstore.c and tidbitmap.c separate implementations but use
radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its
control object and doesn't have any lossy page support. On the other
hand, in tidbitmap.c we replace simplehash with radix tree. This makes
tidstore.c simple but we would end up having different data structures
for similar usage.

I think it's worth trying option 1. What do you think, John?

IIUC lpdead_items is the total number of LP_DEAD items vacuumed during
the whole lazy vacuum operation whereas num_items is the number of
LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle.
That is, after heap vacuum, the latter counter is reset while the
former counter is not.

The latter counter is used in lazyvacuum.c as well as the ereport in
vac_bulkdel_one_index().

Ah, of course.

Putting a copy of the key in BlocktableEntry's header is an
interesting idea. But the current debug code in the tidstore also
makes sure that the tidstore returns TIDs in the correct order during
an iterate operation. I think it still has a value and you can disable
it by removing the "#define TIDSTORE_DEBUG" line.

Fair enough. I just thought it'd be less work to leave this out in
case we change how locking is called.

This week I tried an idea to use a callback there so that after
internal unlocking, the caller received the value (or whatever else
needs to happen, such as lookup an offset in the tid bitmap). I've
attached a draft for that that passes radix tree tests. It's a bit
awkward, but I'm guessing this would more closely match future
internal atomic locking. Let me know what you think of the concept,
and then do whichever way you think is best. (using v53 as the basis)

Thank you for verifying this idea! Interesting. While it's promising
in terms of future atomic locking, I'm concerned it might not be easy
to use if radix tree APIs supports only such callback style.

Yeah, it's quite awkward. It could be helped by only exposing it for
varlen types. For simply returning "present or not" (used a lot in the
regression tests), we could skip the callback if the data is null.
That is all also extra stuff.

I believe
the caller would like to pass one more data along with val_data. For

That's trivial, however, if I understand you correctly. With "void *",
a callback can receive anything, including a struct containing
additional pointers to elsewhere.

example, considering tidstore that has num_tids internally, it wants
to pass both a pointer to BlocktableEntry and a pointer to TidStore
itself so that it increments the counter while holding a lock.

Hmm, so a callback to RT_SET also. That's interesting!

Anyway, I agree it needs to be simple, since the first use doesn't
even have multiple writers.

Right.

BTW in radixtree.h pg_attribute_unused() is used for some functions,
but is it for debugging purposes? I don't see why it's used only for
some functions.

It was there to silence warnings about unused functions. I only see
one remaining, and it's already behind a debug symbol, so we might not
need this attribute anymore.

Okay.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#328

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#327)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 22, 2024 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On further thought, as you pointed out before, "num_tids" should not
be in tidstore in terms of integration with tidbitmap.c, because
tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no
longer accurate and useful. Similarly, looking at tidbitmap.c, it has
npages and nchunks but they will not be necessary in lazy vacuum use
case. Also, assuming that we support parallel heap pruning, probably
we need to somehow lock the tidstore while adding tids to the tidstore
concurrently by parallel vacuum worker. But in tidbitmap use case, we
don't need to lock the tidstore since it doesn't have multiple
writers.

Not currently, and it does seem bad to require locking where it's not required.

(That would be a prerequisite for parallel index scan. It's been tried
before with the hash table, but concurrency didn't scale well with the
hash table. I have no reason to think that the radix tree would scale
significantly better with the same global LW lock, but as you know
there are other locking schemes possible.)

Given these facts, different statistics and different lock
strategies are required by different use case. So I think there are 3
options:

1. expose lock functions for tidstore and the caller manages the
statistics in the outside of tidstore. For example, in lazyvacuum.c we
would have a TidStore for tid storage as well as VacDeadItemsInfo that
has num_tids and max_bytes. Both are in LVRelState. For parallel
vacuum, we pass both to the workers via DSM and pass both to function
where the statistics are required. As for the exposed lock functions,
when adding tids to the tidstore, the caller would need to call
something like TidStoreLockExclusive(ts) that further calls
LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally.

The advantage here is that vacuum can avoid locking entirely while
using shared memory, just like it does now, and has the option to add
it later.
IIUC, the radix tree struct would have a lock member, but wouldn't
take any locks internally? Maybe we still need one for
RT_MEMORY_USAGE? For that, I see dsa_get_total_size() takes its own
DSA_AREA_LOCK -- maybe that's enough?

That seems simplest, and is not very far from what we do now. If we do
this, then the lock functions should be where we branch for is_shared.

2. add callback functions to tidstore so that the caller can do its
work while holding a lock on the tidstore. This is like the idea we
just discussed for radix tree. The caller passes a callback function
and user data to TidStoreSetBlockOffsets(), and the callback is called
after setting tids. Similar to option 1, the statistics need to be
stored in a different area.

I think we'll have to move to something like this eventually, but it
seems like overkill right now.

3. keep tidstore.c and tidbitmap.c separate implementations but use
radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its
control object and doesn't have any lossy page support. On the other
hand, in tidbitmap.c we replace simplehash with radix tree. This makes
tidstore.c simple but we would end up having different data structures
for similar usage.

They have so much in common that it's worth it to use the same
interface and (eventually) value type. They just need separate paths
for adding tids, as we've discussed.

I think it's worth trying option 1. What do you think, John?

#329

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#321)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jan 17, 2024 at 12:32 PM John Naylor <johncnaylorls@gmail.com> wrote:

I wrote:

Hmm, I wonder if that's a side-effect of the "create" functions doing
their own allocations and returning a pointer. Would it be less tricky
if the structs were declared where we need them and passed to "init"
functions?

If this is a possibility, I thought I'd first send the last (I hope)
large-ish set of radix tree cleanups to avoid rebasing issues. I'm not
including tidstore/vacuum here, because recent discussion has some
up-in-the-air work.

Thank you for updating the patches! These updates look good to me.

Should be self-explanatory, but some thing are worth calling out:
0012 and 0013: Some time ago I started passing insertpos as a
parameter, but now see that is not ideal -- when growing from node16
to node48 we don't need it at all, so it's a wasted calculation. While
reverting that, I found that this also allows passing constants in
some cases.
0014 makes a cleaner separation between adding a child and growing a
node, resulting in more compact-looking functions.
0019 is a bit unpolished, but I realized that it's pointless to assign
a zero child when further up the call stack we overwrite it anyway
with the actual value. With this, that assignment is skipped. This
makes some comments and names strange, so needs a bit of polish, but
wanted to get it out there anyway.

Cool.

I'll merge these patches in the next version v54 patch set.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#330

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#328)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 22, 2024 at 2:36 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Jan 22, 2024 at 10:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On further thought, as you pointed out before, "num_tids" should not
be in tidstore in terms of integration with tidbitmap.c, because
tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no
longer accurate and useful. Similarly, looking at tidbitmap.c, it has
npages and nchunks but they will not be necessary in lazy vacuum use
case. Also, assuming that we support parallel heap pruning, probably
we need to somehow lock the tidstore while adding tids to the tidstore
concurrently by parallel vacuum worker. But in tidbitmap use case, we
don't need to lock the tidstore since it doesn't have multiple
writers.

Not currently, and it does seem bad to require locking where it's not required.

(That would be a prerequisite for parallel index scan. It's been tried
before with the hash table, but concurrency didn't scale well with the
hash table. I have no reason to think that the radix tree would scale
significantly better with the same global LW lock, but as you know
there are other locking schemes possible.)

Given these facts, different statistics and different lock
strategies are required by different use case. So I think there are 3
options:

1. expose lock functions for tidstore and the caller manages the
statistics in the outside of tidstore. For example, in lazyvacuum.c we
would have a TidStore for tid storage as well as VacDeadItemsInfo that
has num_tids and max_bytes. Both are in LVRelState. For parallel
vacuum, we pass both to the workers via DSM and pass both to function
where the statistics are required. As for the exposed lock functions,
when adding tids to the tidstore, the caller would need to call
something like TidStoreLockExclusive(ts) that further calls
LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally.

The advantage here is that vacuum can avoid locking entirely while
using shared memory, just like it does now, and has the option to add
it later.

True.

IIUC, the radix tree struct would have a lock member, but wouldn't
take any locks internally? Maybe we still need one for
RT_MEMORY_USAGE? For that, I see dsa_get_total_size() takes its own
DSA_AREA_LOCK -- maybe that's enough?

I think that's a good point. So there will be no place where the radix
tree takes any locks internally.

That seems simplest, and is not very far from what we do now. If we do
this, then the lock functions should be where we branch for is_shared.

Agreed.

2. add callback functions to tidstore so that the caller can do its
work while holding a lock on the tidstore. This is like the idea we
just discussed for radix tree. The caller passes a callback function
and user data to TidStoreSetBlockOffsets(), and the callback is called
after setting tids. Similar to option 1, the statistics need to be
stored in a different area.

I think we'll have to move to something like this eventually, but it
seems like overkill right now.

Right.

3. keep tidstore.c and tidbitmap.c separate implementations but use
radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its
control object and doesn't have any lossy page support. On the other
hand, in tidbitmap.c we replace simplehash with radix tree. This makes
tidstore.c simple but we would end up having different data structures
for similar usage.

They have so much in common that it's worth it to use the same
interface and (eventually) value type. They just need separate paths
for adding tids, as we've discussed.

Agreed.

I think it's worth trying option 1. What do you think, John?

+1

Thanks!

Before working on this idea, since the latest patches conflict with
the current HEAD, I share the latest patch set (v54). Here is the
summary:

- As for radix tree part, it's based on v53 patch. I've squashed most
of cleanups and changes in v53 except for "DRAFT: Stop using invalid
pointers as placeholders." as I thought you might want to still work
on it. BTW it includes "#undef RT_SHMEM".
- As for tidstore, it's based on v51. That is, it still has the
control object and num_tids there.
- As for vacuum integration, it's also based on v51. But we no longer
need to change has_lpdead_items and LVPagePruneState thanks to the
recent commit c120550edb8 and e313a61137.

For the next version patch, I'll work on this idea and try to clean up
locking stuff both in tidstore and radix tree. Or if you're already
working on some of them, please let me know. I'll review it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#331

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#330)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 22, 2024 at 2:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

For the next version patch, I'll work on this idea and try to clean up
locking stuff both in tidstore and radix tree. Or if you're already
working on some of them, please let me know. I'll review it.

Okay go ahead, sounds good. I plan to look at the tests since they
haven't been looked at in a while.

#332

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#331)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 22, 2024 at 5:18 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Jan 22, 2024 at 2:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

For the next version patch, I'll work on this idea and try to clean up
locking stuff both in tidstore and radix tree. Or if you're already
working on some of them, please let me know. I'll review it.

Okay go ahead, sounds good. I plan to look at the tests since they
haven't been looked at in a while.

I've attached the latest patch set. Here are updates from v54 patch:

0005 - Expose radix tree lock functions and remove all locks taken
internally in radixtree.h.
0008 - Remove tidstore's control object.
0009 - Add tidstore lock functions.
0011 - Add VacDeadItemsInfo to store "max_bytes" and "num_items"
separate from TidStore. Also make lazy vacuum and parallel vacuum use
it.

The new patches probably need to be polished but the VacDeadItemInfo
idea looks good to me.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v55-ART.tar.gzapplication/x-gzip; name=v55-ART.tar.gzDownload

�	9�e�[�s�6����_�s��z�z��NKq|����4mon8 	I��
A�V������ %Q���:s3�iM��]���d�j�k�Z�|�E�o��(/���<�
���3#��dy��e�r�)�H�y%�����)�_k����wZ��'~Z�f��z����ZS�]���;��������XF<d�;�K����St�k�/��}�q��0;V���VS��n�!s������hMV��|�ME��V�]����C1l���M�=�8�Q���������m�b��O���
+�F����4����u.j=V��j��46>3�`�x��]�a�V����z��lV����lV`eV�`�1�|-s�������+��H,�����-��O/��.�"��aY���aG:��xc),�,�C�EK�B!c'b��q�xr0�������r�?���T.7+[��Lc�ny��{�K������h��e�89����ZWx�DrW��4c)���-�(��j��hV�������W�A�1^����X���x��}�t���������[?���A�z�����m����5Z������r����������>��*m��,��X���O��X��mG0��V	�mO��V0_,�r�=��ryaG�WOuk�j���%X���;M���J�f��o4LV��l��N���8���%+wz�6+�����J40
��3|�a�p�;"o��"��z�V%�^���.3�=&��J��B�g�&=g��C�v��U<"Qk<&Rk��b�G�-G�������c9V����
����qe�?TI��6j��9�����W)cc��z�����?E-�]��O����p4O�19E�J�
��$��FF���F� ���%���}������3��5���I��b��\���&������cu���1��������NRps��#��U���+[�������\��Hh*s:�Y���W�
Lw]��We�"��y��3��s�����1
�� ��{��1�]G�l�e��<D�Z�Hw�'���d���J.CO����.1#�@!<Rb�x$(G�n����0������Jx6�	��6P��wb!I�J����*��
���%Z��6(���?�j:;��d��g=y6��V��4�W���N���7���4��z[��p��0g�����F�U�/�T(��T���G����p=�-m(���C��[P����
"��#.d����]�����HZ�'2����Gl���}WjSh=�{�5������H*���u�&+6k�R��1��:�/5�+���{?E\)A�&	R��\���nj����O,{k�O����2��GT�����?�[�B����>
6�o���\���*Z.����W��k�G���wE��wPv��m��8*��������_��L��li���O�7�W��)U��������h�������&�0���7���k	�����o����e�����~T<��R�"B~t*~�H�Y,J���L�?�B�]4�P�uN�
�����r��X���X��BR�?��*CpA&{<B?m:0���HH���������������I` ���$���%�����a;v���%���
�~���b�V��m���,��������}�����h�

8M��|=hQY+q��^��������{���Tq7��U=��^v��\]��k�����9���7X��kP�'�Pd*��b�b��c%��1��'�_�����f���TL��2���Q;,�N����i��c�-��l,y���������������P�����Y>� �
lr}�f6Og���//<  ����w�r�����(��������M����_F�����{��AF]Nk��P��#���������3����ZZ��!$���z��o��{uG�oRb��'&�n~�$��������H��#>����� ���c�:G*�(�T���K��L�|3�����j�,�����aI����K4:�:l��TE,�I�S}�<��rh�������jm��L���Tz��-�����;/�r�_i��4���J�.���:�+��n���m����Z?��+\*6�g�`DC/����R�oav������fW�7���%fW���}�$�5|�$��~������O2�����V�����G�I�e�0�y����$��V�[V9����B�#������#�G�Mk�:{�?�v��������z��l��!xS3�=�+ �����L���
n��=��m��SN�$�*�o\h�����S��F�?��m��R��r��`|���0O@��!���������aLy�ua���������q���c��H 4�k�Z������?v}LF�?���GBFU��b���#~'�0��\!}�By���}3�jr{�����5���6�5���M���h�g�*�JU<�a��T�4<�_�����jB>����G
�~����8���?��z��2_7�;1��>M����X����	����;c����6�E�U3pb���<r3�S|K��H>�������F�W�?df(0e�$�|�(O�?
�_��b������������q��*Q'l�����[Q��&�A�1mN�(�6:���{�J�]�ul�^����������5��u=U%�XH�X�����Q_q�������#F�`�g7�o|�n��"?���>�Y�_��G<�S�\Qq��(�� ��u�����?S�g�TG3����1���}���{}��������?%�U�)�X,�\BO�?i";�C	�;�7��yJ{%�5�c��7Rd�X}��k:��i�����`:b���d'yH���\KH|*������x�5��{u!c��6�&�7E�C��ZbU�b�9R������j�eD/_b�]�V<G�C��>����,��H�|���Pd��6���-��v�L�Z��6��9�8^:���~*mB�f ���'�y���kT �@�#n{��ZxH�9�,�s�������a��;s�Q� ����>�$�V�.���T��Ub;<|��)byJ��<�����`�VI�lI���FAy�%�����������B�y�
��$�\B����n����?��;T[���~{�N��t-�y%W�v�H��g�=���Vrz�"�gg%�C�V8l����(���i��%V��`�+!/6��e6�NQ�C����&��h�V
��X�a�G����%������6(��@�^"��X�l��O������8T��E��
k��um���E�Q��lD�f �M��������v)X	v��	�_��}��%XiKUMAB�3��o�*���<�%�xl�����3f@�f�A5�7��0'����Nz$����t�5���Y1~
��H�E����D:�d=���$�b��#3<������d��/j��+
9�cXH/%&�K~�O�9�]��r3Kv���Ku�^p����C�,�hXn���w�C+�]��}.w�<���*�#9C!��S'[���|n�������^�����%Z+��k�t.e����`�&��=		6M.0PrfJ3h��/��<R�<]}Qpx��0C���A"F7=I+���_�.q��D��p����a�S��<&�l�l�LaQX����������?%��������+�ud�1,�8|�$}7����H���Z�����;���b�}I�3�0D��Y+t�o��b�U�m������ ���A�pcI����$eA~�����H���='6`����p�/�L��`x3�
����#���Br&q$	E&�^_���
��
G	,�E�t��P��MQv��1�t�"�$�;"��9j�U�t�,�+��������:�}6�(�nw)��a�t�����e���;���D�Ec���xaT.�� �hIV����p&Qm���i��`n
n7�F4���:2���d�����!7A
����3���%�ld�%�h��h�Y��J	H;����9p��,��ob`��p��	���p����f��u��H|F���oH��O�oK����(��J��������Y������=������dc��3\��_:K!|���%S���^�d0j��!z�! $��N�����`�s�O�
`�>�k�����)!�����?'Qgj?3�_p7�0���������jv�;��"Tu��*��o�As�`e;����J���u� �i$�0�����6��D4��)����9�?��%v/��L�����������3sy��d��Wf��^;I���_�7����7���Z�h�2��%������I�Uz�\q�|
���[$�%�q���[��`g�f4eT��7����h17��T���I�.�no���a&��k�����m ���c��3x��*�J���b%���U��[������f�������)��eCe=��A&��a�,KX�A���sJ�z2j MV��������aO2���aJQ����O��^]�����Yf��4�J0e(*��A�x�H����Y��:	��N���QBeso�2o;=a&�h��P����w���\�H��T�f,!+>\rFS?UIzzu��Y��
j�EX$&y�����r �rS3�^
g�����f�U�s*�IL-��5�'K�!j@��kYt��&���`�{��,A*��_���N�Y�,��Vbo�=���o���0�f�~@E��$�.�`x;�~}}��]�o��>���0~�
��Z|�����{z��y���I������m�����k{{�.k�g�
�j���a���6��;eG�������G^R��H�����0	�y^`�Ap	�����)o��av!�J����D�B���K��#��Q�\�N�<�JK%B�F3�%eO�
��iV^�W*�>E9�7e���T��*��u�K���Lg��,$�t������3��q':�D�=���g��BTV:��,&"{4���h��6���aR�*���ge�+?����!��Geiq�Z����������,�?+9S���]��=����,SB�cH���D�,{���Z��oOg�
�t�N��v�R�KE&�"�������!v�C�v<�
��q�PO����r#�<�5�6�[?Ly
�-�
���nr���'�'�kF��U�����}����'���jB�����qz3����������#���!=P�b����]�
��]����_Q;��7��]����������K8"���D}(��O�V�?&^M|�004�w{p���������������e9���
�����w�O+9�����Stt�X`�&�,����=��%I�q�����$b�������]���^\$����db�Z}���������x9�A�K4�c/�[��%_����n��l��t����`H;�&/�N�{�&��S��B���n �������PxGw�R��`i!�E���|�RR������:/ON�(o�����u��A��Nr5��-(�&.�����N�mZS�]Tf��@���m�D����u^����O��c�@�bj���K��{�����`��g������F@�e��4�����#�)����Y�	M���?��L|H��'��L��|o����} ���>�}w���=[��8?Gf�2)*D������tu����<uY9���,�ff-��jZYy�,fe�������W^��;�th1d���c����jieV���TJZrS�4���x;��&:�V�5�U�����w��f����#�F�#�\��������:���S�ur�K&S^�e�)��{{� ?��/����O���w�h���>&�"$CJ�:\�dIn�_��G/��>�Sy"U���i��'������{FW]�H/7������Y�:l���v��K�������]zh��)����S.E�:��������BW ��6��"�f��@����9�?r_���(�������N������L0A��yt��$3'��<a3y�I���R�}}J-%���/��,?ee�>��
6�J��JL��kT_�]����]���B�Vx7���k����d�aEV$��n~��+r����h��l:��p�=��<M�W����4U0X*�	��-w�@��|��s���T)���!�(�{��Eq� ���AQ+E�
�-�����"�����
Ii�wKs:���E�O�_�cF�D�:�J���J+3#�J��������<�
n��mT����E�P%��,���j
�!�����"��k]�u&j����0�{`,��:3�<���V���z�����*Cj�=A��aj��J����-2�u�x�����R�5ai;��.U��N]E�z���w�u��p����(=u���f�e���(RIK�4��s!��m&8�&"����-���o������������.d��7�hp��:A~��-�0�D�~��_�a�(����
�/T
��D0*�oP�����O"�,J�jJ���y�E4�M����9�s���;:>o6 �*�����E�n��~����2��I$�8�^��az����������#K!�� ��|����� =C� �CZP����
�����_��7����q��p�-�9�+����������`�76�h��m�u4 �A+��c�-������4m���Y�lQ�-�����)"+M0P��5.��x��T)���'���A��H���`Y
�U�m&T����Yo�h$`�5���:,l,M������|�S���M������/�	�\�#�MH��o2� &X%)�L6]%a�Q��l`�)_�}'f�k���Gh�9F�\��x<���������-���5�dD��^��c�b������O=%t�#�1�B.��T��W��0���1.�����O�����'b�m����+cc��/l��j4g
&7���`����D95��hqv�]���!��w�y YD���*aP[m���e�P��	�����Vl���-$�����FM�x�IB\<�g�����
�B����y���8B�g<��r&�2��7��?�I�v���+�+����&��feo��Z�{���}=����w���\�b9/�a-�a�]0�*C��������r��k���H3l,!0g��w����nq{(��"eL�����Z���^�[�$�������"���)&���9cj�DZ�����`d�K#��_]i�U�����gN�j�u�^4�
�jI�F,�9K��d���
���<rp�H�C�������)��u'�n^�9�����r�4m���!+���_�x�+����K.�)k�d�j�
6��0�8�K�aj�����M"0��6����,���xJFeG��b��&+�0q��F�[�?�6Fe���)A�����@�����q<�[E��R�}pow&�oB����������D_�����e%b>�
�&G��$�
��5�b��A�`_�l�J���~��8���d�,�3W�����F���b#�Q��y0B����" 98J^�kl_�����? �B3[�#�+��u�pU����U��c������Q��m �N�y�.��$?J~��yVp2�J�MJ��pI����~�R#2Z���	s8(^7\%��(��KA?t���OO����4t'2,�L�����2.��0�� ����"K��T�*t��#�~��}xx��u[�/�!Y>@�ReX�`����"�*@J}�7p]������H�"U"�A\��Z%������7$j5VKp`�H��'
��S[�����=�g��i��oU�/'�4�
�
�)8��~����3t3 ?uHg6
x���P���������}k�(�}����}�8)���;��w$%b�Td��]�����ue��(�DrZ�8�)������~I��=��EM�x"i�[/�O��w�/�-Qfv`��C/�����E��B�Q\���P���2���`Ph����].\o�	�LO�����w�k�#q�kI�c��J�$���Q-	����\�w�|��a�M_Ceb	�(RYO������ZF`��kR���q'!���8����F��F�\{$������2m@����M��O�"Gp1 z��e�HV�%�B�i������x�>��rB^��0�9��X8{p3����E��Vt=
������*��1_�A���l��1X�0@����-���)�����fakKI��o�z���QJ�gn�����������|�~$�6���y����+6���J|���C,�-���'���5�@���:�f��y���8|�F���vW�H�"�1����D��7�,W";�&fLD~L��7��Z(/!�V�����!��wuM�������:����<\J�b�U��C�=:|/��/_��0�"�hpJR>a���
�$l��9d-�������:h��Sb������<+t��(��0���GL}���f�/LIx���C���y���I 3�������5@(\BUx�[@V�>�pn��L**�T.�4�$��#%�k�x%��7�;��]�([���^<!%	����0�#�"\���g)h#������V�����O�-#Nx��M	�B�eK�#�]�����RP�
�>��*E��d6��{?&�#�%#��`��d�D�2���'�|~�)%�����G#�u X����%M�U��Q��G�7�Q���"_mV���F�����j�ZU�h�XU�`����we���k�m�v���
�[!D��d����N������B�`K��h������\Q���X�leD�����3�9{ ���!O�|��������XP3����$x����!�a2���V�z��%���]�w�+�	���'o��
��zG�M�6_Pd=��G�&�w��J�����2������qU��aQt��C�fa?%�{�V�Ak.I)��sIX�x8���Je�f��e0�?i���|�[P:�Ug�7���=�"A�fed`	5xc�������#\bLs��/���P��/�A�	��"�)��X|aig��u������(%��	�
���5�G	Ga�����i��\����j|#G|���c��I��$������&�2�o���?o�������!��oH &�x1'���a,9��]sH-g);���R����������j9?Q|c	��dx8k����'��B����M1|zR����ZU�?gi�5 ������e�b�	��N�&�ME������U���d�sj���Mc�L'�M�+���a0�FwB�tQ�FAl;2��������N[q�o�^�q8�n������6���^��?���W���'"_��r�]���������NZ�	��QJk����-���&��K���
����;y��Pvo����$���?t)���J�Fr��/�g��tA��7��^�����A�Um��WLx�ni^"����X��z��0B���q#L�?���L��=o�Tp�z-�����\l�C��^�*�����~$wpIF�)����52W��JR��h-���FxP�����I�=�@�����GHm=m�7~!��2:��R��1Fk�l��>�Y8�_�g�o@�_}����r�
���R/�[�f��|N���7��n�HT����7��I��R�ah;��g�I�|4sJ_"R���R�P�W��B�������Kz�d=D���L�Q��1���,�tq��zY
��^#�/���`��R,�e]���N� *�����k���bD�-�����T��~d�9F�*)Jm��������Hr������q�'��mPMU���6k0y����j$����=;�R��!������m'��ds]�n��4]$i���&n�6�2��@h=
�D���RlyI�:0d��Pm8�~�(����Re���� �Q1�����|L���q�A����~<aW=*�]<L��M��O�Qw~�<M�$@���V�WCk�y�����`v��x��B���|��;�W�t��W���o:`��9�����]
�(l���[�!_��iV�q(�O,�����T�� �p�����'�f����$��>�T(����0�<�G�`���{�Q��\����'���8V7fZ!u8�[pR,h����W�u��G{����8AXK|��U<v�������dn�0���w���b8x�5����i�F8X�M�Q��"���������V�.�}�s�`��s�I#[�1FS9��1Y���������Y4=)����`04o�#_Q
����(JTp;�I�X�4�t�$�� 7^#��Pa�XYo�<�{H��|�@QF��}� "���9,�����g?�����zQtC!u�������OL����
�,��I=���{2NzX6R���{u������pd�Z�����n�����4-�*1$��]V������W��w��\
y.�'A�Ats��a���8�Wm8� z7*���_�;
�e$��Q�@�P�(�j_�-��_�y�(/y8�*8�]��O�3�����eZ�\�,�����(���;��p^���B!�����(�*���NM�%J`�
tX��A
Y�~2��h����(e��m���'!U�|l���_S����3-�PJ������tI��1����	��S�eZ'���7:~�Z}e�iJ�iN	qst+�a3	�|\@��B��r����;�5U�)#K�����s��DLp��8������4�@����4��q�������p$��ul���F������:���bX1$!�����$����p�����+#���>���|��y��(���pF��<x>���
N6y�������<���m���Eh�~p_�����y%n�@�����Me�5�
!��X���o�+=&��#z�"�g�����j������m��k[��c�a>��|�CuV���PM�����2�B4Z�����A�v,��}U����)��'A)e��
�:�7��@x��K�?�V*���
Z��W�T�5Cu�����D�S��y�
2P\iN�Bg�I3�AJ<y6H��T@7�:t������6�&����br6��C���'c��!�Zh�y��$e��`?�F���s��m���W�9���>�*lxK���*�8
=��j$�Q���%vf��yE�R~z�}1 C_u/_lo#G#�/�t����(�Y�y�W�B.�#�����qt�����
������T�����T������C��o1��k��90�s�+�����x*[z'b2L���tR���f����j|��O(��������>#���������W�4�����z8�J�S� ���J����L�����c��J�)���m��|��n��������q���CJP1cC�Z�K���+�VHe���HC�<�'p��~�{�<	��D�������b�U�Ok������?�����p�<�������H���0�>�h�f���:��ro�����B�a�
�jL
�=q?9j�1S<���(M
�xH�������x_�5�����J`~=�L�������QV�y����t���/w����y�PY��m`����U0�<��i�hz��Q��\2����f����@y�v����]+
]�J][%u������K<��4��]��6��_�p����;�Q�1��d'�Yok����hZ�0h���L$o��-���h
��$LJ�K�����
��,k
�����6�i�i�^�>	�#���Z��������������5������Wi�NN�D����#
�gI�[����e��R��l���t�7Gz��7�D]J#�X���n�B������~�@�_�����v�.�7.�?[c�P�s�]��H�%@-�h%R���5��G���|b��LV
}i��Q:��D����J7�2�Yg�-��~�98d�$�].#���������Ui��|j��	v����'���7S�@����������}q/��B���s}`��t�f[f�S�v���$$#fr��������La��O<0��&��@��=�t�|�:���Z��RnA��� �R(Qnl�G�8�t0Y	3�$���\�L-m��>.�dgI��7���.�(��$M��g��J��������8j��W,�j�k�_�lc�xr�sd��X���&K����JZ�v+s��ce&z��x�8n?���k���������S�:��
�P�y�������\�,`�8$�E:�H7)<���q��3�E����	�[�)W����P��?����F}�-�8d��1@HB�N%�3^��^4�Ac�i|�f�hf��hS������Vl"t+2I����}J�h�2�sJ�^xN���g�x�Oa_m�%��+�,��]O��O��.-#�8�J���%V�?�����~p=U���&-�Us
�M�U��Z�!����p��E�x������{����[���Pc��49��V�|o���C��y"���-eT.g)S����I}����O�k]=��m�����p�W!2{@�kx��EK�Ie���L���?`[J�����pM4P"%=k�����,�b#����zcn
�^)$��h�q^r��2�����(W�x�+Qp�#�V�����af��v���P�M�EC���1�]]+�5<�H���8-R��^)ym%���������N���'��1�K6o�nf��AnF"����)�([u�qMGw�8�ob{�q�5qS�Z���t��-PW�������k"-�yw!-�M-d�W;��?,��Gpi	A�h�L�Y����J��x<!������~��|f�7:j�]�$J�|��4K�.������)�)��P��������(��t��r�r��A�i�Rm���WT������N
�t�����on�G$
h4��JF���!��V������k��+O�b���>k���N�01��c9G�"�����.��p��r�_<�'�NQ�+&,���l�����^y�r.5���U�(��<�O{1���
0�Jw1H	��p;���>h>&�va���`*�k�K1���s�w	��V���o�h��o��DO���s�s7��U
��pk�\����E���gmE��U�d�)����fK���#�����8����<4��Y�vI���g%E�=�r���G�w�A^o��]Cl�~���*�!�%�����M
��*�.��*���k��T�E�H�l����(������M��}���h#���\��;d�-�����2���
��V0	-d�2��5�x�(��G����=h�3�N��~�_��[���8�q)�i��w����=���8��
*E�RTB��L�
�5�F�U��n���x���
�<	���W���2	�����A�S��yH��jx�Le,IT����F�GT��8��OQ�U9d������R,z`�C�X�IVHr��L]9�6Z�-�h��
el��h)w3���<�"���s�u��a��\)Y���fT�
�e����%;���!.dR)��1H�@����(���|r�
h�J-7��7��V+��b<�J��B�2NH�@:{}��1	����������n�C$k�7P��������l[������,����n�!���Y��g�:RA���{^�d��7��JT%�6&k��-��vkKC]�%��<��UH���.���\��/lw&�f���r���p-�U�� ���A�:�:�|5-j��:jU����G~��
	;nS�����TJ>���)�%�/������<D���"�C��\��z�(�h�\�}k~�����}Tu���>��J�V4?�ccn�$�H��O���p���9-�M��5�N"%k��4P	}N��@���ek��d�\����m�d�*��q���]=�A�]<R�fR��g�eH��>�!X��)��2��5�:����2�	���Q����F6�~��%����7����DV>����[xt����_0S��
��PX�r8e@��^�n����n(|y�I��S�+Ew��>(`�u\a��co�Ij?�,�}}�1��N��h�,[��*QbF��������*8�C���q�8z�2����)A��XZ��<����<���/��Z+,�������gk2����t���oH[��0v�3^�!o�)�f����>�_\�����^�W�?H�'��Ws��"��pYI����]@��6������QH�XD�������o��3�	���!�������P�����������Y�e���+��J"Jkd��^�����Q�&��~gcN]����������r�����E���k��V%��2o�i���, z�����gJ����XF��U�����+��ZHj
d����)7�+�0�VL�MID�
����V�y�������}oa���Q�_N�7x��|�U��?�����Gx���:0�e �`�4��W�����t����7������������r�0w�a�p�7��6�Y!g���Z]m����e�����h�n?�+e]�t��
WF��:����]&�5w��8�w�v0R�=��'q���)udM�&T@@V6�d�_;�''Y�9��e�Chrq� aJ��NI+��lu����{���S�����������c���Q�I:w1sUT�qa����&�,��g�o����)�!������\�������{��d����o
3FVC����e,�w��_���:��C�0�Ro��l�i��YI�W*)�.�XJ�rk�ex^�@�?��e�} \�z�U�Y�H0b��>��Hj�%�	Ix��<��g�.6���8�3���z|��"q2��1���J&�IG���
c�z0S7�1�O�8��b���iQ�78����s���I:4_2�q�.�Q<��%1�O���������2X�'�������B�3����Zlz��e�f�������'������-�2$��X�k��r�������|���_4t�O�}�]���H�D�u%O������N�����AA5��K�DO(��J��V��x�������:�Ul��_���Wc0&���1l��h�9��G7��:���J�h�aM��+9�=����cn�`{v�~��#	��3�sT�,"��x����z����}w�Tox>�6{^���_���U�?8��q��l�q�U]D�$�O���:��]����?���� }�aEt�HQR+�H����\�����z�N��[p���1��F�����|�����
k*�mT
V cc��)����&0C�q����j%�`1
�C9�=
��2���U�������@W@	
�~*Z��7
�_A	������c�t.Rm���B����D.Z���"_p��L�S7��v=�����W���	�8)�3h�j�	Pz��� &@s�0G���U
��~	N�a[�������(�b:��r��ulQ�n*�9=YGG<C��U��/��E�9���Wa8j&+z�����\l�/�n�� K�,k�1���sV�HSp`�j�����X<1�?!��)� �,(�L�Iclf-0��j/UT]��d��	��9��
1H�P�6Y0�
nU��";�[6p���M�y5����PlU��1��@�*sz>�T�]�qK��G4����d+X4��?�����qb���k`.����<�va��g��!J%�f��E"$�o��dh�dijY4c�����MM���w����
��[\A���;��v\�}#�+���-y���x$��(��b���6�8��c�q�7{��Bx<�F����y�>8o��rG��(�j��^:��CD��|���qg�wv�U��6��`�%/��dM*�[T�����:9�$G��3��9��;^+8�t���S��+��J�P�y_��Q����j��/�S
���bVW�8���Tl>���0�l��������'���5BS&)]5
*!9RR�#m��fg������5��;�A+_�T@��`�"�fY�f"�=��k��E�1iJ8�3�tat���w�S�=�=o�,{8��m����+������N����[�J�P��[�a�y]0u�x7P��s����#&y/!�!878�w��N����2u��0��
��\ s���S�����&�[0(��s�3�+�
a�>l�<x��}G�0e�&��`�XMF�7�<���>/����{��bW�j<fe��W�+]Fv�����G�&��4�m�u������N*��Z�a��T�����y�${�E-���"��<L/�z�,^�E�����\�^;���l���~��,_����p����g���V���p?TqX��HJV_*��8����_�;q������������s�B�+�[(ap�������9EA6����L��h�W�"��J�T�p�a)���aG����������6�IH����[�n��W��S��d������`u����}���|�l��@�l-���WFS���S��QH����~2�.��t�'�AW�[E�dz�>v����'�+�BA5��;o���{{~�7+���%3���Rp��]�������RRB(��l:�fiz��c(����0��8�����h�9��n�ELNi���`��N�0��j����2��)���n��
n�
�R�4FF��L�>28�
422��hn`�3���U��<(�*�3 �0��7�,���\�C��R:�E�5Nx������>�����\��`������;��np0�^��9��q3��E�������En���b�Sx����*ND�5'���4��Fo��Tv�#� |7��{��&���&�e4�O��=�������B�&���RQ������A")��>aL��2	w�VG�>������� ,]L�{����QmA����
C4$�����$q���3y��Uu\���a	.��9��m ,
����v
�?����0��L2Zjs���w����82P��1FT�X+5������:�����U�d[�M~BC�=!���[��TQ(�S�I����9A)����K����
sAz���i�������u`
�Y���w$t�qV2m0�")���R�O:$�J��u\`��O����SX�<#�e��������r���>���Kujy$d1V��h��H�>��],5�o�y~�4��b�f������CI�H�3y*�S�.
b�^��3&��c�������z�]��k0��	+u��0��:����f��0k�
.�
m��1���|+�r��z���j�ofYX ��v
K�&Y��|����l�M�������F���r��&��c��Gyh6���3�SnQ&�>���P7��[6����Sv���uF�=��4,�io�q�\R�y`�=+���9��|���n���}�
���U�t��L�F���m�2P
U��i���-�j�Ezsgp��YR/���T�9[-�7�����;o!��	8c���u������c�G�����2=OR�C}R��H(aXt~����~P�@h����.A�@A�`�s�1 �$&0h�VB�,(l1>vi�����������.5%���PmJ�[G_l}��[C�U9#
�����On�p1^�/5KAJ<x 2L�A��fb�����������1�B��Ed�����Y|�����F*
��e�8�{<1��D:Z9�W����A3Vfd���?�y��2�B��	Owf9���' ����(h�'�x<�S�0�������6�k�%�~$����Ah��h�zw���AO�����O���Z�Df���*��b�a���� ��XAg�L�\*����JLK���%ZL2R�9��y��Xt��VN��In]b�M����E�t[�����m*I����AQ1���!U&�`P*$1��h[���2�ak�u���g�J�����c� ���B�m�	�qy����]���1e�'^�v�3���m�q�&����!�~L&�x�N�����{B!X�}B�����}�c��BX�2��#��M���N�}B����T�
�/�cL
&:�%����()�]�;��M�P��(��P�7�x)9��)2V2�B��h0$���j:}D
L��.`,R��^��;t����Qd���������;�"n���#��
L?��P����8�0J4��X�F��L�d*�9���eR�H�j������'�2�7�JV{b�J��[��f��Z{J�L��"������.^@��<�x�0L�(�bl�D����C��$�#�O�n����� ��p9fv��~4Zeq0�BE��'W���W�d"g�O{��ro������-����:����
����2�9)�`�w<u������oWG���p+���C��+^�����(�������C�[��G��,*I@v���<\6��n	���RY4\7��ag
0�5�p��l0���t@����i���/�����g�xfgv5�����
Cq�(�*���G�����wk����F��IJ�{<��C���1dt9���d�i	fp�2��-t����39 �P�=�� h(m�pN����QcoL��
�X�����FN��H@����GLb��2���������s�/�g���}z��6��V{��1v���"�l�lW#=��������tc���3���
�c�H(���s~7pX�o	��ls7��E��{A�'��[�]���R��H�V*7vJ�
��s&'s$EO�
����9���q]ys������8N"*�fH��]�\x��0'������H?�f!e+��^�>��y�`�9�,*S�~����n�B ���#�h�.4�/���k��j���kBiO]Zx���cC��H"�F_�xiy�y
Xuo��\qk���P�S���*dK+[��L����+\v�����P���Z>��"�o����7��H!�.�X�����(�VD�_ssE�y"���j
5�H;���^�Oz�d�-����������:��A g!V	4 l��om��+�V��8����|��!j��BOk�!��4��zzG�&��F�JK��K���NZ�����
��Y�SC�#u��
3VW�E�)������a�#f���fX���=;zsX�� ���f���xv5u�i���i���������>Swv�(-r�p�m�v����_Z^S�0��V��'I���E�����[�{@�����C�=�"t��5b�G>Lcm
�#�
��S��o
9�Zp6H��|/�}2�������9M?,�(C��I�"/���,s����
�$t3*�j3��L�`D��`$S-wh{i"���cl.��4�o���bOS�Rd��F�R��)��L�gmh����������p-/��K��`%C�Q`�L�!�TB��{�O��:�����>�_�\�-�aq�Mx�0{:GKle�L,,�l�!���������\�.&�f���e4�������(t���R��lVSGV�O����2�=SA����9���ljC�Z"��������
3-�N`�mX��<O����#!&N8�l{�����ih�]$������)���7n���[�!)���6�P
sh3�t����bQ�=Nl���e�^��������/k�INY�f�p�G�q}M�/�\�G������L���==_�|Sa	���=!=������X��^>���0y�W!���7aoZ>��6�Aq��e@~�:����R����f�J*�����4��4R~��;��2��l]�C�S]��E.�D��@����;��^-��$|��x�eV�SN��A��2��;�l6���������m�O�8P�(!7�"���sV^��t�A�S�P�OW�K<\����-Ez�[Z�E�;�~�_�x4u�rt�j���C[����e�yWTQ�������8���r^Du) J�53��K���(w��t����t�	|��Q�@�(�q*���:@ ,��:����U��*u�Tp�d\4���3b
q���~��1vD
CVi3���	�>3��v�|�*,#�4\��B�b���c>[ez�|�Wo4[��n�MH�I Edg}Am��w�o:��;�V*<���"����	6�f8�tv�t�[�zX*�H����-��C�����������7��aPr�����������] ��{%^�;~�gK=!0=>��a����y�(`o�;?�����6=|srx���}*6�oB�S�c����-�����L%�\��Y?�N�b���6��.����f1$����Q�*��Y�Y�p���	Hp��'��Y�
���Pyp%�����d;~�of��"�u�@f#n��1y1s���,��~�o�r�����\!
�F1�Tqr�-i3-��B�[��*�~��_^R���g_��R�{�R_��\v��l:��gqg>��/�QX����n>�a��_d�}�J��9P��%B6�?��x�4a�	)���<i�����cF���������R&���i
_������"�.�{S�n�A�=?y��(c?����b�:�������~�L�H6#�'�A��S���Yo�����bc����E���]�x�}&�9&���m�V����������U4��77��T)��l�o�����>R�D6g�`g&X��S��9(���<�X,�k���A���[D�<������������+g��0��������Rl�;D��/�e	����2SX�u��pD�}���B�p��n�$�x�\�Q�n}@}��8�	��~u����;��l��o�B���l�|�.>�lV��������/�����G|d�?���H�'�4��2�4����S�pu i������~��z��a[��N�G�*��A0q��a>�:�N�?x�=fy�C���7'^;���h,>>>x�>�8u�����]�m�����~������_��)��'��O��������������3����_So�A�����o��M��|n���]���W615��Lb&�Qw����QQH�����C�J��g���Q`%�|��'Uo(����fF�V��dE����
��24�A)���x�gQ��|����|(��~��ZI��g���s�2-IbuH��g�R�6��x)��^�'&*�S��g����u�}�U�]�<����5l���������N�wo���S��A��UP6�4��#l�$H���
i�X7���W<�g��sH�)�N�S�J���Rs���5�":$g�B|��M���P+��������iF��HQ�����2g&p,i)	��Lb�mS=^XH��J���s\���o������@�@��Ex��/��.�Z�Ah���=X�[���9#�������N���j��0��5����9� Yo��<'����g�D�05fq���#Q���z?;x���2\���(�.�C�Q2�=�z�y?~4�
���~��D7���E5���J����(�/��^7�v���V*�Fc�T*e����������/Q�������?�B>�SZ7�}�2t��V�0�W���,:Z���cT���e���!���at�XE�7���gw����bz-�����Y��������������@�4���d��z��������<T�)�f�c��h�4+��Nk�\��*�f�U��i�	�a�&������"���{~xtz&���l0���yRN��:�.���3y�/���~�'j<P�I�����Rv�}��oFX����	G������^���z�n��[����~-���n��lX�b2F�N������C������4���Gr)��������F;��F<��G]��v^��Oe��O?�<�Y�/��[���L6�nL`�2o��$��>�SB�V��ry����/.�vk8_����G���55��DU��<�
9M���W!��4qo6'c)7���p)Ge>�%�uv=� 2���h�H�����J��W�G�[
�� xhju��3���������Nwp	�P�t�'����v�?���k����v����o_�<z�>l�����YJX6�Q�h"�;z4t��zh���3�>M�J
���R�\)'��N��l�Y ����O}���44+/�^��r�����%<~&?��O��k(���I?������{=�}����%�$0�E3��
 ���(}��H�gn�* m�),_��h�'�I0)F����K��|�Q��k-y�?|�^�:)��K!juk��Vs/��R����������qO	�j4���<�������D�yrrDZG�
��F��p�jM��^��\��O��
3{���b�}�����A�����$4��\q=�M��n<�]]G��en�_�sl(�%����#�de<���H�{��N
E�v�K�iL���qoN��k�.����e\^����9q���CdU��{A�U	��������n�V���N�B2$5�0��%��R	����"��M�/h-PY�h�fWS���h���"<Q�q���Q�L5������T���'�G/���^�U&D�p��B�����1X����*s�.5W����s�>3�������j�����|����s0��5Wkw��p�[��V���W��0'o��p�[��V�������5�5\B%���fe�V�Z��Vm�._�q�|����5n9,�[����q���h���MWK�j�Lx����w��R�N0$8�p���y����4���_��K��K����V�U�eALd{^��'����������7)�	�k�@�q?2��v�*��$�g�I�K�V��W�D�Q�_iv����$Q_���g}�/����C����?����b��iH=�#���`�p\��9n��I��$/Xx��ro���Xp�Wc8�|�sx
v�[7r�7�d�({����:�����G��t��2c�&�Ag�+�a�M/���7(J��T�����h
'P:�t���B)�=%��V�CA��SC���yKi���]�WC@c�E ��r2W��'|���z<�����4�P���Ig0d3����f���fQx�(�5��{�U,[NSX��eZ
�&a;������x�qxj���|E���O��j�%��!��i��%�A����9�kIuJ������lk�Y���VX����r�F5��;�����8GF�)%�������u�Y���F.K����Uv�N�G��������#����K��Yc@&��qZg������t0A�bZ��|"�����O��� >�ll�%�]���$�Kj�;���O�o����|0�6qf�
�"����;=>CM��G��xxv~z��|c��Ll��������]������j�w��ro�	,�V�/��E��^�������+���$��N��h��T��a.w��/A�	��Hf�yo6�Z�mZ�f�|� j�Z�(�J�B�����Z�B��80���h��������}|~�����9�"T��h���u|��K��7'���|�I����\����#I����_����S��� �E����{���y��}�O���Q���
fv����oY��:��/���h]OT���[�4��~�-�3��U�A�����.�"���^#��y|,e���Yd�v��>O���h�A��0`�r���+\"1��/pZ:�������>H24R�o��f3��O)�����
�{��'�aD��R*��P�p�C��@9����
�PXP���w���NZ)�%�k@��)����|	���k8�~����<����=��p�g"�0�:��M4B(yI
S�����j������hmQ��-��B�c�S����=���a��;�d��+�
������O��<V�(;S@p��E��d-2��:0=gp�"�9[���13��0�3S���P��������U������������"����q�}��YN���E���!'��7��_�FQ9eP!u������k�7G���T���{�Y�Y�����7���[&�H�7_�18���p��DZ )A�eQ���ZK��a����Ku��:�e��C�G�}l����<��.B�W���MqM~�Z��t
�0P��������&QO�X�H��G�(��-`��h�Pn�v��`,D��d�'�\�.���x��}���K��F.
��o�T��3�e�'�Q��fU�7�Z)��p&*Z�<e��\���-��+����
���3Y�|������@��i����yX�+�?v���DMhE�������UuA��p�<���Z�����k�*`�Z]����rBp�����x�M�+��B�(XR����U�l�Cfs�p��9���wC�dr�q��
�]'���������'�<r$�c+��#a�������jF�1���r�T��>.�a~��j�4�!w]���(�(��OY���@?��}4�k
PG��}��C%����3uN�]��K5��+�5�A\�)���n�l���/�{��=�C�;!�}b��o+�b�0�n�z
�>�o���Q����A�@�_��������������U�eS�(� 
��,�=lkVe�9��:�XQ���;R�E�c����R��)0�o$�J���uqEQ)8����2�>==9������`��W'U���W�i��d��%%�{��@R�q
����n��A\��Jv��^Y�G��J�E�F%����u�����Q�NZV_8�z�rc|�V��E�`�����p��B�X�5�0�l�2u�T\yz�&�^s������<k�5����fS�(��G��N�E>�����!F�22��{�L<d^��x�r����h�{�3�����=(��~��s���x��)�=��3�j�y��!������������"�E&�w�p�sQ��!��P8y)��`@��X4(K�+�y��@C(4V�!f)�
UM�Cp����O7y��������;��2
\�v;l���D����y3�13p���*�#���_m	\������X����ZI�gd�0NfVW�2����x����M(<����Z�T$���
�
�7���f�7�Oe�����N��N��zg���Pf�"��U�>��tp�����TU#��T�������
-93/������.iv�p��&�W�k����^Z�Z�:L?X�Z�����W`M��19�C���^�?V�;Y,�RxHB������gO��u/4��_CE|TN���{U��5��R>e�G���{��]N<����m2��#�~���.�[u*U$O��j=�7�1��}q)Im��g-�h�p�\����rC�jz�.�2��N�f�{�u�����it	��t��o��KNf���e�_?����5B��+����)
N�P�@�������������'�=
�U�����*:�b�\m,��W��I�C1D;�^��:Kk�	�+�0a����b��jL������)��S�t�N� 5�Ss�=8���[.�pEb�u�Z���F}w���:l�8{�B�*�-sr����!/P��Q�U�(�2�D�)��3RtMc0����#{��z]��0���� �[�u�=�}���s�>#l�"�,Nq�^ �HZ�������6v����3M�����)��uN<*��Iv�\�QP���pW�?�;y�%D������=�4.�����^�EoC��=,���������3�\Q���(v����7��G��,�p!@�`L���Tz����"l��N�Ka��f$�UZd���Kz�V��N���3Sk��j���K��"q�RQJ��e�T���ns{��s�)+�c
�
��h�n3�i����I���,/���O�\F�������UB��*���"0�sX���y���,���f,X{Q���RSG��WY.�.��-3xp���YK`�S�u�X���y�N������P[�j���PF��7��H��X�iK�xr���hpb�`��-V$�|�n�R�O�C\�o�������tb<�{\	��iN?�O��%�u\�����[DXE*H>i�]U1�J���A�Vc������PT���v�i���=�3�����[���I*��
�<6$-jTv�F���[���2���J
���!�c?�P��7����\����������=W���g����:	���Y7����������c����M~�"�� �(s|�%�J�}I�(=�F�Z<s��q�L�K@?���V�
C�*����4.��2�^�k���v}qL�Q58���W��Ge�T~�S��M9v�%ca��P��N>C,���Z��RH���O�Vu��(����k����D��_W��F�9�'��4��%^�_KO�!	y��#���������a%%�d@��[��b�g�����[������M���lz����������qVH1/Y�Q_V4�������\u7����6�Q��������������	�YV�J�;��f�y^~#E����7W�U�IR�2��:`(m�F�L`�<x BE��2��	V��-��b���<�]���v�c�����PoJY�f3��Oy�G�^��y.���u����2��$#��/�����"NkA�U�Jf�/�G�5�y:�(��$_�1��D��Q�q�>����~���@j�-R<�hKaH��2v���")��d��b�89�\��
]�����X7V�G	8*u�n ��GV������h�o�|6�F+���x���1H���w�.���b|�����u�E:����h�o�ZN��ei@R�e@C�u�u�,V�H�q��I�{j�8w����L���Zu��*��D�j��?M���*���l�GI�,G���HR������k��Gi>������]+��v�S8�g+��JJ4Pf���B��[�5����}���)}�_��
s�c����������(�Rh��M����wK�r7
�[�g����W���y;���F��|���?p�n�5"��pp= n����J.�qQ�0E����E1��r�)4�7+��E�@�ko�m�@Px� _f,��j,��������l@K��)xN���%E�k8�O�Z]a�*�V�	��2f��
.P	p!���,�j0��������f$�����`���C�������K/����C����-V���]|��{����l�	�6������7�hV^K'�Vl���8�������)�sk	�<{j�4��G��g�p`.��5|��v����F��O�n	<��j�>��>3Gf!+�I[�X�az���
�n���e�!I=��SBa�?I�2�����2��S�4�s���2�����'5$_W��V�����%��n!�!�&���t]��������{�X6.9O���c���bO&^6acR'��^<iS��g������T\h
t�����b�
��=X�`v������x�c���5��@��x2���@���v�k��eIn��u��-����_��#�j�����|���l�z�e��R�;G[$v�����0Qw�����j"�����8����.Ao@��,S���o0�L#�`���a3�����i�g�&���0}v�X���h(�������@Ij=��s���j�:{
�Yck�����3V�x�����%�Z��Ku}��������b���!G(��������UzT�4Ek_(F������4N_������%HW/gW��|�����hY��a+5�o��
���V���5rT���jU���*/���>����gJ�UP����*��,��n@��8f��`ghT��4��;���LG=91��H�1x`�Ji?ZI�������0O�Hw�$��[�hR�UA<���E-������xr���h�u�M#Z�����m�F��p��vp@M�"_P�0W|&_��p����R���X8$eN`"��a�hr���{ @��i���H��4v��Z�W.��Qw��m�[0��;;vP���Y��8!���P�N��Z�����@"��y��A�T��	�����|Gz%���`R��I��_����\��D�h^k`�}��F�h8��)��yxCCD0�.�4��#����J�]����Iw)A����g+�P	rR���ILp�8`0��1�T�����!���E��!@=�&�;��O`�9�
&C�,��@"����h����S(�k��%v%�9��-0;	//���h�_����F���/�]���xy9Ee/.',��&�,�����ca�K,,I��r}�\����������������������xR�rei0�g�A��q��R��&��_����4�+������l����N��+?�j���� s��]����7v�CT�q��3����q%����8+�������`?�l����Nk�UZ;������j��V�j6+qc�U��Uk=�FJ�g�DTw����59�P�c�����8���$����Gy�?�Ka��mXO/�m����%�������!�����������V*g��_�������������UQi<��>����U �
&W��lV,r�,o���	[Y�N����thJv���/�
�����b�|4A��y��}�
��h}�N ��?ES�A�O�^T��D�E����*���(�o��^���$_*l�e��vt�eI�Y��{r�vv����Em���������J1BNV
�ov��b�.��/�8���{��4;����?�>�[O�
[�\�����)�y?�X���t���� �
��p]���dWy�>_VYd�O;h�JuA\��J�M�������l
d��o:G����b_��f���N����$�����I������
F�n��~AR��'�J�yJ�CaV���}Pm�����������Y�bK��"���r�r��/U�B3�����;��>�[����}z����<��33�x��P�+���V�3�������D50�0
��G�;���6������d_5��4]r�@)�|��d<���_9-��VT�������U�T��$S4!���u��W��9&h��I��O�^���Uk�%�����Xmwa�;��~��L�����W�'�R�F����<fOcC�H�cM"<���pg�����al���{(�������x�_��hz�HHP �b��lRB��K��������d��&zo�V�^C����r�i��&hPc�+��cop[���������0tx$�4�|H�Y>����*f��h)�Q��H�+����H��2�����JIU`�U��W�l_	8�8Y4��*
hmO���u&{U�����&�����v��g@]&��L��J�S�B2�,��v[4.�;���d,;k.K:�JCZ�K��Z�{u���^��94k0�����
[�K�<�
�n
�����=��|~�B��
G�%�����kTy(����7"[]6�3�	�������R�#4���
���������I�n���v�^�Ca+0�Uy}�_��7v*4"�F���yD�K��h�ch�?k�hhw�!������(��
�5�Z5�����h60(����P6J�c�I���&7����Q�2�������o|ap�4�)�D�p��o03��d���0���7P�L�)��T����*S���D!�� ����i�l�?�8�����t|S�P���P��G��<+�m�m=_`QcrV@�+,jP�*M�-��b�����K;z�9hD���?����Xo.Y��x�E��7����G�B��4��y���JNN{����^�R��Nm�FcSm���T����fOf���(O�xZ�A=����������w-����$��������������l4=R����1viyL������Zs������;�'����P`9W���j+��Vt��^����$���c��&���.]�o
��(h���-����P��mz��W#s{'�2��*�kv�d,T���xO�aIL�S�pK%�P��J9���o
>)�j��h�Mp�/~~w��<�`�1+��������0/��\�r�����}��kU��v1(e/G��%l����c3�{$���ZZ����H��� ����11����m�
^���D]�������G����)yW&��j`���+���f}i�)��J����>���~�������z#�H;�YP����>���,�������G���|*8J��pY����m���-�,5��D1��%vCC������5;��U51���W���Jg`�R���W��!7���
+r�C����� �p#&�?�3�9:��@�Bwa�p� ������0�.�
��A���H)@s��b���/y@]�iy�F%���=dW���a���a������c\B��L`IN
��h9���\,��(t�����_K�����F���j��?�������n�b�U��v�;����h�������<~F�n3��v����1g�M�����������uy
�r��V�_o�V3�_������zMlW����v�����#dbe��V^�nl�?��J��?u���E��Q�h5�ry��o��S��
���ag�-~�~sr������W�|����T����A���}���n�?���P��k#��m>���.�
��dS�j�m�BVei��F��)kT��M��#+�/����|�{�6e�l���^Puu>��}��x�C����P%`�����h}_��"������eaEH/�iL��B}�	f���e��B
�$"�����(Qb@w��s���g�
_E��-�j�JF(��{)�<�u0&~.��� j�\��F�,q7�.C9rj6C �P]�������;G�����Q9���<��F��z�^�@���ZE�Of�k)>�q�!X�=qX�9���>M���Q_��#�����A�x���fWI����[��V�������Ds~~���\�$f���<������m��]8
(��v��"�i���H��$66�~dH	���g?KV"r�Yqb�l��!#4��h��������_�eJ+|</v�w�a��	X4��.��B
��:U	��A�k����E�y��v���M;�8�t4�n:r�u����i/�Y��_0w�����!��	��H���w����i�8	�~����!��8�gnO���u�y
8�:&m��{8X������w�������u!��������C�SB@~��XT"Mr�J���S]2�'p$PEt������NU�I7$��q;U�Kk�C2X=�Bf�:�>&�JC�h�*����f�u$�����l:���S�!����1������2���}@N��-��f��3XE��W���.��p��(�[<���P�����K�s1M����Y�VY����^��T4	%�A��7��'�	�_��J��F
�l�Q�V��a�����\�"!% ��[�)��	��_b��gd�_a�F�����1)���H��?�(�'XJ�r���)�S�{������!��w%SG$���Th�(_�h�����z���\���&����6�i]��Y����&���A��N����:za-t6�))z=m�B�!Dj����)t�%�4�Z��K;n��9=9Qj0
�B�.6"��{;��~�*@	�pZx�O�a��}�eQ����������������&�<�$ip�z�Q���X�������,���[��lV��m���;�k�~��Q��4�>��FDA'j�F�`X��N�s$���=d{���Y��~�q����0a��R i��J�9vP��%f��9!Q'Jub�W�����RX��j�S��.������~��
�+�)��Vm��`���EV/�8ph,�~������$��h��*�w��Y��GX*p�E�oh{�N�S����n[l�|;�|�b�� �-C�>(���;u�$�U�{����"Y����(A0���B��	���9^E�v��)�����>c�[�n"�	�Ca�����A�]_j�F���+4�IK��g�9��y#Aj���|�������8�?���^S����<�S5�pZDS���Kj���(�����4e}��R��;jC����[,�`a�e��� E&�l�TH3���L���3�,[&U�ix�ZG���N����"��|���%��O��g�i�g�,�aW)�r��������T�v���b�Q��+���J�eb����F
\P���;�5j��4���d�{�����gph�����l�x�8��%,�g��3�k�}1�f �qK�3�4@��8*f���
���wvnA�)�Bizu��l�����9��87�DMc_��(�����1��f�A6I>�5��mO<����,{���),E^8rHR��*���C�(J_� �V["���*u�eu{Q/a��� �a[�Rp��M2V��7�{�J8��E���[2�:5���@*�Sn�3���{O���S?q9`�
M����`�����R����r��G[�����4/IW����~B:������o����t����i���������|��,�)��M�$�D�Z�q(o��x*�]G�^���������j�����ww�?�������{�F��zqw�^�+��fm'����n����������~��_N��*�����>����7	����c��_M�������_��>�X6!���'c�q�zW���"�Kw:��66��OY�����6��G�����������S���77D�\Ne)���J9��P��mW������{��������`�Y�h�E�+�]zS�g�l�sY��:N��
������8�#��-��id����
����i��Vc����PzR�t9��+�(�VL�w����V)i�X/�5��et��[Y�b�����`��S�v+�8�������jkUD��_+��������w�����R��~������}&_�ERs=��h7}vtr�~xp~xa���W���3��4:W1�^�{:H������g�o_IY����+�����\���g��S����
���g^',�I�\~Xt�������#%s�����1����'$1J(c�l�g�/����2����eL����%J]��(�g���2{�X@?��2��q������7�b��w�k[��o�k0����mB#$��\��7L�4)��=RdOd���@;0�#�i	�7��7������8`�/S��`UD����_�6 I3���M��\�/.�
�3��}=u����5��W���E�7:]T��^K�7/��>'E�Lq��\��l	^���#v�������_�����u��x�c�3��22�6�|�7���XllG��b�$E���Vekc�����w����8;?=zq.~9y}p~��-���~�~-���l�/��o?F���26�GP�/����Zt#3�����Ln5�����+�}��d`I��"���-~�N�m�z�N�;�Y�@Q>4.�d-J�r�b�7Y��Vk-l�7�v2��Ilw�u�v��<���5�3����N�yQ��^s�n������lJb@��V}��Y�z1�|���o�W�J�Y}CQ�%X����Z�BQM���dO08���x>Q���ZG/��^���������t�W��^�5�x��CR�k����r�?��)����Q�������W�����������c�e�hx�W����l0L�6F��/g*2wS�x���
���������@c������'~�:K�Pxo�g�N���G�����N_�,�,n�T�G�g�����sT�|�����\�^Y���O���v*��G�I��w�#�����G�%�q���!Y8/�/^���~����}�,�6s[�Ut���!�5"h^MG��e������6H�T`#���>X���Pt�E�3�(��e4��J�}|!�Y�7��"�C���R5���xh�����}��<Bj�����_�t~��}�.��s���z�����=IK������J����
�����3�������
6f����zM<�/y�U���df)��^��T&8�����������o��?���|>��6�����B�QA��E;��X������E�����3��}"���hP�l��B�uR�X�N�L���9?!7o*
��q�*��x~t�<�������AV&l6�G���aP�v[�@����,���b�p�u�g�Ds��~"�T_(8#���
������� ��"}�8
���$b D���!�����+���Q�Uw1�����x�X�<���������~
���f��?����%X�MK��H�M���m��[��P�g0S�i�<?�6#x�p7��|2����|i���ih��8�������Qu0�n�6�+`��fC�;(Z��-���a4�,M��o���K��[�l~	r�5�B�6�	�bw$�&yk�:���O�D,#������G�^�s@?:��G��c%���������X�5s�TN��`�C�;�>�X�f37���S�F~���:O�e���=����F���AU�5V8`o$r��y�]���{^m����8h���gOt�)�p=�I�_69z�CP�?��a���B0f)Vl3cAL�%����	��>y�
6<u����������y������32�Y	g.Sl=�
K�P(BOS����~&�UU���OMQ
-����?����Z��4���OQ����(6�������p_���~�'&^.��a�1���33C�3�^~�>�R�7J��ZC�����6zIB)8�<|�����Y*A�[MT�@&�j��S��
wJ�S����Y*A���+D"*��&�&j����Yy
�G�y��
��"qP�B�������������+XLR~�=+�������@E�#6���>?9y��������(P����@���O��F�X�k�9xo���������q����)S���:������v������}�����~�^�B-@[?�F����q�t8���y������#(k�	�!��V\n��j���������|8C�@��
�p�Co1�i^�*J��=���c%*:/N��=9;���F*�/�0�Hy�G��o����a_t%(A �p.��_t�~=d0����f!���i�����
2�q���������A)���<z�iy���9}|�v�nF��������������oe��m�D�_��"e��������X�[k��7{	���T��
�6�E�6���vll�%�h���C������Z�B�!�+���U�I)���-�JABjK�q��}B�|��Z��0[8{'"k��#X[�'�G/��q3��L���j�������e\W��I�5n���vV�Zt��pi��$/�e<��Xo�����j2�?�����)�I��X>��������^��l��c�(�u�.���
$�[���YN���II6��	����^����xB��#X����9�SZ9'C5���%�F�����sd��:Q����x�bB�88X�>#�M��F���v�����
l�4�����|z�(�`�>���`�N�v��v�V�����~�Y5�3ZG���U3S+r��
oM�r�W�d���G��
����Xt>HU��������E7+0 ���CF���&�o8��}#�Q���mv�%:��d�w�P�)T�B���R�w$����J��y�}�;�:{
�J��O|!��x2��m�CO#��

S3,���fw9�s��
];If�I��}����sxp��M^_0�������w�����_�4G���{���������G�3WU���
��Q�k�v��a;���e��=D��R��h6X��:�g�b���{�'�Z�$^��=�g_I�z6^����|��K�p�
�R�����T)��n�${D�����m�Nnu"�|�<���6���C������p0�|�sY]�_E��|�t����g����p�YG5�g��)�a�89>;�W>w/vZ��f����Nw������f��k�+��nb�U�q��WG�,/��V��n������1�"������6�s�=�8�'v�w@Q��3~3���(?�^�SIsW<��������s9��u0���/��Tr��L9d
#��cT������z� �)��Wv�<�5��K������(M��:B
Hb��pt5���Z{W���q�S4�MGO��rq��y��y�-��T�I��FyS�lY��O����K��@8�H:NTw��P�����<�����]���V�T���s���d*fb��"<~%p������%i�����(&�E�!g���*�o�\�#����bR�cU@����"]����&�8;}��?c�������.a�~������i���f1lf�FM��4�����)��6����
��]#��T�������)7�����_��G;1(��[���@����
x��w��2�Pr:�� �-2��l��#N���/T@w0K{����}��~��o�������
_�+0�z�GkH�EJ6��)�"�������U%T))�GJ��N�Q'�i=���6��'O�W��e�|'T�I~o�Wu������"]67BB�����L�zQ���.���B�l �X��D��@�X#%�����m�)��j{��y#�_��6~�KQ����g��CIJ������a�m�_.}��S���(����.��Ep��e""����n���ap�}o�5����A����St�'��RR��/�/_���	1�Y�-���bsa��W���=n���9k?��b�
q��CK@,pj^k7���3:����A~����0��&c��DA��wt��l��=����Q�j��pq��>�L��*��z�M���-���,��������f���;�C�������)�\�������G���oJMx�Sp7��mz>�;�%_�$�.���{��%b�3�4��sU�	v����`t �:�s��r�]���@�zt���UfF}���o���plky�3n������^�7{{�]=��^�������Be<J�|�c���2n��\�q�6����m�a��O!3���+����|w&��j!'������R����z���l���-�M@���W����`��8G�J�^���jkb�j�f�Z��ah
=%�������J�/�DT�^�;��d%���m{F+������  E.�1@Y�T�b%�d�#��B���
���=������[���Z������(�����:M���CPZ��"2b~c[��tV�VAP�M\��G}������nd��7�d�,[r�^G���e<*O�c�y�����\�����$=N!l�e�X�����J�RL�	Z�I^�n��^�����JZ���j�

�m;=L �`d���|Fs���/�����U>�{$t�$�O���
�Zn��|oh�dO:Rp���4����X�k������5�3�ZC�m9
b*�r�fQ��5����o=�����/_ar(H�����7)���N3�%�	���M����"���k�����@�`��@�+.����T[��~c���Vf������[p�o���'�3t���\���~o���Er��U���jw��F�
�v��g��]K���l[�����U���U�0���+8YF_����1�~���A�C<�oA�J��_.v��x|��u>�D�t^����d(�-<���_�������~�����4.������R_����X�����Z�W?�{E�����>������wv�J�E��x����jq3�U����^�z��S������V�q���k��;����E�����.�U(b-�����@�r@�"�f_\z)K!$<%R��U2�Q�2*or�S�]�������u
�*U�B�B�3}��I_��,,� �4i���[�
eM����`� E �K��p���_�d>�.nHYu�����NF�*
u����L�X��zN7��F�7@{V�>���T�A��i�������@�,���d��=���m�{�����7t�����8A�~�D6P�F���g�6�!�A+�(��7E�1�Fy/�@�S�J���HlV7���Pu���t&b2NxLC�z�A����Z>@���	�T�e
��H:���.��A,��0B}�!R���(��!�c���
h`��l!����y���5!TC ��]�F���*(���[]S�3���u�W�1���m/�:��?Eq��
��-:����o�n�QWQ#��Z��.���OQm��5j��
!v�5�!����`���M�����B]�|���n�[g����M�)mI�Zj&��<y���jk�����������c��M�l�Mi�B6,����&�;��`BW/�U��!s���]�����x�Q����^��w��\4w���{�\X�?&��heo������*Z`������r���W�	����'<�����zC�E�Q�<�7t�8O>O������t�&�JhQ]"�^�}~���������~N�7�X
���.�P�'C��>;CE�������G��.���kD�H1~%,c��s�Z)��Q��rc��n���Uh������?I�u� [�p�P�D5X��� OtWH��'O	�^��_.7w��^�Y�x��BaD�+$��m���ND�4����~'�1����mAEr����Q4L�)�
�D��\�rxg�\���b�uUG+��a���T�������R����vv�U�5�^Z--��\�;4���V����U�o�
��w��1��"��G�����GV|KjX����������������M�JU����7D�j����q�#1����9#0�HJH%>5�����=<��#3�v�����GLxtF�Yh<�Cn_��
���(�����`�?����E�S����6B�vc��#9!���)D$����������!wI��Tw��.����eq��fx���A�9��T[���j�&�L���I�p��{{�[__�~AI��8�/e�D�1�
�Z}�.�/��@�#�A��ut�>>?zy���c� &�J�%�+�����7WA�Gg�G����I���pW��=��{�O��s��L���R����������B./��,����7o!A�����?e$G���bX�������cl�)T�[�;*�<�lb�����I�F�/��R�h��j3�a�h�?s����������nwNO�q���O�\��_��i������?��O���E�KP�1�hH�SS�A�X������)��=�,�P�'��/����Z�*V����}���)�;���Z��U+6i��`�2��s+X@j��d�����z���|��7���Kw���g=�5��x4��I����G�d�K�������������L��������_��n�K���j%=wfW�����KWn$ �CY���Yt�=-�fPC��2.�9����������_�
.02�Mh��y���?c�\�2B��J:���w����������1�� &4=�����@����)�a���������a������p3��'�p�J�x��f0T�����:��F�W���[�e:Yg[����8��$9��(���t?��]���#]�QB�h����Gh�Gf��/9����������)$�*��4P�CI��R�]n�H�{�iX�A����0�o���B�S��M:��`�}�����f/�pfV�@�'��%��}��8B9k��=�4���FP?��&sbc@M�$W&o���UuP�O�t|�L%M�0��}A�:K�HUm�+i
�A;�)��v>�
.��\<���Z��uk�������6�����X������`��9*�-HU��G%�2�����G��B���P�-��Y�63�	6��$����gI�3�4d$���B��A4-Q�]�	�@�H��U<5�����*.�=�J<�H�������?��@ �������B�qLx�+�Xm��1Q�STM��T������������8&Os���3���zr��-�h4�
�A��q���(����^]�
�@L�/����
�=�{�f�@L]�m���S-q.��Bzl2F<�X����0���zb�B���yB���������r�L�(��4�H�|�/���q SI�'�L��|�C���W
y
���E��}�P���5-���?1V��f�Y��B�Q�i:n3�����9jA"����&�d��N.�	��E���$.�3�J�"7�c�����1�c9n�<?L"�d%�4Q��1�l|=4I	�L!K�0�r0��n��%������-x!i���(�B�O�k1<�cI���l��6�����d\M��U��f�a���4���6>�"�����,.U	�tHX��������8*^
���3X����!4@���;���3�O���_���-���J�0�<��j�)$�O�l`��^-�W�:MC�������4���Xc��|����	z���Bj�p�y*rw����D�����t�F�T{��d"<tE���Yji��O�!1,��s�J7���27Q�ovn�t�X3d�)d��`�``ac�$����"YnOv���T��C�(�j�q�U�p��$p��5Y�B%h���h1p}�@8l�����"���(��&�����L.��t|9���a^���\�3��=�<���O��{��d���`}�Z�b�� ��{X�#'\`�L���<&f�/k������Nis`s�s���HY��<Gpe���!���A�h���ui�;�y6�dt��&@���M�%$SMc�Qm����Q)L������~���\� C����x�����v�v_Bo���;�LF �C`_^�}p��0���� ���kM����TMB�f�cZX��Okz-�{����O�����%J��]��qh������K���6"�
zI��x5�LH�1�<B�\�K���'�����:�#�������t<��J����t�����v����g�r��B�@�[��}��$��oy�-Y����?
^T��h���r����/�O���GE��x
D�&U�L����������S����7��������	�qB���3���r$����Oj��@���s��N���y�3���pa�o�xb��z�(v���L5�0	\�IkL����������Hz���tC�"xBO-�%���%=�6���wp��i��l�x����p�>��5�CR�+��3) �.I���D�� ��,�����(M*�Q��Q=]���[�?Y��v��K$��Y��Zg�S����Y����sr��������x��������?����B��Saw��L�C����������"�2�L��x&�xp�����j��3�$eO9H�Q)Ehe�*�4��P�cB,b'� 0F�c��|��
WY�a�?G����{ �H��4���NA�R���)3.T��|�^(8�����f��e�`��u����=�I=��w�������pr��=����;E�+�v.#9�n����.� �I6<#�
�"H{U�D����&�E�(A���&�~�h};ya��^l�D~O.+*���8�Z����|�����2�S>]����a-fkg�������S�]�:��L{���c�9^�:�/�������e!�f�<J�����(��K�N�;�0Np��/������f�?`��i�\.���)��3��=����N2� �;d���a��e����v��<��eP��<�"��,Mp$���^ld�6���Pc�7;�KL���:
�$����`�GL?:�A���c�mh�4��-�����Le'J�<C]^9j��"���`�^�
)���S�g����Fi@������ E���7c]���Aq6bh��o�V->p���t1���#�"`+Q���u�CW���:zKL��n�I/��I��Z�~X���>�_Jd>�N����uiD���s@�Pc�*�
\�~2,#�e�A�t�	�������l����1aq����%OrO�GHCrh�A�w�@yw�����K�L��_�>S���l���&5�,7��S`����0
��}?��c��p�%�jZ����)[��F�w����K�^2��j�����1�:�%����*C�;<v�"�����*J�M�K#�O\?e u-v����w<��
�`yVj�d�T4k�����v����r��J/����E*&t����[! mzZDDj�9��E9)��4�Y�X�v����&�
���U�X������t*l*��
E��Nt����8H$�������Wdn���t1��Q1�������3{��{���] �$��@���+nV�a���7fR�
�b�~L��C�$���	]�X���x�������d/"�w���%9��mRg_���s���sCI&3�����_S.������K]S��r.8n)���5�����G��^��y��}�wK�Bv9���� �-�[�2B��m�I�h��&�L�~������������o���"wsE�2�?q��L�<��*�iS1e��~wh��������������4x�x���&z�mm[B�b�R��G��/�Z��e���}/����LJ�?��!F�7��O�u��^���5T�-����%�a�=�=6xt/����L��	]	'F���.�A�6��Tb�4at��������4�������"�gn-q�W�y��A���_k���4���z����U�F�&�����Ty.������g�4ejb�g89�Zg���������*(CG:�w��7��l(��X�Ms���v�t������#w����w��f�\n���z�V�����������t���$�f�����):\�K��,�{>�F���8����0����m2F����,��r|CF4���ir��qyn���i�A��V������~�~�9x��}|��{�.��^	d�d���:��#q�������f=
��������F���w�����uXN�;�o�����rJ��u�������/F������N�?���j
�Dt��o��~�F�����{��������,�%E6<��sr�����,�O�BHn��V�.gy#��~e=t���{��C��k�J��OG��m��{<�o��e�r����4�L.}�YLg&
X�f�
X5f��g���}�������K>	��g��=�U��+�3�m�V�d"��BH����&)����lv��Scz���e�I�AQ����{��d|bw�k�Tw*Qs'�4��r�����U�qu/-We��pSI��RA�'����A�u%��~"�W�8k��A����)�������L�C���n�x�_+1����Y���IL�T�r������������#Yv��)��W�7��Z�,��{���E������pf���a��*UgP��w��Nh��h)o:L:W����y�\]w���>N����4�R�3Gt!�V	��t��@=�<�wn��$������L#ye��1���8���*rL�������h���q�2����pdk.><
�Lz3��6"�;��74���r�$�y�F�Nm�������������*h��A���\�L�����n���<�R�u:��9��y%��|���v�;����lc[C�m��3���q�v�>+����<��~�JC�96��R��������o��y�i�m���S��)���_����sB(`������?��w^��<z����<&_.���3�����<F�u��$���gP*����H���$�����s!e�����P�r����.�NU@�P�X����X&��*��-Q���Gtf���x)�.@�]q��U[��F%n�o�`�o3i�p
�8m��QtrI�G6��S���%�zf�dMr�j)y����Fm�������`����Y��ydU��z������rPI��s�t$��W-c���4��0
���TPy�
��6
c}�N��(�e���BE���>�0�HH�!������[+

xml?o�::�'o�^�z�W"���������I���?
[���gkU�-4����.�z�yF������c��m�SH�cA
�uo�@��Whd�������!������Y���e���������<���3���F�'��R(�|
�y�i�$�|t�~�~q.f��#����7���wh���./�z�dh���1NV����B��\�#2Hb�1Y-���b�1G7z-�GJ�HE
�ud�2|)@�_n��y��L�>�����'�f:�;!1�h*���L���e)f�(��h��j�0�qX�7�Nkx6��U���=�q���w0���=�A�a
�������������4
�3��?�R�
��)�����t��m�'P-�������dz���P$8��^W��
�0�����I���������v�1c|2
��
�B+�_`U�yV��u!�WTo�[u./�~bU�/��b�=����,S�z]@OE��E7������6���\�����,�:�VzB�:=y����	3-���]�GVsV��Te��������nx�w�w~����Ck���'"O�c��Y����c=�*BF-�[Sw��m�Z�����~���T
U���R���|���am'�}~
�L����yp o3�UN]-��*uU�����jc���7���z�J�
���G��5�9m�s�H�i#=���@.5w��\������\u��dOQ��N4>L��i~���t>����2��x5��U�\�ot0���f^/e�gT{�8��Vj����[#�4��8}:<=y�!~��o�����yKtP�9V�/[�����n�z�c\V��V}�<z��������*��k�Y�gE���P�0��h��3����j��p�wR����j��������"����}��M�zRj��RT���:�n�i��=��)#3(�m"�,_���;+���Fo��������_����"4WS���T���)}~�}�������e5�J����wT��Y����.>\���mu���#��G�����������tjT�q��.n#�d�m�2
ez��e�
|v��t!��t0A�LQq����E�{��bv���A$|!J�q�j�����%�M������@�p�)����`M�wd��ol�.vT%w���������=e#���lolK����o�������_�;x�/�V��=��=�.
��|�����y����w����\�M��R����S��*��E���;j�4���7i�a�Z-�3��MS��@��e��mzQl��*�5���gX��v��.X���;s]7���gs|�������|��/��x6"�mM��I�FF���������qY+��<O#Nt:n��Pm���M>A��`��n��������m�}S��c
��7]�J�;����U���
�����<��R�{�,�c��v��9���4���,M�9����l����$���Yun���
T#&<
���;��I	��0���-M�j�&��:#�K�^�>;�X���;��|g1�K�Vj
��� dcjo����.j�Y����,�����u�P���}�g&g;�A'�Dg7�Y��|���g9���q�3���(x��J��}&���\���mz�'�C�S9Hf����M��g���!�9����9��d0�\G��2�>!�03�j~��$cb�7�G��M1*l�?Szx�8��b��;����'���D���K�(;�c�����	�
��xpz����Z�������1��l�	�C��x�v�q(�[$O�p�_��;P�;>9�H�����a���ur��$0"<�(DX�f�`Q�����
6��\�)`Y
(�����M����� aI�4�`Z��@����r���r��#NC��|�InT�n����p��Y\`OK3�2��~��6l��K�d�"�s
��'A~�>��Y���`-���7�cuy9�P��;o;/N��9_M�����r��
���U,"@�3����l�+Y�#%��?����D�����+PJ�����*��:��^�k�H(�(�QK�1D���R:�l�p�k	�p�q���7���~)E��p��1��h��y!�s�A�0?L�g�/;Gg��G�g�/$'�����d�
�X��X��
�eG��8�e.7�A4`��)��[$)�������M�K)=���
:2��.�M�>����$%�?�[*������cEy
�<�<��[yk��������`m'�K��s��5x��X�K$�[�N�����Bh�hD!��t��@Px�4�O��������� �`����)�JYP&��q�����ui�m�)�L���D���#�-7�L�Cz��fS9G��8�f:4���q���*N�
%���S9)v�5+���q[e�\�gP��)�E�����DH<���Fc�p���Z����*=#R����)�@O���������o5DfG�%�p����j�r��1����:ea@q��]$�t���x�����$�����q<�t��Ee ��O�3<0��u����3)�b��>�
t�
���|/�L����
)��v�$6s�.�q^��Jy�D3�g���fS���8�Y���I~����J&����3c	�T
�������A`&��� :�/��
��
jn����r��y{rvt�N��
	V��#�$�G�\;��'�-p�i�}]��2F1sY�k��&�s�Q�/�Hjr9�{����0��@��Q����QD!����8c��M�j>Ax[S����Yu|�6(hGI+TT<fDvVo�/G��-���C�H�Nb*vx�cw�W��a�����[���7����ou���!����N�O���c��nml���6b��!.�V�\�/���$�]���������e�i�Qw���O��������{���������|��)I�n�������?�%xGaHKlWR��<��|+����l����N��+?��N}�?����z�����t�F�����
��?s@��?��$�|g�[�������V�������*�J������^�b����w�v��n5�c�F2��x"���!>��DM���X��1g�M�����������uy-w����m�m+~�EuG�w4�����6�wW�b��W�l����������YHR������>D�~�\�a���lzc%cj����'[���Z�i%��?�^Y��(����nC������55V3j��1	�t��a���-���6���F�����(8��(�o��K��+��aLOJ��l���Y���LA����������\�v�~\�mE�l�������Ia;o��
�-���h5���aynu�tz�����u��T����6�[�#�}��W�b����6�@

����Jx`z����tDi�_&�(V������)�������r�.��n�H�X<�H�S���P�&v/��d>@���r�*K��'����&����L#�c'2fI����)�����M�rV�I���?�P7\��MOO�?-��C�C�4��e�t���b��(�7u[����6l5��V$�����H�eLm�����P���#���
�V"L��%����35���Vw�U��o���aTY�����]>���Pp�	uc�X�V��T�������
��
GA����������u��D��^:Q�K^�&��������:��X��z������4�G W��*����0�b��6��?6#����8l�������TE�V�(I�k�-<E���/j!QQ�V:��B�?��P8���>��2�Lb)������0�9\�D��i����Qa:�����V�����h]j�m9�_��+��?��j�v�b���3�}��hE/���a[e(E23��7��(�AI���W��4��f3,9lr1o *�e�����4�V�7+7���0�X��BUr��F8+5�@U�}�!;&��Y��'�V���_8 �'��]�'���L�S1p���^X���9���^Z3�x�k��]�>oq�����(�����r�4��8G�� �%��U���6��v$O�1WI��'$�E�0��!�epw�p��F]�B��[������|,t����V��	f:����b� �C�D��)�6)�Z~�S�����<E�XW�p"��Y�
�9\N��khb<�RD)�R9��9����dG�};�P�{��d� �b�
���e
-�}%�����u�������!�����D��:�)
SKc�c9.��V��18K��d������8�m%R(�#'*���z����P)w��5���1�i<Ky1����� #��=-`L'�oA�k��ui�%����cMo�p`J��kD4����q�cHI�&���V $��tp���4R��I��h({��HX�+�O$z-x���TLH[��,�GoJ�U_Pt�{���x%�G����Wb�y�8��0�������.�5O�����Ah�W�0�H������H����l+�������`a*�&:,�z��%XE��c�����h��n<����S��FG/T-��n�j�b��)��\��������K��	jY�^����[�����V�^Tq�RTq�dg�
�_$����*.4w�"~�L��W�S�l~��H��V��SN0�xi�P���&h!�D	F�-�8���<�;��j�qq�P��
~?��C.���B���	#���k�F[�bu�VcE����"y��E'w���]B�	uV�$.tc������u���bq��;���A�%Q��/+K�E��T_��N�%��Uo]����1����vH�+��~�k��N�Rl�e�v%u���-��*����D�~l��F���plT:�x�QK�9��E�QK�b�������j�����h������i)�?�.6j	c����b���q�3r����X-6j�c��~f�F�F]����N��F� 7F��U�kO�����F�WJk�Fu4<��4�ruc��0C�(���W�����%f������B��z�����,���nT��v���/���u��R��q�B7v��Ns�E!��RZ�.+I ��Pakj��W��f�0�����D8�������+�g,���o`V�G��4��J�G,����xt������i`�xt��k����xtjVV�GW
�O(�J�j<:��>;��^V�G��B�&3�\�����1r���8��N\���^�{�%��V��>��4$����@6�(�a%�+!7�U`�JA�p�j�B/yh�>
g��C�
���%�,Dc�i�p@���.=:PK�����C����j�l��q^�#8D���Nb)'���O��$�N��9��U���&��p�wQ�����cz~b��(RO���,M��G����Ykvw�z�2\����
���4�!�������C`]L���{�X�!E�uu(�o�?{�8-���Wm�`�w��b6v/�j%�s�,cuV���wDie

�Z.���^e����Z�-d�W�Z�Xm��_�M}dL�2�X
�(	qd�R��F�,���YZ����f)d��J`�����jP/��s��+@l~WlL�u}Y.��h���������5A3A	�K���V�\�,�7�8��7�8��7XZopq����B��&V��bC��K�I\�*�;�w���|���0I)�2�� $���������r�C��v��M<���.\���`E��@3,3gj W�\��UJ��+��T��X��Y����g �b���:� �V8�{6��:Hm���^�-��2����_�_r�C�����Elb96��l2���twl��JP����*����NN�@�����U�g9"�����z�d����,�++��,#��}�JW@4����WGw��^S���N��a�8�?4��b����fD)��FJ*-@J�W9��Fe
��������Il�gC'�n	��EY�I�[A'A9�
�T,���v�����)g���,�%��X4�%��9����ZCY�PV�$B{�;��0��^����K:�XIN�m!_��b�����N����v�?�_���_Z�nU�:��E�����Jwo�^o�^Twk����^|��;���V3�/���J�q}o������}�4�:������PQ�f��oZ��H��/*�(;�@���B���
`��t+��Ld�Aw0�3�XDd]�>EhGfj`c�`>�Oo����M�/u�<�8$�9"�=W��$y��K(�s�����Q�#.���j:�_^=�h
��?����T>~$���n� ���
A���d�?�{����r�'�;�]�J�=|L���O����4�[����}���H�$9���������Ok�z��"���`'71�3�!�L�f���7���r�C����
4��Y[2��[�m+�u#��v}��c�;c]P�j����c�w��X��d�C�s* �6�����"�������T����og����nw�������
���5X3����}�T�ew7 �.���~m��.X8r�=#u�L�3C����2�������*+�f�"�\���{�N�ru/��}V)i2�S �o5�P�D����������a���O^�������������i����|�V���I�Yof��:���t���du��<�j�C}Dr{�;����r����{�V������T+hu<=�j�����XW�jP�]&�(�n�����������%�`�%e���Np�"U���b[�*?���T@���J�����2���:Vk�������U��ub�-��WTX���b)��J��������/V���RF�������������������������������������������������������������������������������������:�

#333

sawada.mshk@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#332)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jan 23, 2024 at 12:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 22, 2024 at 5:18 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Jan 22, 2024 at 2:24 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

For the next version patch, I'll work on this idea and try to clean up
locking stuff both in tidstore and radix tree. Or if you're already
working on some of them, please let me know. I'll review it.

Okay go ahead, sounds good. I plan to look at the tests since they
haven't been looked at in a while.

I've attached the latest patch set. Here are updates from v54 patch:

0005 - Expose radix tree lock functions and remove all locks taken
internally in radixtree.h.
0008 - Remove tidstore's control object.
0009 - Add tidstore lock functions.
0011 - Add VacDeadItemsInfo to store "max_bytes" and "num_items"
separate from TidStore. Also make lazy vacuum and parallel vacuum use
it.

John pointed out offlist the tarball includes only the patches up to
0009. I've attached the correct one.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v55-ART.tar.gzapplication/x-gzip; name=v55-ART.tar.gzDownload

��n�e�[�s�6����_�s��z�z��NK�}����4mon8 	I��PR�������I�z8�/���9MJ��X|� �V�\����+/|+2Ey!������Ayyfh��,���,]�8ec�2�,yh.�{�����&>��V}���V�����l��v�
��M������$���D2�c��\��}���R���m���k������v���ZWt�
��	�5����F�hZ�B����D,Y��j�3����9s�nb�	��g?JzV\��{=w��TL��)7��8Ca%�h����F���g��Y����^���D�Ga�g�����K�j�X�^�7��R�2V�`e)X��X�5�x����V ���k{���"�p#���7g�?�:eR��ms�ea�����;����X��0
<.�����3�����\�<�%(����3������r��X��dkv����_J�%vo��-���q�~���%��"7��I	�gl�KyV�.}����U�z��F>�K����4��7��o��O�����v���j����?|���3�������f��,]�\�\�1�U�3���GV��Z��kvX�������f�T��*!��I�
���\��g3V.������n�c-9���k��Y�iv��Ri������������M��q�9�#�_�f�N��fE����'�
L��=��wX
.uG�M�[�~Q/�������ef����d�2E���,d~a�`�3�~2�oG:Z�#��c"��,f�q��r�*Hx��9�c��
��@i>W��C�4�k�&{�C�LE��J���]�o8���v!>=����b��<q��������6�H�W���Q��AT��KZU�����?Y'�k��m%�rs���"���9����cu���0��������NRps��#��U���+Z�������\��Hh"s2�I���W�
Lw]��We�biw�����gx�_+Oc�nA�	��&|�c�
�!���b	xy�l5��pO�Q���g��\���Q�]bF�Bx����HP���,1��a(M������lP
/l�H�+�DB�~�$b�����#+�4{kq�������d������~���J�������Z)%i��Z��?;�o&�P?<h�����`J�`���I�`��N_��PR�����O�m	t%�)�z��[�(P<���C��[P����
"��.d����]��������
Nd��?����T�%�����z��6`k4F��7���B�|���i�����5K�^����<���/l�&��q��i��-�B�����'��[�����]�d�����~������&��-��Z�]l�/���\���*Z.����W��k�o�����2��e����Y������w&*�W~���_��5w�?������S>��k�;��uk5�u�e��93L�a�;kt
��F�K�����������#��x����E���T�`���H�X�
�����;k������T�!X�U�KF`eVt+[`�JI��(���T���2���
�a���Qh;��M-ip,�`��~`1i�7},H��m� ���m���+II��6z�e��H�[�N+`�6�_C���������>N�V�c8x�&_<�<haY+q��^��..���{���Tq7��U=��^���\]��k����v9���7X��kP�'7Pd*��b�b��C%��1����_�����f���TL��2���Q�/������i��c�-��l,y^���.���������P�'���I>� �
l|uq9�M����/�<  ����w�r�����(6�������N�w��_n��>���������p��P��.�O�!���''`����dYH��x�u�����4���;�0y�SL>2�v�y�h7�?���'�G
|,���Ao[�fu�T�Q���M��>$0��~~9<����"M
WbQG���ceB��%T�cI�"�������V�m�7v�����z���Z�YoU*=^���k�����}9��/�4dz��h�z��p��+�En������_)N�~�3W�Tl������^�����Jq���7������o�]�k���%�{�$�5<ox�dvEev�'�]q{��L+Yh��~�#�$�2�	�����|��_+�-�p�~!��p�������G�Mk�:;�������������^�����M�h�����6��2���6�������=��@9a����q�u�Z��������V���%`������rq�a����C�O%��,��"�90��U��UKrH/�>�����f3a�6��H 4�k�Z������?v}HF�?�'��B�U��"���7�N�a�W��B�^������z:�jr{���H��O����S�����f�g�*�JU<,3����`i�?�f�)�e��|Z+�o�*������q��e���J-a$�n�wd�;}����m�k_�,U���YC�l.�	V��I����]����-�"y�k������^m�d����d�!�|�1�<J�4P?�=����������%o���?l�u�F��,����e@�lb�������m����7�T�5^����5�oa������#�*�^�:XG�SUR������������]���:z��,�����o���r�R����8���������!x��+*�\%�y�����v:Q�'�����hF��7?Evp�����O.������*{����S�^������%�X��&�u>���ux������W�Y�f*x#E.��O?Xc
A'��R`��dae9���xDX��<����B�%$>��u�njb<��������e1k��	��M�����XU��q���= �j��D���Xd������Pj�O�}r2��C*�!�<p%#��u��%S����_Ku�f�s�%�kC�S��O��I�����x:�x�
������^�F-<��w��)n�b��Y���e����(�A�N����x+"�{xs�V�*��	>���<%`�T_��"���a+��r�����]#���<�czI�u��NV�@,�B�y�
��Th. \w����OhD�E����?\�;��'``����+O[S���3�l�c+	9=n@�����!^+6m��q���f���H0���g)�i�M�G���=i'���;Z��U�2#�t���%-"c���q�5���
J&6P��(�2���'u	��aUf~(���e{��U��6m~���0�Lp6"Y3����hGC�P@��M:�,��\��
�_��}��%XIKUMAB�3���W�Jh��bM	<6�|pw�3 Q3����p�
��AaUq�=�^��yL�X������	?�[��"M��L,�C��T�M�S����uI��@`���5_���),$���%?�'�����K�����]n�$�R�&�.q�!��,K$����`.�m���sW�x_�-���������H��GH��9����9$1���-<s�1��8?4)h	�����-�K����!X�I��"OB�M���������U!OW_�<\H�����W� �����r��r�8��"�Q8���������P�\�Y6X�
&�0�(,�Y�xm��W�����l���Z����:�����r�������H���Z�����[���b�}I�3� @��Y+t�o��b�U�l����� �jw/p���{L��w�� ?���to$t����b0���d8�h&�c0�N���
_r��Yy9�TIB����W��|���`�aLK+A>c �7r�@��&`2]�H%I���c��/BU(��$G�
,@m�"'�j��x�
4���^
eu;]�hprj�8����%�}�g9^�K�3�(\����� ��IT���A���M��FK:���G��&���*v2�|B" �]�M�>i���E�{1�)�x�(\��o*+�����3�O����B���&&���	'�����_�J�\n>X'_��g�{�/�����d��T����rN���d�
ZI�LD/"�^&���%�.x'���b�����X
�3��=,��F�5�� �Q��+P! �n<|{�+|;��|�0Vk���]��tPOa�^�|�%�:Q����rp����	$Ue�N^�4�Q�c��N��s_V��,�U�������k(y���"������X�tx�@���*c�`��$�;r�r>zG�^�Q�XA�Ee^k�g��X�I��Wf��^;N���_������7���Z�h�"�����������Uz�\q�|
������%��2N7������fx�QPi�����K��L�(R��U��>]2��
�%�L
7�0W5���@�=�G�Yg���Ev� �%�J�����������#�t�W�����)�{�eCe=��@&��a�,KX�A���sB�v<j KV��KJD����0������ABQ��&����^\��W��Yfo�4�J0e(*��A�xlI����Y��:	��N���QBeso�0o[=a&i4LH(��o��I�i.D��[�B3����9�����$=�:���~�Z��!�q�����0��Lk�}���T������P>������|Y�y�T���o���E7 �j|��F�g����b�'�I���\��4��<��J����6��5�p|�8���_R�y0Yn�b0��^��:�O�F��d������	�C-��{d��=���4��c�����|j�4BJbr�����{�5N�}a5s�n�Y�O��������at�C�'/��b�K�F��y�</�����9�yI/�S��o�y���+=�#*���Q��C/F�S��^���<�JK%B�F3�%eO�
��iV�^�W*�>E9�7e�\.�%�3�SY���\��f2�e!�����V|�^�X�[�y+j��Tn<�4����}g�0����C~��^�Nbd�Iy�,'����T���t����j��
�[���N����LL��^t��w��F�L1��!%��A�,0��^R�n\��������al��,��L:E��uy�n���]�������P�PG�s�z2���������������a�S�m��h0%��4��������]V�W�+vM�v�A�F;�	���~�����h��������_����@@]�E|�?e;��)��Y�-����P;��7�]�_�o�����K8 ���D�/����V�?$^M|�004��;p���������-���r@�������M%G�x��_E���d�-c k�o����!�d}��c[A�	C2���_��v.���$������tWWWUW}�����X
Pd�v��8��x,�
^@���L�3��C�����S���dt;�#4��n�&��5�v����N��;Kk�����D���l�������/��NO�t�}��T�����b���o�������%����������-�mF�K��T���@8�� m�����=>��x�������R��AA�<8��IoW����7-�������5[+�eV��,�_5���lj&NgS'mL����c�O�����z�x�3	��=��v����5��u���SF�v��H����m��F�T�7m�8:���6bG�`���md�����T[���3�ig��U���b�����dK��������!�F�{l���8��3��}�)HZp��N�,����������U��������ecML�W5�g�O�����ht
��������*�G�VL���,o�Ly��M��������(����_{1`=�����)�&�&�!%5��4,���c�����R���4R-���4��d����������gm����p����{�U�+�;��N~�
�1M�[��%�&)�������� �e�I}��i�p���R0�7R7E*�I��4�m�������,�G�PL�L�.�Jo�*u�%�����w�5�n��Rl2�p#�$��o�����z��u���YU�/��i���V6�\�W�����������_�g�����0H������u�{�$���� ��&w�y�������C�`������iv�F�����[P;N������*wH9��N�+�F��g�H��)�3t�@	�\
��z)�om!��B��#��l�:
.���Vy�5��S�	�y2�Kv�k��.���r���YQB��a�\P���Q-����s���o@Z��@�����,���
�!�����&�*6���N�������o)sy?��i���/4��4�^b4������M�.�h\���i������e��2Cy��T��p����>U��N}C�zt���R�&���]e�Qr��];�����/Q��V�i�[�F�]�LP�T8���m<�����?<����?��7
��p]�+Mn���.n�D��[�a��<��H?��z+�QhE%�_Vd�w��T�y ����?����*��m]���t|�u6qK��us$TO��8>9o5��n�����M����y����2��3I�f�Y/���4���������a�������A1vQ�����b;C���SZ�����
����^���o��G���=%����L�2j��$�O���������Y6����� �O���S��Gz�:dt���T�E������)`/Mt����,.�Q8�eS��(z*]NO�����%�{���f�M�H�4(����F�~[#�K������Gx3Y��uJa��	�yx���e?���pL�	��M'S�D�$��)���!�>�!�
�,4����|xM^[���^��Uz��L�SP��v�?~���V6�Ed'B�����;��+�_3��b�K����Pf�@�e��jY��+�DW�kt��/�}�&�H������z��4vj�^;+�}/xzT�5k����,D�Nr�d���%���P����yT9���S���"���T���i#D.���%��\�F���a["�m!��8�����,@hq���	IC��8BW����]��z�������9?�R�L�#6V,'o��|�#�v����/�J��j��W����}-�=l%z_�x������\������O���(�����jG������N�����5���d6��2go�w�c�i�t<r�|�^�)����S8-���+�O\��f�/�nIq0��I�]�jUV]���,�|hN}�QTA��q�{�����ebl��eK� ��"]\��G�t�Ggy�6�MF���`�;� ���Gg��$����L��������)��?D���_@G�����ya��>p�m�a[��+����,��r�,G������Mt�7>����n���d�NeCO��f��Fk�0u���|Z�?�1�m���m�m�p54��O<�SN�Qq#����p��
���M���Q��B����kY�3e���H-�������w����f`�l�gp3�-E�h�v�@��h
ZO>S2h���`���r�
_���,��H�$S�:X!���^X�W�;0��������VP���M�����"���P�������r�S��rx�GU����l���G�������IQ����7o�D�2�?��7�E�a�c%�a���w��S�@
F����Z&��!qE;��yXBy��#�bC�%JF���*�g��u�>�>�Q��`�����2������R��U
��6n,%t������(F�6�YP�����i���A<Qb�P����d��N��d�}���<Ux��{��x���@��������nC��m�
N_�>N��[�3
3�8��E�bd`$j�Vu�Pl�W��<���)K�r��{�\�n�@2`���PCN��d����t&g��K��U����I�7��A��e,:�����)�'@���������.i�*�V%eW��Qz9vi�f���6	���:/���L����l0B�z�@�AD�p��~�-�8�`f�H6���\C����e�5�kAb�9�N!�%����s=B���G������D�2�I�r�j������1����4�5�;S�Q��&�['����H�`�9��P6E�R��i�p�Oa��2=
��a%g���Q�
�80�c����L(�a6�AB�k�
C_kL��3��S��W5!������]��{J�XU�47u0!G,[�q7�6N���'�Hk�c
���PDs��AR�-�sl?�b�qB�n�����H��w�
?�|��x�������s�K�Tl��+�Ef�;���
���j��Y@$�F�I�|3��/�Q)`�A%��OW��"�1��J�D��?��j�v��F��T~��>7��Z(� �v���
��!0��NuC�r�����tZ!�C����4�eZ<$��������|�#����1(I�����6��h@�K����lM���\�G��v�zJb��[������}��@!)���vWha'y>�*�D�mf��-(G���i&���[��xf`����B�
�"]o	Y��J�]�]�3C�d,�K�d�������(o����f�9�����pukl��_����$�!a�D��1��1�AJH=5����F��gI���F����1��e[��]�����Q��C�nq�!����>����d%3�������v���].��o�6���u�2X�e��(�]����Z
��>�����#����V�~{k�c]1����ZZW���]%������,��z��M�^�}����VH�HM����
�j]t8i4
�D��t!�v�k���wnB�����b�q�gJ��z(�@�X'��e�c"oa2&6>�W��eH�8CD{�d�a����VE
�V���j~F����;�������7�j�qfc*�6^�/i�����Z��f�����6 �>^�&�j�������5�**E��������
�X{��FY��
��z�h����S�m��\#�p<�|6x7��,�-�9�$W����G���5������#�������<�3�	e:���0��\v�e%�0�F_
<�	��)��D|ig�}B�����h#���������G�daB������
li_��������zv���tIwd}.��8�6A�Ak���y���m
Dn��".V�a��\�}�l!����q�,�:��!������J�}B�'���W���I��H@�'+����tg���l��c��<l���YC=B^�fU���cy7x�4�~/>6jS�^�d�o��uv������Ye���?�N�������n��XY4#������7Du���x�R=J�U��J+|��B��${��r�������z���>�����K�D�����]{��������N��	�����F��#���!���cF�G�����v����D�zl�����oc���y[A��z�����/]jH���W�/T�K�q�hW[{�]�;���4��q8�y1�`���7q��&����@���^���[-|i.�>�����b����j`}
0s����"���d<�86�ZsEh�(��H��9`L�z|i9�E0��O�&�������g�W�mi��,x��f<�V��C1g�i�����&W_t����` �yk�BJ|���|���3���<��/�?(�3�"�FA/��\1Nm�?���7��OnN�KDxJo��e+��U��"����o>`Hz��dc�����Q��1���*�ds��~[
��^��/�����[q\�W}Z>�~�UY5����qy�F��21/ZLI�
��l����������8}��K��%� ��}+[g�^��v�y<9a�S�{=�A�m������:m�$���1~��������t�f��F{��0mr��m�"����	D���M���/����M�C�q��lU�u�s���J��Yn)�F;��@�voQ{��f]�c�L�_��_bvP���cf����%��	f��4��OY�	$�
����ZZ3�s�
���/��O�]j�>�/f�����{eT��,x�O(���K�a����RD9'�����B�]�"�s�@!���':SPBl����b-�R��#2�<\��&3���x���S������L��)���IK����5���C�8�7f� u4�[p6,�6��W�s��w����4����,_^����+��w�&{�EE~�l}��;*������y�2���\Y������S�`�F�w�>���0m����/����t�^8bj��+r�n�-O1�U�L��M���E����7���D��S����S��J��/����5R�	7V�d������X��^���(��U��3�	���3�K�-go����,.����(����:�m�k���%�v`j����')�C���=�DCs��)j���:I�O��[:3�9�����o�u���#4-�*)%��]������1V�4T���\(O�D�(��4I�v��:dvmz�A�n�T�	�f(z����N��C���jb1�kibb������q��<��F.����^����6��gZ�4���|��+*m����h��w�p�B�#5�]��OP��etbiQ�>�B�e�l�e�g'�����b6I"k;���I���v�������_3)�pI������LK���R���)O�%�6����N�b���AV<�%y`d8%��1����,(��@t�#��5,3���k�;r4U�	'KZ������92B&���<N^m��\oh�J���2�yq��1����$�����?�
���Q�K�S��Xy#��$D�S��d?)�[������et�����������wi_gg
vB��y����n�����o�1D\����y����mB{�C�5�41�+��Fr����H�*C��nIm�b�?<nb;]�1����s���=s�}-���wv��'l��\K�2
�1v��g1��J�.�
�|���p4���[&s!�F�����4�����z_�7j;e���E�Ft{C���
M(]�?�����J�#��D,�7�Cw��g6�h(c>�q�(��O!J;�Z��&G��g��,�KERm3�M�I��3�:_����1�� ����a8��j����_����]�]r����}8
��!��2<cL��#��K9����jlxG��Uhpzt
�H�aD�q`����.�9�J�!����Y��{x��C�<2�����e�p��f��8����P��`�m:�c���a_Td���0���{27��21�8��T:p�o��z&��$��nI�5nx*;v'f2S*���2h�`�f���j|��|Nh��������>�#��T&����r�����<�����f:7��8��I�O=�^��L��7J�Q���hM�h�H���n��(�&(��Y��z8�����TRR
3.T���RG�\���B*k~3��g"����b�T�1��>���SL�������+a�US������i5�����rO<�9���9,��a�8�q21������z�a�wp����9�U!���g���p@�3����y���P�p��O�h�(��m]�_b��o)L%e}vQ
�a�po
�5(+��A��y����$�I9�l0���'��v!&^���
�BC�����-��dT~!.�^�/-R�PI��"�8�5�~ZQ��u��:�k���?_�F�����&�;?�e�,x���Oy&���g����#��vX��F%�x�l�o�M��j�4�a2�P��L]�+�K=���
��f���+1�\zC������='aP�(��k���l}���������k	6/���J�t
d�7�nQ
�~����^�m.w(�fso�W{��G^~3���41���83��-Y��u4(����^\���.����M�����2�9�*�+�	��%<�DBoO����w���������J���4h��Q��M!a���	m��R�Y��-(��~�94����l��b�F�4du��K>=o���~�w�7�H���������%�F���/�����mdC�P
���l8�I����&a��M�v!��e�Ln��<3������\��1�=i20
��HW1,�C�\*��&)�C�_8FW%��-�������Z(��.)1����61.~Iu!�l��%o���d����2�#����t�Ly]i|�1~d-��fkVX5���/�|�B�x��s�dW�X���&��e��ZZ��VV�'�2�����	g���f���M�#9�emU��s&��
g��b����eA���E�XsJ2��.��P79=�����3�E�����[!�W������%����q�r[Ly:�U�r$��D��Z�����`�6B�`�������.���oS���S��2D��<�,����]-�n�Q��_��6��TO���g�xa���\��8��3����_Og���D!-cI�8��K�a�%�X�x���
f��z8��d��pU�r��]�����dK�Z��?��[��cg�\:�>{��Oq�o�?�&�8������!���z����hn�r9���oVl�����Q������6�sl������7tB)���F�!��{��X:�-��]�j�C��D9+�2���%RR��N���0����(6���
�l��RrS��� ��/�/��F�Q�����D��1_�bH#�����40zk0������nBj,�����V��h��M��i����Z�kk�����LG�/�v����<	S�e_0�U��+��H�������\�M�kz����.�g��7g��
��$Z{��*��_r���~�A��#&��]�#s�������),��xi�I�x����yH����J��d2����0���~���b��7zf>]�%
�|���.����������SZ����o��]Mn��W�9��p��Z�6�U7*��}u��E���w�At�����]�
����S�Xl�$b3��g��-���p�e'��}A�f��M}!��^p�����N�SN��G�h�� *~%���VR-������X��O�Iax�;
��'�Y?dY��&T��)���	�G����T�.y��L-0P/#�X��O���rv�~��H#vz~6cezr'5YC��u~[���K���%*��\$��e���)^�N�����-�hv�Z�:�]j�-�����AF'B���'�K:p���i��Q3�d���8�����/���M���q�X�����X���~+�jh�UiwHslTq\N}mR-zN
��H��Q����\�GE�h>_������qr1��U�������\�_Ogoc��xi[��@��`h�Ys�AI�>��r�!���l
2:��~e_o�O\'��G�����@�������OP)���AX�eg$�p�?�"����uWS2���	M��$-O��<f��H(i�[?
zR*=��Mb��I��1���LU�M��y���b�<�)�o���k���wkTA�<d&�'w�A�ly���S('�F�����q	��P��=���
7sX���� �C�q.����O���+%�q��l�������z�A��� v0���&����d�h��vJ�ON�
}��FS����b��7�L,D�Rp�����fr�:���i1�$$��b�������pS�"����`��������b;�7�o��Y^��-���c�'<!t������u�!��_Q�<(N0nM��RB	����a	K�A��9��V��{"���4�;�~!:���7�I��Dli���� ����*�{�7��hsW�c����EQ���u=�H����@i��?����}$��/���������C�DUW��8�&R�*)���pSxb�j�(9�&e��q����q���^�}�k[Z��Wsc�XF���Y.[���?���g���x����3l@�s����8�^-c8��%C�����%��2.��::����%��"=nC�$��
�#��"H�=�*����	�d���!M���8���Si�R�/�	�:��O��H���_�z��->:��J�/�%���94V-L�����k�m�����
�/�7	���s���RTx8H������ vX{�'�P��d�����Fv{���e��X<;��-fTp��:��I.��#z����	�H�� )���\��������I,��=����Hf�nEJ���0k�Kaf�%%�l�lC��!��)��M��V�_#��N��T�Vhy.#���[����?s����)�������{#�Jp�P�Y.k	����h]mc��1^K��	�h�����'��l�w&8�64@7T�3����@3L��������q4�����*��&�R�2X#[
hFr�������X^����;uQ�>��~L�n�54�u��M���[����i�@��i��,!z�����g
J����XE��u�����+a�ZJjd��g�Jn��l��:9�����5n!o��Q,����������!%=���c��lrC��c��Hf��.�1�y�����shQRl�&��NB�J�!�C����p���������u.�`{��J�
G���W�e=k��������[{���q:w�����{��K�n����hu�����Ga\CW(��Et��XC$
�D4q���)MfM�6U@��l��:?uO�����I\.������B�	JZ�������8
��x
�����L{�C)bup�5{�"��H�>f��
C9.��W!��U��\��8��	'�`d1�6",-u�:�;0;�o�Xsd�6�%��3k��x��6��;����x���!��R?t�y�M��I�RIXvI���[�X.#��Zrq��.�\M�Mp��1Yf�"��}��8"�mV�'��J��Y?x�t��(iE\H���O'w�g�=C�-�e�Ta��64�e=V���>�����H�W�7Y!�J�����<[r��S=K�V��&&��e8g��������1����������)��x�4�LN9���7�����2f	�sN��@�0e�i8.I"9����Q8��������v�D��i=�o�/����������.���5��QW��#@��s8�������1'\9%�yB#VV
VE[B��q��^�L�@��Vqn��>]M��'��s�T������I��(���ALY���N����m�Z��c}�N;/>�R&��xfL����E�_V�(`���Xdp�0�e�]��[��\�=/�]/�Z�-���<%��u<�x���`�q��=�*@|�S���{�����>X�'�mR���/rl��#~n���������3a�)�N�	=��8���:�w��xVW�R�}���cH0o"|4�9���.(�����h��f�(x[+�� ��o�E���,�"�xbp�s�Z-@�/j��~B#��O�tN8����u��)�~R����Wg�����@�n���~�Xg��W��6[8-�3��MPz���$&Hs�����Vi���E$8MO�bD���t�U�����@����nn�����t(h^���j;\���t�x��f���q���k���MW�%���<��r���i
f��N>�����g&#�'�R
%[$bK�m��t)���
��W�����\L�4��<u!y��&�Q��8XbGx��Aq��� ;���T���n�6��4�EZN/�����It��a���@��a^��Lg��E��b�����x����!�Z��� ��G��H��_�e�����N�9�EH��%��M��e�e��e�����n�Q:>d�+���p�K��d���y��0��6�����jW��(���r�����8�Lr�I�7�����x��.�4B
��sx���r�!1����)�zQ�\o�u��9�����;e�m3v���&_����������j�:��G��r��;�;��I��^�Yg��N�+��J�R�}_p�Q2����6T_��+������i�������n����m�B"5>���h���2M��IPI�S�$'�;6�)��z9��x��/�}o:x���
L;`6LV��,��L�c}����Z,u��u����@���=�'��������s��m���f[�H~
�:�_��y��c��V1�j�ml`�u}0u�x���'k@��^J#�������{0�{'�,��������}t���|�u����r��l�	�N�p������NC<��:/?�A���������5������X����n�~�"v��&1�zuA^�	�������V[��6,���,	��k��[TU+���x���S-���I�\�4�|fS���A�^A�jU�`�V�U��L���#Y��B��%�e����gH�����3��ue��rt�<,�K�#k\*0������������������/^�c���Nm���,,_\�D��*g�	�r(z��l���z&
��U0ha�f_�y�L:�t)�����:Gn`�H_/c327�GZ�4+���n��2^��^�6��1x���<��=I� �:�_�"��"}d��B���4�y��ncf|?kT�2�O���W����Qt��:t$�����K�M�A8�_�
�����y���������	���-�K
V�������"c]uT?�>HK	i�`�����X�Pf���0��8�^���h�������������y�������/,��X(�������n����W�J##�����82�

21��xaa�3����b�yXH��\��<|�_�d�����2qi8��:c�2i8��
�X��<�D9
b����������]�qc�|#�N"�P.q,�z��q�L�,��������Z�Hq�)R~���
GS�'h��lr9��)Zd6�����.�������e����7!��� k8��{��/(2���}M���%��i�K� �DR��aL��PD>�VO�>
G����� [.����������C���VC4M�KCbI�p	s����V5y�'��%u��c�j#aq:���k��iY���������;���'W�$���j�O(�Z V�1�>L��Y�y�yu|�;X&��j��� &ZwA|+��;JK��I�4q����l� l�M��t���P�0l����������hhnG����b����QBGg-���[&��\Yz��i����\gF��|�N�LZ2��ai�(j	t�x��7;{�e�W��~�_����?ju$�0V=�1Z��I�>C�Z[,��o9x��%a�ww�V���*QE���?����b��n_��U�\1��9v&�v=}_��kt�@�	'5u��0���'���v�(�,d��)���g�����������e��T'~33��� A�x2�1AO�	�b�r�6�K.w���[Q2#�M������Xp����2^����r����4a���7�S��K��4���J8��P�6�_�[kZ<mo�yqBR�yb6����9
��cE�������jF�*U;Z�S��u������]?{��{�u�VB]����Y���*��S����������R��]�4��)
8����u��O
���������?dz���GFSw�Hl�`XL}�%�u��I�RR�81�|u���bS��M�'��A��R��*���i���������M���o�JW�t���m�r�[�$������������O~�p)_p\jAJ=x�2\���Anb�^����������96B8��d�����Y~�m���F:��et8������2���H:��N��I�^f����j0��rWef�Z�#x>������k@&�o#(�_&�d2w���&� ����l���!��<��x�c���s��&X�5��E/�4�]s�_�������n����(��9�e�A����g�b��Q03�p��
N1[H����b����p4x��m��`U�I���rjmq����]bh:���,��e�
g{�
w�%�/79g�|��T("��0
0T1��l���eb��|�l�/r�H��y'�V���7%�(�6�Q��:��A�7�f�E}�c��Ob��C�l��P8+�=���N"
������p:���M�O$�g���b���V�|��'$gIZ
�[f��y�Fd��/'�����d�0_��["M���1'�	�c��A�w�0��DsSv�6�h4����E^��~��M�%�P�9���.��
���#��:����Kx�.Q=&?
-;��l�1q����&�%c(����c�=%��68���5��&%��2 ��s����4�{����P>�2�7�Iw{��J��[v�v�8�DJ���U8D���	�O�F���}��Pq��7pn���	���<�-�*��������� @�p����>�QJH�4�eu8��M��GW�MW�t
+3��t���������r,����u�;G��	7�^9�!�Z��n�}/�si���#��W:������e�5/�Y@M@�5|{��_�����-e��g��*�H����sx���m�7��m����y�cv6��7�������R���r�y�����dP��3�<�;���F�����u�T��g�g�G�f�d�����ab��@Igst����/F���3G��������/�������d��A�6tzO"5L��)�	|~%k��M��Y"c���!����o�c/���U�����[���L�^��b�[���q�?���������.G��2�,9i���Xq��`�Z�y�s���A�
6�r"4/:����������aQ$�%@�9�����D��p�K�>�C���rT�$G*uJ��SR� ��T�R@	G���D� �^���.�����Q}y������x����e���*��D�r;:�8,@������3hYN��T���m������g3���4�}h��H��O��>4~\:8�C�������XB�L_:�[��2>/@��+��N�[l����`�k�UW%\0-�n���Bc�C���5���]�XS�\�:���p[l��L���F�	�������e�Z���H���\����	�@�"h�!:z�����x>&}��1����P�S���Z'��r��qo���Al7��s��z@���?"K�]rU���(����<4fwDPif:
d�d�DP����MM��A��
��g!-
c��-�7�X]�Q�l$&n�����%G��e�:�����Q�M����[�V#���	}����'YW�W{flW��3�L#��)��0(�~�nW�]r�f2����
5�f����B��$�5����~����c@�����#}<R!t���,b��g�q���O��D�%��7��u8"v2���>YZ�K�&|Z����"��!I�d�y��8�a�F�����5�#�����K��ifh0�M^���;��4�u�)����.�H�������`$�����.��9���:{+��R�V^��W'G���X����8��L���B��{���@F�N���?�_�P7F�������c�lA���5Y3�X8N��(A���w�/����]\�����Y���+��7��%���Z��YC]���	z��=/kqm�{��*La������]��8:gF��+���;]�E����:��Z���1q�y��& O���uAR�I�^���8������9iJ�Lo��Gz��C�&�-:���$.~2]vB����n����	�q����lWyB7�$��������ul���O}������7�� ����s�N���j���ek����SVy�zs��3l|#��Q^���;LA������OP�	%U��~���1~	�?����HOm�3|q�	����)M�����n&Vz����O�g����n2��|272�C����� u�������hl��n����������q�1Q�����w�/Xy�Q����z��	����jg��8�dYK�^\���k�z���"���w�R�,�'����:�U�m<]��-��F�:���=��
���ETW�$��!�Q�"�LR�o
�r�\���b����%����(����<�[~�&@ "��:�{3���5U������h��wg����'-���1}CN
i����	;�>���n�|�*,#�4^��B��|[��|v"��.j�m_��|��-h�7!-!������{�]�t��)}�x��;���cd���:�����v��"x���U��&�)[�x��B	�O:�����#�I��s��_��������+�����8[	A��<����'�����=<���|T�?l+~��������������N��~p~x~��"<�b:���G�A8�����m+o\j����b�~���P�iH�����$�mD���G�i����1F���vrr�h���`;K� ��6����E��������.�*A?,��N;���}��|��e�9�rG�L
y�P���I}�������%����1{i\��^v��|6�-�aw1F��/�*l�d��>�a�aP�}��9��&�CG(LI�����������iu�F�m��
�8��{$r,��%q��e-_�u���E�]-�&�����g~�X�Q�y�-�=�q�����?~����/,�d3����0�@tu��{�az�^��y��7���v����Y<OMHB+���������-����U0S����������U�jn���d�!�R�,	�V�`o%������?���� �����c���=�o�8�ca)�\�����QN�*��!��9s�?YJ�}�Q��|�-�\�����������GT���9,����L�������v�����a5����&�?xd��|����U{�7��C�S
i�����~y|�oF�`1�=�2gcP���,�2��4����S�pu�[�����sJn�>!��}p�]�4�G�>�&�S7����	}]p���T?���I���?vRK�O�vRv�>A�u��7���B��~d����7�;o��������`�'���'�ee�
�N����&�������O�7��I���'�����.�^=?���*~]��:nlj���B
�����m3�'I��������k�K�w���9
������$�M+$�t��j+�x;�9����d�)��4k��D���'�Z~�M�~��G�����qJ��������,��}�wf$�������YB����k&F���g��'6+����7�o�z��q���]�<w���=������?q/���������D����OEc�,�<�A�@��d )����V���=��J�1�[6�|��S�%O5NV�����-�����3z$g�B�,�����Bc|Na����$	�"gm~��,�rf����`��"��6��KIU��
!/�9���	��oc;4&��B��3.r�t�+>~<Lu��rxt�B����S���<�rQ�H9y��jVq�}�u5���X��5�U�l�A��������\E��1��&P�=��:����4�g�
�+��"��R�#$w�Yw8���pw1���A��T/���E��f����]���~��za_U+�V��U*�������Y����*Uk�bK���=X�!�&�o��N���<���[�d+u�~vC��bU�jN?�k�b\FN�
i�C����Q�����Z^}G���R�2��t>^�Ql����K�<������bF�o�O�T�����	�\���J��l����~e���kA+I4�mX�I��Sk#���l�><?:>=SO����Gb�.h������>�j(�w�C��I�v��z�G�%|��$������z
�����S��y-�Z��{��^��/��0��~e����d��[�&�I�������p�H7	`+N&����-�
�v�1�x���:���^��OT��O>�Hy��/�����n�
/���2S���0@�ve�%�e���rs���/.��j�^����.F{e-Mv��P�b�g���_�"9t4����	��<�\"�%��b4������K�8��M�26�Qt���Vo�>����A����27kW������o:�����x��v�=�of4;����t|R��v�
��_�<��y������Q����c��Dm#����!J������k[��)g}�����R�\)G��N��,�����O}���-+/�_Ai�����bx��?���O��#P�F��,�L/�`H�����#�����I�������T3�������q�T�hC/a�r4��xa(BE)#L��,q���')v7������A�\p���{��j���j+Hg�F��}@�P��g`5�C��c�Z|q�a�0O)NH���q�d��NP�m��/#q!�+���zV�cf>��^����]T=v������ 7?��U���lR�����u0�TV��u���b�a(�P����=$�(1�>���@)���Pv�)�B����`���5�olL�Ap�7#����#��yt�z���^����A���j�UoEp�F��Tm"C6�)���Z �J�?LD-��14&~�|�X���x��7����'
Ec�.7�P��R��E�H�{:U��y�U����������*S�����W�\A���=7�<oU�z����\m���S+�����9���=�kw������y�oN��f����9��-{����+��3cN����+�����9����caMt
qK�>�]��U����U[��Wk��^�v�z�[NK����\5/Tx���������6�D�)T�z7��uD)�Q����{�f�`�����$���V��F5#2UlV�0�%�yQ
� N"�3U��+�)���I+-���e�����WHH��m&���Ze�^Wm�A���
z��D��,�<�M1�<���X����G��^�6���
���:�,����9����9	0"?0����9��X�Bo,T���u������'���ar=��m�<U�~T��e8.OgL�����l���_de��.���gl
FQ*�s��b|<8)�m�B�6
m��)�)h:1E��Sz�.�[��W��tz5B4F��v��	���>��O7���Kl�>�g[�u�v�C6�)L1~�jw�<H_8Na�#��4&���P��@m:�!�8�N���������y
��s1C(��.�9\��,D������x�.k�6$��(M������������� <Zc��W\�s4�a?l���[q���F����?�����eP�3�ev\s�*{��������-�wg�����h�8k�%��uFtH�b�g�)E���$�S(^��������a��4�����5Q�Pm� �m��M�#���S������$�u4h���{�����,�����txv~z��|k��L=������t\�7�'�>���w��r�,���/h��j}�Wm��@��No7�d�������D��0�;�����h���CE���?_��6L�EY�T�_P�J�^T�A�6�?o�+�h�#�L)����d15���:'��/�_�-b���n�^�]���~{�������\��Mx=�AOg�Kzm_^\_�b���Q?\����wE�Q�F`�-&?	.��%�K7�#�a�?�vt�R�} ��k�%�%e<��S�����c�_w>���m�|����KZ�2�k,�� O�Av���$�3��y�������Hst�,j�x�$�OH�?��t�����>#H146�o�j��4�'���������N'��(`�o��{�B$��\c�KW�'�(,l�z����{�,�q��Qqa"����^.!�~�����l8YD.�6�g<�	���T������	�%�03)!-����3�2��/Np�E�B74��&�����������a�7�d��+�
���������;*���g�U32�J0���s��- ��j�G_N$��
��������w4�CX;!��?>��Bn08�`Cs,�uE���	��?�a*�9����(��A������2����F�Qm����o�O�%������R�U��h��7�f�o�B���o��A�8^�cJ�wY���2���6��E�@Yg����>8b�����GU���	1���)��Z\��k�K����l��@��SYfp
M�>�X�H�.D����$D�c���"���m���������������}lM&������	��7�5t/~s��j����REbYc]QmW�?�M�V�(��?&*:�b�:�\���[Q9n�K���Q]���������z|�=?�����z+��X�pe�?w�cXa&������������uA��>��-�����z-�*�[�%'�@�ju�G�CL�����S�>��>�+�q{9����<&������])l/����
��J�-��������2���]/��S�����$&�>�$�c'��'��C�(�������1�����"Z�G�8bvH���k�C<��Jr��L$�.M�#����&u4���W]Q*)�@�i=�N������Wph���BR<������|��_.�8g�����}�dl+�b�)�n�"z
'�>	o��������C�"�_���������������W}��nLPQ��U$z����K�l�d�����>uA�%�c��q����i0������������^�;���N���8&y`�x���Iu�tF��o�:��h] ��kO���Sj��Z�����Y�eo���xti8�S{�a\�����Z'*�e���i9�"��K�#:�c�2ex/�| ������������M3T�4c�y��
��������,�A��\U�-��zA4�K�����/�������i�,����ce������;��?�G�����B+�����qw��>��~���CDgK)v����l�?`y����m�e��E��LV�����$�:��M>�yi��u���Mrzb���,��x�)�
Y�0X���I��:�=I�zJ�?K%�o����e8E���Gb������y�����e;�bf���MWG.W��2��?�����=�J�? ���q����'�{��hN�C4?Q�Aj����O'\��x:c3�#6��m�����Mv��rs�d�}Qi�W�6^U��������!������S����=�'���)7������q��2�s����3$3)R�v_�t��%��%��~��Z�P�?�J��~rLI�P�������w��X"�rzH�B���5���WO���+8p_cG�*'t�!�^5�Kl��/M��152�^�m�����c�h����oAG_���C�����hR�e[O�Y��������R[��Y[2��t$c��������
��'�6�����L~�o��+&ka��%�J ��XC�1�T����O�������g�>t^����LipJJ��N��7���4i@��L�{��Fs-&5wU��R�\�,��W��M$���������6Kg���:�4a8�5k����T���)Ig�8�O"�W�
��rO�9S����%�r��@G(��Q�K�g&5��g�������[6fW�m��7�G����%���7�P���hg|�/\H�a0�e`�F_��5��u�$���k
�������"�U��������y���]�������GE(#I��RSt@���
��zt.4-6#���-G�
8'��86&�D'A��"l�*��������aD��i�6OR���^JDoK�T=]wIa1|�#6g���}1�H����8o����%J�X6�F�@��� �S7�,~I~��]9D�G����8�t�igRX�CF����1��~W����!\;3�6�P���nq`I:G\*J���l�H�&����!E��xB���g���(����}Z`?+~��8'Jk2���.�Y�����*�]a� ����9���>��}6���������R	<�[M���\��f��Gf���������Z��w�OL�A�k-�������)������$#��[K[�56�m��)�A�X�7
-,5L)�SN�5	;���%����W���y���������W�$tZ����9c�&{�D��K�C���LXEn���	DW�P�u�	3(�n,:��n�1���w���t��QF�����)�	L)���A��H*��
�<q$-Tv�F�CfK~�e7�T�H����<���l�	&P���:���ol��X���r��z� P���W�Iqm��B L���80|F�`��N����4��8��K���+b��Qzf���x����i���@q8����K�������$A�D�f�W�y��,���h�ip8/3���O������r�@�����#�;Kx�,�h%(���Pc� �yT<U�Y����s����_�h����u���gN���n`����
$�������C0��Y$�f.�����$������0T��,�7y����2����m=$���k��F�����+�r��:>@G��a8�{c���uO�X[w8�3Z�7������1q8s��#C'�`�y�
k���k��X�}s�Q��1�2O�%�����[n0);����t�Q&��Pgf�;*�U����������/&�NL���^��z���Q����<�i�-fS���|����`���hF��@]G6�M|i������*�S��v)j��t&QN���Z��	�)�#�q�>����~
��`i�- Mh�A�/��]z�6�Ds��"	C4�� ��8c�=X8	����(�@���?�1n�P�
f�,��k���F>�W)`��,D����������L��/��NzC���:/0�����K��@�@���m�yB=~D��1����y���X+����Z���y�z��
GI�N�������r���*rD�R�	��`�������v�|P&;�=f�{q����1��|�b�eDCcFs^��}+���>_ov	J&���O���j���0j|hi����#�*:48>	~w�*�����9�N������y?�e"'r6��'�=��g�n5^�[������ �7��Z�b^lO����K��!
��0�J�A
�*0�����V(<Ip,35ws5�s����[
)�3'!���1_ �]��>�ku����[�"wb���7���$ ��Ch�j87~^A�91/�*X&����c,~���K#�?����/��a�C�=��#w��DB|���{����j��6�������7�i1^ZO'Vl���8b������!qr	�<{��Z�H���Q������M`H�maI��Y;�P����y�gt���v�*=Y����G!+��x�8�a~�����4�_������h�9�4�_�,XYf��A�2��s�k��4g�tGw�G����������F,e#�m|9�X7�gMf�r��\��g��N����B����%�:�#�d��;���������I��}?w������Ro����s������{�����l�g�������|l�o>�N1c�:��xr���kcYR��l�1e�v �W7����m��l+W
��b�^%�k��gqtEb_�����L�].��p����E��f��b�K�K�5h���)E}�7�;����mE�t7�T�k��/^�L�������c����#8�_K���-������*z��)Rg���o��r�X��A�N���-/��M,j;� c����#�P� ���%W����(Fe�R����Q0_\�At�.�����3�v 2����z9��\�S��
���������m��H���nTo7�Q������o�|�Py���	��p��T�����.����o@��#8f���ag4hT��r��JSsf��Y�eW$��<p@�L-P�|v|w|�gw�;w2��-p4��� �a�~Q�a�/ ��y";BF���>�e3���:�����`1�w1���@S��������������*6(�^��!sRb2E��K���G
���po�B:)`����j�~��n�f����q���M7)�����^^��vV8m'���0G�`���H��^lK�/��l���eT�B�^K()�t2��������^y��+(��(���u?��n<F�q��J��G74L��I#<2���]YBw��k��1�.-/�K}�"Nk�?PN�B���	�Y��2}z���]�`����0��
>�g�h>��P��"���#��vg��3l.6��[�
�9���6�[D�p�0h��v�����E��97�o/���������������������X@�[�r}�\�����������l�@�j��N_�����ii�����t���$y��R����^MF�4�+���@���j���^�����j���]�����=x��J�Qo6~�*������_�~wD����$�������^��L�UT��v�T��J������^;h�Z����������zR�Y8U�=�������6�X���j�N��#�����(��������'��g[���)q������`���
U�>��7�j��W�l�-z����������
hUU���/����BrUD�J�UirE��\��m���1C`k/���h@NG������b(���*ok�J�G#b�������?B�����'��_Ut���T�{Uem���EU���#�q��)U�n�Q�T�J�<Y��-{+�N���a��{��~���l��v2�xV+V��*���^�Q����$���KqO�f���s�}����Q�y*a�:���*���@?�\���m��. ���q]����vy�9_�Y�d�O����}a^��Z�!�av;g_���m���c�K����|��6E���"P����0��9����`���f2���<Q�
�)I
�Y�35���h�F�M(+�!HT6[4$l�[Ra����*��+����?uOO�t������Y��<��	y4���b��H�+���Q�+��������jp�qN��?v�O;�e!co��YxQ4�/x�`��(�z���dm=
�aY������.e)�"�V��sh�ZRL�<)=���W��;&Z�iH?vN�_�����hh$�v �Xm�p�M�rt��j��XN[�W��~�����r�u�^����,">��!O`s��`�m��}��=�70�qxC�St�,�Kh������.pM�&�������rA3�
����B�7y��7@S����.��r�6���_1%=��xC�Z�[z�\#�#���0i��X��|y�2}��&L���<�^���W�)<��3�)�����+%U��W=�_M�}-�2�b����<��}CZ��y����V�V[H;xB�l������l!!�X1Cb�`����V1����y�k���%c�9kY2�����
�bj�=�_��i��=x�S�#���~4�x���)3������8w��[�������v�8m.#3��U��j%�����/����ZO������R�'4���-���4����Y�n��)|�n_�	��������~]N�hVxF�vv��D,����Ck���V����J�[��������YcC����X�V�����,�
e�d�8�i��<
�or�,�z88��,���{Y�|��/,�(�INDNw�	��s�����b	�hyK!eC�m4�R�Gw�+g�J�A�	1�b�����p@�����MC��C��>��^��p�yw�|)������aS�qwiR�l����6NS���8��{��� rxh������1>'g���O���a7o?�1���Ki�.��f�*�fO(y5��_Usg�N��&�M��j�&������Q5sz2�.��eu���}���=����M?Yu���:�Tk	e�>nN��fuy(������b<C�1�K�S����4�����lZ�L����G�~:����r�H�/�Q�s����{1�E�.��|���E�"T��{�tIsB�F���fk���	�P��m��@����N�g<'U�����D�\���8����.f���[��ZE!�r�/~4�I�wEH��|���������BflO&b�3��������"�j����������ae�{��hg�O����l��{���"q����6/��>x����_x�2�$!���'?�9>��=H��P���A���XK"���M����Z����y��y�������f3�I;�YH�����a	NVD}V'p����v��)��1�	�"C�L�.�[F��D��Ba��p��q N����p-0u�L�7��JX�^�E+�W�b����)����N��G<Y�iA����4�8GB��wW�{|����?���mX����.
��Q\�BM���Ug�.6�����~�t_����7��J���{�������!����Yz|
K��Q�,�� �M��C)*�f�N���7
�_7p�Z��Um�5q��F��o��������^�b�]���Z�Z�h���VV���gP����^o���[�#u��@=��g�:���_���pG��v�z��H��u��U7���5�S��r�j����G(�*��!zyy������2r*�S�~����R�gV1f�����,f:,�������(~�y���O�g��:���Q`)����a���sH��md0����������B�c���)k7�>ei^eI��F=�S��~K�2�GV._�z��p��}��m<d�m�^�tu>Q������Oq���l����O�����t3#��e�����^�3��_0���3�:�����F-�D*Z���� �b@�N��s�N�gx_���o��4GFFlR�{	:|����E��(u	��mW#��g��3��	����	�-�/=���W�'�����!~�59��;z��7G��M{"E�k��f2��A|��c�xw��5�s^���
^4�5M���c���f	�o`���_$	�Ko�w��������������\I�10SG��u�%�:���pZP:���%��������UI!ll"�����y��^+Q�����^�'X�a<JV���������7(�����#��~=��n��`���.@���1dJ��'��e��Kl��������N>�y.x�xn�r(���c��^��.��`3�$����S�5�R�W Y�f�/�S��]�"��9���`~�(��_��0����S�5���1i�V��5�.^�{�������{���&��3t���t	X��O��`kY���y�8"�����a0E�@7���f�S#�<vi��@V��DG>lX��`�:L
������@���@�V)V�Z5��`�2��G���?W^G�(c;�}Ez|<t����H,0�!8��[<��G�g�	���O�J!�����mqy�!�xn#g��A�����^�K����bz��wF����R�%�A��7s�G�)�_�����F��l�Q�V��a����\�"b#"�aX�W�l�e���d��2�Oxj�E��M��
E1fD���d�YH*�p��=%�gL��X$�0�~���e��E����#��������lZ���`����z��9-�
�����84y\�hU����������y<�N����Hu(e
����@�L�l�!��t��3",_(�_�I�����w��I^��%N��qo��f�Q��oF�(!�N���y2��u`��� J�4�v����4���
��4��9�`IC�6'@��8�y�0�������=d��1.9������%c[v��q\K��f��!("��!�eD� jcF5`X��cPo ���m��Olco�0kt���2����=&�0W$M+���&@�[��	��(��0`��X9����R�����������pl��at��0�y5���{����e	G.���_9M�p^��/�~���u*����8�b�7������)U:�X�+6����i����qB���t�j�N�$�jk��s���x�7��R��#L7��:��un)�2�k]2��@cbpK��@d"�pD&%� $=�c�K��#b�*r�IJ��g0x����F��ab,�t�3�8-4�O8Fs����:�c����3"����^R
�B1����,�2Q6�>Hu:��
���������@���0BC&�l�tJ3}��L���3k,[%U�6���)>������������6������=S�8� �e��J�4�'�.�|h��QLE�Y�1�
�]�F�XV*��LF(�,<k�����5���h���##�7�m��g����xC��2�r2'��
g��3�S�}1@n q����4P��<:g�2��s��]<udh%�-�Hc�U���#�1/�}89�1��V�	�e�+����oq��Xu�� [,�Z-K�.�gp~x~��������l��@ ��J�ci�hJ_�!]�x"�Z�*v���e_��,1�G|�J���}���v�U��.Q�1������Duj�S1����������cO����i��0���}���6�;&���Q��*N�2m6���uc]k^������N�	�0�v2?=��>�}��r��zc��>�s��;����JG�K�p���K��A�#y����3xw���&�������j������*���������_oT+a����U���h��A������^����]�{��r6$�l�V�������R_v��b�����R+)��Z]��������?V'X��0���G�q��W����Ko6����>~��U�����6����]���s�E��`w���T
��-U.�UJ�j�R��:R�Pm���%����T����������8=@����u��v�My�����eI��0�
�a�}m}y5�#���-��dj�is�hI���f"��}�#bK�I����Y�L��[3��]�x�F�$�b���
nV�#������[]f]��f����� LE�]��V[�Z��Ar;[;o�}x����� ��Kn��{��g����>$�z���o����Ij�������Z�`�yu�9;3�.�A�{���gE
5���>�m�������W��_y1��������?��f����)X�E�^~AXt������#E��E���9���O�]"1.��\v�s���HOs�A�����]���._���.�YfK�'{a�&�0�za���fw�P�9c��>��F���9!!���BE��pJ�I�d1��!{
�� �u
a0���0P|o�}o��9�_M����LP8Eg�W�#2P��bH$-<��6�vt���i������p���k�(�_��/N�����Z ��<����
5P���s5&���'F��K�9ig����A��(Y)����Z�����B�����BLb��D��p'�^������7�'�>�����������������7�������������e�`�L���,�Y���B�0�}��b�9	��~o����Q�
�?n�����U�'���4��g�#�n����H������h���%�lD�������M����Z����>;��(t?�}��DP�68�A'XSh�[� ���w�GG
@����4��[�3WL)=�W/&��3r����V��Jho(��K{�?o�+��Q�{J�!^�&��n$�6R����~�.&��s[�����9��~�f��xWRR�k����r�)�cOcN���'� Z�������Q?��k����pwtCW����|8�v�
w�����$�	:��0�U!��1|������Zlz�q0�Zb��0�SgA
����sx����}~|rx���Y����kUqS�:?>:����v����K��;<����������\�B���\��x�9	�?��<���{���d��8��x��b������Y|m�=v\�����B��5YD���&�����H��#;dm����ePG�}L�F-�_�������@���������B	���X�����,UsgK�G��P�oK��h�#�6�[��������|\����%�<`y��|��p���b$IIhy�\��B�������;�������6f8���zM=��2�:]�b:E�~P��r:���p}^���Z�m��uL�S ����=�N��
����H�c�Q�
��..�p~B����_���&fs��`+90Ec�?��������6��0��8�&�����C�t�)�Zd�`�!�8'}�hT>O����*�*�����x�L��$Q�'jE���7�R�����x������!�x����	u�DJ��I>�!E����J��{�s�]�&��ic0Y h��#����K�VA��Jy���z��	����K�P�������&<�����-�y��g8GW�Y��<4>c|�gP��t6,��"�"��A���5/�p���F.�������)��/��X
����T��x����y�pk����_tvI��y+i���H��s��������] �(����|Qy�H�gbc�.�{�i5y,	�X�wUY���������Y��5s�tM���CI>�}"�z�n�����CTF^/`��'�s~��#��
�d[7�J��I��\�}�������W������=����{*_x@=�E��\r���Ie~�{�U���9K�c�+f�/E�}�������d�S/K�����O�;���������N8���d�����/�"~�`J�u��#^	�F�=t�Ss�BM�J>���?c���������?�n�z��X���rx��Qx1/*�	iL�]�5�~%�aoeF�e�=���8bJ�>(].(
��oV��l�}�$��������n�g�U5S�� k�VL����Oi������\a��D
1H6QW8�����`4:_����"sP9BR�A��S�������+�L ����Kr���|�`G�!.�p�>��M����K�3<��h�`���U�?)3:QR������[��vL?/H�<
�~n���L���Zk=�����+����D[�K�~�~$B-B;���Mo(����h%i��0�y�;�h��',�<�[q8�q�
���'�3��hN��6]/���n��R���&
2�������JT*�/��}���D��DJ_�a������o�����\�-hA C9�A������;�h?u���D���f�"/�x6A�%o�w^U��^��9U�#�����AY9?8�k��O��.?���cj�j���8�����<�����k�R��9(�ES�fgI|��%9��-(��J���6��t0. 3�e���1[p��?�$�o��
���n�
?h�j���� y�S�0�6�TH2l������0�lQ;�iL�D1��po��;?~Z�v4
�)�B!��p�� *��������k20����zc�������$y ���3�� !?���t~F�-�-�	�<��z!<t�G�����^��Y���H��}!!�8�-�$�[���YM���@���q"��
�=d<?9����.���|�*�U��V`Gw��T�X�Z�F�
d���]O�Ov,�����$k>�����D��\3�93�N��d�v���T�On���#z?�v�Z[t{����B��f��%�j�������U�Rkr�5��M�9�U�d���%*�eK�u���.���N~#9j
�p'�����	h����Q����7\��s#��H9'z�9%V������C����+ekJq��k���@N���/S��3���69kpU"�m�I�����c��F�[�a88Q��Z�����(�v�����s����{tx��m�\0����I}���{�E_����5q�1��88�6����rO��*}oF����-��,����PF�Z������=�@�!�����m���9}!VH��k	M�����g_I��o�*�)���@���u����O�!Y�V!d��!J���(n�tr+��>�e�t����p����O���MG�����e!���*������a&_��}���}g��_���������y���w�l7�Z�j?l���~���W�����T�����V�^�^������8���pwI=Q!�?4�4�������!h���)�Z��������@��|��1�$w%%�����<�}��X�!�Q�*����?�aP���������O��k������=yRk�/�M�����h����	j�(��{�W�����"�[�������>�	rq��y��y�=�wL�Q��y��������m�$~Z`�|"�twY�P�AU��������	�.Iu��t�y�?��R����0���m1���Rd��z��
T�s��?�:�����	�[�H(	���TD�J�{��@�������I?�I����q���*n�Y����~���fBgjZ�i�a�hD��+l��M������,����Xvd�q�5(G�����#-�d�W�s(?�"??����?����|B9���,�
y�	@��#�tg�D[��)�O��.�{_���p!�D	�m��F���r��������PW���������\l���E)�������.�Kr�O\��[��)j����S>x���<Q"�^��"��wAaL�{�~�W��o�E2�IW�����e��)�/z������eC	�i'M$�M�562��_�m��=^�L�G���/E����h|����g��CKJ�T;��������Z��1��Z���N���_�kF&b�Y}����8E^[!�����F���~A��o������M���T��K�+.U9����,�GF��ra��W���=����9�<��b�
�qH�CKH,�5ot����9k��'dC�_����� i
��XaTNz�����x���<OO����L��. �DxM�����I���
�r����w9��(��V���sMxU��(�
��L<h���f1��OR�m*5�����`�a��y��*�Y�(QM^�3��P���4���Ac��:O�3��Fv�7P�K��gVF�����gp�F��p,��+���|��~�c�(�^p�������������/����_/�`*�qD���=�C�-@����jW�v$�u�s8��X�Z�����(�N���8 N��?GC�J��!�7�������o�V�{��J��[2c)�hk�Q���W������b)=:�S�����Y��j!����C)`�PD����7A'���m��V*Em���@������N���Jv�rG�<s�f��*����z���.Y�nM
��k�j�
.6��+g������]������[;J����	���$�3�+l�k��{x���D�jf�;�q�]����p\��&��������&���
�����1I�D�*���pDR�����Q��EZ1UrqA���������/
����&@(3E��N>c8Y�O/G���.�U>�3:u��gz�����Nh�����;���.N���4&���N`�k������5�3��@�=�$TF�:����h�N�!~�8k�Fx��)� eb����R�Q�[f"
J�.��C6���a�L~�����^ok��Kv�D����J���46��~g.oo����w����{r�t��2U^�~�6��WZ��~e�����z~f�������uu��R�Q&[������u�*�zX����+�,�����.����>�����)~b�J�&�]�Z���:�L�*�b��!�����LF���->�wv�����W:JN9����|R��������D%��|�����j������W�������������RoA���{�A-l�J����T/��� ������:
���^�?nV�������;���Z$���^�!�n�d�Qxq1�A�aJxLJ�aD�hx9��e���,�U�	J�����u�*��F�"�3��I_�94
Q�	��lZ�B��@����J�G#�oH��2���(��WY,��[ �.����T��(2E��u4���k�W��&���2����"~#N�.5F��f���i~�q�y"�A�E�!�1�J���s�R�����l��E�}�'���CF����a�(��B0'��;��'�H�<��iJ&4�W����)N�*�N��]��pcB�����	� ��$R�1u�y�yyM��kx@6k�(��M�eN0��	H:���'��$���t��"�c���`7����+i��}v�4��5��u����0UCJ���x}9�P�}uMw/d��7�^��Q��vb����K�����au��I��y���F�u��������/�����rVaI��ac
�fz��d���P��R;����b�.�^�j������/�j��2���Sc?��A�
��X��j��^#��:;$�b&mg�X�`nS7��n����
+��R�mJ'�;��dA�oBL�����L�-}-jg+�kT����_.�_*�������K�a�siRD+�����O��1h�y�U��JU��O�0>o��4'<�����z~����x��[�\�'_f�Y�`�=��	��yT��������}V����R��g[���������8��'\����jF����w>���dW?7
��~
8)���m��d�����r��F9�Q���[R��)��+�s��g�1�v������DKrQ�����Do�BB~�%\\T���A�������V�#�5c"\� el#�HMl"&���E_�}�;)O����_�M�(���I>���2��@.��������x]Z�l�u'�7�UN�u���4�A���A�V��k7�����4/�W��y�vy�'W��F{�Phj����� 9��;�}1[���yw�����u'���}����4���j�����^�M�Z]����7DwzD�����X����W8#2�$��h�ht���E��R�2lW�N�},@�3a5@�������9C�}n5J�s�=?*��'���)|J2����c��^�9ur	}�0#�H�Lf�,'B%w	C!���gtbZ�t��BY��\3���$ ��^z,������
�v��',��p�����[__�~IK��:
/�t���?��t�>���h��x�auN��_�8<�:����K�J���'D��u��)Xa(�:�F�<���b�P�����(H}[��H"P�?�;=:��6���������||���=(����-�CFqr��`��1�9��0�D����<��v��M��������RGD��t?�u"����uL5&f������?y����9z���7�����t������<��;��.�[��;�����K��L�>[z&�X�"�bGc����B���L@?.�OA��������D�?`�����W�N�� *{S����w���@U��k��VP���a��O��{�9�d�U������p�Ux����ez|��<���cv%���&d���c8>.jl�sD4C��n��?��T���4%���z(P��.b>_/��JE��\��T���#\u�
�&�[zp��,��sX�Y
�B;-�f�PC��
.�9��7?����������F�������g���T!�������.?�,�]����������1>� &myb���c�"�^x�[Vl����3�H
�!���Y65���VTn�0PDV��K�n'C�Q��4�V��Q�@�r�Jmf���W�p���i�L�AY'�H�$/���N��p�~1 M������h���-����n��/y�����������)�M?��f���L��;�],�{��E�p�����a���3j�������1�tJ(�:;��p����^������n��#��}��<b9g�
�3�<���A�?Y��fcHM$�W&o%��UMR���l�|��L%I�8�}��z[�Xw��ki��%?�>��n=�
.�Y�z45Y=$����/��X
�m���;:co*f���y2zT[����%�2�����]T�����������Y�6G7�)
��%]����Q��0(��$}`�Bb{�iG�/�S	!�V����xj�w���t\t�;�!F�7?����x�������t���iNd�+4X��c�����l���B4>lC���
��`j���<U�eb������4&�Qm>"��������^M����/���� N�4��9(���������9+�S�W���
��H��������� Vo�~���Qp=�]�Yv��<�bKZ�E$���a���lSn� ��#���bxx}�P
��`:
���,�����S�;�i��;����e�BD���yl �
���<:2����x��X� +H �[�d��*�)�a7i�f/#�������(��1D���U�%9�<w`�E�X����(/��x6��&M��|�y9���u	+s�,j|�HF.�<J�'����q�$����\AB������,�e���"����i�#�||�%*"��9�Q�����Jz$3�f�r��'O���Q��b�`D

I���1����� �q���E��o�_��#�����Ab����N��\����|U��y*��V��!+�*
?x�6Xa��<��G�����Q!�V4�:�s���<��E�wG���MR�������t����,���H�s��/����/X[f�&��-�m1[/�L
9k�Ul��h8�1ScE����"h��'wXLs�_�B�(rz�y��h:��x�f=�,R�Rh��4B1b��l�p<v��G"���8��&��)|�6�t6�����.��G����-w����1}�3X�I������/�������tO�8���5s�s���IL�U����t9��������R��{9���Y��hDn ���F�G]����t�	JFl���n�n!(5�L$�Q%ET�t:]1�/*d1����|A���||������������}����d"T�����[��0���qC{k
�|6�6Dh�N�"���l����qm���M�8�b:H��-A����
���v��43_|���|���:��W�� �t���FH�;|�V�P�J_���I�(<�l1�;05�,�hRs-c:�_�DxAz&}�_��Dq������� ���|v�_�e0��[k� �j�t��B,+�p<��IOX��������a�Q� ]q�I���5g�t:��*��z�T�����l��q1��D�mp"�@$O(�Vz��_���������(Tx^%v�*8�i8|�����e��v��'�7:/e��y��P�G1B�C�<&���y�7���s�XI���c�n�V��C�IdD����� �k����S��8�����o�FWT^z�G���pD
�c������������{�J���%^�����E�=J<�;j�K?��

��bZ�N��pL�#��,Rn�3���uxv�9=?��;y�'Okv>\=Pxy��T���_7Q.4]>U��y�i�!v@>�kJ�P)��k�GF�E89C�H�T�{&�A������DYM�R�Q�Vv��FL�w��EB����}�~Xl{YB�*K,����?'S�pbC�����N
n��d'M�H����8�R����0�/C��6�\�Y��O1�8�K����Q�C�P��c:9���r���{E�S��\�0��G���]>�Y&�8��8>� �]UR�_���Q
�.s] 	2�o����g�����������P���eE�|�x!���0��`��"b�n����BO7����ar��=�E���W����S�]����L�{�T�c�]�z�/���D�@��=c!���<������?(cy�,��Oe������rJg����T
C_�OK���'�N����C�r�;Z0�T���X�Ye��e����O�9�<��eQ����E�Y��H�}	^hd�1�}��`��;;GLv��&�"�TI���O�~��a���c�m�4��+�����]e'
�t��+�L}6$d_	���]��;X�<^�����Jg<8(�����n�zb3��l,������:|�X�/,�b���c�#�a��k}�
���'c�z0K�d����.����G&qXh��>�_K�>��E����4b���sD��s�bUJ:���dXA�����f���F����l���O!cq���X$OO�DC058 4�{W�r�pCt���%���@���t���h�>��z���e�)�@�Ta�����x%-����9K*v�I�i�*3�}Y�,Z>�[�)M��xc	�KFS�lcy���o���ua���N\�dhc�'�V2�_��|�4����SFB��bW��U�c�1�li��sJ0� ��[��T��A?����� �^r�y��tN�L�!�'-��t�i�m���$�����dc�:���"��n$���Tes��"�%#>�T�U
c�5��e�Z�	�i�X��SL�K����J{W�f���sx�' ���?���,���R��iS����&�}�%���\��pN��8gc&�A(�@���I��(����z.3��R�����7�s�[�"�u��i������.u�U�?���>D��d2ss���1q�
�r�(Mkz�
wN�%���<�
��3~�3�_IV�T��%��K���cE��L/���A��
�������N����>�l{���6���~�OM�����Z���p���~�R�H>sM�2�?I��\�b��U87�b�K����7��U.�m���Y���uh�I��e�������mI#p ���3���9x���r�:��[;�����d�P�A\�~k���t��n����VC��1���ZJ+��6n����t�p�R-&�}R�	]'���O�]vR���i���6am������v����*����^�2�g(7)#��Wy��!��o_Go����4���z��_L�����]�tM�.�]�����d��R�6U��56����9evv~a����������6���n=|���mB+=,���2�]��
�f�l���V�W�7j���X��V2�;�#8���Ul�#o~�N�W���c`�g����a�3e3�R�m���;���Ex
���T��H1���p����v���Lo��t�e����������������#��{	�>�b�JJ5 ���/(���;�7�g�T=���PP��������5��F��������	�oV�}���[��x�S��A��'���_�12����S�j_���z�Et�4n��~�N���r��ld����(�d�����.�]����*>�a���Z�Suy+��y�<����{��#����.�+�x6v�h���:�	~��,��D��V����eJ#s��c:�h��lF�����������b;���
o������3���^U��k�3�m�U�Vb��RH��$q[Thv�6�]>KkL�������h�q��,�[z�IF�'����C�Y	Z���h
����b�V�����\��L��M!��J����i�$�KP���_�8���a����r�����WL�C����|���bV�,G��_5&1�K��b�^����~tvXL�`:�e/������A����r��_k.���^r������W�N�Ru�:�����T�v�DCHy�Q���L>E�����{�W��h�`Py��	]
t���V�	��t��P?�2tof�4������L"ye��9��0l����^{/��z=X1���]�Ehfk4�5�'���������t�
�v�1�8��{������w���w��+�0�k���I���\S'�jeo�W�g����h��$AZ�.0n�K�t���%���b[;�n'�����O����Y�Q��������W:
����K)��d$'�};�;'�H�c;:<?�?�H�P���
F{����d��8CF����}�����+�&]vAM�^���g��S��y�8?M���������o?b�L�n2�������y2�t#\�\�������*m�E,_��/E��N� ?K\�|�I'��d3,^I�K�`��0��v������m6����d&�6���8�%@��<0�hN	���Y�V�5�
��R�^��e����o�������s�ao�0{�F
��L���}8C��v���\.�����W���OdG�8(���[E�7�@��v`b�lVk���0Y��f.tFX��)���
���E�Ci�VP��;��y�yu|����=~��_����C�}�?�7	��������s��
b��8?0�Go�#�85v?,��~��j�N��P�D�;>(?��o�������������������O����������8��T~k�����J���5��cC#%�?(P�������s5����^��{���yyG�:���2o ����iy�{q�������p��QH9&�E�2����d"�������i�T������K����s�����s���I��]�s\�����|0��P�:2�*�������e;]zT�D�}����(�;i����W1�.!���M�������'��5uFM�g�>�vH����D��eh�]	)�m���_���\j�q����*����T>��>�|������5����Bo�S1���(�������K�_������+��7���c.�d�Z�8������7X��n�����2yM�N�Uo��q��'NW�R�nf��.s)`&���7��DS��]
�O����v��c��+0���KA3�����R�N�}x/�����2d�#��v��W�V��k���������w�i�0vB k,��*��c��?=���y�]�m�Q�gM�l���f�j~������$J���M���M���o�	k��&��S�d�h>75�S'�6SY�����[����US�Vm�5��VcO�9%tC�#���(�b���6�k�H�i#����@.5���\s����tBu��b����/4=KY����}���X6����`����d1TEt5�1���0�-��:�x����o��D���=�O�X��MG���g�����b��FJ�
TvF�u��V/j�`����F���2������o;�)0�C�(?���F�uyV$�$
.�����F�>�S���n���6��;'��Me�j�n��%��Ld�MS���g����'��65_�N�jN�[j��i��yiS)#3)�m2������w��{������Wk��_�w��Eh�g���i���3������;��w4��3�q����m���;�Z�����(n�����P��F�%']�[-Nd�7����Q���A����(���#�4�{��J�\7��%]�BE��pJa�,���9��E�=e1�
���0Pq!
d��5!�����6��I0 L��T��b8�!�O�8���S��
Z�8c�����c�=��>����l�{(����_�*'�����:�������gk���������|�QY�����p.�	�����=(�((��g�:�Q�
�����0ga�u��H-���1&(	�X��mQ)������]�E������<�DW�_a]���{�m���u��z������X��X��s�}�'�s1\o��\��2���6�%�����F���@�&��d�����.�/�>������9��Ci/%+�~���wEM	 �6�M��R��������lc�(��i�=>y���c5;gQ[Z,q*��aN��%�seeY��W�$f�z��'Y����s�mM�1a7%%��L�LJH�G�|q��h�T������_�����y��%��}��s��.�Z�5��������>��+�]4�%����94�}����<�*���pf�`;��N���n�����$�<'��f���C���uw����P
	��E�O�?	o��u�uL��0��3y��u�1��mi,	]4���a1���q�:���������������������8:�m��i#�+eAI���������Q�DR{��Q0��v	��K<&��=�:E��pOO���pyL����h��	u�p���������C��"�p6��|���Q���I���w�]�:���9*Hqa��.��Fd����:�.j�Q���!>���k=Ke��7^�.tfjHX�"M(�Vi0�w��X��*9�<�f�����p�<�v������{:�1�����3�va|h\�$�A�)7����������z�bY���y�!<�R��W��k�:�������?��I>�>��bc�c�p3f��H���������
:�p���������3"nlQ*����
\)����.����c���
L$6\T4��3��IJ=c6]8H���m8��G�����d�J���E0�ce$>&r���9: �X���������������	�x?q2���9� ��r)h��kP�r��2��$Z�u�(x��2#J>>���"�P�VJ����aA�4���i>��bs�D�'�����(h��XS��;O�3����3KNE7j�B�
n�6�H�Kv����
x2�s>�H���gZ���_`z��^1�J��-������y����9�~���`��v�b)�JY��4�yp���$��i�}��(�X�{&�l�Q���mf�=����Q�c<�
W3Y��'�����\7���FI%&���M��T?zw���
��3(�����69b�Pb$ ����D=�EC�	S{��W�gLj�{���4}P$�X�E������k��N9I0�q��������
s����:a�8��.2|:�zB<��z��\��<�R��i><�t��%c �S������})���3��b{�>�M�
���/	�^!����dK>�q�".s�/�i]�x�:y���3`<|����(����:E�_���A�����q�������;�����4��GB��7D�e
����K�C���v�������;I�VC"���v����h[z���l�%�[f<����b��3WL�nKI28����XI
����y��a@��.����ej�1�(d�?�g�����g�����%f&$D1�q���@�������[��+���FK�7I�QQ�&�s��~���:������k��V�}�C��{aH�k��C���zx�����v�E��0X&��a�\��sw����
_��h���~�PtR�$���E��j��~�������_��s�Y�n����)H������4�%�+i�P�bF����]�j5�gu�Yu��F�Y��]��l�[�=x��J�Q��~�*�����"k+��� 
���&Y�V������v��k^4��J��z��z����o5����,�^5C���Y8U�=`����U
Vs�y������&�ID?��������������-8����XQU����1��5T����|\i���~��u� �{�~~x���RU��n�!M��Sk<i.�0j�B6���3�uW���$���cZ���_L/���������X�����j�����=f����t��a������H[����v�KD�nT4��d�w@q���@8���T����\��Z�4��^m�Z�����^/���vP����l.���Y��V�Xo������W�sK�+�������o:'lz�����]�}@�W�[j���y�����R�$��*�}��]
���/#����,r����������[��c9yW��D� �� �0��L�c)������-�����Z��\�%�yB��O�a���<�4�?�2c����J��|-9���,����������k��������/������p:
gu#�0����(��M�C��C�l4jn'��0�7C���L�e*�N�I�w>��\=�/e���x�$�H�J��#f6j7�S���?��[�G	P�!E��������\(���?c�\�N���T����D���
����dA�%�YG����u����WzY�K�Vm�c
�z�=ul��2���|y�/�����>���o�?����V�&Puq��s�1�Y]���A�[y\OmW����k��4����s�y���77�}�#�(��0��C��L�e0�� ������t�s�(	�������"t��u���:���/��4��0��i���_�!�7������~wgjN�E����8/�m��`f��o�(�A���?W4��4���2,�6��[�
�x����16�������
��<?����T�22��*�C)ppd�4q��w��$�
7Bg���$��I3����!:�;�������|,�]���f`/m�f�$�f�.�OI�p���������k<���}�:����oAB`�"��N�%��#yv�X�����=!�/kG(���-SO��$�.�d�a������*�_���X�l6��[�
f6����_b�$�C�E��)�)<Z~���c�����'�U8��09�_a6��Yp}�Cg3���+���#�$H���U����j��/�Cq�@���W��To��n��R9��I������Rn?�Y���(dR���e
je.wj�� :�J8�4�`i<���=xRV�����Q�e��S��y��;�&��:P�Tj
�&]*�cP�\�X&����i)��1~*�8��O�/�kb��X3G?*LQxyM�����5t��R) iFpio
�!"������ ��^��-eo���sE+s�Fb�B��JM�����Vd���Mi�(��p/��������Yu��8i�*!-���;�����[yC�"3u�?�[H�����b�v$���l+��������Pc:�&,�~��%�EgJ�����`�^8�	��IST���d/3T-����j�b�=)��\���������K���	{Y��^����&��W���k����m�U\��U\�8n���I<����+�����(��`)Z�o��T���T�����/��j���-e�$�������g����u���[�WN*��!W�rp%�n�
\�O��Fe�XC6�n�{��+���R�����r,z��K^&�Oh]Q��L���2r�����W�s��a�wM"�I������6}5�b[����3tI�k�R���V�w�����Z�s\�w5��b���]s�lE�V��F����QK��h;��QY���F-��D���F-���Z27_k3�5r�����R27*I�����v�QK��d�'%7j����"w������r��$���������S�r�jmc���������s��{������Q�����Q=�}�Fe�o\��q.��g%J����t�2�T�s������B7K�z�����&,���^P���z)9_�$�[3�`u��/���a�*#���� �R��VZ��.�HJ
;����*p�W��V�c�[E���\r��w�%W��\r�i������.�hj>�RZ��|t����r���Nw|�|tq� ]Z�����N��|tzU��GWJ���|t�����TF��|t���v>:���o2+�_?����������)��a�Y�����+�}���}��i���?0	��TT�i%�I+�:0u�T�p�k�A/����(tR=��A������hf)#NH�F�?���`�����T���h��o05Kq
I������*��5R����w���X��I,%��_Y���#K��d��n��XJ'��p�Dp_"�(���	
��J<U�8dif�8�\n��Z���7c����x����)�o
��k]zkLZ!�)&�Z���w���*��lu$����}g�Vb�r����f����r6v/�j-�s�,c}V�����*1��\T��J������-d�W�Z�Xm��_��M���jdL��qd�R
2f)��YZ��rX�%�������*3�1{��^��s��Tk@l~WlL�u}Y.��`��?�������
A3����:���\^,
opy
opyIo��6������
����P��-�.��c��H�����m
����)�F���#	$$�$�Z���v�V��,�C�0l���?��p�A��TY���������\VpI]�(�����S
B�f�f�Ov*�)(��Q�c�jE��g3��U��M9W��"��<GF������7�.8�������elj56��l2���twl��Z��*��.����-N�@��/��u�g5"��I��f�d����Y�WV�^YF{�N���h�L�&u�N�������I�hS��n���0(�Q�B_�1����JI����$~��~xV���hO���N*m�$�}.tR���I���T�t�s��I%����z��W���yA��:q:��\"��e+]r���K�Wh"��*�D�n�%���/(�S�$+������������l��_������{�#��v�WQg0���zaoP���������^������A����/�����W�����_�	�m��f��w(om�L���7>�t������������'c�#��������3�"��p4�`��
��Nb��+���0=����b~5�=�:
?��pP�}}�u4��B>{����i�xw��%�AO���z�?�+��f���������{p��-@k�*Q/?��E�j�C��ao�FA���#�T
~������1�r������oS�8��Q���.M���r�t�TQJ6I���u>�c���y�a"�X���=3)��MH��C,0A�9mG6�0�\��!��?����
?�=�v4��V\��70����4��c��d���];�?��������7����
����T�pHd�v��_���;�Y��r���k���&��7p���p��� �\�R{����,��o����Vt��YR#��3J���fnh�0��Lb��f�eo��g�W6���,h�+A���b��j%I������}�>�Oxtz�={��}'����w/����y������c�����`yQ�u�5<)=��]?^��u�ND��~��C�>2*��%���{�R.7Z�p�����1�/�
y���k=��m�cljt���>��z�����k���[�j�V�*�K��c]��C�������,�7~~3���<+m��Fm�Y�e���m��D%���K-��'��&�r��i��q��J\��Vu���_�&(�Z�1��?�J�CcOi>)��J�0���W:?>�J�:�F��60-��T���J-��[�7�m�����S�������V��~/������Ju��_�{�z;��7��T�k����%��j��?�����s�R����X�U<�v ��[�s�_,��[[�g���d��@��~�GP���"hI��[����QD�x��j2���s�N�KhF��,��P���0����$�`�p��`<�A��,����;:t]����&(&�W��j0����
����M��z�y8	��:P3�X����  ��]EeuF���(���&��E�����pZ�*X�hq�f����%]uu�.O��R��5�
��������
�uCD�h�fM
������U��~4�T���f���O`ty=p<������=��j�},�
�w(���H<h�U��Y�K�={��7�2dQ?+btg�����N0q�t����v�����3�����z���c�k���]_�t.Uc�����F��&co��
�1���1W����j'��m�#i]�U����D?i6�b�>�t;�>f���r���n-��^JH���u8���r��k
@Gm��f����+8K�aufi�����t
�������4����wA�A'A���G�^T�������v�~����(|�-4�%��.��~�IN V�%��!���3�����>$�,�2���/�4=�,�Ov������0�Y�-��.������$�8X����|�����W�(����������k7iMo�oO���7���G���m�=�2/�h8��}�}�������s��a�f2�T���w��c�d����V�&�U����%�@X���`�r����.�AU5����)/��[��q�z��(���\����z�Z/���f��������:��2ou�E�@���;uv�%��W���b2��m|�]�b�6z�8'<���������r�)��0Z2�b"������2������@��mc��l&`������c�q��p|*�z�	0�4�E���I&����E��@T�K���#��%y���A�p���#�
�}�����hWF5��=��d0�B�x\����������b(	��0l�.�L��Kb�Q��p~
iB1���i���b� �*��"��'�Ty�wL#��`1����2��U�s��x�$���+�'P�xB����9�
�fE/�T1���������,T���'�.��Y�S���Z\���f8�u�����"�k�����|����@����1an`���hrC�!yi�'5�Q�'PR~jVy�A�c�������3*h�q8(�e�#�k�4X��8��T�[��D0w\O���Z��1-F#��7�tjo��o���������mq<�v��|������r
��,)L�����&^������,eNm!�6���E�K����`;=�<.�P�.��#v���|��bq8'Dp��rH�C�#����(�9����;���v4�2z�G�u���C��rL���~����#y�����ptFt*���.9���]
�W��f��� ��uo�w�:�G(W	�5�c�
��'���
/Q��9�R�B���?zf�G$o!TJt ��`�&���r��Hahu{Iy�&���>�K;��0��K���7Q�V�	�%]*��
���`�����i<w�>���BZ�t2�4����W�%`1��+��5 rh���%`����8��xB{?o�@=���E����tqQ4�2tK��HJE�':�B�]aN�k��1��^�dCur�Z8k����v2b8�?��@aq�fb>_���
�a�m;���bw*�����\v���������@�\��8u�$�u��=�B��OG��4��;m�q�m�O5���8g���@N����^�#�g��Tb��<�H/��V�fJ?��F�0�Zx��C/��}�Y���"y��-m{�'�P����K��%"qH7���%fMm��k��������]i=A��Eq=�fD��#?>�����}N�T������&�����/C�nC��@2|-�d��X�2�@e��&sh����zc���m>����
m�9��>n����E�[�?�����3�9�:#�~��Y�����v�-W���>�����.r'�g���
�j�FV��^�����'(���	W��K��"�%�\��P��>:GsRwPjA���
�c�4^'���o�
����^zfW��_djD!K���\a(�l���tTD�9.�L���3���g��zAD���`6$u�J��'��-
�5���>��	��}���`�����|nx�0)����0���G)��Z-��
��5�	��`��M;��uJ����1	Dd�#���Et->7��Vvj?Kt���9�3
�<F�f.�P
���i����N��G��+�1PNW�G�n|���5'��^����3EKa����^�v����j��_�u�io���������>��3f�e0&i�{�����J��p+��L��v�0?�1���|��5K�e�u�.&�`��]8�	����KPML�m��-Rg�{����>� ^��.�O���k
F�k��z�F�EM,��kK�������4e,�.�������_k����O��)������1�
i�������hV=T�-�@{���@{����������D!�����I�c������
j��O�PZ"����(��E\�R�q�a�MP[�>�����
In���J���@������{����X���-�������z�_�Cg1�t��N7�����Q�
�p���2��:��X�������p*R:�b�tg1m^��\�� ��PCE(h:m�N��j9H�X��X'�u�Fv'<���&��j�US��HI"�8Z��i|���]��i;�s�}�;.�m
�H��
MK��~[>@@����SW��M'0��?K����i�A�=QF;��6���b����M %h���#����L +mE�k�b����l4g0�Cr:Aii�5i����(r�	\�����H�������D���h�F2����yh��F�|��}�l�H�Av���a^z:R��a�L��&{(��pt�8��9���)9DHkz��
$��'1�I��B6��["�S(=������e���'H�#�����0��cn�:+���\o`~�����g#��e�,�1��S��
�R�����s-]G�dR~�����A?��cR�t*���k����[)>2$��J�i9��r�$����
�����tzd?d�����9��vvP>�9.��X��*1 C��~38��R�	�����$n�2/�F}�Eq���K�v�>!�]}0��#�����Q�^��,�9�H�����|��/�
�v��,�p�~���W��"U0/=��Q�Cw������_v����8����!�d��0��$���R<��@���f�mRk&�X����t���9�e+����bB �.Z ��t�(i��D�@d���x�;�d������%���$j~6�m*E�TnvZ�1_����Ss��+�w��a�dB I�k|=���1�|Qi�YU�4������HV�����T��>R���%Ib(�����Lz�%���09��������;�����g�%���7b�!=����}�TF������8,��(�c���W��R
���Y�I�\	U�(QQ���p!4���'=+
���������u�pef��
��h�L�D�,����RU�Hv�%�n�N�V]���.����,`������R��;o�
^T��`vk0�V��
����]�Ku�Lo���(*'/Br�������O�=MR��|�^����^����ut+��&���fX/��>
w�p��.S��iwJ�����P�
� e�)+���]-��Z�JC��t�P�����"�RH�*
�5�������F��9�,��*��3F[�����!l�����^�{s�}��sFk��J��S�6���������J�N����#�������|��F#d�L^�
��|M����z��B��������(���T�S����n���_T��������>+p���A���O����������:�_RH�w��_k�V��������IA[�w�Vm�g�U�W2�������m�j�����aF��)����*��/�:��?�������*ep�7��z��1FI��o��o..]�������f�<!���F�B�a�WYB�LN�����*5���3���*������~����j�<��,]�����v��|ix�|y�!��9���v�������N�)Q<�#��6��\�&d�w������i@w����d�K��)��~2Rf'!j:���4�G`�@���x�lD������.a�G�U�������4w��K�q���GT����x|1Q��
g�.�P����I��?�nOC��f�^q��>�y���p���I�|��*�l����s�w�Gx�� �Y�3\��\�C�|�P��y�BxBk�p���c�N(
�K����T�	��@�Z�x���G��Y�~A�J��+#i�Sv�H�
x�b����Fs���-��0�
q����~#�5e�B_���&�h	&�=�1��u���Q��vt����������A53���sP����=|�
�����E^D���t;������SF����[����i�y��\�����w���)�V�sx�����N����IM[���8A{�*q,�K�s�. +<�p�4�h��O����6%�\��z�&����v�]��-�C�Vcg�|��%dd�;���7��I��p����|����fn���9�H�3�l��|"c������:�)yZOx<��������������H/�P����r,0i?S���")�V��+U3�Q�f��DZl7r��M-�/�-$d���Q���W5�������.��l�������7�F#� �UP���(s1��Z��~X�����?}"�AQ��A�
�x/h`��V�'�������j{�1���i��w���j�������+�o��U@���yA�!(����H?�M����A�W��l
*C"�a|��p�f�<k�FU�lrdf)&�yI3A�B��0.�������@^��] �t����8����/���C���6@�Kq�����Gy�i��Zv
C2>��L��0^����x����1����z%	sc����h�����s��V2u3�����Zs������P�Lq*�a��!��
����d�V���$}�b
i�v�`/p#E�� r#�!�D-���b��}<�a�$#�g�+��:[�#�
p1F������~'Q���b�U��o�Q��Zc����u�������w_��w�:-�%i�?9PZ�HU3�5����^�8!�-r����ax$_z�6�z��x@`s�x����aAqN��3:�
4�\X���$��{��9%I���y��17:<{���2�H��)�����3-�������P�'&��+�������w�+�uU��_��3���/�a�l�v�l48q�}S�?�,-���m�`��'��Zk���kk�k%X�E�g'���!:�E,g����B8���1���y�c����_������~C`��%G����K9|���&g}����*�����_3q��L���6��T
>C�>��o�D����	�y��n��h��XS�&������W��s��b��+{u�.������Rt���;�Q��M@
-::������M��z:�����~K�,��R��XR~��%���~@G�3��b\����,:����� q���\^"6��T(�w���7$$%�_�+RKL��a�%9#v��l�E��cm�����lV�}�k�Wm�`�r�/����0sK���Kb+��'B�O������!|_>�DQe��#(_�[�����TU�z�8ur��!�/��NJ���^>M.��R�P���������-^e8������a���r^O <Q�������P�	a>��
@D���|H�C�	h��C�f���3+e�v�� `F�<�X�~�{+�V�~P�W��9���=R�
������7_Y����$����w�
o�9���� #��#���a!ic���:��O�R�b(��a��F�O����C�S�h0.*���
�T{�(���`�H��m��:`7?a��F�B���v�C�f�MK"�4���v�CK�k ���3�yO�g��E�uB�vn��v�s�����>�����j�L{�:�Y�W�R6:��`��C
*#�
6����7���]���UA�����xW)��4������EAG����������s����_����Z�xJ�F��f������P��VD[
3�{�2���B�Z����`��0x�8�F�q<�<����p:���`�+=���b�� ������#}�OF)�u�����R���sxK����$A.(��V1���Q���f�/��A�qQk��)� V6�D��,��I����#Is��q�'���{xf`c�:'G��H:d�:+��a
���
�)R���_���J�l�����b�N����]R��Pla���ie_����pI:���N�m���f�a|^��W������Y)����}H�������<W�����"�"�o�#��:����qT���'���������[�x�U�+�����~3l����n'��e���m��j��BG���x7��kC �SV�1��G�<��$!��+ A�-�������X���@S����b����L���Q���8�0�������Ax��A.����� A���|Vd��G&[��3�������s(?�����r����-Z" ���B	����r$}|���h��j��C0�P[���G1�"W��&v�Ya�x,�w���X��8�'��<~�$��q��Y<KL�F�����Ka�8&��,!�	z9��]�����x��w�c�y�
���{������fw~F��*R8��}�&P;���^V@��E��������cNe��
�����:4���zE��n����_M	�4w���Slv���F���(���I[G��w��BZ��F$�0�Ll#B�[8{����X�HES[\P��*-�P0��[[�����
��|-0��D���;���'CW��!x�������8d��k���6x��\p9�2��K�B[&����� z���?��C'dJ��%
�3�'�A8+������6f�+��:�;������4s~�_��;��6FQ%���1u�q2��N*|�3[�B��_RUn(���$�f���	�%�a���%h�(f�+��!�����S���M;��}m���`�L��
���$����ap{�z���>��Iq0|RvpB
(�g�vM��r�#6�0������	�F@�_#�@c��y����v��'����_�b����!��l��o<�U�L�&I�
'�3��YY�}pJ�����q��RX�f�/&�Wb��g��0	H����Y���n8�<}����k������\�z��#���=]���[u��/����f�����������$R���%
Te�+o\���a����>��d����%
T�n9)%
'U�)DD~O$�rc
-����~�dFY���r��Pi�
.{��\�U���VPm5�����V�]NYp"�T��&\�������%08�\@=b�2nO~.��G;��'VF�'25�����E02 ����*�!� �9(��T~/
����3������+��.B�����.��
by�N^������K�O��}����-�s��L:��!��Z��J����XJ��P������'x�Z���J�����_(T���1t�rc0����"����r����q�@�`����%��/6�aH1�jO���j�L�v�,�Q��gK	yM���\������:�
j0
~>����"�vz��C�z��.�Kc�����I��	6U���Z��csQ0�1W�����B�.XD<A������ExeG��<����'���,��~���\
��qY<�g��'�xd$b�_,����������� ��u�rP%�O�,F�����q�.��9�b�������H�����?���#E0�` �.���!�zws����������w5�fCQ}����/p����e��nr�E��q�:$4�3
qa�j�
���������^Gx4���!q�	��G(9�y&�.[v��)���&�;�u9�{�D�%�vi���
�(53���e�h9��0��]w��7�������Nw�nG����1����]T���*l�(8����~}s�#�������[���W����~�W��`�jV����;��������Z�?� �������Rw���u?�_�������HQ�6�a�����������V����6�,A������-k�
��n��~?d�5���G��0k}���#qH�oB�1���Z{b�_��y��9�&������{v~x~FAd�/�sD�Xz��ks������S��p��`��4��m@zlZ�tG������a��l������9��6�R�}���F����7?����x����s����h��i�T�b��w�=�OU�/��� ���7�h �A
6���K.��6����iQUR
�����u��~6@.�qxt�S���0����E��QP���3��wyk�������9�������07���Et3�������L(�v�\����>�I����$4��^y�������}�^�v��B���"���_r�7�Bg"GWy�,!c�������\Y��HZQq��qC���sG��J+�K�q;�]�)\b����r�K��Dy�����MY����4�w�&��	
c<�g���wp|	��c|���*�m�����G�����|l>
Z�E���U�51�3Wi!&�s
\"����\�x������$����`�w��p���+n!�l�Xm4���=������V$F��2;�o!>���^5����H�l-)�,�
��p�
�R�G-S��K��DI��V�X\�9���M�1�V�{�	x��s8�*i�/M�IV��wa���>���W��Xi=�wP+�S�������"�����o��K��C�������i�O��7�%N�,�'�Ub	�7��z�|A�p!4VZ�;m��10��x�Y����������;;�}�/���z`(?�h� ��c���l��f�Q�����6��81q��.�x��M���;X��8=�}ou�SW�U��Z5�����^Pm�7�i�W����I/G����?����?�W�B��?_G���|��U��<�I�}t��4���h�%r�Y�����v^����������c�v��� �"g$�f���7��I���9��`G���?��pw9��+�3+"����^�i����L����A������JP.��a5�5��Er9��dsY)6�R�%���R�AY��D����R��t�r��:���gGXS>C��Y.�U:�xB3�p1
��v�k��O�������x�G�v�8��couQx�����Hbm�6���{����V����]����`$��Z_��L;K������{��k�������Yg��5��$�����n��4��M�lv�)�ikk{-F����t������1s�[������e~�������'IxXV���r��OW�����%v��v`�*sn�S���F��<vl`�K?����CK)W�>
9�lIDA���s}�':������t_��~~�����9l�����?u�v�>����.@���k��n��"Mq��?C/������u��v�4�;+�[4y��sm�Y!}����7o���wzN�\�`�.�r>iJ���F�=�gp2.kP$3N�������7+���-��	�'��i8�jg������������]�2��t���n������:���_���k���������	o�h����$|M�]}3����f��Z��1� i:4D��6���o��B�
���H��uK���&��!v)_D�{�K��v���#�����Y��1���3v�~}�I^{e����5>���^M?���=q[[l i�n������5�pA2���qBF���Q!td�WF9�s��H(-�-����-��`p�_�T��~�<����7�S��2�IJ^�"�=����
��������'V�zs����U�^y�]re���;��dn����g��g��Q��1(����46�T�z�xw���~����l�I�&�����R�L��E-�o���r�����.��)���5�Z�Jr6���=�(����������s��C-�Q-���
-����Sz%��Tf��#&w�
eUF����%Q�T�_U��1}
*���+��,j��:����������l�������m������_t�U�dU�Y��Vu�U�lU�[�
Wu�5D$)��Z��_�om��N�>7��J�Z-�u
=���DL)��S�[fX�y�j�>*���j���^�����zs���*���Vk^���x������>m��oU�w cW�O��r�����{�*�~o��\�.��!,�^���4��E�"{��Yk��p���,�����T�����6�X��9Rg�M0���~����O�uy
Zd�g[�l����",�Z]�w0�Fj��k�k{�u�S��T�����}�~~x���RU��n���bUqbE=��UYb��>��dlp�t.��D�&jvJs	�5ln���P.K�,�O/����5�]=���<�u���
A�8�N�����
��4*>�;�b9��m��oS�md��p���FZc������&[r[�	�S]^��a���U���
0>�?�����`<�o0h)�;�n9��	���.��S�8�4N5!�0x)t���d�x�4Dp�pP�}}�u4����S���|=��E��rV������A�����j�-=Fi`K��sm���U���G��Z5����D��v������J5�����r��B���>��l{����I�-� `�M��K��j����#{�;��
��K���{�����ULD��������j#^*a8Pu?�6$����HfQ�0�6���/i�,�����������q,���
���U�Z��i��g��6S4m����;�8�����N���\H�%_3�`��\d�&)����s�/���H��X��)�������I�c��*<3��cG##'e G`v���$���It����(��N(A������������-IL���o���%a`����&ow�u���,#�������]��d����_�d���W�@�����a�U�C�@�cS%���T��B��-�r}5w+\��=��o`F���W]&����<�mo"iQ�pJ���$�W��������-]�++����1�����'���F�'� G�/C�k�&���G�6(�6
��"n�QX��6���!10�1Y,(2���'�K@��
vRbX'��p?�"�8^�����C���Gi�����%�&�F�eiL+�G�}P��8wh�NDS�	k����q������d��n��������1[+�/�y��������4wY�=��	B&4w����
(1+J�����@�:�}rM��N����@�Q�������	>S��>I���Z�'�v�,�����$;D�l��U(�K��dyb���'��=�F�y"�W�������Cg�u�������W]
�8��$�(|�����������h
����
l��8��H������?�v�GW��`�+ZZvFXd��j&�!$#�v|��ya�'�=��"�(P�&��~��V�&c��L�� 
L�I��g=��5�H���>hD�~����\�?���)�v������@**&��}L��c��}T
��V�Q�Y�������f�x�V��F���|�Q������M�{�a�B���=��$�k���6j��M��H����v�i,��x4G�I�pjD	(_v�.w���S>�������D��7��*
���������q,VR!l�qj���AB
&u?oo�+�L�<T#�����k��7A9�,��n����	N���@d"�Z1919Gg�� ��t-����xm����b���������q#+\mP�p�>OL^7w_�#��)�gi~I�l�n*.��6$s��	��3�������WO���M�]*�9z5$v&`&o�0��"�TM�������m��y"u2��$�`?��W�@����1����5��Ac��Y�&d��J��,�]��������p�+�����a����n�~n1t��C@�����O��Rk�24SzAV����������@xh�w;�Co�vw����1R1���1�q�T� ��v��Y�����y8���a�B������[���<�9��yo����P�����+�}�24��&��f?q������J�[kw�	8����K�"��;=H���n�8D�z��[�z�sZ�&���1��"�3�F@i�f�����Em��V�������p�|����z��y�]o�������4Q1Z�>@��s�1%����1[�D���R��[�?/$�#��)�^=I�FL0XJP�g6������$�S:��a�@�Y7[�xg�iq����j1�M���@����-E*
j,J�R��R
���X+N��H���1����5s�������������jk�^2��I�?����#��kq����UA����4#�
5�:cDg�Jxyuu���KJ�p���v/{/>���������
W�����@�]�6d�#�'�;I|��{���|�}~��-��F&S'�F*4�J�
���%���9���7�K�x�/�����b�� B+�T���vZ-�y�>r�$��i��dJh]��j�0r
8�:�����������z�%�]�H��&���loU��0�7T�����m�l6LBX�� N
 ���h�I���0_�]V�o�_��S��7���O
��`?4��Qp�v����p��sK����S��12����Y�;��k������!f�/s�8��?��e�c�2��?����!�3�d�z�F�j�z�F���?�������4���_����?���fP���j��� h�������e8������Ok��~=+����O1Y�:8[�h�<J��q6��� ��k���|T9f�iIr�#InsN\�{��i�����Uk�Q���+�f����agX����.5'.y+�!����$P��z5�,���P\�������<�N1�Li�!����?r>\w{�������^�����WW�B+������#L7aO_���]���m:�����!$�1 r�VH>��_�]^}xS�V*������������������pC)�_\�o���v�\ nG}���C�_E`1��jp���Qn�"�Y��u������*<��y�l{��.���K�����S$��5j�R��RD� �(�
�?��I��Z�A��-��mgTu����0��bj}Qx��b����=^�
Y�PL�*$�@ZC0���\�/����:�q�	I�BJg��|�R� ���"6��8'�`:��`1�	_�-�2%��y�����#af������!��`�o��?&����a Z%]�s���3�������*�hj���tr�|��9��������e+��qF�x���Q�A>�O��i�8�A��X�AHk�/��(���(!eh�i�b�]���z`�3[�-0x���(�A��tp;���[�f�0%�D�<���"a��0�����+�i���JO��7\:�$\
�(7FB���D9��2,�
�_��c�S]w1�6� ���[�O�7.A��
5��Q���
�p���Fu��{�������cr}���������y,-~?#�����^1q���R��gR���3�U��1y+`i"�bV�8
�D�FJ �K�'n)�o�	n�2��b%���kR<�:m\��F��A�`	=�1t�o��T����J�yA�*�Uc���7���oHMk:�[��� ���I��H�(J{NA6Su��4��R�z��2N}�	\b4L51]g�����j6#�W���(MI����&h�������/&:s�z�5��:�����"!�:��V�m�
k�0��<R
�\��*�����qN����n�9\*	���3G\�J`�T��q8��D���znS(�rTz��<z4>�R�_����{�k�V���}��O���@�rCM������s5�[���lk�<L
;6�Tg������~^)�E��
3�&���a�����|��_����n��r������@�]�fh��7��5�m�B�s�QlA-��{��y�N���A-���]�!:�HRm��h��m��f��~4���4�2 ���v
�j7I��t����h���i�U�u�mk�u��D�5�l��7�OB�P����'b��7�)]W�%���l�G�8��IJ�#�,cN�(���QG��nJ�N���3a��d���ME����X��]�}R�����\W�$�0��XvH
}p�1[������\�m�����;
������.�6e �g��J~|��g�2�@�����E�e|V�����C�����F�V���JW�Zo�j3�4V����i|S���9�����V���R4���{C[7��Zy��6iR�����83�I��"�a�g�	�B���!'��n[�����k�0\�9������f�����q�o��MZ���1<q��p8����JUz��M�%��U�S�L�"�.gl-:w�}�8h=H<��2b�V��\/��z27 /���C�C>�G>���p0F<��H���M2����������_��S)\���1e�S	��^2";��|	�����-	�������M�U�����C���?)Y�&��2�(2I�W��s�T��J��+��Ho���?����N.m��d'�-���K)��n�rW����-Y���
^<KU)�������b�����QR��B��-%��������4�,�Q�T�>����o�����<}�pn�����~�����%/y�K^�����%/y�K^�����%/y�K^�����%/y�K^�����%/y����������

#334

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#332)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The new patches probably need to be polished but the VacDeadItemInfo
idea looks good to me.

That idea looks good to me, too. Since you already likely know what
you'd like to polish, I don't have much to say except for a few
questions below. I also did a quick sweep through every patch, so some
of these comments are unrelated to recent changes:

v55-0003:

+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));

I looked and found dsa.c doesn't already use shared locks in HEAD,
even dsa_dump. Not sure why that is...

+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)

The first parameter seems to be trying to make the block size exact,
but that's not right, because of the chunk header, and maybe
alignment. If the default block size is big enough to waste only a
tiny amount of space, let's just use that as-is. Also, I think all
block sizes in the code base have been a power of two, but I'm not
sure how much that matters.

+#ifdef RT_SHMEM
+ fprintf(stderr, "  [%d] chunk %x slot " DSA_POINTER_FORMAT "\n",
+ i, n4->chunks[i], n4->children[i]);
+#else
+ fprintf(stderr, "  [%d] chunk %x slot %p\n",
+ i, n4->chunks[i], n4->children[i]);
+#endif

Maybe we could invent a child pointer format, so we only #ifdef in one place.

--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/

Can you look into this?

test_radixtree.c:

+/*
+ * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
+ */
+static int rt_node_class_fanouts[] = {
+ 4, /* RT_CLASS_3 */
+ 15, /* RT_CLASS_32_MIN */
+ 32, /* RT_CLASS_32_MAX */
+ 125, /* RT_CLASS_125 */
+ 256 /* RT_CLASS_256 */
+};

These numbers have been wrong a long time, too, but only matters for
figuring out where it went wrong when something is broken. And for the
XXX, instead of trying to use the largest number that should fit (it's
obviously not testing that the expected node can actually hold that
number anyway), it seems we can just use a "big enough" number to
cause growing into the desired size class.

As far as cleaning up the tests, I always wondered why these didn't
use EXPECT_TRUE, EXPECT_FALSE, etc. as in Andres's prototype where
where convenient, and leave comments above the tests. That seemed like
a good idea to me -- was there a reason to have hand-written branches
and elog messages everywhere?

--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
  test "$f" = src/include/nodes/nodetags.h && continue
  test "$f" = src/backend/nodes/nodetags.h && continue

+ # radixtree_*_impl.h cannot be included standalone: they are just
code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue

Ha! I'd forgotten about these -- they're long outdated.

v55-0005:

- * The radix tree is locked in shared mode during the iteration, so
- * RT_END_ITERATE needs to be called when finished to release the lock.
+ * The caller needs to acquire a lock in shared mode during the iteration
+ * if necessary.

"need if necessary" is maybe better phrased as "is the caller's responsibility"

+ /*
+ * We can rely on DSA_AREA_LOCK to get the total amount of DSA memory.
+ */
  total = dsa_get_total_size(tree->dsa);

Maybe better to have a header comment for RT_MEMORY_USAGE that the
caller doesn't need to take a lock.

v55-0006:

"WIP: Not built, since some benchmarks have broken" -- I'll work on
this when I re-run some benchmarks.

v55-0007:

+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.

This hasn't been true for a few months now, and I thought we fixed
this in some earlier version?

+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().

Do we need to do anything now?

v55-0008:

-TidStoreAttach(dsa_area *area, TidStoreHandle handle)
+TidStoreAttach(dsa_area *area, dsa_pointer rt_dp)

"handle" seemed like a fine name. Is that not the case anymore? The
new one is kind of cryptic. The commit message just says "remove
control object" -- does that imply that we need to think of this
parameter differently, or is it unrelated? (Same with
dead_items_handle in 0011)

v55-0011:

+ /*
+ * Recreate the tidstore with the same max_bytes limitation. We cannot
+ * use neither maintenance_work_mem nor autovacuum_work_mem as they could
+ * already be changed.
+ */

I don't understand this part.

#335

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#334)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jan 24, 2024 at 3:42 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The new patches probably need to be polished but the VacDeadItemInfo
idea looks good to me.

That idea looks good to me, too. Since you already likely know what
you'd like to polish, I don't have much to say except for a few
questions below. I also did a quick sweep through every patch, so some
of these comments are unrelated to recent changes:

Thank you!

v55-0003:
+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));
I looked and found dsa.c doesn't already use shared locks in HEAD,
even dsa_dump. Not sure why that is...

Oh, the dsa_dump part seems to be a bug. But it'll keep it consistent
with others.

+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
The first parameter seems to be trying to make the block size exact,
but that's not right, because of the chunk header, and maybe
alignment. If the default block size is big enough to waste only a
tiny amount of space, let's just use that as-is.

Agreed.

Also, I think all
block sizes in the code base have been a power of two, but I'm not
sure how much that matters.

Did you mean all slab block sizes we use in radixtree.h?

+#ifdef RT_SHMEM
+ fprintf(stderr, "  [%d] chunk %x slot " DSA_POINTER_FORMAT "\n",
+ i, n4->chunks[i], n4->children[i]);
+#else
+ fprintf(stderr, "  [%d] chunk %x slot %p\n",
+ i, n4->chunks[i], n4->children[i]);
+#endif

Maybe we could invent a child pointer format, so we only #ifdef in one place.

WIll change.

--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/

Can you look into this?

Okay, I'll look at it.

test_radixtree.c:
+/*
+ * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
+ */
+static int rt_node_class_fanouts[] = {
+ 4, /* RT_CLASS_3 */
+ 15, /* RT_CLASS_32_MIN */
+ 32, /* RT_CLASS_32_MAX */
+ 125, /* RT_CLASS_125 */
+ 256 /* RT_CLASS_256 */
+};
These numbers have been wrong a long time, too, but only matters for
figuring out where it went wrong when something is broken. And for the
XXX, instead of trying to use the largest number that should fit (it's
obviously not testing that the expected node can actually hold that
number anyway), it seems we can just use a "big enough" number to
cause growing into the desired size class.

As far as cleaning up the tests, I always wondered why these didn't
use EXPECT_TRUE, EXPECT_FALSE, etc. as in Andres's prototype where
where convenient, and leave comments above the tests. That seemed like
a good idea to me -- was there a reason to have hand-written branches
and elog messages everywhere?

The current test is based on test_integerset. I agree that we can
improve it by using EXPECT_TRUE etc.

--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
test "$f" = src/include/nodes/nodetags.h && continue
test "$f" = src/backend/nodes/nodetags.h && continue

+ # radixtree_*_impl.h cannot be included standalone: they are just
code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue

Ha! I'd forgotten about these -- they're long outdated.

Will remove.

v55-0005:

- * The radix tree is locked in shared mode during the iteration, so
- * RT_END_ITERATE needs to be called when finished to release the lock.
+ * The caller needs to acquire a lock in shared mode during the iteration
+ * if necessary.

"need if necessary" is maybe better phrased as "is the caller's responsibility"

Will fix.

+ /*
+ * We can rely on DSA_AREA_LOCK to get the total amount of DSA memory.
+ */
total = dsa_get_total_size(tree->dsa);
Maybe better to have a header comment for RT_MEMORY_USAGE that the
caller doesn't need to take a lock.

Will fix.

v55-0006:

"WIP: Not built, since some benchmarks have broken" -- I'll work on
this when I re-run some benchmarks.

Thanks!

v55-0007:
+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.
This hasn't been true for a few months now, and I thought we fixed
this in some earlier version?

Yeah, I'll fix it.

+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().

Do we need to do anything now?

No, will remove it.

v55-0008:
-TidStoreAttach(dsa_area *area, TidStoreHandle handle)
+TidStoreAttach(dsa_area *area, dsa_pointer rt_dp)
"handle" seemed like a fine name. Is that not the case anymore? The
new one is kind of cryptic. The commit message just says "remove
control object" -- does that imply that we need to think of this
parameter differently, or is it unrelated? (Same with
dead_items_handle in 0011)

Since it's actually just a radix tree's handle it was kind of
unnatural to me to use the same dsa_pointer as different handles. But
rethinking it, I agree "handle" is a fine name.

v55-0011:

+ /*
+ * Recreate the tidstore with the same max_bytes limitation. We cannot
+ * use neither maintenance_work_mem nor autovacuum_work_mem as they could
+ * already be changed.
+ */

I don't understand this part.

I wanted to mean that if maintenance_work_mem is changed and the
config file is reloaded, its value could no longer be the same as the
one that we used when initializing the parallel vacuum. That's why we
need to store max_bytes in the DSM. I'll rephrase it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#336

sawada.mshk@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#335)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Jan 26, 2024 at 11:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Jan 24, 2024 at 3:42 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The new patches probably need to be polished but the VacDeadItemInfo
idea looks good to me.

That idea looks good to me, too. Since you already likely know what
you'd like to polish, I don't have much to say except for a few
questions below. I also did a quick sweep through every patch, so some
of these comments are unrelated to recent changes:

Thank you!
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
The first parameter seems to be trying to make the block size exact,
but that's not right, because of the chunk header, and maybe
alignment. If the default block size is big enough to waste only a
tiny amount of space, let's just use that as-is.
Agreed.

As of v55 patch, the sizes of each node class are:

- node4: 40 bytes
- node16_lo: 168 bytes
- node16_hi: 296 bytes
- node48: 784 bytes
- node256: 2088 bytes

If we use SLAB_DEFAULT_BLOCK_SIZE (8kB) for each node class, we waste
(approximately):

- node4: 32 bytes
- node16_lo: 128 bytes
- node16_hi: 200 bytes
- node48: 352 bytes
- node256: 1928 bytes

We might want to calculate a better slab block size for node256 at least.

+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().

Do we need to do anything now?

No, will remove it.

I misunderstood something. I think the above statement is still true
but we don't need to do anything at this stage. It's a typical usage
that the leader destroys the shared data after confirming all workers
are detached. It's not a TODO but probably a NOTE.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#337

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#336)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 29, 2024 at 2:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
The first parameter seems to be trying to make the block size exact,
but that's not right, because of the chunk header, and maybe
alignment. If the default block size is big enough to waste only a
tiny amount of space, let's just use that as-is.

If we use SLAB_DEFAULT_BLOCK_SIZE (8kB) for each node class, we waste
[snip]
We might want to calculate a better slab block size for node256 at least.

I meant the macro could probably be

Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N)

(Right now N=32). I also realize I didn't answer your question earlier
about block sizes being powers of two. I was talking about PG in
general -- I was thinking all block sizes were powers of two. If
that's true, I'm not sure if it's because programmers find the macro
calculations easy to reason about, or if there was an implementation
reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or
just above a power of two, so if we did round that up it would be
128kB.

+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().
Do we need to do anything now?
No, will remove it.
I misunderstood something. I think the above statement is still true
but we don't need to do anything at this stage. It's a typical usage
that the leader destroys the shared data after confirming all workers
are detached. It's not a TODO but probably a NOTE.

Okay.

#338

[1]: /messages/by-id/CAD21AoALgrU2sGWzgq+6G9X0ynqyVOjMR5_k4HgsGRWae1j=wQ@mail.gmail.com

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#337)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Jan 29, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Jan 29, 2024 at 2:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
The first parameter seems to be trying to make the block size exact,
but that's not right, because of the chunk header, and maybe
alignment. If the default block size is big enough to waste only a
tiny amount of space, let's just use that as-is.
If we use SLAB_DEFAULT_BLOCK_SIZE (8kB) for each node class, we waste
[snip]
We might want to calculate a better slab block size for node256 at least.

I meant the macro could probably be

Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N)

(Right now N=32). I also realize I didn't answer your question earlier
about block sizes being powers of two. I was talking about PG in
general -- I was thinking all block sizes were powers of two. If
that's true, I'm not sure if it's because programmers find the macro
calculations easy to reason about, or if there was an implementation
reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or
just above a power of two, so if we did round that up it would be
128kB.

Thank you for your explanation. It might be better to follow other
codes. Does the calculation below make sense to you?

RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE;
while (inner_blocksize < 32 * size_class.allocsize)
inner_blocksize <<= 1;

As for the lock mode in dsa.c, I've posted a question[1]/messages/by-id/CAD21AoALgrU2sGWzgq+6G9X0ynqyVOjMR5_k4HgsGRWae1j=wQ@mail.gmail.com.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#339

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#338)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jan 30, 2024 at 7:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 29, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote:

I meant the macro could probably be

Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N)

(Right now N=32). I also realize I didn't answer your question earlier
about block sizes being powers of two. I was talking about PG in
general -- I was thinking all block sizes were powers of two. If
that's true, I'm not sure if it's because programmers find the macro
calculations easy to reason about, or if there was an implementation
reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or
just above a power of two, so if we did round that up it would be
128kB.

Thank you for your explanation. It might be better to follow other
codes. Does the calculation below make sense to you?

RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE;
while (inner_blocksize < 32 * size_class.allocsize)
inner_blocksize <<= 1;

It does make sense, but we can do it more simply:

Max(SLAB_DEFAULT_BLOCK_SIZE, pg_nextpower2_32(size * 32))

#340

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#339)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Jan 30, 2024 at 7:20 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Jan 30, 2024 at 7:56 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Jan 29, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote:

I meant the macro could probably be

Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N)

(Right now N=32). I also realize I didn't answer your question earlier
about block sizes being powers of two. I was talking about PG in
general -- I was thinking all block sizes were powers of two. If
that's true, I'm not sure if it's because programmers find the macro
calculations easy to reason about, or if there was an implementation
reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or
just above a power of two, so if we did round that up it would be
128kB.

Thank you for your explanation. It might be better to follow other
codes. Does the calculation below make sense to you?

RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE;
while (inner_blocksize < 32 * size_class.allocsize)
inner_blocksize <<= 1;

It does make sense, but we can do it more simply:

Max(SLAB_DEFAULT_BLOCK_SIZE, pg_nextpower2_32(size * 32))

Thanks!

I've attached the new patch set (v56). I've squashed previous updates
and addressed review comments on v55 in separate patches. Here are the
update summary:

0004: fix compiler warning caught by ci test.
0005-0008: address review comments on radix tree codes.
0009: cleanup #define and #undef
0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is
undefined after including radixtree.h so we should not use it in test
code.
0013-0015: address review comments on tidstore codes.
0017-0018: address review comments on vacuum integration codes.

Looking at overall changes, there are still XXX and TODO comments in
radixtree.h:

---
* XXX There are 4 node kinds, and this should never be increased,
* for several reasons:
* 1. With 5 or more kinds, gcc tends to use a jump table for switch
* statements.
* 2. The 4 kinds can be represented with 2 bits, so we have the option
* in the future to tag the node pointer with the kind, even on
* platforms with 32-bit pointers. This might speed up node traversal
* in trees with highly random node kinds.
* 3. We can have multiple size classes per node kind.

Can we just remove "XXX"?

---
* WIP: notes about traditional radix tree trading off span vs height...

Are you going to write it?

---
#ifdef RT_SHMEM
/* WIP: do we really need this? */
typedef dsa_pointer RT_HANDLE;
#endif

I think it's worth having it.

---
* WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
* inside a single bitmapword on most platforms, so it's a good starting
* point. We can make it higher if we need to.
*/
#define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4)

Are you going to work something on this?

---
/* WIP: We could go first to the higher node16 size class */
newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO);

Does it mean to go to RT_CLASS_16_HI and then further go to
RT_CLASS_16_LO upon further deletion?

---
* TODO: The current locking mechanism is not optimized for high concurrency
* with mixed read-write workloads. In the future it might be worthwhile
* to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
* the paper "The ART of Practical Synchronization" by the same authors as
* the ART paper, 2016.

I think it's not TODO for now, but a future improvement. We can remove it.

---
/* TODO: consider 5 with subclass 1 or 2. */
#define RT_FANOUT_4 4

Is there something we need to do here?

---
/*
* Return index of the chunk and slot arrays for inserting into the node,
* such that the chunk array remains ordered.
* TODO: Improve performance for non-SIMD platforms.
*/

Are you going to work on this?

---
/* Delete the element at 'idx' */
/* TODO: replace slow memmove's */

Are you going to work on this?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#341

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#340)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Jan 31, 2024 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the new patch set (v56). I've squashed previous updates
and addressed review comments on v55 in separate patches. Here are the
update summary:

0004: fix compiler warning caught by ci test.
0005-0008: address review comments on radix tree codes.
0009: cleanup #define and #undef
0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is
undefined after including radixtree.h so we should not use it in test
code.

Great, thanks!

I have a few questions and comments on v56, then I'll address yours
below with the attached v57, which is mostly cosmetic adjustments.

v56-0003:

(Looking closer at tests)

+static const bool rt_test_stats = false;

I'm thinking we should just remove everything that depends on this,
and keep this module entirely about correctness.

+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);

I'm not sure what the test_node_types_* functions are testing that
test_basic doesn't. They have a different, and confusing, way to stop
at every size class and check the keys/values. It seems we can replace
all that with two more calls (asc/desc) to test_basic, with the
maximum level.

It's pretty hard to see what test_pattern() is doing, or why it's
useful. I wonder if instead the test could use something like the
benchmark where random integers are masked off. That seems simpler. I
can work on that, but I'd like to hear your side about test_pattern().

v56-0007:

+ *
+ * Since we can rely on DSA_AREA_LOCK to get the total amount of DSA memory,
+ * the caller doesn't need to take a lock.

Maybe something like "Since dsa_get_total_size() does appropriate locking ..."?

v56-0008

Thanks, I like how the tests look now.

-NOTICE:  testing node   4 with height 0 and  ascending keys
...
+NOTICE:  testing node   1 with height 0 and  ascending keys

Now that the number is not intended to match a size class, "node X"
seems out of place. Maybe we could have a separate array with strings?

+ 1, /* RT_CLASS_4 */

This should be more than one, so that the basic test still exercises
paths that shift elements around.

+ 100, /* RT_CLASS_48 */

This node currently holds 64 for local memory.

+ 255 /* RT_CLASS_256 */

This is the only one where we know exactly how many it can take, so
may as well keep it at 256.

v56-0012:

The test module for tidstore could use a few more comments.

v56-0015:

+typedef dsa_pointer TidStoreHandle;
+

-TidStoreAttach(dsa_area *area, dsa_pointer rt_dp)
+TidStoreAttach(dsa_area *area, TidStoreHandle handle)
 {
  TidStore *ts;
+ dsa_pointer rt_dp = handle;

My earlier opinion was that "handle" was a nicer variable name, but
this brings back the typedef and also keeps the variable name I didn't
like, but pushes it down into the function. I'm a bit confused, so
I've kept these not-squashed for now.

-----------------------------------------------------------------------------------

Now, for v57:

Looking at overall changes, there are still XXX and TODO comments in
radixtree.h:

That's fine, as long as it's intentional as a message to readers. That
said, we can get rid of some:

---
* XXX There are 4 node kinds, and this should never be increased,
* for several reasons:
* 1. With 5 or more kinds, gcc tends to use a jump table for switch
* statements.
* 2. The 4 kinds can be represented with 2 bits, so we have the option
* in the future to tag the node pointer with the kind, even on
* platforms with 32-bit pointers. This might speed up node traversal
* in trees with highly random node kinds.
* 3. We can have multiple size classes per node kind.

Can we just remove "XXX"?

How about "NOTE"?

---
* WIP: notes about traditional radix tree trading off span vs height...

Are you going to write it?

Yes, when I draft a rough commit message, (for next time).

---
#ifdef RT_SHMEM
/* WIP: do we really need this? */
typedef dsa_pointer RT_HANDLE;
#endif

I think it's worth having it.

Okay, removed WIP in v57-0004.

---
* WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
* inside a single bitmapword on most platforms, so it's a good starting
* point. We can make it higher if we need to.
*/
#define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4)

Are you going to work something on this?

Hard-coded 64 for readability, and changed this paragraph to explain
the current rationale more clearly:

"The paper uses at most 64 for this node kind, and one advantage for us
is that "isset" is a single bitmapword on most platforms, rather than
an array, allowing the compiler to get rid of loops."

---
/* WIP: We could go first to the higher node16 size class */
newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO);

Does it mean to go to RT_CLASS_16_HI and then further go to
RT_CLASS_16_LO upon further deletion?

Yes. It wouldn't be much work to make shrinking symmetrical with
growing (a good thing), but it's not essential so I haven't done it
yet.

---
* TODO: The current locking mechanism is not optimized for high concurrency
* with mixed read-write workloads. In the future it might be worthwhile
* to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
* the paper "The ART of Practical Synchronization" by the same authors as
* the ART paper, 2016.

I think it's not TODO for now, but a future improvement. We can remove it.

It _is_ a TODO, regardless of when it happens.

---
/* TODO: consider 5 with subclass 1 or 2. */
#define RT_FANOUT_4 4

Is there something we need to do here?

Changed to:

"To save memory in trees with sparse keys, it would make sense to have two
size classes for the smallest kind (perhaps a high class of 5 and a low class
of 2), but it would be more effective to utilize lazy expansion and
path compression."

---
/*
* Return index of the chunk and slot arrays for inserting into the node,
* such that the chunk array remains ordered.
* TODO: Improve performance for non-SIMD platforms.
*/

Are you going to work on this?

A small step in v57-0010. I've found a way to kill two birds with one
stone, by first checking for the case that the keys are inserted in
order. This also helps the SIMD case because it must branch anyway to
avoid bitscanning a zero bitfield. This moves the branch up and turns
a mask into an assert, looking a bit nicer. I've removed the TODO, but
maybe we should add it to the search_eq function.

---
/* Delete the element at 'idx' */
/* TODO: replace slow memmove's */

Are you going to work on this?

Done in v57-0011.

The rest:
v57-0004 - 0008 should be self explanatory, but questions/pushback welcome.
v57-0009 - I'm thinking leaves don't need to be memset at all. The
value written should be entirely the caller's responsibility, it
seems.
v57-0013 - the bench module can be built locally again
v57-0016 - minor comment edits in tid store

My todo:
- benchmark tid store / vacuum again, since we haven't since varlen
types and removing unnecessary locks. I'm pretty sure there's an
accidental memset call that crept in there, but I'm running out of
steam today.
- leftover comment etc work

Attachments:

v57-ART.tar.gzapplication/gzip; name=v57-ART.tar.gzDownload

��[�s�H��_�_1��V`y���
6�qk����^]�i�z��M������G����r[uUG�Z�������kf�l���Z�^�����"S���Y��<�����3C��dy�e�r�)OV�(��������|@~���������'~�-]���7����nj��j�����X�E���O$C0���=G������E���i�������hL�N������V��k��ss���1e�c#�`�6��N�?��j�P�)p�����F��[��(�Yq�������S1}��\��������;3!����i�}Z��"�_����a�������+�lB���V�'K��X���`eVF`e��A�W2�[�<?81\�c����9<��;���1���^h�,s{6����r���}F����`���2�8S<9�������r�?���T.7K[<������p�=v�K�����p��e�89����ZWx�DrW�z�4#)�����p!O���/�YP��V��g�I^M�\c�i���������c��>6n�������O��7nO��o����h���������Y�=��,}aU��U��1���/��X�MmG0�:V	�mO��V0_,�r�=��ryf��Wu;9���=K<1����u���J�Y��'M�i&��K����q>#��[VnwJ-V��m?��h`��	6�}�����pD�������-KL��'�,�5��[�(�"������)Lz��G}���@�xFbC{NdC#��,;�[�X��	V1�r��X!�(����V��&;-�d��O����W)��vo�D��
��_���.���uP|^8�t��T��R[��-I�J�(��u��Z��h�j_��9�G"�H}`
���DRn�X�!Wd?�� g�z9t�.x�)�t��[�I
n��pz�
�Ws�}��~�����+�	Md�F}`=
�8�JV��k��R�C,��<���p�)^������w� ��{��1�]�l"���b	xy�l9��pOQKg���R������*�I�Bx����HP������0������Rx6�	����H�K�DB�~�$b�����#+�4{kq�����[�do�����~���J�������F)%��W�������7���4��z��p��70g��hD�F�U�/�T(��T���G���kp=w��m(��T���yRP����"��#.d���[O'��'{Q�#in��(�$_���"�K�]�M���m�������"
��qh���V�YQ���z'c�/u_j��6`���~��R��4	R��\��Fnb�����-{m�/����2��GT�����?�[�B����6
6W�YIN���J-�����KU�i�x���]Q&��(;��6�~��i��D����/�|����
}����������?��4]�-�q��nN���5����X�����?W��
t�����V���[U������,)��(��	3������^���}�����C�2��5�����V6�
����Q$T�2����E�/��3��v$$��Zr�����?�����I` ^�1'���%������v�pUIJ:���.��G�b�uZ+�1������o���}���a�
���}�4xj���X������@������L7������[�������7����d��m����xxr]�E��+a/,�zQ�W�������k5OjMKoj�J��jM��k��n�wX��<LC�`�n�fc��JEv���������qw���
G����*�����xp7c;4���a����+cd �f�
�eH,�A g�>b�XC
@vv=��C�����
{��u9��}�Bu�?��p�,�}l�oh��e
 !�����{x���#/�����MBL1��$Z��M���$���DB,)��T��i�m���1RQFQ�6�_������U������4Y���:��%+L.�����KR1&�gL�
���n���O�o�M�Sk��4��J��k�jtt�9���+gw�e��L�A�0�R���p��+�En��
��W�S�����*6�g�`DC/��^�U��������]�����o�]�K���9���I(k��I�������_dv��m+2�d�!�������n$��2�o����F�kY��[�SQ��p�_����<��74�l�_o�����������D�����z�O�b
��������fsj���I��0����=��:	T��v�89mb�_;��7(����,k.7����B.�8���s���$35�Y�\b7�������jI����_�?���l&���j:����NS�=��Q�c��d$��z�(dXu}+��:��S�����*�|�0�YW+�3�3�D����/;l��/����A?3�6�U*��xZf���S��pw~��K���	�z��)T��=��=�����,���Z�H|'	��!n�i*�v���>�%&��n����U�5N���:fs#�����l� ��0�U�	���Y���X���W����W����W�:�o ��o!D��=��?y�9.kxk��N&�3�T4���>=���wv��n(��`QS�5�X��S��C#�u�c�v��Wt�~��%�;��Q#X�����|B!����F]�p6n���#�M��?�;�
��rGU1x�*�w��L�<��F��+�(����hx[S{�G�����tE��k~��`��J��������~t�K�rhTo6�����S�������o���ll�tg�s)a���M1�@m�pc
�+��|�_���,�{��Za�O��2>�
����TW��	�����'���
��q[K?�JM�r��7y�[����E�������Y+��%�k��E����O��hi�O�}t4��<*�!�\��]�:���)'���_Iu^f�����?�c��M�
I�����p<�x�
�z�p�+��C"�qX�v�/��x�[�X������0��x<�].I���\����Z�Qa8��L�Y���E&���Z����
[I$�%u��n1d�^�K:tS�o��zb�u.��WN��@sQ�K���	B#��(oB�u����)��8�	&]!�+���1Ez�=���<����c����J
\�x�p����Q�#_�n�
K�����B����e6�NQ���������h�V
��H���GW���������6(��@�^,��X�l���/����L�(P��E��
k��um���E�a��lD�f �M����g���t�[(��].|��P�>W������ ���\O�t14Ot���z>�;����{�x�mUs�1(�*��G���a�I��PC���3��x�X]�I<`�%�s���x�J�)vl>3�c�.	�hL��������#>F���ja<��G���xy-��X����dX���(���b��D�a�l
f"��<w������nhz���}�,�nL�O�C��m��3Wk�����IAK�R�� Vh�\��U�'��t��y�r�\(`����f�H-_H�u��q�������\�E<������I��*w���$��-_
�x�8�1�%`�e�u�`#��2�%���}p���)�����D��,X��C��`9 ��+'��J|�T��O��A��=|����-�z�=3
T=��B'�vZK� �DP~l����� �jw'p����H�	�w�� ?���to$t����b0
���?�h&���������
_r��Yy9�TIB�������Z�Y�������]������(����wV��t��.�J��2�	�fS��n0�������"q[����%��2��N�K��Q8'�SJ�A�IT�U�;���)�]���t#������*2��Md3j��wf�&x���3�����|bm�Q8�4�TV)�he06� �7��k������N4!��n����|�/����{�������d��T9�����F���d�
ZI�LD�"��F���������p��C�\,��)X�#,�87��� �q����)���{7�_\�
��R� �&�9�����p�K�;��nB+�z���I���W��y���f��#��?
�K������@u�U���J?�)=�b��������@�?���H ?����]�n���E���L/�T�>����;b�����(/*�
8!X�3��
�.��2���qBK=��������r�+Ud��L�o���'&W�����N0�
o��8�q����vFW�� �������j1�e&U�j~��������d���f�&�@�aH�'�H:���/��9*)Vb&�\%.�%&�l�9�L'z�9�|��Sq��2�������j�����U��9'��>P�YAF/)
�Ec��8��n	�"A�=D�	@{&`������,�#�q��!A&"bf����$�@u������*a���6z��L�~���r7���u/��R����*#	y��Z�3�����������W��nb�g:�:���� �����������B�6��uNA{��������9d��o�x�9�L	�el\uo{7��bQ�����
d��F���,G)��,�RF%�j|���I��13���2�$��_������r�%�N�����]:������_P	{	^�a�����/������[�N��Z���v�U-��{d��]���8��c��������n�t���k{[�w.tg�
�j�b�n���8���i{��A�������`jK3�T����'y^`���9d�<���Q�����yL&
�WF~KT��JD�_��L��N
���X�S��T��n=CTR���"��fe�Ex���S�|SV���:R2>���<t��"�z0��0/�7�n��;�J�j��6"��Lb�JS�3+CK�
��--[4c
�e�$~g����q��Y���f�,�K��QYZ�W~*K��o3<�
�����S�v��<l�^��,mD���������q��1+1����
�[���Y����������Y�f/�2�]���w�ZCt��u���\*���pr?x��C�.vi�����P��@��z��w���R��0�e���hk�0=,��IR����%y��Su��V��0�$5�+
�&|�o���VE#��p�:F7wcc�?��"�w��@@
����m�Z�L�zU7,�,`��a���G�������G����.j�x�S{���p�> [M~�x5�����p�����f�����_��(���_Qaw���d��v.6���
��g�;�W��������%c'3�o��W������n������N�:u�g1�l}t��if���v��a����7o�=@���E��$��-\@W�v���L�3�e��w��B�vkm9����@4m�Q���
����I�wr���o�.� �%;2K�=������r���K�����?_t_���P	�]/������Hu ��p�d�C�����9X�������p�����"�w���)��u�'�?�;�����7)�'����Qp�^�p�:g�5ZYF�z��r�f�,���i��U�H]��g�v6}�6�t�������tQ����1!������sI�>������'=c�Yw���q��k���,��|h7��oW�����n3�G�������J'�3��>��y��U��=?�I�D���Mn�|�z$�c�$��|�����z����K=��������KN���������O.�k
�^��\�p~d$��5`s��vL�������15Y�gj�|`r�u>6�L���G��#;��=��������b{�x|~������pr�f���M+��������eJ����bN{��N/��^�����QZ�I�6+�,��
�i�$
�4�I;\W������9��u�f����SF�$�g�v&���e}�%�����&�>�HJC�z'�4��w���N]t���yr��,�'j�,uJ�7�Z'��,�5��7Q�K�]~8��|��O'sB��������tF�r8!��^�7����PK�������:��=_�g��u��X�;�<`�e������i0A��,$�������45=0:9jgs��D����l����s�+���4K���K��O��"�������%��(��Eh���C<@��%dUX��[�`��>M"q���i����T�*�Vv4��e�#e��}Xa��^��a��T�Y���;���m?O�Y=K�X����&Sg6�%���AV
YR�����;��������3`y?��`�)5+4��,�aj6���H��w7��
n������I������d�4Sy��Z�yu����>V���}U�|t���w�2�M��%��P���i��5�Z1�Y�v�Zq����r���b�n��q�	��I#�?^�&���t�F���2{P8�������/���T�mP�F-��*P$���K�H6-����Z�)�e����Ra�9���*������e�(�'�H��)��{��Q=�A����e��-����k������[b�&��dp��zQ=J����N�M�)Ia����N����g�5����iQ=&���EL��o>�=9z��>��3�������.��1	�"r}����qIn���4 w�[�I	>���R�h���s������j'���9�x{��3�~�&K��<���#V�<QR��	��w���c�8�^���P8�M
���r��������j����]�3��,F���0�\��<.Ft~�k��:S�G����������vrW`��F�����p6a�����<�!�pL^H��D,_N&�)�s����;��Mg�)����kZ�=�p5�-�q8��r���C~6�=�j��`o��OF���b����������i<�z����I�
�Q���)�&�|�]e���1�N����nC�2�fg\q�QJ�Nq�D����RS���L�����1�ov��ur]6�1B��Y��T��g&�wH�
�����#t�����a(4*�x�����/r&'�	��j�v�XI���~��v�&�����j��Z�����w�}=�=%z��y�������V�!9/�Q��Q�S=40����-���ht�)��[��������2q���y�93� y��P����p��-\���^_:�J��	UU��m8���[���uC�_��J]������J{V�D�pD��in~m0�H�E<�q��{�������a�?�z:&?t����
�(�p-x@��a�����|g��{t��Nb��~aTD_z'��oh��)e.�~���V�dD�#<��@aw�;��N���%�$�`k���H;
1�5��4�0�������E�'3v��<N��az�u��e���MQ��W�%��v5��������>���a8ug������h��"�cw8�'v��(�X����KE�����X�Z���-B� [�;z���x��e�0�����8t��Jd
�'�'98��L0��P8���x�0[l�������7�76��K�Sr����<�Vr�l!)����~Z���h2���
_�{
"������S��������
�+����6?���Q�J���9�j��_:��w�_�5���*��)o&��>�-x`����h��)/��F�&c��Z	�m�U�n�ev���("a��"�]��Q7
����7������>V�a0b�
\�����Fp�5�p��Iq����R��f�$�`oF�T�+��H����� �)qn�Mk8,9-��q��M���
����]��~{���y&��i����U�`�|Z��8��N]P�G�h]�(fpFFl�@�}aWg����@�(Z����)���7�u�B��O�
���
'#���cr��O6��S#IX�]
��_��k���(�n$������w�]��U������S�/q$���D��b�)��,��06��
f��6`4�� ��C��[k���0)��0���@"�a����������HD1�3�r�G�/n���|������fed&�X*��f�M^Hb����I#W��3�N���xt�4\��0���8
m38+�CK��k
��h�0b|��<c9g�@hP#	;�1K]M&M2��p-�5F|���S����-(3��F�cRZ��H5����9V�V�%���1���C������\iMsM��[[��9���S�#�s�<|0��8����Hp�u$���u��_�>~����P�����C�Vld�����X�Z�>`������"*7;����
>�A��{�
'�F�]u`&_�����T*���	F� t%�8��`�
V������be�t�a.�_�����t�����ev��i���~piT�M�2��w���{��_����T�H�1<LG����I4!K%����l����\�g��N{����[���6 ������R����#��������`	tt�@�c��\���$�h�����I�i S��H�[�V���h���xa��z+�l�7Rr���f�M�|�����7t;.$��*[
��~8e%���E�61�":�^�(�"
%=
����J��0���a,�g��D`/;����Um��m�����X�	�l6�d����K�3C`��d�Dur���g[��~8��p�hy�.��[QD;����^�9��p��&�WBIj���~��I|�	|�U�g}��Y�S)�"�{_�R�5�������\��DFj��w�����E����!27��������g�6����{(��}.���l
�}R�Yn?�s�6&�I������_�9�U�U�����[
%�z%�	���R&n�&u7���YR;��y7!����|�P�(�$	
����+�@��d������|�!�#_D4�\(�R{���7Db����Z���z�h�������f����2�x<�drM���W���}n&
�0����y��Y;8L
Y+��mf���M���U��6Tc�"C%������D�&�)
l��4�%��p;Ql�	
"��h%�e�9�q'd��y�b�K�YT&��v��p�}��zr��Xs-����,�\L���A���������l��N�n��nv�e�bL��]qc�����8|f��4
���/��j�}B�'�������I��p@�K��GY���	�8�&\Ry�\����z������-G�n�+�u*���hLQ�{e���2�\���~�1%�*81���~��Hg�c]�Y�!���HF��7f��WDu�E���x3/=*%V��J;|��B��$�s�����'��z���pr��_GK'�H�����]{�������O:�������F��3F�h�����9��������}��/:��8��<��C��X����-07@U�_��L��e�J�9|u�BU?����N��������h^�^Pil2Q���hP���$���2�RnvM��x�(t��F}k��jS�nm���y������������Gp�t1�����<�(9#Zq�$aq��<%�d��1	����p`�`|p������4���sw��mi�2�J� ��f��V��C	�f�i�L���]�Y���L%o�^���y����=�����"Ou�G�O
���Iq�����g������u*S����S������e��<��N\���X>���g��K�:��D`^�j��E9�3RA�g�+�ff���~J�����<;��I��j�M��
������Yew"�e��)��Z�E�3�p��!d���wTm(�}�r�E�6�\���[)k�"kg�*�����r�yJJ�R4,����Ur4{[��5o������&�L�����Z���dG���.��EV�m���:*�M�O#
pBA��V�v�k&���q��`	�|2����Q13����������l^JQ���7���1�!�I������h9�{
�R���b�}�� F}��D��3�^�0��I��7pqV���IE����S�S���)-P��g��
�bHT/e�f�"�Hz;�j���M�(�r�Ao��y8��?�psj� 2s8��������$��$p�5Y�
5������&������
<��e��7���<)�A��������<��gV�.2����9�fJR�n���g���	;��������$,��Tz�s]C��#������Qn�M��4q�13��C~P�+��"E;�`E��X�l���
k���g�|@qG�Bk�AF.�w&�Z��ev?����$~�}J��/O�lP�J	k�y
O��82$}��m�?m��?U�o��3e3j���AGjL?KJ0*a�b5�4k�L���P'�M0T��E��E�R;��l	�m`vE?���q9��~�_(�,1���z�����i<�y���zyI���NR@AH�1��e�>}�E5�k��L>��� �\\jE�Qk�,�b��>4N����n��������f���:\-����a#X�������?��'9fb�mv<;��,%$<v5jE���g��1����h������������.Hy�m������|��k���jR�`9=�������<�P�h�Y�y��'���R<�dl�;r�TK)�=��������H�Ey�}i�q��Z�^l�@G%Jf������������c������!��J�S^L����JY��?�ikF�K��5{apd����1��Y=�Y��Z�}�;)8	�b4�i
^�g(���R�>�:�Q��J������\v�U���i
D+��Ft���!.��A��lO$�i��	^7���5dh��1�V�j[|(���g<wg����<!�.�ZB���N��l:���r������ MG�39�f�q�%.�Z��~�8��������K�,Et�&h�
:T���CdB&��4�����rY��ua�� 4��8d���q���1��'��Z
P:i����A)�����l��E���V#`�)���0E�9uY���V�1�s��!�#�Q8��]��CY��di,����7��p2XU�VA�0��X\m1FZQ�v�W`���&&������kh9�tz���.�u9��B�T�PEu�"v�H*����.Q4&�H�
��-�����&��$����P��]���5N��p.�2a�z�����17�*D�,F�����^�gB�q�K��H���G�:���Bb"3�f�w�z)wM��|����$M����a<�<��U�!�
���*��vm���LCS^pn
�$�f ���v.�OI.��7���Dh���XHi��l��$v
@��,��<\�G���
)�:7��cf^)�k��l��T4=Hcy�y]N����~� O���q����u��r��2������i-���ZA�'��\��vvx���0}fM���������9�c<�?�>�U1�3����@���3M���i��EQ�P�@z4a���j���AB8�5��d����Tf)��M��&w��o��O�no6	pS�=�����K��0���	���M���<T�{�%�BF�Q����dk��K��q�Y@������2��������+�$�b��C�i���Q����cl<����N�N���I��X��6����D0m�E�~j��=*��BLNQ�Z���d�C}�@$kD�����5�i��p�=�I���u�z�F����g��t�\��y��t�U��s�c���w#T2�3������r��A�0[�����6��}���ft��x���E��������Eo.����J/���������`�VgX�w�R#��@����H��Y�s����:����6sI)�3��fe��4b������!������r-���!�o`;#��=��5sHkT�����:
7"����uc��7��f
\�����K��_�o�r�P3�Ah<��
�1E�]���5�81��P(	�����	��t>"�j�c�)`l;��D�9��o�'a,��J#�/*iz^#�s����Z�2+���<'�&k�V��.I�g>{���qK�ia�=���$]�z���"�2r�����)����E�Y���%�R���e���QH�K�[����31P�l��^F�����me�z+ss���6!����c�>��=/{��Os��
�Q�%�]H�eQ�t���X$�Cz�d��MN�rv�#q�2���D���Za�&V�X5�S�=JRt�r��ZL�%���r�#�D!����L�{�����f�
9���\��M[S���K��<D�*Qv^�n6gWJ#4r�������ihk~��F
���[�o���Xur3�M>Q� 
�KQ��d\��&e�������A�`�D�Z�A��APw�J!��ip��T�=����E\��A]�l9
p��Jw��gO�-�L���=�qc�eI<��2��a?J����5Hn1*���e�`pZ�#�\�g��[����wm���v}L��s��)%vV�����K����2�����[$�Y��(��&��8�W��2�X:?
��/�\n�?�T�O�V����=b���
�r��%�I���:-%�q�Q���N����:MZ���-E��P�!
�V*��u}��o���C�L��L��Gx�����G���TG�]j��TA��yR�K�`z�t3�
�;:��p���2�rU7	���n4�Kz�^������+��@�TW�+�^N�0NM�2�%l!�l����_��d�� A6wn�FK,f���v��!Y#IW�F����#0M*�AP������S����.Q����vZ�w��\x�P��/eU2�-����C����m�����?F���j�����6����w/�]��
��n�����H&UN�vU
c����U"F1c�{��_3��=��s�(�8~p��Q��<�,�����O��y���.����4:m�}�p��9H�����J���~^7~����cc�^�F���d1���+�`�U��e���u���o�,|��A���05�@_!���/��+Dk���gi�8b����+���ty���u]�e�L�:FT���vw��^y]�����*��9��B��>Y���.H�DTX
3|��\�f-m�E�<��[���!�����,(:��
	o�@���=���������]�c-��C�;ehi��:`�E�n*�%�5H$���5%<X���Q^2���]v�ja�z�xo��N�U�
�����tKj��"	���6$��u��g1���!�_�y[����(�ht2F�X�D���:� ��	��y9(S��^�2W��/��B���C2�+�x��A��_�d��W��-��I+�#��G�MZe���NB��������X��F�$�#���	��0���q�/_#��_�Zt��[c����DP�"�����#}
�D���,���������;�Q��`���67$Zb�?�~�s�NH�gr�������e�����N&�+	
a�@:���-�����;�N?Y\�y~�;�����&�2��IvT�����L����IbM �s���R�������xO|����������y��f�pnn�RUA��s�����]���5;t��1H�(����
g��RU����a��EV���H��pi�E�����c�<#�n�	������{������Y���G�?�����[�����FO�cF���F�K�V�,���Jx[�>�Z�����TR�e/c_6�_j.���8�!2A��_|��j�����I��XA2 ���c�Ji��5�4�L��k]V������^��F[�d�lW��-of?�q�d~`n�Z����<cg:��zg?����<`{�������*����8���RjT�U����v��|1����BW�s����L6a��JIk����]���C3s/���z>�K�&��O���f#�P-��->:�����_�[ �=�I�����������02����U��������PRhd��S^�)�� Y-l�q+�$��.�K�n���-��Ip��J�bnK4'PE�{}
SN����Q$S*)�����ob�D�m�d��(������D���"��N���%s���P���!U�a�W�n�!a5�����\)j��RX�`�(������93�i���������f�����J�g�X�-P����u��5�q�Wk�$S�5�Ma������kSe����J�b0G_����2�=i��p��3�k��+��5�����yH������Y�=����q[���r=��s��m77<\k��9l��\�o�P[�jo�@�]��-�\�X��l{���`
|��{��O��vR���8c����'a�mwp�v�Z�L{�\�[��c9���:!KfL/��zL�f���z|F��\l����8�����9�o�����$	'pV#] �N9�N1�h��yi�l�|f�� ��<^(��C�{����u#FP�s���)Y������:}�p5��x�L3��.��_5�Z{�����L�u�����,�Y�N�n�`����
$)��}��.=�����'�z@a���zhj���I�38X�.������Y�QL����
|��y��a��4A2k��\-��7�~�p[H���9I��b��D�����p7��O���_P6�Z�u�6��:�<�qN�x�/������2�����1@����6k���FZ$�0�&���;��}�;��,�`����z�o�w��O`BN�lCJ!J|��_���r"�<��� �L������:O��//4!��^����Y���D6+#�d���|�aK`�������O�_s�;(�.dnN7�5��%DG���d^�J4
�f��.�m�+��PJ��c<R������Qv�m��m	���fp��,'OL���p��QYM3`c��[G�9>�f��� �Y������������Pi���-���b:r8�����@9������r�9LZ��O�Bn�S+Y�cN��W�����e���")�� �$������E�5��R���Dzq�~��T`L>W�D����U��@2��'�U�<�?Mf�'�1��E�_��,'7�������f�T:���O�h���8tnc��������E/�����/N~\�;)�|&J���e�U)��R�D�j_��]����pQ�R/\\����m�O��{������0�ST<)V�WYJH�^���q	�H�����?&Vb��C��>�%���H��M�������0O��8��1�L�)�nqt"&s�>P�#�ai���9��+N����0O0F*�Z�������,��������o�Q����*��M��#�c^.�iZ���V��������/�������e��������� ���t=��d����[=��-bt�A�/#0�8�Tp;�.	�m��;_nv�Z,qO=,����D/��l����6f1W����8#��N�r���XW�f:�������������%p���1��)�V;7\������OX��K��O����b\E/m`��#���9&���O�uPx���T�M(-ix�<���4� ���kq���4�^9��1�1��������������s7����)����@M�R�\h�������V�XRRO��u>cN*}Ig��Y@��������6d.t�:G��t��]v-f��wI��b8����^jS0�L��y���[���5g(&����@<�r�o`�&���Y��EX������$�f��9��(Y^,7��s~���wo�#u	�2T�3�p(�U�y	����T�r��h������2.���/���W���.:$�T��L�������9����h�]�������3k�$�ri���2Oj�8��������� s�����@�Y�s�S	�Z��h�]��L����fM���T�;����\-��>�+�y���R������b94�`�T�q�9��4�w&^"����\���8U��y���/���7�e������X0��90yuJ�+s���;3YT���V9ytY�!u��M���q��>|�}T�O��`��$r<k�B����(��U�����H���� o���'��T�Tpz���;@����r�+J��!-�Pz0@P�JXa�F��2��9u�V3T~o';���3<H�(��9������>�s��_�{��'��e��M4��&��}��K?�������=��n���FO��LU���ZE1C��NU"%��Fy�6U�j�))��V������{�$VC-�l��Rnu��W`�ZU�����TW/S�=�HwvV�R��ZV���/CzfP�4s���������(G�a:����5f�������_���5�&�������<|�C!����������R��eJ$q���|���D���'�����c|��Y$��6O�)"���|/!�p]7�Dd��Q�+��1�I���/U���%9������k4���=O�Y�����c A�������r�w�ss�}��e�r���N�����������g��L��)N��L�<������l��]1�9��O0���h1`�]���0���Y�/0�����(��l����l1z���A��u����O�A8�_���\'�����wo//
��]R��x.�W���e6���F/�����M3}Ye`�����fYV7�2���q97��
�i��������\9eI)���.U�N�^0��5%�:�X����u��K��5V�;��yY��`�OfL�p��$K�;:/l���<tFL����������#�a-�o��L�_��4:w��)�
A\��F�,�A^JE�
���T�I+�G���`�a#r��������[�(�{R����
[f�ql/	���o���;&]^��h����s:�\���
���`1�T��2}�`=��`��:��a�el��m
gpqO���Z���G��q[��nZ�V=�A��
�S�54���I��	QG���"l�F�Q�����l��kK>h�;�N��b���6'q�$is��1��j��)�y���(l����:�@{��M�,7b���"�>?~ur��.��u7����`��W&�*���~���4����V���8
k�q��3�nE�c��co��]�`y4%A	�5��o_ja�)N�]f13Fs=CM�_�]����y[��h$���c��Y�!�_����>T��M/ju�o�
�:��]�WU�0��\C6?z��!���S�cM������K��4[HhI�����c���������
l�H�N��%e
36p�0s�/}_H�!����v������%�VN���+�}I/��=���N�|��%���1��\�P�q��q��n�?���o!�l_Z,��\�p�{��"T2�sK�u3��|`��M�@��f
�*ShX.j��i�u[G�O?{�����/������3��S��_J��o'����l��B�x���W��}�rH�"1�������)��3�b�/Zo/�DI��r<4�����o��7+����,��Q\�	�e-���?0�h��9����(&L��U���
-�#���!M"���e|��������������
�������-�
$�����O�74�R��$k
����{��I�B?9�zc��`)�	��������=��C����fo���@�����:���fy����$��s����dbJ[5�����r�*������Ry}1��n$h�y���|.��n-��)|}?Eq�&c&!�J_�6��hp�,z1��������,�����|�;��C������,Q_�>^Vr��'p5v��1��X��Vg�����
M����������x�r
�B_�s���g�|��}�)�T�v���E�G���j]n�`^4:����r`��O��S���* 8;�3���Nb
���8���h:���M���g��:������Y���hFV
�;�;8� o��)~��X0��
�vH�Ir�E��>V?�I_C�Kt�+{�����q�1/��F���Z�\�)�"I@��S��/��EcRsRt@W�%e���-��c
t�����MH����}Q�A4����������0yg�Q�!l���@J(�S���������]�3�'0�����K�l/���0�A4��K|�����;�:�����!���7|����T!�j�T�2�z��,'�1W�d�.C�(o1{~����!��4�|�F3AMB������@��������?���:�!����_MI����f����:���
�{��-3}u��W��X��K)k��w=��=e����~����������]���^*�f�S�w��m���
�'�p�$���������D�c�1ol"c,���6����(��A��
��&��(3������Ag����7g?�O
��J�L3���Yc���]��G�U�K������!�,r]�g(���@�?�������:�o<��������k���"zWis#/[�e�F�g���qC�$T���|�����WI�����\�0n�rdw|������Z�
����N,e��J6�J������u�]�3Wm���\�_W�dY��uU�+.�����}^��E9��-p��������P��%�o%�x�i�[�'�c�Vse[����tB����K��K��92���N�uZ��B2� ��t�j	E���D�iX��+���D�k_n��A��a�&�J���2���#�}��T�M����Y�?�����w�������|���N�i^c�z�u�2[W����(�GF����	{���Or�r�Z9T���	�_C�d�;e|�d� �#�Q0������
u��n\�Pr����J���K30�`L��F���u�������6T\�U����GQ�a�,V����\xk��M/��K�/��j�������J� �������H��o����Y\��M����5��|-'N��8���Y�u��+W{�����LH��Uu#~��� �M�i�G�d0�F� �F�V�Xo�|�P<bh&:M$�����z������%��;�`$���5�a�a�v��5�/'������a�v $9��,����'o�JnI?���.��	�]K�3M/���[)B�Pb'#]����L��s�*�PZ}@c�+�.1�Y<#y��3O����>�[���1?|!'k4������>��y�H_�*E��Hc@�1���G�P�����I�$[F�U:��u���"�L�YEo�[%i	a��~������H#�{�|�����h[k�0��,��c�����H������S����*5<�q�e��	>��p���bR���O<M��bw�t�*mAO%��%���v ��?����������(7��dl:��:SY��f�{^���d�.���>�_��q�G�9f����lA^���
�4�+8��)��[a��/��
�����&�e6���}W�f�$�qOJ��&��|f�OdxQ��~d����i��0�`�,������O-�&��	��~��i�K2�m;~{�O����IY'��� �"O���us��[Y�����@��s�HQ�o����Y��k��)I��9���N�e�/�����|1 ��T�~j5�5�>PJSn�/��{CP�>��n���Knz~���
%����s�n�T���Z����U6��(�����N��qe���<���T�i����60��&�Q�w��q����I����\�xn������Tp���np/E���A�����&�a���8��CJ�A9�%Jn>��k?��,�C%��#)�u�q(y/���]0���������._-
���Z<m����`���q��.*�m\��{�p]��n"�!�q�S�dE�����_������p����p��b��p
�����%p�pF	���	��	�{�jELCrr�j;����syZ�ju���wo��D������9j+�����}>>H/S��[���.�����K��Y�� ��KT�f��q��$�.[��q��Jfemj)sX�o�c�����%�Q����B�����c�0��fz���[�YK���������F�]��Ugm}S��	3���T����M����S�	j	
�����/�����mw:�����F@��+V�����`�G���
f�@CP��Q�����k7'�P0�B���ka>Iz���1W��h������sO�l<�J�|w��6.V�mP{z$:����e�����^v8~�v��Q���������s��~����'�����b�3��1��x>g�������}]��e�S������L�:���v+5�IA�C<��3J�;��"��H�H�=��U��y|7��lr�L��vB,�&�'������
J�-vJ�8�]�-�����Ibj8���.�-6#%� �-���$H$UO+��$�6����,�-�aw1F2R(��k����"�s�XK� ���%��/����t�C�t�j�w���(��Lf�[ ��9�eF�)�G�T�W6���%�]��6;4hG=V��J��{��?��/����mk�k��k��y5����S����F��K1����V�J�����\i�}f�e'O�V�=V�9�T��;�e�*��������y�}:+�����L�:���q�A�;;�05��c���]<7�%�H����3�&K[��������yK�nj`b���?VK��s�F8����X������w]8-����,*+-����T;>�]|_s�~����ydl0.IA)���p~�����^���)���/��k�m���2?��r���Q���"��'B]���'�Z?O1�������y8��5
f�M�I��F>-�rU�=t?�^��O�L����}B�����D���#����������.�oSm�����h�&��=yA����A6��ys����5����7�����@�r��1�}cs����g������?�vzFm�+XO�H�����9?�)��H�������.�h��\Oe�
�W}p_%m��'�>p��)X0�2
�I�����YH����K����}�2_�T�Ns���I}7��p���k������@�R/�Cg��1`b,�N9H����W�g������(�#��'t����,a�3<��Ss���I���������4qB{�F`�sq4o��1��#>�M>�k?��P��S��$YK�/�<T8�����6y�t"*���g�{$����z-��y��;up�!{yKB��l9���I��P���o�.b��_R��e	�o�MDF�)�%�#��x �J�loty���X��CN��$��?+���j��=���4�"(r���gY�sx.�\L[�6�^��/$;�$��lpV��f���=�6�m��&���W���& �BV�����>O�GG��2��i9�;��2N��}\�k��������K8���J��^�F��Dlz
�9h��W��d��wo��1�FB�_	��O0�v�1��*��0J�Q<�?����b>Z��Q�h�k��{�����j�Z������0����^/��Z��n6���r������K��_�K�k�Z��v����,����������������v1uE�����W��DW����{�G�U�I�&���_�4��aF����w��L+�I)w�f��Yt�F���f��,�0�0�?���0~�&�b/lI/�� Mg?��^+lT*��V�����4�d�a�&�="N��x�%��x�����B=U�V�3�� ��]��-�0�w�C��I�w��z�G�,�On��������Jo����u^k�{�^�>ht�,��NP���W��&�n�w��/z�hVx@��p'����������~�O?���]���|�C?�w �)��X��GU�Yt5��-��m��A�t:�eX����j�U*�A������_���Ge��5������@<KM�&����qC�[E��O0�����
�LT�y���o�]�N�h3-#9M�� ��{��A��p<!��76kw���������^t��1�v��g��7����*�tr���_�o�ie���������G�/Pg��,e���m�S�R6�FQ���O�%$\`�����c8L=.�k�j%���\Q�_�D/2�n�Z�����_��Q�����������>~�?U�U�A�O���3���a�r������������Qq�Nl�b�e"F��Ry����mU$��[X�Mz�(��PS*6���en]��3%�����������5���{�N��W����;��s��P3d:��'���D��V�oc�c^)�)[8���X��Ubi����+�
"�T��?Y�a�7���X�I���K��n)_8����9�{
#a�M��M��p���	f+
�C�L4Q�L%F
��,���`E)HT�k��s2} �42f��a�YN��Zr�c~pV6C����%i|j�v/��.��
:p3�����p�f��TBC��)K��^����?,-����?2yD����Q
U����o~=C�Ed��R8,��zR<�0Q��_����ejT�������������R������������-��m�y��s�k�\�����W�9������3;���F�+�����_�l�����2P���9�������;���z�+p;�������;��r"Z�,�1����;��;u���[�}�~����5�w���#X�w�Kk\��4��zgN����YW�����3��R�gd�����^/��Y8��?��G���n~��gL����P^b�h]R��I��|����Jl�^�7����`!����6����	���8�Z����r��^�Y������8��2K8��V�`m*��
���W���z;��W�����V�HTG��f2�:�f��%�K]��~����PU	V������Y(c��	
"_`�n�]��-,~r?��9c�>U�~���U8�Lg,p[���:�~w�/���
fW�c�3�(���A)9��4���t�L����'4�J�ZpN��}�����J��J�K�W#L����XM����}�7oy��+����Hwp2F#�2����
�y���l&�)�Y'^]��p%a�P�qp>x��!��b��/f�gI��E�f������6��
�
����8%��ik��@?-�k�F��O�|�h�s��q]"������[w"9O�KV��Qu�]���5 ��a\s��{��������-�����������<�(LK���F���,�RL,��%���%����e�*�3��	���.��
���|
�I0��j��T��"������gL!*y/���^����b��~��'/.�w/��n�=���.	�������W���WaQ���V���M=�o�����=�
|�=����L�M��?��#�ax��u����%QaI�O<�-�������bWk�+���������������Zz�{�:oi
����W�w8U��Cx3?����W����\��n��n�q�9�X�#*@}��k�W��%1�~�1�`�{���6�+�;��7�K�K*24n��r \:�������������]^�����md^X2�O���q��!�������M��2����p�t/�����Yq�[����)�������'W�������<���
JV�k����B�I��/��/;����`u����K��E>����^���o�j���_*T�.Xd�/__�O��o:oN@�m�}�����oI���S�f����lmu���8��������X�����T����%
GX�g'/�����wqq
�z`�Y��K|���wA��4��)�s����k���2/=�c�Ylj�p��%P��	7�\�K$U����J
0�~D��%<?p�#��p���!`��hv���������ry��d���d���:C)�)?�dE>Rv�����,�������(^�`g$�]Be�^�5�|�j~>�h���T�4��.��U�pcj5�c�|�tf*�����/c6���m�8�����:�P%�r�����(�Nz8��Z�Re;�t��8�}��p:L
kQ����Y���4�^V��L�\�����H
<���S^95s���y0�$�������'��kJ�����2�[1"�Gu���dU����������f������NRk%�s�	z���T��'��Z-��u�E����Cx����C��8���h� ������> �g#�IS����H�\�����FL������@���'����}���^/n�_��������
���S[M��$V��n1�!"U�&�c�+&����-�h�P�N��K�/2P`F<�hp�\���& �����4�a���Fh3�����1Z�vI���?�-��X>@�^r:%T���Z����sR���a��Fw���N�(������Zs\�i�Z�vN��c�������B�n'������������7��#���7�z#���G��T�F��	K�[���{U���*��{ �:O��e�9�4�@��WK5�K����k��r��:�n�)�4����hq��V�����G^T�}��u���m�>����I�(���S������	�L���AE����i���r�7%�W��R����x��|2	�S	�t��������`�y�u?����V�eu�8���P��_�z��\�&�C~pk�!�8p.���*�'�mN���n�r���^�\|�^���g�6���|�9����W�7�Q���\J���L�z"C8w�H�k���I��F�������>v��%'}L�������e�D�+8�\��71AP0��*_*q/x����r9|��DJ�Z4	�V��m�pOD�7��3{��
]9K�?��OxG�`���"�{.�A,���)B�����C=��q��r&�51i�����QR�c�fo�����?5��9b$q*�
Ed�s`�&���.}�{A�%A�Tz���)����|f�����V���W�"m��Kq6y�:����9��`��f<k�k<���$�~�}�W����\�Q�X'��X�X������{�S�CpJ!U����*3~�L'I�)�b����\��aJP�1}���L���C�+���x)�i���2D8�$�����;����������1�-����\��'	|��v���*x���"QJ���ye����}�?���j���L>b�-��Z3��x�V����2\���1y��	JK-bJK�R��oT�<�&u�8?��C��|-+O*	��;	S�\���zP�@O�g�.��Y����h����F�3��&�@E��@p�;�"O��%�t����#������re����.|W�4v}j#�iy��G�����!9�������%�,��7*	xI�l��`��f9���KvCf`�!�[&%ZZ��q�J!T�����wa}�����7�D�Q]�M'��������/�����|����q�����j�xR�aT�`qJ�Uf��J������{
��r]��dLZ����
���H�(������m��Mx�M�O%�6O%���3��=��fl�>z	���e���<��'�UZ��:N�h�,��v�t:C����\r������ID
��j	�U�4h���X]B+L����TU��8|�(�DZ�#���wy�]�#���}�x�x�������0`���J��(>����Ze2@F/��X��@	^�0$�G	�P���������a4{�3���/YT28c�
=B�����(xDH��Mj�F}p�gS�R!�fb��-k9;�C�SRn�Ws�$��t5|5"��h���9��lS�'�Q�:�+�� �e���a���<9g�����F�@�{�X������I@�����7B�����E�����'�K�A��?�|���%$�K���`E��W2wmW��nI���80Md������!�l��	h�40�� 	k�e!�UK� ��T�+#������/;;^E�%MeBgH���b��-a%�������������r�#X�ac��t4�Sr�%���H �Z�e^�l����u�z7dF�o#$ �
D��X(#%�EYj>@���v�I:�<T�c}��g�����dcI<|}��s/�����&_���c�e�������k�G5�z!��?i���B������8C�7���k���I�3)4@��3��E%3��OG�L*�99��e�����Q����};��G(o����I�!f��!�����t��	�3���_t<�h�>RX�3�C��s�~e4_���qp�c\�zsr^d�y���1�����/��~�,�����,��[*�$"�K�9��r�k_�'1���j��L;���uN%�X����hL�>z����Xr�m�L�c/@D���kJ��	^/��������"a���|(F����-/7��#~���3������q���m0�>�
���6d� �p�1Ye��8���LS��g�)Yv"�����*k�d�1��lW%N��%�B���e#���*�Y��a���N;��'��r��IL��&`BLpJ������HB�D�.���U�^�E����e��
h�1=�SvI�O�l�8�	i	�_��r�.�CmF�4&EbS�Y��)k�Sr��Nt]�P��c���b�Q�$�2�u0#��`1���u���	YDaf!���������a��������G��L
o���D$���-�,��R�a�.�GL�x9X1��\�X'
��^I�,�{W5g��D����,�F�������^��Z�abG�W�����L��Eoxmg�(M��+�h�/���v�����r��O��;���r�'��O��I�%��7��D~���e�]��aD�
	�bL����
l���s��#��o#���a5�|��AQ��	�$d�DLBd���(HzMv=�~0��7\IMG����^,tC��\��LNf��D[E��)F��|h���	�qU �������"��M�����d�nP���1Kg���fu]`^�-�I�e� ����������b<���������>��9�$�yvy?�4������4�K�=�TXr�J��#zM���av�����x�L�����3�g����:�jqcK���m���E���K]���L��e��S�W�3*�����.;���v�%_��@�$FVYC��H��Q��*
�O����B&������k��N(*����Y�K�<�'cL[
y[�]Y�Y�C��2�{��d���6����~�{�n���
�w+I����:�	D���
Z���*c�9��������9�}�x,��f���
m����&��\�~i����c|��K7�0D+�X�����^/���eL|����3�c��IW�z�������R�(�#��N��#o��6RDs����[m��;a��b}�� ���l,���0�_���&$N�l@�X���Y����������l4��w������h�9
���dZ���vx��B�/N���P�w����a!����+����l������e�\�����m�����_}x�`�	��e �e�)�G�t��c��5�i�4L���a�L�]���<��w����!�N]����m\�38U�����{|UR)��w�O�q�Sa�1������S��T�Lg������X����pRLu�!*���^�
/�����Y^��(��
�@[���A�CX����z���T�������Sk��,�����$�	�Q�P'�����I9	�
����7p�a��gm�Uu���fc�?j�Vk���k�[�Q�7j�����]:�[�]���ur=^�n���/Q9���4Z�&P�j=���A�U��v��z�N���&��z��"��������[���<V�`�N�/#8PO`�1���tz���*����m������.�H��a����Z�q����P���ju�b��+)���o/_��UU���^��Y�FV� ��{t�!M�V�Je�TzJ��Z���9��`�D�X�<B��=��O/�K�����A������c
k��U*�Z�:8�+*V��Fw������mprO���O����m��.=��3��U:���Sj?���*���G� �,��J��S9���K�������x���S����e��R/��{J���KN2f���pE�q��(�4c��Vg��-i5Kp�[��>m���L�7	��)cYyi�2���
�n�)-�(�c�����'���QR.�+���HR�������8w.7�#���0V$�4@�PT�����1�Kh"��M�5�Z�w@>tta_�'aS|.�U����T�[3{m��yXTf���yX�F��:A�J'hQ�	Z�f�� t������n*5�v��0���2z���g�l]��un�����g�>�<}���+�g�������Z�8�Z��7T���jW6��z]����������rf�� ��Z3�1�v 3B�M=���nZ1���pxI�W�ec`�T��s?�G9��`�>H�[�������|_��[����YS���{���a�_�A��'9��^��F6p���2;x�pT��>@Bu����rc��]���}�IL��u�$8!�n��b-�!��$4��#7���n}�S�w��
���@��f�t�#_��Z�Am����%)t�;Mu�lk^���Dy�
��Ie$VH�Q7���=������2�Y�b
���>��e,`�/�k�w�%g���>x^6��l���C�;'P��?��0+�CY��?����)���iT���$��M��3H����)9t�3Hxg�4�
�vw���&�ZlV[(6��=�m�$Ym���T���O�f�Y@+����J`��
�Y�e�q��}�`Se�MI]��@/�a���;��|�b�i�_���"��2lb�<C��Q���2��L���
@��?M��Z��S����������7l��z�� ������^o�
zA������������[������e���Y� +����FVTAY�,&5����K$%V����O����i%MA��U���y����&��0�j��V������z8�����Q�-'�)!�����6��]E�[:H���5�X�!aV�j�����l��x�eS��qt3���An�jh�oJE9_��ZV�������X����_��F��o��������kuy������2>z��m���i�������nHWXz�����9��t{�/"G���///������r�
����GS��#ubUO��%�I�SG��0�M��Y)�z���~Up��~~r�'Z#���'r=�"��)��bnG�,��94�+�9�<>?�����/�������wo���G�������9����<>?y���=Y ~�]e�4��@`���h�^'�����q��N�+}|��u������	&�4����Q?��}�EYn1���|�\
���������C�pxz���5)���h��h����[�c !NN�f�_�C�%r���P��`��	7�y.��N����<^�h`O���H*��bA��p��4�;��6.�"24�k��iK����y�F�[]B+Hbq�0������ >��1j���������O��{8�8"�-��E�c�$.Dj�`x�����Dh��j��#�R��^T�����`�.����>��JX�C!5`��u}��j2���7��	���F�Mj�7�gX�����V����n��|2coe[�'�+��K��Y������
OqnvQ ����Q����'�/x,�zaG4���I�rK@���gV<�}� �=*|�$�m��?R��de��_*Vdxd�e,,p�M�lI���7l�
�Q�����Zh�g�+a�.�[|sP���PM���n���y�%g
�H>�l�I[eYC��P��XWQ��}���j��8�'�^S�� h�1DilfR� ��@��a�#��������O����?�5	���	��*GB�����������i��1j
�]�����|��XS�!0���b����w�1o'�)��S]A���*F\�����s�4Dq8F� ���ZUh��R�w*��A�%�AZ����<c� ���|6e:
����hN���
BK�YDhP���A�X�,PW�������	��������JR��e��7[w	Lu���O�x��C\�.�uF��rJ��H:kK	O��V��`�h������?�L��V,9\�����w���R����+��{���G�Y�|=����nd_�AS*5�F�=rri��F.���ox��>p��e�N����
	�Q�U�m���b1�j�N����H� ���Nq? ��������n<����qI]���(��Q�$t$pF�S�a�3E�:u��^c)���%�:�].3`���K��o>Y���"�����a��0��7����oc��EqA!~e���C�s_+�T�^]J1c^��}��O��V��jM�C��Z} 4��eJ`i���P
�<;:{LnV��-IQ��1�T��X����e�0Q1q*��qi%������j�XB�,����g~;��<>Cs���������]D���J8:�����A~H#����x�'��*[�O4�*��G����(S����mDV�N�ua�����
[[D.H/}�V��<��v������vnQ��wot���v�J�`�z;qf��D �W�����$���:%\��9������z\��9vS]�3A[�%m�uF��&X�S���Y���������S����.S��+�k�%  k�u���H��
���2��Fu'�]9yq�~i��o����<rq	9
xt��ai�G"���/O��{D��Z�R��v���Z�D���t��s�im��Y�PTg�'���U	�iF{�x2C�C�"�����pC�U5"�X�����c�&9�k�j.H`���[��iRn��(���I
��������H�u����K��#��#�D���_Rp�3}d�����t�+����Y�"���tV WS�B�)�F�D�0K��h��:O��>t��G�%�;�*Q��>_��92�X�c������d�#��k������������mp����8�\YB���~��4�QJC:���3��T�u+��Pa���Y�	�����{4	c�J\��!��8{�QL�N��D�f�?���x.^�h�.�l��������������	���
��E��i@������N��1g�W�?��^c,]��{�i�n��D.
�Q��
����`��D�1�N�!ez�:��G������XG��6&v��{��c�F�����^=�w��5'p��V����{�ZA�-\|��F���4a�T|98���N���@�1j@Xo�-^6����ne�"��O��`1^�3�A+��q��82;�1�4l�Z:
;���4t��WGu��j���9�E����p���v�"����t�0�7�]b�\�p���`��{L�����}-KA��
�qkp@
����%���{���I�����Q�	3O�(��K�(�%�#��uJ��S+�����#J�8��6
D��
c���l'�ti
��&�D�SIT�g^�Z*Zw��������trz��:��IGs�U�����'��T�Xt*�3�Q���
+�\�N��K�����a�tlXC�s�'��b�y��^������=jS����$)~~	9BI�����[�x6�������Z�
�@;�qgCr��JsT�r>u��e�ev�����)�P��t��mkY�\��dj�4����,��"��'f2�e�� z�"kn�0�X[�M7L},u����V���)��RR�d,�Y`���h4;=���`;;�
�D"�����i#+��Na����psp3I
k�����m����_������ �� y��
�J�8�E�MQ�������������G�.���#��6\vY0�J������������f��&{��!S���i�>��lE�xr3�����3�t%����q�������������%������:?�IK��]�!��z	������)��b�;��A�m��
��hf�Y��=�B
i��#�(x��[���TB���q�pB�`^��!�,�����h~�Y�n<�lb2��c�Q�*��9��nE�������Uy���\^t��w��7�o:;?�W�w�V��Tw�����3_��p��E��L@`����O��*�O3���$��[]EVz��pB�k�g��oURN3-��m�CI��#�)O��'T�^`�/_���p����=:����7��QR���%%�)�6�����K�����>s�D��Zp��Wv^?���Gx&'\�eoOb���o�l�3�O#X������w�~y���f�u�3��W��\$�������/0��\���R��Z�+*���]�9���3�����k���785���yl��Vk�����l1��l6���o�mk�����j�5�!�����|�W; j��~��tcl��UV����t �	+vE_�l��C ������,�\�d���u�h+�4@oF�%G��9T��{������|<� �en��B� =��	P1PQ�E�j3�h�z�� �!��+&���?=��I���$������--%e�jub�[e Z�.}c�Vv�{��h�!�u�yJ���+����c:	v��r���]����"A�LK��4P�8�w��MB�1������������Q������lQpz����������uuB�9J��(�����H
aTd�{���I,����if�K�`�.8H�g�}��~0�3�.�L_Z��_)q�T����=:���k��?)��"����8���&�TF� 
�G_���L��>=M��}�&�f��7�v�E���p�C�>����1j���{K+{o�3CBD�@�_����f�'�M��O���u��Y0*�)����5#��8-�h�lf\�4I�t��L�w��-
���1�#��W�p~B
�J����a�r��f?����k��~�X���������r��:UL�4`�#yz��q���;���r
�O�;�:�wBt����`J�QQ�M��fQ�<� =&�>��������~T��\����w;wR,t�I�mVuD�1 �4�}����N�w|ii��,�!�;����1��)w�p��#B
�ZC=4J������"�j��v��]���"����� ���w�'?.1�da��#I���,�G�T�����r2})��l&��1�������)R���f%�G�d���#�o��F2�)< ����;�*�Tb�	���O�R�e���f~j��1���������_n�
�]���M��K���>�c���$�	��&n<C����F�����p�U�
����d:�O�4��bvF���h�W�3�}���m� '��7������>.Z�uj����:�����?L���Ta�\���7�$����KbsQ��Uu�2����������P�|��S�9���������Am�FOe�f��g�����
��lX�	2i2��u`�������@���DxY[�W=�}��[r�/{��?��!L�M_�A6;���,�U����U�=?~ur��Gs:{b�

,7��S{�c��!���k@��s�17VP6-n4��:#vr�cwHE��7���+	����
;���[�1��t�4E����Q�����5���F����i"�9u\����O��+L�=$k�����O��� ���Xr�0;�MMx�ewn_E�J�s����E����/��M�����5E�}4Al�72A�:�0f������.�e�Os�]j|�{	T��D_�Jv�D����#��^G:s��i.���eB%�}D�YG7"���'L��E0���������_1�������J=%W�T�w�'�O.��tA����)�s��F�rfYW]_�����p���2�8K�}3�}��}��&z���\N]� ?]�������z�pV&{Cz��	���2pR��cL�����k�^\�'�����b�'C��K2eM�]���3����2X*oF��w����y���,xa�N^�&�w�=���:h<��~e
��~�dg��3k-����<M?qs�����x=�G���������z��M����\�d�%::������}�mZ2[��'�H�}��	���y���O������L���2E�a���M����[�������~��Z�^��;��o������f��������p�:��^�v��a��k5�A#�����������)�������s� �Z���U���+���D*����._������T�v�o4�F������H�?�����|���\����6P7���_������
��z�Gs�����nDy`�|2[	��j0u�0akG��1�����I��?�2#�t����w����O3��EP����{���O�c������^��-_�ZN�����M����I�|]����������Oy(g����+�,��\X��NK��P�@���Kadx,�r^���<�x���[���r�������X��!�������n���y��U�dN
�j�+?��r����)�����$���_
��b���d�98*��Hw��tx�����c�=[[�kkS�<�H���s���G�/���8��K��l��h,������D������9|�=��1�6���k���������L������(�5��	�j�jm���M[�-V�8Ko�gnS`�����j������j
����IUF���������R�S�u
=�a�h������*��!?�*�K��O�'�����U�����<9_�X!�7�F2�w���o����G���n������>�V��=Fk��Fu�Q�����������2����Ym�DVN���F+`5��^�_k������'����K��wh�����?������t�B�1g�1�M�:�rb�����m��?Sa��{�n������qd��p���So1FK,�Ecu[J�u���[�z��a]���'����jf�9�v����m�D�u'c��'qA�-�Nn�����cf_d��s���n�5����n{e~_}��<�F Eyu��nn�DpTFX�
��i�5E�xYw���/(3�*M�-l��x������9���_l�����90�?���I�)��f$(4��KI����T������SC_M��J�$K���$���L��\���}]�����1�(����5���a|H��?�X'� ���F����{��Js��%:zn2]��V���k8����TW�Ez���m�>����$�N���X-���D�?��;�s��.��%P���5���YY���� ���o5Z��?�q����o����Z��	����jV�j��~�1h��A��o��^��_Gs`������������v7��kK��t��gdUY3R�a��uX�FS-)��_�')���R���X�:la�w�r@����T�T.?�Y���k��"$�K*�]��&��1���^�8�����y���_gU��7p����d���A���m���^�&R.Y�c��zp�e��������Q��
�Lu� NX+�����o\��m���)��f9�o�gT��a��	���bob� ��a
����TruJ��K]7�7\�*F�����U������D�j{u�t�	����z�D� h�����A}�cw�wWC^ql���cJ�gY���,
�6�X��4��D2�r��N��''�z�jw��
�z���+E�>�K���-�"[�R����r�x6��xG����zS|�e���la��3�b�b���'�wyz���15
q���������Z��	o���}�}�+��Z3i��5���&�H���f�^k��F+�_;���Z��7�a�9lt�a�����z�q��L��w�?F����������fE�J�6��`/�w��N�I���&���aI&Z�J�U��x��lr�����<��Cl#�|��%;��9j�0rH�)�"iR�5�!���rE�&1���u�L�"���]����1f1�*|��!��"��|��T��Z9+k����]���*���$L�kU#x��~^���z�!U^L��I��&W�*�_�^M������~�D�[�������������>U�n��� h������7������*��d}�[c�ZL��f>�nos�����4Q`5 �R����xq���Q�=�����`:����	��
��Q�#\&!�W2�l#����&	�����s�IE�`O�I"��u������N��L�.����T�rg��_�(8��`��Tg�}�k�^���0�o����F��-�4���K�Uq}E�����L%+�I�L���-�r�*�9�>+���6�6��(H�p�r��pP�`���.N��~��{~x��y����NN_?�+��.��$�9��L~1�u����,]�����hb�����j��d�{m�}uwX���b�y��q�9?����%7
���x���R7t~;�<h���J���P<^���@7�_cY:�z4�q��^E�H�a�M�LQ7pI�/ ����&�4i�q*F*YHDN��l��SIZ���4a��-"]qV/�3�b��m��	4�����'��"��Yh����GL�|,e��h��ab\Ne��p�[�P~����b�Q[b��k)glK��g�[`ygS�V��|.�Q��]JA���;�56�����]�H�B������P]$)��RQ�m���|���4�&#N
8!��������g�3��w?����+�g��zkd����
�Eq���t�<���Y�/�T!�m�r~HZ�E���C��SGqvzqY�Q�TY���l��D�M,1��M��nP������Y����(��f'�j^Et���l.�:i�=�2��g��'���p�?���5G�G?1]f\[_6W�v|�9���O�������w�Q����[�#�o0�o�Nm��nu:��H~���p�>�u�vk�9h��Z�k?�v����j����~�*D�m���d[�����h;�����z�l���Z�����:�J�X�o)cY��R������������Z���K��u�G����Osc%����V�&�Q&�1�{
>����4�O�K�
b]R/g�w�[U/�������:�{��i2hZq���K����k�(94K�P���s�><s�F�g|&�}���L>��E$F�^�vL�$1��9��JG$.`�� ��T�.���e�ng8c��x:���)g����@��1����t1�-_nJ�D��PC
_����O��w���(3��\������GW�~�j�@J�����;�1p��'�]������
����f�F_d�v_��{�����n�t��,�mD\�e�L�~��@{�o5�?����i���inM�L��z������d2"g��@)!|��{�����a3���������������������fP�V�N���a��j��_�
k�ZX���~��h����qp}����z���M|��]������x�?H�8 ��u��Q�[�����w����d}�[�`l���[l��&����*M��L����^8�_cxI�z��l�1oo������Je4gn����MH���.~����cH7�����U�+�J�K�\�T+���?���|���%T�����g�~���1�������Z���3�
���E9i��,�v��q�
~{�m,�0�9���y��<����CZ���6�}m[�c�TX1��6���_7L����6%���wvp���$8��'����6��Qcg>�jW��Je�U����0��~4?=/F#��W��j	CKut��`
����}sv���q�]w�l��{��_���`<Nr�'����_���fv?:�<�xa�F8?~u~|qa�
�����eP�=3B���}��b{��������'�X9�FW����z�T�W!������iX�M�^}��"�������h���7�f�@�����|2��qC ���;���H�s�j4��dchBM�'�g�[Wn>����[-���%���1k#.�� #���!n��L��#L������m����2B�X4TqM�(6����R�)4/@�_���S(N��~o��%�_O(�|'�2��TFd4	\��@��o��PZh�)�����p8�������h�Qti����f�x0���0�gf�l#\�Z���Tc��`��<��� $u���J89���������7q�S�,�C���e`_m��k!���d�Y�B�����l��><}�����z�..�O^\��^^��>Vo�Av<~���^�<>X��o���7�2MW �P�Q��������3�2k*��N�<�u�9�
L����o�,y�o�]�z�E�7�==�1'�~��g�r
ck����z�����O�Y���F��kB�W3�k2�~{X�~��u�|��P6e6 ?�r�T�� d|Y�z1�~!��*���^����o(��l���?��+b��l�LI]�j6��q�?��������.����s�hkJ�z��}�j���������^�����Y��`%�&����'7�ZU����o����M�^�U�ht�Wn��b��G���;�%�����T�	2��0�Fd���blu8�QG�^4f_����^/��+�������q�������u�dM�M����H��PbU4�>;<��jTW4:����[�*�����������{���B��8:~����8�������1�������
��).��&��C�
f�h|x����`>����
����F#y����m)j��fN��xq��}2TB�S�#�D��F��=�!�R���O{$9�"��l)��7��oRa�
�6]�{r����c���8Qw���',o����7���W�$�	-����E:h	P��Aw>A�u'�a�d����g"�������R`�=���i8;���.�����W�n�^x|�!���) ��p�h����p
�M{�8�x����8��*�qQ=Qyc�4Lc� �)V�JB~'J��}���z��k,8�7�������z�=������f[��,�����%i�AlZ<{����~�V|�X� +��My,H�����o��F}���Vr%u�F�!%��9����3�F���&��c0Y�Fa�� �b�W\���jWU+�����S=M��i}
��(�g�?�����'�k���E��'��3�,�^��c|�2��
���`�z�(�K�+�_�y��������3cV�|��K�/�����*�z���t������y��h�
�����}����Hi�Dd���!��w��<�����(��K�{F=#�{ta���H��P�(�bA��e_+>pH|���>�au�����=1xJ�eY�hY�i5������S���%L?H���%��M+�����"g�eT����b[9��y��������i?q��s���\�SY���D6��r�1�u�������K�u.�I5��B��v��b"H~��sN�No�<M���xJ[�2+�>������^^d��I�����H���ittz��%�����=���j����n��%��� ���+Y���M������gr�>Y������&cL�9���G�p^R������{>���8�vf+���W^��1%��t��p6��f�x��G��8��d]��
���d��eKx��io�15o�pg�?e�<K5�y�s�Q�DPhd����u���K���F�����~?JLA�
���nOp���/�_�a������;�)_=�!X�K#�����^�]�����J��v��������7������n�O6�u�}Al�Q����)��T��RK������r������pk|�o�wr�.�?qV����q�v��4�E������5�����d�g�p�h?�p���b|J�����(�g�E	P��o����Q���{Y=y��X�BE�����g'�J��Cdd������@5�0��u�0/��G;�!�{�g4S���&E�_��~v�@�����.����@��gd��b���������6���.

B�����Fy�[������j��d(����1=!�����se�����7���2����0sBw�4��19���Y���=�r8��+X�h�K��q��.�4�Gp��?���0�}�6�YVz,����i�V�
O)o�
�5R��O��s�!QW�KlQ:NH���1zEx�N�.O^�����p��P�hu��d<�+�
�+����P���������-NK��?�Y&#���{���p~@^��Y�	��H��'L�Lrb���La�E�
X�w/�	���C�����n�Ey�"M^P�7@Pk�Y���M�����J��A��Wn�*
}'xva�t9RZ9�C-�W�L
@J���{��@�����������|q6Iz���u���9���2\+s8_g�vz@�
�q^�����	���Y;a--�_����B2=��������Y7�#�.V�N�IU� +r5-������sUKDK����}��X&�>�+�Qk�����A��}�C����E�o����F�E�qO��rK��!�
 #
,XIG�W�<����c6���s��	!_@��3�!2mr����RA�F�&�
�yE$��4	�8��p(p�{;��O�=/PL�'<��������O�G��������w
O����%�� �R���!7��hQkt��qpr
6 _��rO��"�7g`�G���u�=v�_(gp-h�)����=�@�)��������
DsZ!v����)I���+���$���7D����>%)pdG���C� ?��d��[��yo���6������'w���_�Ogp������#������(��D�a/�Y��� ��w�'���|V��S���
R3|V�*7OB�so��4���Z?l���~���W�����T���^����ja/��AjY9L��aZ��)���
���i���P��t1�LO�r���1���������]��	�<�U�SMSW��Q�5Ic��}b�@�F��)����=�aP�P����y���'�:e�~��=yRo�/�M����Vi��F���dL8��/jECZ���lk������cQ���?�/.XD�<~�}[�i�V`a��a�3�~�n�~��O��T���>o�6�a����.��v<K�=���]�����QM�����B����EV���9���^&��w�a��jjs�F1�;��-�2��VA~+��W"J�~/G���{�%C�L����7�~��>��?����U�������~C�e+%3���4�Z=L��������<qv/��1��Qt��#�kL���ba;���[��4��g$?�*??��Q�
�eY�{��D�m��������{L��q��%��}�'c`������c�=Q�2��^F�Mjj�PwzqM/�5:�w�Ok��En���rQ�q��r���
������#7��6�h���	���/X��'J��k�]x�g��w�Wm���5�[u���PQ���������u5�T�Q�"��'�%���,�������6���w'�3-�I�(���Z�h�����g���9��L=�����������	��Z�������77��'rJ
/5�.�g�k+�����|3B����3�^|��&��On��������W��r�	5sI������La���;�z.���s�}J�b�
����YXFdA�y������,����~�s�q�7�4C���
�X�%?���
��@��<
4������.PF�,W�zA�������7���������g]V���{Ux52a�F�3t<h���f1��(x;�w�Sh��6���]�R(b�E]fmg����!
�0���A���9Oh�l#�����h�e=��������yf|��c86C��#Q6[���F�K:��"�%o������~������F��w�xI_�l���8��������Tlw���no+]=����o�O�,�j*j�*��%.Faw���Q��5�0Tj�0y�zz�P���w��1j����+#��e�F[s���~�����7�)��Egwj
�I�a�$�[��*`�P*���p^��&��z�u�	��jIk��Dp"E�s@YgT�c%d��"�\c�������>���l���nMh4��Z�
��a��)g�����]N��7,eF,l�*9��IP�$(��M]c1�����=���On��5w�]���.���U8�Lgx�k�g�;��	�qt��U�X����L�eJ�RJ�	g$M^��>>:�x��L���9�U�S-���>u���4���f!\�Dw
9��[>�a%[�U>�;r:��~���|��U��	��8��,�v�q�����n�����Q?o�X�<S�M�����h\gZ<1��?h8$��m��x/1�|1`�@�E�R�a��fb��Cv�l�%caM>��_KxE�w���KN���r��4��Z��?hnv�������6m���[�����gt���^�k�A���pY����f����s�g���7I�k{�A��1y������
U�{P����k�,�/���.��1�}Ni����S/v���I��h2�I�����ke��^O&��t���D�����o�����[����`TF3t��>��)�_�	;�:�������t�����?�&�(�{����o
a8���N���``�������U�{�V����{��=�U��iP�wBV�/��[Q]�%��/��k�e[1�F�����P���3����m/��M��.�YM|�--6�����-e����[���o�����)`�n5�oo�TV7�N����t�5���	>w�}%���C�2*#E��L���R�������nm�/���XP�&�k�m�^^;�}�v%�	8���je�.XX�7��z�k�~w-�Z��@\=����G��+/\�'(4[���;A��2��������h�o��������+,g��q�hvp��V����QHjc��'7 �[����27XX�Nk�6����
���=%kYz�����f�NS��I��Ua������!j6��6�{w��<�ui�i����+�I�E�$��l������u�����'@��OM0����.���dV���qz�{8q�Vo��V�� t�$g���N��A�`��'���I� �iO-q�#z����2����-I_?%�~����U"���w�����o}2\�>�:��v�����n��kz ������Lo���'jl'�����*��P��p�e��~��Z�*Q^�UB����be8T`uK}��[�|/�D��,eT)�'p��rL�����!���^��������Z+U���j�[��[�#�o��:{�po�]�
�� l�{Ao�j[��a��n�{��7��Yo�����?�����������M���*��%�Q��0yL�b��J��
����}�lo_^�f�hGWX���7B;�7�;D	��m�[��HK�:�o�J	�p��������
�(h���M(mc��@
s���^	b$�9��WH ��������_0�bqS����E��<U��+b���E�96��j��s$U���6��%\#�F�E0I� ��@�X��K@����F���R�yS�u|�����)>qI)��.c*��p���a�������x<�}A�����m�@2!0^�T�YEb���$����������`�ILB$U'��&����8B��N�p ;�{�����B�Tg�NJJ�&}*$_��`�b���P��
����k<x��I���x�*6k
��w����F7�}�k�V#���;��ss5��`�*w����|�|�:F���M�K�YR�5�q��{Ww[Zt6�����oc������4>}������Z}_9�����ic	�����$z��_�Ex�_�>����+Q���`�����_��Q�������BG��NV�%��������������c
��K�|4��h�������`w(��z�%�����|�������;��uz
`�Z�J����;���o�]:[q�6!�fu����g����������U�JU+U�O�Z�o��V]�QS�lZ�=n4�'1�z���5r�����Y8$/I8hOwn�h^��e��e��w=��nOa$��g��������>q~WO������g��y
�}z�H?7��~��\c���Y �%*��2�LFeO�
p
3f�w��L�����Y�/oen�^�FO��rk��5	���������J{������j�G�5c$\���A@J"S�Zm�_�}�=T����	f��f�&�����x�q0��)'�g��S8���}���d[�P�}p�v��z����~X��zc��j���5�b�^[:�-:�Z�}\�m�W�n����� 
q���y_��-}�
-��Gy���.��a����/���_m����h��>���"���b:_��[@�2�ba�U�*fM���!�ErD�[�����dj�fr_I}��CCa��H�(
���� ��� SB:��U�;���4�[\�W��
Q�(Hm7��S�b)&e���%6�*\�y�v����Q8R���GKoa� O@thttq��O�p��_���XQgT�\P#�/I9EhN��KD�3�8s;�������	���
���}�p�\�����`���
Z����nQ���5�F��8
��N��O/O^��8��j�T=u�c!#��AX�tj�!�Y��)Y�1��%�bTc��T�t���8�|��H������N��)|.n�?��S������[lP���������
j�rb5�O!�A,��V����L��z{���0.��'sFT�"I�LCL��T%y^mv�Z{kkL������=������'�_w����}s����9&�59������s��E���x& �a!���0�*��lK�"����	H�w�T�������cG�{�	�M��kv����[������G%�.�`��4����PqBi�pC�9���S?3�=���/�Y�
5��)���B������p�\��t��CFt�`�A8E��@����I�2����Z�)���h������L�f������L�NP�{z���!����@���d�!�O�s�RS��H)���g4�����B'd����$h��
M����������?��]sT�_fG���A�����A��L�O�D
���b���uHy%Q��F�X��Ti��"<e�E�m�0��9�C�Q�*�o�Rq��/E�X�3���(�����b����)�+���'r��w���I0�����h�Y�2�K��U^�.''��&1�
(�y�v�q�L����eN.�R^R��w1���}3��T�E�[q�y\~&�fY�L���`��}��|ji���;�(�����Z���.�O_�p���QML�g����C�4�*9���7S�={jn���)�2��b�]
.e������Q.z�@+QL��dm�2���GFm$��0��(�3R������(D�:���������4u�QX��<��H�+��k2��2di��0��j���h�����d&��Yp�%�e���*a��0����F�D�
�J����UW�El?��sF4PDtJn
������[���4,��5���9}<��+���-'SJf�N
��(�����i�;��V���l
����7��)�����������b���M�b���X����7�?���$�,�$w��I�p�{4���\����~�0���/�%THJ3��_�8m���@�"�?N#~����5?�n�]��� �k�z;g�$�L�w�����tB~
��aa��a�5 �;�O�����!�b
�L�Y��1��p�:N�=G���3��/�@����*q����x�b�}=q<�Gx��x�}�N�i��.�?;_"X���#�
<���M)�2v�����x�F_(#y�l��B��B�|���-����<�r!el)
.������6�B
E�����]B��,���{'9t�+����
���&]����M	�mW�07��G89?|��������K���-9$���_���;����B={���[�vaf�����H
pp"���[��Jd���_�p�7(8WI�A�|�8}<����Y�)�v��y��]��B����3�>H�;c�"��BZ�	{��F��8ZX�=��
+��?Ik�
M��+���f�<�y�8����������W5���Qk�Paq#��O3U��;��lm����@L��zQ���W���J�A�����v����a�S$�hw�itL�9���T��D	��9��8>����������:W�������T�3�2('CE�8~
�\�]��IR�F�9��p���1����2A����p6[��A�kR��X���B+��4��A��O8�`&d��E��w���rj���J���3��%@W%��N<����h��{���)Y�iU��w��R8Q���N���H�����L,�B��)%���d��J`���:�&�����v�*L����[
R�a4@�����_��t���q��!��PV�Ng<_u������M�S���\V���+�a~�GP�.��l��A2���3����BO7���
Kd�@_q�u.B�lAL�n���Vm���7��>"����I�&�:��-��UjS8+��*KM3��yao��-Y;���doI
Z��O��?��q8F���o�����^v�-�3Fb���:Q�aey<���]���L��0��w=�F��~������3�$+drc�/!T�U=p%�=.w�vN6�K
L���kOqrQ&<_�����`1�bE�SCSf��=��xp"�:E��)KO���A����s�N�����|V[
�Ici�A�E���zc��d�y��x�0����fI���������'��4���K�����2�zo��s����������7
m�*[���������0�F�S�	�YE�bH8�!���zt(i�:
�n`���"��<���U�d<\ ��Cv���[*���{���Z^���d'}�ihxvR����/��L9m_��O�am����`)�v�P�,Wp�W��k��g������	0�x�G�:�a�F}HD���~�WQ�v�Z�bPB�1�5jV����=]������U�S���Oj��>/^�K����/��/�l�N��O��H;��s0'���4�d_���t�� +.:���Y/&#��H���0�E�Yz7
")��@���a>���� �&@m�3��@�T�T��������%�H�,�FI+����i�!�#����4�y���w����-�@� �`�4~d����K+����?{�{a0���@�]>���Q6�M���v�^n�eDUz��gw4��E:�4;}�c ���l#2��3v�YF��Yq��\���3gNK��1�o����=�o��_��$��:.V�zW�b�zO|�Q?w�,�c��,���Z}�x+���]e�m��A��1�����U����n�+�z�tc�}�:��K�]?�^Zz�,6�+�W�!])���d��SX\�/r���}_��9���$��$f���V�	����vM�.0��q�VZ���,�76�i��4?g����VO�>0��8�m��o��0TthX��~���6n�^�aou�j_k�����F�R��:��A;hW�]���J�~g�c�&��K�9�\�5��!��u�������ux����T�z[m�l�3p�.�/.0'����+t�{sv�mw���v�?����_�7��N�C;x�{���\<��D�����H^��#��S4xN�Cf��w�X%�������������o�OeU��}8E��T������x�~����'�����x<����5p���..�]r���!B�&�����_�F�^��k�[� 7� �n�A��G\��7��0b�:N�����r ~3���]�a���8F:����%�R��\�H���R���_ q��?Hs�L�Hn�<2yfRM%�()XVd�^��C���Pkn=�/~��$���������Wg�)�/�-����AN���zr�.qnY�#�#%�u�+I��3r����4�Vj?L��
����en���������59`[.��N��+e9����2�N�D��%Rx�soF������u��j�n��f{P���z�����Y�0��1���I�U
(�V��d+D^���N1p
k�^%�Y���~��(��\������J��e�A�F����m�3d�����u�Z�^{��0�xo�j�k�vz�������R�R�jCU/�=?:9����P��lw�'����4������4�-8�G�DiRK0�sb�y����x�'��A�vL��W@3��Z`:�
j�����W*{�Zm����W�4;8�	gS&�6��#Phz;gn���7��c�8�j���]���-����<��^'K�rRo����^��7���s9�u�������[�/n������~�6��5a��[�U����Q??�(rd���2��}�J�B��?/�DU�u�>�S=~y|��4=������S�������^�{b�2$��.������w_���<y����.���
��P����*P%�i}�W�Q���l|�+0����mIK�����Xq�J���wl�U�pCob����dchBMu)]��[Wn>�bf��%��V"���2k��[��a�����Y�}�HV}�����c��E@��<0���	������&�i"M��A��gOT����o���[4��]t<��������+5�[����������b��bU��Uh��Br��5��0���6k<�f�vb��l��4�E���B�X�^��D������T�T�E�%g/�C����_������7'/�<�+UxPx�~����z|��G?�S
o`[����@S��u�����5\X��]��j�^�4P���\�)�O����Ez�,��+�l��)��^_��L/�����]U@_ux�
��w
�U�%U��)+�A�
�]�>~q���>}G���go8�LA��m�\]�QQ�����=�y��?�����C�]\+�����D�����i�5��2�*�K�������@x��s����w�/q����� ��P����#�I��~Z.=+�,#?X�o��n�>f{�P��h�'7�{��(���k&D(��g�?X�N����D��ep��	i����������:�:cY�]�>�@�����cK�J@���PGP��$|���� ��4s�n�/��$��D��o|�NR� ��)��Q���k�MAc�3n�;�r�q���)c>��V�ejD�������l�8JA>25>��|�ct�)2:������9���G�W�g����p���`k������]������������������-�������	��7SU`��"'�1��4�~"%�@�����j�o�Z���o�sRK>H��y�+���.�����%�D`2��5�3���j��V���5��fZ�k��f��n�I7�A���y�G�%v����fr����k�����E�?�Q�����O��&v������!G�gos����|��`#�b��#?���Y�j�� ���kw ��f�D�~y����cr�g�p���}No��R�x��R�*=~��&w-I�*��H"�n�����������������D�m�T���9e��>�n�������?u��LE�����oYi;3r�l�%��C��FvMz���vo��I���i�r��T5��T6���
�V8��S8��*���a���o��7���Z�]��������h�����s�C���j�~�U�������vZvx��nI�����
��H���?��y���3Sh���)���:�OQ�����a�zB��v��%;xr�hJ���(���-�9N)y�%�����Vt�)�����[..�O���zwzq��x{�1W�=����N_�;|u�^,�p��n�2������o?+{��g����K�N]��f���gA�b?#���
��
'���9#��o3SC��G����l��x��w&w�u}���J8�^���4{_M���X��������w9�9^���1��C��n��o�����Jz�{��F��A��$o���s^o����(#�h���������Z��vbNz�ex8����{��f&�~��+�����&n�����g�k�������f�hes9�l����m����:�?f
%���}Z�����dX�v������Py�n�[��fu*<\o�����P��s�ne��������Y�S-j'ZN��t�Xr2�D&��:�����gg��a����&�\��9rI���^���h�����tBt%6P�������I���\�%R�6�o��	�O�[��Z��-~q^���m��-%��v���0*~Vga5=L��4wo�)��*��;M�"��!q����u6��VG��:$�*�?��5�6�o���g'G��X��8��RZ�.Q��!���!?����{ �w8��y�V����(y�.^�1���p��KR:��z!��P�_�wq4�^~zv��������4W��&�*�p�P27�w�E�p��G'oh��(���SA������o��dL�jA7�N����	%�."x��ep��deVi�3��Q
>�u��\
���"�����r4��2
F@m;n����(`k�Co�L:��rHJ*T�#,'����P�]�^^3R�`�������.���z(�A�����m��U"���� 0��+�A���\r����P�gW���Io/�uk'��T���!/s�Bt#��	�]R4o�������c��r�G0I&*^������6]2����,Mv�i�2���{9�����hd.5�����JT���Kx�]�stq��{r�}yr~q��K�h.��-gKly��[:?�VF9����
t�3��5����,
������|�\�5f'�{��(�g7�����t�!�|� z&`JT-�\'U�����]�
��pWg�����[J,w����
�h����]����:�yq�&�!�s�	���=��g����p��A����*���6I�	U�ZqMNG���)(��]@���qE��y�\���C���A��y�Q�>6�=>'\�����1�Tz����w����|������F����qrB�N��_��Q~���������(:�t�������%I���!V�\a|�q�����p���h/�}|U~�X��'H��F����h[l$��N�O��'	`�<�d~]�r.�1O��8�3a6��w�$(��o�[Uc��r�Wjf�S�����Te��#tJ"���
��5����x����W��s-@� 	�e�����f iY���aQ��I�H������pKs�#\�,��J�yt�}��q�3��I��`��E��8y���2����?��H\���k������f�q<
�A��ix����j sAZ%���U�����.�
�����Gou�>)�0P�;>$6^���BT���W�a�l���Q)���Nv���
��Fgx:Y��������`�2�S���s��]l��aG�KI&�T�A�R����4�_�����Q��N����,�L��H]f,�����f�R�����������+W��V����wo���'����rx6�}�o �����^�����Z���f��Wk��������������-3���f*��\���[��_��KLM6���u����^-���
��f�^k�U�z�����v0����T����=��Wu��m��z��u�q�.��`�'1��������
\@�n�m�]>��J����w0F�LS���{��m�[��V�/T���������������k d}�[U[���n��i�~Q{
H�������D�4�4���.fl_a��=�Iv2����5����w����*��bW*�N�9l7���n�h�j�Y!�Z�V���~pa��~wu�4u/u����b���
�p�������`J.�bB��3�
�"w8�V�>���Z��B�Y�V����������j�,��W�i93�"C`{U����������hz��[{Z��8m�=f�l}U*A����^����x���j��`�=JA�W"��Zj�U9���������(��Y���dd�R��
\@����wO���u��=�.	{��	{�wJ��X�P����@�&��7[���������
\��pT�ms��'�2�2S��V����W����o����G��^����4��^o������f�9������~^��{����
0R����d�� �*c��i'���*���^�"���'��q���	yu/��o�O��Y)
����(��>��l�Eo�JXR�����>�Dk�L@�(�ax�F_���1E��k/�l Q���5vE�0cl��Ls����I��|�����1*�Z;�w�:a�����0-�
�\���}����)�M��������]���0T�@[���.DOh�N���L��:�����_��+ ������zJZ��f�3q�R�����:s^�w���B��~�5"�=��J���P{\��3���D,{�3�]��+`�9���[S�
Sw�rtUo��6�/k�4J|�
��u\sPZ��$;����J\c�H�y���VD^���j�G����G7�(�a��	Uls�GD�P�+j�������O�'�� 5vM2�	����2j��-]�!����6���e���n����=G�[����$�J��t|C��q,@�]r\�"t��)��J�T�qr-+9Exr ��������d+.�i�reTz��S��}�O���t~�JM�y7;.�k�������~�S �����[�R����"m�p
x���������>)m�
N����&�<�Ge������H(w+���2�l��������T�j+l��{�iW�����u=(����\�\���\\�
��z���7���/�x�g��7��:��z������<���TKy�srX�0�*k���PlOM��d4��z��X%
�x��]�"����W
j�������� e������9�Ey�}�p��W�������z������7�G�_�T�`��������]����iv/6:�!�����Sw�?���:����=����7��|Y��-1��p�X�OA����-���E<���!��?�#�L�:%��p�����^���[b�������'aA�[�+���a��@�F��	���3�&��z�?q�H�{%g@����^!'d"��IUl��Kn�]4yb@�A4'��m���IQ
.�x�*�k-+X��`��Zc��N�������K}+Tz�>��Y�mY�p����y�b����xr[����T�PW�$�Q���5���R��v��b���q�;�_��e����(�%�#E��=����$��`��G�fj>E��j����+������k��:V��yr�=��x���R�<�{�tzB,��-n����U����iW���B����yLL"Z��B8l�f�
`�������^��2�\)�-��+���E��(������s����r����#�)���X]������G��T0}����A�#�Ge)q��56�I�����f\�8DD@����F������`��G<����E�[{F�c�;6F�GV��35uqY1t�w�����@�G����=�+~/���B�U���h�+�|�KM��vq��3�2�%��-����\���������Mf>�7KZ"��w�l4�(v�S�&Ug�Z�V_�[�=[�I����u����;��v��o;���n�|1b�0,4,mB�}��@����RF
}��*��>2�l3����1i�h��'���3�]=�z���(|�#H�%eC���<{�X����hJw������/�a�����#��z���G�?yD��'K�a"��Xz�G8u�'���k�e;�k�7�-������C��79*L��n�|x3Y�e�����MF��Xj���K�3D�Q����y)������E����@�����= 
�<zVI��#��g�X�� ���F,����m��a�L]�ViNote
��h����Oy�V���e�f?�L��Z�f�
�W����~c��j+m
��rm
��lS��Q����
^E5��*��|�S�fH��%�������EW��3�)c����.��w�����G�l6�/��v02� �����Pl_R
�6�O���-���F��*r
e����2\�\�T�����k�Qr:����3����3�r�8L���N1���c���e���$Qt��Hz}��������x2!���HD�����F������h��O!���N��"V�tm	-�[�|&wLcZw��O��M���/�L��K�`G_x<���*�tT
%����)�R<�j�DR=���6�]���*�P�M7b+3�����U��m�P�����\6'�T�d����"��d��`�QC<����k8B��F��0E��A������U�lS�1��W��~�	e��8����7XND��[;�wb e#Z\�u�� ����d(��������9����o�t�������8a]�"�����w��p<�d[�9�B���r0��1�}������X��R��Zd�Cl	S��gA�����"��u�X����j����o�
�t?��#_�E'u�gX���+*��4��~�)��^4��_�j��D��6���G��m&:��?��#�m�6?�,u]�:��u��R6�(�ay��{t|x�|%�k��<�<�}w����HkZ��?t�����P�G��-W'������@����\O�Pxa�s1�rHHD5��v�!��������������t��s�|L�sK�	#���+����	�S�����p�
�$�AP�������1���{_�y����W�lp���7�o� ��A��iw<!�P�8�jF<_�x�.��m�dzp(�#��>&����B���L~� jv��MV����P�^`�%`�	���bv�����|��w�/��M`7w�*d*�(��5�������S�����gn���\oO������K����
��X�s���'��q���b���W\��I�CH�_>w�(c��� ���\������#p>�|"����ABYd�nu��|^�F�B<N`�;�e ��&��l���`.Z5qo�R�A*T�%]r��r2���Z���{��1}?"�����}�I��v�{m�	�hJ�!H
n�~�X3��)�,�������;�P��(��z(�m�G�<�1T�c����L@3#�E������e&N���������(}.� Z��f1V���0"�����-��=�Y������^�^�@�=lF~?���	
�dM�T"@���&0�v��0L��G�������s���1�3R[+��=�[)��O�z��C2��u��gw������$5�f"��~K[����H[[/#rKD��i������q���B�$bOQv��"=�"��qs�w�[34���X ���9�p��9���#1����Nye����hNB�u|^_��!���*��($��2N����/����|
f)l����o�#��9~��F'�L���!IS���>��Q:�X@�a�G>�NX���Ju ��Z�T�4�=���|GC��aF����^�W�E�}9�o$#�|��]J0�/<������	��S�j-�=����-�Tav��Q4�7��'�f;\��hz���9fH'�7b��XY$�h�c^�g�u�����Q-�.S���/�ft����������o��_����d�7��	��t5r���t����������Pf�����U�l���t@IGy2��Oh���O����d�':��OIE2���xqxJ>Hv��"V�Px�)��E
N�c�$�)��#�~qz�[�����VDU���7���Gd2�.H��& ����D#������\�B�l;��Q�����`�C���D���!�':��J������U��;��?J#�Va_���A��pW����&�w��)�[��F�������(U�0;��C����-'(��D��<�������%��Fn�zV<fo�����^��������9�uv��s@���4��Y�+I�(�?y���!5vrG�[��L �Mr��f�UG�-�]�Pj����"20[@���r�`������r3�!QS�^��32�w�Sn>�,�dZ�Td���f@�
_�V����������o��<�>��>6.�����t�$b���D�N����4}����b�
�f,���4P���K�2��`��(���EdmX
�P|n�p5�-���{:�H��4�����:�$���~�;�:�}��;-����tX��6�|�SYj�RzE�5'`Kg���u�@B��d����@[[d������Q%��pK���5�c=���A3���Ja��M)�*��2�p��3{�>����@��
To���A�.�]abS��I|<�"�����V������e�M���|���N~/��lPm7�nPmo����&�+M���A��_=ty�Qt� ����ys��D
�PLf4c6H���.��EI�hn��[���bI��A�!��@�f~i�j-��q#���]��?�����-��~7S/����_8Sc�ww���	Ch�{�� ���/���~��>2�g��J�?O�r7l���^f���k���[���\��,�� �7�^.����3Qnw�~��2&E-�����#O�ws$����)����C�V6����I��-lef�p���$��� �~
���7~t5F,�U+�|��e !�QD!~b��k�qs��$RK��6s�v��)|��R�E��@?J��=���gM���T�N4
����z�<u�����"��V]���&S����Q8�'�������K0&J
s��#3F��r��-��E'=����������������+�&�DD5�����H1�:"�e���W��� O�y,_��aE�D�H��L��-�K{�E��M�jT�����	(�I�ix{�����@�t�r�QTt_�z����sVyacd)�v��wN���@R[j�c�6����������S}w�9V�b��
����������*��!i�)�E�I����&Y���
7B������j�--��?����vf�V��G�/�I*G������S��rPi~��n��������#�����>��@�F�^�3u��Q�w$J��O��v��`���L��\0�	�S�n �3pQS,��
N"�%
^��j����u��y��d|=
��C�q���	B�t����Y4,�1���: �D���Y�����d�}�j�}�d[��D19;�c�q!�"��7Z<p�hw��=�,z�LPO��	P�����pp.������#JrA[��B��S�C���a�L��Y'�O����x2O.��\����������w�������a�����x��'�j�z�NN�u8��f'�v��ct�%s�Q�u�&z2����������V��^^��/9��/����uhy��^jy�\�xz�Q���%R�:����_~!��Z~���������{.��M]�>��"����	���CJ���T��f5��F=}�W��;d��U���L}Gb�Q���Dy���������5%���,\�g����Z�1����ARF��S��`$����;��O&��M.t��z>4=q��)��9�Hd)~�	H���2� 9�M�`6�i�-�����x�}���d�+,��������+�������2�]�z],t�����Y>�|��G���~���	��-��*x�#�|�����b���f]]L%��4�T�ZFn]'�&F97�������WA��J����0�X�������������pI?sEI@,%�xs����P����S�����{(aeSN��Ci��P�p���^��]e��=9
�p���iK��s$�W	�1r,�l�r����T�:p��L�9�W�����d\&�p	����|���@(���;���9<9��<;��]Ft<b�������G�A���]���r:���Z��{�������������w'?�����l'qc��I�`�"�z�o���gv�(������{�������N'�nP���������1y��C��n�I=�a@+��j�x4������V7E���s�:��UP	w����[�C1P��x`�����:!}CF��hn��D�49:f�����
���q�>���i�6Y7���F��BP'8'OW���u2������d$���W7{
�`?�#���3�N�`ok����9���.�nsNV�$�����R�7tm�����^�Y	S�ESY��������<6!9��FkV �3���L�w�zv�B�AK��RRL��?Es���K���QT�����K����v�������?��#����][M�r��Z�u��a�@n�3)�%G������g�i4�c���79�{��1O!
KR���$g�	n��r����K�I�������$����_����C9m�X����D
�-E,������Y^�����)aF�fT&���"��������63u�$����,��w��XR����D����`:��	��Fs��5}��(��(����!���
td��$�4%��Sd��!";Iy�H4�<r���^XX"L��Z�&�Y%��<�x2`��I
F���V��q��/��_��r�����A���c\�w����������f��!�}-���]\�_v_��\v/�)�����dr��X�)N�k���Y�1U�U�]���G|�4��"&��d@Eko����p�������C/g��0���Kc�A8y�m��P|<�"Yd%�FV�Nl�g"{�<�����%�Q�${}0�Qqt��
���Ob��w�:��?��#�/�e�j����q������]j��\����^�T��V�����M]�6��.�ON���q�(�h���$-h������J����:�[P����D���#�p�c�7��GCMt���79�or�������>����]8���)Jfv$V_��"����Q�'9���"j��&�_P�DHC�E7�R�X�W��ii����|~�o^�p� ��2�N��F5I�����K�++v&�}��+��A(�����;�x�{���Np����Q�k���_��b�I@�ZD�mR�y@x3���<\����\|:��6��0�u)�����Fg�dE�m�k�&G	E�`������?��v�@�)�!)��Z:�TFZ��D#Q�i0���>�����Y��K�h��5�{"��T�������\�������c�$�e�xc�/�����fp����<6!�D��o���P�h����(���^?�.;�d::�#Q9�[Y������s�
2z��j��Zsi�K4.��K(�ch��v���5���@��v���D'�?�\�\��	ku~�H���I���B�����z�@5pb��]�kI��nW��+�V��Q���g��+���+/h�W���\�=�n�E@��|�cJ=�9�*NM�����S�Z�8�����r��Cfd}A*_p����UIM(a���8����������������&_��Z�Eag
����k��`�y����%(.������k��R�]�$(�cA7�m��i[�k��M��kc~'�����������UyN�P����Hn 
��v��D���+��	�X{s���)G�8����g��
t����fc?������{�|�u����B8�nz��/��a|��n�<�=�������H('C�T5+P���#���������VFa��#O���Z��v�`����A��#Q7�(>���5!Zv���#W`�������y,���P�' E��t9�������,��t�z~�.u����E��JT���M�Ew$�z��&Af=8��0��m� e>,k�4������|����xn6x8tL�T��T��)g���d���mk���D�n���dHKY��%w�h�Hh)��y����L�����m5��J���4�a����x�����t~�<�z��_L8/�e�<��	�B/��yG ��&_
	AuM���AN�(]�:�����8��A/"��4���[F+�r�������L����k��p0�T����Q�������NS����?�IK"?�������c����O*��������w|z�)�IP��.*�^�w����>"q�������R��i���Y������GQ�v�d2e��ml�HoF������%�!{�C�qRoS���I�I����t����J�nv�t(�
{@i��={}��R����S����1��:Rg��Q����G�	���?UW����F�� �4;��^��6{A;�V�T�L�����6\���h�G;���P�Q�,�|#Lp�T������&��������.$��(>�:���IvC�M�?�qw&1��?������q�(�����d�<�\a`��4�e9�.��g%N����D�J����v�i���p{�,)��k�0C)���w/��2������/��O�>��)7^R�x���P�$v�����0�D�)T�����Q�Rs�����
��:_m���3+���.R��"W��Q+��ej�
��/$4�>V8Zj�����
nW ��q~��/�E��h��DQ� �u������c}N�n�U8����b>I�u�l������$�Q�3��df*#E;�h��1
l��5K�:���"�4:1V}%.�������V&���<��N�����z%��G�Hk�-*���d�����J��:����zFY'��H����U"�8���3?�Sym%H(Y��?F���S*S�#���l�=t�@���dr��d[�B�����L���d��q����z;`��Rw���� ������7�V:Y��[�w`5�#l)Gw����b3w
3���N���s�b��A�e`����v�l9��j<P�8��`���z4a��T���r�I��^��5�ND�����~@ &w,J��y>����r��>%��e��nrq("%p0�a���N��N�R��
0�R�����������D�<*��X����M��i`�������D����{?=S�;��w����%4r��h@�Q�*�R�������9�i�`{|S��j�x!����P��9����mJ����7<��k0�lp���.�O�l4�G�9����sF@-�{���C�����$�
����inQ����O���H�v�6x�jP�Y�t�����S���0ot���76x�jP[�
�`�J�,
�2L~�U���4[����U�zu���R�m'�ZX��
���zX�AP;���pmI�o������d��s�[$e[�4 �0(�\RB%�i<��\1�N�O�Z���)p�~��`dj�`���1UptN�:�:BV<��P��g���N4p��T��;^}a\�=vB�}^�Y����5ovw����M�K���8u�XK�A{����aHK�7�����/Q��8��%�D�d"�g�<��cd��5�A���%������M��2#J�{NeS6���T.���0�D�.�g��������YQv�b���zFA��?^6i�������(�lI���.��7Q��2���h��@4��D:�S|�e-�E�� �����L�H����lx�����������F\��DM`�HZ���9A��������`������@���$��wJl������+"'k�	sE����8g*��<�fBou�o$�E���knd�TJ�r�.5�q��R�"m�%?�[z"imag2�;(��`F����,

7M����K`o�z�M�l��va�2��(��@���b�	��~Y�,������6�m$�����M��H_��=��cm$['J��|.H���"���Z�������$:w{��D�1����������F��c�����]��)s��?�=��h��r�%����Z �};@)���m�������r��?{3�a�0N���[�E���R�J�f��� ��P�������`���94����k��<C�A�� ��?�0A������H!���;i�M9��r��r9����7q�t]�������.������A�
���`t���<1_��������Ru?�_���vUy-�����T����������4����;9Bqa������������q�;Uq�(�c��!��j ��=�@���j4Z��k[ ��9Q��
�"���I8}w��m��:���Y2�FGW(��7$W�N���_qSQ��S=B�
�7I������q�qD�y��OV�`�7r�k~��}�T�:����\���W���^��
�]�s/J���A���F��������|"�X/�D���| i �s^Z�G�K,\*�,u�+���Y�p�m���n6@������
�3�0������Lo�������bk0^j�����HI�U��Z�v����?b6���F-�hL���x0���4�O�~!�Y�����k��������5,�{|�;9@�'��bL��P���M:f�S@*'�\z���)7QZd�i�����z,�Us25e�������M��S�t28RIy>7��51�>���n�K~�S��|\�#�s/����4Z�&(	��s���Ub>��(S��m2�r���h!�P
X���i(K������6��M�p[�����c�
���{���g���-�Ces�&i^�J���	C����1��q�e��}�R)�:�%�zF)z�a�iE�Kq=X!wN��a>f]�R*��� �����s[����Yq�����������-�����'8%��II���H����sm��D�:81��;�����2�6�������6��	8$\M�r1H&NP�Y����I�k�R��c)�K0�B8�?�]�&�$�
UFs��AcW*J�,�b��D5��!7�x��F�r�)P!)��I�qE����}~�U$���
���1*����3c�0Fn��?x���^]��r�;�]���@,=X)q��i�A��v��Os����x_����]_F�����vO���d��������`{W���@����8�T{���
�e��-���J~w����>H����=d�&�Y�-�+L�8xl=�K�zC$���y���g������A�o���e��Se��7���2�yX���l�E���+V�J�*��#�s$�����<��sK,&aA:>^��A�*���V��6������X���)l���d�����GS7)��.Z}� ���$���z�Y��R����A����]	e(��+������y������fU�}�8'I���2B���;]{��ZV��������:v��Av[im��r-��SEJ���i$~���W���zf|t���h���~�pg�����=�����7S>��������[/��� ^��>���|�����vX����o����_VQ�����z�!V���k�p��%��t8O���2��7�-��3�������!��EZ�N�v+V�����v��x����j��L��p�E-E�QT����]����c>���
T�[���wR*9���P��_B3�aD/��h^�Dh]�`���Qx�V���e
���-�*��$��������E����5�^'��VX�������6i9�a�#�Z�
��n;�- M�Sg�]�#�!7�|�2���A���zY��!#��s�w�YJ����\��&�J+��
�f�	gn�r�_t@ ����lp��F���^a��k��H\#�.c�p����d��X�`���<%/�n��)�&�]8H.tUGn�clI{/V�k75E�H�l3��w�ldB�t���`��d�l&T�T�&S9���+I)F�]��O���%�s�8��,����s�G�]�5?�9��v������������W����R �Z�=3@/� �����_�T�7f�)@�����`�Y�����q���n��o(�hK�!L �wmN����ZM�Rh?���.�S[�j�.~>9;;=�xy�7|r��,��Z0=������ 
��.'������QBw��'��(�QP���?HC�
fp#���������[ R�rD��3���\��6�Y2�h=O����J�����j��[��c�p�!>�>��4��0x:�����V������<��T�c$���@D)���Ap4`�&�l����cH�|Me}sdJ�r�����_��U���kn�Xq�I+m���;�Aw�y|]vz�T|�<����4�^��z����+����S��M����hLw:����*��m8C�7�4)���Nr5���lIT*��8�$}����x<�:����Zm����j�]7��M5��nSE(��;��, i��0�N��9X���bYIF� �*�
�1z�7�����u������3t�� &F���rNB�����d���DO%����tm���m(%������w��Z��#�Z.!6�����$�mS4M|�|,�	��X�w�{}���r*�4[�r��8�
�r
�8�J8s�oC�j0�
�xsr=9���9�7��fw�� ����3��a�����#�x:��OW<�����xv��+�����-��e��l�����Y[�k��m��~MVvjn�f��O���E���6-���oN��0���\�	��?��)��v5�s���wG8�����8�v�i�OH�����n����v�i>�0����<vLk0Tb��?���r���M���[~kb�~�����#k��\��=�cY^���\o�[�	;��y�_2��,��c_�24s������~Z����{�E���]p��?o��)�w/�������t��8��X�hZ�.o�� V��������;d.��^�eT���;�G��z����8���Xe���'q�mg~�/�$��j6�����H|UT�*��*A������gd�Y�M�Ig8�����a�i�����G�!�������vfdf��#�C��D����-��.67���4R��(f��wU^6w����������|F-�.�
zd��=t��N5��x�qe>N�*���K'=�g�l�\�-���ZN��^�-52�/���/���/��(_<^��V��z{���n�W����1��H���M.�\�MZl�Q��|w���R�����JB����;����
N�Y���w:V���p�I�c��Q��n~���6�"�����ZAk�B]���e0K���ux�X�/�:G}�����i0Z�}�����28����fU���]���S���Q^A���4K!�|oY���������LHj����GqN��ho�}��o�U1�������h<�k5��O\����|�#��4��[�<�7�~�����G3���7�V���]�k6w������l
�����L'u��������G\o��������x�?�v���p�8�{���m�.l��.�#-]��Y���.SP������eO�F��?�2Oz7O���F�|�A��b��n�l'N��g{O��)	���7v���l7:n�����m{#�r'�#V���G;�������v����k0��C���s,��8�2�Vu&eIvX|1�4)Q����d7;�E�P!Fnm&����$rR���f�<���pR�V�:V��|\ ��m.��J%�4�����"��u8�u�^����#u���h���cr%[�����4���~op��� n����_N�������	��3�#
������������Z�R���4j�#�$����/O�]�W�TK����������w9�?��/�@5*�_���#�8���N�#�Y9~YR8�C)��>%�+�K����jj��3�T�,=��}��"Q�2�
�\�+��T�
�st`�gjA�W-��|I��JTY��/qx�I�08��!����<���=R(��`���*=�"��BN"�[4���
�Y}r+hE�,(a��"��:y��64�l���P�#b��s}��s�i������ _��I0�����g"?M�k8�z,��z�{�"�0*0��!@���������\���]����I�gU�����u3`���1z��_��V�0��@��KF(��N�����,�+��9��b���jd*xt��X�`�Z}������r�h
�	cdb� D�<��X�3,�\L����"�����x���r��a�S��n���'���/�&l��B�D�mr�E�'���6����� ��`�&b��N���Q�������T���s;��$����)����b�sU�������Q"�V��(KJ��p��}R�V���V��R���-�~�1�����Jod�����bk!�u��y���p���J�PZ)�%R^s��#T�=H��?��xg�QK�FB�Zt���K
���9��'VH���tq6��0~�M��+��[b��E�XR����u�.����j�	_8d}�0�3���d���.t��
05�L�F!���T���V8�Y#Q�H�y�L� TI.���w��9.���TT��v1>��>��&�|��o��������r"e8
��%��rJ���_��j��������
E?p�e�)�s<=	���^EU�3,h0� i�n&v�+����H��[�@�nxX���ZLE3��������0��������P��;p��q��H'"�$m����|���{�����������$�I���
�m�
f4�'����9X�j~+@<���
��";B$���JGQ0d#+��!#(����
�C#(�(��
F,����`�"@�A�xp�o=�Bf��
�~kAD����:{H7EJR�@�:���b'�|�1F��d��L(�EL<U���X�&v&NS��_����{����:�B��5
;
S�dS����B&a�'��� �a��61Jsf`��XM>ws�I�J0���7��ZS2�a�T��#�<W�q������\���M}t<FF�V�w��$�����%E�|\yj-^2���������hM�y[�h���X��]�7R��7�*`�$�P���HG�2���[,9H��0��'��U
zT�,���V��� ���D�w >t�<b�
y.��P+#��%�;/�UR@��YK�4�w;�me�&���Ml����V��C[I�w�C[e�9��c�{_�#�%A):�Q�������N������B�x7l~�nb�6��A���+MTI���T�O
_>�=Rbe�=_KZX��8�����+TI�������J���a��:I���b��D������� ��7���O��O��O��O��O��O��O��O��O��O��O��O��O��O��O����?�C�x�

#342

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#341)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Jan 31, 2024 at 12:50 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the new patch set (v56). I've squashed previous updates
and addressed review comments on v55 in separate patches. Here are the
update summary:

0004: fix compiler warning caught by ci test.
0005-0008: address review comments on radix tree codes.
0009: cleanup #define and #undef
0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is
undefined after including radixtree.h so we should not use it in test
code.

Great, thanks!

I have a few questions and comments on v56, then I'll address yours
below with the attached v57, which is mostly cosmetic adjustments.

Thank you for the comments! I've squashed previous updates and your changes.

v56-0003:

(Looking closer at tests)

+static const bool rt_test_stats = false;

I'm thinking we should just remove everything that depends on this,
and keep this module entirely about correctness.

Agreed. Removed in 0006 patch.

+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);
I'm not sure what the test_node_types_* functions are testing that
test_basic doesn't. They have a different, and confusing, way to stop
at every size class and check the keys/values. It seems we can replace
all that with two more calls (asc/desc) to test_basic, with the
maximum level.

Agreed, addressed in 0007 patch.

It's pretty hard to see what test_pattern() is doing, or why it's
useful. I wonder if instead the test could use something like the
benchmark where random integers are masked off. That seems simpler. I
can work on that, but I'd like to hear your side about test_pattern().

Yeah, test_pattern() is originally created for the integerset so it
doesn't necessarily fit the radixtree. I agree to use some tests from
benchmarks.

v56-0007:
+ *
+ * Since we can rely on DSA_AREA_LOCK to get the total amount of DSA memory,
+ * the caller doesn't need to take a lock.
Maybe something like "Since dsa_get_total_size() does appropriate locking ..."?

Agreed. Fixed in 0005 patch.

v56-0008

Thanks, I like how the tests look now.
-NOTICE:  testing node   4 with height 0 and  ascending keys
...
+NOTICE:  testing node   1 with height 0 and  ascending keys
Now that the number is not intended to match a size class, "node X"
seems out of place. Maybe we could have a separate array with strings?

+ 1, /* RT_CLASS_4 */

This should be more than one, so that the basic test still exercises
paths that shift elements around.

+ 100, /* RT_CLASS_48 */

This node currently holds 64 for local memory.

+ 255 /* RT_CLASS_256 */

This is the only one where we know exactly how many it can take, so
may as well keep it at 256.

Fixed in 0008 patch.

v56-0012:

The test module for tidstore could use a few more comments.

Addressed in 0012 patch.

v56-0015:

+typedef dsa_pointer TidStoreHandle;
+
-TidStoreAttach(dsa_area *area, dsa_pointer rt_dp)
+TidStoreAttach(dsa_area *area, TidStoreHandle handle)
{
TidStore *ts;
+ dsa_pointer rt_dp = handle;
My earlier opinion was that "handle" was a nicer variable name, but
this brings back the typedef and also keeps the variable name I didn't
like, but pushes it down into the function. I'm a bit confused, so
I've kept these not-squashed for now.

I misunderstood your comment. I've changed to use a variable name
rt_handle and removed the TidStoreHandle type. 0013 patch.

-----------------------------------------------------------------------------------

Now, for v57:

Looking at overall changes, there are still XXX and TODO comments in
radixtree.h:

That's fine, as long as it's intentional as a message to readers. That
said, we can get rid of some:

---
* XXX There are 4 node kinds, and this should never be increased,
* for several reasons:
* 1. With 5 or more kinds, gcc tends to use a jump table for switch
* statements.
* 2. The 4 kinds can be represented with 2 bits, so we have the option
* in the future to tag the node pointer with the kind, even on
* platforms with 32-bit pointers. This might speed up node traversal
* in trees with highly random node kinds.
* 3. We can have multiple size classes per node kind.

Can we just remove "XXX"?

How about "NOTE"?

Agreed.

---
* WIP: notes about traditional radix tree trading off span vs height...

Are you going to write it?

Yes, when I draft a rough commit message, (for next time).

Thanks!

---
#ifdef RT_SHMEM
/* WIP: do we really need this? */
typedef dsa_pointer RT_HANDLE;
#endif

I think it's worth having it.

Okay, removed WIP in v57-0004.

---
* WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
* inside a single bitmapword on most platforms, so it's a good starting
* point. We can make it higher if we need to.
*/
#define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4)

Are you going to work something on this?

Hard-coded 64 for readability, and changed this paragraph to explain
the current rationale more clearly:

"The paper uses at most 64 for this node kind, and one advantage for us
is that "isset" is a single bitmapword on most platforms, rather than
an array, allowing the compiler to get rid of loops."

LGTM.

---
/* WIP: We could go first to the higher node16 size class */
newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO);

Does it mean to go to RT_CLASS_16_HI and then further go to
RT_CLASS_16_LO upon further deletion?

Yes. It wouldn't be much work to make shrinking symmetrical with
growing (a good thing), but it's not essential so I haven't done it
yet.

Okay, let's keep it as WIP.

---
* TODO: The current locking mechanism is not optimized for high concurrency
* with mixed read-write workloads. In the future it might be worthwhile
* to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
* the paper "The ART of Practical Synchronization" by the same authors as
* the ART paper, 2016.

I think it's not TODO for now, but a future improvement. We can remove it.

It _is_ a TODO, regardless of when it happens.

Understood.

---
/* TODO: consider 5 with subclass 1 or 2. */
#define RT_FANOUT_4 4

Is there something we need to do here?

Changed to:

"To save memory in trees with sparse keys, it would make sense to have two
size classes for the smallest kind (perhaps a high class of 5 and a low class
of 2), but it would be more effective to utilize lazy expansion and
path compression."

LGTM. But there is an extra '*' in the last line:

+ /*
:
+ * of 2), but it would be more effective to utilize lazy expansion and
+ * path compression.
+ * */

Fixed in 0004 patch.

---
/*
* Return index of the chunk and slot arrays for inserting into the node,
* such that the chunk array remains ordered.
* TODO: Improve performance for non-SIMD platforms.
*/

Are you going to work on this?

A small step in v57-0010. I've found a way to kill two birds with one
stone, by first checking for the case that the keys are inserted in
order. This also helps the SIMD case because it must branch anyway to
avoid bitscanning a zero bitfield. This moves the branch up and turns
a mask into an assert, looking a bit nicer. I've removed the TODO, but
maybe we should add it to the search_eq function.

Great!

---
/* Delete the element at 'idx' */
/* TODO: replace slow memmove's */

Are you going to work on this?

Done in v57-0011.

LGTM.

The rest:
v57-0004 - 0008 should be self explanatory, but questions/pushback welcome.
v57-0009 - I'm thinking leaves don't need to be memset at all. The
value written should be entirely the caller's responsibility, it
seems.
v57-0013 - the bench module can be built locally again
v57-0016 - minor comment edits in tid store

These fixes look good to me.

My todo:
- benchmark tid store / vacuum again, since we haven't since varlen
types and removing unnecessary locks. I'm pretty sure there's an
accidental memset call that crept in there, but I'm running out of
steam today.
- leftover comment etc work

Thanks. I'm also going to do some benchmarks and tests.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v58-ART.tar.gzapplication/x-gzip; name=v58-ART.tar.gzDownload

���e�[}s�H�����sj+��
�f+����������j$
�X/D#a����?�=�@�8N.�UWu�������_��LV��j����W~vl��B�KV'W�Au�V������*=��Us�*�-yd-�{��7:�Z��m?������4��i����tM����c�'I�7?��x��w�|����>��_�ykk����q���vM��-a�=K�f3�-����ys6c��g�dZ�5��k�jP�9%:b~�m�~���yrq�r�q��Y��Sa�#q��*������f�i�y�{���r��h&��AX�9������
[��XC�k���L��!XYV`eV�`-0�|-���3�s|VMy�Xx��#�����^�2)<�G����p�������R�,
X�>���B�n���L�`�K:���@�@�BaGS���X9�d�kv�����@JFv�D�m���q�~�z��$�{�0p�K	��lEKy^�/����u�~��J>�7�+�5���W������Ouo6���6�_[����?|7��;{#��{�AK�-\�Z�����[nl�#����������]V~���46s\�,��\��v|)BZ�b�T(�@�Z�;��c���Z
�o����f]����Vk7L�l�f�b�$]��]�q>"��KV��*V��]?��h`��f��2�pi��h��"��z�V�^���3f��ZxL�#3J��B�g�&=c�&C�fl�U<"��|Ld�I2�yv)����>�b�X��B�'PZ�����P'M�:��^��&3�I�R��noZ���M������.���uP~\8�'���L��R;��I���S:���#��Z��A�>y���T�����e[���B��C��~`���a����z��fb�EboUhD')��`���j�^/�-�A��O�_��$4�9���$T�,*Y%fx��+�2�t����g�ufx�_�Hc�o@�	��&|`���)���b	xy�l5��r_qGg���J.B���X�T�G�)1j<�#����;E�)�;���jB���M��w��XH���A,r��T�zd�^�fO-n{���O����NN4����Qa���g3y�*���j'�N��V���?@���q>h#:���^��}�����F_��TQ���K����m	t�-�z��[8(P<D��C����F��D��G\�"��:g��'g���pJ
Nd��?����T�%�����z��`k|���oGR�84�V�Y�YYo���3��:�/5�K���� E\%As���G���b/5��	|��'����'z��������#��_��������I!q����x���7�YI�Yk�j�B�?���T�����Ve��*��e�
��U
����w&j�~��3���n�;�_Kk����S>T�q�6�v��i��������5m���=���5]�uy��\���`�C��v���?j>�x�j!?�� ���4�aa%�_���;�5������&�V�`U���X���X��BR�?��*CpA{|��e��p�rnc9���~SK�K=��}�L:��M�y��",�����D�ZZ�����p�� ��C��JX�MA���%<�~{������}�!���%_<�ZQ�%V������o��py�����k���3�u.��6���5���T^[�|�������d��I��Tx@���*��
��^T�|��h>�����Y�m��f�V��F���M��_���)��P)�a�-��l,y������'����������'���I�� �J�����t4�L���_^0x@@.T12bsp�*$��"��[�Ql�! {u5�o��|������2�j:X���Buy?#�pX�/���}��o5�%�BB���{��.���~D�Q�����b��It���DG��I�=>��X�R�#a�N�x�>4�S����J5l���!���o�?_�\f���I,�H�t�Jh��D����qmIU�BXT�1�7����-��n>���^��g�ei�Z����H�{�w^����������h�U��0���K��~�������*I����<��P�a>�#z!���*��fW~���_ov����_bv����WMBY��M���������$�+oo[�i�
q<���X��#�����O����}����v�P�j$��x��d��������[���Z;�����������	�j�f]���ns�����]�u�s}���N��l��e�?�)'�I��������6���c���+���R�
��`|���2_@��!�����
�d�����������-9��{������������l$�5{��2�������H�gZ�$dT�;��>�w�S���2�k�|�0�Y�k�38s?E��?�/?l�k���w;�F��8��W����a	��O
K����gO�/�&�k�i;�P�����s�/��Z�j�����R�#C���R|����>�%&��n
����S�������:f�s#�H���l�(��0�U�)���Y���X��W����W���W�:�o ��o!D��=�=�o>���5�cv�g&��j���yW������=1�
{$X�h�V�z��|h������p����������Bu�1j�<�~Xw�P�p)��I��������E���	���zg���\��*�4@@��������X���
ey
og`j�D���S��������_�]�7��^\��\�2��������e������g$=<i�[�<	�[G0�9�\J���f3�7Pr	�XC�+:���-%���d�QX�<�L���B�%�/����n�a>�Z����V�e8	n�pkZ67M!��u[��~��*����JpY����K�N�[}�4�������''�$��R��%����e���#�r���T�e6�1;�^r��s:���T�-I���s����b�^�����G���#����r�������s�'^����wg�/�3@�:�.�"H���=�������1�f&x�,7S�"��	Ryr���i����$����:XW7��
�]?J�%���7Yc��:�(gRI���(�j���a�7�6����#���IW��J..mM�^c�{�%��$d�X�C�����!Y+6m�Dq�������#E0�����i�M�WT��<m'���9Z��U�b!�t$��U+"c���q�5���J&6P���)�*�1�������*� �f������~=��0pQd�8��H�AS�/������C�	nc����#A���/��;m��)H��`W���]L�]�)���~�<c$j����g[���h
��;�����xXc�� ���f�,�)�"QiX	���8���l��Z����K���
�f�����(���1`!�Z�����?��7^��l,��r#%��4=�w����C���hXn�����C+�=��}.��<���*�#9�!��S�S���l�X������^�����%Z+��k�t.e��k��`��������&	(93�4R;��r�.�(8.y��0C��/�@�F�5e�r@��]��?�0@�<B�W�.���CyLr	�f;`�����g	���c�\��|J����bkm<>���bX�p��I�����'����je��s�ou0@������Uf��	���0�mV�CA�����4Hd��^,�6�x����,�����	��~'��L�Sc2�K4�1^��bI�/�������I&�$��D�����F������q5��Q��bh�'��Q/(�Xy:�C��$+���@'�me*����9@��S��m�;7��DD����:���$�hAv���0l����,c���)�]����t#�����*2��M�3j���!f�x����a�'�|bc�q�B4�LV)���06� ����kX����N� ��n��x�
�/���'{������d��T9�����F���d�
Z��LD�b��F���������p��C�\l��X�=m�:7��� �q����)���{{;|}�+|K��|�0��k�����tTc��V��,�������������GRU��H�5����D��y�~�2zP�1X�0����~l.$<�@~�-,���P��U��=&�^8��C����[b������(j�8!X�s��J�.��r��:IBK=���~74��Q7��������+!���(��\x@��C�jb<�<�/q��I����v&oF�QNA��#V���bf�L�N��d@�=7�3a�0�]5�6-����.�p_�t�9�zw�_%�QI�3i�*q�ma��y�h2���$3�eJ�^��G�PO�''���2�
VQ�2l�����5��&+��%���h�@���x��0���CD��hr�h_	�)� �ji(��A�:lI�������:	���P�������N�7����0��_�b$��o3�M��Hs)"u�Q�����v-�MTdUE����O���7��~a��d:�:S�� �����������B�.��MNA{9���O'GY)s����AF�;��gJ�.�S�M�fp=�,e�*�J�@�/o�O��r��\A��c+etQ��''��
��d�0�H,3�L�/���N�Y�*�!�^ao�]�������0�����J�K��r���fz����?��(w���z��M�c��j�Y�#;M���������y�0�����7�X�{�u.v�=��Fma���v'��h����IP�%v�\^@���"��}���hX����{ewDe�E�B��%�����,ew��J���J2�o����T�����U��1���L���uk��O��O����<7�S��dz[����_w[�{&5n%[�yG��y��H���Q"O���
D����v�sL�!��8���p��t������<-nm+����q�9�jF�3�)���C�2O��e�6&�Y�\:�^��4O�����yZ������8��g�)!70��y�g����K�������hw���alG�<�
�����5n���nboi�w�5��t��g��J�
������f�@���R�5�[�*/LR�����������!73� ��Q<�O������}d"�q���������<c26����k@��<�������=n��������N������l5�C���wzC@��{��mX\�w�<H=��9Ho�~������h4���:km�Ve����y��&��O���%@@�EI^J��\S��T��m-~�@���m�g������W�q����2###"#���x��?���S��t>��C4���Z8��jj�=��s��1B����kP;���:�?Rj7��N^�o����i
�!���>������r���s�������w��>��bsw��������K;��K����m
Z����:���q!�p�{'@�J�[�sr�y���WA�{�2�
�=���uOz+�r�~<9�hW��,�v��Z9,�jXf�0�}�8�r�W����~U��V&�`������;���>�cv��y���7[J�����k�FkT��Q�'�v1�����uY5���*�Vf/��n�Yu`!�*���yx��g!��A8��t�<�j���V���K;��������l�d�B�IN�o�^|�^��(����\6���{U+p���	l����@u��1\�b|�^��h����z5��
�)����w:>;y���c8O)$ &m�1������]�"�>Y#s!�<�.�^�f����LR�O��+��4��*KI�<95I���$���!&c�2�)��J;�R����qM%����N��`�����E<=<�k�A�S�����J����y��������`C�Rl2�p#�$Hw���J�Lw=�K����'YU�O��i���V6�\�W����KL�������A?�l�${G.r�e����n�:H+���=`+\l��:�����fuB�S����N���3m������4������!�9t�_T�~#��A�o-��+��}t�[�%�x�~k�g��n��d(��T1�"�K�K���heE��o��Y���������t'+�$���r�(�87x8�H�Y�4�bD�^�9�!,�3vZX�dQ���W���&M=�8�g�|�M�vb-�C���=�(���i�3�v��KD��w�Xg��V�����e�+3Cy��T��u����>]���|;��t����/R�&s���e�er��
=��n�@;"�8#�F^�p}21�4R�������o,n�s�*g��z��(lwBeao4��&G���Tr��x����#������>�q�<[�VZ��no ��,~�@z���������*��m]��Ju|�e6q���t$TO���=yu�j`	�����5�>=���4�'i2�3I�f�Y/���4��;������a������	B1�P�����bcC����SZ���g;���=]���/;'G����=%@���L�2j���WH\����L� w=��l
���
�NB�����o��u�8�knSU���S���~$�Lt����>��.�Q8�e{��h{*]N�N���
�%�{���f�M�H�4(4��%��^#��������Gx3Y���Ja��	�y8X�h��w��U8���]���)�Y�/�v��U�F�����pX����j>�&_�pL�2���*�
&��t����w��9>�
g�y����+��1z����x]�}�R�f!�3t��@�fO�jY��+�D�kt:��/{=���H������z��4vj�^;+���wzT�5k����,D�Nr�d���%?��P����iT�?���S�
�5#���T����i#D�,���!��\�F.��Q["�� ��G�b*~���e�mOF$$
���x�}`7vmx�et����O3~J��L�3��X���`���/ �up.V>U*��-~_�����������}=����o_]�r
��<F=���FAWu���>P�g�����vr���V�1�f'"������8?=|�y��=��t<r)�<��C�����S���ChM�Nn�Y
��<#H@���?��uE�����w4"�'��p����x�$:��Z^g�����aq�����0O����[����SV��������`��0{��"6,D��u�Nt�@��n���zN�������0=�3��|H��C��E�	l�9�
�f��!��yy
#ng���
!u�(�9������WF�?�_��c�W �E�&��w_�pN���8f��?F�D�Q_>�����=�����?3w��rB���O��#0d�1:�B��:�����.���������]�� 4}`�l�W9����e����I�%����kQ�m��@�99rJ�0t�Q�I)o���U�{����O���~^,��K�6&�Z<b���������1�LSP:G����u���;V�F��}-�?�������Xa�@�;�������.C��(R�@��K�������|z|tt���X���/���5�c����k�a)��&��C}��d��c!B�D�OZ<@�#-�c��KX�x���?����a�nI�Q<[@A=z��\{���1�o^��u|�J��j�����*���Y���}����R���Z!������I���(��#������xE1S���pL�(#U�5!��)p?���5���g�MgHeb�80� *'$4b��M�{P�dy�dC�6ij���Fl;<?|���E��e�o�JnW���O8>i�f���6	��w:/���L�����?B�w���A����~�-�8�`Yf�H�
���^C���`ek���kIe���~�%X���i]����rE�,!����f�S�<��,������F��"b�1
��D�����X�9k~P,y`���"]��@���RX���6}�6i�:��>0��� �T&J�����a�.����%+2�lH����d��]59���Z$�'_����v���)7b��,O��qZ����3�a�%�3��0nv��#��6�{-�9���!a����,�x��������w'����m�O�������A���\Q��C���Q���T[�"7�L���8�2&���J�8�z|
���i#
{7�F��%�4��^���k�A��By����\�����t�����o���
��^�K���llCiy�|>O���r����`"S�8!�E�
h�H0�Yn�#^�����m����8��Umn�n6�\�%�������3�]!�>������)�s�!�c����6��D*�ORJ��	�c�
�+�
�e)U��B�%�]EK�����\�f^K���GZt7�Z@��/�4\H�5f��O�p���s�P�a�KD{�������S��*k��!�����m�rd�]��q`-�*����Eu
n�7�tG�m����l���p$Bd:p,YI'#)�G��!�������-��&�����:�FB/E��V&3kU(d���I��p���E���*w�[����t�����j���*�H����?e���k������_7q��i�IV�����Z-��&��0�b�:��e/b�\W��s����1m��>S���C���:	},�;
y�1��I����?TG�Gd#��O����/4Q$����"?r�<������<����-�N�+:����u@G)p�&��E����V��7�k#TE��1���d:�?C��0P�&�C�3�I^����}H`�0��.��1I���o\6�}n�m������Q|��5x-�<A����6t�x��M�B���V��y����b�����AB�#�
���{��&��a�t�����S�m������2�'
<�d1�;p�9IP��Eh����@V�Y^�&8�]tS��G�H��������WsGFw��0}6X 1�6,?)F�d&��02,��f~��A�
��5B>�d��E��*)��K�B��Q����W�����g�$7�J;����1�[V����;���)�c��m�>n��".��lKZ��&v�n��H�]���p��pR�f>V�����O�	l�V���$S�lOV5Ak��QsbRR������9qO�(��	�����k� HS����c�6����|�,����N�����%���?�c���V����u�mr-+���1~c>^�6�N��v<��A�yN����Xi�o3^�8�d��P��qW[���]o2�On9�����3�o:x*�|�}+2i/���jBi�(����Qq�����.s��r��Q����\o���t�/������[+�J���1��o�d��>�2
>��w�K
�xy�����|j6�jk��-�y������o�4�
G�����$�&C6�le����r��+tF�z-��o�h���&xN���1��~���)8�}'KY��'_b����+�`E	7Fr���{c�A�����-���=��5Y���'gH=�p6~f����{_����n�ql��9s���7SC���./�NA�������<��G�^�9������� ��B�Q8#)Jpt������vz�O���%�;*y�O�M��l�\��D���T>Ex+G�����[N9�s�����b!<���������,G��w�E�^���uin@��k�Y]�=��%a-����j�\���/`�����C{;�~N�u�r9�JBj�[J��N6B�*�;R����B�$���7������
��,�Hl��j2J�4����t����!�C������,Y��n����.�+�<�A�9�(������I��0�7�����u���C����bfY;?]��h�iQ�����*��sl��&�������k��� ���Zb����rb��p����\!)VY���X���=���XwH���+=a��Xixp�����'��'���s�2h����"R��X(�GH��*�l{�5[���5#�<B����
��-u��B'E6�Ij|�#y�$u�nlP��O�?�^�P�����,yq#����x'�o\&�}�1��AC~���U���L���b���6����gp	�M��9*��z��"N2_�f��&��{�O�GQ�X���U6[�`�/d�T>{��"W�o�XQ^A���tf����Q�R��+�[�{��8o�b�q�V���y�p��d�
C0eX�������D�1���/�"l4����_��/,=�0�h�p�td�l�Z�����'q�P�!�2������������~��B������x~���������C��L`�$����Z�l��<K,�����-�O���g�����b���"_���+���]e3��c{|_�k&�.3.)��6��g�i�	L������i��MG�w��F��O����g�p^�{f@�1��Y.k�K�iDD$���D������*���E[�Fg��YU�[�?���E����u�\�_</��7���t�GGT��
��0�K�S��Xys@��n�S��d?)�%�����~�c���m�0���]��������,Fs���x��6��y�C�F�;�"�U2B����/���-di6�$�W���Gj�]4r��?��8����s71���������3?i����i���V�o_P��?�������G���<n	G-&���92B�epd	j�L#�q��^H#��
f�!R�>)���R�s�)��u��6S$�s��]���~b��X7�)w���B��������G��$c`L�&� m�4BLA����H Ak��C����w�����p2�[�-��2|0��b�&/���Z`@��"����h�je�R��������o��%�!^��!_��d�bg��2��(�����$��5�(�Y�3Bi@������NW���X�c��dW4R�a����e"b<h��T�Q�6K�VOP��$�,��W�����;zJ���}>�v
Jc6yXK`u	��~�|!Abfp��^��J���6��$�O�}�:�?JP�/\gG����;�}uo2���	G|��)*zQ�7�)����wH��L� U�u�#�+��*��7|^�]~Ys�i�AY�_��`,T�	N��]g����!�W�g�"VS������i5�����t�s����&��n��Z
�������}t>����o�y,�{!���g����6�3������6
�8�{�-z	4����V�����4�%�]������2��La���[7��#�|�;�}8���f���g	�����o���n����Z(�/����'��^�/-b�PI��"�8�#�~ZQ��u��:�k���AZ��#�B�WM���������<�\��<�N�������D;
,F��t�t��
�]����)�"LF�H|��K�<����5V �5����j���4�k
�{0N�i���5[_����;-=�=����K��;��,�����M��g������kz�B��8�������LFv#/�F�D�����3�����]���-��?��e%���'��t���#�5�m?��J+B��v��P������=F(4�33md�R�1��F
e�f�m��]!������r
�wk?�`?�v�Fx`{�ku��h�eS�;���3����e��]��G~3�
~������!������p����h<{��w09R]��d)�����IXF��&N$�����"'!cC�� #J�P4A�=F�R"(���;���px4��mEc]i�?!�n6L+k��e���$4_@����$�����J�Q�H1K�&��(����&R��u���yg�8�5��K�Ya�|���L�:
�_�w��qnK��x"wB4�,�1aky5v^Y��T�L�,nCs'�q:�h�%w�7
����U���
��6�a���w>F�-Hc�)�l���;L#�p2D���W:g
��9�^3|�B�6�X�c�I\yMs���5���m?���k�X��uF� �0�����+O�AY�3}u�G��z���P�@LY��]O�]V���X
��i�����O<��h`_\k�x����4��������e�p����ogi0���j�A�-�����C���T���5����8�m�� �k�@�Q�����@�;�X/�:�p�1[45=	�p�]��J����s<��b�Q��>����Q����tB�fZ�'�J\��O��h[���&�"����
"k&@�H-�e��@~=V�������l��&C�v5�����""�17�i�����D�T8�K1�e0�1a2IT
C���f�1Ltx������w���&��.U�#72�E���]_���m
Y�>��r?���D�9�n��MI�����}�k�
o�6�*��*
��XbW9��!�j�+P�#�f�r�����8��e���V������75�m����b����p��
���g�
`���'����-���������z�	60�3j�;JDy~\+�si���U+����6�>����Y[����r��Q�����<.�Ob��1��#�3F�Ehk,9�����o(l��	�il�\�[`}���h"D���i���Z�kk ���l��/���D�Kyk2\e1d�;q��'�m�e������9�k����m/.)���sBlh�}�~�o�S�Gs&B�,���HY����V�����gQ��M���c!h����-��0K�<�{h����R���@D�ch,����t1�����j|�������]*9����`�6���������'���FW���U�]�<�?�V��v��=�t^�z�����;� ��t����mq�jGU1��X��dfw��g������`�t��m���]��,��a�)�Y~+��$�)'���~4�d���VN�E�^V5~����1�0�{�����b�Y�r��	���������i��0h4<}NEPtt"g2�<@��AJ�>`��>�
K����E�#�����������d
y��������k��s��$=.d�Rt�um���p'~�t����q3Q�R��<�8������K���i�c�I��!�8ao?��0\��������+�����ets�}�*f/�R#��	��Bt� ����O��+v���;�_d^2n�W�q(�]��������+��F���s�8�lC���������2f5����:L����_�7v1
�U�v� �	���5�
=E 	��|��el1���qk��J�Q�f������tf.����!"�.JW�
��KA	A�������@D#i����<��BS��<J�@�<�+&W��I^*
���lR*=|�mer�#����,S�w]�aq��������[���H/<���[�:��I5q�	����?z��}@=�78.!q@�z����^��
.H<�}*�Z�����,m
�bf�$��n�q�n��!^U;0k�f���O����d����i1)�Z_�,�<_���'��{sAq��?�~+�W��Ci����I��c��A�)��:������)����
O��1�
#x�����;�
�o�[^���h�5]�O�r�`��:)���7��LU��Og�E^�A1�9��x�g��e�G/tZH��
Kf�k
���i��<�4�����E���}��q�;�����?Wd�����N"<��i��?����]c�/��������/�0UW~��8�&R�*)������=1f���sA�29o�g^������2��>X��������{�2
�\�r����9oX��6�s�WST����8cY}�;��u����������[�q��,/�Kt��7�JZ2r�"�n.&���\�tf�w��JV���n����8��ov��hd��1vI��{s�_������=�`���������,��d��]���G���ti^�	�P�������8��iE�4H�("|��a_@���G+��a���Cn'+���n�
&�}�#3��K�ll�J�P������&��NF����J�0���*�g�����������I,�f=���1�g�"���$,U�b��hIQ([U����' ��rU#&�N��T�VhR.{���nc���e���?�R�������]J�N�:%��nA������Z�"�8��_F��V\��>�og(�j�Qj����R����8� �a't%�t���,�qA[����5������2v�W��������4�W�A�:�����n�T!����ZK������}k�r	{c���b��c}���n���.��5���N��j"mlh�Is�����EX���e��i��g����t�Ww�[ox�������6����N��azrBx�~������g���������l�G2��r,&�P��cf�<0��&!�o��Pa�T�1�K�/��kA�EF���u�.��k�we���I��/����U���6��:�qM_7��[L��:K������7�i\�3x�#B�,�q�����y�J�b)]�0TQ#JZ���<Eh5n(�����~��
��`6f��O&$N�����5Q{��+��!A,n		�0������<kN�
w�.|�S�qC�N����3���bn�2(��dj3���-���&��sa�E;~wq����f�6Gm�~bk�1�����mc	�����i�(K�}�E����v���=�9E�ofP���������L��x.�=���+�����<5���@-����&�dk�
:�"�X)�N�oo����%������M%������$X�k�[c�
��������~F���,�^`"#�.�G)�����?�����y�I�b
���\FiH7*A����,���������\_��x�c������DJ��|�sF��{�2!wF��ML^��p��Qi���d-��:,�qR�!Z�Ed���k5��U��t�4���Fa�$%h�����1�9&.����d���,fQ8�����+��P��S~����Z6��,1��]��S�4�k�o��m��s��V��������	�Z�`���/��5�mS{����(r&}��-K@GH�1�M�C-B=Rv�j��it�:;~������%@_i�.�\0��	GX=��t��bQ�2��u�Bn���]��<wul�k�(���M�M�79�z��������I(��F0x�����)P�0�"����*��Wd���x������M�/�}�	����1����Rv��K��#����.�����}IjY�����u��6�0<�F�1W�.�W��O�+�v?!���!�����Mk}�3�����?�g�/ro���Jg��Y���RTX��{��*��C@��x�O��"B/�kd����}'���|���b$���b����"������S��?������6�i����0�t�H��bL~��}_��-��u��s���u�i��9@�6���p�D�S4�]X������]��=�a�0]������P�b�Dk`:P�����d��:�y�=����#W�U�4WH���9'��Y���>:5��'�^�c�x�"3��?0�X����5�����c8����z��W;b��+�:���J=ip�+�^�����j&�m���eN%���^1��s�~�������x��f���Z��@���^�|K����=�%3�Ms�o`�"d���m�%X���fC�w�K�lt���.k���s6����7jW]�����r�g��F<�M�BI)�`�6�3��=�RT���/���4Uy,G2[�O]T&��%EizN�rag�����[��{g��|bY�����Q����=���	'���������b��w��i�Y�s�S������h�\s������
���*���������+y����t
��oK��
+$��P��i���Y�p(������4qKr�����w�W���.�Xn������7���}&B0�&+�h�f&��>7�Y@-f���yfs�d��������q�N�>�<�����m�=(r��P���vt�
��< y�@8�b���������������L���wR���r�R����!�P���}��r����|w��&s��lQ���p�������=<o����=E���\*om��c�d2��%1Q+����������M���DL�^]6D�'����?���*c���.F?BS�Zy�U�J�!g~�'�BE��|�=_)M-���e%�~��WP�Z������DU5�=IVv���|�oY�so�R3��$��x�Ll],���:+L���Ff��I ���w�!�K�)�--�l{qq���|L�9��S������+����&���1�t{�q��U5����1�eli��_z�L��t�+	��u`����`��c����iQ�,->��}�>���f�5j�����|�'�%�cf�6���g�o3����r�=2���q�?���s��������`|������}���&;1�7903���[�"��"�	�=|�u���`�T���b�(~�6p�uk2���_�N1������@�s�>t������+�+"�7���8>;{���<o9�������tQwD��������x��#-��M����������������^��������O���G)��Vg
��C\�����b! k|���u_u��5�JW2Y8����1�1�ph�����3<cz8-���������,��7L�M�-+�k&.
l�Sg��&{��N���a:�i`��@������:8g�~�5'����`��v�G����D�4�J�������!�,��k�Z&�W�����3�8�P�
GS�#���lr9��)tg6���!�UN���p���b�����o�|
g����[P�f�3��8�����b_�dOB<tu��S��������WG ��<}����������il��|K:4�l��P���48r�1������}�>N�S������c�]����H6w��`1������n��]4���/N^u��J���<��b�������������'�M4������j��
����Q|E�c��g�*�w�����|*O������z�$V��Mq2�������ej2H�/9�` j �����@����{�sD��M��z_�5�Q�+X�"�f��y�'�;�b�aL�8�4��iw,-�|��AB��W��.,%��Ye�Oc��2Y�����R8W7s'~'�43��3A!x]�1AO�����d@^v�����V����'��K�&�u42����I<$pY��L�Ur�uB$w�O����S�
��&��kIkM��&�1/N�O��JV<h4�aS]�� ����Hf�JA�R��������n�`+�������X�l��u������X%�R���������zY����-�i��uK�Y1y/����c�303g2C/��B�y7����Bk1�����n���$�q�o���<����3�������f�
�]Y���dWNd�Z*:�'�D���{��~�+� �eVWh�-��k\,|����r�����7"2�}hZ��q�d'u�����F}�\���sAg
.�V�����S�gJ�>l�f�}��	+S^����!�29_����%�Z;�t��N��I��t��I!����8��,3Y�������d�^_�0���Y�����>s���:� ��3�������I�}��.}U�$���������3�����]�H\�
�K���!���$L�-�`������xY5���i������b��T��I��1e�X7��!F�P��M#a:�m�H�a8�$���C�7�"�f�������������e�p�e�LE��_��a8�R��&)$���i	
��q
	�S�)$�H��n�2B�Q�����^�
�B���[��5!��WL,!D�=|�~&�F����!���Y��#s'��
gh*��t���+�W����)�����1��ojf�1�[�9�g�*@��q�4��c	E2IN`h���&������s������oaH��
`5�M�0J
G�aY�+��q�Z���Ii
�.v�
�7!6>J���-���3d���)d����S�eE)�\'��a!���M�c�yH�Z9O�a�����$�S[�?K�f��
�x��:�q(:?�IJ;�4�eu8��M��v���XqL��4}�q,�)n���ag'�i�����6���j������;�������n����<k]
�K�k������S�y����M�R��x�����,�|%�8Es[�D��a�r[)�^��x�
�o7��������Z�{)�zFZ9;~�����������yfg~5����un$Z�������Yr$�?��rg�
��E�?��h���lb-���`$p:��w v�e�D�3}����������gHg������|�#�����%&�'��-�Zg[�W|9{�?���l�u-
��N�i����R���[�������C���U�����r�K�G|
����8���q�d�Ppl.'B���a�LF��xx�o����-ap��`ou��B �O�_�����M��
9��8p�S�O��:(����I8���G����2����Pm�>V���vpEZ%�������M�~��D�����=~�����A�r
���~�"��9�V���9�R�(b����.�����>s��Q�Kg���!�7#��\�CcX��\����Yu�`,��A(������O?�o��3��N�RQ+:&����_l��L����6=`�.�����-xS��z�b=�n��"v������`7��"%}�s2����0�S1��Z'7�r��q���Al7�J��{=����-��g�M�7��pO/��J���<4f�Dpmf:
d�d7E��>���MM��A��
��i!-
���m�r\���j���H���>��K��7��ux�������"Q}j�J�F�gWzM�����V�Q����h���D'�wsTQJ���sW�^r��f.��6��<��������I�k�E���zQ������i�4G���p2B9#���\<�B}&R�3)J��B�4��uP~8�JtU-��%u3N-aL��Om�����2p���KP0G���G�8������y�:�Fz�q���d p�F1��L�������W�\j��R��[�����	��L�n2����BK�[y�^^�S��rc�N�����3H<�@z����������?�����G�9^	w�������i�w\�uQ�\q
L;^��W�:w������>�Y��C�s�:N�q���:J:����lf�.R�H&�
;� +yjx(���BdD�~�-pe�'A��Ac���K�]k��L?��%h�[��������������4��B;H�/����~]�Z{h�g�������@����KS���D|��9��uL��)�����<���L��~��3�������N�X��������������1h��G�=O�D���f�H���n�����o��_)�U��u(�"l?���Q^���;L�c���E3�M�B����:��[��89=���z���3t�<_0zf�����Wp��0��h
S�G����f�����!�����B(�^8��A@�� 6wT��L�mT`���@����;�2�� ��g��=?|�4�T�����M�]l��F��;��
�v��o�����@!��C����\$�ww��k�B����#�0�c�����p��,[0�k�3V;VX]�_0��h�Dl��W/���*��7�2��������|�e�G���b@��H+I�p�n*���q����*_��K`WR�X��\��iE�	$�<�[~�&#"w�X<�{���2|l\5,����hs_C2f����*���1r�r�e����#���Bv��}	�����W���z&��qgvQ;n�������m�h�	i		$����/�
xy���3�L���������G�:o��#-*���N�7GA�Z�tzRG����C��~�'����A�]����9�TK�/[[���xy���J��
T��!8�����|��2��`��h�g������m����E��(e//������4q� �-����.[����t�m��&���]�O�XO�0vr�F��C;K��*6��d��E��Y��������@?,��N;�_�}�����-�Ly�2E��,F��~�]\^r~$x����6��l�qAlz�	����������A��:�a`K�3KL�?v��sHB8IZ2���j�?L�=�J�el��_������$VZ"GD.���g�PW���$P��6��h[=T�?N����`��/����-k�d��{'~'B�j)81�Nzl����L3N�l�XKM�.�����{9�~��b�v:��������	 �[��}������-n�����$f�3��M�WH���`�D!F{oog���������s#��LDR�NY,-r�
�m5��-A���I
,�|�P-�+�����X�:��C��'<��iQ�u�eAY����LN�Rm�D�u�}��.���������S�����Jh�
z;�����(���O�kb���t�
b��2�8����(����v�����+mx����Xd���<��A����:Dt	�7#o�J�V�.:�`-�q����Y���>!e�}pt��Y���}b�>��nN������FtT�l�l�/(�
�f�(�9/�r��M��������ohg�?����������PB~���&������S��w���`CJ�	������C3g�N�!�H���m1M�}�@O��g����sf���w���T�`�y
���O����$CbN��o��YH}�S8�A#����7���������������/�M����`�-K1t����Whi�������(�#���')
���b��Ba�3��)�c���A���
������4p"{
b���9w��LlpZ������1o(5���)D�s%7�G){����o"���>w���9I�O�KZ������!2wZ	��c1���/���N��?����bN���0��S���z���J<xponub�������:_����*���i
/?���'��"g�~��<�rf���h`��"�U5��KI)Ia����V��i�xEuOXu��\A��D����$?&�I^����_d�c�hq������5J0)��}\�*��w������o,����*~.zAYo��bQ������$��"�k:n	[���J�K���f�]���.��Q�������f��bd�V���T���A�<�{���
{�Z����R�������������R�j�Zl��gO��y��E.s����d���N�V1�M�<���Ri��XU�p�28���`\FN��gR
����T����[^}G���R<{2��t>^�Q�����K�<������bF�/��b��t��	�����Q�6�z���k{�~kPMMz�l��#���H7�_"���O�N���c��-��wX�zs��=��������������kR����n�Qx	�?�&7c�x��^C�q��@����u^�it��F�_o�^����������V����u���������F�n8������'�������g;��g<�\f^�x/��'���'�
�<�U��W��t����cDJI����[
H��<oeIx�*��\n�{��`��W�����w1�)ki�;���+�<�
X&8��kC'V���7�`8-����%H��0+��<�_���a��n6A��@F�u'�TZ��z?��jr<!SS����]�������w��K������y��3����W���W����y�����'[;o^<?9=>:>���g)Q���Q8DI��;vwmK�m	��z�����a�q�T-W��?F��q�z�Rwk��*�y����f��'/�4l��d<^��w���c�_����k
�~���0x]�zz�����I�������;3�������q�T�hC/a�r4��xa(BE)�L��[����0R���h;�N~��x
����n�����U����V[A:s6��*�B�e8c���<�n������C���1Jq�('�	�h�S�����^����:�[�h�l�I�7cz(���f�[x���0���A._o�������
����`���0��dA���P"����m$�O2|Qb&�=����R$7]3��4R�S�Oao�h��k����2,oFF��)LG�����j�r/��.����p2���nGp�F��Tm"C�d)���Z �J�?�^-���?4h~�|�Xi��d��7����
EcIT6��;)�f�*R���A�zv�h�����������
dR�����U�K3 �^�^����Y�~x�������Q���W|sj�[���7�V^��z�+�9��-{���S+��3p��sz�[���7�W^��Z�+h;��-{����+��9�Bwn�������[Uk��Z�u�z�����k������4n9/�U�B�����;�'8�j�Jt�A�w��mJy�Q����{��ga�����$���V��F5#2UlV���%�vQ�� NbP�U��+�)���I+-���e��l���
$$��6�TWX��K�+�V��F��m4���D��,�<��H�9 (����
�V�5���$�_����>U/HTG3�R����b����S�Q����*�����
���_�B7(�������p��}�>~r���m�<V�^���e8.OgL������~g�+���	f��C�6�(����~1>����d�T�����,W���ZqNLj�������0��?�^�l)���^N'<1�Ho>��O/Y��hx�E���1���D&�^�~0�������<��Kck1��$4�/*>�����~��WL�����d:h�,p�n���[f^�u���4�Fb�� ���_����;���:<Zc��W\�I4�a/l������DF�	{�jE�w��n�j����eY�\��^#h{a�w�u���Y8�0�>Xi�������a��Q�����
�H���Op�P�g4�!����@�ifk�oa�j��F��^AP�C0�}�0#k���?�9�I*� ��������s2�;Z,������������su_��o/~�CB��z�����z�UTT�m�����t���*}8��;��������L\L�����Hts�<^�)�K:}��l��/f�����rt������3����o<[��;�J&��)RT���`O�6��d�=��/��}9����]�^���c�1kHp��n���n	��5������������<�qp8��i1
��j�f`
EGK�_Ut�������u�.�{��������P�m�a��b���wz�&���x���i�2!
��>�ji���8?:go�����������D�����G�����7O.��gg����$�
��k���3#� c�
�7(��m6�pR]�O�zQu:(�v:����+��p@e�H�|��N������y������f��o;�����m5��$nJ���58��u�
��8��O[J����v������X�
�f��0a���4�����^��
kL�7-2��c�&�~��� �?���sUT��`f�9��D�Q��,�
�Q����8 ��0,y����:oQ��E��*�4h������A0���:�R�;(O!(�W"JB��:
Fxo+�\x���]�����Z,#R 5�w]�bM�N�z%������
�&�N��Q�����v���
*7����6.�@�x��\/zW�SzeA������ux���q�6��-H��ps���x0����D*��pmr;������L�R�.��������������S�a��	e�*bsp,zh�Fa�Ak��T;��	���3&D��#���&�]�I��|U�H��A"��
'��Z�����gc�qj���f��(Z0x����
�l@�p<Y\^��,�+��f���,	��y�$�(z�����!��jQ�D��?ctT�������+E�����T�~{m~Qk6�)(�|q?�Pd1Fd#F�L��9�G�OY�Q��(���z��w.�����`��>Zh���d6W��O|��uq�c��/]����Jpf�Ne����#a�rG��,��H���<F��*�s����V��P�O��-�d`�Lx��e��s�M���|��Q��=}��M{�7KV{hM�E�]5?X*�����G��3�C�JU����q�xj���Fum�JN+��g��|�;~�i5��VF��c���w�cXa���b����W�y���n7}��[��_���ZnU��\KNP�
��&�N������U�:'����,n4HQ�@�4 �hw�lP��&���cT
W�8�a�p�xO����������$�>�b�c'�'mi�_���$��b�����T����:b7�x;5E��[Z��%�t�*P�3��;�>:�L�7��A
�������.���Qh^��j���92�����U��z��\k(�C0�{��8��q������)et��"s��)�����~z�g�k������C���Yx9�=&5�����IC���	�&
�s�������}R���=�H��k������4�m��T��<�d���y�@>��3��Q�K�c|���?t�U�f EU)L����������������CU������UX6����L�S����d��!v1.�V�5��b�,mPR����!i��'��b(�h�5��9�$�#�6Er�s������Kn�n
{� y(���G�����i[=�����(����F9��s)N*T�?�8{�,%|U/����}�#��>��?Dt��n'_O��������;������@��������oJ�Xy�����0~>KH�$,1�)�*Eh��am��j�
������:ZcJ��[t���:#�#�aY���e�������o���f1�^���.���}t�EB�.��(��'�<(�!�Z�_H��3��oL>`�*Z��Gr���z@�������((~(��II�JI�6�76Fn��;]�����p��ov�A�����d,����~*��������s�d��
�S$�t������I��*PZ���[c�J��.�����(��u��9�#�����������v2N����
��DM�Vt��Cl�5��j�f�Q�E�L��{�h��p�	83�D}b���F���#?�s�8�*4���������U1�_��(��I�%]w	Hm{�J+�P��}4d����4.�)b�l=N'V��M���i$�[���k]!}�a9%]�������`�����
J��hl�������5%O5�FIk��������,�q#�3+9GH'���%�[*A����ov<���/qR����f��,`��Y�g�VHa�ix�"��M����i0�����y+*	��M�p>��b'|�d���$���~����gU#�4ru������+UY^���"���6HW1�s��1v@�X�"d�q��NPq�@�1&�XI9 �?�wr�����R.c7��OcC�A�D^���a�`����m�'��hD�	p4X4�����hI�\GG���/�
u��3��"K�f��� ����
p2h�Y�!?����_�Ys�����8�ey^G�L��3������$2fX'0���T>��,����#c�����r�Hax�d��*�n�u�`' qu���������_�D����y6����Q�#��T�m�F1����]{*+��N�!N�U�8�M���I���a�|��s�GR�������'e����<�����m���������?�B���1�
%��{�$Y������T����O���t3@��z8��!�RY�rc�gYk�2�-�
9G\Z��w��T�6��x"�@�P�
�1�h�%U�}t������N�I�������.8<��.��w__h/���>�n��������q�@�Cty�
89I��n���C���sdF����$$.<N�����?��yq����
H����bQ�dr��HBI��������uE9�R:����,,I��$���@_�'"�����xE&����|��)�3�%�=:�b��>Q���������Y���n����IOD����q��cu�z�-r�I����^�y_<��E&�1r*^d8������8���y���NJ*N�sa�.�$��55^�g�P���p.<���cl��� ���=�t�c�	�Z�_����|^H�0���
�p����A�������`�h����z�������2�������6l:/�p�)Y3e��x����y�,	&��O���lG����=��t<��lF��b����Z��}J�v��S������c�v"5;N����&3
�^���+�i�]�C_]G6�C|��4����*����v�Q�0\��N����qv�b�u�
�����F���R���wh���
"b�<��>L6q���C<�ob���8�}�����Y}����4v���<X�'���B
�0#Rx��Y�"*��]�K�m��88]�L������:/0���b�K�m�@������xB=fD��A�����y���X+���Z������Iy�����D���?I��*����*�|�R�	�z�������v��I&;�=���|)��2%�8q����1&�%>r��\���?f��Ob`�/�{�f���/3ub����;�'����w�5���K�89���Q�6X�'�=�J�g�
���B�s��V��4>-~0��'\QMG�����Xf �n�7�*��������D1��C��\M�������VG�D�'-{�G�QH�FMw�T�O#�������7�'Q��F`����j87�5�����vN��_>��������������)�:�K�=�HXt�J��#yMp2y�9���TW#��������'4����:�����mn��2�I��8N�������%;Rj/����1��+N�M`�maI_���D(*��1�e�����1����Iryy���{���IRJ=��u�����+�����M�aok�+�
+�i7�V��O�9�M�d�)�[��������p�^"��\�=/�T���i^B���TP���NgK!�OqW��.��*��wH�2����z�h���P_�����
-�H�;�0�\���!V9��t�����dq-%�`��$���a�������N}���#��X��]���n��)��dQ_R�������������^���~X#����5�{�����8��6aqRgN������
�f5%e��=.�������k���79G�|q�����g�g/����	/[�=
����j2��z
|#����d�.��"K�L��VA�x���;�:�|��C������@PYR��~�A�`:s���k;�h�@O���k���"���?c�.�
�v2�z�p���l�%
���U��^�����@����P\q]@��Y��zX��	 �t�H���@g�e3���������S�u��Q�+W���KpE���FCL���co?$��A�2�U+������?|�����f�T�(=?y��MI�4~*������$�V���V��U�_
�o4���*��^����?��]t��g��R���j�a�Un������h��v{��Wm6�~����������j�����k��U�%���p��{�q�����6�P��9R��M�����-_GW���u0;�~�Gq�P]`|HK=���������=�S��T�����b�P�������
(UU���{E��4BZ�BV���������UM�@�*�P�J�����&���#��Y*le�����]�V�t����n���JP
��[k4�S��[I����@6��`�����.^F]%����-^E�`�&Cn��������`��M��K�)�$=P��@�ExQ���s�s%���|��4?���G�??�"���@4&l��D�X�)�S<��?��D��ls����R��rs�������������v��dvz���IJ`���sF�S�����R2��>ny$���{�=�����;��U~���Z�j����+=X�J���_����7�jo�h�+����/��������F�a����7-���X��Q�����T�������0�	�����[G��� ��P]������.�|�1��������H���������/
��8���f��N��9a��?b��R.7���JX���w}����Z��v�_xt�S]#��I����x�slAaK�> ��@���dL�$�Jul�����D�G���Y��Z��A�?>��')���v��"�F������$��h�y�e���}(K���J�XWS��
��k(a��J�D�ygp��'�L<u���0��-��3���a��R���b0�
u`9��V���}�_m��}������J�^o4�`�Y����a����[a�V�7��Z}?��X������������Vcx�L�w������H�
����4@����V��\Ri�ry��4�{�~}E�������?o��p��?-�5
&OV��R�+<�����vn�'[�#<������V����$�z<���x�%<#Kk������x�t<#�6Z-����������z��P�<����QJ�Pn��J	�P)����&}�^u���;��wQJ�Z+}���@�� �'V��j
�{3���g��P	�����@���@�
��wT�{G-�������a�,�o����ST�~��������Wx�+����Y�+�s�;�
/��Wx�n��K�z�k"N]��2�x���
|�Wx����KK��K9�����������K��
��^�+�;���y
��U^����x�o�1�z�+�N�
Gf��6��&0���FG������UE6�]��n��Z����_2�����������9�hE�b�
+���!���|Y�������r��W,��xX�Pu��.N���l��M��u<,U��2e���&������>�e��I���d3'��-����qRS���������|�o�M�+-�+���f����T����������1����~]?���W������FP�����T�a��������2��^��w),�Nl��z#����~F`k�-��p�����k���\����z[��/���+�X3H�3���Xz�FBv)�5�e	5�5`,c*�R�RF��W+`,�L6�eiK��PU�+r�0���%�U&�ei%���c���2@��p�h����J�0��8���m#�b[�&0������_����
ci-(��X�r<����,�a�R1K9��iKm��m�;��L��K����Xc���Y�r��s;�v0���h��W%��d��e�p����0�z���Xn0���^����a,K�r���'%�.��UZ���{�)��[i�[IdO���`,�L�]�X���v
�>[�����t�0�%�R��ImI��d�l����`,K�!����~4��E�c��a,��]���f��|^F�6�,�����,,eG�e�v��j����8S����b�^���
��tw�������������w����C��~������n��W	{���~���k��Z�Y���j��������h���������k*���(�=FT&��GA	�}#B��0��1d�W�z��[�:�T�c����P�.�����_R������}��������*����QW;;� ����VkT��H;5�YF�h�j��{_c�\�����Kjk�f����{�{A�����Z��lf�\��zf�e-�G]��P;
�)�?��^��w����`���p����*��P��hrr��l5�K��;?>� ~`�����Ce\lHS��%����d�=���Ye�.�I����\���j�+�9��-{���S+��s����Z��=o����������9��-{����+��s����^��=o����W��R���l��Z�v������5nW�^�]��-��q�yi����/=���pKW�/�^�(f[R��*�&��"Q|�H�W��j��h��+G�_1�F{��io�>�
���k�"Vx�0b�(a�{��h����h{�J�/�c���f�A�A�%���������z
����w[����m���FVo�G�k���.,hG�b.�Ar'�f�����#maG#x��
�N�m�Z�k���������N����f�y��9}���������D�U*E��6����o��v��H����
y�l�k�r�N�I3��B4���&�NF���z���l������������cI'_C4�/�em��.���v��5Y���a�1^�1X����9�����6�Z�X�X;z���j��%�Eo;��U�u���Z���ff�����?�&}��F�Dg����
%����K����v����jrAs>�W<�/�:�Ox��r����p���6M[�e�Z�:��*��>�y�u{'�aNHY�E���}O���O��;�����B����v���m�il�v�|������3���R;�Aw����F,�j����+��:*�������n6
�����\�7,|r4u�4A�j�!{�0W�7��1�y`+����*����.����Z�Xk������k����'�;���c �D��8|�z���g�;�0��SG�����3
s������N�V#/�����?=�8��9>�<���q��=�eT1�S�2����%N�k*��X���L�����5��wRu:!����&�^��XHR��Qd�������CI��b�F���S�di=C������?�G��
�gt�1S����|R��V����m!�Q�4����*?t������������Zu��k���n�Q��f�6���7�����/�����?�V��o�V��(�xB�. v,�0��U(��2��
#r���-�=Z��Rm/��2X�{(�l��`6��w�����1��@N����U(��������n�_��4�����`Tm�!r*�<4+��H�$b�����!h"���9;���Fh��Ql��_x
��}�������������WoPt)�Kq��St���^���XA���2�w]�A���tG�$���R�*�x^WEm���`|n�3N�Qw�����:-b��`�� ��cJ��[�~����d���R::��t�Z��)]�0<`j%"�~ri��	�a3!`���l6*������~��*?t��U���6jA�7�o��a��k����5k�A��k�5�����l�P����������Z!��dA�����e��nm�|���z5���b8B���/��Z������l�!om�{���`�N��2%Dvw��c��}|���}��>�+V�J�T-W��?F*���� �Y�]���p��������������A4[���e�7����~���uM�e\�������`��#g��7o�������%i��������%������6��I_�H4_Y_l�J�*6����f=����wlwu�u���7k�n�aj����8�V���5Hngk��������F|��%�v^?�����o �$�z����aZ�������Z����g�����`�)�w�������"�����v���<{����N__9^��w���c�_��
��+o
`Q���0s�����w���s�F�!
f����|2��qA`�������@����2�R��"TTw)���t���,���e�1��~�fm�
�~�V+�nv�%��%��t��F����	SA�0:��f������b�c��)I<1�����a�p����[�����Q��^��ql��6�}N�Kd������HZx��H��j1�B����s=wGP�j�����#u���z�m�oO/��%^��Pc	������uH�"�&��Nf����A��(Y)���/K��e`_m������c������wQ��W�/��o���z�����z��/�N�]��^�^���7�g�����������t9�5��?���7�e�� �S�#�d3��Ip?����L5]a����<g���GW�Zl�Y��_��|�->�=a�>��(M�8����(=���k�%.j����}�����,
���Fy;������k
�E�������*R��,��������A.�g���d�yF%�^A�*�V	�
E��������$��#���L)����d1����F����O����"sn��A���qo�h���w%0��/����#�N�t{����{t
�U���[�z��>��fI%�wG7�������]������r�q	�3A������3b�9�[�
�-6��8�}.I�7��q�b����/����w���:<�k��������*��.uqr$^�Aa;���r�zeE��y�&��`ch
�����>��Wkm:���z���o_xv{M��
H�&��;�K���K��O��o�Bg���)���$N����7`�g0�`t���[�bN�@x}F-�_������������5(�A+]�
X��#�8�+� >������L�Bu�Io>bPQ��c"�4�<'�����T��eS��XR�����Ib��H+F����WpO`v�M�|���'8e��`�?���o&���������5�^��k ��t����^����Z
�����u�������a0�XL� ���7�u
�-�
������Q����(��"R�C��z�������cE#�����&��S��^��.����O/q��9nU�������!�8��/�FqO��U��U�k2M�)���j�I������uFt��4����Z~q(�������8c�NcO���IX�w���M8�_�=�2��`4��L�.��6p��<q��*�U)��O���z�0���S�|]���_d��m��[�/S�p�^t��#xyhr���A$��GY8�M��^�����������Sk^�a?O�IgwO�`����IL���G���j�v�E�:�
w��(�
��|�//nd���v���}G�r��Q�����e8&��k�wv�b����g��3���F��C���<FG�Hw|��	d]b+:pX�B��>�f�l����x-v��|cx=��-��{�����������L��;
HA����q���l���aWz4B<t���D.��>������V������{.����X����z$�D���]�����{�U��4,��r!����X1|>�E�>�u'~�86��T����r�~<���������7�y�*�������![�~���#h��o�X���W��QwEMm'Gn��]r���|��#���_XQ �{������A����j6&�����t4.*�	�V�]�5�~c
�	
{+3�/���W��SJ�A�pA�hhJg���1[z1Ib+��2}���[�I�@UF�T�D&�����w��S�����D��w>W�D$��r`v_�q���s��<F�t[���8�!)� ���i�������L ����Kr���|�p��G�E��~}���������B�^�a�@Vu�����}X�z�C��O;/H�<
�^n��b�
=�g����+�S$pz.���%H�a/2!p���'����rlsX�V���_���yD��<a1�)��������{�i���u����b4g�6�Q��0�)�3��[B��k� ���'__@�`T+Q��<{�����%q�������i��@a`$>��9������-d��0(Z�����`����; LX ��o�"�U�C�@j>;'C�4G��	r1��x:J8� g4)�l�lL�1
���{��DLF��?�6	��KJ�k�U
u
��A�@�rH:�JN:w7�>�R��EZ"��KX����q`����ekd�<�p�)z[+�4�{��=����0�O�"T��(H�$)�P��L2s�����r^T�cj�8���&/,�������j8�����e�d(�-�H�&C��So��]�UK��oI�@.���N��:?�h�)z���k����r(V7���4�@M�� R�1
�
�dC1�?0@�s����7��zcp`,}���@���g5�7O��R�2B�����
j�'��a�tj%0m�^�jj?��_������=�k�S���K�Dm�Z;�$G��&�] �5S�3��4�J6h�A��K%�������Eb@��X��n�6Nl��09t��-C�M=2��[����Z����b��Z����B���U-�E,�*��fw�LqHtBq��0���������,9d�r����
�7�I;4R���mN�����r>�w����q��cM)�;�o���)��e
�s�W$�M�\��B���&�'K��
5�,��0��������>��k���6H��/����^�}�77��+x�Aw�)eyQ��J���s��M=1���H%#�]���k��D��4�����q�o;�r�����{�K������C��m��tv�B�����*z�?WE��������!�~7�l<�U?u�+?��fa/���J�������R�vI����]H����q������\���|=������'X�B�;�U]�;��O�Y>���X}�[�:)Z���������W�������n����^������W���Qo���4���j�^�V�n�^�����>9��	p"S�3+!�	
&��g��0V��t1�H�	�D��E���b���#������<���{��d������%�p���:��l�7�1�21W�v(��/f�*���?zT����|e���ZC�xo�-�����4�����(��{�W���[���l���m���?AH�[B�8~w�y���b��� o���U>�0z�
��O)�Q�o�/�a�A6�����vQ��,��b�O��I��^�S��~''�BF�(a�����������'���;���PI���/'��s��~k3%��1��k�����+�$����([�"���m���/��!��O��8�����[������������V�"4��d1������7!�Q����D����D�C�����h"�b�9p���J����?���b`������q�xnD�����`�9���2��G��0�������)�#O� ,�����b.#�E	GE�v5�G��zp�f��W�B]��x�E�@r��[L���;��[����.�E?p�n���)�i�O%G�/JE4���E2%LLaL�{�~������h��T,�JD�@��|Oo=�\�r*
7����N�t��Z�����?�_����=\���m�p���t��-j|����h�*D�M�Tc����;M-���������~�m�=�J;����kF`bbZ}c���9E�[!)��$��{�_���R�W��)zq��H�	�����Y\�rN5s��-N���e��|�
�;��.��st�aK��L(
�$�YXB�A�z���\8K���#�6�/��P ��O��3;�/_W���M��2@=m&W�s�|1���
�*�������A���r�����rjy����:��o��%4�U���WA�#$Wjt��0VJ���"��+�`����y��*�Yt)�B^�T�W:G�$h��
ytU��+�8[�c�]~;�;��
/����4k�n�_t�}]�f��^#5��C�V[7���
>�F$�.�@��[�����~]����U0��!z�v$1,,�}W-�s�Un���[;�p����xx@�j�/t�\Iz���5"2Ub��M��'��(���o�
�{��J�[2c)`mk�Q���W����� g)=:�C)_��Y��j!�
8��j)�a�Pv����OA'�����\`*E}�;���E��e�V}����������]_�Q�����D\�,.���P��z� ��6��+g�������������[;J�����	���&��>���F���
~r��������u`�v.�qy:���(U�5���R@����.���lFR*�K19&�y~rz|t|�,��*�@���'>�j��Cn'9M(�\�g�;���d}>�
�:x�������Gz�����Qh�����;���Q���4&��-NcPx������5�3��`���$TF�:����h�E�!��0k��G}x����#Z&��:_�!%�ef!�������!rI�0M&��_L��e���#�.�q>2���;h�*�vo���l{������6o���[��_���3t���Z���~o���pX�*��F�������j��g��=�iU�������\�������-�h1�B-3�{:�����s����?��b�O�M|���F��u�����}K:�{=��:��[|��f��~��Z:��K$9��B���/�xG���D%�e|����U�|��oT���������j�v-h���A{�6h�U���z��W��{�v��|��Z]�=���6+����������4��<����?��Bj�R��p0��pt�>���05�n�q���%U��F(�x��MP��-��L���s��}h�+����gN9��WD����Q����Y�,�h���A�G#J�u�
>�0�p^����1�c��.n�h��aO�S�V EV 2{������W}����,9��[h�*�7���R�!��}��k@|���E�^|�b<�T`,�"JKF�]�O*���(����-����
!�H�>���3NZ�fy<��Tc��+j8+�CD��O�h|������6Fc�`��3�M�9@M'��Ke�*����&[�5< s1�
���}����:��tty=�O��I,����
D��*D]��n�C�dW0���
`��ji�kT3��s{k4����v
���LC�T���Lt�����9j��N�^tp	���R�7���4�@v5���}c������4&�P���jm_9������1CS��v�B���M����eZE��X����1����^��P�����,�����$����g�����������Yc
��M�l2�Mk��6���J`�i(�����%�~b������Bw�k�����v���l7��z{?����{-_�[��|K���W��'���VP�8���j{���R�rE�S������5�g�
��h�����Ix�M<���-S.��O�p@F)�h��o���D��%��%:��w��u�G�x	�����s6�y"��G����{4�����s��]�$>=���M�~��a($��A��|6����D9VW�������%%���Q�.8L�#�b�D��F�F�%����Nd�[!St�($����J�7�����n��kU+�d����}+
R�:2C��a���\�y��X�����O��j�*����|(Yq#z�0B���b
[�wn�mey����P�Z�V9���Z�:X���A�V��k7��������W��y�vy�'W��vr1�+�n��F<>H�����yOl��?��v��G4��wa��w��e����5m������7��T��Z]9�Q���(��#b�������"4��M����� !�DsD;'FT-z�����D7�z�>�V��06���������aseu���Sj���C�+�
<
���`��)�t�@�Z%1����wJ�y��	��������>X~Cux�C���Cu���i�����5e����'����W�S3�Mw�vH��;�7Uc���}B�o��p��%h�,����vD~;�}���k��!��x�a���8y~������z�Z�BZ����t~B �^��v����bo�
���=A�o�������m��!������Wo_�?ry���u��O.^��[���(N^8l1f?�����S���<��P��7�ay\Py�O�����a�'�Q������<�2`��f2���L@���h7&RZ������G+
�&1#��]m�rc�dr���=����������������bv���g�S�%������������?CG�gM�X�"O;�Uc�����B,���C?.4P����M+�^(��'p�$��+vvz��K�J��\��.Y~�;�
�1od�T�$`�6�O����{�d��1����nC��^BP��g�y���Di���/�Y��StE�Co26��D�F	�Bn1&o*�������Cr�4�����X��2W�8@}���g@� �0�"��`��v�F��7��	_0��W�KO��?2:���'�T'^���X�i���E��$�`������:�_?�vZ���3�vA��Y
����D��;P#�N������hE)���i��`6�}QB�T���t�t��c<�JL�}{�@��>�i`����m%��c����LM�9`X*s���F���t��o8jd��h����rbyZ8OF��5b�(�>�r�o��%�2},Y����,�&ST���w!���6A{E��;����P4�a��@>���6@�+���e��
����~��S�?_��z��1�r`�G���x}v����
����-0�	���cs�-��o��
��.�s�Thg��1��2��ZIZ�v���HE��0�P��Lc���HBo����(��$�	6���
���4q�3XI������u2����x�e��zV������'�%�~�����7��_�FH�]}��^�Vk�t6���,�nxuv�����.�����j����dPI���g��nM�8�
��y$55Uj�KN����(yPj��&��y*�.�	u��:��,
�Y>�n�A���[�S�z�S�����U�f^K��e�'������)��@��^�q�n�������T�K�#��'��"P���
Yh�<�G7N<�P�l�%;���c�s�N����W����z3������E!eb�#�\��&�!���D���^��0=���V����l������
l=bT��M��C=SC����*��
H���b�~<Q4�
��������/}!v9=�\BW4���<��M+�RV}���b�A'���^�6�!$Y�>��������0���C��8�
�+�k�_V�#�����8�X��e�lC��]G�+	9��?��`��<�/���xn:�G!��n8��$�T8�p����������3@2pR$���[�(���c������kO����sND[����tV�[�	�L�8�_Ms�'1Q�d^���*88|���I�!�]��y���F��D,��3��K��G1B�C������>q��z���6�Q�B�O�5�TD=����m)u�i���~Z,]QyYxh�����B_z7������(�:�1�3��@��oI�E��R��F�+�\�a�oh@��;���1��!�D�����B r�������y�����z������I�����rG�aj�q�m;>DXJn*)�y�$7
����&�oL��7 �L�cm�1�Q�S�#hl�0�4�:B��]i)����q�P&odE���^*�W�Z~5���d����}�q�_�9������i#�4�W��9:�]��o����X��pOf��!�~�1~�_b��s1�p�S�TSy�P�i)M	�F��sS���&��(J���`N��lY�\�	��%U}�T��eHw�����4�;kT�x�J��$���pi�����a���1���=����d�H��P�:��lA�L�l�������x�zVGdq�08�^����1�{��|v�,7Me�����oyZS�����H���,�9i�s���[o�f�G��]o+����|h[(N�e�=k����&qF���_����;����2���A�Ti��FE�$�P���JGof�4`�	@���p}y�N����[=������_��ej(cP�������'�^ ���������:d,�Ja���w�����.X�g�U_�4��GZ��9�7I.?�2�b���<r��Iv�Z��hd�m�z+5�!�\b�F$���������\�hcI;�7�a���l�r�+���w|7�s��=���[	$=�i��{p�P�-t�}��#n��gx��DQ��`�����{��R�A�-B��rN>V5/Q��h>�W����=�u��K���
kp&�9}��f������������\=X6����1��
���#�q���0���\����
�i���rL�ID_FB�������Q�b������f��-M�9�DW,(��R�R��
��
5���C���2���d��^���eD���w�9��IR��u�-mEA��������E�8��W����Cgt���w�`(`�<Ip~9��9s�\�����3��B�������\���@f7jZqs�I���������rO\�i8]�w�L
\�P,��a`���nF[V/iTsf<y��3�0:���4v��|�����65$���O�`��1U�i�=���j�H�&������vw#:���v����k�%6ru������i�M��7/}����'�������?A��������c�{�k���cfq����}A��3��\��c�1#bK4
z$�!(��F���OcV;�Nt��Y
;���a$_b�������C�I^�������fRF�,��:C��og��
����z��GL+V��Av�o:�T��(#�������M�`F�V�6������{6O�:0����K���AC�����ls����6vW��N����^oP�W��Z��_���V%��?����yz9v�o�wC x�3�9Zx���-������U�*��_(����b�2p������������|}�e��`.C
8{�������v�oC;v�}szL�TWbr�~Hl$�����[i�J�7Cj�D?�����9<;<==>��ys�J�J�������3���9&��q����B��;�������q
RA�����C�G��\���+�O(����p����k���u��f��3c��p���>��J<�1�A��Q��q���F���'��������7C�G�!�y���.y!�z�g:C�_��K&�I�$�A���������a!��T�����Er�pt��:1�f��}�%�K�
D>��!x�>�
���&>_�[���m��+�y�d]����F�GJF�4W��Q�8 d�������R�1��74�'��m�g��/����4pdL�r��V�]���.m���h�0x)	s�I���<�,�!�K_����fXi���r�?��U�au?)�e6�<E(��R�PV���1��B��}�������U��%����Z�B�����1���]�+de��&����N�����Z����j��o�k�f+�v�m��K��V�������}ztr�C�A�����(�\M&"�itu�����h4[p$�<0~�R
��`�*DHCy����h:�O>
���Y0
fx��b6�q�Y�eN�~�T���ry�W���{���9M�B�,B3K�I�����B��q8s!���N���N>�G�9���u��;��s;��S���)����2I�kJ���[����*�L	~��Y�V�u��������^���[lk�����+����z�����1�n��dk��������_J1��u"NIR=~wq�
irlG����W������'k"-����������;�^�z~�BQ�aP��K|��\=|��+O����)X��^~B����;����K&>�KZ� O��W����]��n�n���^��%�C��P��j�H������4j�&�a	��J�]�k����s�Y��{��l�%��h��}�C����hT��Z ����A&9�Q�>Nn��5�
�

[�3�eG����o�����^v���5�6����?+Rj���7��
�I��S��F��_-S�?��b�+P}1���w���W���E�Y�Q�).�2:����T�1���yQu�M�VP�g�;��yz����:�x����G|��������
C�� �2���?
��)��-<��p~��)���m�s�*~X�����,$�8���VTt�MQ~B��-�Kg')��#
�;y������'?>�����_�G_ux��[;~
�U
E���))�A�����O��]�y�G�(����K����;�����y�@�y>|H�C��
0�`#f^.�1�5F!%��U��R��D��Oa~@��q�z��f����������c�i����������~�+��������0�x�8`��U���G%Qe�g�
K���������|s�P���d���.�HxRhZSg�B�1�����6E�&��/C�JH���U~!�?��s���x�S>��c�P-��x|l�TC��k8�E|~�Q�A������Q@�A|z<�.�`'2��H�I��T�o���������n����v������#O����������F�4��:o��D���6r� �c���1���~y_�L��g�������o�H�BvD�����.e��4?Uh���������I����b'����z���WR��������A�G�EI���k��fK~kV��7~���j�A��n��\�j�t�Z���-�$N&s(�Q�<���LYUJW��o]�jJWM�Z���h�[�=��<H��
9�t�����4�k���]#�v���5~���{�7
�1�|$�Sl�����Y$�Q����7��A�m�@x��z�����l�nj��������m��)MQ����{y���������,�L��c��C*r���:AF����,+1��uS������U|#Mzr��7����
&����)���2�J�������;���R
_�U%e~��*����O�6H��7m:������j��^#���!�k�s4�45��46���
�
��38��&���a���]3�5���t���r�o}���P
�����p����Wt�^������o#d���v�>��x�0,�4���\���X��sf
���u	�d~���a�����a�jBw����m<
1[�@��M���?�9)~�������Vt�)�����[�/�N����z�������0��r]����GR@�<x�����c��`i�������G����m����!�����^Hp���6cp�=

�euT�@T�;�v<�9��ff;%g�o�4��������
�b����J3lu��{��p������]�;���~A�����#��A��[�F��[��
gI7u��rX������0j@�t����M�����]���v)~�$L�����EG����'��������W�_w~��cg�(K�%N��5_Y���+�
�[�;3*��$n��:hZ#����y�A^���9�S���h��:cdk������i�@��R{�T����VB��L}$d,~��NX������8��5�
2��P�����<}��4OP@>r����pn}�#�jz��o�����$�K+A|:v
������2^���-!M��y�3zX�
o��h�����Yx	
g8��y[�s��.��U�VC���@g:w��)�s��t�%�B�����k����NE�D
bd_3��4R����O�O��&v������p��C�Kb�C����)������e����9�+��4��UM��u��X�p�5J����<�fh����^w�5A���B�����
R\A���2��H�O ���B��1�����4�'���P��s�����0)C�\������0!��Rx����������3�Sm�j�1�0�4�����D���`V�%geK9F�m��n������G�����1�Q*I��,%S�1��W��pK]H^^1�� �|�M���7�W��:���T�����e�*5�B�s���
�t��
�GGE���D���T�����$R���3]�	�#J�x�����
]��1a�EE��*J4����<v+g����Y��D��<�*��"=��,�f���u��JGV1���� =F#s����GN.�`'G���.����y�����������|�@�S���y��9
$�K�����@���g�9�(x��2#��qr!��D�+���\C��2�H~a�|� y����Z'�)�H�\�~��o�hW��\���%%�:�L�S��������.��i@�Ns^��A%��;a�����~�U��y��k�w�	�R�m�MLyn�V4���<��l�v�:�!w�-J	pk��2����$����7�gD����01f�J�n8��(�7��`y�c��@\�d=�j�p�W��.t��hjL5b7z9a=
�W:��^�:��
���*�~���69���c�����`���!�S{��W�'Le�{��49P��X�E�����_jRJb>IH�c�����z�s��y�^�&���S��q
��f}�����\��<RS�Hi>< R�i����)N�sR+�r�dv<�"�h��{s�@�`	�d��	�@�j����b���Q��AY$�O����pJs�<\�	0,>���ytF=��q�"��N��-�i�b�w��+.�>sRh�c\��[T7��L)����V�!�����]�)3��j�zt:X�#���@'�!�C�V�I���<��$v1��X	������J�M����wM;�~���7Z��������@�RtzBu����9H����v�� ���!��	����+��{�L�������
1����BB?G2@F2[��Qj�V����[[���{���l�*�j������7���/	�D�����,{�j�>��[��[�kV���^���j�����{U,W�U��?���n�g���J��:�����IV�U��C�#@�~�������k�n�Yi�k�~����U0t��W���Au_�^pNUu��C�U����f��2G�<�	��z�����������C����-8]���b�@K=�xC����A-�T�+���%Kz�~ysx��G�����j��"Z%H.M��������b~5�=�:?���_�~~�u4�z:)���|=�����2�#�n=
z��K�Y�y��*��k��q�K��vb?�N�������
>\P`'�(��_��Q��
[���J���\�
����=�KI?��}�m����=�S���-�r-�V_�A����]�%���]�V�+=u�+=�����j��Zr��\��[_����F1�i�N��;P+z�5��15�=9i��0���XV�#��<����������N(��w:e���(���s���B��g�������H�?SZ���N��	g��h��EK����m
���bB}�y^}�&9�&i(�u��h O�_@9(�k�W+�����J�XE\��^�V! �M,��V�j���Z�V�W���t��Z��x����2���6������R��
'�!Z�����uf.l�wC^�&@�x�������L�I�^%d�FU�t���� �65������y4��6��r��Z�z���zL�������7��l�f Dt8{Y-hf8��$���J����`�p:V�-��3V��6	��Yf��.�$�uW�i����k�Z�Z�h�I��7�)IE��	�P��,dc�����Kjb4����$��%�Q�Y�j���\R3�2e���("��L#�deF����[��O��5��b�Ri������J�V:7��iE�����o���2Li8.�Oe`�������Z�������w��W�!��V�w����^��k��������{�j��k�za~m����s�����9�
�������2@��9��6��>�b�J���	�������z\G����Wv�]���������7��r��m�[�^���F��l.	�Y���z������G�_��T��N]+�|V������]oP����
��996������u�cg��>ErX�A[���-���A�t��kU���32�u[YQ'��-���#��N~M>h�'6P��A�DW�%��K�$P��@Xz�Wk�f�;� �}?��:�[��^{P��=������[�v��kT�=Z�~�78�k���xX�=l�-;�t���o��|�J�Bb����`
���-�1�-� �����E4�$�!G�����2Y��*5	����Q���"
�[�������N�KhF�b�|"�`0�
���Bk	ky�������m�����VI2H�i��)4�K���Dl�dPeD�7���~4����������[����(��$Z��T�)�m�?�y�m0��dk��~"�D�h��fR@X�J��-�`\�Cy#��zADW=��?��m�1e�����'�������T�	v1��*an�;���>�����d�����*�TC�E��3K�����8��|3������2^_�����N�r�^��,*�s������6�-&w�Pai:9 ���
-��8a[��
]�#���B����4=��&F�������O!��s�������&A�M.�x��gK�����0����]�#a7���y
�q"�o�N)r�����y�-��{�V�f]�{^����|�.��t��"
��~k�~Q�)��S�P�q�����^�Z��I��Vj���\��FOKU�ly	*L?�f�W�,6��$���\�_Cw��
f<��[&3�6p�:�[4�u�S�0��s�Z��5��^e�N����K_�Wo���v�U.���v-���U_�[��nK�������VQ;�xH�z�c1Y��{!b~�k��@ ��G����]=�����(|�-D�!����~��N .���"t�l?�*����Q�I��<�u_�G��p���]��P,i���m��vq��O�K��d-v���O����	�����O�f'�L7C>��,�s�%����5�!0� ��>?�Fx�D�26�}�����}�}���'l�}�Kx��@��}R�M��]Z�'BX�����&,���a��D��O&;�s�&��Um�\�����d+-]��2���(�3
�V�]��+��p�h��Z)�'+���t%���	�������+�(����9��o�R9�q3a��#�����E8S�p3Dc�$���U$�}g����M����m��g��Hn�.��\O>����p|*�nl���)/�p&!���B��$�3��,��rA�5)@��C'�h@(u��!��b:�X9t����`L���|A���M�W�=�?�l�����������a��-����$����sH�8��I���:����#�T���]���;X�'<��L��4h��\3 ^.	q}����jsq����d��
�	���u�T�]��������c'^�;��R�E�6"�3�~�����%���a*o��Kfg�dE��]I�G�����@�G�����6i0B�z#WtB��p#j�J�oB�*OY��cV//����,��b�e����4Q���ly%������e��D0_��<�����cZ ����}�q]���M>�[�,/����
�u�;��?>{��%�qN��&�;���&^�:�t�t�K����j�BT�"~�%a�c������.�K&�����x}��.����0\����O � DJ8b;xS+0�dn����H��HV�P�(�N�q
���p��:�����
N��]"�n�)���t�0�d��r�^��4t`������+`e��u�;}�9:><B�2�y��=F��}��������Dao�HK�]�rx��������#:�4_�`�&Q(2�J@��zQW{Ks�&���������f���Y��6: ����V���F�4n���h�&T��
?�ac��H�a�O�Y?�,$�c;�!}�w?��+=����y>�i��������QDQ��3�c�[Q�4�"�.<]�����PJ���9��	�ap�p��5��BM�V���D-���L$J5�������p�fb>^�����r�u����<�J��
4�2��d�Zu���A��SH���_�,�t1�pD�R�/t�
�HS[�3M��'��v��zZ/���l����h�5���{bL��S�ri��.��C���\Xor��������Y�����itA��(F]n ����z��*�Xkl[C����kK��h
Vas�_�!G��'}(m��N�����	�ix��PQ���e���ZU�4�Z�/<�!	� +0�Ac��^`�f<b��/��������3�P��vRD���������-sQ!��� �Q�.2#�E�f:#�"D����3�rb��^�OA�*Pz_",<�j��+��Q�H����d�.�,����g(^�\�4�<�� �|k0���Y���:��4��(ctq�Ln2��^�������[�|S���s�1}��!KF*��	���K�����KO�A1m
��#����CX����cfB�0K��7j��`��\��p��+�����L;�G��Kt9"�u�\��@���~qn�[�2�����'��9���f/0���-������(���a��9)�W���{�@��"�L����4|1�y��#j�� "��c0���J��'��%J��5�����	?���4d!m
	���J�I49L�(�������j�l�]mk�d�������h��1�����K�*�{�/����'Z�%o�l��'W� CM�,����
�[�Q�����P���6�a��hW��7���(M���::-���Fj��(����X�������������������s���xs�����������������/�����$F�ry����^@98Zq��c�
s�K���M;�Kkm�@WT��+ S|����t� H���p/
8�I�}B�H������W���Ja�:�
�5X�`�C%)E�&�]`��K���o�;WD\5_/*a���u��wNN�b�	��@4Q�9���V�^6��`{���`{���6�w�����l���G*],����On�n����0
�%�*�������d���'�Uj"	\;��I�V��5"~�(�5����k8�^=IYr�%��(a�?��O����C�������
� �G�
|�l~�gZ���I��������ja�<��
�b;���r)x"���K�^D��N�H}����H�&aM	�e����_F*�
?�H; lqt�{
fo7`@��T��L�I��p<�0�F��Gj�-�dZ������1�����(�{�	3�����c1$�������+���0�N��D"���I38���S����z�y�PF"[2�k���Ei�Vs����s����(��v.y����R���}�����X!���{����si�5U��,������M�
�as��fq�b��v������#(�_�Vs�l�������)!jX��f��!�	�e���\n&���[�)E�ZW�*z��]��\�d����
'3I��2�mg$�	z?�q{N
_U�y31Q�4����15~�$�C+2X����+t#���4���}���<����2�]"�[4�n"`�$~��<�U4�W��\AB}�;�����r����-�;�+���6����,�*��3�r�����XG�fs�V�����Sz���'!�4n�w�����!�n���R���������_8Cc�wgK��h��
4:b�1�����WC[���=^* n�6�.��c���2
��+�`w�Ox��\��|�3@���Z.��?�����J��$%D�,e������Op���q���������\���3�Qq��[����l+Gex�e?�ct��x9F*�M+h|��e`!�����r�������j�I-�������7�"�Dh
D\=W�@?����s�3�&~y���g8�9j�{G<t����'PF:h	�t���d���h{��ylB���P��@q	�DMandt��;_�V<����d�����a��T14�������%QJ"b%��YT-F�CP�=-���
d�h��9O�Fzu�. -�R&SmfF���B���+qbYr�	$�Y����L����C������:H�������:����|d��_wF��I �%�K��7b�-=���$y�X�]���[�x�!Ce4Gt���*/n�d}���M&s�\i6�����p!���0U����������{���t4�����0Y�.��Eo���2Q �U�d��	e�Tiy��N������m�������P^����1G��s�Zw�B�#�p�?�u����c��de�E���EH�jw��%M�aw��$��@�����u��Fm���<}�2����2��-�A��[�'�>p�1u������Bb��d������bvm�L��g���gO�5f�� "g�p�6.�@��H���)�7��#6T����`>���=��oa�\������G�7�/��i	���p|���PxdW-S%v�I��G���;��'�/�����F�3}�p��	kh�m��[M�����{��<6��j5rB��vXw�8�����ce��~��
����DO�a�Vx<�s[\J������K6;}�
j�j�>�^�K|�*�'�t���v�T��9�;���&��Z�����<����s.���@�w��oqx.�	��9h�M�VoT�bZ����x�T���@��_A�J���w$&�Y�L�'�:�_�J�\�hr�E��.���7��zL��WA?����~?I�.���"�%y���~Y�An7P���'��_=D3<'�,��?i�cR�$��i#�b1��	9�1^��
�o��r1YD�qV���9��zZ�H�4�lSc����B�r]?TD�)qn��G��;�5/&�K��Up��|�����"���f�)�ET&k��T�Z��=*��2N��9O��V3��O�@EH��dO�b_���u24�dg�����H�%xAP��y�BBB��l��c��� �"�&��I����g���Y�K�+fYd�EO������I��$f�f�"�6���q4��H7����:p�]L�=��r����r���%�,h����w�i�"��������'�:�_���.�:�=5�����S���LW�.,�"�'������o�??>���~�L�w�����c�Z���$n.lx�����������)y���N�{�{�u�����Td�2ml*��C~��xh��M6���1d���f�
#e6*�M�:�;���IWA#�9v�����@������/����
z���z����I��p�����q����������p�J�Z����D��H�������C��66��x7E�?�Hk��)�>��lp�L�OH;]C��Z�V�f���`��%Y�?vSK�Km���3����}Qc#L^Cn���o�?D�~hBr~���bX���`�W� ���`�g[t�D���#T������.��Vd[Q�Idk//��:c�IS�������G	�����jz���8����6#r3��I&'�T�F`��=_2�^'��Z���X����,�8,i �9h���%s�1@����k�q�������/Y�������C��@��\"��.En�Q^������,���S�(��5����'���"����`�Grb�����������$��w�u,�X���i%S�1���x��5w
���"
�e���>m2�k��!����������?'p<Dd%	���D�s�f������b��}�eAi��U,n�3���L�Dc��lx��rxg��"Z�9nc��%�Q��x��8c��7}�E�j�f=E���	G�v~qxv�yvvr�9?&�h-1�����jE�:!�99_j��D�N���l�0<�#�I�X*bpiN��)-Y���9,(^��FF�~lz���D�\j��"��Q����A�dQ��]�:�9���,����_--���N���P�'&�2*���ZS��I������������ :g�
�R6�ka�����5��fgiI�����������[������8`m�'����q�(�h�/IZ���cZl��p�Gr�a��!���Z�al��2�?��{�!�8j���_f���Y��}u��������Lt,�4�#;���a����r��#��$�qt�@C������3Z!#��	 ��5�3��U+�[Z���6���2<�)��1�!�a�����Cl���|�3�{�����{W��3�d���5/�;&���ze�YDd�z�)�p��j�$�z��G}���`���$8N��p��C&q]��7�F��0X�-qB��`�P"�$n����K`5�
I��x��1�R`et�"�D���t:�XN7B��d�0���6~��=���'7���_Y*���s+f�RyI,���Ki+8���8��C�<�B��bU���W� 7�i��D������a���3K��9������}�\Xq��AF���Z�K�ui��p�7QG�������@4���s ��v�A$���vzoBX�t���5�w(@!���4��H^8cygV
�D��P���yb����&=�����������6*u��H��
>y��4����ez������
�8K��� ���#���a�ica���0}(i�(�:�|UA�_��!5�>�fL��8%@3��{��qp�)�����C����Cb����s�ZkR�Y��qm>�_�mr�\����2x�N|�X�m��R�]�$(�d�����,�m���E:�����;�oO4�B(�$8P-���2��&��U�pk�h������^���������]����#���Qia]}k���E���]q�4{/�~]�9*j9����1���0�.��������&�0�Q��$����]���	(�5��B69'����K'�����G���%,��KUv������@�Y#17e+�����B�l���#W`
�|��Y�Hx�7���@
�'�p6������"��t�z~�.w����E�SK����N-�)�-�@>����H&�x	�H)��aE#�P���H,���c��s���������$��g�I8[�%;�rn�rk 8H�����9�������5I����Y��q�/C�
't��~B�4�l;��r1�qM�/e��LBN*��;(��D:B\�}i��.�LB�Y2��3a��k����~�;��������+���^6\pfsI����zT���B��G���/��O'�?���
���;�s����I/�B�er���;2vG�I����e�) ��F~9}=g�i�{���C+^4���`�`�M+{�2�]4H�K�����x����t��}����}n![nV�l��(�O�����?�P����+��������W�)�MV�����������]�^�z���Vm�\���V*�%��h'��e�����J�1c���3�����;0���3��)��-���Q��T�B�����g���s�����'�����AL�����]wp`a0�������p���C�L��������|Vd@�$�p#�v�z���y�����WT�J���hm/�>���"�Q9���/�2��=�~'�0<1�%"b2�EG���r��0x����H"���V�!����I��6L.�����,���C�EZ�����%y�2����U��a�[��1��O����Z}�g�������B!���{�y�3+<;b(������@����~h��p�}b����+����	g�c�{��G����TJT�v�7G�$(�]Y�Ql���� M���K����BZ��6a��2���o��,gB���s�^@[ZP� �UZ�k�Vl�
6�&������w�nL(�^P;*;�y��)c�tG����8�-F�3V���g{�dQ����#�r�&A0�/�3}����t%BK�`>�K������G��t�G��:��s*lm|,�tvH��T��Z�GXR�����b�f��L��Y��A��~�v�]���������rC����Bq?�h�7��=�R/��'�_�(f���k���V?�)&�(+�4���D���NQ�t@"N$Z��B�{������9L��������V!R�������rE��P��&�P���D����0+9�$4{�/����^HO�Y��`x�}��2(�)�$i�W���xg��������=>���H���d���lB��x���&�k������7@!f��n�n#y1�t�oF���='^2�����~O�2��V9��������Y=--�;)����V\)U�O�2���1Y���bb�X��4���O�KP9������� ��wE�i��$i	��u53'��R���M-������^-,����f�/h���9��������H�4mvrsSq$l��!�%������q{�s��=���=��6%0������F&	F�%!;@�s���wP����Pw"}���*h����NYK��xyZDlp��X	m�Qx�����B�����bp;_7�C���#c8��,����KNJ!%iI�
\�R�b�<D-�p|%���y�@�O�w��_t1zc0�:�]�4S+������SeZ�qRy��)PqJY���N�4�E��l��Ewev�,�Qv�b%������,o~:'j�w������" DY,�|��8u�����t����O�2�D��.�HH��Y4����#������tn�����q�7=��EJ�b�O�$��"�Y����������bj�75��3��#�v�0'�	���YP����.������r��"�MVf�gJ��<�f@o4��v���oT���c�P
�W�h���x�Z�F���������z�S���CH0�0O�.�C�I�z_{�����+[�
����dDQ����/m��$��e��no�E��q�=��I���^���)�������=��`��4�B�
f-b_�Pr��L�]��*�s0�7��M0�/vr��&���C�����u�rjf8ub��������w��:�p�����%������w8�k��O��0y�K
���6�_�\��i~c������e����=��@d�zE|����U.����'��vv�����{H#�'�t2C���"��D���(�j�`����({ �	C�\��������TL�����C�$�d�E�;������}��OA��K��IU�g�Z���HR���p�z�Fy��*h��s�|�y�=��&�����������9��n��vE�Yz��qs������S�p��t?�-��6����������
��.fQ�.1X�C���C��R��:�H<gu������g?w�C�9���pA?�
a���?X�L���=��E��*��wDD�UKQ>H4���K��u�%���@�>�y~ZT��B����t<��l�\���������S��v�A6����m��f�gJ�����?u�E8�����3LEt��^6"���B�l��g�e�2�2��t�sql���/,���w=�h��t��������:������(Dk��!��(%��c��;`���r�Q�I=������7Q�d��i���x���z�ZZ��J�����@�!�}3m���+��F�%���5n�*����A���}��Oh"�~>K�����?�,��`dTl�� eIG��|�P�!U���kb�f��BL��>D����s]�%^�d�F���;��~��M�a�A�y�Y�����gB��-�����
����W��9#���o�1�-h����[�%��@�t��������{y�/���bw���0|���?*������+�������������T��p~:�Mao@�I2������Z/I�d�j��9��\c$#��Q��1����}��>�ez��%���lx���S�cL�Ks���$`��Nm������&��qxC^��5��l(�,u��t�EV��t4��|;g��@$��t-�B|���J5].����������({@�p��FSE�D�E.��|	�H����w^u��=��������C��wvfFp[��v|�A;�h�{��HG�����N�_�}����� �>7� E���E�V@�L=�P������p�~)jYL=���
�UF�\f��+��&v0������RH~�J�@��F[�n9y�2�������0���������b�O��WH���zX;��VZ ��WJ���-��"_�-a���*�c�j5'��mX6~$z���������?��||<��Q������k��~��&����<�5��o�����4��~�l�����&C��E1�;6Yq�;]��[�){R������&J]n0�E�_<���@����2�"���]_�vG7�$��ZF�j��~��
Z�ryo��6a�]��Ho+��^���� r��M���]dzW����?]G ~a��U�.��I!��5�����:?�����O'�C���pzJ���� ��.����_v��_��_��miT���p��>���ULwN�g���b��j�� �}�
GW�=��<
{�<(���+��S���?N������:z�|%+�����^={{�~�\n��� lu�)���%K�J� ������S? o�����L����r��:���g��|���\>4�t,�,0�fv�XN(���Y�eNp�g�O���3?�>���s��8�V����1�^;�%�Vb52iM���6h���:0�r��������Zk
J3�,�3S�/���0:|�N}��ur�j�����S�\���:���p��;��ZA����[�K�f�2����.����Qj��U0�#LN�m������"�k����@ +I:Z99V���b��<��kP�~���5��7�4)��%��**�/�F��^���{�;���%/q|.��5�R���T4��cO&�Q�@�]������@���0d/d����3��D���]�����,k�������S���gg���<~����=:�zH�z�{���q���������OJ��j7�$�Q3��3�������\|�;r���d�>L�H�	������)���XM���aK���
H[�l���8:==y�������������3�6F]����\�������D`��'I/FoIw2
�~�"8�Q�"1�|}t�Ft�B�t�aX�,D�S�r�u��Of��A,b>5z*V&;�j�^?g��{��\����4x2�{�� �Vv-J��=��E��z�)u����'b�5�������w)a^����9R-[����g����������?����O���+�d�G���.wz�a	��<�����&��=v��F�^D�nS�h���s�����=~o�	��Fn���^
!�`{R��Pi�Z�h��m������`�V����r��a�Uo��8�g6��nE89�����e��0�N��A�eh�E�U\Q^��+��V8�������_t��C�G��O@�C58�#b�M���1����H�������;[�J�'�����R�LA���h��n�\���^X�VR�V�fWkUIN����\��/��������O��]��<j��h
��h9��Wh9�B�]�+�b!�

��o(c!��W��/Y ���j����JWPQg^�BQ�n�IU����u��!�6�����'����}�NE��[��&����n��[��f����n��[�!j`Im����r}k��1?��R�Rm���*���eIu������Q��V���V��U�_������Pm4[�:�Z�r����T��^�����2dp5�0�*�����s����~��^\��~e����5�~�����n��!�v�%����TU�T����_�`	��������:n��z������������2��-�c���bH�Ga��U����z��r��Je�|��;0����7��~T@�����6�+ �����f�-<E��k(#��������U�
�B����2j�
�G�DX���!�S���8����V1-�t���Wmv����_i���A��nes�����	O����"�F��|8�v�Q3��sP�zI0�cJ@bd�>��%%O(K:�� �\t@Vy���W/����������F{qrt~��$$��s���nGS�6CD���k���������O�\��v�����������o�@����n��������[��G�a��
������LL	xzP�|�eo��*z�LS|&����\�c�!�Q�����b=r�Wg1&��`7��vM�&��o�����<����>�f/���l�G��i�P�@���3���3�=�?�}O�gaHO�B1{*P+��L��[J��@�F7o�L}:G��G�8�� XNA���F�z)IPF�����1�
��Kh������o�j�d�Lx���#���CH	�9�^���i�$��h��9R��QLy�~4�nV.�$8iV��oo�Q����J�t�
'���$����J�������=�APd��S���l�
I����`�N}J�����Y�&J�Cf�n�f��@�mN��: ���K�S�}0�/���O�uL�1��I �
ovt��E��q	�rY���D����P�y��uZ9~q6��n���^0_i'���<���l{v�c��/�&���6>�9����j�a<����r9��`53���,�e���t�Y+����d�]Yv��P%.$��s�e�W�uT�)�b�.�l��4�-)�1�k?�D�[�8��O�k#i�����0���T`]uib�RS���!@O����3��_��1\��\��������\,�qY��)c��^��?4����_��D�d�(���I���L|��'>�rK85~�|�a�1�����?oL�KB
v�Y�=0�)�/)9�sc��_�s]S����r~�A��W
'���6V��F�^�!�;�_S|TI��#�%�\����h��������ig
��Nb��4���j|�5���7�7�����R�fm=<y��a���]l�3�f�noJ�"n	?L2;������c,�����r�C�KD'��l������>��o�������"�B��n�0����eu��D�����Dt��,a<���N�t��)����4$��v4v�Nv��Fn���v�M;�En�(�r������Dn���p��Cn��5r�����:n�)1n�I@���'q�p���$���Bu�:tw�:���(=�+�=QI��,����wF���k� ��3�B��T����@"�L��(�:EB�������3PF;0DXV���o�QC����3B���j���C�&3Q��{����|)��cU;0�E8��O���e!�^/����	��Y!�{-�;
��KJ7��� Y��YT��K,�5�'���v�l(,�yw�
AY�l��
���$��;�rG#R�$)w���-�9h��0/�@�:��C=��m0��O��r5�R�;�L�(�:�U���\���V
��!�
/v���@l5������;����L��u05s�!5w6���A}�!5S��C��7������QqR��_
������p�Yomc�;�5�����u���}P�a�J
UZ�e)��S���G�Wv������5�e8fWU(�,R0#H�{'Uk�)�;y���S���;U��;���G�8{��6D����3����������?���|�����������?���|�����������?���|�����������?���|�����������?�����?�]F��

#343

[1]: /messages/by-id/CAFBsxsHUxmXYy0y4RrhMcNe-R11Bm099Xe-wUdb78pOu0+PT2Q@mail.gmail.com

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#342)

4 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote:

My todo:
- benchmark tid store / vacuum again, since we haven't since varlen
types and removing unnecessary locks.

I ran a vacuum benchmark similar to the one in [1]/messages/by-id/CAFBsxsHUxmXYy0y4RrhMcNe-R11Bm099Xe-wUdb78pOu0+PT2Q@mail.gmail.com (unlogged tables
for reproducibility), but smaller tables (100 million records),
deleting only the last 20% of the table, and including a parallel
vacuum test. Scripts attached.

monotonically ordered int column index:

master:
system usage: CPU: user: 4.27 s, system: 0.41 s, elapsed: 4.70 s
system usage: CPU: user: 4.23 s, system: 0.44 s, elapsed: 4.69 s
system usage: CPU: user: 4.26 s, system: 0.39 s, elapsed: 4.66 s

v-59:
system usage: CPU: user: 3.10 s, system: 0.44 s, elapsed: 3.56 s
system usage: CPU: user: 3.07 s, system: 0.35 s, elapsed: 3.43 s
system usage: CPU: user: 3.07 s, system: 0.36 s, elapsed: 3.44 s

uuid column index:

master:
system usage: CPU: user: 18.22 s, system: 1.70 s, elapsed: 20.01 s
system usage: CPU: user: 17.70 s, system: 1.70 s, elapsed: 19.48 s
system usage: CPU: user: 18.48 s, system: 1.59 s, elapsed: 20.43 s

v-59:
system usage: CPU: user: 5.18 s, system: 1.18 s, elapsed: 6.45 s
system usage: CPU: user: 6.56 s, system: 1.39 s, elapsed: 7.99 s
system usage: CPU: user: 6.51 s, system: 1.44 s, elapsed: 8.05 s

int & uuid indexes in parallel:

master:
system usage: CPU: user: 4.53 s, system: 1.22 s, elapsed: 20.43 s
system usage: CPU: user: 4.49 s, system: 1.29 s, elapsed: 20.98 s
system usage: CPU: user: 4.46 s, system: 1.33 s, elapsed: 20.50 s

v59:
system usage: CPU: user: 2.09 s, system: 0.32 s, elapsed: 4.86 s
system usage: CPU: user: 3.76 s, system: 0.51 s, elapsed: 8.92 s
system usage: CPU: user: 3.83 s, system: 0.54 s, elapsed: 9.09 s

Over all, I'm pleased with these results, although I'm confused why
sometimes with the patch the first run reports running faster than the
others. I'm curious what others get. Traversing a tree that lives in
DSA has some overhead, as expected, but still comes out way ahead of
master.

There are still some micro-benchmarks we could do on tidstore, and
it'd be good to find out worse-case memory use (1 dead tuple each on
spread-out pages), but this is decent demonstration.

I'm not sure what the test_node_types_* functions are testing that
test_basic doesn't. They have a different, and confusing, way to stop
at every size class and check the keys/values. It seems we can replace
all that with two more calls (asc/desc) to test_basic, with the
maximum level.

v58-0008:

+ /* borrowed from RT_MAX_SHIFT */
+ const int max_shift = (pg_leftmost_one_pos64(UINT64_MAX) /
BITS_PER_BYTE) * BITS_PER_BYTE;

This is harder to read than "64 - 8", and doesn't really help
maintainability either.
Maybe "(sizeof(uint64) - 1) * BITS_PER_BYTE" is a good compromise.

+ /* leaf nodes */
+ test_basic(test_info, 0);

+ /* internal nodes */
+ test_basic(test_info, 8);
+
+ /* max-level nodes */
+ test_basic(test_info, max_shift);

This three-way terminology is not very informative. How about:

+       /* a tree with one level, i.e. a single node under the root node. */
 ...
+       /* a tree with two levels */
 ...
+       /* a tree with the maximum number of levels */

+static void
+test_basic(rt_node_class_test_elem *test_info, int shift)
+{
+ elog(NOTICE, "testing node %s with shift %d", test_info->class_name, shift);
+
+ /* Test nodes while changing the key insertion order */
+ do_test_basic(test_info->nkeys, shift, false);
+ do_test_basic(test_info->nkeys, shift, true);

Adding a level of indirection makes this harder to read, and do we
still know whether a test failed in asc or desc keys?

My earlier opinion was that "handle" was a nicer variable name, but
this brings back the typedef and also keeps the variable name I didn't
like, but pushes it down into the function. I'm a bit confused, so
I've kept these not-squashed for now.

I misunderstood your comment. I've changed to use a variable name
rt_handle and removed the TidStoreHandle type. 0013 patch.

(diff against an earlier version)
-       pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
+       pvs->shared->dead_items_dp = TidStoreGetHandle(dead_items);

Shall we use "handle" in vacuum_parallel.c as well?

I'm pretty sure there's an
accidental memset call that crept in there, but I'm running out of
steam today.

I have just a little bit of work to add for v59:

v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
any bitmapwords. That can only happen if e.g. there is an offset > 128
and there are none between 64 and 128, so not a huge deal but I think
it's a bit nicer in this patch.

* WIP: notes about traditional radix tree trading off span vs height...

Are you going to write it?

Yes, when I draft a rough commit message, (for next time).

I haven't gotten to the commit message, but:

v59-0004 - I did some rewriting of the top header comment to explain
ART concepts for new readers, made small comment changes, and tidied
up some indentation that pgindent won't touch
v59-0005 - re-pgindent'ed

#344

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#343)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sat, Feb 10, 2024 at 9:29 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote:

My todo:
- benchmark tid store / vacuum again, since we haven't since varlen
types and removing unnecessary locks.

I ran a vacuum benchmark similar to the one in [1] (unlogged tables
for reproducibility), but smaller tables (100 million records),
deleting only the last 20% of the table, and including a parallel
vacuum test. Scripts attached.

monotonically ordered int column index:

master:
system usage: CPU: user: 4.27 s, system: 0.41 s, elapsed: 4.70 s
system usage: CPU: user: 4.23 s, system: 0.44 s, elapsed: 4.69 s
system usage: CPU: user: 4.26 s, system: 0.39 s, elapsed: 4.66 s

v-59:
system usage: CPU: user: 3.10 s, system: 0.44 s, elapsed: 3.56 s
system usage: CPU: user: 3.07 s, system: 0.35 s, elapsed: 3.43 s
system usage: CPU: user: 3.07 s, system: 0.36 s, elapsed: 3.44 s

uuid column index:

master:
system usage: CPU: user: 18.22 s, system: 1.70 s, elapsed: 20.01 s
system usage: CPU: user: 17.70 s, system: 1.70 s, elapsed: 19.48 s
system usage: CPU: user: 18.48 s, system: 1.59 s, elapsed: 20.43 s

v-59:
system usage: CPU: user: 5.18 s, system: 1.18 s, elapsed: 6.45 s
system usage: CPU: user: 6.56 s, system: 1.39 s, elapsed: 7.99 s
system usage: CPU: user: 6.51 s, system: 1.44 s, elapsed: 8.05 s

int & uuid indexes in parallel:

master:
system usage: CPU: user: 4.53 s, system: 1.22 s, elapsed: 20.43 s
system usage: CPU: user: 4.49 s, system: 1.29 s, elapsed: 20.98 s
system usage: CPU: user: 4.46 s, system: 1.33 s, elapsed: 20.50 s

v59:
system usage: CPU: user: 2.09 s, system: 0.32 s, elapsed: 4.86 s
system usage: CPU: user: 3.76 s, system: 0.51 s, elapsed: 8.92 s
system usage: CPU: user: 3.83 s, system: 0.54 s, elapsed: 9.09 s

Over all, I'm pleased with these results, although I'm confused why
sometimes with the patch the first run reports running faster than the
others. I'm curious what others get. Traversing a tree that lives in
DSA has some overhead, as expected, but still comes out way ahead of
master.

Thanks! That's a great improvement.

I've also run the same scripts in my environment just in case and got
similar results:

monotonically ordered int column index:

master:
system usage: CPU: user: 14.81 s, system: 0.90 s, elapsed: 15.74 s
system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
system usage: CPU: user: 14.85 s, system: 0.70 s, elapsed: 15.57 s

v-59:
system usage: CPU: user: 9.47 s, system: 1.04 s, elapsed: 10.53 s
system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
system usage: CPU: user: 9.59 s, system: 0.86 s, elapsed: 10.47 s

uuid column index:

master:
system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s
system usage: CPU: user: 28.05 s, system: 1.37 s, elapsed: 29.47 s
system usage: CPU: user: 28.46 s, system: 1.36 s, elapsed: 29.88 s

v-59:
system usage: CPU: user: 14.87 s, system: 1.13 s, elapsed: 16.02 s
system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s
system usage: CPU: user: 10.96 s, system: 1.24 s, elapsed: 12.22 s

int & uuid indexes in parallel:

master:
system usage: CPU: user: 15.81 s, system: 1.43 s, elapsed: 34.31 s
system usage: CPU: user: 15.84 s, system: 1.41 s, elapsed: 34.34 s
system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s

v-59:
system usage: CPU: user: 10.93 s, system: 0.92 s, elapsed: 17.59 s
system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s
system usage: CPU: user: 10.90 s, system: 1.01 s, elapsed: 17.45 s

There are still some micro-benchmarks we could do on tidstore, and
it'd be good to find out worse-case memory use (1 dead tuple each on
spread-out pages), but this is decent demonstration.

I've tested a simple case where vacuum removes 33k dead tuples spread
about every 10 pages.

master:
198,000 bytes (=33000 * 6)
system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s

v-59:
2,834,432 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

I'm not sure what the test_node_types_* functions are testing that
test_basic doesn't. They have a different, and confusing, way to stop
at every size class and check the keys/values. It seems we can replace
all that with two more calls (asc/desc) to test_basic, with the
maximum level.

v58-0008:
+ /* borrowed from RT_MAX_SHIFT */
+ const int max_shift = (pg_leftmost_one_pos64(UINT64_MAX) /
BITS_PER_BYTE) * BITS_PER_BYTE;
This is harder to read than "64 - 8", and doesn't really help
maintainability either.
Maybe "(sizeof(uint64) - 1) * BITS_PER_BYTE" is a good compromise.

+ /* leaf nodes */
+ test_basic(test_info, 0);
+ /* internal nodes */
+ test_basic(test_info, 8);
+
+ /* max-level nodes */
+ test_basic(test_info, max_shift);
This three-way terminology is not very informative. How about:
+       /* a tree with one level, i.e. a single node under the root node. */
...
+       /* a tree with two levels */
...
+       /* a tree with the maximum number of levels */

Agreed.

+static void
+test_basic(rt_node_class_test_elem *test_info, int shift)
+{
+ elog(NOTICE, "testing node %s with shift %d", test_info->class_name, shift);
+
+ /* Test nodes while changing the key insertion order */
+ do_test_basic(test_info->nkeys, shift, false);
+ do_test_basic(test_info->nkeys, shift, true);

Adding a level of indirection makes this harder to read, and do we
still know whether a test failed in asc or desc keys?

Agreed, it seems to be better to keep the previous logging style.

My earlier opinion was that "handle" was a nicer variable name, but
this brings back the typedef and also keeps the variable name I didn't
like, but pushes it down into the function. I'm a bit confused, so
I've kept these not-squashed for now.

I misunderstood your comment. I've changed to use a variable name
rt_handle and removed the TidStoreHandle type. 0013 patch.
(diff against an earlier version)
-       pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
+       pvs->shared->dead_items_dp = TidStoreGetHandle(dead_items);
Shall we use "handle" in vacuum_parallel.c as well?

Agreed.

I'm pretty sure there's an
accidental memset call that crept in there, but I'm running out of
steam today.

I have just a little bit of work to add for v59:

v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
any bitmapwords. That can only happen if e.g. there is an offset > 128
and there are none between 64 and 128, so not a huge deal but I think
it's a bit nicer in this patch.

LGTM.

* WIP: notes about traditional radix tree trading off span vs height...

Are you going to write it?

Yes, when I draft a rough commit message, (for next time).

I haven't gotten to the commit message, but:

I've drafted the commit message.

v59-0004 - I did some rewriting of the top header comment to explain
ART concepts for new readers, made small comment changes, and tidied
up some indentation that pgindent won't touch
v59-0005 - re-pgindent'ed

LGTM, squashed all changes.

I've attached these updates from v59 in separate patches.

I've run regression tests with valgrind and run the coverity scan, and
I don't see critical issues.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#345

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#344)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Feb 10, 2024 at 9:29 PM John Naylor <johncnaylorls@gmail.com> wrote:

I've also run the same scripts in my environment just in case and got
similar results:

Thanks for testing, looks good as well.

There are still some micro-benchmarks we could do on tidstore, and
it'd be good to find out worse-case memory use (1 dead tuple each on
spread-out pages), but this is decent demonstration.

I've tested a simple case where vacuum removes 33k dead tuples spread
about every 10 pages.

master:
198,000 bytes (=33000 * 6)
system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s

v-59:
2,834,432 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

The memory usage for the sparse case may be a concern, although it's
not bad -- a multiple of something small is probably not huge in
practice. See below for an option we have for this.

I'm pretty sure there's an
accidental memset call that crept in there, but I'm running out of
steam today.

I have just a little bit of work to add for v59:

v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
any bitmapwords. That can only happen if e.g. there is an offset > 128
and there are none between 64 and 128, so not a huge deal but I think
it's a bit nicer in this patch.

LGTM.

Okay, I've squashed this.

I've drafted the commit message.

Thanks, this is a good start.

I've run regression tests with valgrind and run the coverity scan, and
I don't see critical issues.

Great!

Now, I think we're in pretty good shape. There are a couple of things
that might be objectionable, so I want to try to improve them in the
little time we have:

1. Memory use for the sparse case. I shared an idea a few months ago
of how runtime-embeddable values (true combined pointer-value slots)
could work for tids. I don't think this is a must-have, but it's not a
lot of code, and I have this working:

v61-0006: Preparatory refactoring -- I think we should do this anyway,
since the intent seems more clear to me.
v61-0007: Runtime-embeddable tids -- Optional for v17, but should
reduce memory regressions, so should be considered. Up to 3 tids can
be stored in the last level child pointer. It's not polished, but I'll
only proceed with that if we think we need this. "flags" iis called
that because it could hold tidbitmap.c booleans (recheck, lossy) in
the future, in addition to reserving space for the pointer tag. Note:
I hacked the tests to only have 2 offsets per block to demo, but of
course both paths should be tested.

2. Management of memory contexts. It's pretty verbose and messy. I
think the abstraction could be better:
A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
can't destroy or reset it. That means we have to do a lot of manual
work.
B: Passing "max_bytes" to the radix tree was my idea, I believe, but
it seems the wrong responsibility. Not all uses will have a
work_mem-type limit, I'm guessing. We only use it for limiting the max
block size, and aset's default 8MB is already plenty small for
vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
smaller, and there it makes sense to limit the max blocksize this way.
C: The context for values has complex #ifdefs based on the value
length/varlen, but it's both too much and not enough. If we get a bump
context, how would we shoehorn that in for values for vacuum but not
for tidbitmap?

Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
TidStoreCreate(), and then to RT_CREATE. That context will contain the
values (for local mem), and the node slabs will be children of the
value context. That way, measuring memory usage and free-ing can just
call with this parent context, and let recursion handle the rest.
Perhaps the passed context can also hold the radix-tree struct, but
I'm not sure since I haven't tried it. What do you think?

With this resolved, I think the radix tree is pretty close to
committable. The tid store will likely need some polish yet, but no
major issues I know of.

(And, finally, a small thing I that I wanted to share just so I don't
forget, but maybe not worth the attention: In Andres's prototype,
there is a comment wondering if an update can skip checking if it
first need to create a root node. This is pretty easy, and done in
v61-0008.)

Attachments:

v61-ART.tar.gzapplication/gzip; name=v61-ART.tar.gzDownload

��[{s�F������U*e)$@�oic�eK���,����l����
��w����g@��$�Z_��jY��	�������=�u=�u]�>M�,
�@�MEf_����"	r�&����bE������)���w��}��n������o�t���w^��m�zn��}�6�]��sE���*�c�}J'�C�������M��L�FA�5~'<����Ga�@pz�m�G^�uZA�g�i����y=����?������!;��O�u���������������8�2r�4~a�\"�:k��E��f�y�C�w���{���U�A~����h���u=�z
��[*+���PY�RY(+#ee���Y���f@/���X&�.��`�e2f�/��2%b��2�L������0���J�,OY^d	�'�eBQ���L��`�S�IB�	���|����e]��s��/�{�Ox�^�J�,����'��������	�6I�px,�c��B)x}�&y>U���4U�8s�j���Go^�[������������~����M�l�����?�~�I�O��;�����C��#[�e���T4dDE(`�,o(������{�����<6��`lu,�:�����^m��B91�����}�����d�[���F�v�k�����wx�0�\��~�����/��;�wY
��c�\	(
l#��`~�Fl�>\
#���r��~�fu��}���z��TK-T"������#����ru2|1D�x�b���V�h���#������!��)Z�b��
A�c��R0<�m��I����<��%�%I��R����W�r��� ���es	q��j�G�HF���4��C�j�>j�	��< m�^��E�>��-�GIkG��R"	��5~�j�Gv)�.�:�<�C�H\����$&LD����V��c�S�����������4��N`�N�������0��q:Hc(���Gz������\X7�#�N.����l�}�a�4g�`#�r�%��!2��X���'bXt��'!��I��a�u��/r�5���!G�v���CH
�N��D"ALH���r ��xTE��� ��r
NU�G9�$�c�X�6H��^��91��x����[g������n��C��Q�|w�w�������/�����dT�9�9	o��Qz�|�_���G���o@�
��Yo�s��D"Aq�/�`*�����Mu�A$���A��3�{��5?��w�����DFY����6UCw����)�P�����D��'2���>M��k����v�u�`���<����J�M���R��F��I�N^�[p�2.���K'��o,{e�����2����V�|���73o��&V�|�o��W�w�����[�kY����L���}�Vi,lr�S;J�2��"���ya�	�����_��jo�?�����)��p�n�u �0����������h�������{_�zYD`�?����~��XD���I�q	���.;������a�����5	���2TV�kL)+eE�RQV@Z8��B�\P@`�O�Y:�$����!L,r)H�WX���`��4������9�����T����N	���Do8��WZ(p���}Dn��d	��oo>6��i�v����W�Z�����m+�[b&��Z�GA/�������7�����\P#ICX�"5�r^�W����6g ���l�5	����k����p��}���G����u;}��;�����	����n�����@��c
zuk��i6B��u�`�������
N?��/��\������i�U������������k����  [6F��5u��6$a�"��>�Q�C
@��tp5�pr9��}�xqy��?g�Q�%�~<�:�cP�a
,���|g�c�y�	o��f��o���[���;M�AZ$9-G�O�����D���Mt�O��s��D9XDJ�W��Q���x���]E�M��_� ��_�;y������d���:������
Wht�:d*B<czm����l���{cznW�!�x�9�.[m~p�e��v�e{�^��0�����7_L�J$El��W�+���T���b���0����z�&r�W��~��=lv���]��0����]�Kf��Mhkx�&���%��i��=��j���Vy���<ES��v$��2�o�1��>
C��1���E<�r�g����gB|}�o>���f�s����uZ�������A���������\��N�����a����A���>RN�&�:�o��������������S�����2RVF�jY��TX0���c�����(��3�gE�^�r}`�r�@0c��y�i.L\��L?dG��Q�Sx���s��L�l�;�HT�b����������������}���H8��1W�E�{c��r���*@���t@k� 7-/�z��[rVu�
����0���0z��b�H�#-�vM�1d�T&��Tqd*Js�*�	0xK$Il���~��.Na
����+�rX)�������g�\�B��3Q �%uH] ���e����E�"�p��a��"��EVI�V@#0����"a�E����P��3�!<��w��M�W���+������H�S_�E���K�:T 
k,�����suV�]�0PAt�)��"X9��wS���r<M 8<�`�Xb2IgzE����&�S��4]M��������X:���%)M)7�#�uhv
��l�
���'��1UI1q(��Y�`
P�,Z�2�U])�p��|����Z�.��[��Z���h����QP��+mX�A������{E�yL��W����}�� -��$�[`��5������}�%
4�*nA�p��$,�
�H��(<�N������)0�{�p��
�
�a)x�X~6��e�E%[��?D��h�1���i%,����'�H�����-,`��a��6�x������Bi����'�f`��ZB��G�s���BY	
�5e|��M�E`2+ARn!iT����N�i3E��A�K�������Y
���'��~}O8c�>�)�����'T2j��8k��;������Xm��I�A�x5��������:�>.}�r}������F���f�<���r5�A����?��]7��dQ!�y���[g�5[��wyrt|~��g�9��M�tC�s�*n���>f�u1��V�\oc���6W�Gw��1o��m{�[N�y�r�=,n��yL�j�Y�A���Ni���m�� �w7�x� ��c��0����;�q*�����=i�V�'M}@�D���O��z>��=�
h�D�B?P��r����MA�w�^������4E����~����Ef��5�a���aA��u'^bEp�I�0�� ��G
9��[�|m�4���9�x��� J�_
�@��g������v1U�*}�<�!���/�A���*J�5���^��!xY�f{������C��/��;
n
���_��}��������_O�
q�3n�0�l�0kk��f�G��r?�L����?�)�d��D�=���,����$o�NIZ�f0�W�+��)������P����n��_���1��*�c��]~�������ZZa��5�"����n�����!lzuJk^�D�����������{gg�����p	����r�QYM�4��08����1 �'-yD��C��%sq� 6�)M5��p�V����^�Nb�P�c���[��jD��z{t=���jl�(���������,g���x�p5l�X�����Y���Sklc���������	x �`7���P����.a��^�%G�AA:��tb	��E
~gb��B��-��,�n#�S����gD� d��s��aA���IY���i�#f^��
	%B�mY�X��!X�N���D�)B"PR�7���BBP�`x�R�nB�2�P�-#���P9�����s KM�D���D�ThL'	3[����7�=
���S@,X%BB��
��r0rC�S
��J%�����"4����vRt�z�(0��u�r���j��T��bif7���.��gzXZ����{�l�}"���Z����#�Dc�82_{�eF<�������.���m~U�<bT�"�����c�RM��.�d��	�v����r9�-Q����q$4�;�!������BlyP����8EcY'�s�����MW���GZnhfD��9����@�	�x�'3���M�N>9���{�7�A����,u��6/%�����x�2S�������c	6�����(�L����D���Kqb���)�{r�%.��'�����h���2�bM�%c�]�=O�.���>�,��J�����B����,[�a�y��u5X��0� �O����U���c*�����6�����J���5�qB�`k?f��&V
���i|b�9z
�^����%����'��q�������;��� ���'����jN]ik4��\�m.x�K�.r���*���t����Kg�g$%�^e�B>�5�\T�'"���\?��}��mV$�@H�hV��yR�������u|I�'s�KE0�N��q�_�5(����'G�2�Q|�����!T�{zA�P�%��{	����gF/�!��ui����}G���� ��$X�(t���D2�(�<�U��"
A���������)�B��)SQ"�*��Nt6T�B�dT
R��?���Y�=�����x��_L���zk.�I"oP�`DA&�j�2����6H����l����`�>n��w��^�W��*���R4Nt�FhzZ��$������MT���t�4s�TgA/LpM�
!���
	�/'���X��nIaZf
Kk��p��z?S�x0�I:u�����aJ����)��:�"�N&�zU�g{�'	h�.����.���������&�V��*�����7
@%�/�{"�Yd)6V8�d����&�w�D��+���(�LH��\(�$�4`"�0����0�	{�!�@��/=��g�.)�8��7Jx�eci�;�E"n���+a[��f�5j]��^�W'���%���'g'���}����N/� �EN����������������������#$[��������q�,�Jc��0�t����&
���:&���i����R*L$6�� ��x�&���D�Z8�u��%� �5�e ��vd���������4f�����(�L���EL�41�2�L��]{�\6W�
Q��5m\�
�)4u��JI�b��_6��EK��al��6Hs�����e���5��rK�����>-�����g�/���|;�X��"A���r;��Lg��P�����H����{_I�.<��O��l����n0�w��m�``w{N��~%�I���=3}>����E�{z���l��V����O���^;�?��_�^@^�w���h��z����m:BA
�"=������+�]����v,���6�����>�'�� �����G�����#	����Z�N/Z��
Iu���P�k�cp����a�f�L�����|uvN�GZ\����/+AC	���OX��Ew���F�;u�z����}�}����������\
_��N��gq�����Dci�3���~��|����� ���zja��G��d��������8X2��F
�.��,$%(���<~�(���7�*��B������������s�0z���n2�	k%:�^��9�r�&��m��\��o6�����k����������"$N��T����
]���2�	e_���N2�vh�Q��G���9��r����
Z��R��E��`�UWl�9=��K�9PB9��K����,�&�q��H���k>e%^,z(�`�<�R����%�����z�WW����,�fh	&�&G-]�(�����,������N��Y����^�����r�UI����?��8���a��.��u�~{xzt���IR �d�&�z��_aS� ��5iS��R���!.����@<��-���Gz����_�'_��X���=l�9�(]���z3w`�G�
�'d�x4>�tr�:�:~}����������{k���wB��'���*�l�/.��~��l�}J�f����s��v�m��j�����6n��?�DP��j2yBe0��v!Y���]+��C�{����2�G�H��[jC/��N�����1���GS�+����V�i]w�������t����h�;�l_rGs���v(�����"��6�����"�FGJpN�X�|����G��:#~)�be������`�i�Y�Ub��W�7��3?����d9�o��xo�<�/��^dkm�4S�1�L�!n��gv����,�X`w9[W~a��~��Y��������/6�x�+I�6�~�d�����^�2[�[Cb��5D��>�q�����=������w�>=;�w3������7���el�Qd+�"kB���
��X�g���`NE�/���i�����Yl�AxkO��Y��/O����W�/��������4�:I��Z���Jm;���<JM�B�"M�C.�:|�V�WIV��<�����t�q*��6|Z����`# ��?��6,.��|�Z������������
�_�?����j�"�����q��9�r���cO�:���|s�����S��6���������
��]�A�`��i���R�A�w|�!��g�i
�;���>�����r���5P����_.���.�����ul[;v���lw��-h�.e?�6��5h)�32\���2�����v���2$^����e�����F�{�2�
������xz+�r���d���F_�R���\:,�lXf�0�}�8���sN\t�*�T+�D����h���X���_0^6e���R��-�F�rS��5N�5���(����x��S��x�����fV�j3����n����BfU�W�����'!��A8���<�j���V����^f7��������;��������U����qoQ�����~�jb.�^��N���T�O�u+�G�U������W31X�0��.o�Lq���������(��)�xZH!1i��I}�G��W	��Z����8��8;������|����(ai
�U���yrj���H2)SBL�ze�S�a�vB��#)�4�J���=���#���
�xrx�0�=�������tx�U�����5�O_�e�����B�b���i%A�;9K+
2�� .��*o���_R���:����������G�}[L�S�~51:X,f�\�;r��-k�v�I�u��AZaM���[�bT�Aw�Xa6�� �R�B.~r���\h�O�-������}1N�(���|QY�z�����c��uo`����}A�l����y����M����\pS�X��.�.-���MG�h�zdZ�[�2�������3�������\���k �fQP)��^��{���� f/��aY�E��'�4�����)��0����|�o4F�4�Xb4������m�.�\��c��[]Wn����2��	�S�{�1�j�t�����|(���/��I���yv������W��j���cN�Y%����F�����}~��)����q����i�����*�Q�����r�MN�q$
/���i�R�	DU����;�����=L����H ������U�oX�T��.CE�:>��V�����
:������^5XB�~egr��OZ?�Nry����<�dh������O��_>-8\6MO�X
� ���/���^�Pll(�#J�[�S`o\������]�����}�����5����CF
s�u�!�u��M�B����r�1�B�gc�6��G�DL�� r�p$X���U����e;
g�lO�mO��1!�.�����d�������)I������$4�u�
P2����ll������@����g@�����6��F
��x2	��b�L������3�H0�H�k�"���g��gv�?|��x�6�EdC�$�n�;F���_3�ua��K2����9H���S-��pw������nW��<��w��6��p�Ga�)K.5|�x�Bn���n��0b_`���^rf�-��xb�C0Z�s<D����rS-�?C��AJl���qLT�F����G��1�a�rGC�r��RR������
�B��.����i���V�h�}^I^0���g��
8+_*����&��fU���������}=������W�\�a9��aw������A�r����]=�v����3kv32��H��������,��nM�C!�W3���[8���\9�������.��@��3��n8�+��Q7�ml�yK� ��"N�5o;��H���ev��1�;0���c
F�4	��_{9�^K2eu))��b�7�!kV�6�G2�(���PKz4S�x<w�Bo�]D����>��R�	�$������aG��pofZ^����P����E����|r��?��~ Ld�g}�|Q�7u������!�����0�h<����v��a;�[�8�e���e�L���9|��	:�B�
�:���
"p�\�9�v1������5c��vv��-���/]'��x]J���I�7����s|��<fH��(8O)o���U���������~^L���6&�Z<c������J��A�-(��`�f���|��+�
#���������������,4�'��s,�����B�!�S2�)���Zs�z��utt������`�����.�~�������R�Qp
a{6�/%p�����c�F�sn���$�NK,h�B�(�a��Q���b��I���<�@A={��t;{�F\��3P�Z�D�F<�m`H�S�����%�����[�K
�
�#�7f��*��K�_M4��� -.��v[;����q?���c8G�Qg�oBI����Im�6���������+�MMP;:�m����g����@��h�v�p����(����Z�#y�2SL��f��L��]�9�������V��qa����htl1Rc���^�.���t����b�����D��9�W��DZ��8r��fC�P&�����nD�2
�$��+
�Q34���A(�<CY�B)�@dk1�4�I��I���e������0jIQ��,���lp�4X�X�)����t@����x��n5=�A5IvO�F�D%mQq��n�B.Y�pl��l:[�*���T�o�&4m�(����AaV��	\�hV�Oh�F��$P_?�L(, �M���awrt���K64��3���9r��\a�LG Q������X�V�������.'�i���+�������*;*���e��c�|o,��js��`��/��j� NV{�O�mI'#�D���%�=��p'���N���1�!5$q�f�i�
������1C�`�<�f���J��J:�}�e�w�G�-1K�����Y�Y=>�����z0u�'B�84��p���,��_���3�<������8��Upf��qA����A#a�V�o8D�lBat���!u�Fja0b��P����
d�u	%�,"�_=���Kg���U=#�k)�@�H+&�$m��;:���n
0��!jq�;-n�`6Fa��C��`�����oOm:������8#�����5������$N������:G1�nv�%��qia�H�8�������3��!�������&��~���m�I�����Vl4�U(d���8�n0���}Q���������u�\������j���UB����������~���m�}���S��Q�&����t"i��iH�1��������`������O����TH`�P^���NBe����M��dLl|R3.����Y�-���[#��F$��q���`��
�\C�/�;���<h)7x�0�6���vX�F3$?�����o@�]�-��!�z��l��1��kY�i��%�R�"��t)�1	��6�5<�o��������Y|��5x�;<=����V��/k�K|��6i�4H<�.�z���f� �+9ki��N�<�E���|?�u����c�:�N)�5�[Z����j����rJr�zr��e'p������$j�$���(@-J��(������������@
J�q_K~,B�DC�N���.��U�X�3g��i�|�S#;jW���Rt����?�������$�T�:���D���tg������uc��<lN���I=������9����	�k�����M����''K�8���+�lg��Z.���s�o�p����uguu�����y����o������(�-Ol}�z�S���7TZ�{�
��C6t�p���?n���������~3X8v��u����co���F���r*�������Eum��
c�����!��G����=���@�\=�K��k�|v`�X���Y�L���J�e�Y05������W��e�q���6w�k=r����V8Y�,o��F��p�h�g��dp����~�u��q�N�^��r���<�C
���W�1�Y�o6�>�����Pi��j�/!�r��5 _ET�(�pI��9`wD����~<�E0r�K�&
7s���S���W�����tA-�>�+���-[>�bn����f�o�B�\4���` hH������'��0=� g���[giP��#�3�N�5�<b��vw���g	�w�U`&���4���t+�V����
��@Z����x�3t2��]��C)�p����i��{V�rr�������3M�	�.���q��0����g��*��{�\ixac�e�1z�j�t�#Kc��N\��IQ�H�rK�6����[
aSrc{%�&H T�&� ���/1��	�Y4�����x�p�N�1S���c6�C0�&N#o�K��/����������vc"@.k>��@e"�h�������65`u�S������EL�b�h�n��a���p3�4��8_��\����v��%J���B$]�~�N��W���Vf�j��w���i2���w��$�+�2D`���D��qAb�nt�/���v����G��h��)B��Es��?�xL���;�2'Bs���Ha2	Eq2�%������7;T�#�
�H��H2��h�c��7.�������)?}��*<�|& 
\q��
�b�Sn4�������c]�M;y�f��&?�G�O�N��X�r�W�[@a�/d�T>{��"W�o�XQ^A����fa���QG��]A;��Ad\NC��8��?t�����8�R�|}����R@mA������,Oz�i
��A ��f)z��(A)n$>h6�_�����&&���I�G���@�a-����>�����n�g|��� ;^^���Dik�|<�����8\P�do���Z�D�^��.�����-(Q����l��I��(/�:.g�Q �������Y����{��vl��~���e��%���=��2-91�)�}�\y�6-���H��x��_���W��)�����x��s;�e�.�R�H�D8B������R�P��pU���0��:�#;B&�������Mz���B��lj�=.�����x����#���������K�S��XysD�No�S��d?)�K���O�2:5F�_���a����L����������3p��h��7
�y�C�F�;�#�U2�����/�?�.d-��Q������R9��H�?��N�3����/�K���s715���Z��'� i����I�D��F�o_���?�������G���\nGU&���y2b�e�d���L#�s��^H#X�
f�!��>)��/R���)��U���Z$�s�9_���~b��X7�)w���������P�^Gf��O�N+��aD(�#��gy%r�&�@���9wywm���a������ry)�'A�,�G�a�R��(���.b*)
��
��X���z��1�" \o����f��u����q�j/��H,c���[��l�ARY�@s�B���bF(
H�6�V��J��+2`�}���F�=��s�LD�
>�J4
�f���J���D�����j{���yGO�X����g�����&�d	���0QnP��O$0�V���<S����f����8����O=[G�G	�]������:b\�vH���=���-�(�T
x�tz�Z��I��6S1H�z]�K��m�V��x�_��f�a(�8=kc�T�1��j�4;��%fx���	��A�����=��N��5���3�;v�7�v��nF�s���������O�9l�����W �)��b;����`���3���?43��r��m�E/��x�|��0�x�I�)����2���-St��f�V
:�����L�A���g��~�[�M��)��������a&��#C	E��4�����K���E�)p]�������VT�o]��J�����q&30�0��_5E�_��6��b��}D��8L�:��z[#+C��(���T���$����D��&#�$���%i�Q��+���VB-AzC������-'aPa(m�kI��l}�1������~\#�y�0t�U��s�S�	���%�3��cM�V�r��r����fW��B0i����n��KLc=.�,<vy����]'n�Ej���u4���LK(�M�����2���*�K�	;��� �(���Q�;y�Ph:`n�����c��>�*�)$�<���!R>���N��r-�k>�	`��v�Fx\{�kv��h�eS/<����MR�m�����G~7�
~�����!������pY���h<���x�i���$�tRO#SaN�b&7�NM�k��r��IB����?@Fhy�h<)���C�|�3J�0gb���g����#l��Dn��.2��X�X	>/�`���7�q�I�Ie���x����b�^M�)�qEM����Jch���1$k�3��X����~�}��1t���������r��B.�h4Y�#c���j���*;i���_���N8�t������4<��^�V��;WP2o�p�I+6��]� A@D��$�I���0�������g�g���%�u3��B\��X�s�I\uMs����N�N>(��N�:m<�uH1'�0z������������g����?e�+>$%�"���d��v54��F~�Z��B
��_x����~q�����[��x��3
�W��e�����:;�yp;U�2-��X���2��4���O�}'��Y������u����@�;�h/��:���45=	�p`�$K�����J�^a��(-ou[$�@Cji����O0���^n��	����sB�����$),i}x����|��g#����e8a0�F0���"0�[W����Z
�������� �����X;U8�K1�e03AIQ�o�1nG@HuV��w*����.�(x���n���kP�?"?2F8��_����u
�s?���?����D�9�r��YI�{���c�m��x9 *��*
0�Hbo9��)'�q����\�1��|����88�j���{�����
��T�!����e+Z�C�RU��Ej��-�>\x�5
�WE��la����|�C��U+�Kk�0`n8��'�
��t��N�Y�����r��������^</�Or�������F�Eik<�}�����M�Mc�����E#!2ug08O�����W*^[����d}�~��w�&�^�[��2�!�����>�`l�l{^n����\�^���{qYA_~^�dC�����c�V�14g��mF�-�&VO��b�`��E�W7��?R8��
����.����.��[�K	i�'!Vs
N���n?��f����xn��J��/���������������5�����
�������W+��^��Ei�����nc��v�����MW�6x7��'���H[������ly���t�3�*l����#��AW���MV�������r�/@&9��Y;�������rJN�'���U�_%*z��T�V��bk�x>���Q���V��T	�W���KD4��>
&"%:Z�3�Z�^z %H��y�p�����,��F���l����Nk��<���������k��s���'].d�R��U����� ��t�0�$H���1�=�s�Wk��|������:V�$g�:���>�\�{%[����J<�^F7����b�".4����)>D	b���D��b?���'�qG�%��x��b�H[������R�id�:29w���6t����9Ng)cX3,
i���
�����!pc��][�ag>��:�Y���K�������9�|���	����������|���o�/-1�
�~�hZp4O)����p��Ff�u��F���M�yB���y����^yW���6�{���SB�MJ�����L�{dv3�v�er�.�� �#�������0���v������wk�C�0<��C�^"�A��`_#H�@��f�O�'��%4�B�Z�:�C6���W�I����Oe\�=���
bZ��EQn�����x[���]7#��'�����<�7bz#-&e_�m�����8w��� �h�(������o��z+�(���t5�}�|$P6Hcw��[������4E�����i�9��aOR7�Wp���]Re�K�����k=�i��Z��:�&���`�Fd6��
������w(�;Y����1�l���N��>W�a1��|E)���o�G>2�<��lQ}	1{��MT���j�t��Y��Mo��@O�a�t�%��j�Xm����<�u�e�E�:L�`�#����JJ0�CF;�oO�Y.�$������@�����9������{��j39U��uB��y3�9z��%\��s���-���l����3Vq����w3�D�e�[���?�>�H�\^����
�u5gT��W��d�r��?��E�1;,UhX��J�
�+��w���(���'��%����~�v'_����Y��G4<{k��.�@���v	D�g_
f
�1@0m�#�B�&���=3��|	H+��AJEq=>z�/.?Z����>�	���<�#�xG��!���?3�C��O(g�D��z�N>o���|z!;^��3��;i�::��y�Z��b��#z3�x6)���!%a�J��FK�D�*�:�u5AI�\��q�~�T�B�D�r�l�|v�oH<3����"��.�t��SN"��t�u�*��R�&��+	�+��49`�V�q�>�p������F�v��j7D��f]�<]x �a\��vuy{q{�@�E�F���(��,G�Q�������]�V��������I�������){E!����-������&�q��7Vg���o��X�X��h,'���V��Qg!�WNx�Y�5��M\����+Qz�����M'�U�f6���T��Y?�M��M�������D	��J;��w�cAB����)�@��N�1���<�6(��EH
f���;��i�S�=IL4���Lh�S���W{�d���h��8���}�>WK�XT+;���|��W����j�C��A}����o�U���kZ�r'����E����)%��RV��h���*�[�u�/QE���r�h�"�/���O�������D�A7|(��dX0!��`��-M����K�~A���qKH����$P�8@%V�Y3-��t�c�j���w����xRV���Y�|T�����M����|���(�Pp'�*���p��iz����������ZfL�AC�%t�X@,�Aj�/�w��j���c���)�J�`@N��;����/m��7��2"[�����t��s�rU���*���3���D�lCYB�XD+%Tj���M��]�$w���`���H��DQx�JB !��)K��vkLS�6��8�%�F�Mi>%����ld���(E��ut����H1��cG<��-M��2JC���G�,�5f!�>�� �������4�Mz#�����8����Se���DB!y�`�P�Mf��pN�ai2��8.��:��qq�!Z�G���������8��n����;�(���D2;;4FF����pT�Dd�3��4
�����8���"2���oZ���X��1��%&41�9�cJ��p�c�..��u?"u.���J\i�i<�t3FG
,~Q�*]�s��&������ff��::����>��t;��"L@%�l��F�4�p]�^���<�qPX��%(�`vEB"���������3dD/S�����P�\uq(���fW��8���M
nfE��:�$�15�=lJD|��v��{]�(��EC*�,�����|E��j�wMm��������'�xmo�3AO	�(e����E���P�#8�@pU�_�Z�e8�8}�s��m,���Ut��5��� ��J�=N,��A~��sq�r�Z_�
#���Y�'���i����u���E���]�<�:7{Q��
����}\VY�r�O��;�G��z�ZZ�9�FP2�8���[�w�\��\�,rM;/:1���m��om�`cR�&�;��`>�@;xp(	��BA�����/��7Q�:����YO�j}�yp���%o:\Z���x��f2�G�q�j��iD��o@K�amx0R1\�@!�%O��f'q�K�3��f��2.���B*Ou�9�6��E�����y��r���0���'��A�r�15����%���o�Z��k7N�u�R��S�'��b�d�����s9[�$�M����I���p:R,&�P���o1�w\WR���X=\A�h7����o&>v��Pe,���x6�e��!�E���jj�V�`y�F��
K��.���uB,���2������W������
�����k�x�d���T��nmr��{���]��Z�X�+�L���-��.*���	a��4=�h��=�yG���������@��\�^t.�T�U�{���z@^{�t7s5��;a�4�����)�J�D[�[r���9J��\�@��j�a�ic��D����<MuZw:�X���%�m3��w���$nG�$s8�I�jM�D�8�%9�������+JxE�b,7]m��f������)�M���4�s3$Z���|��]e�D��A�WF�p����8I+N_�_g�3U��6����S�ZH�
Z��_��y��c	$�V1�c�ml`�u}�w�i�l&k@���)��@NQ�]4�2��p;��������l�(�����.��#�d�:v���8��Ft��y{�z}�����p�����4������D������~�:"6uo�c�]zu�=��rB�nW���,g����k��[TU+��@���^-���Y�|�4�xvS��P�As^B�l�]������DU5�=IVv���|�oY�s��R3��$��x�LlU,���:�L����u��I ���w�!���S8[ZB�������|L����X������+���L��o�1�t{==CZU���0��sVQ�w�i�u���d�MA�A��>\��5L�f,;�^��E��fi�������S��m�]c��?x��{bX�L�n}xu�>�N~���(w�%�^�w���:1g�1^�=�����=Z0���8��2����3�1�}��4b�,�������Z�i�7�,�|(v��'�`�J�&��t���cPo���:7a�S�����p2�a�"��_�����������lW��.�%sNi�WdL���n2�����<0t6E�n���7����������X<�}x��=��:[�u�����Z���o����Q�'��}�}P�W�*]�@|d����6
�`���&�3z0�[����y���BR�*��P�1��w��,��o��4��{L�1*��W�3;�*x�	)��i��Cw�s���
 ��������g�C\9_�c���39fL����q������m�k��c��0oT�
�d��C�&NT��E'���4������>��Z	y�������;+V�y16�b~�nwN1�������;F�}�r�A<	���%'Ni("/���_�0c��y�&�����������-��4w�RM����A�8\�gso���8�;L���gK���Yw�ZCv ��m����nS+�o�u�x�P_�����J��(`�1Q�zx�[���QZ���N����q��a�Wl����t��(���1	���sc��L�JoJ>�'�@��ij�E����8��Y�Li��259���\
�Y0tK�GQK ������9�M���&�k},������,��U3e�<���b��0�{�V
A��;�A�Z�g�!���i����D5�,�iL��#�S���V\
��f����D�.H��y`&(d
�3&��Q>�������n^r�����`���d����b�����"��.��c���J���N��.��i�^�*�CCZgV|-i�i�������}��YW�����4����#��7�	�LU)�U�Z�X�8��mul��~�<��
���������3�&��_
p������k�Z����ja^v��|VL�Ln�|�a
8��������?�z���'F�u�1l�0^L}3����|�)�g��]��%�"�mmO0rp8Fa9�	g|WV�l��
��t^E��|B��t�vO}�/r�d���
��e�q����r���2|#QoDd����J)��R)�N��#��a��>��z]g���]����r�/�l���}�����|�������C�er��o����r-��;��&�����&�X��w�XN��lc+~�����|u
��.��cd���}�$�'�)�ub!v-g�)|u?����y]����H��{���
���'V�5!����@�����C���;<H�[.����Q���(��H�4lS���������d=zL�(V�Bq�Q ��}�P�N��Y(�0�����KA�/LB�&�����E�������%�p�e�LE��_��i0�P��:(���+g�0��F�4���{g��y�Fd��/'�����d��{��uRU���R�{���H��|��1H%:x��������)���(]F~������*� (
����}��f:���#�v�n���J}�;�P$�#���jrN��p_`?w�`&1�|?����v����@�{�M�0N(��aZB9��s�)Z�
�Ii�.v�	gw!6>�P���-���Ql���)d����S�eE(�\��c!����cN�H�Z	�3�Q����$�S[�?K�f��
�x��:�q(:?�I�[�4���p����=�C�	��&�L`iz��X�S ��y9�.�A	�:�7�k&��$(��k�k�=�����a����Pi4V����W�
fi37J����-c������$\EIDY�F�\q���~��1�\n+6�����j������\��_�/��H+�wg?�&�Rf:[�����4~�Cl��S+�
,O��6��r+�����C8�DP�$�(b����@��csh���t�C���y)���)a&z��W�pe&|����<C:y(}?�$������d
�(1�A8el��*�� �����K�����kQ�vMS7v����fv��/�Oy�9���_�����3��2W��Xr����Z�}���� c�ds=�u�o�0
}����e���
�<��Vw�+"�T���������9��c����:��T-�c�B*y���#i�x�\$\/���ow�c����i�Q�U"�����o�t��JdZ���;�����O z��,�@L>��)��_ �!i����)e�"&�x�q)�J_�	�3�7���O��t�����n��lH�y:�8i�B��g�B������b�dBa��l�x��|+��aFx2��Z�1)FT��b�u��V�������m9	d��m���P,�����0'��������E�	'Y(��+���vV�������ZF�~�W��L�v�������A���{6��8p�VR��"���/��Cc�OD�f��@FIvS���N��-�m�������0K�f=��k�V
:���.��8����~��Z�'_�;*�	���f�����jB�ix1�>�J0��=
�8c_���nf+# J��xw�j���V�`����c�i��.�����bX���T��'yK��u{?s�OJ�&#�32�y�����4�G"U�p��d�(M��a]���1�&�D7���^Q7M���
���FhK")��������p8��u�H��(`��\�q1�}3.���������jD���Dq|
�5>;u��,,u�<���I|/�O��������v ����)����n���(h'7ld,9J98���1
����O9t�'<�����c?��}���#�p���	�M��!��q��[%�%����e�~��0��1f�.�3���7����B�9���Kn\�V_V�"��4a@_��5r����q+��!D��La��U��% �8d����b��k����w�p��q�uZ���v\0���\h��dy�v��
Qk�l��811�;H< I�9yiJ����O�4���A�;�z�w��'qS����/�s�9���#��#���I6��Pyb7w4��{��P)M�������5�b��I"z`��5�U�?��a��#e����B�k!�.�Du�����8���5�`�)h��T��h����
���F��\��:>9���z����^^.���q���Wp��1��h
S#H��@�����`�F�r�@��*�L���t@�j��U��Z�
:;�Q&��U�������_��/�,$� e48sq���4���&�
w�W`3w����?5�JZ�s�X��E;�<
��?�q������~���L�D����XAu1p+������P.��o6� *�"���;z���G�r�F���'�K�U����R>J-�����T�����X��'z��7����8Fy-*���uED�Ax��Qe����W�[�+�������E�+�U�9d�����L/���{e��'.�2�|��^�+��uMZn'��������;��CV�E��.�%$� �����7`�����d<��{>S��b�����{����"	�]j�<��N�[��P��PE;���x�� �����������sPe��|�f��e�*��.��~puxu�Hv���4�����N�E�i-������U���d����_G��4� �-����.[���4�MD�k��hZf�.,��^��]�;9J�����%�[@M�
i���"��*S���A'�����E����>Z��h��h�<f�"y�����v����	$���m�1kl\�\���l:��ga{>B~�/��j������O��`��N���d�����M���C)�~�%�~e~��������K9rEA��j��KV�X��E{���z��_���$���'��"����������$'��I��@�T�i���-k�)��R?���(g�_X�L���t }Q:\x��o��}�����3}����Ht&p�����{L�N�Ib���f��)��J����:�W�!C$���%���*�9�-�TS���h����O�B����q�T�E^�Bm?D?|��.�U^\�����d)��OGYg���������%���09e<���oT�����#M��A���K����"R��L���5����T�u��7���^�>>��v����D�����4�I0
nC����2rbq�����A����?Q#t+K��'���Z� \����'���fq���<p��nDS���K�!z���A��Aw:�b�������o�����Z�����.b��A-s��	�}c��N?�L��I��������v_�9%�xHrX�/�����o�9���}[L�t�����k�Y��f�L����?�:�}��B���S�m2�#�����s����6�/u��8���~���2�q��Q|/�9����d�)��4k�RL��!����f���#
�H}��8��������\���b����s�yzC��<	5�
��^C,"}��5����g��'~F��J��nJ�\��'��G
-��x(��M�����<�*'I��{W����!�8D�N+A��x�"�u8u@%���I[�3��P��T���[9\�|�^@�UO5�B��
���N���0zTg�*���xS�2Vc?M�E�_�n�w�+�,r^�������<�S����2�X��/l$=�$���nZ#l���%�=a�}�r�b\���W(��x�<&ya<"<:��������)���<��Q�I9���jV�=w����o��b�������:���o/;���J��*��������9�T����h���,	��|6F��((��N��
�cj�*;��^g�_.���n'�t���V*�Fc�T*e����������/U�������v<`���[�JG;���N�1�/�dn�����)�v��U5iV)o#9����u�4I�~&�@���E5^�IJ�������+��'���@g���
�z��IJ�$���h�};���a��.�*&,I'����~4����^.�����^�_MMz�l��#����n��D6��__\����
]��b��f�{tM�?m��[�!���^�E�����>���F���	����rg�~�)����������Z��{�����Nm����d��[�&w�&w�'7�wz�i�1}�8��8�6�7<���<���2��b�{af>Q�g?�W �)��X�
~��e(6�!>J�Le��@��q+K��V1?��N�������5Z��^�yC���&�C\�J�����e�C@�!�6t^����1��\;_���
�2�"�uv;i��z��
d]urzA��������&�c25������Axwv������1�Wnc����bF�����O����E��@�E�����7��OZG��Wh��q���`9��C��9��Ag�����a��~�P��Q^&�J�r�����?Zo`��)u7�������8l4��>~�aw�����TO����G�
�_S��3���!�r�����L����h%6�����/�������R�
�����������F0��o�K�o?�H��~��:����
��W�
�^��[�k���
���viP1:��)5�f�t����F:F&vQ�09�J`s�0���'F���r��������g�lK�z���B�'��6��������� r��0Su;�N��N8����Oe���_�s(6������#�c����3Q���8�����e�i���~	�s�0������x��ay=2
�L`:��v��@W��{A�U	����5w�g�^�h�T����,C�k�0\*��c�jA�F���������JSV�3^���}�P(Iz�A_E��I�4�T�2�t	����D�0�����W����
`"�����OW�������)z��[�����7u��m��4��w��N��fi8����h����"d��M�X1��Xs7�o]���i�Y���i�Z��&���{���R������SJ��w��V�0T���?��7`�H��
X
�X>�/��O���6aq���z��.�
!T�Z�1�Rq�|��]�f>�M��Z��KK!��_K����jFd�Y�
A2K����
@�������{}�wi�Y�V��=�(����*Ay]O�_b��.���^�t�j����KL�[�D,�k����*�-�Z��(��q4�����}������:g<�o����<��6'B�0��(��b�r�qq}���n���}���E���w�����1V�h{���v��t���Qy2c��<T������n���v0������)E��^A�����h����$�Zm��F�ls������
�)BS%�m�b%����O�WC�_D��&��1��L?��Ow���kV~���nQ}ag�8{4�I��W��,�`19S4�:�������

����������y��Snw>E�'��6�{\��,Dd�����x�!+�C2���&����/E��[�@�q���O�+��$�������{1������Z�oF�_k�2hw���2���n��F�������-�wg��h���,5�g�;���0��0�e���N���}�#J(^�3�R�l��A��4�����{3V��Pm.!�M�j������W�����I*�8�������%Y���-��J�����������z,������tH����7�����o��rw�,wW��N��l�T������5��T�d�b���G�cx��]��jOI��������l>u�@�-�H���������W�WD������{����hr}�9�O��7��_�&�����'��5��/�������A�
z��Q�9&R	����A��/�g�6��_R_R�Yp;��;�6-f�`YG��5>-1~-T�!�W��+�C<j_\i�yy)��o[l������ya1[j������s8�8���^���*��Zo}8���}u����L��^�����@��o�0||�o]\�]�I�7�j>�f��W�B�H��P�������������j�Qm�����S��p@e~��R�J�����������u���U�w���n�o6��$mnJ���5�����
��8��O[J����v������\�
�f�� b���4�����^��
+L�7-2��c�&�~��� 1�?���sUT�F9a��^yE�d��q�
�b��A�8E";41R{����:�S��E�o��p������~0���:�R�;(u#(�7"JB��:X��R[`��P�;�����_idO��GED'�s�&����x~}�L��Q�H���3�at����'�E����+h������_�`0���:F8D/��`"�3
�<�~� ���������Y�X%q���H�jF��Qvt�$���9bmUwVj�f�l��q��zeyc��XK��7�\���N3�E�&���I�tI�P��_�1�O�Rs
%�X�;��ORE\e#	���!2|��Df����3�r)�����m N���t����J�f�Ne5�����#]bN�`q�l���F����E�m���J$��	��E���w��e����aK�A�����a�'�����G�HP{�����������=�V�(���GSIK3T�T��-��4�v?+T��f�����~���7f��q���N��QG��X����@��������uvzy�w�.�v��>�EY��^��feq����P�n�V���1&�S�s8�N�4|��E��z�a�r�>�iw�b��f{1>�#�H���~�C\�p���0W������$��>���c'��'KlhD_k��"��b���X�O���:,8�MD(��eZ��A,Qf*�U�S�d���90P08�������XB���L�_(���c5���~�!��|�8�y|����*�;�
��=����BF��D��f��:lP!8��t�������D���+r���s.|^psI��m��:y��7uc���&gn	�w�����+6e{�E�n�"MQ������?��`6�%�!���}�
F_����H�%>�*��)V���U�`�����UhOt�w�Rk�rU�XZ�V1l�x��g�@
�d���N��'3}�����:��u�,mPR��0�i[����b(�h�5��8�%�#�6���3�k�Y���S/��2�KE�?K��$��h�t�^>KUy"�X���	����D.�I�y�ab�4:�v����I�;Fb7o���|��+|�/�m�A�"e�@~�u�C\ �^�u��YQS��O�i\O6�S�i�H6����&7���:/��S���tP_LR�DZn&�N�f�4<@Q)A�<���yC�%U����~����Q�,�1�X[�����i�����8�iv�����@�����'�-���Z����{����1��������*Gx!�O�^��m����x��'$��������������ic�}c������q~�G�c�|S(*>Gwf�u�o�
i���Fm����8����A��\�/]k�����ur�p����R���VR
p�����~�G�~JN���|�/�hP�~d=3�a9u������d\���Z��������g�#���|���(����A�r/�r�N�2VA�tl��x�����f���C��?S������������I����l\�u�����#�t"
�J�����������5:)�z��%��j��u
8
>~���v�+�/:('$�k(�t����?�������5��X��Lz<F���T@_��7�>Q������2��'�Z�L��F�x���`����<�������x��0r�M����5��i�iB�q����H�|������~K^za�(�~;��R�=%��P
��SL��W���l��Y���]t��M��<��\j����d���'���zvu��4^��IB�H��t���q�1�~�?'{��Beb��t2���0��E3��
��3GW�w����d����,�v��H��?t��{��J�^�3���������V!E����&h".�|�`�e!�%���&=��i�[]��o�Z�E;���i��|M�-�# @�,K2	����-b�����Jk:�d�-��m�5[��%"��4a��F�I>���=����IW��Zv��4��g��>,Yf�br���&��s�w�	pn5��������T��>���?�/#(Q.gc���%W���o�J����o�\_���M(�P�` ��~��IZ���
���<b��q3��[���C7Z��
S`?��h��$�I
�`���1�k����i6p������v�#���A�� Ljf��u;�r������:����0�fn3�P�l7E|6'�R�p�[b�E���K@g�6�G�F�Og�_���f��K�4VI8�d����Y��F�<?2��t��/��W7�PL�;��64�%���~Hl
*�`���CK{S{�nT�A��u���mB��t3���������)=;Fe�Y#��������'�s�f��y5���gW����R�yL����W�wE8�qs��t5�Ml�aYx\9^��sq��dZFg�-THe�h������"5������b�H��[�^'�B������Cr��i���(J9�F����+<�����iV�������������\���j<T3�?~���s���2����d�#�*����H��7�f�S����+��v�Gx���f�����l�7oI����{\t��k6G�/�������m<�1���^���
|��pw���g!~���E�e����+�o�?����L�������L��-����jh�QZC�e=p,��;�l���q��+O����L6�����{s�z���=�.B�R�xfwcO������-�>
�up;�u��6��|NQ3�������)�����E$��H������:�=<{�#�[���Ik�xv|D��f|iz�=B�����f�N�_�a�� �f�y�P�r@�fB������p��#���FU�r^p��7����?��������&��CI�����F����+������s�Z�m��Z?����a	�t�|
<�DDP"7�I8%��Q7,O0���}������Vww*�����S���C�����w+�f��Z��U�{~���c�I������hQ�e��M^������k��`w����Vvv��N�Y��5��jP��4��J������'����)�O�`57����]7�Ocu��@=����mt����o������
�����tP����`
�\U�Owv�V�j��_�l\�;���������[��*��j��b}��Z��ju�Lj��8c��_
���m�����>�+l
Y�`�\{CiLBPtnw,����O�jU�����r�����@��]�������r��5��P/���JrB���uo��
.������\��k���!�w����
2����U�${�z*|��������<��D����}9�W+�(���r�[��|�V+I�X����VE�@6|qp���,/�����S�t*�0S�����(�U�Z^+�/�0'�z�INV��lj����X�%��9������5g�@�����9�"��^��	7{�NX��G����7�A�_�������>Ex��C����y�J����R5u���@�(�_��/6��;�!���G����O�X�g���H�f{�`:�k�����#'���P?�}%+��0@����+������
�[A�D����1%�~<;9�:>i���������zzy��mT3������g��+3��Tw��o�9	�����A�)�&����g�oSy�iz@��:�%��������#����;/���K>Y���I9A��6Y��Vk{4�����~�w��������N�*^fw�S�tv��y����P7}`T,#��s�U�YB��Z�W9���5���n�`2�=M�R�w�nA�*��x p,�����8VZ�4F���`2�wA@7�Q0�Zb��0������/�_��/�O/�"�T5�uu|$�e�����r���Bg~���P�6''�;���9�V+�����vJ��JA�����O�����V?�?%I�l't�sLG��u�
hg0��O���/�|F1�x�O-���/��-�����eS��ZP����Ib��HKF���WpO`�@O�l�k��8e�q��?������o�>�X�^SO����v��O&6�^����6Z
�������������0�\�'@�����NA��!���#4���x����(��)��pA=SYm�U96�X���)�O'���9����L�����dP�w_T�:�AA�!�?B	'��h�T>_���@3�SlJ�x��v�����A�{���W���Y�H��/���H�>R��h_r�QO������;t[�_�=��-sA��������;�0Om���y���UP[�R����AOl��i0�����&��� ���U�8�{3t��F�����s���:yL��t��w��`8�I��?�_�Z�"{�G�;i��w�����@���v|5t���S���/;B�Q����?��	���[qc�"	���K!N[����
�����X�/*���L,#���uo�#��#�9�A�I�5���8�o�>8)�3jV��e.���N:b�(���W�7l�&X�sK�p��sX���c����d����6�u���h�x��,6�r������������^�:��u��]�������3Y$�+��k=R���*I��T�8��e������4��4����g�c��O=/�Wo["��^�..��_]jG)o'\�L����gj�c��_*�~���W������C�m�t4-+��Gt�?qbmR�^�-����D�1�+���}��'�[�vq�|���)p*4����L��_��L)��K���������-��$�Zw��R��-�"Q@���Led��a�Y11n>p�,?���H�z�s�I�F>	1���+�_�
�U0�_��d�A�I��n�q�M����
n&�����%�wJQ>z8���n��gg'����B^��B�^�a�$�iE7��>V�[�������$F�Q77�OD�j����x2�PQ����~��')�G� ��]���;���No(�6��h%� �y���G4��C^��9�8j��7-�W����S��V3�L�:���]Ly!���6�I�L���|}�BP�D���������1��[>b���a��-F���;&��G�V �s��P��5E�_��~3�CJ�X�O���c1g+`��]x�
;_���g�0�z�\������A�0��a��6&��GZ�#�L"&���c�o�%��cU�"����
4*����t���5l�ru,�r���\8��8+g�
���h���G�%�%�Cok����q��?��5��f��)Y�*����?�*�-l�{X���rL
a'�����G�0�PZ ��a<�E�r�c�A��S�|M�?'�6���/���%����p���I��L�u��g��"#�1�@����CG�~����{
��h� ����	��~�I�oF�������,'X`� �������VF�!����TAm�w2��N�Dl���PM��A�m
����%kd�Ff������d�r����:��_������������:
6�
�e��B	?�a4�b�P���j�����TXxF2��D���2��p
VY�9�
,F�E��*d8�X�]��(e�P��b������,Wa���Y�9.9F�".!}��M?C���3�s�c�q�9������] ��n\)�XS����[�w�rJ�������i�����%���)L2O�����Ili���p�D�fju��{���7d�
h::�z�.onL{7���~�mR���H
�����?�0�#�a(����k��D��4�����q��l;�r�������dt�p�l��!a����
:;}!VHU�k	�����g�U�Ci^O�s��7���x��kc��GSd%QK�+���(y�$sOU�!��E�u�8��:q��8 �O�,����X�B�;�M��;��/�Y����`./^�:)Z_������P�����5v��j7l�;����W���Qo���4�;�j�^�V�N�^������8�Q�p"��M��vZ���g?@YDM����12��)�h�� ��s}����d}n*��+���$Y-it���g��s�}�u3����
�S�I����g(o���e���g�ZA�S��Wv�=�5�������?�wkM�w��T��p|�=���\)�Q���l���m���?AH�[B�j}�j�c}��E�`���&0����6����d ���`0�E�X��;T?`� �����Q��	{8Iu��t�y�o�S���������;>���#�����' �Jb���x�Y��{���6S���?���{���L�|1����*}O�f�����_M���x��C��Y�������k^w�$t��K�I������B��>�]<q�����>�Nn���;H����i*��������Ona���m���wnl��(���x�s����dLh��D�[�4�	�#��~���
`�@��(��6��(�T��a����^�t �~����H.6p�i�������s�M�%��'.��-��5e	��)�A�a��)Mo�w�L)�
�0&�}�~������h��T,�JD�@��|/o=�\�z*
7����N�t"���������_�m��j���
7~�Jg�7������S�^�h�)�j���!}\v��Ew��;��������M��_i�|���z�LLL�o�\1�sK$�����w���__J��*�=�C/ns�0!���]4��\�	��.���i�"�����M�y�0����u��>l���	ERJQ*JJ)��cJ�z��g�V�gU���9Sc�� �%���T����0��u��I^g�,����Y���B�=�N���<���I���s�����rjy����:��o��%4�U�����<��M�i���g>�hRR�
8@"s*^�3�����6W����Ch���I3Yyj>�UytU��-�t�8�]~�	a���E����2�4�t��p_���>������������f��$@jS���eLa8BT>���G��,z�x��0�[���fsXq�R@��X
X��k��t����~mM����������>��$V����WK�s��+lx|
S<���L���pG0��������������\�"�\�����4���=�����E�[���Z�����z�v�,}��>^���^x�Tc~cK���v��"�� ���#d�W�<�� �?��c������v��C����,I��P�^����v1��l�i�����glFR*�K19&�y}|�:j]�J+�J.")������Cn+9M(���g�;���d}>�F��x���K�Q�������;�j8F�}��o�dN� D�VC����C�<8�A�A��ZBf�<��k����P���F��5�,����Ez��%f��h��6��|A��,DT�����A�����
��|�&����/������v�������Ju���k�����\��\����=��������Ty-��Z��_iv8����f�������%Mm3��qf]�	j��h���6���[J4���a�g7�e_cO�m��}N��8���!����y�.n��x|k&Duu>A���^����d��S��G��wJ��^	c'g�iX��K��xT��MT�<����_��^�����{�Q�����!���N���������ne�4+�~�V���^������w����Q�e��Z��N�i}�����*M�E��o@�I�Z7T2t(������Xu��&4�0|U4�qv,no������DM�Py�M���R%���X$�����=~�<\�Q$��GB,n �V����1��Pw���gMurZ�gi	��<����H�sL��>����5t8��t.�����^����
�g�qjt�� �,^�0�������<@�.����C���$�a!�[TzN�h�O�V��}"�m���a�p���M�Y��L���V�Y��8q�����N���z�0=��roj����X"c�c�P�q4@�SY��aOV��d#��dE���a�u#G`?��K�����>f�&����7��u�z�AU�m��K�A`�������P��C���QO�@��c�o��!��2G-d��7����QcOm�����2���V���0\�L A��� �W[�� ���/�}c��NB��4cQ:��jm_�������1�CCm���J���M��v�����_������0}�~�]�Wk4�K�K�56P�$�)�����@��'��MS���^b���Y1s�d�X���S7�������K���}J'�oni�����X�}�q��Y�Z4��n�����r�_�����N��$6�*��"�[V��I���T05�-����~]�J�"��W@���;5���=^b�j�)p(�R�n�)�o�ra��}��}2z�F{�yf%r�.1�/��������=��K��/6�������<s~W����!]�_\�m�j�#9���~n���C�����1�!5��%�X]�r������SF�W��0�� ��9�;�����;:�|o�p�Y���_g?��+�n�W.7w;{�n�Z���o��l��%�kd������MQ������(�1�9#XB����@�9	{�`�S�)b�����0�3�oC��
������������v*�j����r������N���&-F���}�C�|�'W��vx1�+����F?>H�����yWl��?��v��G4��wa��w��E����U��������t��J]�I!��t��1�s��R�k��c��e�����`�J���#��]Rq��Me�j=�0�3�����:D�_+CB>���1��N�L������CE+�
?
���`�z.���@�Zm3����7K�y�v�I�a���2[~C�zD��V]b:���}lZ|����:��7B�$ �:��"z,������
�v��c�\c���}B�o�����%h�"����vt~?�}���w_���h�a�N��_�:��l>������t#����7W�����m&4�[���K:�����>/
R���-��T�OgG�����ry���v������c��-��?d'/����1�����1�/y�����:?|���
*��Ia�r7L�d{�^���Xl�^D��A�W�	����9��Dbk��vC���h�AP&fdqa���\nD�L���#���g?�>i}8~y�j^\������e�}V�7y`��]�%��r�����Y�~>V����p�X,�����k)6���=T(�4c���������,����]��N���D����z|����7�1���1��y#���2';�}�g��sWSwp��]�=�M�r��B3�k<o/g/&bt|�Ol/���z����8b5J r�yk���&,�}����Y=�=?��(s���G-��|f�	2/�1*B

�hk���qS:�0��y������#��i>xLN{�E�-Q����	 ����OrOf����n8�s���n���9j������nL���5r2�d�H\����w�������~��%dM�
��N�;�s��$���19��c��&O�.�V�n=�M�y��������2g����n��MwX����F6i��K(�,'���d��X#�|�b�/���iTB*���%,�����l5E��[}��m��ZLE��;����P4�!��@J���6@�+���e�������~�T'?]]��z�^�����eWg-�.��h��ll���L��`x���n�P����l|v����B;����Qo�IV��R�������XD*�,�������x�1bQ�X$c�����Q�vw��yi��p��t�8��,%Fn��@4;�h��x��"�cM+�� #��S���?
i����gt�/A)���J���Vq�������\�n�u$����������C=w��l�Y���Ka�����`F�GRS�VZ��xB��x�����2N��?��"���P��G����������c�� �0�
�58A8�h>A���X��'�����hF����5-����d���5��R��L�
LO���9���������6U�ES��?�q�p���Rf�&(�9�����+v�~�U���97�����D��,
)�zG3�8��E0s���z
�q �hj�jx
�sn��{bv'����_8��5}��L
9{+���* �FP�I�	D��;�C��#wX�St����Q��Ds	]�t���,Oz4Y��[HY�m�V�Q/�0qb|{E���d��7��_�k���r)���i��c�$�?I8=��^>e���'l�,wbp4��c>��Z���CK��jc��#of�<J4�����<��N-KQ���m[����r������*Icp`�E���o!S�	�?`�7�
��Y{vG��x���A��cY�m���
�# �
����//[W������x���������AeP�]1mH�"�^E������1�I7?�]�TUP����VzO[�]��&�D���n�0�3�,��:�3%��T���7��3�@�&Lb��Bo��D����s8�<�x�B8���#@����G�;-"j� )x���-�}YH��	�����R	�9��fMZ��x�I������Q��N�������06_�D��M�K��!�Uk�x�*K����Yz�f@y�A����].�Dy�-lq���j�P���I]���&1��zH�5�t���*{���7�+jc0p�M5�J��������o����BO�����N���T�2�����(�=g82n��5������n��3���oP1v	�$��d�����8�%�Ia+$���[���1�@)i����T�G�W���UG��p>��
�����5VhO���#{8�� '�p���������u�/�{;+c��O�(���@@O{��q���i8	��V�a�������f/������@�����a�#�1�kE�rL+�Q�Xb��J&nGD���G���kXg���/6&��Nct:����ID���$d������f������G#�@T&��O���,���T�|;���%���H���K/�<�@����r�+���6e���2Tkx�Lg7���#Zx��:T@5�h��2m�
�Em���������8oL�!�
XS/���e���9�\����@�U����=�5�b�yJO�8����w�Y���E�G����1�)W9��+4 �����F
��a �������#fvR�n���
�F��D�e$��M���m&��_�j�����"�:d���T�I��)��))$������e�i=�h1��KO��Ho;��s('���4"��[�����%�����qHA����u
�F�(��&�@@�y�8����r�,���+}c�a�Zg�s�-|;�^�I�,�2nTJ�6���&��5��-8����9����p�J�h���X��$�#����^����x�,��G�qR�e3�������=��k��������,b�R�|{zE�����
�>��o��?��Na��]p���Ja��R�xS��3;�������-��������Gq�'nVI1�m���=*c�0V�oFi��t��^�J'�bO[�I�%a��47�,f@��4f��j^���R��%��E������M�J�U,-X�4�2�4A6�z��1�T
�3=R�}�v��hL4F��Avn���T��(�-"���CG����J�����U�����	��4��PiC�S�` h��s���3���c���Y^F;_W����~�^-�k���j#k�����Jz��c��{]7$t�7?CY��V=���N,�$��	O��0���`l��������uy����W�V������n�c;r�����W_'���f�oC_��??i��;��X	����u|�t�V�^������Z'��0�x/���''�����y�T�J�t'�'��3���9��gj�}r|y����SP�{��?� Q����!����C�s}����'�LPbv<dg���A�v;;���N���A���u>�����7���������s�_���x���r�.dz3��:�j��4Gb����8�I��r���-��ER%�R�u�s4�er�H"/=��)XQ$gG��s�����'��/)]�W ��t��k�\�oxtf6�����|Wt��8m���"W���<�Hq�(�v��9��[���|��r�i;&���F�����m��6��W������i[���JB�Ku9[���%:��$a�2)���'c<��������Z�����7���nX��4��~R.�j%y6�K��X�c�����8�=���@~�8<:����h����`2��_��.0�=���!j+�F�����>��[	Z2�5������l2�u�n�����R>%�.�p���6�Rt3�{j���"���(�����pv�z�M����
��[����^T��'4:�����E���nW�(�	�������D1�&�w7�uo�T��aO���MLF��0���u�t%h���F�W.�z��Z�V���)�L��E(\�R�xU��J��!*V�@��@�\A�/z��z�k-�#������1A��]�L+d�A9����N�now'�m��}P���No�^��4�k���]����(V3���/��/t�1��?F����S�=�nn����?N��#����8��Q�q�a;���/�^�nL�)�S/��d�q�k��~�W����~����V����~m����g��%]���Q�xB���(���p�L'��l'��#T��P��[��y���}����S��WA}I��d���^��Vv{��n�:�Jo�bX���uA
��������2�-��e���r������^�h]8��-Yol������i����`]%��������)2������O-x��h���^&k"�-������>\�����N_�Q���;������z�\�G�p�_S��3����h�������
�L|x1�p-@�O�������n�n���^��5)���P��j�I�������4��:�a��R�]^����s����w���0�w!Uk���^]��hT��Z ?�^�u/����lN\�5�
�������Y����o���w����w�[�)l�."�s��Ha������)F����O/��$��~�L-��|v3�F�ei p�����N�F
\�8��\��5P�S�S��p}��fE�A|�ZAq��K�hc�e����:�:{w�
�#�R��������!���������Og�
���8?��u���L^U���S�[j������<["'���O�"��jz��$��|H��s'o?��uu�:���?_��~U�����olQ�)�V)U���yjh����_�l��^]�Y�K�(��������9������Y h��2�>�������~�3/���/����^T�*ej�
�oc���M������X)`����.Z2� �.������~�+�������0���9���W���G%�c����o���������|�P��x��d��%��	O
Mk��P���f}�X����D��eh�]	)���������x.5�Ot�g���������-�*c(��|
�"����� j�oQ��[y�:
(��O/B��M+�Dr6'i;)�����{����`��=`^S��n�����v�������������z@�
�����7����������oy��V&��?4S0��x�)��������+dGK��ol�RV�[O�S�V��k�������w�������I�,���<;)�#�>��4o�v�A��!����w�)��T��7~���j�A��n��\�j�t�Z���-�$N&s(�Q�<���LYUJW��o]�jJWM�Z�����7�R�y�(�r��G��i���_�Fr���k�
dQ��Dw#�FK���c��O�H��������� �6[ <XK�Xb����\Q��5v�N�_�5��@deYi���__����5����g�S���%��o���[f���|�
 ���o0��%eL�_�zb�������W`i.�(o-ZL/MK�E��wF����5S���Cf��� �.�����+��F��5;���^����&���U��M��o����m��w��_���[�,R&~W�E%����E+�X�[]w�Wh���C8�*h���*f������G���Uv�i�~p�������|�g�FGf���3(^��K�%��`�>=���7c��\,�l�Q���/��_�g>�����X�+nl	oE'��>�/�3����hO{,������[[�������[R;�<8}���MK�:X�qr�nl1�����l��?*�A�gS���+�O��g���GA�?����
��
����9%G��3R��l���}�Fs����{�;��F��U����N����^x�N]��F������X�������vh���`�{�c��sk��n�A��!�����q�s���� ���w��������R��I�Z+��M[����7�����������g������2��b��bi
����tx����������e8�7�����4&��R�������=�;�3�y���c�6��sR8P ����1$����U)�P!*z���XqT/(�����Yd,��H&cB^�r�S���k���������f���V0?�w����-
�l����r��N&����`3p���`6�u�=E�q������Dr�-�BoZW������$O0W>��{p,��;�jz�.���j��O�b.��:�}�g���0��� �~��y���w��5�����"�}:�����[�9�mt��*v��]M
S�=�������K�
���&>	���=�!^v��7*�vM|}����H-XX�$�?��MH� ���k������������W�������������&*8 Q���Q�)�'���c,��t��w|�����������
2��O���WP�6��o2���p�PphGyzt�������8��k�>�j���I���Nh���F�1�l�z�[N|
�9
VF���/i�)
�y>�l�������2���`���y|�,�%-���Ps� ��6���{���=dC6��y@�-p���q���}2Q"@�cG��W���#K�H�^12Q!�~������/�j��"}t��H��@(��Y�&R��Lo~j�A{a�O�Q_Q�_��%m	lQ*����
\����B�f�
\���A���Z������(]����`��>��(�Zn�<����1�����p������,���X���������NBA��qB�@J�>��fP�k2_������:�x��r�H�Y�Vd�]^�n_�__\^�^�/��.q���	9[!���r)���m�N��*��DN>
�����`W_��i�y�K+�������J�i�A��ul��������I������qfh�i������G�v����J��K
HB��a��iGar�^����H�b��)�9��2��8��6��F�3-���G�$^#�M��R�k����`@��ln�x�k�<�$�@`���LX�7��(�0J�mBZ��X�VL�
�t�[D���L��r�)N%jw�yGd����<��{�).d�}5N8�i���n|��H�I/'�G��J����iKW��I�3x��������T����2��:,���z<����i4�aUb/`M�����	�����$�����3r|�s�`�Iu!� |�w�|�i~������ k��Q5��a�����T]Z�<������D�d�$�(�a|�HQ�����Kj���x|����J�����U>& Ae;���dK���2E�*�/�@K
r��3��^������GGa�u$�)���lA
�D���t�^
�+b���)��Iv	���$51�fY�R����9)��"���-���mV
�x�+����i��5�y&N���%v���Z�z��#��q)E5�<O����k�NI��S���>�"��R���
�tI+��}[�>��\������^�L�����B�u��C��z|�y����7�za?���&��1bJ?��+��{nO��
z����0B���NCJpC���i�G��Z�Q/W66����}>7�%��f�1}1}�����;���%X�����%<z=�������Y�$H R����Sq���Zs���C�����w+�f��Z��l�AU������m�����7�E����7�y����J%�����	;{����N�[�u��p�v������z��2���.����?U����f���s�N��C��p��#������������
8$��p��j_�;t������������V*�sJr�T�|~x���RU��v	�X�C��z���!��(�7� �ZQ�m��rLA��� ��J�/��w+_�V��6�N�����
�n��%�p;���&�5"�G��l�@t��/�Q2���+�4L����6�=::|y��i�8jVE_o;p� ���,*P^	��L���)��G��3���p�� �
�����'G �bo�p��:�k���vDv�
��3G��ToF���y�7+�R�*U�W[[�B��V�c����T5vjF�;X�^Q��57�@~���8��!?(6�_&p>�m�oL�����`���jv��v�������w�B2����$�*��T�������4�o.��`���ykSSOp;"_-����OJ���yl�q�;����y\1-a���������k�j�5�N��%����b��Tv� �'*%�����d'R���-Y�Z7Y��w� Z�&,��f��-��t6�����������R:\�l������)���Y�}AMW|�\z�48fp�c[�q=2��&y�J�����B����p?������
��O�of�;���J����$���;��*��������8���@�26���;^���A/�=8�H�
3h��R�F��VJ��%[��u�e�81q�����yZ���������"J(M`l�����3��uN�1����R������unB�������|���tg6�4��h��8��A2�h�%����3R���D���,/�t%�����P�S����������G�n��z��t]��`���w�n���3�d��1�ZoP7�p�J�fRyLe�Ek���!y��8�;8}%t���l�VJ��BJ>{_O���6��^V�>���S������JJ{:����r]��]!������c�!�~4#Xw>�;q�����I�*V�Tj���a�[/V�p���x���#N#y���;^����"�KC��v2q!7���=��hy�Z��lZ�_�?1�T\�,����+�K�,����~'�����b���0R�!7�(�����1vD�R��D�bBH
�XeR�J�F��P[�J���P�#��O����x�2�'t,��cvJ���V[����Ff��|�`x|��M�Ach�#V��<����>��q�i�I��c
q��*?�&g4�j�Sk�`�<�C��=��PCto��*�C}���$�����^8��F��CB���0'��*����4j!/��%`��V�59���B����'��-�,��G�����%��#�V�5���u�����Q�}A��N��������t���wa4+�����:��0+H��r
I�{T�pq�^#i������uQp��QtS��1��|��+����������K�^�}^H��,�����do��i&�8M��b�vX(r���+T �����2Ah�Q;�Z2B`���*�T9`_?��K����s����f&�Q�Z-�-�(�^�M�������#�����u��	L�5�b�'��W|E��^����^�VY��n�'���'���d,��A�����|����;{W��T���q�'#~%�m���$��?��
�����N�����?}���!���!a�6Il���%���D$��je�jD��$04�D7��L���&;?<���B4,d|��.�R�d�l��o��)����+������g4��h���H��_���<��=�}�D	�|�99(0�o���A�}eO&>�8���"���c��1����R�p		��M�M����0j�B�����E{u��N�������� ��?<��v|b�q+P�\���L�Q��F�,_],�����=����q}nlPn!vbg%���x�0u�v��ex�� A/?��1�wsw6D��k`���-�-���m����"^DVwk�b����o����_j�EV��i��n�l|��&a�7M���E@��H���=��
o���4��O���m|�����e\�VC��A6As�56�_� 5r�	�Er������9���l�Ua��}K	����C��t�S�g)[;�k��$�	���cG w����u��Os������bM����z�$(w���:��&6'��=T�t��798����A��q�D��e'2'a2�m��d'���-��F�w����T�A�R
f���>�EC&'-T'�S0���5g���>���n��������Q�������K�"�(%.g��i�_0��w��Q�������/�d��Zy��������^T��Xkd���UdfU����`J3��%<.iW�sa4�8O�7�-�lM�X;h�l/9J��}��x>���Yn�i1Z����9y1�n}.Q�y��Jp��^�?����t�����CF�w���Z���]�;��uxk �V�T^���z�h�2���b��;8W�H���0�H�����+2n���������6���M�Mm��<9|�~	��gf��,^1s��	)N7U%}� cYuZl����%������A����������>	���h�2���trR&k�j�h����(�� ����*z��sDW�����8�M���Uwm^�cE�jq����VJ",���I
�C�~���,������.!{=/��s�(���]��D5��4�����S7��;��x�3�������W�g����\����
��1\�F�����|�����x���4�kJZ>��K��%:d|7���g�]V<��?E~�%�7hH#�=!����Y�����i6��b��(��p>�s��t9����<�t|�y�����(�/E����*w/���O���O��5����!�����������7
U>��2�����6}�?9��^�>�v_�Ut��<���g�������mg�=�V�n�a_z�NW��N��wk�'o������+ �����%���;Z30B��wK?��.�~�\�����b��j�Y����p�T���U~�����6j���n���
�n�Y�����W�u��A��l�����f^T �������S�>������J����(vc���Oa��	��S�5��(��KD�,*n=8KF�����p^w���	��5|�]�vigC�����\A�zUG�������_��-��H����]���_�u�������~goo�c����|�������<�Q�Ai��QVE�r���cu�� �������M������Zlg/���k������=���k1PS;��\��_���z����[�b���]�fj���� �w��x����%�����'��pr�8��/[��%>��M��h��%���
M\�]1�������^��Q�V��/��Q������lJ�>l ��k9R�'�-�a��|����v���z�d�}������C�F�}����H�T��T�v����8���q�w�������D���]�D���<>NB&�����LX��E��o����X>��-��x�����X
�-�=!X_����zm�X�uW��]����W0D��Y�tH��	��Rq���K�2��Vr�X��C�&�@\I��i������en�����0����v��9.P(���o�-��!�k���pC�fo>�d�x�������W�A ����I�<�T�;���3���hN��
�IjW+v�%Q��S)V��4p�D)��,v[F
z���Dl
Q(���RO
U�[N�p����p7��8=0�����+�
ZO#?��G?�Ao!I1�(W)�Z[[�(5'c{j�zs����=����O��\�d����~�����7j�������D�s�''3,b�>��&�����������/R���r�X��t�Z��}��Ldc�D����P3[��Iq�8��������6X�c��c�t|���j$NWwa"��r���7�Z�9j�%��q�����>q�^�^�?�����������;�z������Xp��]g8������<~��%����/�+;�nog7���<�~}<D>��*�A��rz�$����T����I���k�T�G#
��j�%� �����G.���+�`C���I�gd*��N��+���
��yu���-���%�W�d���x�Y�9[J��-%s��aE'q�!�)3	@=<-R�S�v�X����]1�,LR��3c�i�J������US�jJ���U<�p)�n��L7[�F|}��i$���\�����6����&:{Xrm��Y�O>#lJ�r�S��6���4�������
:�������0���������Jr����aI;��0����`)���������W�W���_�Zm�~��k�������W�z����T�����f%�T��������;M�\������F�i�����=j}�4�*$WYf�!W��$9�1���
{��ek��j�������*�����������b��v�r��v��{�0����l>�=���?M�u_�����$��.�l����"��6�����%�+_��!��������l�H
7r���t��By]}�U�����~�+�g�u������0���y �J�m�x��/������Y�V���_�&�W�"`�zN�&j�\r����O�L�3�b����8��>:��#�/�������D�9��R[�eQ(��FIL��k�,�GR
���cV �:����A����_x+�=~}�l�\��)f����Y"s��7������W�A��/���y%���'�����v���J��se��c��������t<2��D�~�`p�M��!��|>3K����I��<����i�����Bhy1�����U��4��M����.����[�����$��jx#�ol��#�����iE��`��-��X�@������.��^$�.������8rb�x���Z2@zB�D� X�8���v	���J��e=����mJ�'��$f��c����T���4
'mG;��f��j$��i�M�1#��V#� ������ru"�pn��2J��
6��x���%��"a"����j�`���M`�(���w6����J����V=F��\��:7�r
�B13�5��,���$�����O�-�	J\�Q�:�y<��$�u�� ~=
no12:�Na)�(�'��6�������@��zw�+�a4�E���G�p�]��QG�3�9� 	���pL�LI	�U*�\���n��A�]�)�Ei�3G~1��:��������A�1�����M���X��/�;T	��g��[ND'�0��|d6on���������y�KO�/����k'����6���G���w\Y*��87Ad`^���a@�X^S�����JaT��������=-�
��l��m�t����,�P���tiN�"P���g
�h��dQ�Z���a�]o2j��&
�Z\���T��_�V�z���x�{�i��4{z��������'������?�m��G�p���������r��$JNm�u��
��������?�s���fb������mfM���I9?<���if�xEq9�yn��
��4���y?���aB��9��w������B0ca�	_��H�dXa�A�
�wV`�-�`���-?r��_z�%�rZ��Kt�U��Y]��{	ZbCQi�������N#�������~��k�p����^�wv*����n�������j��V+��fg����4������f�i�.����g�[��n>���m0�({���a��ZM�/�M�g�i�fe?��g���XmD
�n������X9'*+H���)����myc�|~����+c���c�=������!�|���4�>��z����q��i\d�S�����G��2���m��K�=�p

���`	g7�Yt$����};
�����kF�f,�b��r?�R���z���o`
M��b�}Px�9��bb�kr	�y�5��F��~0-���&�D�2�P��X�e��f���w��'�k�7���h!���!L��w���s�����O	7��U��0���XOD��H=�IAi�]!������������I#��4���(����,q�R2CcC�fx�?�n?��n�`u��!�Q&�^0�qTV�c�)	��i#����+�������q�tX������'�m���u���*.��^bfFX�n�n��_��U
	���B4l�f�[Y��;������k% ��J�=%��<$X���9L����l��.��:_�n
���H<U7��$z���W��0����xt?�yT��Lqm���no�����v�
�5�<qe����Vj��S��o�`����[����j;��p�*�mG�~;�����"��~k�~1�4�N�@s(�����{�ki��k�JM9C�+
���^O��wH��%TG�����i�D�c����Dy���kj{��W0���x�������4�!X��):=qH�3�j%�^U8v7_��,��_HI�����|=��vv+;�ry�^���tww:����f�~~ar5�!0ZEm�_�����wJ�v����oz,G�X�h�#��~�M�w/�w���j��0|�)D����6��|��N .��Kk:K6_�/*��P{/b�����R=C���3��C��6������a&�B[��m������=5�����t������d�I:9��3�g���-Y���T��t����P�IK�_�C���S� ��/hTF���7����b ���Q7,��Z�
o7�_�c�l����V�!��h��	���~]�r�'`��5i'��U��\�����dK�>��e*�F�����N��W��+�{a����5Sd�6����,��]���-���N����[���'���f� ��L�+���7�%q���%��FUkZ�d\W��$�o����I}qZ��)��0�?��f�GW�_�����P[�=��h�	��3	�����4�T��jz�q�p(���8!�a����?trE�p@(u���c�|B�<xs��!1C�)M�k?��.X�,����� ���d#��%��-�L#fm?:5�O!M(���&MT�P��hO�S�Vy���7�$����1O�)S�'
Z�?����x!�.)M��4U��	�>`�"����n3~��[p�8#a+�}��R.��Y����n��4/�b�ba��Kfg�dE�x�K�G�I�cF�hliL���/��/�9�����5�-@�	5�<�����^^����g�����W(�����)P�'L5�J��8&�A�.�w�� �����Y����g�|8�?�O��~�:���y��~�-�AY���T���������l��%��	
,r�&�m�o�����-d������bI��R��4����	��B�:��}�����������m��!��,G,��`0��K���|>��J�6����(FQp���\h)G�m0I/���v�I��9��"^"N
{8��ih����x;.i/��P����Z�G�<MZ� ���������#mi��������^�6!�����~�����o������r�����.�-V����[�rOz�6Z��R�k�Yi�6:�]�%���0�3vkEp��F�5����h�%�����/
���rO}��w�dCrl�4���W�_`�G]�U�5�M���2Z��xR�;�i{4&���4��iE��# �����
]��:�R/��	�?k�Z\3��!��j%K?�%j���`���j+�����A��8t31�o�^v:���le 5��@��-$F�@�+C����)��)�Q�>r���6����������?1��=q�m<
�����p��L|F�]�R3Y��c����/����{bD��S�ri�����t>�����0a�� �,>e�oq��T���d�x�.���@�^�LNY����\������������_�!Ga"�=(m��v��~��e��M4y��(���t�K`CcW�7���B���1�H/�1wo�;�o�I{�,Dd'EDQ�e��=w������e.*D�g�wUa	Y��$r��q;Zg���
��_��:T����,0���Q0�27#j	}������A�e��Hh�'Y}iy����?66�^\=4�N�6
�~����1���������=��������������fL7�s���l	��3�`z�1�3����@�{�O�K��}��z��Q�53�t�%o�5kw0�B.�z@bE�
���i����yMNQ<E�-�#;�bS�E�w�V�-�29���aG�w�������Ox��]����-��R{�����lH>.��$�
�L�1b2�p���N�dC�G��#����C���������,����
�B�T����w�P�R�+ar��Q���s~�j�����O�F��Vz�:]���~'�����^:W�Q�}��n%�z6��.�=f+�y��d�	����Z!pk8*u�����1��p��7�H	���G�|4=Bk���=E>���g$�"3�;8��~F�3��<�8{s��D'0�L������b�����������y����/I���AGa$�������%J����_��
���*�t�F���������/��)]PE�>�L�A�3��El\CE�O����x���g\��B����)}�A�R+Y��B���5|mI�=O
��e��rtm�=N���v�o���M\5_/*a�����L�%Y8�
��'D#d�D����\��)K!
���� ���]}{')1}<m��I���6�b�d-~r�u�����a����+N(���[�� ���(���/oa�Wt����s��G��j%oQ#����[����kp�{4a;`.���K�>��F���������&t�&��F��xQ<bo�d��2^��.I����H������O�0��,����DX�'O����D����������>��1!0��
D8u�����HD�a8�Q�5����]_Smv3�&�R�	��4��-���'�vBm�1C>9��-N,�b��_0�3���L'u���#�m�0�N��D"���I3����)
��li���R����]a.�,JS��Kw������Ei��s���#jV�K�A�����BP�G�h+�1i��"3j�.\pQ=y�=�8�=���v��������2�<�SYb�b���jN�-��Z���>���`���KC�]��{r��T�nw�����7z��]��\�d����
'3I���
�P��>A�6n�H���
4o&&J���]���� &����xhE���7}�n����#����O�q�g�;�_&����
Nm�\G(���q�HSy��a�y���c W������j-��a��7G :pKd���������L?�_0L1��r�����XG�fs�KW�a��A����n����.��S��5���E��N(�k
}�����[[�k��I���
4�:b)x��a�m�����1>{��Z�iS��m=���(S���b�crG�_�����"�4Mz�%0���T�c�iZ�k:E|u��}����XG��6��������Lvy���#����4��XGjY���A����N�����7��������L��b����`�V�Lj����������%Bk "D�B�Y�������gjM���U��`Z�!��x�l��O���-E��mRkch�mo��?�MH'�J>(.A��)�����y���J�.�":��@w���j?]M/���_��vM�����F����U-F�A���#�*�)����<�o��aC8�-�u�J�8L��mQ0�&�78�j�'�7�@��5��wZx�0����1��=DU�}u��'��/iu����H�>����;'��I �%�K��7b�-=���$�yI�=M�x��PI������)����pC1/�L
���B���;k.�oV����08\[�|o�Y����2���
o��\�&�@a�d�@����'���S����;�K��C�_+��~���}
��M���\�1w�V�S�v$N�G�Nc�����`��N,�(*/BrW��Q7��������Y;�*�J�W�,���d��Y�o1H��<A���C����<��%(f��'#V=e}�k�gr�}���}�d[a&	�%"g�p�6.�@��H���/��z��_,��+�|��'@{�_�c��-���g����M��������*m��gv��2Ubg��9��~��y|�zw���N��j��0�J��5�-`����6v��&�M�\�=oL�.�j��X����.`qR[�}9F��
���sp6��^����g���7���v�����=���o�^�M|�2�'�dt��v�T��9�;���$��Z�����<����s.��`�>�,"����&�Z��3D=�l*�z��T���x�T1�b��=
�/N��#1��(��n�<y���@���P���PD��,R�w�O��[�������B�g������.�fa����J�,� ����@�w�o���L���������1)s�c��
f��G������xv��7ap=�#u
�1����XF:p���.4����R�F��W�B�}nce?TN���l��r�i�y�0i]b���Q����SrS���v8��k���M�S���j_K|��,jn�%'����'���zs'�O�@EHm]\�z)��������%;�E��UE
,���u'���l�����-����x����N;6q����3�NR�4�2�����(�p
&J���e���N�KfYd�9EO������Ib��@������i3�j��G�R�!g.J/�7�Ft1�F�l��}Y���&�h	&��`�<}����������SD��3�]Fu�x	j"j��_f��ta�f�{��U����_�.�'g���n}xu�A��kA�B������APc�"�z��2|�S��s��W�{�{�u�����
*2r�66�q�!?�g�8��<4Y�&��s����
gM3�Fe��(+�u�:�����IWA#�%v�����@������/����
z��F=��i����I8��DX��Y/�������������Z���D��H���5B&y���<��<�E��#y�����S�P���=r,0iK����m���R5C���/�j_����Z�_j�F��*yJAWc#L����:�����n��u��	��a8���a*
H0�^��������Mm�q�1X�*�P1
�S0��xA��e�T��W��\eL[�c������?��#���s�^[Mnr����$C�fN<�����24���"�� �:y����z�u4f) �aI!HT��q\2G��}�Xaoe_��+evD�,��F�~E�,z�*����2FDG�K�y�g4�E7/$�cp��(
#u�0J�u�u�coqn�A0d=9�A��	abbn���W]��;�:�lQ��:��L��3#�����5p�>��(��QA;��6d�@� Y�'�%�I���t=Dd%��J>���I�"����	�P���d;�X��g���:��H�Y��E�� �>�E��y��0���4G��6��������	z�
;���['����!�#
�����S-1�����jE�:!�9'v���r�"�]���Gr�7iKEB.�q/�z�-^8n�sXP�����$�c��������RI���;�(>��AE���������^�2N,�����(=�$kuyb�,�b�Z�1��GH�y��6S�G~�6�.oC����N�����.��>4�������Z�y��C��52�7w��^�b,h�����t�~v0L��H�����qF�$	h��k�i��?�����R�x

j�� .�.Q���0��xH2��k���w�or�'{_���.��=�5�4
���$�k�+����/��:>�i]>���j��0��.�	"��z���K��2W+�[����6���2<�)��1�v��Z5�p�u������/v�h�w���;(%c���r���`�y��1��� x�+�;��nTv��P�Qo��F��)DT}�������q
��>d�����"
��Fa�"F[��C�4F	E� RN����?��vQC�A@����Zs*VF�+RKkL&�A��t#T_�:��'�������S�BN����a,
�%�V��1�2r��#��x)mG�7��yt��WH?Q��9�x�r3��� O���?�^?M�\vfIut"G�RR��W���+�16���U�}��.�W����"��Z������z}	X�N;����h�w��%H�(nXs��q�
08H�.9�)������N��5��B��-�MzZu�@�_A��h��J�7<Ro��/�|e���O�3d���3���^�J��>"�������bh�X�����CI�Dq.��A���$�Y�x3&q~bWE5&����Qp[��/���1�����;$&�~�|8��5�����+��+���l}���
�e(���Cw�S�jn���r���A��>o�Z��2^k��t�_����5�BH�J�ZUeN�Md'�F�� 4��,�H����^�&V^��NYc���g\4=*-l`��o���(��+�s�f�E���6GE-�p|�t#fV3{F���������D�3�{�Dr�r��q�9�(�	�`�F?A�&���,�G����si�	��<14�KXj)���$�Q]m���Fbn��V�	�{�h%���_���X�)@�Hx�7���@
�'�p6����EE8��,��R]��	7=�R�J�"�"�COZ(��/f���Y��|��������|7�R���FN�����X�����Y������Gl�&����U�Q����-�����-�[�q-N�4����Ir+G��JV��6�@�i,�v�	I�����;Ek:���)�+�)cz3��R]����@M�#�%��������V�A(.8��R1��4j��`��������~%�VzA6\pfsI����zT�U)����G��T�x��Ie�U8�4������Be�&^��.���]�wd0�?!�������R@��4�r�z�����]��6�x� V��M,�m7�����v� U/hBS��N�m�����������$\(�q	�r�Rd�x�G�c�/w�����J����S��������)t
�U��<F�z6�'�;t�Ro6z�^c�Y�-���F'h����RO��B��2D�����4����3�g�b��b��s��L�(i�}�	v.q�y���]�4u6��H�t�Z0��d�9 T��6,F��������q�:<�`}`��$����:�&�i����`������I[�����c�^Q�O����;�����Sl>�2*�Q��%_@F�������'��DDL&���}fb���u	L^uV�"2�f��a���rv���
��m��&2Ku���i�V�d� �hI*�#�2fU�fX����|���S�c-��V���q����t�|��a����<A���1e���?`�;���~h��p�=��tI�t�t��3�����B�#�Zgd*%*D;���#\����(�����Z��M'�F�H���~�m�1ebMY�B5Y��(v)&`1�j����6��An��H�@���l<M�%�=_n����;��&Pd�����	#�����	��)c�tG����8�����g)*�����K��.(�g4����E��K�J_#i36]������Oa��2��Tu1����n���^8-��{*l�},,uvH��T��Z��XR�����b����L��Y��F��^�v�]���������rC����Bq?�h�7��=�R/��'�_�(����%��g��M1�EX���=�#��4u�*�q"�
lJ��C��`*�`�7�a��$�$�&�%��2������\��{(�0`(��R\87������F&����0�Y��'��{�x���O�Bz�hu��g,��Kd��A�L�&Ik�J�g�;��g�gp�DL��LG���Uh��jL�fx���&�k����1o�B���6�G�b��^�A��{N�d<'�!pK���eh�-sz�	G�{o�zZZ�wR��3���$R��^$
Te�K�c���f����>�h���^$
T�r��s�I�%A2J��&�"z�h]I�0���jfN��Z�-'�ZX����{���\���J�o��+kjN{K56�,!9�%�;���86~�����3�f�I��=��b��`��\h
�����y��C������� �9�`�;�����(�;���r}4�W�p'��%�w�<-"6��s��6�(���q�W!Kf���1��o�!e���1�
�X�A���%'����$�.v){��	�R8��E���` ���k!�]���L�Nj�"����r����T�q�@�`H�������a��H��P���&K~PtWf���ew+��P��ZKI�r��%Q���x%�W?�!�b��;�n��{;������55�}9��i$"t)>@B����q�m)��@36�i������7��pzF�����F��P���g:���OX������M����wL�b���������$���Jd�u�;�����u��D6Y�id�)M�����k�m���	���{�������p�6�1��b��^���M��`�.��'��`J���]���&�����7=���|#��h�����3[_�*OQ6n��J������m6�Z{H��.5����ekS�O>�3;�{%�����eF/���Z$�������D�h�U��`�o���`�_l�(M�	Q����E��6<4����`8l���	"���o�������3��K�s�;z�%�p@����$ a?6��m�����q���UoU�-���+�I{d������-?5�\ts����m������G����FJ/��x���;E�u�����QJ�j� ���({!�IC�ds~xqxr�:�A�n������,�(rO���Wv�/��)���x��!�
�[�C�V�Cj��xpQo�(��n-�NB��:�����W�$��>�/��.)D�v#}�+���6������|NP��)�j�T�l��c�2�;:V�����Ee�,��Y�a|���	O%o�x#�N��OW�������!��Ey��#������,U��Xp�{�"�s��[""���( $�s���y�����KaE�V_��<?-�JJ!��U�:&�a6@.�qxt�U���S}���I6����m���f�gJ������_g������,��a*��[����}Z�C�<��(���3�i%����c�0_�a��T���qF��G�,[�J��/�T�o/ZGE!\���FFA(���&5�|������N����N`%����[�S+\�xG�������NTr�G�0�A-�i�q�\Q5S-} ,���qSV��tn
�����g��62�������?��cS���F�k
V�t���GA�Z����&�\`F,-�t���C���2����*� �5vv�a���/���4L08=��k�����L(����=�QZ���5���\�#g�p���Q���&��u����h��������}4�(��%�/U��1���ov�!�G�Pw��r�����]�����V���v�S��)�(�I�������X+&��l[�t� ��c�k�h$7;J1�8k�Qrc���LV����=�n3�u
��zin��\���������$S��B7
������_�{�&�:3��U�2��F9���v9��;]Q4�qr�RM����x|d��x,(��?�Iy���t�U�@h�{�9#AU�c�t�����V��������Z'���S�	���Fx[��v�<@;���G�,IG�����NZ������s�d���������
��G���^t�.���/E-��'�\!������l_e][�&��{�������R�!@�j��'�U��?�<�����`mH�]�[���x�i>����K��Xi=�-wPK-N�K�a�C����?C�/����pH��V�����m�6~(z��������3�@��|����Q��Ii�I�J��QzEB:�q��4��o����t��~�m�#����'C��E1<6aq�;]���	}R�����f&j]n3�E�_<���@����2�"����^O��w�$��ZF�j���j�4+��n��i4��^5�/ ���O@z9��f����9�+v�����6��r>��I��7!�tF&����Dvr���h�=�|��7O�����prB���m�l�zG����7��WA�U���5#���ha��SG��r+�����/BHo����k�.{Z��a���
��)sRP��'�CR|Y�MP���ZPBhc�����`�[.����~���RhcQK�2���@h�k�
��^�GD�x��2����r��*���g�D| ���\>4�t,-0�����	P����Z����,�(�g~$}`��s��8v�����1��^7J���nd��~�)����;u`@�r���������
�f�Y@g�_��aPt�p�
o3 +���l>)�l���G�m�+��o�o����c�&� �"de�k��]=5�-����h�K����9��9�IE���w�'5	�@V�t�vr�yM�R�1(xB�� ����<�+
w��iZ0�QK�s9TT>,_��T���K1ww[^��\,lk6��?�h8���Lr���x�u+i��F��A�"��_��g���L����;�_Y����I���K����i�k�{��������C�����>a������
������b�����2
c
�j�r6�4~79�_��\���1��O�<�;�;s�`J�~I�����0&��
g����p:R�o�NN����]\�}LN0��^���md��[M"� ������
)K�N���btG�t��p���hK��eH!2�����b>
��A��u��)O���V	��?�aT���h�X�����>x���5>�#��KAPZ1\F���3��n�Z��� �;�H�L���Q'h@����:��
S�b�&�j��[��z�����H�n��.��i�f.�?*�o��U���O>I�*��?�%�U��<���%(���v��g��}� �]x!a�M����	���fc.Dc���U&pj<C�M~{e4����]H	��B�QkY������"��z��Z�V���^��f����|��LR�M��T��[�
��yB���:}�	0��E�WqEy�[���[����}|qy�~	]�^�B�jp3H���jb�=
I���L�c{:��.��3��%�d�:�����n�����^��W��T����5�Z�JrJ������>���Ou�:i��RQy2�`
�
��b���b��h��Wb�BvPa+�P�B�����_�@����O������<��"X��������[���oC���������'���v*��bU7Y�mVu�U�jU7[��Vu�U�r
QKj�Vn�����?�f>7��J�Z)�~,���uIm������Q������;�_�u�������i�����^�6v��?�����g�2J��o����r������F��Y�4��n�Z����Z����{��N��ww;��z�����w o^�U�U��S����n`3O�� 
n���2�z�z�������]���[������Su5�
��������V{Zo>EH�
�������?U?�^�z��T�|��V>* �����b�
<56�k������
u�?Uu����
Ue8��
j��y�;[! B~���}Q��og�brD�������^��L���j�F������AfQ��'�f_GZ#�|6F��(���9���$�����Y��+vM���
����%�"Wm�M^����
Y�4������d���.��.@"B�=���t4�l#D$�>*���OgG�����ry���V/��.�� �/���@����n���������;�@=[Z����a�&�!<=�N���w�|�e&*>q�U�	���1���(��wyD�9��=t^��iDK�Ys���yP�T�[zA�g��	j��k�Aon&4T�8lOg��LsO�O��S�i�S�C��T�)3������:�5?�M�.�A���s��[�$��vu;���I�a@�v��}��%�Y��H<��XE#����� T�3}���3Z��[�3�I*��Z�����Ly���4�nb.��$DiV�M�oo�Q�o�rJ	t
'�3�$s����IJ������=��Pd��S���l�mL����a�N}�����[�"P�Cf�e��e��@oN�`
c:8���K�S�0�/�z�/�uL�1e�I��
s~�y�EF�Q	�2W�|���,J1�86H+�%�:��=���+�DT�G�\�m��y, ��E�$�p��'<��/m��g���K.g����f�k6�eK�L<��.5o�w������+�o��@�������r�Li1����V�[��G�K��}-�y��uLg���,�q��2��3�N,~i,�:�N�3>Mf��Q�&/5��O������
9����
R�FCn�t����q�"�`\��}��q(O�o6�
�6+�G�^X(,��d.�(����N|&��c>sx9~�����[vHX���8f&�;f-������������H
"��~S���b������
'����^�3�F�Q=9�3��|Tu��3�%�e���&i������O��i����b��4���j|�5��$��	����F���^������$�'�I^��V�i����t,"���$CYPo�F�?v���c'�n�$
.�X��2�e���}
���C��s3�-�$?��j6�A]m�`@	��?����n����lg�?�����|x�JM��y��O[iO[�	���<m��I�Q{O[P�V�i�J�Q��i��Of�.�S��]�����&�
��������`����"�`�2*G\�`b�,v��BU$��g�2��g�
B��N�+�6R!����l��4�C�`��)�W�N�ZjAI��<�Z1��aK
Dj&���9b�	s7
��LE�f�<h��f�h|G#�%�'�U��<�AI8��EPz�PS���O���
��>jE�i��_R���t�������6�b���?�����;eMq����o�rkm�Vh�T)��2��[�r+	\���e@r���A�9l�y|j�9��1��l��V}r7�����"�d�G����"�P/Cn��R�6
)Ux��90b��������=����@on���[����>����[���J����v��4�G����=�b��f�N��zk���o�".�L(��\x��rCXR���-K�<�2=<�<R����/���,���/��1c��(C�d��A��;]�Oi�Y����Z'��������>>���sE�!����U��?��������?��������?�������������

#346

johncnaylorls@gmail.com

almost 2 years ago

In reply to: John Naylor (#345)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

v61 had a brown-paper-bag bug in the embedded tids patch that didn't
present in the tidstore test, but caused vacuum to fail, fixed in v62.

#347

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#345)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Feb 15, 2024 at 8:26 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Feb 10, 2024 at 9:29 PM John Naylor <johncnaylorls@gmail.com> wrote:

I've also run the same scripts in my environment just in case and got
similar results:

Thanks for testing, looks good as well.

There are still some micro-benchmarks we could do on tidstore, and
it'd be good to find out worse-case memory use (1 dead tuple each on
spread-out pages), but this is decent demonstration.

I've tested a simple case where vacuum removes 33k dead tuples spread
about every 10 pages.

master:
198,000 bytes (=33000 * 6)
system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s

v-59:
2,834,432 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

The memory usage for the sparse case may be a concern, although it's
not bad -- a multiple of something small is probably not huge in
practice. See below for an option we have for this.

I'm pretty sure there's an
accidental memset call that crept in there, but I'm running out of
steam today.

I have just a little bit of work to add for v59:

v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
any bitmapwords. That can only happen if e.g. there is an offset > 128
and there are none between 64 and 128, so not a huge deal but I think
it's a bit nicer in this patch.

LGTM.

Okay, I've squashed this.

I've drafted the commit message.

Thanks, this is a good start.

I've run regression tests with valgrind and run the coverity scan, and
I don't see critical issues.

Great!

Now, I think we're in pretty good shape. There are a couple of things
that might be objectionable, so I want to try to improve them in the
little time we have:

1. Memory use for the sparse case. I shared an idea a few months ago
of how runtime-embeddable values (true combined pointer-value slots)
could work for tids. I don't think this is a must-have, but it's not a
lot of code, and I have this working:

v61-0006: Preparatory refactoring -- I think we should do this anyway,
since the intent seems more clear to me.

Looks good refactoring to me.

v61-0007: Runtime-embeddable tids -- Optional for v17, but should
reduce memory regressions, so should be considered. Up to 3 tids can
be stored in the last level child pointer. It's not polished, but I'll
only proceed with that if we think we need this. "flags" iis called
that because it could hold tidbitmap.c booleans (recheck, lossy) in
the future, in addition to reserving space for the pointer tag. Note:
I hacked the tests to only have 2 offsets per block to demo, but of
course both paths should be tested.

Interesting. I've run the same benchmark tests we did[1][2] (the
median of 3 runs):

monotonically ordered int column index:

master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s

uuid column index:

master: system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s
v-59: system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s
v-62: system usage: CPU: user: 4.06 s, system: 0.98 s, elapsed: 5.06 s

int & uuid indexes in parallel:

master: system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s
v-59: system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s
v-62: system usage: CPU: user: 2.54 s, system: 0.94 s, elapsed: 6.00 s

sparse case:

master:
198,000 bytes (=33000 * 6)
system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s

v-59:
2,834,432 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

v-62:
729,088 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 4.63 s, system: 0.86 s, elapsed: 5.50 s

I'm happy to see a huge improvement. While it's really fascinating to
me, I'm concerned about the time left until the feature freeze. We
need to polish both tidstore and vacuum integration patches in 5
weeks. Personally I'd like to have it as a separate patch for now, and
focus on completing the main three patches since we might face some
issues after pushing these patches. I think with 0007 patch it's a big
win but it's still a win even without 0007 patch.

2. Management of memory contexts. It's pretty verbose and messy. I
think the abstraction could be better:
A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
can't destroy or reset it. That means we have to do a lot of manual
work.
B: Passing "max_bytes" to the radix tree was my idea, I believe, but
it seems the wrong responsibility. Not all uses will have a
work_mem-type limit, I'm guessing. We only use it for limiting the max
block size, and aset's default 8MB is already plenty small for
vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
smaller, and there it makes sense to limit the max blocksize this way.
C: The context for values has complex #ifdefs based on the value
length/varlen, but it's both too much and not enough. If we get a bump
context, how would we shoehorn that in for values for vacuum but not
for tidbitmap?

Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
TidStoreCreate(), and then to RT_CREATE. That context will contain the
values (for local mem), and the node slabs will be children of the
value context. That way, measuring memory usage and free-ing can just
call with this parent context, and let recursion handle the rest.
Perhaps the passed context can also hold the radix-tree struct, but
I'm not sure since I haven't tried it. What do you think?

If I understand your idea correctly, RT_CREATE() creates the context
for values as a child of the passed context and the node slabs as
children of the value context. That way, measuring memory usage can
just call with the value context. It sounds like a good idea. But it
was not clear to me how to address point B and C.

Another variant of this idea would be that RT_CREATE() creates the
parent context of the value context to store radix-tree struct. That
is, the hierarchy would be like:

A MemoryContext (passed by vacuum through tidstore)
- radix tree memory context (store radx-tree struct, control
struct, and iterator)
- value context (aset, slab, or bump)
- node slab contexts

Freeing can just call with the radix tree memory context. And perhaps
it works even if tidstore passes CurrentMemoryContex to RT_CREATE()?

With this resolved, I think the radix tree is pretty close to
committable. The tid store will likely need some polish yet, but no
major issues I know of.

Agreed.

(And, finally, a small thing I that I wanted to share just so I don't
forget, but maybe not worth the attention: In Andres's prototype,
there is a comment wondering if an update can skip checking if it
first need to create a root node. This is pretty easy, and done in
v61-0008.)

LGTM, thanks!

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#348

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#347)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

v61-0007: Runtime-embeddable tids -- Optional for v17, but should
reduce memory regressions, so should be considered. Up to 3 tids can
be stored in the last level child pointer. It's not polished, but I'll
only proceed with that if we think we need this. "flags" iis called
that because it could hold tidbitmap.c booleans (recheck, lossy) in
the future, in addition to reserving space for the pointer tag. Note:
I hacked the tests to only have 2 offsets per block to demo, but of
course both paths should be tested.

Interesting. I've run the same benchmark tests we did[1][2] (the
median of 3 runs):

monotonically ordered int column index:

master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s

Hmm, that's strange -- this test is intended to delete all records
from the last 20% of the blocks, so I wouldn't expect any improvement
here, only in the sparse case. Maybe something is wrong. All the more
reason to put it off...

I'm happy to see a huge improvement. While it's really fascinating to
me, I'm concerned about the time left until the feature freeze. We
need to polish both tidstore and vacuum integration patches in 5
weeks. Personally I'd like to have it as a separate patch for now, and
focus on completing the main three patches since we might face some
issues after pushing these patches. I think with 0007 patch it's a big
win but it's still a win even without 0007 patch.

Agreed to not consider it for initial commit. I'll hold on to it for
some future time.

2. Management of memory contexts. It's pretty verbose and messy. I
think the abstraction could be better:
A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
can't destroy or reset it. That means we have to do a lot of manual
work.
B: Passing "max_bytes" to the radix tree was my idea, I believe, but
it seems the wrong responsibility. Not all uses will have a
work_mem-type limit, I'm guessing. We only use it for limiting the max
block size, and aset's default 8MB is already plenty small for
vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
smaller, and there it makes sense to limit the max blocksize this way.
C: The context for values has complex #ifdefs based on the value
length/varlen, but it's both too much and not enough. If we get a bump
context, how would we shoehorn that in for values for vacuum but not
for tidbitmap?

Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
TidStoreCreate(), and then to RT_CREATE. That context will contain the
values (for local mem), and the node slabs will be children of the
value context. That way, measuring memory usage and free-ing can just
call with this parent context, and let recursion handle the rest.
Perhaps the passed context can also hold the radix-tree struct, but
I'm not sure since I haven't tried it. What do you think?

If I understand your idea correctly, RT_CREATE() creates the context
for values as a child of the passed context and the node slabs as
children of the value context. That way, measuring memory usage can
just call with the value context. It sounds like a good idea. But it
was not clear to me how to address point B and C.

For B & C, vacuum would create a context to pass to TidStoreCreate,
and it wouldn't need to bother changing max block size. RT_CREATE
would use that directly for leaves (if any), and would only create
child slab contexts under it. It would not need to know about
max_bytes. Modifyng your diagram a bit, something like:

- caller-supplied radix tree memory context (the 3 structs -- and
leaves, if any) (aset (or future bump?))
- node slab contexts

This might only be workable with aset, if we need to individually free
the structs. (I haven't studied this, it was a recent idea)
It's simpler, because with small fixed length values, we don't need to
detect that and avoid creating a leaf context. All leaves would live
in the same context as the structs.

Another variant of this idea would be that RT_CREATE() creates the
parent context of the value context to store radix-tree struct. That
is, the hierarchy would be like:

A MemoryContext (passed by vacuum through tidstore)
- radix tree memory context (store radx-tree struct, control
struct, and iterator)
- value context (aset, slab, or bump)
- node slab contexts

The template handling the value context here is complex, and is what I
meant by 'C' above. Most fixed length allocations in all of the
backend are aset, so it seems fine to use it always.

Freeing can just call with the radix tree memory context. And perhaps
it works even if tidstore passes CurrentMemoryContex to RT_CREATE()?

Seems like it would, but would keep some complexity, as I mentioned.

#349

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#348)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Feb 16, 2024 at 12:41 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

v61-0007: Runtime-embeddable tids -- Optional for v17, but should
reduce memory regressions, so should be considered. Up to 3 tids can
be stored in the last level child pointer. It's not polished, but I'll
only proceed with that if we think we need this. "flags" iis called
that because it could hold tidbitmap.c booleans (recheck, lossy) in
the future, in addition to reserving space for the pointer tag. Note:
I hacked the tests to only have 2 offsets per block to demo, but of
course both paths should be tested.

Interesting. I've run the same benchmark tests we did[1][2] (the
median of 3 runs):

monotonically ordered int column index:

master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s

Hmm, that's strange -- this test is intended to delete all records
from the last 20% of the blocks, so I wouldn't expect any improvement
here, only in the sparse case. Maybe something is wrong. All the more
reason to put it off...

Okay, let's dig it deeper later.

I'm happy to see a huge improvement. While it's really fascinating to
me, I'm concerned about the time left until the feature freeze. We
need to polish both tidstore and vacuum integration patches in 5
weeks. Personally I'd like to have it as a separate patch for now, and
focus on completing the main three patches since we might face some
issues after pushing these patches. I think with 0007 patch it's a big
win but it's still a win even without 0007 patch.

Agreed to not consider it for initial commit. I'll hold on to it for
some future time.

2. Management of memory contexts. It's pretty verbose and messy. I
think the abstraction could be better:
A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
can't destroy or reset it. That means we have to do a lot of manual
work.
B: Passing "max_bytes" to the radix tree was my idea, I believe, but
it seems the wrong responsibility. Not all uses will have a
work_mem-type limit, I'm guessing. We only use it for limiting the max
block size, and aset's default 8MB is already plenty small for
vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
smaller, and there it makes sense to limit the max blocksize this way.
C: The context for values has complex #ifdefs based on the value
length/varlen, but it's both too much and not enough. If we get a bump
context, how would we shoehorn that in for values for vacuum but not
for tidbitmap?

Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
TidStoreCreate(), and then to RT_CREATE. That context will contain the
values (for local mem), and the node slabs will be children of the
value context. That way, measuring memory usage and free-ing can just
call with this parent context, and let recursion handle the rest.
Perhaps the passed context can also hold the radix-tree struct, but
I'm not sure since I haven't tried it. What do you think?

If I understand your idea correctly, RT_CREATE() creates the context
for values as a child of the passed context and the node slabs as
children of the value context. That way, measuring memory usage can
just call with the value context. It sounds like a good idea. But it
was not clear to me how to address point B and C.

For B & C, vacuum would create a context to pass to TidStoreCreate,
and it wouldn't need to bother changing max block size. RT_CREATE
would use that directly for leaves (if any), and would only create
child slab contexts under it. It would not need to know about
max_bytes. Modifyng your diagram a bit, something like:

- caller-supplied radix tree memory context (the 3 structs -- and
leaves, if any) (aset (or future bump?))
- node slab contexts

This might only be workable with aset, if we need to individually free
the structs. (I haven't studied this, it was a recent idea)
It's simpler, because with small fixed length values, we don't need to
detect that and avoid creating a leaf context. All leaves would live
in the same context as the structs.

Thank you for the explanation.

I think that vacuum and tidbitmap (and future users) would end up
having the same max block size calculation. And it seems slightly odd
layering to me that max-block-size-specified context is created on
vacuum (or tidbitmap) layer, a varlen-value radix tree is created by
tidstore layer, and the passed context is used for leaves (if
varlen-value is used) on radix tree layer. Another idea is to create a
max-block-size-specified context on the tidstore layer. That is,
vacuum and tidbitmap pass a work_mem and a flag indicating whether the
tidstore can use the bump context, and tidstore creates a (aset of
bump) memory context with the calculated max block size and passes it
to the radix tree.

As for using the bump memory context, I feel that we need to store
iterator struct in aset context at least as it can be individually
freed and re-created. Or it might not be necessary to allocate the
iterator struct in the same context as radix tree.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#350

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#349)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Feb 19, 2024 at 9:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think that vacuum and tidbitmap (and future users) would end up
having the same max block size calculation. And it seems slightly odd
layering to me that max-block-size-specified context is created on
vacuum (or tidbitmap) layer, a varlen-value radix tree is created by
tidstore layer, and the passed context is used for leaves (if
varlen-value is used) on radix tree layer.

That sounds slightly more complicated than I was thinking of, but we
could actually be talking about the same thing: I'm drawing a
distinction between "used = must be detected / #ifdef'd" and "used =
actually happens to call allocation". I meant that the passed context
would _always_ be used for leaves, regardless of varlen or not. So
with fixed-length values short enough to live in child pointer slots,
that context would still be used for iteration etc.

Another idea is to create a
max-block-size-specified context on the tidstore layer. That is,
vacuum and tidbitmap pass a work_mem and a flag indicating whether the
tidstore can use the bump context, and tidstore creates a (aset of
bump) memory context with the calculated max block size and passes it
to the radix tree.

That might be a better abstraction since both uses have some memory limit.

As for using the bump memory context, I feel that we need to store
iterator struct in aset context at least as it can be individually
freed and re-created. Or it might not be necessary to allocate the
iterator struct in the same context as radix tree.

Okay, that's one thing I was concerned about. Since we don't actually
have a bump context yet, it seems simple to assume aset for non-nodes,
and if we do get it, we can adjust slightly. Anyway, this seems like a
good thing to try to clean up, but it's also not a show-stopper.

On that note: I will be going on honeymoon shortly, and then to PGConf
India, so I will have sporadic connectivity for the next 10 days and
won't be doing any hacking during that time.

Andres, did you want to take a look at the radix tree patch 0003?
Aside from the above possible cleanup, most of it should be stable.

#351

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#350)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Feb 19, 2024 at 7:47 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Feb 19, 2024 at 9:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think that vacuum and tidbitmap (and future users) would end up
having the same max block size calculation. And it seems slightly odd
layering to me that max-block-size-specified context is created on
vacuum (or tidbitmap) layer, a varlen-value radix tree is created by
tidstore layer, and the passed context is used for leaves (if
varlen-value is used) on radix tree layer.

That sounds slightly more complicated than I was thinking of, but we
could actually be talking about the same thing: I'm drawing a
distinction between "used = must be detected / #ifdef'd" and "used =
actually happens to call allocation". I meant that the passed context
would _always_ be used for leaves, regardless of varlen or not. So
with fixed-length values short enough to live in child pointer slots,
that context would still be used for iteration etc.

Another idea is to create a
max-block-size-specified context on the tidstore layer. That is,
vacuum and tidbitmap pass a work_mem and a flag indicating whether the
tidstore can use the bump context, and tidstore creates a (aset of
bump) memory context with the calculated max block size and passes it
to the radix tree.

That might be a better abstraction since both uses have some memory limit.

I've drafted this idea, and fixed a bug in tidstore.c. Here is the
summary of updates from v62:

- removed v62-0007 patch as we discussed
- squashed v62-0006 and v62-0008 patches into 0003 patch
- v63-0008 patch fixes a bug in tidstore.
- v63-0009 patch is a draft idea of cleanup memory context handling.

As for using the bump memory context, I feel that we need to store
iterator struct in aset context at least as it can be individually
freed and re-created. Or it might not be necessary to allocate the
iterator struct in the same context as radix tree.

Okay, that's one thing I was concerned about. Since we don't actually
have a bump context yet, it seems simple to assume aset for non-nodes,
and if we do get it, we can adjust slightly. Anyway, this seems like a
good thing to try to clean up, but it's also not a show-stopper.

On that note: I will be going on honeymoon shortly, and then to PGConf
India, so I will have sporadic connectivity for the next 10 days and
won't be doing any hacking during that time.

Thank you for letting us know. Enjoy yourself!

Regards

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v63-ART.tar.gzapplication/x-gzip; name=v63-ART.tar.gzDownload

��M�e�[{s�F������U*e)$@��S�,[���d�$&�ek�5��XxP����~�=$%Y��RuU��
M`����_�������f��O�<K�"�TD3���'gG��H�\����if��G��/rasg��`��c>@�������u��7~:]�����t;�v���|�l����k>����)T�3�����Sy��7�K���~�di�����4����N�^����=���0h��qs�s���	�3��X�y@�c-8M��3##v�ox��O���XM�^Nb.#'H������V��9i���>h��Vk�M���?� ?`�p8|����s����2��U�-U���2RU���x�����9K��(�	���8V�E�s�L�����L��'�*S�r2�e>%����!�S�Y���`�PE��t�8�s,���gR��dc98��6�dYb.�
���=��<a�R�x��������������	T6I�px,�#��B)x}��y>S��,U�$s�j����o^�[����Z�����Z������w:J�n����$�'�������������Z��e���T4dDE(`�,o(������{�����\6��`lu"�:����^m��B93������}�����d�[��z�^;����t���wx�0�Q��;���|>@��Kf��.��{~���mD2�O��-���Q$��[�~������������Z��j��J���p��r��_.�G��GhP�Z��ZD��>9�7wX���N�bk�V�����qm�?6H��.Jr��K�K�fU�
s5����I@������q�&�����Eg)h8�
���%|�Z:��5 ��n���h}������?p��
��k�?Z5�#���^km��]$�f�Kt��S&"�^���
�v�1��~V�Q�g��IDK����0u'�|�iZ�l��8��13��#��^Dr�`.��O������w6�>�v�3_�1W9����|"�Q�1*�m�����4KF��p�u�9�����G����n��b~��$~'M�"� &$^�G9�|�<*�"�jJ�r9����z��1R�z$�w������vv\��O�Yg���������!m��c����W�fT����y�vG�2��������(
]���������I�� m�������Mn*�����t0���}�����!��� ��������������T�ku"���Q������D�U�Z����:�@��'2���>M����mVk7�uw�f��u_k�o%�&�o�R��F��I�N^�[p�2.���K'��o,{e�����2����V�|���73o��&V�|�o��W�w����x��������_�>'h�46y����i����<�H�_�������Z��s[�����!��m���74�����Ao0��q?l��N���~�E���t��YD`�?}�NB?^j8"�u������B�Y����?���]��{�_��*CU������2PU�+U��C��!�����l���L3l6��"����~�%}�X�}�f!S���H�ZcJ4od(2k.��e$��Sb:�'��2��
�b�@m��$
i�c�����}������ 8O�>�=��r�sq���;:2x����5|]����8�W��I�*��s=�^��:������m�u��"��VF�A�:��a��]��~�����������i�'����n��m�w?�
�aA�n
�0�F��������������������������=M�j�]��}7<;��{x�����Dd���@���0��� #T����1���bL��^�/G�/F�����/����Rj�d��o����o�B
z�w�������k����d�[sa���;���Y�IN�����r0�{6�m?m���7q���7Q���epT�2�<^�w�jGQJa�-�C�`.�����_�|���2[�c����!�c6iC����`�T���^����-���?�������0���x�������^��l`����yT���n��|1(+���v^����nr�.�E�d���Fc�����_u��-�������nv���j_cv�/���6���i��c��������(��U�VdZ�ACO�LA,c���OM�������}�6�cd���x���"�3�[;������|���V�����n�����O�P�v����A�yx�����z�����qtC���n���>RV�&�:�ox�����{��PU��IUYEU�*#U���T*���XH�1B�Lxey����"P/V�>0m9I ���!�<�,�&.h���������	<
�[��Ll&i��}r$�v1��U^aM�THUg��������T%NSP����y�����9�PX� �x�: �5@���D=l��-9��EW�Jb�E�U=VQ5�$���a����Y*�\d
*92�����
�%�$6k�a?�UW'�0�Bq]���m9,�b������r�3U.e!����(��:�.��h�:X�U�"@m��n`��"�MDVVI�V@#0����"a�E����P��3�!<��w����Y���+������H�S_�E���K��S 
k"�����suV�]�0PAt�)��"X9��w3���r<M 8<�`�Xb2I�zE����&�S��4mM���!�����X:���%)M)7�#��hv��l�
���'��	�I1q(��y�`
P�,Z�2�U]*�p��|����Z�.��[��Z���h����QP��+mX�A������{E�yL��V���(��g!@ZR�izc��.�T�g��6��(���H�u����9�R -N���:�
��3Ff�|���	(7<.����b������}�ppl!*�<C����W���`�?/�k��#}��JSd�r����"��v���A#F����[��[\���E��	Q���)�
�$4<;<�d����6=U���I���Qy�'G:%�-�yP�6}:,���V
�Bdf)Pr8J����=�L0�P�xN�*�P�����*�l��/�����z%�~�6@�j�=
�y�fk���NG��K�E�\���� ���a��8�C����G
*A�������r�
�&YT�r^�1����y-�1�.��������<�q���l�pc��U���u��Yo]L8�k��mU������>7����m�i�i^��w�kz^���cVt��fi�S�!�u�?���n��`A&U�ta�!�w��T�I�KM{�d�nO���6>�^E�4�n�|�{4���-�h�~�����_�����V������D����~����Ef��5�an��aA��u+^bEp�I�(�� ��G
9��[�|m�4���9�x�W J�]�A�G��������~1U�*��<�!���o�A���*JL4���^��!x[�f{������C��1��;����_��~�������_��q�3n�0�l�0kk�Ff�G��rA�L����?�L(�d��D�=���,����$o��HZ�f0G�g+��)������P����n��_���q�o�U��ms�q�����������P�IEw�j7W@�l�!l�uJk^�D����/������{gg�����p	����r�aYM�4��08����1 �'-yD��C��%ss� 6�)M5��p�V����^�Nb�P����[��jD��z{t=���jl�(���������,����x�r5l�X�����Y���Sklc�����}j	�,�@.���I����e�Q��K���$�t����,D)b�����
�I[�y��4FJ�&��Y��A�$Ko���Q>�em�����y}�+$�
��e�cy]�`-:e�z
��@Ie s����
	A9���JE�	Y��
@���T��B�������,5�}�FS�1�$�li��V��Tf�4��RN�`�	��+Lv���
�nL)d**�,�_�z����&2�I��������}���w���R������`Z$W�D��E`i��c���=\��$��j�N>������|a����6�g��X���uTa���Q��x�V���uJ5M3�UH��RH&(�)f�[����D���V���
^J�p ��nG�b���vntx�)��8���F�d�l�:�5�8�rC3#Xt������
O��#x<���l������W�C��J����`�[`V�y)!TG���[���8\���.� 
K��	-8]��?D��`"]�%2EtU�K|TMA���(/q	L?���_G�X$<��h*�-�8�/z"v1�@N�d��V��/�R�exf �C�C��������yy�|�gCZ���=�Bz���k���
����\�[3'�g��c��`b�Pm�����'�������+��Z��	y�qZ'�*����J�K�R�mYp2n+�����F�z�������d�"�|�~��~�N�����t��qFR/V&)��X��E�q�q"<^�% �����fE��d�fe�'�]�~
����P��0{2���T�+�t���u]��QPza
{rl*������B5J�����^{�����Kxf��qr]�6�c��O�w�����"�H��B���H$���#^�Y)��
q�;���[�",����0�!R�
�DgC�,dMF� u�1�^/�e��i�- �M�k���Z�~����"�&�eFd���-3}�:k��H{x�FM�	���V�z�q�L�=~E(�b��(E�4Awh����UHR���.K�D�J�J�H#0' Iu�����i �+X����rr:�a(���Q���e����Y
�	��3�Q�S��7��Y�40LiUCU1�?WB�[$��$Y���l�$
���2����Q�2�������V�s���F���ezO�>�,Ec��
�,3>��T�����2sE��~e�	�U�����Ld�����C8a�d"���KOe�YA���KJ�!N���o�XZ�ND�B�_��[@���J�V*��E�Z`�����po���������xo_'=)���(�l��xl8zs��ho��������������VDx~t�cg�k\��K�R��k L1%q���BE�:���e#{Z��m���
���!�r1�I�p`>����d]8j������e����`uq����2���*�e���S{��r��,ML�L���i�^�o���zC�@pwMW��j
M�>��@��X�-���b�E���+j��
�\sz��-�i���b
����:����OKc����������!_��#�|�HA�f���?���Y��S�L@������}_��/�������d$!	!���� ��`� �����>#i�BR4�����k��S?����I�5�d
3�=�P]]U]�-��94�(�������j���1����]gGl�B��*R���h�����b.����V��h��l�m?a[���A�
)��o?��?�uy��k$����Q�����	�8�T70u/�@��8V��n�hV���o�NJ����9U*q���C��
%��n@aO�	���[����C��N������3��>��VoI�R���y*uZ����x��{{�gX~B�������1�8|d�F�n�CP%
]�Y���Apzv�|��u����QH���)��*�i�+�hJ !����ldI���-`�����[g���p5l-G0�:j�q��(�f�����P��K5����rI��H8�F�x��!��W�]B
�h#F3|���:����D��N$apE_��&�/�h:|�B��(��f����hBi(��6�3/�A�e��5"�����?Q���B�h�Y�9����=b���J/
�Jj�Z���Tn�6�]Br�A��F';~V��G���0m�TE�P�.~n_�^c��.[Ia��q6a;�%Q��w��NY+�������z���op�S4q��K��QS�<�����%��M��*����C#9���S��Yp.�*�GN��S�����T�����������YLdY�N��s7���e9��AD����2F>
p�6����Q��R����N5M�����/:�w�<Y/�c������`��N�6;��5�����Q�����������)���W�R��^�`�������p�X�v�}���_w�M��N�V�5c68�-������Lct��p!|{�<�q���)�N>,���!���4��d�����<�>�Q;�5��B�J�����t��v
:<5��M�������X����oU��M�[��02osDCo��7���f���Z$o�wY6�3���|\��������e-A�@�6�8q'��J�u<F�R(�ye����yR�!���v*1Gt��������zQ���7nY��f����x�S�Z[3��z�5�u�[���]�eg�8&��^���[����s��e;x������a�n%)�����c��n�������zkH��[Cd��s`�n->���8���wt���}zv�fv7n#w�7��e��g�
��H:�CHAiu'?�@�������+��V{s|rt����"����t"1e3��,�C���������>mS��qU����M�:9�l_4�.����-��I��M�����.���6�im� ��!���%��'��Wi�����L)�#{Ji�J'��m������{_�}������*@+�[f���Gg?�������n
PI���}{����s��#���dY'I�Co%�xwk����
�ypq�����d��mo��v8����������b��Y+�rIbWEJ�{�����VZo�_U]\��j�:��������ruo��x�z:��\�����X����|H�p�����_.�C���.���"�9|su�d�G)s^'�,�����+Gl/�\5YJ��vca'���^`�y�����K�g����nU�.�V&�`����w{����w�'��)}�-��Y�l)5���jd�q�Q
�FO>:�	l�=�������Y5���*�F�Wv3?��U2���������H��p2�o3yf���c��������o< ��1�:�|��>�N=��NE���w�}r^_��TM�@���1l���oA���]���[���[�b��N�Dgy�d
��m2�����~NQ>FP��B�f���	�G����	m������8��8;���/����T�ti�����r~rj���H2)SBL�ze�S�a�vB��#)}M%����N��`����"��|����0�|N��Y����Y����Yf���z��!T��L.\O+	���YZi���������9��~J��nj�v��&��j��7��#��s��vR�s�-U�*d������4�*���_�i�����l��6���^���GoVT-���\���^=�P�<��T��~�a(���B10��A���xK2��������O��(����7��>��r�O�i]�OfJ+��F���Qi�:lX�,/h_em��D?PW������;@~i5����|g���{�N�*��|���mb�_o�ezi��=f��8Z�OdZ���D��e���]y�
J�&��y^mw�TS�n��l�����o��m�j�6N)W.#�B�wJ	��^��e��=�[�vl��[�7������������!@N���h���#i�F����<��w���V��<���
�:�'���H�=h�%��
��jZ������"b$.e������(������FK��/�L.��I���I.OB��:�$Y�qf�l����.�Ov�M��'v�:}�����>//^��
�#wJ��
���.r��������;����}O�F*�5����CF
}u�P�u��md�����r�a
"�qg����FSeLB?�}��'��>�����5�c3���l���)7#B/�q��m����$sS��"u
%�>�HH���9�x���
����{�H��	�������q1��L��	L��0�e���i�O�_��5�T0�I�j��#�z4��A)��������[��}�Y-��q��
G��u^�x����p�DVv�����p����}����n7@St��0���n�Qf�ER(��?Q�6���kVrc�7�<�D1��������}d�JF���\R�S|�w"�1�TI���i#B���^g��`�F(:���M�gc1�b�"���x<j�5)0m�vc��ga��^�8x�v4r�[�*>��]��t>���v����J%�u����Y�w�}-�=l%z��������2��[,�P�����v��*�{������|��j���
Hslj"e&Z'/�/Y\F?8�H!�WP)���[8��N.�RED�E��xtMj�0�<��9�!]�����])yKc'��N�NoZ��H���av��1X<0��gx��y�U�����!�r��o�{��Q�8�P���#�I�y�qKjSP�?����!dD����n���ou	1���	���M�HsofZNr5��V�c\��{n��������0@wFO�X9���g
V@���B�a�}�A5@��p����=�i��� ���3����5��?�!g&��������@�-���P�������Q��������+��s�la��]�:i��G���ut��x
:���U��Q#���<��.'w��Gx�YL���G����@R��I��,������R���A�q%��`�����l��������8�����4@�yB���2c���}�������^�e;�T�[YE�p<[@.&Poq�'�M$��{���O�j���^�R�'-'d����Ny��:�X���U�S)*YY�G�wH����)���Y)-V��qU(M(0�xZ����C�#�aB5�4E���x��]�,'v���YU`�c�R�S�K��=�&N����a����t
�����Nq�[Z�k8pL�,���c���@�,���R���������*�a�����fq��q���C�T�E�qQ+`�:�q�S�M��PCV���c$9��>��
:>��7uF���L����}�)I@�xnj���O�|upzvu�&�E��
J��"��2"��"K���.i�od��1?�g��0��adb�FM�n���!V�^0RF�F�!���:(i����l�9%+V4�
wj�I��r ��y��h��uQ�HOC)P�jmM/���J�&����I��x�[���
E�R��(Wk����b����i����%@�
��V��hE�����HO�fL��>��q����i��A�R�5
�A���[s�tcV+����`�e���4�����
�i�&k[��b��#�� H��D�r�^K�W�����>�(���x<�Sv'�\��?��,I19C �k!�����_2,��o'�W�������l|U��<*�������F��B�����{O����F���8�3�Rm��d�+�4�-�S?��0������b�A����J�GpP�X��l���	�d��n���I��k��DQ������!����>�2�;�����Q�Yo�+��n����*�^�R���x�`h�
.�^��Q�,u�p���_���1����$"v��1�
����������T����D��F�
Y�R�Y��
�� �����C=9�p%����W_=t.0Kg���jF�,����@�H���,kJ�4,o���	��1y��O�����NF!/m���on�6�J!P���X�0c�o�1��S�X��~ML�q7��$N�����a>�b]N+�1���R�
��p:*a�b���6����E��`7C4;9���{

G����F,lc�I���I����9����>�/�j��u�V��>�\>�]��}�������7o_}(s'n���M�Nc��&�OL�F��dq��D�j�G��L"������4Z{�>�����7~�B���
4�u*cf=��r&cb�I���w�� b�K1��6����-�V�RJ�*������_��vB��y�Rn�>7d4u>@�m��6~�3�O��]/��|}1V+^�M��k���h�;�g��!W2-�SP2*�{�)-��\��
�S����0�)��{�/<���Y������Qj���Em�x��O6�:��B�G0-RoQ�\/�'t%k-���)�g�(���o$��� ��`�R��*e��z+��}Y�6]������O��,���������3�>P#�3?Ke~����������(����`�;���I�C�DJb�����X�D���&�!]���1h����s��N�ZvT����J�~B;�;�����I"��u�KFAC���j<!?)�z��mN���)x��� ��s������)Vp�~��M�Xp�b'K�����e�bfDg�.c����QvaEp������ur3-����~��<��}�z!�p��'��8%�.:]�~@w��`��eM7s�]m��f>�J��?�����}g
_�� J,����Chdw��.��]�Lv��/�k�w�t3{/v�8�N���Onb����.-�^hn"��]�I��K0���g:.}�������������A}���qo�m�\��p��Y^��
��� $��z����1�E������8�E%����r4V�d��������*���h��7�X�"�Gp��Qx5��`|9���
��"T[�pq%��t�;$������M������8b���!u�4�X��wd�o�\���Aj��Ek�G������1���,���3�j+=:��e.I	�z���C$r:� ���������H!���gM.�����~bDa�o|��$�(u<�k��y+dK���]�)�y���4���OF�Y=Q�"�48:ew����������8�N���^F�Ag���=���Q�r�{.f�������u���S�������SD	�f��D��F���G�����C$��u����'�p�#y�M$�{;$���#��T��X���� B:)��
��K�4('�pP�_rC��q�0��GN�J����hi�����MG�65`ug������EL�A�4}�~��P�L��k�m
���wS^�nr�I;U�����t��E@<X�����b���(�Z�Q� 6d���A��8��]�3����#X�y���B�� ���v"�r�y����q�-�y@�CcBfy�&�xL���</�TdK
�I(�[5()m����gnv��{L����3^o�c���/
2�R�q��)?�����*|��L�,�2���k�����hp	�M;�)*���0�"f
�<f��&���|�\�
W��
\��j��`@,9B6O��'�+rs���
���-�6Fs�vDb���v����c����1��������d���i����j�w���<zy�S�SH�	�a>�5{���#�8�T�6:��?��&�/�C�|��<
�
�C
�g��s'h��	Wr��������WH����<�Py�>�.���+)9�qK!��������=
��*elma�f���K�D��r"l9~��Y�%��g�m�|�}���<�t��`��g�^�%+�5������M�&mZ���_;j�~���
�9�yQ�� xn���tY���({�r��[�v�_
��j�:�[�a�rdg@lO�}������(D������2��y���!���[?*X�;��1�w���#1Oy0^y}D�No�]V�d���-}`��SGF���k��f�dRFggv
vB8L���g�	J�(t������<��$���}����t!k	&�HN�������z��S(���Q�+9�Ss����j[�b8�V
���t��k����3���nl�{TX�����pTe<����^.�'KD�aI����������`&"����N���He�H��d��Z:kq����(2
�z �����P�.E���K��*4�|4"*���!�
$�"����Y|y{��F��G��Q�ry)�l���1gC�
y)����~��m���I��h�Kh�i:�Z'�z"�K���r�v��x����Z�7����
�����\�w���C*�+�P���o������W�>��"�������������^&�c0����BT������������%}��l���]�T�g�����/���hP��iiC��5D�eg�?�lv���D�k=S�����x~�S�������xa{�Xz�XN����]7g8E/��������aBq����SUZ���=[��+���1��������%�����1���a2<��c�[��_����ZXM}Zs�v��U�A��J-���uv��I�Ctq�=��O%����!��W�9l�&��7�C��^R�b>�����Y��gL1�*f����m�E-��x��o#����$��R�E��l>�>��f�l��.�U��yow&��G��1�l>Q^+��a����*������0�P�����!MF�=����5����B#���c
 �{�Z1��V�k����#���S��~������(R���_�q�����0���n����52!xC4��b�O�YZ��\��z_"C�e��]��uIT�Kt��
��&a�\�P��!M�ZD���I���`���j��\�	��ZzN.=�M����*��9��)_G�{Q�>w�~����TV�e� 7�7�p1L����J��&��g���1�M�C����?��e�-�j���2��a��lM(C�����B�P��Twt����>�u���B��s�z&/�mm���Q��u!������t����!�����6�~]�M_���6���1c���F��*�z�����g�R�2�������%8�?�1���B�����5����hfIa4�5��3�D�����f�-��%�Ln�������b+�W~:R��~�-��,��d���(�2rF	MtzB<A�@>�&��+nw��V�:�R�~���s4�Q�WLR]H*�����5������E��W��L����>����_��`m��J��;�e�W�QD��M��>�������g����Z����e4��
�O�R�6����s����&v������A��u�$3�
_�E�S���L>�AA	�
D�Xr^2�tl�I���~|v��}&��Z,��B�
�X����6��jl������U�{r6!,xr�
��s�#&f�%����Dw������,i����7��-J$��"����]D��]K���m��i����/)Ckc���Iq��w�S?g��3<T��e������A?�����L��5���C��p�������xI����Y�����w���y��5d���'!���y���=F�b��a�`*%U����d*�2%�P����	������
?�#�%D�.�!f,I�K�5�����l
���a����8�3�F8���!��;[����	�������0���������&���bmB����8c���@�`��+��zbn����]�B��	���A-��Hs�(j}�3�*dB���W���O������%�����y�QG�����T�v�(��Ci������(#"��h�h���,���w�[�e%��]��|�������l�QuK�.e���-j=�5KUo8��t�w������D��O���SAR��V�VZ���q�p&aO|�I��!� ����/
���B
�i��x^VCr��
���F��c+|A�wp.���M�Mc�����G�!2u�38O�,����*^[��__��������({)o�����g'�|����-��9��k>gd��C�������midjM�����c��17�f�=nFb0�}�0myWb��E�W;��?S8��%�����.����0�Kit��(R����E�Z:�(��?����~��M5hk{��t�.�,�O^8A})-Eigv�����oG�m��������{^�T��U;���~}zu�nc��v�����u[���6�`OB�����+�y���~Kq��08:U���5��R��`Hg�?�@�����o�\��Lr���vI�+!+1n�9E���nVE~���PcReZ�F�-+����Fe;4�S%�k��n&��h�����hiE�d*y���)A�k�w	WXJ���/Ri�L�/��LO��&k�s��o��;i���K�{������Z�6���X��O6]4L>��h�K�Ky�r_c����s|��i�c4Jr�!��$a�B`x��o��K�u�D��Q�2���|�q�����L�'�OCt��(j�v�_�]F��1/7�+o9�~A��_t_�>I�M#C����w61J�5��D���aM�(��6�+p*��3���eL�sLA��Y@L���g��B/Q�}��'G1��M� �c��g��
��n��a�������*���B(mTP��?�pZ
K4�9�~t4���_$���)=O����/���C�+��J���&y����K��I��X6�����n&CP��LN�����~�_}�/k�;�8���fFj��-��1��{U��)�� .m����K�]h�Q'�x���Q����!�e�+�����D4��2�e����bJ�����(7j�����xsm����9C�R'�����<�7��#-&e_�y����8���� {�/*�������J��\�Q���t9��4
f
��}�{k����Ud(M|����i�9�oC��nt��l��������i7jmM�z��Bg��!�u�K��kA����?T��k�E`�G9�:��z���
�e�G7�ZHD�����KJ�]w�>ry�n���e��W���)���U�\�fK��\�5��pV:��d_�I'nW���r�U<���&J��'��"�Tm��/M�NUR�qE2�!};r�bA&9��.er����V]7�l�����dt�T���e
��S0���W�[�[9�t��b�(����j��0w;c�/�O|k1�N�\���ay"���2�����{������Us�@U�|u�M�[�kn2�
�0�;z��

�vXBIXbse���-�2��}�m|�^�w�]w�9�H*���G������"�(Y�`�@�{��Q��0}D�z�Q(�r���g��/BiE�4H�("���~O����Gi��a�����4�A��9���������"��&?W��%Tp��:��N��R:�ly�$�l��K�CL�������uI�c���{�����h:����*]\�-)e�L����%UrE������S�
���f����C,x�F��v�pO��my���kv�����Y�!�A`nB�-��`��(0L��j��K�����,�_L5A0����\m�������-���_�d��"�./o/!n/�(T����_��rD��������w��b`���eR���*;k)1;C�^R����j�%���h�����������[��q,A�_t�i}E�N��3�`�+'��(���:.���������������Y;�$��d|�����p��%+b��P�c+��>z,���?�0�=�
�N���g/V����cb����&��o���F<U��DG6������|6�J����G�x��m�w
�\-�N`^��h�j�+v��xP��V�����s���I�W-��:�i����U��.]LN��B��V���m���|6��[>.P2���$7�kE��T�4"����6�-�j�����&L����a<��l����:�A,U���n�=��������K�E����9�>��y��:dI7y'ltt�+mzY��m4kHz��s�HK�&9��$#J�z��NY�D8cW�l���{������X)�8�@)A�I��9��V�I�]��>f�"D���P��v�2h(�z�MJ��fkLR�B8S"Z��f6!����	�c��9�(���ut�������c?/��.M�`+��t��:���5f�!�f!�N���r��0�Mz+�w#_����"�A�$��e��t*����y���ps
�I8(�g����ZEPZ��E#S���3t%�r����nOT�4���F��$%�����i1�6�ZG��$�����&q4��n��g���g�}��~=�5o3�YbB3:'Xp���\��*md��VD�\d�&_'�i4�p;B?�,��y��CX�k��[Wq�Z���=-N�
"�9��	����6[��!�B�������s0��X���.��	�:2��E'Ler��7/���ks�����hm�����;65���h=|@��v�T��I���~#�WA�z����H;�R�g�n-(��W�0w%�:��zL�";C�1�N_3f��LN��%�J�g	�w8�8l	�����������
�@�
�CE]��)dW��qB`��4�@,�������>C�y+TH��!��#�>u�]���
/��yl��S�j?���I�/Th_gg��acx�;f@�^�Y������t��k_S_|�:�0�&�����s�*�(����`8L�N}�U,�z���A��.�CKzG�9�>��D���@�1�������>v@
]_�[�_=�}�w�\�3���/�?R�}h�M�C�����@��/9���`!�&�I��7	)@a�1N��mNa���w��Q�m������e!X�NtMY������L��~�,R�	O���Dc9����U�Y#d�B�f&-SL
����46+
����b�G��|�2z%��B}f]�� s��l�L��1hm<�	�L�FhTL2����<�.A���>��!�D�`�N<����t<�Id�H�}��`�).�������d&�%��-:1��{D�oJ�GQ��3)hWj�`��H�)YQ�lD�AO��8���������{���Y"��9�i�
�r��w�)hEX/��Y�)��7��}�����<���jH����ro�`��<-A��T���k�It��b*$��P���o��[����FN���KrN�c}����A���6%��@���?o���` 
�KaGo����Yn������,� Z��f���a4iwpv,�������1��G�i���?��;B �)���
��< y{��V���{�u,�����80�d
��W %�CJD�v>��B�n�{=4�K�?�]����I��:��9�O_�@Z<�t��p�|upu�W���X�Y���P���joAWI�
?�����wRGD���h$&4�.�C�#u%H���jM
���I+�'�qZya
��Z]���R/����G�����Jij�������3�;(p��M�s6ZT�����f��C"���m��J�fGF�,��K�������<S��/�M�l��$��B�fz�}�f��K�+���L�����������	 �������ar�n4�#���|��vc���Q�15�2��8?���T����'"���":��*cd��8���o���w�ro,�
l���8
�T���~��_Z
F��a��>�(�T�"#�
�B��04]��1I�a;��`�H_0m����GMZ�4����������lr�j;2����`
�2�g��U����U�9���aA����Y�&���d>��3���7����g��t����M:va�h�c�7���;����D��������'L$�|���p�)��m����t2���`�:
Po�k�:�Q�C�n���h<�e�B
��������e+o8���z��q�����k�e��	��
EZf�:�����4nP�����P;�F��Q��Tp:q�a|�'����&/30gL^���r�i�lT��^bT��z�B��i��������)�3�?����yb���c��s����o�m��L[���L\b��N�Ru>$�X1sx�(��n��=�Cw3�`�
 ��M q�s���q
e��&������m�r������wd����0u]b��d��s�6�AQ��`<�L��;���L`<��I��X6X�X�����'�C�	f��vg����4*�6��+����mu{l��"2����� ���n�%�2i��u��%�E��<��4�<
�D��9������d���az�%<_/8k�uU��,�����4)x=��5�2�[e��e���i�8W�V����v# ����2���e����d�������b+�M����kE�c�����/����7�Y ��?S��HMR0�E��=�5�<�3
���%��,��eP��No�v2��#�@�f����,v5�����"e��4�xg���t9��o�J!��r
���B��4#�>m��4�H�f���������>D�5�k��p����'~+qNn���0A!+8g�>���k�+�^y���%���}����=��^�5�/@��(�d�/��?Zl^z�~��a�U{�-����S	#�
�*��*AKM��-1/Vp�W��Uu(?�0�0Cl8Qt����Kl^��@�R����������������o/�n���d���)�X%�R�����/_���,}�U����u�-�GgJ�v��G��M#��������_&;�����Z�6@@���}v�5?�]Jz+.�o����=L�c��0����!�+��"����������:��~m���>�&�����2�-�������%��y�]��#QgDdr��h�r
�R)�N��GA��3�����u�
pOxm3�x���1e;�b�}Ho�������H*���g5�����9�����d����.�9���������X����O�����yy5CG��hdyA��7{�$���)�>b�T
{�)zu?��>��hX���m}[���KmgC:������x�Y��	�3�tI�:K[����������1_5O��Rni���6nrK�Q���v�G�)����0 ��~h���m��
H^�8�f����T�Z`�����^���}�t���)��-��AT?��c
�Z%�@��~�\z<+�H#�o���2����7�eep��\f/���VR���c!:`����'�k.a�R��c%�D�yr��h��p'`�1>n)�1�9����R���?@/F�����ds���V}"B��W����ID��:t-Z��������~���T������A����:b���j��q�0ls����O����/�OJc�x���hzAGt�5��K�VUL�d�k��I@B$B>O���A���o����%iJQ62�9Uj(
k%0�F�F���$y������;���&��x���'��q��QR�r���[���I�3�N1Z��#n����'�1��<���c�3��4!��}q�����pZM��9 �&�r�1�~%�44���P�V���.~���%o|Y�LK��������&o	�����*�H� �B�B��[�4CH��Q���T7����d|��Q�/����<[
���V.�o�~l~��9��K�g�������v��N�C+�8��V��C���[�LAH�d��C9y���MF������\:Ce tD��L����z�o���3��D��1�$��J�''�Y�W�N������V�����N��`3���[J2���������`fG��������C���]�/�12q���G|
����8���q|2v$�27*NV�9L�(�(|����A��6�@�T-,���e���S���d���u]fiC�E���J�I9R� �Tr�	GR��,�1O�'�^��K���FM;8�"$�I�����M�~�����;�������A�r
x��k?H��~.���~f���.���"��R���FVgno��J�5+��2�)�Z��E���[` k�P�%N�m��%V?#�b3.qDVm=���� C����3����g=V�@���d.�"�cR�������X�����a�����e����U�(����_WaN���'�~�+b��&����/yNf�Y���w&�r�����wZ��L�v��4
���W!x����#?�f����Ez[�]4��z�O�^f�SGFIvS�����N�T-j`e�������0���o�+��\[)�:+��G�l���c���j��y��v*���n���{<���k�������3��?�3���0n�0���z���v=�*m�\�8��{�i�-/�����d���~R��^Z+o�k�����9��N}dNU�����v��;ib=J�������?�r��[���b�O�A��x�V��D������E8���'��lU��k��!��U������,��5K�2����?���A/,�/������n�`^r`����@M��{��p�����P#I���'���8,c
�������T��������������M�.�`&����}Y6�J�:��k.�Z�a�W}����9!JH��@t����	�K��H{�!�����R.5g�D$�'3B�Snvx&�������|��;o�o2]���:.������0���&eyB-�6��X_�CI7d�����-�����}��Yl��Rt]a�!j�K��~<��� l}�Z{�[�����t������5��Sk�����~������g�\B�{�y���,B�9A�C#����M�5���9H���"F�y��:�LbZJ�e�<O��)�e&�_���u:������M���r�:3�<���{�^�>Ez�{�L��s��C�������I"������g���]�b�8�e�%��y��V���xxe�Gy������
����	�f$#
�C��"��������������!z����^��@k����+\;������Z��P�A��4����/Dt����x�'<�Q�X'[p�5F<��������`r�@�g?��Z���*�:������B7�2~�m�r?(h�?��x�
M&-��H�n�H����j��[�Ap��gO�1���}B��h���Gm{d���;�mS����y2��5���#����������sOY�Q n93��&���������>���8!&�g�������0"�KP���<d�����D-��wz�^e�
�
V���-����������V�9�al'�6�L/��c�sn�'�+3�T�m%��W��uu^i+zP��]�*�NW!���|�HY��_�5`�����x4���W0k_Y���DZTV��D/��.�J�tU�-(7/�@(�"9�<����^O���

�[|��ZUv������(#9Py|t�F������<������:O{�dR��Iw=�:>�l�P���e����z1M��OkGI��4��2��D�&{����g�����
g�����\�dO(5!����z�Do�>t3+�`���"�C^S�D�A��u�f�E�qr$�ywK4��{/�c2��������-a����u������r���N��~g6���!r�|��R���g���
���p��l���e�y���;���J�C�(1�+s�fH��<��(<��*
���z�ta�jK����h}��
��O����z��/����.kx�W�����d���[9�����
�f�m�������_~������O,�����^j:��� �@\\����/��la��L�#�'��	\n��@v�����zG����=�Z�~�M����+�Bz�I;e����Jx�9� �T���&&1����`.i9��8p�z�$�P�������� ��H�����d)�u�����/�E������dj������?�P	Za�>�4��_�.���O�Kb�����
�3�8�PV�MQ�Iw
-����SeD�{��P$���4�A����.B��#��T��?tJ�Zl��5Bw����~B
����	����{e?1�%��D��C"_�6���-QVa�t�:@�d�;�2�NF�7��i����m�>l;]����t_�^f��R����#�}��.��������scm�`���I�[W-h��������4:c�J���Q��s�����+w��m(=�<�d����]b�d��$ADb����j=>m��/U��8h�w�M+$2d��j#��n�s�D���SZMi�k��D��Y!�Z��f���#�KI}��8����3��J�K{��-�y����	�<�!�{����:Nd�P��>��s�����qy���M��xC�*�L�)|���p.9��t8��"�o"�$f?����9*�'�]�SR��\6���V����pE�hb�a���f'M��Nr���=x�V�L����	����%�����M��vJ<�sguT�����[��	)�9���Je������������$_�Y����g����gTN��2�����6�E���]I8��J������Pk�E��h�V���6ya"<:R�������*����c�(���{�q5�����l?��7nc��g�+�du�F����y�������$�i�$��Z>k�J7��O���"��M��x�������f�����v�����.����n'�t�nP�T��Z�T�n{mcccN����A�Z���������E.~���i%R����fW�9i�����2�G=����Ki7�]�z��V����'���$�~�')���W����������N'��%��fw��bI��F�t�n��
�x�m�TQX�N�;!���0�W;��V�����4z��j�h��0d��	���t�%�i]�<:�h�����t���_�0�����R�n�����n�t�����$����'�
��:����z������k��]���^u7�vj{��n&cf�"4�;4�;<�����O��ip���h�!~\Xs��l'��'�����
����D���c\����
�b�*��,C����aRf*��F?�� �[Y^���)��{������N��k�}�������� .T�X�Y��2�!�&�:t_�����F�d<����	�2Lc�uz7n���j4�
d]vrza������[��A����dn��z������f����,$N��������F���?�n�.����-��6�_�:>i5[�h��8K��F� ���.J:��A��iZ��c�H@o!�k�WG����R�\)���H�5t��Rwm�86��~��n�!���k(
�;^�o���V��y�_������~�7�0^_�zz����b�I��"�����Z3�������q�T�hC-a�f0���0����	�W�-q����)����@�?z�\�:���[!���v�����A[A>fm��T���h�(U�i��x6�������X�Jq�f'L
��SyTkZ/w�H\��-4.�����1�����m3�/<��1���� �0�����������[�-��y4��b���+1�P8ht2���#J�D��n��Q
^����&a�+K�)����������D���(�4���z�>�O-^�9��%���u��?{"�y=��p�Bd����04�������6z��5@�[>_�v�rp<����N�3���$_�_1B�R\�u�H������2�*��������4�;�"��������U���0�F���>n���E���E����o7���R}��W����&?��/�������ub���jb�n���tA�Z��N<dA�Z��N<hAR�Xi��~oJ�������R�?���Oi�W���>|�S�AV^���_���� ���]��'��8���k�%G n��V�n���$�Tcq)�8�����uog��$��TFZ�@��\j�?�T3&;�jU�Z���^dm��|���V�A��&�4*��o�C���2k�j�#(����Lz������n�Z��������O��-eA�� ��#�wA�R���Q<��D��9	^��!��hL�?^OF�1�y����CpTqf5�|�������"����������G��c������`�m��������d���Pq����t������M�4���^�J�8����{#��&Y(�j��7<��������*$�M!���������N�>���4��G�2�Ho>����7���hx�E���1���D&�BN�^8
�����<��k��6D�g=�'�w�����8��+���l��K2
m4��P'Y���.3�n�.
�SV��	#�mE�������;��}�-�O�+.�$���mG{��bO�k��6z�J��A�K�\��u�u�����p'
w�nY_�n�n��Bsx��C+tS
BXF\���?��\��=����E8�)�a6~���O3k�����*�����`
�Q�c�52����?�>IeM� ��_^]����
�h�LW����������V�XL���o��NN__�n�_DE���X�.K7��u��]����
t��
v�U!_Lz���H|�r�x��S��t�����;�M,4��+R���yzy������h?��l�����Pb8��X�����:G���	�F�M ��dxC����������w��]�=�,2�M�9�'�%�l�Ew�K�KJ5
���Crw��������,�(�B��%��Bzy�l]�'�Q��R���K�/~�|g�2�~��y��S�8
?F�>�C|���(��By�����s8?��W�|�i<)��Z�7
�����o���G7�����EQ��0����fCe|fh"d����
e��N@�����	[/�6���6�vr|
����o�Q+_)��"=up���:=_��`�m�����U���%isS����0,O.��*dU�o�R�-,���ofG�������h�>���]ZO�K3���A������:�"�,�>i����_���	�g��b��c4fz���W��,xU����Vr��C�C��o�2Zb�O��*q���������# !�:����Jp)���
����P���
*�H���2<T~�]��}k��
��#�Q�P����
� �f7��D0i!��%)
<CPF+���>�P�:���������@�kO�vC��c�}������:�>�@��������.����[C���$.���Q
�C
��P9�n�.���v8���Vu{��n������mU7V��Z��_������^;��M��i���"�}#�jc�����Q������#����(�&!��,"�I�OdF�k��h2�= ��X��{;������O���}l�����SY��1='lx��*1��Q8�9E8�x��$NU
a�"�6�RM
eQE����������2����aK�@������`�StG��]$�=��������W��(V+E�������*W���T�-����%�+c3U�Z�v?����qp\�������#fl,\��cw �l��S|����.�������OoQ����k�Q��r-9A*T�����Q�;������s��q1���O�9@M�4����a���y`}�G���m��B�����<U��������I@�y��(��&P��%������}��Y1;�� Y�Oy�u>�Ciz��eb�a^
�%`��Y&?M��j�����M
���u[�%�/����"�?V����a���'�!�T���VE|������?��I����5x�)����}TN~:��4�����=�����9�s�:��n����F��W��<����1���B�S�����0>)v��l'�H�mP�)�������C�p�`c�8	#���Fl�p����(�9y�Q�s�
�*y��bP)h������+���������CU��u;��	g�h
�8q<C�S��S2��$6�'�IO#�b��N(jHD�K��j#\@�O��#DcvF�`u�����)��x;2k�P�H4�&��eP��e�Vq�H��(�[��m���g�*OD�YP�E��F7�4&�Kg>�x"L�y�BRn��E�uR�?�������G>C����qkg�H7�zD]���{��fVT��9X��'���`�����/DGS�X^�����&}��o�B��)V9��\st8I�q�p�F��9�Q'S��5���*��_p��)����xH�T�lmmEg��R���h���*nG!���RO�[h���2�b�w�'�c�kx�m��*G�"��[��m����8�
��P/84l�"J��9)�tI))�������������������P��?G�z�U��/�
i���L�T�p}2A��Y�I
�#]i���wtr�p�R�P��c����l��Zdl��P�������E=�Td�d=��a9u������_�
��J�����}�!6������|T��(��O?u�M8��Ra��7���<�Ck��������C��=S���y��c��<wi��U��M�����r$�N�!��o�X�n-n1Y�������r\�'T���H�T��z�����y��d}��.�?X���;�h�Xs�}������$��c4�>N�uL����I��,o@�--�;B��,����`h���[h
v
X����N9���k
#'cD�l�����&������d��Y��|�(���������s8*��S�1
�QG�<
��$������?-�~4�����	���/Z�4Pw���`��$p5��YE�Y3�Xd�P!�"��y����y���_�_��@e���T����0�E�S�����3GW�w��������4�v��� �`�Xw�%t�>����m���k����F!E�Y�^4�������*$�|����H���$��-�.W����@��h���CEy�����`��2$��{>Y9%�P�M��iM�h�����F�Q��_ �J�?)���d� �O����;u?iK���B���9��9vl���%�lXLN�%���d�u�n1N�F:���;U��*�~��'���02�e�r:��:�Pfy ����?a*��������?uI+��W����z�X�6b��x0��'w�@��2�Vj*���V�x�����p��$�I
V����#��
������cOo��Z�
A6u��0��.WA������#�'u���`����J�`�n�"�tFH�r����\���O����xm^�6�0��2��D���E��i��p4���5��q�R\�(�x~�#���O��W5�Pt�<7U��e(�������b�,���uh	�o�W�3A�A��w���nB%�����~�7=k�1Sjv����Fs�1O��t���6j<�m������sO���������"�T�?������"��uY��/��9�g%����ssRY-�q*h�h5��)����BS��u��<�d����SX6uHjs=���WE)��0��}��2��"�2y0��j�Q�����������E�[.�C#��G���;~*-��NLf=���-9��4�\#n�=��{m��M�'� J�����Y�_��,[��[
���q�����y#������������@.^��p�v�.,Lu�,$�O���?IN-8����<q0�X�$s�� ��Lo������~/���f�5�Z�C��1������~9*kWh��D,O��|d�����7���+Vv��"$+e�gz?���s?���C�>��u������l�OkbX��������\�q\��bSlo�"wSd$���Td\A�, �=���-��v	���<;>�sJ�/�1���|q�e������Z�U�_%�L>�4TD�����kc�z��\+��8Dc�QAU�\��u\�[��~�_�>��y=��p���+�P�A$1Y�V+o��������'��cc�Q/5,]�F��N�.
��Z�f�D>��hBj�nTcZ�e���h�����v��~������P�o7�+�F����Jmkg�����-�~f��1�p��m��(������?����u���*����a�g�t��^u�:��;������F7����lE�����)�P��\�f�oe��Vx���YL������}s��)�^���=
^M�E�~�
�*�ZP�>��yZ�
6*{��Zk��(LO�_�.�@�A}s�}��4@Z%N�i����hum�t����R$�/������{��_aS���Sd�k���$pB������?�A�l��r9Q��^�?Pmg'����9�A�K��
����?�
9e��A#��y
Ho���]��[�Q�G�5��������9�Lm<mnZ�]F�u)��w��q�Q5g��Q�g�bX�V&Q|Y3������{�V�T�Z}�f���������Y\fY�io�����T�i�wl���@�,��#�%l��Z�|�du!��V���]�?E�6�V������,��_�>�:G�G?��4�FQ������7��6���'y��_\���I�G�$Q\;��]_�~"W�������}���W��������3��Q�����%��P���2J,�tWL�x�sa��w�������wq�R�}��Z����"(��7������I���
��"(-��g'��'��������y\��^��'c�����o>��]�A�>Lu�L�����`����L�5�����<�&xQ������2,y�.�C���A�'����](;�|����r�bo��E��v�o�h�kl�
�a�AO��,E���T:��Fz��U�hI��~e�)�����,$�j��f����r�J����C�6�Q���q�{�S�����hU�}��p���
�R�Si1��2n�y"8�a�����s��S�Y���CQ�m��.�����?Kr���aR]�=%����'gG��VeA���\���+�y�����yJ���J��i��#�L� _*>`��o�b�77������$N��U�'���&�����a�3���"���M|6�pY�!��R������;F������)^��)�h��z����)����$e����	�WS=����NY{t}�?�Fw��o���Xn�<��2���d6����Z.�.��R����Z{��~z�����<���)��>6�]����A��=������uMO���P�<���[hl����}]�W+����9��a��&qe7�xU�j��
�����t"�T���^<
��*~��W�4�8����Ac�L�~����>�<&�JPP�x������>:�(�G�c���B�,�qh�����=��Wj���x�[���D���:�(Om`��yq��*A��G?����������1
^���hR�9D����$]�x�S�S������BJ����OL��d��u�N��J�;�_�Z�"�z�G��g�c�B���d����~���7�����������Y�f��z+��O$A���)��������k��yk������L,C���uo�#S#�9� `��m$�u�8����8)�3jV��e.����/b�(�@�a����p,�x�P$�����h��=qH
R2|"PQ,����To�x�����\�}�kn������n7_����g/���p��z&�Dy11����������J��B��}�f�3�W�I�0iv	����{��O��7M��O/�W��-��������![�����8R�Eu�8��_	�F���-�T�}Q��@�-�����
cI�{]��1���wc��$d���"t����n%��^��7������2�z��L)��M��{������>z0Ib+��2}���]�E��������Y�^�b��|�NX~J�7x�(Pu��
��	*b@[8���a8���u�������B
���#�������k�L ����Kr���|�p���#��/��N�5�D�������H��-���]"�{��O;/H�<��nn:#>	=T���?�d����+�]��O��
o@����&7�����P�m��J
`@���{3�h��',����s8�����oZ��"o�&��L	�`S��Yw1�� ����"
2��d��@����������1��k.���a��;�G7r��B���@���+ (�@	j<t�����z.��$��x�7���[	�9N����6"P��Pv��������a�u9���w��_���f��6&zO�#���u&�Q{���&���~���XUQ~��.m��!im+d�F����l��"��s;X?��~@������HP��x�m�6�a����a�����pxJ�JDA*l��B���>	�EU�SC���f�����xEc���V�>����\.s88�_3*��I�����&j��E��}��$z ���6t� �?�
j�$��V_�a�3�({�w"xhI�O��F"u/�+6~�
�T�DG4L	���w���i B����{��D�<�3��]T�=�p9�*�
�N�������:��8�M��~�Z�F�n�������=��`E'j��
rdd�p�79�in;�9=�V��d�f�h=W�On��X$�)L��iq_�=#�A�W��2��% �Y�%9�,F�y�2���]��(�>W���<�!�
a\�9|w��f	�@��,Yd�r����
�7�I;4R���CN�����r>�w����� ���X�X�5gl ���������HL��5t/)�EMa��x��,\O�`C��
�'�7R���Q�M(�L���������y�����6���I)��"U���r�������TF��(Z���t7h��������U�W�m��(�q��w)�H����QC��m��tv!VHU�k	�����g�U<@i^M�s�
�6��c��x��Y+c��KSd%QI�K��_(y�$�@U�kH����q������8 ��	$,����'X�B�;�m��;��O�Y>�qR^��H��<y��*�:�����F�����vg������d����*��N���U�V��Wo�����>Y��S8�	+�6�H;-B��#�
�,�G����'a�_��HVR����7c��?7��2J�g*IVK�?+���~��id&}P�����Z�lv(���9�|�y���V����Wv�=����W����?�wcM�O�����p��/���\)�P���l���m���?AH�B�l��l�c}����px�7M`��O?��������8#�a�A6���?`�����Q��	{8Iu�U���e�&'X��J	?������J<�����ir�$��K���&�������fJV9=��m�����$����([�"���m����n��s���d�|��}������_��s;�Km]�`B:M��Q�hz��1������vQlr��$��F:lx�����t�V�����?���b�������q��!sh�R�tK�{���%c�C("��"g��	������gn��!�E	{����@����L,FO/n�Ep�D���1,�����b��(���������SUI.���~��~�Eu�D�,~�g���@D�[�]$SJ\��)�	~��_��8��7Z����<P�,������n.w3���~����j'M:�\z������/���t��_{������������������%6�R���9���n55�.�vG�><[�R���;��o{�U����9W�)��Iq�%q������K�N\��x��m�"&D���E3_��N�`bs��&)��������wc�+��������P$[ey�lMH9:W��G��pf��`���p�SkjL��$h�!'�D��\W.��U��0@5m
��>M�v�#X�;Jx�	��O� �-���:�.��
�������w:���W�{L�f:���a���IIE�����x�N1v{6��\%_�"�u�3�Me��y/�����fsS��|l���!�6��Y�v��b��i(��
��p�tu �e�z{��Q�����I����i�e��h��2��+�g�i�u�I�o�#�n�c�(a��J��3c)`mK�Q����l���V9K���:�-��e��$��B�+p��R�����
�����O�E�s����Fv�w�eZu�_�[�KV��K4���H�N����
�8gYl��%)`k�����N�z5
�?e-�/�_���^x�Tc~m#�m����EBeA���v�h��a��Q��{���>~�U3����m�m�D����C��������$���v?
~���'�!��}�I�W�'��f�0�XP�I���\|_Pr�iBa�6�Dp8��gt'k��r�~h�e�S8/1�k;�|�7�y�p�B�4<����$�A����1	��hyp����������������!�2j��w�z�OjY���Y��8��K&�.�21m����t)Y���.3��%������.krI�"����o�J����sv��������Ju�������������4o���[������3t���Zv��^w����pXw+{�z���Q��K�����>������E4���6���;�����f����e�����6��h?'��6Nq�c����������5���f"TWgc�-!�P��h4h��n��w�?���]:��J=9M��tT���������%��|�����J�������+�;�����o�D��Fc�Zo�{��^gw������v�+{��V�[�������Vpu����t��tko��6����T�d5��0 ��~ ����� ���w�p�c�����H7��Uq�f�����J;n�VE4!�AE�E&;�
J�����"�/~�����3����� ��Zbq
����z6E3&��}0��T��U��������]q
$�f/����\�:�_OU���������O�5�gq�85���S��^�x~���<�����b�l���-,�"xK��n#ZpH�
��]���E<o1t�,��4�&�i�q��4���=���4�BW�I98@��(������ka�^]��b,��D���Q�G�S98���� ��Zt��L�u�*���z��~��.�K�7w"�Mb�N��#p=V!��6�.�*�����@��gWK�_��^����:b��d �����w7��
2{-d�>��{�6G��`�����e$�%���a8'�@�RS�A,�67�Av5g_���$[*	�[R[d��E�~
����^�9����)��F�7QIp?g�:@�m�fZE�//y~p[0���G�R�V�6�.�����%�M	G��N$���O����|)=��2�-�b�F���=�n6]=��Ei!TO%��4�N�_���]�	�v�z���3��h�������V��-�������Ne��j�s�a�rn�-+{{�]���
*�
�Y�Neo+*���g�J->��V���E����y������R���EQ���$�&�l����aZ"��3�	
��/����M�d��bM�9����3����{6�������.�=���m���A��O�����sR(K�PR��Jg�3Azg�vaJ/�
���1���z/}#9���pG'��-N:K�������n��W.7v:��n�Z�u�o��L���kd����C�M	r��acQ�#�s:B���E�
��s��� ��S�<����eaz�����Gk���Z��*'��S����Q�Vj[{;��������-W���6��m�\�c������T�7��Ar��7f��b�5�Q4�n=�����N�%G�������Vz�Q-���df`\�);����|�u������T���=���nEqFd0!L%QA���[�.��Z�N���=�\ 
�T��Sc���l��t	�6W��*O1��F�[�a��7��QP���1����Aj���!5�o������IC��%6��z�8�1�lp�:�L�!��H�x���rpFYo�4b���p���r��������*�/�^��Xr���{{��__ANK��Et�c��|5��}���w�AVi�u���������.)���/�-�������O���`�S@�&�w��%���{��_�����a�i �����������B.�	6-4�����X�`��C��������#��GL�>��"j�a$Oy���pS����a!��R{D���&y�=�^���Xl�^D��~�W�	H���9�^Gb+��rCR��h�AP&fd>�0��F.7�!�k*}����_^�4��<i�..~n�m�}��@����&���
?y/[�D$�0�]��E�v���b9�e:�XK���wl��B�����\������R��[���I�TJ���w3�$U�A�������vJ/x�
��$gE���i����Np�s��b���,��	��~���3������p6�=��6t�rqu��4��=�[	��E}:v�����E��P�8f�O�'r�!y��w����!=$�VM�{~�~��9vP��C=�k��,���W���m��{��Efb1���1�X�o�}������19����D�
��?�3����Lc=��jN����g�L+~"gM�Y~���g���Q�*������#q�_?^#�|����p��S��Q^SA��t\p��J��~�oA3�c�'�cm�P��M(o�c���x<��r0�4im1�9�c�����:pD�&-�|	����u"�p��:�5��(�1������%�2u��2���g�)����B�M��lL����N�b��i\z!���9��k(oo0�aD��������o�A`{bp���������o����������^��u�y [�-0C��^���`����-�&W^h�`���<(ii�d�5����1�L�R�#Qr\�`BQ�s�C�XM,���Y�}��(E+=�D����gBO��;N�n�#7��/�L���):��XC��,HO��4g��"Zy��vY���KPf�����0��(��Z��V���E��[%~@��>�����>�(9$�����O�����g���Mc���S�x,5a*eK���)9�**��/j�<y�l���>�NU�W��,�-��P��4.xkp�0��l�J�	�0��Ol{P���[�O�YhI�e-!C�=��0bW��d4����S�/}����O �����Mr��yrOo�6z�����	J~�Z����K~�j�U���3n �kzy[{�(�L�w��T�tD^"���^ta�5$��3�#����P�� �s�}�#fw�k6|1`
'QV:�k��/SC�������
H����8���x���!B�����-�)��4B�Pl}��>E��:X�����d�n!e�7�Z���h2�LX>�^�6�!$Y�:�������
�1���K����r�ia�0�� 	�i����1�e��:��O���b`V|��qWI�t�I�Q�������Y7���8�f�(3���(9���-�em�9�/���N��1�1��@��5����������Mh�m|����x;��A��I�_R��5��P>����Hp�j5/.[�����q�g��M&����i_�)&5�3U�g��eP>����RP e��������Oi!��N"��o��(����[G|��J���$e^GS`6��L��������+��c4�T�x+DP�$X���G�sJ�C`{p"�9f\�����;D+��u7�6Ni|������I���A3�I1e��,��la*LJ)	���1B��E]��Vt�&X�����_z��Jy�q����].��y�-lp��W�34fR�E��2y���/�{���x���5U�5;uj��De3�?Ru�T�1���H��Y���3j�C����������|w�kn������_�|M�pr��Oe�uRv�I3]���*L�:
$%mA���j�Xtj��4��������3��3
���R8<2<���9���/��:�����j�����6}`��|�V}��)�k��8 /�-(�O�q���t���f7��F��S�����YA�7�~!j]PF#�AR��4��c�k.�:Bx�_{h�
�N���`�=t,X�3V`X!����b0�E"Kv-��$J��=��=w{�"r���G=�@������*���T�\[���%���H3�6@'�<�����h�r�+��,�=���%k�j�C�]��!-<��U��B�\sCd@;.zu ,���3�7&��]���;��2����z�������TR��5�V���/-s���)�q������'���%�vM��l\�I]��|48G��P
�o}��1U��t��P��2��$�/#�LL�n��gi����,x���`JL�������B"��O��0k����/��8�yH�}Iz�)Elz�GDj�9��E9	��$�YPY���V�T�,8�L��Md�
x$eQ�$0.BE�~7����������3g�%_khIM5����g[��;�^�I�,�0n��};f�I^�5��-8����I����p�Zkh�����?���n�W/i-�@Ey���!�)����i���vE`4�Z�da��h�yLUj���W���AZ��� ��!�����*}��v����Y~��/h���=�wk)��IY�U���F�����O���2�b���j�3G���-d-�g&P"�+��k=~�p��Dl��a��1�R�H2�-����F����JaK|0���K�Q�w^~�"6�+�+/%X�4��[�'c
L=	�Va�����,��w7�Ig���BB';w�m]�
�Pu��W�Zq��!�cKk�i��K�*ll�g�e��
u:������#hBPs�I�3}��S���Y\F9^W;;���V�\�Uw���F��d;^����N/��u���K�>o~��Fcg��G4C8�0�|�6:
�������~1��N�f��x��CX+�)z{v�e��1�f_D]X�����Qo����.���O�tmMu%N
c�F�j1���������q3��I|G�����������I����y*���1��$j���;8$��#�Jm�O�[�j>�?����;��
�_�.`���M�
�O(����X����Kp�t�;��������
��|�������/:���j�2.�[�"������������6C����e=Ks"�\�s��3��/�s}���N$U�� �m�2Ks[�e��$r������Er�p�ZG�o��=1/~K�$��n&CK^�o	~��3���?�����;��j����+d�sk?�F�U{F��;��^���f���w�)����d�^�(�\\�����F���J�^���1m�uRSIhw�.gj���@������$�&�>�d���;���X�wvv����e������n%�&���V�g�_O���-�6�x���i��/����������E�������q�>�����{lt��BHp��_q���$X�P��VD?����I���t�o��{���	yFu+�\Qls�(�����#4�t"D��G�����.�O���#�#�;rc8z���j9�	���hk!`i���
���FG�0���n0�&�w��vo�T���O���MLZ�V���U�t%llG�z�W.�z�{�J5��e�S��$=%�P�j�B����:#T�\�������_t�%D���J&GF�)}�U=A���]�L�2�s�������~'k����4z�k`�n���U�n7�k���Y����(vKb`[W/��/T�1��?�����C�<�o��ww
&3���{OJ�
^��Q�q�a;���O�^�~��	�S/��d�q�k���j��V;��ry�[�^w�v�k�4=�8��,��[u��'�����	��t�
�v�1�8��9�������������+�8%y���I�@�XR���je��S�f���|�Rk[��l]��/]�4���
^F���6t��F���?�n�.�-X��G�6�_+8V����R��W�	)%�����y��4�������������V�&��"+�j!�{�����N_�(Kqw4�����w��������y��� ���'�F+�
���$lg�����k��h���\�\�T�o�m�*m�E,��r��"TTAN��%.]���`oW�s@����%7���������C6����p��=�C�M8C�N-����h�Aia�}�WbM��E+)(�|sV����;��mj='�]�d
��h���$0T�������B�#�^��/H|��j�Z�)�������*:@���6��v�F
�R�e.T���!���c�>�P�b�Al�Z!�D-���������������!�G|����
]�� r3���?
�O�N)
�-<�'p~���{����,^UX���F�]Hz�RCu�-���r�'tc4�tf����7t���`[���W�!�~:�|#�
����������o�B1���?O5���A���[����e0mw�;A����-c���9������^ h��>�����;�`#z^n�!_\�e������6@�|���6M���"���be���ez��h�L�L��^�,���W:���`���R�
���Bs@����.�+	�"�-�~�7~;i�[��W1�Eb������>��"�{��B��:#c��^>��)�5	���4`���O[�r��j<�
<�':eX�
3>�@����]�c����
�8_���i.����m�-��s+OS{����E�Q�i��� ���$m'���o$��f���iW��yE�V�Ug��V���'���]���0�6s#`�����#Q+��
��i������oy��V&�_�S��Y{������s����,m�w��KYEg=�OZ���yoy���������7�$@�CD���
4�>��z��� I��oM�������U����I��(�r6W�7��V��f����d�6���6��)�J�jQ��JWu��.]��w��[���T�$J���G�E���2u������kWO�]�w ���'��(�5Zb=����&?�"�B�.��3�}�m�@���z����
]��Q�oG�po��x���a�.J�{}����h>�����\61Kz�_�������d�r����m0��%�O��_>)UOffi��+0�4�	�S��|7�-���%����;#K����+����{�!����g:����5j��N=z�m!����^����]����������m�w�.�;�
�)����Q	�b�v�J������Z@�����
Z�6���V�+{{�2�m�	����0(|r��u�p>������#�b����1/��%��m8>���=�����#}[�/���Q���c/����}��b��s���k�[����N�{��./�����Bpu�:x�\��Xn&�.������������p���[wm�����a7z�~�^�-B<�j��.%~<q��>���������+x *���wsB������������Y���<�����2\,k�����\�^���^����A�����-�t��������rht���`��c���s+��j�����������]�Q
����N
��z����������0�N���������y7�3��������?V����2��b��ba
����x�������poti{�iL��Y��=�{�w�g��>�����*l������R{�R��lP��k�J���
Q�Q|-���zAQ���"����A2z)f��|r7�N[o5.\�`�0C�J������G�Q�<4h�)@�M�v�;��*���-0�L���ip6���������^����-��$x)
Z�^7/�u����I���\��`4�����#����j����(�O�b.��*�}������3D�� �~�|��'<8����k����Et�t4��yS��9����U��BG�5t���?l��c��&/1�1C���$��������e**�Q���}����H-XX�$�?��uH� ����s����������c��
VW����������c��G�D���Gy��(���m��1�h2A�;><<;j��5�#?=�l��s�S�� �(~�������	�}p�`Da/O���R_Pf�%���|�����9n��.����V�vlTPu������p�Sg�W*��E�,d�����G
5���B2��H?��z�����H��,$����8�}�X?��;=v������`�9�N��'��d0�D�����O/u�G�&�(�bd�B���y�����|5�OE�>*�k(�!`��R){c�6?5�����a�O�P_1����m	lQ*��n�
\����B�f�
\�������Zj�����(UH��ag�
��(�Zv���+�C��HS���������,���L���������NBAa�qB�@J�>��fP�+2a����	J�|����C9�g���}��W��V���E��xY�����j2�l���r��@.����:�C��\n�9<�N���:��#�t"�0�VJ/��{�6�+�!
�fH���Y���	�H����pfh�i���o��Gtv����L�sK
HB��a��iGar^����<��<��@x���cM
l��#��/�����8����A�A��� ��5Dn���@����K���I��	K����F�5MH�K���W���y������)7���dG����������'�wG>��L��Qc����6�]��Q�W����r�z�QZ���N����'��� HrNrE%n�RY�mr�`���|$3"����z<��}�V�{k��W�L����js��(&{���a����A���BR<@�
R�~����^���=A�&��.j8]�`W;�����y����
�+�.�IvQ�CLIQ�����K*��K?��H
��J%Z�Kd�*����tCR��%_��t
�J������u�Ln����d,���(�Z��Y]����g������^
�+b���	��I����$51����Y�R�����9)��"���
����W
#?����`Y��B�
�<'pB�;v���@u[�8d�4.����+�0UUy��)�*z�`�G��1���a)]r�s��_��-9��4��u�^���{�L��i������*������I��>^��E�!��6a���B�j�/���q8�E�1|�_�>F���������Ph�s��0�`�V��+o��������|ll��j��8*�-W��J|����^��n	������?��z�����d�h�����v��~�����?T����J�Qkl��R����?�o<v��!�m��.�����QV�E��C^!�`#��uv������Vgwk���]���Z�E���^����{�[���hTw��<��j��k������Q�
��^<����]|���o���C�{�'E�+�Z���!�����Z�i}'���U*k���z�r~py�&R
�{��I@�3�t�H��P�r�@5�~^�s�;������'���h0�a���?��#�LR\�S|b+�4E�b������XgkGL���Ih&��dO����u�����F��1K4|"H��O��/n1W@�j�[��A�D�����t�l�%��${�$���4#k(FA�,������g�[�2��s��wYY�d����pRv���D#��#�Ty��d��.}ITey�
c�t��k8�/��!�R+�n�Y��P�M���&�J4E�Hv1���7���������9%l^����Wk����7���k3g�����}�1�4> r!�8^�WP���A���i#^�n����J�VI���`�fwC����M
xmU�-���B�����a=+�����$-"��y(D��o$an��{�4��eO���&)���P]�������`r'O�.`�F�Q����t��wg$�=
n��q�ts�/��
���2�~���$X�OQ�Y�Y�����
@��n��^��� � �
6�|Jz)o���,w�|m{+�0?������~�"�8��<�����wQ��)����A����`#�+n-E�P�W�2
�������s���l�9v�;�f���,��h�e4��hj0�~���]&��_%���OP����?�
�:�~���S�)x��0�o��^��"~R*��R�\J��}��Q�����Ne��Vu��������(s�a�}n����"yC��T�vD��o���{�AK�����`�g��dt�B~�qb����B�m���f��3�:U��L�@�,Y�VR-�t���7y�i��!{x�nr�6��k<�fBK�������?�/�_���:_c>	�9���
Or��3]w��n4N���_h����@�q"-u~��J���h��=�4��)�������[I��a8�F%<]Kw������7!�6ii_a��BX���7a�5�+��KT����Pj��B�������Xz���,09���,ULq�p�Q����������^��5�l��������E	O�X����xR��	|�J�_�7�)�XC�q���0���}F�GN&��I3�J��A�"������lR_��Mc���@��t�by7��r6�`V��1��I���I&�
���:It5�!��R���:K������3�J�4q51���s�����cHg>D��"��|�
H��=2r��*����@�0l�ck�e�	���|W;�9���O��*B�[���}�� ����������#�Z]��N��Ww��K7��=��l����#�b����JQ)}x��>��9�:N���7�w�I)��L����w���������(�5g���*���#9��v��������2�T��k�uCI:�9��Js����Ps�'J����s~���x2F�BYv/z
R�b������'��=�.�8�j�|�z�,��	H�{q6��_i���0s��}M&[������c��>T������-[�G���:���"^�j��,t��,eJmQ��b-D�(��(t�r�JbE=��/�P��1��:Ain���'|��s�]��!���R�XJ�,��dn�����S������0��_~���N��~���E�w))�n�h��]7Tl@'?^D��������8
m�u���~�X��Cl�XO��G��#�+#�:
�c�
���S�V�,-q�9�R�B����Z)g���'$o�w�/H��`nGq$2�J@�@L��E�;�4�[�rOz�6�Q���J���>y�LG �Z��0h���QM��������!l�<	?��!�(L����dCrl�4���g�_`��]�Z"0Y� �������)���������C��H�H�(����O��uq-��mJ������	��?�p��5��hBM�V2N��D-���&���R����5���kz�����w�����n~Y62;��<<,���&W�/h
Ti�9^x������������
>��T���a�j+�(��9c���%��F���b�$�!����{
��c��� ���\��!�����|B.��!9g��AJ�?e�o�9�O]�B�A<���nS�N�X&��,��VS.�-����4X�U��E�t���qJ�c�������	��y�����#BO��M�l�w�/<�.	� +��A}��^`r<b��/��7�H`���YJ|5�"��Ns��={�����1��/*D������}�'$���Lg$_�����t&QN������ ��j_b*av+��ObP�@P�fD�#����=���Y(p�]"P���v0�<�� �%d���9��AuB�i��ZF����\e��5��9���[[}S����
s�1}��!KFAn6�3�`j�1�s25a��vv)?�I��>�x~�,3%�������;hm!�{��������7�xK�9�]�H<E�-�#;��2�����������V&�<�<L���6{�#+Wm�'����D�Gd��aC������f�2��|\D���
�L���)]��#j��0&��c8����V��'��-J��5�O�����>���4dQ��|�y�tN_�&~����DE��j�lx�Z/�jHGFz�:]���~�L��!I�t�R����:�� ��H������<�����g�	����J!�kX*u������f���,��|�1���(���@hm��5A>��9������e!��-w~qF8�m�L��7-���xsyvyp�~�<8o�<�K�\����wD��%��E�/��rp�bw% A�������r�|i�L�
�*�����>�=�����'���B|�i�|MSg��:<8�������hu0�k�/6����JR�2Mp9�6�/�o��v�os8q��V1p���#%(��k������zB4B6ETj���%[���L�-�������{;|,������>�6d�D�<R�b�d,~r�u�����A$-�UqLI���T B��/���a�v���^��,#jD��Yzkk�4��������%�Y�=�&��0y�������!��{<1�:{� �_���jo{�$�=�oX�]f������]����C��F"�D�'O����D�����������w:H+!0��:@'z{d�B��G"i�-�����t���j��	4�������G$v)�N2����2�hH=����1���'���{��Y�<�>����6���	@H;a�@p\�v'�`�CgN���fK����f��7�K���Ei�Asi������T�(�<t.y����R�G����7���BP�G�h+�1i�R"3j�*.YR1x�Z{0�6|�7��vq���.���f�'}t*K��PL����4�RY��}�MSB�0g���KC� ��=��L*a�;�S���n�sT� S���l��5
��e�O�����x��}��l���B�Wh�LL�j
��wE^'�����xhE:����B5�����KC����y��Y����I�|mPi�����XE(�X���y�������<� �wPkAu�
��9��["v$W�5}m����,�y���:���A�6�Z�4��h
v�[��l�/�bu���L��V���6�P�'��jK)%��n�`-�
����5z76�D�@�l��G�Q!7���/7_�����������hS��m<w�K/����b&W���{��
�2,��.��l�~I%9NHNKR� �R��j���7CS`��mN���f]T���]�����+xN&���l+Kex�e?FCt���3D*vM+h|0�e`!�A���r������;U�I�������S7�"��h
��9[�@?

��s�3�&~y����?�p���c�:[f�f>-!x�j�Z�1�mo��z�MH'�J>(.A��)L����y��������p�;:�?c�*���WW��P^�!jBIDL��k2����(�
����W�L�<5��|�X��E�T��a��l����$�&%��+>�����U��4��P��;��=a�?"T�W��Zpr���VG^0S#l�������1���XR��~x#��R}*O�����E�8���G�2T�S�C�������������/�+�u�}�l�����6)�7=Z5�����;���t<���D��+`��M����tG@�C�(���D��{	e�T)y������C�_+�|?���>������h�Zc�\�nYL!��D8��:��^�����W�2��r��"$w���Qn����[wp!�/��<��-V����O���oG�����E3h� �y���}�SW�yKP��-$6O�W=e}�3k�fr�}���=�d[b&Q�crv��l�B	�����=�2|S�;�'�F38�
z�����?����|�����Q���u�EK��L��� m��gf��2Ubg��9�=x�;���������(4L�����w���w��N�����Wt��;�f�Z�FN�5���'�u��cT�����[A�98��I/����
GSn�K����r�}�f������U���mUv�[�����nTP.���m�N���d�
�oe{�&9�v�����;��;����*n��=��;-����������j�]�x�6�Y ��3 +'S����Gdq?
y���@���P���PD��,R�s�O���!����=_GQh��w0�]@Q�EHZ���+���<���uO������{V<2Y�F�~G���1H�e�D(��buq~G{��+|H�	���h7���������vAG
���]
cbl�b��Thcq.�����.��9�G�(1��W�2a���^'?���kN&����h2i+�-�.���IP����(�$_|�����Q���N2���PR���\��#��7N�z���3[)�/��8�HHh^�&�rl�t?UPD��S0�P�@�<�,�n�,�$��,����'��6m��LRNA�Z��kOmL�9V�hn�P5d�������*���������Q��&�h	&���>��5?M�QD=�vt;���������/�vi���%�agoN,�VM0u����(k��<]���z��y�>9;��\��l�����t'q}a�����E��6�e���VO����$������6.6����T��2el*��C~4��xh��M6���1d���b���v�+�U�:����c���F�~�]o������}�_&�	���m{j����I�I8���EX��?����[��� ^V��Q)V�V7����p�A���>��jbs���x�)���#yX���bOa��g�{�X`�~A���m���R5C���.�*_��M-�/�yC��5_^���&�����������?�!9?=s1���u���?�X�YW�,���*&����w:.��;��6������2}�H�S���vN���G	�������6$�ap����6#r3���l*C=�l��+i��E��/���+�Y
qX�@�s��=�%s�������j��bQfG��B����C�g���[�".�B|�"7�(/�,����,���c�(�����B�'���"������Brb
����xn���W]��;~u��[��������zfd<������]E�2�@A���
��Pp�@EV_������}��ay��J�
M$����q�/���	����v��1�$���:��H�Y��%��A�}��h�"�����K���{�6F�q.��o��y$7lo�(|_:���Z�������v�I@�Jb�+��@�����fX�T��T�\v9�b���I�X*bpi�{�~'V��9,(^��FF�~lz��?G�\j��"��Q����~�dQ��j]�8�Y���,�t��^-��L�&��a��.OL�eTl]	�"�����;��)�����0n1��z���:ka��[��������K�y��������[�^�d,h��5��t�~�0t��H�����qF�$	h��k�i1�?�!���q�)(�u
v���(s�&�kI��F
���m�������/�n@������8��LS�Gv�3u��	�����������	�v����h��Y#�&���w�R�T�S��ni�U��|v�oV�p���� �YG�KhU����!6��Z[Flu�Q�#��������������I����Veg�XG����0�U��9A��+��D��)�|L*���Si��Kr�)����	C\��[�(��6
]���m��0���"p)'q[��\^���T�  H
g�[-�9�+���%�5
��A��r��[�.���6~��]�D���2�+K���Z�n��C*#/��#��x)eG�����<>����O�z�.^�����
��R5�s�������,��N�HTJ�V���yra�9�=w�j����R�J��QG������/�hL��- ��r���D;�;",A�F���#�����n��A
tI$/���5+��t$�L�����B�=�&=��P�g��8Z�a�R�
������3_YAc���()�,?|�L9���q�]D��f`�W14m,L�q�w%m��0�2P��
��a���R#��Q�rks��X*b2:t��<�0��_�ccc�}wHL��6�p����)�l��qi>�^��r1_�4��2xhO|j_�m��R.]8'(Xg�Y��m��m���E:�S���;�o�u~?Z�H�<Tm�9
�6�=��
�X���v�X�#M����M,��9�u�e_�ES�R����z������b97(�^�����������jD�j��Qv0������5Q�����%�\�_�������b�l�� d�cF@���[+�\Za��#OE��Z��*;�iGTWS o������w���\!	6g���+�o>�����y,���P�e ��t8��G|z�F��:K�T��8��\�������:�fg~5��$����z~	�H)��aD#�Pv�uO���%��Vh��#��#V9����	g��������-�� a`�v�$����:2��I�,R�du9
����:���L?!ijp�d�#Oa\���K�d"� ��9(��hb!.����y�m���B�p���������G�Zm'�����r���W	��^�
��\.8�(��jU
=����E����x��)���p�����O����I/�B�29h��;����gE�r�}) ��F�n9u=������gO�V���U��`�u+;����.j��9M(
��I�MjWw:��e�~%�BQ�[H����"[�c>J|��f'�W���:;>����P��Qpv
����|T�y=���:���7��p�������[��N��*��9��h'��e�����i����3�g}�`��b��sP�L�(i�}�	v&q�y���Y�9h�l<��.�p���1f�CT��6v,
�����9@��]��ux���@��q��30��4�N����c7kw��m)�g.�Qz��~2��-,������P�}cfOF���7���xu���0���P����������}��Q`�����Y6��C$����
e��o{U7�Y����L��*&�)E�T�L#m������@�-�y���R�e-6�V���r����t��P���#�^9� ���
��e�AW���@�{���~(�kLh������?I�T�t��3����^!~�6V�Y��@���	
|WV�wkxYV�	M���K�/��B�w�m�1�cMY�B5Y��8v)&`1�*����2��An�4O�@���l<M�%�=_l��q���	Y/Av
��S�c�tG�������!�~�=���r������G���0��t.�p�/���FRfl���%e
0���e��a���#����$�q��O�������	5���K�K���X���X���D���;�b�������.�+|�>Y�B��Z
��Pz;��P��O=���d���r�	�W'��I�h�O	�0��~HSLnQV�i:A���=u�b��8�h6
�r�!cl8�f0��0QFhv�TZ�HI�B��r.c�=E0��C).��d�F@��cEF}�1�Y��'��{����+{!�hu�n�����3/�)P$i�W���xg����`���������A]X�b��#�@6��$�h�)�X�;nV��
��v������}}3�G�9��q�d�-�{���M����]$�
����)iQ�I�g���M"���]�E�@Uz��=&��+VL��C���&��X�E�@u*g�<��40$HF)�=����$
)������5�r��E����uo������^7vz��55����U����f����T	?�qp	p��@3y�4nO~.��G;X�'�J��&p�~��f�@�"��1��#dHtV2�*��:6��N���l_�2�I"k	�'O��
��+�
=�n�z��U�����tn��:w@�t�ga'��eP7�t��I)�$-	���]�^lz�����W�����t$�D�{M�����{��n�J�Z��Z�\n��T�*�"��)S�������?��i��E��l��Ewef���Qv�b%������,�?������X_��_�(�E����
��]���������}e�F"RA��$$k�,w{q��rv4c���Z��\^�!���3������RB-�b�i/?���o����M����wL�b����������$���Jl�u��������r�� �MVv�kJ�e=��C�
r[9�s8�d_��(��O�h�����Z�FQ�{��m���o>��K���1$�P�'}.�K�i�zg{�������k[�Vw��lDQ�����.nGQ�n��J������m6�Z{@��Z
��\���)u���������<w?M���x��/G(9�y&��[� �S8�7���p�^l�(M�	Q����E��mx�a����`����	"���o�����}�3��K�s�s�nC�a��0�?0H@��~l2)XN���w����������[���g�&���:�lU�o�kPs��o���Q��^�����C��H��<MP����F���!��T�4�XS����4M6�''�4�����C&%�d�E�{��7���+���NA��S���+�sl��l,���0��b�Q��r;��u���y�����&��������<�lQ�0�F�+���6������|�Q��)6���m��)����;�+��,f10�,x�,�
a|���	O%o�8=qN���./N�4�~`���]��.,��p;K�i!���{�YDx�|yCDY��Dy��>�P���p)���V4���bPI)D;`�J;@���
��;�t�?��T��Mr��91��(O��3����5`��Z����
=��4tev���6"�]��Q:D�����93�����rql���0,����=�h�v��e�^��e������yT�5,Jl���M�4��X��5���@�tR��/w+���4���Z�Z�9bW����p���<z��AP�j�L����EQ3U��>��7e�nN�V��}��=��Oh#�^>K������?�0�ah�`����W�|�`��U�[�51�3bi�����w�S�J����pW��Ft��zc���M�a���y�]S����gB��-���������w��/rf
=_e�k�h�k�K�k��-h�I����@3��_r�R�����0����T��Pw7�r������������V�Ce���C��
��$����HhJ���I���*h���1�5F4��%�r��(�1�KY�K�S���N�w��:�?z�^��e&� !��c9�]W��<UM,t�����i4��<@4������������h��\�g��@,w�tE� ���J5]>[�������XP�?��p&��f�EVE
�E����U	�I���Z��f������/G���e�)��?�-�-��L;B�B;~d��L��"����I������9'�N&�o�N�0���Z��0�zt�����wm��������	�WH/h�E�y���e][�&����|k���Z�l a�R������E�nir@06$�.�%��`4�0�S����WZuk���BK�������B����?E�/��0�p@�^�J�I��&L��/%t/�����e�9!#�P0�)m7�Z���!J/IH�C8.�������v��#��Q�8��}�k���)?�ha���	�E�t��nH��I��z/���u���m~��k��]����cspw3��SV�7 ������{�^#lT����^�^���l�����>���u"J�d�|�e�{�����Yb&���F����������{�~s�z���OF7}���h|B���M���zG�����������hW����"��cd����?g���l��RLxT�!�k����k�.{ZE�A���-��)sR�?O����2�*��|++5�����Vx���{�ry��^G�N/�6��d(c^)�$���X��������s�R����Q.W[�4?��[��1�H�����@�J{�h��lg�r$@����r-s�h<�|�,4�9H`���v�;����Cx������k+�����+J�n��:[����J�S��vvj�%(M�3��t��'����0p�
o=$+���t6.��������]�+��o7o����U�\��h�2����.����Qkv�U4�!�N�y���p�xR��5��HM(��4���@^S^j=Oh�
����3��s%��N�2�A�=j�z.�J���7e-�(/����bwamc�K����m���� ����NnT�owv%e���������#C�Y��0��2�iurY�G<�+����:i�;~	�������������������"�w��	��������|�~R���_6�L�E������6��&g����|��Mg��4����������������w0��������i+����G''�o��..��&'�
�f���i�62��]M"� ����[�R���t%���E���a[��-V�������5����*�0��2��]��2T�'3�j���O����d�V-5���rV0�G��[���d�����!f����1��iP3@L�����"���N�����u>��Y�FM$�0��I	�������j�z�]�;���\���o��U���K>I�*�����U���|�KCP��y��4i�����A�3:�B�|6u���'X0�������wV����a$7���Q���w!%PZ	�Z�e���o��"��z�{�J5�����(�*���^��}f3I�6Q�R�o�*P�J��`��:}�	0��E�W��<�-W��-p`���:�h]��Z�����cP�P
�a	���v!���������p=6'3X��h6�Y[PJ���S����4*Q�\o�6j�[{�Jr��fVkQIN�X�����/����~
Z����e��}�/��[p
�G\���-8%�@.d��6��20�L�����\�/]x��t� ����PT�}�*����-��.�n��
�wG����T;��j�����6����j�����v����j��(��`�V��+o�������cc�T�TvJG�KGQgvSR����{���W�:�[�����������?T��O���5��P����?�����?3�	�?����?���-z���
���������n�U�{[���0�n_�vwj��N�s�C����n�$�V4�;A����?���a3O��2GA+�{a�,��w������u�R��5�\����,* WE]h���jO�OT���Zk������/���o �`gs�}��4 Z��Xy
O���6��Q���������Au�������p��[���`v�B@��Fa-�� ������Vz�Q-�����a-�Vv���N�uAfs�����O����"4F���?�7{q���9���$�����hY��+vC���C�-�E.� �����5Y���F9*����^�.�.@"B�=���t4�l#D$�kTU���.�N���?ry�O��<�l��A,�_��c��)��?�'��m��=��G�_j�YZ�����'��<�N��27�|������	�x�{�����<d#���.�(�#��gCN�Ku=�h����#n�>
W����^��L�2u��!o���7;*h�'STj&�'�'���I�S�A��T�)/a���Ry:�1i'N����s���z�r�Ir��jG@R���e����k�I7@	-*%��;Vk
�:�;��v����	*�;�����8��$ ��9�"9���7(mH�}`��"�O��vg����|;����xUSJ��XX1��g$���N6��(�JO�*r��i��"k����e��h��Pr�������S�5G��%a�,2+�,��p���k���@�^��"���~�v�������)OMJT������c,2
K����~GG�r%�%�����g�������v"t���~�m��y, W�E�$�p��'4G����6[�3���#�����������e�;�p�����;�luJq�)�o���������r��K���\>@�-:k!�6�L��4�-(�1�*��D�MC{O�31E���������~o�E�Fm�����>u4��i���f3u;tHm�@�>O5���%�X���^0�E�w��Yg"��Z���za�0wRhs��<�lH�:�8��8��\����8v��f�[�HX���8�&�?� Z�y`7c4iR�*zg��s6	�1B2��I����
�]�p�K/�a��������G �t���p��
�<{�x�YY+�I�~:i&d���y���=������>�������b��E�[[[4a{5o_:Yk�6'��L�t�&9u[i�����{��"�	�L�C��`4�zm����`7E��W,N���s2�B�>D�0P�7ShQbJr�:O�|}�Z�
6�+�+�R��-����0��l�'{Z���oU����O
�i#
�iC�;�	�0p����;) j�i�I�H�;m8 I����;m<�I��wJ��
������$8����8{8BX�_�\d�Z��5&6�`g�o/���n`�l^v<-����A({��|�RF�����"*M��(�:E���� �/�VPz���"N���1l}����3@Go5a�������������������yP���E|�XP���y���W���D�'�fk��[��`5�r��.%8��~������I�X0k�N�sg=�NYQX2�n�B��XY��'EJ�+74l����H�Vn�m�[*6sPJ[j����U>"�z4�'[i�U��5�2������Q��v������c��MMJ^�,0�9L��jj-�8X�x_vs���nn,��[����:����[���J���L;vU�#$�I��~`���?F���g����R,�wx�:�I,��A��,)T)���T�O��g)N�e�b�JVl������\YT�|�HA� ���l���4�,�QWO��3w���;����}�=WD"��kd���-��*�%���K�N�k\�����wj�������������n�nw:�Nm���j����wz��]��n�l����7���U�WQ��Au�i}���.���������0�Hum�`6�M��]D��%��:������,F$����t:��nn�������w�k/��"��������+]&C7��g\so����������Y��������v���o��*��e��|j���9����GI

���D��
�JV��+���|�{~���W��b���8@��Mj�+��6J�f��c����;�jr
���D��^������u����$�.%�P�x���;�m����o}�������^����za}������l���+���Vx�U�U��;������������HU#����
�����?���5�5�_��Vn��~g�.Qi{��d��5�����������Q�Q�Q��������NQ|�l���3���*����T1'	�*����n%��}�Se�.}��{��5dC ZP(������@rn������*�x�Y
Z���XB����[�&��-�s'�����L�o��A^z�X����I���x6��&8���o�@u�z���m�������<F|�Z�'��������)�bD������V����N.����������+���2%Y���&*���~���;�����\�W�=���9���rp��\z�������9�`���d�D��pT7�
4��"��[��������$G=�-���������{�=����/W��xlT98Y��<Fh�8����S��(��|
��M���j��/��%\��:T��jM�8.��z�i����j�3
����pqf�T������`����ic|lB@J�o����%>5KV6�gI5�>P�(�A���-W�r���rX����s���d+��Ag�[9Hw����]8=�a�[�;p�f�*��$#��4��-�f�,_C�?��l�98=:A�j�����������k%�������_.����//���	��R���{^v���^�u��o�k�������)G2�9s�./���������&��9*U�Z$N)����M<�M`�q������u�c3���z���su��2���Y�WY`�ph�`�^����vf�W���������}�<x�ZU��$��2���LO����:�D��^h������,|��$��?����u��K-�d�4�Z�����4��G7��1u���%�;���O��9���H��idg�������:3�wD��"����������Iyy���:I&�"�t�p^��"�eyT����V�]s�Y0k�O�L{:p^(lg��X��]q�[��/�#��#:b � *qZ[x��r6`�B��n���et��^G�O��2�Y�.N��B(���������bb�ZV3A����P@;�i���RP�V�)M��]�>]V�� Y9��]p�^^7�h�:~W�I�i�bQ�z���#x�=%���j	%��i�D�tXP&�|������Q��g*1�� �|�ac������,@���[H�L/F��F�cg#�
qi
�TR�(��o�-8���s'�<7�;K.9���d�%'zc�y�����Ys�L6�f��MKasI���Pz�U�u23>c	�'����F}���X�S�5J$JBeK�������W-��H���S�������x��T������1�a��+���E�X����f����i��Z�����K�0(�.O�x��B��B����p��1�^����6�o:��Ki��HpV�zU������,�M�(�������r���n4�+�����Vog����m� n������l5�mv$e�?��b!�����<�q���$���^��lV�M����t�\3yW�����3���M&�f���������J-%�������}�()��h6�i�$������IZ6**T��5���u������i�p_�s��4�����C|�Goq���n!E�l�.�A����jdk��O�.��B��Nt��3�;�l�C����!�������p�4�G��8�"�����#� �S�)��. �����~4���0�����0��\k�r�����u��Xy3����%J&�����?���|�����������?���|�����������?���|�����������?���|�����������?���|��������W����_E�

#352

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#351)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Feb 20, 2024 at 1:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

- v63-0008 patch fixes a bug in tidstore.

- page->nwords = wordnum + 1;
- Assert(page->nwords = WORDS_PER_PAGE(offsets[num_offsets - 1]));
+ page->nwords = wordnum;
+ Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));

Yikes, I'm guessing this failed in a non-assert builds? I wonder why
my compiler didn't yell at me... Have you tried a tidstore-debug build
without asserts?

- v63-0009 patch is a draft idea of cleanup memory context handling.

Thanks, looks pretty good!

+ ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
+    "tidstore storage",

"tidstore storage" sounds a bit strange -- maybe look at some other
context names for ideas.

- leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
+ leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
+    ? tree->leaf_ctx
+    : tree->context, allocsize);

Instead of branching here, can we copy "context" to "leaf_ctx" when
necessary (those names should look more like eachother, btw)? I think
that means anything not covered by this case:

+#ifndef RT_VARLEN_VALUE_SIZE
+ if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
+ tree->leaf_ctx = SlabContextCreate(ctx,
+    RT_STR(RT_PREFIX) "radix_tree leaf contex",
+    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+    sizeof(RT_VALUE_TYPE));
+#endif /* !RT_VARLEN_VALUE_SIZE */

...also, we should document why we're using slab here. On that, I
don't recall why we are? We've never had a fixed-length type test case
on 64-bit, so it wasn't because it won through benchmarking. It seems
a hold-over from the days of "multi-value leaves". Is it to avoid the
possibility of space wastage with non-power-of-two size types?

For this stanza that remains unchanged:

for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
MemoryContextDelete(tree->node_slabs[i]);
}

if (tree->leaf_ctx)
{
MemoryContextDelete(tree->leaf_ctx);
}

...is there a reason we can't just delete tree->ctx, and let that
recursively delete child contexts?

Secondly, I thought about my recent work to skip checking if we first
need to create a root node, and that has a harmless (for vacuum at
least) but slightly untidy behavior: When RT_SET is first called, and
the key is bigger than 255, new nodes will go on top of the root node.
These have chunk '0'. If all subsequent keys are big enough, the
orginal root node will stay empty. If all keys are deleted, there will
be a chain of empty nodes remaining. Again, I believe this is
harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
on this, but likely not today.

Thirdly, cosmetic: With the introduction of single-value leaves, it
seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?

#353

johncnaylorls@gmail.com

almost 2 years ago

In reply to: John Naylor (#352)

Re: [PoC] Improve dead tuple storage for lazy vacuum

I'm looking at RT_FREE_RECURSE again (only used for DSA memory), and
I'm not convinced it's freeing all the memory. It's been many months
since we discussed this last, but IIRC we cannot just tell DSA to free
all its segments, right? Is there currently anything preventing us
from destroying the whole DSA area at once?

+ /* The last level node has pointers to values */
+ if (shift == 0)
+ {
+   dsa_free(tree->dsa, ptr);
+   return;
+ }

IIUC, this doesn't actually free leaves, it only frees the last-level
node. And, this function is unaware of whether children could be
embedded values. I'm thinking we need to get rid of the above
pre-check and instead, each node kind to have something like (e.g.
node4):

RT_PTR_ALLOC child = n4->children[i];

if (shift > 0)
RT_FREE_RECURSE(tree, child, shift - RT_SPAN);
else if (!RT_CHILDPTR_IS_VALUE(child))
dsa_free(tree->dsa, child);

...or am I missing something?

#354

johncnaylorls@gmail.com

almost 2 years ago

In reply to: John Naylor (#352)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

I wrote:

Secondly, I thought about my recent work to skip checking if we first
need to create a root node, and that has a harmless (for vacuum at
least) but slightly untidy behavior: When RT_SET is first called, and
the key is bigger than 255, new nodes will go on top of the root node.
These have chunk '0'. If all subsequent keys are big enough, the
orginal root node will stay empty. If all keys are deleted, there will
be a chain of empty nodes remaining. Again, I believe this is
harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
on this, but likely not today.

This turns out to be a lot trickier than it looked, so it seems best
to allow a trivial amount of waste, as long as it's documented
somewhere. It also wouldn't be terrible to re-add those branches,
since they're highly predictable.

I just noticed there are a lot of unused function parameters
(referring to parent slots) leftover from a few weeks ago. Those are
removed in v64-0009. 0010 makes the obvious name change in those
remaining to "parent_slot". 0011 is a simplification in two places
regarding reserving slots. This should be a bit easier to read and
possibly makes it easier on the compiler.

#355

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#352)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Feb 29, 2024 at 8:43 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Feb 20, 2024 at 1:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

- v63-0008 patch fixes a bug in tidstore.
- page->nwords = wordnum + 1;
- Assert(page->nwords = WORDS_PER_PAGE(offsets[num_offsets - 1]));
+ page->nwords = wordnum;
+ Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
Yikes, I'm guessing this failed in a non-assert builds? I wonder why
my compiler didn't yell at me... Have you tried a tidstore-debug build
without asserts?

Yes. I didn't get any failures.

- v63-0009 patch is a draft idea of cleanup memory context handling.

Thanks, looks pretty good!
+ ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
+    "tidstore storage",
"tidstore storage" sounds a bit strange -- maybe look at some other
context names for ideas.

Agreed. How about "tidstore's radix tree"?

- leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
+ leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
+    ? tree->leaf_ctx
+    : tree->context, allocsize);
Instead of branching here, can we copy "context" to "leaf_ctx" when
necessary (those names should look more like eachother, btw)? I think
that means anything not covered by this case:
+#ifndef RT_VARLEN_VALUE_SIZE
+ if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
+ tree->leaf_ctx = SlabContextCreate(ctx,
+    RT_STR(RT_PREFIX) "radix_tree leaf contex",
+    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+    sizeof(RT_VALUE_TYPE));
+#endif /* !RT_VARLEN_VALUE_SIZE */
...also, we should document why we're using slab here. On that, I
don't recall why we are? We've never had a fixed-length type test case
on 64-bit, so it wasn't because it won through benchmarking. It seems
a hold-over from the days of "multi-value leaves". Is it to avoid the
possibility of space wastage with non-power-of-two size types?

Yes, it matches my understanding.

For this stanza that remains unchanged:

for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
MemoryContextDelete(tree->node_slabs[i]);
}

if (tree->leaf_ctx)
{
MemoryContextDelete(tree->leaf_ctx);
}

...is there a reason we can't just delete tree->ctx, and let that
recursively delete child contexts?

I thought that considering the RT_CREATE doesn't create its own memory
context but just uses the passed context, it might be a bit unusable
to delete the passed context in the radix tree code. For example, if a
caller creates a radix tree (or tidstore) on a memory context and
wants to recreate it again and again, he also needs to re-create the
memory context together. It might be okay if we leave comments on
RT_CREATE as a side effect, though. This is the same reason why we
don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(),

On Fri, Mar 1, 2024 at 1:15 PM John Naylor <johncnaylorls@gmail.com> wrote:

I'm looking at RT_FREE_RECURSE again (only used for DSA memory), and
I'm not convinced it's freeing all the memory. It's been many months
since we discussed this last, but IIRC we cannot just tell DSA to free
all its segments, right?

Right.

Is there currently anything preventing us
from destroying the whole DSA area at once?

When it comes to tidstore and parallel vacuum, we initialize DSA and
create a tidstore there at the beginning of the lazy vacuum, and
recreate the tidstore again after the heap vacuum. So I don't want to
destroy the whole DSA when destroying the tidstore. Otherwise, we will
need to create a new DSA and pass its handle somehow.

Probably the bitmap scan case is similar. Given that bitmap scan
(re)creates tidbitmap in the same DSA multiple times, it's better to
avoid freeing the whole DSA.

+ /* The last level node has pointers to values */
+ if (shift == 0)
+ {
+   dsa_free(tree->dsa, ptr);
+   return;
+ }
IIUC, this doesn't actually free leaves, it only frees the last-level
node. And, this function is unaware of whether children could be
embedded values. I'm thinking we need to get rid of the above
pre-check and instead, each node kind to have something like (e.g.
node4):

RT_PTR_ALLOC child = n4->children[i];

if (shift > 0)
RT_FREE_RECURSE(tree, child, shift - RT_SPAN);
else if (!RT_CHILDPTR_IS_VALUE(child))
dsa_free(tree->dsa, child);

...or am I missing something?

You're not missing anything. RT_FREE_RECURSE() has not been updated
for a long time. If we still need to use RT_FREE_RECURSE(), it should
be updated.

Thirdly, cosmetic: With the introduction of single-value leaves, it
seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?

Agreed.

On Fri, Mar 1, 2024 at 3:58 PM John Naylor <johncnaylorls@gmail.com> wrote:

I wrote:

Secondly, I thought about my recent work to skip checking if we first
need to create a root node, and that has a harmless (for vacuum at
least) but slightly untidy behavior: When RT_SET is first called, and
the key is bigger than 255, new nodes will go on top of the root node.
These have chunk '0'. If all subsequent keys are big enough, the
orginal root node will stay empty. If all keys are deleted, there will
be a chain of empty nodes remaining. Again, I believe this is
harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
on this, but likely not today.

This turns out to be a lot trickier than it looked, so it seems best
to allow a trivial amount of waste, as long as it's documented
somewhere. It also wouldn't be terrible to re-add those branches,
since they're highly predictable.

I just noticed there are a lot of unused function parameters
(referring to parent slots) leftover from a few weeks ago. Those are
removed in v64-0009. 0010 makes the obvious name change in those
remaining to "parent_slot". 0011 is a simplification in two places
regarding reserving slots. This should be a bit easier to read and
possibly makes it easier on the compiler.

Thank you for the updates. I've briefly looked at these changes and
they look good to me. I'm going to review them again in depth.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#356

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#355)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Feb 29, 2024 at 8:43 PM John Naylor <johncnaylorls@gmail.com> wrote:

+ ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
+    "tidstore storage",
"tidstore storage" sounds a bit strange -- maybe look at some other
context names for ideas.
Agreed. How about "tidstore's radix tree"?

That might be okay. I'm now thinking "TID storage". On that note, one
improvement needed when we polish tidstore.c is to make sure it's
spelled "TID" in comments, like other files do already.

- leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
+ leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
+    ? tree->leaf_ctx
+    : tree->context, allocsize);
Instead of branching here, can we copy "context" to "leaf_ctx" when
necessary (those names should look more like eachother, btw)? I think
that means anything not covered by this case:
+#ifndef RT_VARLEN_VALUE_SIZE
+ if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
+ tree->leaf_ctx = SlabContextCreate(ctx,
+    RT_STR(RT_PREFIX) "radix_tree leaf contex",
+    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+    sizeof(RT_VALUE_TYPE));
+#endif /* !RT_VARLEN_VALUE_SIZE */
...also, we should document why we're using slab here. On that, I
don't recall why we are? We've never had a fixed-length type test case
on 64-bit, so it wasn't because it won through benchmarking. It seems
a hold-over from the days of "multi-value leaves". Is it to avoid the
possibility of space wastage with non-power-of-two size types?
Yes, it matches my understanding.

There are two issues quoted here, so not sure if you mean both or only
the last one...

For the latter, I'm not sure it makes sense to have code and #ifdef's
to force slab for large-enough fixed-length values just because we
can. There may never be such a use-case anyway. I'm also not against
it, either, but it seems like a premature optimization.

For this stanza that remains unchanged:

for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
MemoryContextDelete(tree->node_slabs[i]);
}

if (tree->leaf_ctx)
{
MemoryContextDelete(tree->leaf_ctx);
}

...is there a reason we can't just delete tree->ctx, and let that
recursively delete child contexts?

I thought that considering the RT_CREATE doesn't create its own memory
context but just uses the passed context, it might be a bit unusable
to delete the passed context in the radix tree code. For example, if a
caller creates a radix tree (or tidstore) on a memory context and
wants to recreate it again and again, he also needs to re-create the
memory context together. It might be okay if we leave comments on
RT_CREATE as a side effect, though. This is the same reason why we
don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(),

Right, I should have said "reset". Resetting a context will delete
it's children as well, and seems like it should work to reset the tree
context, and we don't have to know whether that context actually
contains leaves at all. That should allow copying "tree context" to
"leaf context" in the case where we have no special context for
leaves.

#357

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#356)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sun, Mar 3, 2024 at 2:43 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Feb 29, 2024 at 8:43 PM John Naylor <johncnaylorls@gmail.com> wrote:
+ ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
+    "tidstore storage",
"tidstore storage" sounds a bit strange -- maybe look at some other
context names for ideas.
Agreed. How about "tidstore's radix tree"?
That might be okay. I'm now thinking "TID storage". On that note, one
improvement needed when we polish tidstore.c is to make sure it's
spelled "TID" in comments, like other files do already.
- leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
+ leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
+    ? tree->leaf_ctx
+    : tree->context, allocsize);
Instead of branching here, can we copy "context" to "leaf_ctx" when
necessary (those names should look more like eachother, btw)? I think
that means anything not covered by this case:
+#ifndef RT_VARLEN_VALUE_SIZE
+ if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
+ tree->leaf_ctx = SlabContextCreate(ctx,
+    RT_STR(RT_PREFIX) "radix_tree leaf contex",
+    RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+    sizeof(RT_VALUE_TYPE));
+#endif /* !RT_VARLEN_VALUE_SIZE */
...also, we should document why we're using slab here. On that, I
don't recall why we are? We've never had a fixed-length type test case
on 64-bit, so it wasn't because it won through benchmarking. It seems
a hold-over from the days of "multi-value leaves". Is it to avoid the
possibility of space wastage with non-power-of-two size types?
Yes, it matches my understanding.
There are two issues quoted here, so not sure if you mean both or only
the last one...

I meant only the last one.

For the latter, I'm not sure it makes sense to have code and #ifdef's
to force slab for large-enough fixed-length values just because we
can. There may never be such a use-case anyway. I'm also not against
it, either, but it seems like a premature optimization.

Reading the old threads, the fact that using a slab context for leaves
originally came from Andres's prototype patch, was to avoid rounding
up the bytes to a power of 2 number by aset.c. It makes sense to me to
use a slab context for this case. To measure the effect of using a
slab, I've updated bench_radix_tree so it uses a large fixed-length
value. The struct I used is:

typedef struct mytype
{
uint64 a;
uint64 b;
uint64 c;
uint64 d;
char e[100];
} mytype;

The struct size is 136 bytes with padding, just above a power-of-2.
The simple benchmark test showed using a slab context for leaves is
more space efficient. The results are:

slab:
= #select * from bench_load_random_int(1000000);
mem_allocated | load_ms
---------------+---------
405643264 | 560
(1 row)

aset:
=# select * from bench_load_random_int(1000000);
mem_allocated | load_ms
---------------+---------
527777792 | 576
(1 row)

For this stanza that remains unchanged:

for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
MemoryContextDelete(tree->node_slabs[i]);
}

if (tree->leaf_ctx)
{
MemoryContextDelete(tree->leaf_ctx);
}

...is there a reason we can't just delete tree->ctx, and let that
recursively delete child contexts?

I thought that considering the RT_CREATE doesn't create its own memory
context but just uses the passed context, it might be a bit unusable
to delete the passed context in the radix tree code. For example, if a
caller creates a radix tree (or tidstore) on a memory context and
wants to recreate it again and again, he also needs to re-create the
memory context together. It might be okay if we leave comments on
RT_CREATE as a side effect, though. This is the same reason why we
don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(),

Right, I should have said "reset". Resetting a context will delete
it's children as well, and seems like it should work to reset the tree
context, and we don't have to know whether that context actually
contains leaves at all. That should allow copying "tree context" to
"leaf context" in the case where we have no special context for
leaves.

Resetting the tree->context seems to work. But I think we should note
for callers that the dsa_area passed to RT_CREATE should be created in
a different context than the context passed to RT_CREATE because
otherwise RT_FREE() will also free the dsa_area. For example, the
following code in test_radixtree.c will no longer work:

dsa = dsa_create(tranche_id);
radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
:
rt_free(radixtree);
dsa_detach(dsa); // dsa is already freed.

So I think that a practical usage of the radix tree will be that the
caller creates a memory context for a radix tree and passes it to
RT_CREATE().

I've attached an update patch set:

- 0008 updates RT_FREE_RECURSE().
- 0009 patch is an updated version of cleanup radix tree memory handling.
- 0010 updates comments in tidstore.c such as replacing "Tid" with "TID".
- 0011 rename TidStore to TIDSTORE all places.
- 0012 update bench_radix_tree so it uses a (possibly large) struct
instead of uint64.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#358

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#357)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Mar 3, 2024 at 2:43 PM John Naylor <johncnaylorls@gmail.com> wrote:

Right, I should have said "reset". Resetting a context will delete
it's children as well, and seems like it should work to reset the tree
context, and we don't have to know whether that context actually
contains leaves at all. That should allow copying "tree context" to
"leaf context" in the case where we have no special context for
leaves.

Resetting the tree->context seems to work. But I think we should note
for callers that the dsa_area passed to RT_CREATE should be created in
a different context than the context passed to RT_CREATE because
otherwise RT_FREE() will also free the dsa_area. For example, the
following code in test_radixtree.c will no longer work:

dsa = dsa_create(tranche_id);
radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
:
rt_free(radixtree);
dsa_detach(dsa); // dsa is already freed.

So I think that a practical usage of the radix tree will be that the
caller creates a memory context for a radix tree and passes it to
RT_CREATE().

That sounds workable to me.

I've attached an update patch set:

- 0008 updates RT_FREE_RECURSE().

Thanks!

- 0009 patch is an updated version of cleanup radix tree memory handling.

Looks pretty good, as does the rest. I'm going through again,
squashing and making tiny adjustments to the template. The only thing
not done is changing the test with many values to resemble the perf
test more.

I wrote:

Secondly, I thought about my recent work to skip checking if we first
need to create a root node, and that has a harmless (for vacuum at
least) but slightly untidy behavior: When RT_SET is first called, and
the key is bigger than 255, new nodes will go on top of the root node.
These have chunk '0'. If all subsequent keys are big enough, the
orginal root node will stay empty. If all keys are deleted, there will
be a chain of empty nodes remaining. Again, I believe this is
harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
on this, but likely not today.

This turns out to be a lot trickier than it looked, so it seems best
to allow a trivial amount of waste, as long as it's documented
somewhere. It also wouldn't be terrible to re-add those branches,
since they're highly predictable.

I put a little more work into this, and got it working, just needs a
small amount of finicky coding. I'll share tomorrow.

I have a question about RT_FREE_RECURSE:

+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();

I'm not sure why these are here: The first seems overly paranoid,
although harmless, but the second is probably a bad idea. Why should
the user be able to to interrupt the freeing of memory?

Also, I'm not quite happy that RT_ITER has a copy of a pointer to the
tree, leading to coding like "iter->tree->ctl->root". I *think* it
would be easier to read if the tree was a parameter to these iteration
functions. That would require an API change, so the tests/tidstore
would have some churn. I can do that, but before trying I wanted to
see what you think -- is there some reason to keep the current way?

#359

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#358)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 4, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Mar 3, 2024 at 2:43 PM John Naylor <johncnaylorls@gmail.com> wrote:

Right, I should have said "reset". Resetting a context will delete
it's children as well, and seems like it should work to reset the tree
context, and we don't have to know whether that context actually
contains leaves at all. That should allow copying "tree context" to
"leaf context" in the case where we have no special context for
leaves.

Resetting the tree->context seems to work. But I think we should note
for callers that the dsa_area passed to RT_CREATE should be created in
a different context than the context passed to RT_CREATE because
otherwise RT_FREE() will also free the dsa_area. For example, the
following code in test_radixtree.c will no longer work:

dsa = dsa_create(tranche_id);
radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
:
rt_free(radixtree);
dsa_detach(dsa); // dsa is already freed.

So I think that a practical usage of the radix tree will be that the
caller creates a memory context for a radix tree and passes it to
RT_CREATE().

That sounds workable to me.

I've attached an update patch set:

- 0008 updates RT_FREE_RECURSE().

Thanks!

- 0009 patch is an updated version of cleanup radix tree memory handling.

Looks pretty good, as does the rest. I'm going through again,
squashing and making tiny adjustments to the template. The only thing
not done is changing the test with many values to resemble the perf
test more.

I wrote:

Secondly, I thought about my recent work to skip checking if we first
need to create a root node, and that has a harmless (for vacuum at
least) but slightly untidy behavior: When RT_SET is first called, and
the key is bigger than 255, new nodes will go on top of the root node.
These have chunk '0'. If all subsequent keys are big enough, the
orginal root node will stay empty. If all keys are deleted, there will
be a chain of empty nodes remaining. Again, I believe this is
harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
on this, but likely not today.

This turns out to be a lot trickier than it looked, so it seems best
to allow a trivial amount of waste, as long as it's documented
somewhere. It also wouldn't be terrible to re-add those branches,
since they're highly predictable.

I put a little more work into this, and got it working, just needs a
small amount of finicky coding. I'll share tomorrow.

I have a question about RT_FREE_RECURSE:

+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();

I'm not sure why these are here: The first seems overly paranoid,
although harmless, but the second is probably a bad idea. Why should
the user be able to to interrupt the freeing of memory?

Good catch. We should not check the interruption there.

Also, I'm not quite happy that RT_ITER has a copy of a pointer to the
tree, leading to coding like "iter->tree->ctl->root". I *think* it
would be easier to read if the tree was a parameter to these iteration
functions. That would require an API change, so the tests/tidstore
would have some churn. I can do that, but before trying I wanted to
see what you think -- is there some reason to keep the current way?

I considered both usages, there are two reasons for the current style.
I'm concerned that if we pass both the tree and RT_ITER to iteration
functions, the caller could mistakenly pass a different tree than the
one that was specified to create the RT_ITER. And the second reason is
just to make it consistent with other data structures such as
dynahash.c and dshash.c, but I now realized that in simplehash.h we
pass both the hash table and the iterator.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#360

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#359)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Mar 5, 2024 at 8:27 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Mar 4, 2024 at 8:48 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Resetting the tree->context seems to work. But I think we should note
for callers that the dsa_area passed to RT_CREATE should be created in
a different context than the context passed to RT_CREATE because
otherwise RT_FREE() will also free the dsa_area. For example, the
following code in test_radixtree.c will no longer work:

I've added a comment in v66-0004, which contains a number of other
small corrections and edits.

On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thirdly, cosmetic: With the introduction of single-value leaves, it
seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?

Agreed.

Done in v66-0005.

v66-0006 removes outdated tests for invalid root that somehow got left over.

I wrote:

Secondly, I thought about my recent work to skip checking if we first
need to create a root node, and that has a harmless (for vacuum at
least) but slightly untidy behavior: When RT_SET is first called, and
the key is bigger than 255, new nodes will go on top of the root node.
These have chunk '0'. If all subsequent keys are big enough, the
orginal root node will stay empty. If all keys are deleted, there will
be a chain of empty nodes remaining. Again, I believe this is
harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
on this, but likely not today.

I put a little more work into this, and got it working, just needs a
small amount of finicky coding. I'll share tomorrow.

Done in v66-0007. I'm a bit disappointed in the extra messiness this
adds, although it's not a lot.

+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();

I'm not sure why these are here: The first seems overly paranoid,
although harmless, but the second is probably a bad idea. Why should
the user be able to to interrupt the freeing of memory?

Good catch. We should not check the interruption there.

Removed in v66-0008.

Also, I'm not quite happy that RT_ITER has a copy of a pointer to the
tree, leading to coding like "iter->tree->ctl->root". I *think* it
would be easier to read if the tree was a parameter to these iteration
functions. That would require an API change, so the tests/tidstore
would have some churn. I can do that, but before trying I wanted to
see what you think -- is there some reason to keep the current way?

I considered both usages, there are two reasons for the current style.
I'm concerned that if we pass both the tree and RT_ITER to iteration
functions, the caller could mistakenly pass a different tree than the
one that was specified to create the RT_ITER. And the second reason is
just to make it consistent with other data structures such as
dynahash.c and dshash.c, but I now realized that in simplehash.h we
pass both the hash table and the iterator.

Okay, then I don't think it's worth messing with at this point.

On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote:

It's pretty hard to see what test_pattern() is doing, or why it's
useful. I wonder if instead the test could use something like the
benchmark where random integers are masked off. That seems simpler. I
can work on that, but I'd like to hear your side about test_pattern().

Yeah, test_pattern() is originally created for the integerset so it
doesn't necessarily fit the radixtree. I agree to use some tests from
benchmarks.

Done in v66-0009. I'd be curious to hear any feedback. I like the
aspect that the random numbers come from a different seed every time
the test runs.

v66-0010/0011 run pgindent, the latter with one typedef added for the
test module. 0012 - 0017 are copied from v65, and I haven't done any
work on tidstore or vacuum, except for squashing most v65 follow-up
patches.

I'd like to push 0001 and 0002 shortly, and then do another sweep over
0003, with remaining feedback, and get that in so we get some
buildfarm testing before the remaining polishing work on
tidstore/vacuum.

#361

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#360)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Mar 5, 2024 at 6:41 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Feb 2, 2024 at 8:47 PM John Naylor <johncnaylorls@gmail.com> wrote:

It's pretty hard to see what test_pattern() is doing, or why it's
useful. I wonder if instead the test could use something like the
benchmark where random integers are masked off. That seems simpler. I
can work on that, but I'd like to hear your side about test_pattern().

Yeah, test_pattern() is originally created for the integerset so it
doesn't necessarily fit the radixtree. I agree to use some tests from
benchmarks.

Done in v66-0009. I'd be curious to hear any feedback. I like the
aspect that the random numbers come from a different seed every time
the test runs.

The new tests look good. Here are some comments:

---
+               expected = keys[i];
+               iterval = rt_iterate_next(iter, &iterkey);

-               ndeleted++;
+               EXPECT_TRUE(iterval != NULL);
+               EXPECT_EQ_U64(iterkey, expected);
+               EXPECT_EQ_U64(*iterval, expected);

Can we verify that the iteration returns keys in ascending order?

---
+     /* reset random number generator for deletion */
+     pg_prng_seed(&state, seed);

Why is resetting the seed required here?

---
The radix tree (and dsa in TSET_SHARED_RT case) should be freed at the end.

---
radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
"test_radix_tree",
ALLOCSET_DEFAULT_SIZES);

We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I
think it's better to use either one for consistency.

I'd like to push 0001 and 0002 shortly, and then do another sweep over
0003, with remaining feedback, and get that in so we get some
buildfarm testing before the remaining polishing work on
tidstore/vacuum.

Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
polish tidstore and vacuum patches and update commit messages.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#362

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#361)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 5, 2024 at 6:41 PM John Naylor <johncnaylorls@gmail.com> wrote:

Done in v66-0009. I'd be curious to hear any feedback. I like the
aspect that the random numbers come from a different seed every time
the test runs.

The new tests look good. Here are some comments:
---
+               expected = keys[i];
+               iterval = rt_iterate_next(iter, &iterkey);
-               ndeleted++;
+               EXPECT_TRUE(iterval != NULL);
+               EXPECT_EQ_U64(iterkey, expected);
+               EXPECT_EQ_U64(*iterval, expected);
Can we verify that the iteration returns keys in ascending order?

We get the "expected" value from the keys we saved in the now-sorted
array, so we do already. Unless I misunderstand you.

---
+     /* reset random number generator for deletion */
+     pg_prng_seed(&state, seed);
Why is resetting the seed required here?

Good catch - My intention was to delete in the same random order we
inserted with. We still have the keys in the array, but they're sorted
by now. I forgot to go the extra step and use the prng when generating
the keys for deletion -- will fix.

---
The radix tree (and dsa in TSET_SHARED_RT case) should be freed at the end.

Will fix.

---
radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
"test_radix_tree",
ALLOCSET_DEFAULT_SIZES);

We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I
think it's better to use either one for consistency.

Will change to "small", since 32-bit platforms will use slab for leaves.

I'll look at the memory usage and estimate what 32-bit platforms will
use, and maybe adjust the number of keys. A few megabytes is fine, but
not many megabytes.

I'd like to push 0001 and 0002 shortly, and then do another sweep over
0003, with remaining feedback, and get that in so we get some
buildfarm testing before the remaining polishing work on
tidstore/vacuum.

Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
polish tidstore and vacuum patches and update commit messages.

Sounds good.

#363

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#362)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 6, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 5, 2024 at 6:41 PM John Naylor <johncnaylorls@gmail.com> wrote:
Done in v66-0009. I'd be curious to hear any feedback. I like the
aspect that the random numbers come from a different seed every time
the test runs.

The new tests look good. Here are some comments:
---
+               expected = keys[i];
+               iterval = rt_iterate_next(iter, &iterkey);
-               ndeleted++;
+               EXPECT_TRUE(iterval != NULL);
+               EXPECT_EQ_U64(iterkey, expected);
+               EXPECT_EQ_U64(*iterval, expected);
Can we verify that the iteration returns keys in ascending order?
We get the "expected" value from the keys we saved in the now-sorted
array, so we do already. Unless I misunderstand you.

Ah, you're right. Please ignore this comment.

---
radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
"test_radix_tree",
ALLOCSET_DEFAULT_SIZES);

We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I
think it's better to use either one for consistency.

Will change to "small", since 32-bit platforms will use slab for leaves.

Agreed.

I'll look at the memory usage and estimate what 32-bit platforms will
use, and maybe adjust the number of keys. A few megabytes is fine, but
not many megabytes.

Thanks, sounds good.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#364

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18

andres@anarazel.de

almost 2 years ago

In reply to: John Naylor (#360)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2024-03-05 16:41:30 +0700, John Naylor wrote:

I'd like to push 0001 and 0002 shortly, and then do another sweep over
0003, with remaining feedback, and get that in so we get some
buildfarm testing before the remaining polishing work on
tidstore/vacuum.

A few ARM buildfarm animals are complaining:

Greetings,

Andres Freund

#365

sawada.mshk@gmail.com

almost 2 years ago

In reply to: Andres Freund (#364)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 6, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2024-03-05 16:41:30 +0700, John Naylor wrote:

I'd like to push 0001 and 0002 shortly, and then do another sweep over
0003, with remaining feedback, and get that in so we get some
buildfarm testing before the remaining polishing work on
tidstore/vacuum.

A few ARM buildfarm animals are complaining:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18

The error message we got is:

../../src/include/port/simd.h:326:71: error: incompatible type for
argument 1 of \342\200\230vshrq_n_s8\342\200\231
uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
^

Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.

Regard,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#366

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#365)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 6, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote:

A few ARM buildfarm animals are complaining:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18

The error message we got is:

../../src/include/port/simd.h:326:71: error: incompatible type for
argument 1 of \342\200\230vshrq_n_s8\342\200\231
uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
^

Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.

That sounds plausible, and I'll look further.

(Hmm, I thought we had run this code on Arm already...)

#367

andres@anarazel.de

almost 2 years ago

In reply to: John Naylor (#366)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On March 6, 2024 9:06:50 AM GMT+01:00, John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 6, 2024 at 4:41 PM Andres Freund <andres@anarazel.de> wrote:

A few ARM buildfarm animals are complaining:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18

The error message we got is:

../../src/include/port/simd.h:326:71: error: incompatible type for
argument 1 of \342\200\230vshrq_n_s8\342\200\231
uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
^

Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.

That sounds plausible, and I'll look further.

(Hmm, I thought we had run this code on Arm already...)

Perhaps we should switch one of the CI jobs to ARM...

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#368

johncnaylorls@gmail.com

almost 2 years ago

In reply to: John Naylor (#366)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 6, 2024 at 3:06 PM John Naylor <johncnaylorls@gmail.com> wrote:

(Hmm, I thought we had run this code on Arm already...)

CI MacOS uses Clang on aarch64, which has been working fine. The
failing animals are on gcc 7.3...

#369

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#365)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

../../src/include/port/simd.h:326:71: error: incompatible type for
argument 1 of \342\200\230vshrq_n_s8\342\200\231
uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
^

Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.

I've looked around and it seems clang is more lax on conversions.
Since it works fine for clang, I think we just need a cast here for
gcc. I've attached a blind attempt at a fix -- I'll apply shortly
unless someone happens to test and find it doesn't work.

Attachments:

cast-signed-for-gcc.patchtext/x-patch; charset=US-ASCII; name=cast-signed-for-gcc.patchDownload

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 326b4faff5..597496f2fb 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -323,7 +323,7 @@ vector8_highbit_mask(const Vector8 v)
 		1 << 4, 1 << 5, 1 << 6, 1 << 7,
 	};
 
-	uint8x16_t	masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t	masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8((int8x16_t) v, 7));
 	uint8x16_t	maskedhi = vextq_u8(masked, masked, 8);
 
 	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));

#370

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#369)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 6, 2024 at 5:33 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

../../src/include/port/simd.h:326:71: error: incompatible type for
argument 1 of \342\200\230vshrq_n_s8\342\200\231
uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
^

Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.

I've looked around and it seems clang is more lax on conversions.
Since it works fine for clang, I think we just need a cast here for
gcc. I've attached a blind attempt at a fix -- I'll apply shortly
unless someone happens to test and find it doesn't work.

I've reproduced the same error on my raspberry pi, and confirmed the
patch fixes the error.

My previous idea was wrong. With my proposal, the regression test for
radix tree failed on my raspberry pi. On the other hand, with your
patch the tests passed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#371

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#370)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 6, 2024 at 3:40 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 6, 2024 at 5:33 PM John Naylor <johncnaylorls@gmail.com> wrote:

I've looked around and it seems clang is more lax on conversions.
Since it works fine for clang, I think we just need a cast here for
gcc. I've attached a blind attempt at a fix -- I'll apply shortly
unless someone happens to test and find it doesn't work.

I've reproduced the same error on my raspberry pi, and confirmed the
patch fixes the error.

My previous idea was wrong. With my proposal, the regression test for
radix tree failed on my raspberry pi. On the other hand, with your
patch the tests passed.

Pushed, and at least parula's green now, thanks for testing! And
thanks, Andres, for the ping!

#372

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#361)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I'd like to push 0001 and 0002 shortly, and then do another sweep over
0003, with remaining feedback, and get that in so we get some
buildfarm testing before the remaining polishing work on
tidstore/vacuum.

Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
polish tidstore and vacuum patches and update commit messages.

I don't think v66 got a CI run because of vacuumlazy.c bitrot, so I'm
attaching v67 which fixes that and has some small cosmetic adjustments
to the template. One functional change for debugging build is that
RT_STATS now prints out the number of leaves. I'll squash and push
0001 tomorrow morning unless there are further comments.

Attachments:

v67-ART.tar.gzapplication/gzip; name=v67-ART.tar.gzDownload

��[YsG����+*��Nezw�0I���%��nG�����
�A
����_fU�h���/��E�W��Y���y����ZC�i��F�a���f����S9�P:��V*����^���?������w��[��������������np>�)��~�����?I�P�������z����Ml����~%��K�zv���^����y������Y�,�����|1S[�;���'Na����2�k�.3� )�#���M���e��������#��Q9M��3�������^���x�/�1�d�Oe���������\{���8����U�U��
v�Fc�v#~7.`�F�q�J82�"�����P���@���x�"w�7\_��q��m��+1e�9�/�p:_&�������+�Bh��[�j�;:�$���X����]�+�FM1����� �
�B�������nT�����Rzgm|��2��%"�q���<�'��
E��O��OJ�k���k�9�r�Bh��F�������FAy��F����*��K/?/��#0�D�H�
�w��jkl%���(&����HO��,#���R���'Bu��Q��Ab	�DPYF]�k�s�D���dO�+���Q
O�+���r%��F��G����>���=lD����Q�Q�a�g�19�$E�n���|��/?�z�SR�M6o�>ZL� ���J�*�p5����cy�`��)�N|��"v�e��
)��Z�3��V�8[L��{&� P��s}?���)H�<&��tHI���\[�-�>� �z$=�FI�k��XR*�s}�}��"���$�
^�k�$P���s�$p���1�)I%���e}kwm�/���]2�,��"�`��d
"Nc�8%���QC�
����@D��� ��%{�;8�J���"v�u��@��J.	����-&���AJU_��SN@���R��8<sgU!�Dl3����}|�O�L8���O�����!�����N�!9xHA��@��G��`��?w$���+m���$!���9����~kA	�)���.���	���s@DD�kw���bEi�
e!�e���vdT�<GCY�(J2�"#
��&��OF��>����d��A[Gd(�H*4���0�����c�^�������
��Z���B���6���$��E��fv�H�m������w&^j<����Q�]t�\���h�;�v����O���O��h��E/�e��M�$p��52��r�7���p�EB��-x���N�i��E�"��
^Yl�;=���x8���<'�v������PN� ����U�+�����)�
h���r�W5rV��[�^��"2���������z�=�I�mJ�aP�4�`�3�+d/$�%`G-���F��F���^	2,b
��Z��b)[����_�4�*$��:������:�u��]���~�v��B}����^���������O�h������F�J�h��@Bv-�G7(p����n�o�����\v�{?��FH�GOpi �����/�����j�K���I�;<�"[��Y�����B	�����>�BN���>^������h6��pg�����c�89��!�����b�P�����kDwE���y���JOn/:���[�%I��`���U;i��W��3
����x�o�|C�s����5��Gk�����l���j*�?�s���3��fxk�,����Or������Nt^4�x!���a�hvh��%�6"�������9\[���E��hv�+,�A�L�.�8��w�C}����
_=p�.���n��n�/����Ji��8���'�W��C�5��.�����/��������K���E�����\Om�;�l�.�R�a������o0�/�j�0:GH���x $��7C����S����o�66�����<�i��S[���6�������^����I�c���c0�|�!���L�i�q�0
h.w�%&��
�!�ZFk��4C]0Z�K3w�^�r�DIbp�f���:�g�O�!��s�BWE����@��A(L�g?&M��*���ab{�$����K`�El���>yL5A�Vs���gTI��*"�)��)C�u��~� Y��]>s>A�J���l;��26��������4J�I]���0z�7+<������'R�.�����E��	��+&|��N�E�!��`�d�f�ga=33^[�����h�����Zh�D���
�������Y��Ua��-��
��5�dt��LA����g�/���f�0#�Df@+�#{��wGL���L�Z!{H����#��'�����p�0��L#cN�'&�f)}4m|�����l������s(x�����#���f�B���fh�*�v�0K��d�@�J���D��6��*�:�������Onl2p@��-5tr��YJ���h%�j�Q�1
Z�#��Pz�����p@�sC��`�P&v�����F'(�9TH�v�[�/q�!r���|��pS�zaC4���!���>���t�I�Z��$�]�NZ
��rOl��K,��T���N�������Z�����j�[����l��-�L����u�'x�}($��� p$�8��PJ+�A��>�3h�&�?�W���lV/2�����o^/�laUk����'Q���c��"n�����
��FA�M���*���9-��$�|5W��yg��a�}]��2i�6�����F�zS�mc%b���L��G6�b�Li,��-�pFS �_����UAp>��l2/z	~�A�p�m8�^P�����c�N%l|Tn^��Of�{2�8���1Hj56t�o�mH/�,����]��nei���C~�P�R���t�;������s��B.���}�9�����r���~���Z.Q��o�Hz?�����*��8��r���!�5���Ei����j���G��
N���BR�Q�H]kL�����Pg>����j]�/fG?fO+{���I�";t��e����6�G:��5�m"���{{��v��1c�C�B�j�\����Q�>�ve��3��V�L�H�W�D�6p�uv��g��i����
�����#ud�'*G��]CM��D�MWj�C#���6l�+
*T��H�(�n�AP�M|�em��|qL�����?�E����"���=��<�y$K�rW|�p:.A���6�L�Ot%#��i�����c��Di�J���$�
+�i�w<=2qt�T��ldL3a�h=�4FP����H;��HI8�		!����L��r���&�s+�
�^3�'��J�c�_z z�sk6��d���������n{��X�r��#�'���fr3:>I��:~;��&��tHd*�������������m���A��E�!�l�
�����6��U����Qib��D�:�!�����N�����*��nY�-h1��j.2]������F��W��v����{�l���0��-������=AvD��Q�*���A��eA���{�(
l��E~�����Y�Q�dH�W�f|����&<��V� ���u
_����i���>�7��#��.�����#�U���8����I����:Q�yj%������34<��t�f�������3�l�Y(������Fz���L��������Usq'P���8��mK�i���#=��:s?���eu,��������p:�d�t������8��;,��0`2�f���H1z�2��i�vy�����a�i���91���o��E����tN�-)-WIZ��5s�}x�al��`6��v�l�/��xz5�18��?�f��������FwP�����:��5������M����UJ�g�l��;�/���||��#i���9Px���2}V3#/���r[�|nC[%��0���CL����_�l�Gy��:	��F�!��������4���:���Z��)�>P�D��� ��]������h���7"Zb�����j��V�wU��N�P�q���+]N��/��a��0�]������Y���
�7�%���q�}����J7L�PAh���0�"�_������e:�^j_���@!�����������@�I���.E��
��8I��LA�(��?���[�y�����#�L��I�7f-I�q_P�����L��5�������w��a�����MF�r�2�����]I~/��]�?������={G)��AT��Jhn���{�H)����z7�]�scr#��J���x��������V��8(����������	��S��[�O�YJ��_�]�}��}B/^���}��P��~%�z�B�h�
���y��~2��'o&�����F��G/�
�_za��+�
GvX���>*?�Y����Aj��E�0�lW���h�cdH��}�����e91��\+oK]ImZ2��oc���q��^�/���lZ'��7���1�>'����
���:i2t=%\e����,Yi^p���-[Z��N�M��8�F�����+?i��p��v�����L���v��x�������������V��w��������
M]N�J���L��H�z�
�H��`�%��Z]B���&97���|X�s�>\�o��t��sN��;K8���,i�����������e6��Ip�w�e���y�X�X��e��]K�����ZF2���2N���Q v�Kf�<�Vl���bCJ��:(��2�.�V�U�|������Q�M��Z���A3��*z�����w�T�tN*�;�~`&���w�K���r�j�&W�����L�����-�6-7��p�%�M?P����G���UgBV��r���#cvu;�����}o��E�*;	�=>C��c�d�-D����,�Zn�]4��x����z����<����O������#�^��������/p�L����~���w�������q}}[��|��h�T0���S(o
��8�t������H�Vo�����5�{��B��8R2����\��g���������Y
�>����XDgR/�SD�����x�t:�cf���R��_�����wM�!��!Y�&J�$Y�&&��O���|B�"�C��G[�5���wn~�,>�3�k�2|�o���T�^���)��8��`��<)�<%�<&=? �o�?t�$k�Q1X0���7�z��2v�W�F~�_#��au��`���1����������V����b������0z�C(��A.��y}�<��G�*8��h��R������y�z��z����T�3E���Z�l}�V��^U�bU�2����|L�=��P�,0{7� �k5�1���X���X����'l	sOX0{f6������?j�{bZL[n5�7���q:M>Y��A,_���G���Oo��Y���IaBxb�����k���W�~���
LM�l�y9`���+Wu5��
�m<j`m��/R��������W�YuZ����u�C����!$���A���������*vb��25p�-��m����q�o��nrE~���u�@��`�����58J��GtctYW���
��_�U��Y���/�l���yj?��O?���N���J��8��z�����yh��N�=���T/����Xo�/�����4E����&��t��k�Co�?BU���+$��;�"	q��J�+Y�U�y��FO"f��K�o���������A��G���a��������{�g�]�
���[u�M��z�>��K']���6�?��������{���Fr-���}���,@�����la�NdY[�g&?�/O�D���`Y��|����n��
��L�O�$#����j��U��
1��UP�/���:�4�^38X�������e�y=2���T����GyK5��n�����el�M�^���6*�2SS�-G�v�Z��
.e��L',D	*
6~�2��N�MX;���}W����^_^t�:�����A�'�X����b���mtIG�b�et��1�������]�?]��
d@�����W���j��BzD�k���d�G��u�Oh#A�~{x�'�w%Q5��"Pt�8
�^���_
|����\��������������'0�2s� `�|���S�$�Ge:��$��--��8���� � ��H�H _k�/�� ^�$�
�����,�M�cl����|��{|�K�KA.���,����������
�����CJ���*���l_^�v���dn)�S���L��#������a��Y��*�@�� ��@��c'
;���aqK���\�t��]�k�<E#\^�GqDC�L�8)��FO��J��1��$6l}6t��j����@���@v@�]@�Paa��������1]������768J����I_N���b���c|��6*:%]�J&��M���Z+:8��<�,.��T��g��]��G9��������.of�F�a���c�J��j�k
F8r�7�y�w2�N
I;:���&'�t�0nN���!��"P�X��|�~��!�[1c(gw��"�5L���(�
:XB�"D`�bqK�l0|�N�c�$�1{CP���I�K�^IN���
'�2)���$3�m�k���RIy]m��jZ�6������	��S��z���2�kXD��v,��7z�]C+T�X�����W����k��4��@����e�%���4)�������M�J&�������(fW���Mo�yM
��v��i�+4��k�~�~�t�aII
�B��;�-"k;j���o8T��\�S@4�2'���p��������A����l/�%F�F%�Y���l	��W�n|4���"
*��-;��L5����vr����RP��!�j�MD�i�~�ct�QH�p<���d����,xR���S7�N2u�i�� &��Qt�u_���zQt
����/��k���&���#����@.�K
(r �7XH�mIW4����x������0�q!_��}��+�J1��I�G�y�yO*�N��_[�w���R�:>�ZO�8E����n����#�x�i~����K$��;'��|�bF��ON��D�����#�!�r�K�"����R��p��#����U��k'��� B	���A�Qz��!{S4)����.�l� ���])���z2DP+t5�U)���D����}z��"�=�E���h���H��:��?5��Q(_Y�
Q��@��:�
��q��9����������A�!F`L,?k����J�	'��c��.x����s��������!�r��������8<_@��9��*Ie�(�����#E���"
�%�>������S��I�o�PUIpF�{��C���4>�J����s#U�>;�N�"�|]@��8E�!��]rf.��A���
�u�A/t��
F�k����m��(x�<��Qt*�������dmt��O)���Y���)�R<)~��������[���>��z�k2�-q�H�1�����)IF�~}x���e�."�%�@B5gK��@�n��Zk��sq���:�"R^�	���Hr�r��n�	��k�f
�q!!*{�7+���(w!�d�(%��V�9��%���3��-�����\�yDj:"K��(]��h���H�TR -h�p�H���.���S2��
d�$�E�9H:����%��<��?�s�1�f	pQ�'��Z�0L�6Ee�R���Q�s��������Y��>���(�%_�*�
���P7"aU�pXYG@��j���DN���P��u80<p2�!�*����p��3<�I�#���<���0t^
[�lJ��<�U��|A0N�d�`�mh.��� ��&����|����J|�'��e��<��g��O�p������/��j$�20���t��nc?c���� ��;,�����U����x[�4�*����5��j ��pZ
jQ�	yLCyQ3[��l}��1���G;�l��$=��YQ}�����8��
��6�'�#�DIV<l��+�����~�U��~�&C�o�~S��8+8=���#�2�+�t���|�iL����vEX�0
f�+����ur,�����c!�	!U�O*���^ �n��F�E���J��+0I� �|����l�AN�j(�dl�3R|��#e�T!\y%� ��x]/Y�5��|�s�Q���&���m_)O���4j RN�BVaTl�K�R8�P&�n
��Hd�>�E>@v��M��d�q`C �CAi��
���� �[ /@G$�v�{����g��O�:�\�SX��w�|~�)���p(7����mQ�E�P>���2������I>N��"_mV�k�v�SD5��v�5���5?��cp{��d�^���j��
p�����
e������?*A�<D��M�@O	����X����0��T�D��+rh��S����1)��5�'�](IS��~v�����2x�-�J��F�p1������E{�$'h��K)������TR�]��AC�b�V ��	�Y���0���`
\� yg,E�N����!i�!���8�Ch����1�q�E�������G�E��<��7^�&G���7��Rb��������;?>h�j�,dC���G����d�����}&7�X����d��r���uN�V�h$�BR�o+�A��M�`8�rb�r8LFI�kL-���d��k����d��Y�1~faS��43h��:&)��#L��������t1�8d�]X�O������yG��{ *E�	���yr�^m�OdI��3_2�(2��Tc���n�
������/.���&���e�y��*�t|n8&XqJc'[y<�?|�P��|0Q�re��8 7�s+H]<2~�$�>*b���Bo��AL��m����>���0���3U��{LW�O����&s���;o����������z�9w��M�.Y����h�G�H�^�%�����/�k�O,��3�/WN�85���1��u�9���=vJ^�����W����%C�_IP�wx]�}V��9o���/���F��l��3H�k�;��Y@-�S�M���8@�p8M���8�����d��s�����z-����t����8^���>�N���1��~$/�.0���b�b
�4�����.J���Wr�'����!���M�w��qA���'!�n�����F��=���s9�M��j�
�9��O�f����4�9��y�"�"A�K<�^O���?$��*VXV8gR��^^�z:zQpi��n���&c���'HM&N��{��F��!�>�9�9�a!�u�)��� ����e</���H�cRa���O&w�r Y������#;�:��v�V3Tq��o�)��\��K�[+yIu��J=s�S�)mo?�5�(X
�ai���� ��iUi�����Rb�J]��Ct�G�/1��X��$�{{='���P�vX�q�y��UJ�i* h�]
��=*Z��~"���0e�����ciu�!�"����P��t�����r��>Z����L�f�Z�$V�6��Fhc�����l������%��
qNfuU����_�/[]���.NB��P�@��)��+� 2y T�N���H/0��"!��P
{v�r}��6�l�p4H�h�Q1��
���{��g���`�a�"�#��\��}(hy���M>�?�,6\H'6�"��K�S�hN�iRZ�����}�
�D>	�K���&Z�}�M�Z�s�-@<VM �B71 ����@"��(�|��N
cph�B�����8�.�\!)��������������~�%[�H�D��t��� ���!ky�����BC���DV�o�P@4�8��,���s������3���v����tvg����`0D�������Q�Ez������ZLl��$�����K�r���t��+{�&���Nz.����|���cm��c\Xl���5Ss�g]FW�������������8*�� ��69�xT0�	�O�Zx0��8��g>m�|l���������3y}�)PK��������HV����+����m�1-��/�����W�B���iW�	!�@$80p.+/]bl�iD����u���x$�d:T)&��pw0�Hv�6�C]���1�����M|��I�DZo�6������&�?n��`1���ZC������k�/����vG�m�HfI����I��]����6|��8�����<N��<	�r��iH:��`�'��y���<�$"��X���SG�jL���@�@2�~��'���9���}:��Q�M9}&&��4m�UY��O��
0I�:J�U����J9_�g��G��v�(y��eq�fb�9�*C���E.
��P]�P
5�9lDS	
g�t�^�H*�D�,�+"�.&7��.�C�l���!HAl�����	��4�rJf��2��`%�4���G���z��d|�R$6+�`�UU��D������l�<7"�_����	��<�s_X��N\n111��s@�#L�DI�UH
��g:�}s ��G`�#��ue���6�bDi�]�;�V4�K4F�V�D���!��Gm��)K*X�	��G�O�L�#"^Z����Q���R	g��D��x�7S��$3���|�qQ�lI�t�g�,����S�Wg�L�#XqPKH�h	)�YK%wJ��y'H%��5�6�R��|�SG��'8�`����b�7�P1�G0X�����d����S�s�x���t��6U�J�v>R���R�W��h0�������-�����B����S��Uv�v(�����:�C-b����<�����5��"����u�kG�2���,�X�������:��A� 
��������+��5�5	3���1��"&J�n���K��6@Q<>�:}����_=���;;O�~����2�N�����&�w{�i0�������<U�,�Q�������������R�����Y!�����u-���f������-o/�(��U��:�k���n��!0� �|^FjG_��Nb�q�&�>�a��D>�m
5�%�Y@3p4�
5�u��[c_,y�����$���"�����H���}�@.��N�^������Z�B��=��L��1��F:@_[��_Z�z�$��F����c��q�,����g�TY����4��T�ZXh��@E\r�7���8�����G�wl��[��l�S�KR��L[����G"k�Ja`D��j������#�'Dc��4��G3��DM���i%l�CGI������k�����&/Y��-[+��
�Hl���kG�aT���������������1!g����B�%�1C�*
�����`����Ch%��s0E-��[bl4dJBb*5�����Z��v�U�,��*/�x�1�,��d��/a�i��L9>Q!��h,F�����:Q4^Q����~$s��1���������e��BR� �%:)q/��nWP^��J]��BcO����NZ3��1���%�����%t�������:�l�x�f <�i�"��V�j��2{Q�{S���6����	�uv <�{�7V�,�K�0��/*;���'8((>B����G�p�>��	I����;M6��_P"uJ	- a�L���@�Z;�Z}�2r��Z��s�����b��
�9��X"��:��k�@�^�J��)S����9$���KLm�mgh��B�b����L���/�B__mU$���3�4^������y�Bq1��S1�7�I5� U	a-��$�z��L���g��F�vb�9� &h�����xN�xR�;��.*B<��	�2>1�P�j*y�������&%LWa2�a�����F�����4��|��h$�j���~A$I�(�W/F�c(�Eb9+[�����pg�����5A#��A�����-��`���CCE}J�@U?�����T�PX��[1��@�p�1�i�b4�S>�t��@��X��r�S����	�B������0LQ����fS@���������DL�Y������O� �Zy%�`�V��jMx ���~��p�9����"3��t�.��i�(/���o���k�TK�e��*��r�h38��@P��W�+�~�)����3m5z��&�~�G�����X)��S+��c��Y���L�q����)�C�@@g�N�8�/�O�
-���{qPVK���"�UFS���\6L�0,[eG�<�!��>04X.�C��N�7�Q�&p�R�_�ym-�#|}-�#|~WE"�����R������N�m�Z������(����]��b/�+(�����w��<OOt��'����7����g��Z<%��Y�L�
��_�"k��P�W9k5�Sv;��������Z�"0�.���+G��r������U��fZ��R���x�sP8��|�X{���C,�Nt=����������J�����'�������]��
��n7���-<�7#�
7!��H)�����w�x_=����lk���B��A�s`�k;CEA��9� �����R���J�3�"O�y?�#�Jtt�1)2�k�"=�t9��$Q�vLqI�Ti�	��m��h�:}��K��"����$����{�w�IJ�@��1����J�d�5����}���\��Y[���'t��5��q�F���g��.(>)�?��<�H@�� �A���>FpD��Z!��!�a���������(�J=M�-���:~�{X!����H:k�������<}C�	�o��D�� ?�����v7����,�a�`G[_t[H
���!���d�gw&6�1S����k���i�X�%���I���Bu�
S��
��E��L��IC��>/��&#�E��`�"���$��jw��J�
�*���2	}���@f7�?9���,JA	3PQ@�;:]�H�	�[����'
'�VD��s_=|�xWq�3�Z��!����4n��!��v���A��:9�95	����Qch:DB��j"Kj�a,�������M�1E'!��.k��|�1mw��S�������������	��(�{EpC��^*��u�E���m��OJ��x���Sa�E�[��#0B[���8�:���l��:P20�������.��'��$x.�r�p�o����p+���.�V
��fv|�Z��4�&)�S^@��b��������O�����%�2���rO�g�i�-1��6)��yl<xZS��X��f�����$O �(9{����
f�����3s	���[wj��~R���������<[z!�B�|M���v��c����?�����fp�P��~Ru��j:o�
Un������M�����s	o���U����8H�0Q�|��	�R��Z�>��U�mq�T��;��j�%	m��@ ���T�����&�K��\��K�j]��{��aj3Kz����:�L�:lJ���z�#2dl�b�e�5omf���[u�om[nF.�]R��E���}��Fo1�j���jW$�����Yx���?��a��^>b�QK�k��~�^��;�3�/����?�;�&�����������G�3;�����%��0oO^;*Y���e��&���lic�NiN:��G�(R!o�������s��!8�����y�X&y�>Z9�-��8�8~�-�g�B����o�������`�����_��>�2�����`��T��:���cZG������$TR~�I��$[�.!mBT�c�T�
��f��'>���VP6�M������b���6���3�{�yX��9���
������f�X��X6\�_��_����n�,��6��|��{?���1�Bd��~'i;��a����p?��@��8��L��<�,���_q=�t��Q���i.f�FQ����;�-��S/
��n�e�����X��zM���S�es���De��������d?K�z��
��&��X���
58	N&Zh�L��33eC�6=z�U�;0�����bD�b���)&�"����B�G*=�_��i.e!��X���	�2���N�<��k�G���B
0��4�N�u�����������d�]P��DG(m�
�%����q�.�#Hb|��YVL���5�i��qk�T�8�j��&���=�X��7B��m�oo:{����o���K7��J��y��J��!(���l���[>*`����T^"Y%�E
��o%��n��n+�_.;r����
����@CL��v��bg���c�M�f/�o���
8�"�09���^�O�q�x�\��	YMRT�	���J�^WIf��X���4�b.����$g���B�XO���!���{sS��c������[� �d���I�bs ��v�q����=~LG��G��#�H�[�C��������7KC@��CE8�,���e9s"�(!j�A�\�|����|z�+�8������Y�!l��F^;U����������������:������6�d�"�{��p$���%z"�����v���)�Y�Y8)qE)��� �y�?���v�,+=�C��������a�qt;1�Y�S��b�W,n�Exa���W��������)��a����e6{�M���������E���m]+����k?���(/�_�[��Pt\���q����������O9�6#.E�
+�b���4���V�oL�|����������|8=�ru����M�.�j�@��2���YJiW����1>���c\���'V����Sj�/�Z�;Sq#��k��kP@j�}�,��L�������4E�r�G�/0x^�'#���	�pY����X��P�5�$H���-���K�q�����S����
�ia
0>��x�
���x���
^���.O,�r,��7)�%��d��`��e��``$]NR�%������5��9�����[?�q4<�H�/'�]n��}���hY35B�	��������lVN;�;�m�H����[�Lqt�H��������`>�B�`LVCP�'��H'�Q�NYf���
������5��0 h�/O`0����I���ee������D���6��M��f��N�$�q�C5w��<�>���E7]��r�\������vS��#�����3�O���cj�!@�k�5�����|��d~��"�LD�$q��GG�/Oi����i�U?8���wS����(Q���{w:t@'i%2���X����q2
�9'���q�2o+��R�v'sQ�G�Bp����uL����^;�C�sT�0�c/b>�oWB��t�_��6 �;IK{�s~1Xl��j�l�.������9�`�����9�HD>�F" >�����Rf���~f_2��n���iV1O��@���6#J)!=����3�#.%_
d0;�+�z)��h:i��>�s�����Y��n=�y���h�QI�3�]����������8�.��d<���3���`��H#�����m�-
�����k���`��#��he��%N��A$��sx���
g-��$�b:T��vGS�9c��a�X�������\��%�pa0��O����`�r�<�<J��.�t9��<��b�9{�)��>V�&I�y_����A+�MU'���:f���]�#�}�S����.��i��e�D�|0�kT�8Le{����5��T��X;��������8�&N{��gA�:&'��
z���6M�MM3���xR�'^�$�&�O��$�w{�������Ky/��.b3��s���D�	����2�������P72�66���n�x}���l�BCs����5��A5��v�}]�N��V?����������8:��zp]���Y��jz1D���d�R!yT��o,�L�<���U�-3HRS�K5�����nf�h���]�X�x�R�=����q(�(2��G���K����o�I07������������p
�D9���MI�d7��Y���0R����q?�B����3LY��VI��L���g�O�=������{��3����W?�c���*U*�6(d����x5����&��~=bm�*���Ba��s����8���]_��~�u���=	7P^�7L���!�����S��I�7�6���1�_6�i�y��|�)X�w��tH �����}�����!�������[9��|:��/~<<����=����S��w��y�����h��rQ:)��N����uw��lXn�	R(�e��-�k��=�$�q���������� �-����N�M���'�v�������2r'�_skc�y�M�"F�
�	H�m��*B�gNf[�����[����x*���\UN��B22`��X�����6h�'[��V�t�������������82��4Y����!��5F,�����l/����7���L����/q����j]�BY�
��
-�C���N��w=�Q����/Q�n{����7_;|��&dk�07!t�*K=L�����-1>cs
�m_)�.=O�?�$AG�W�4���T\���}�����W�����s��[�Q��gi���T��������� gn8�r����`{wT���>�5�`e?<����6j[H�bDKgcA����7
w:4��(6��F��j2g��878�i��Gc��u@��������#�OJ������Y����(AN/W�7Cr*��G�v&Q��YE������K�qT7����{��l�]]�%_u��/�O�p����	F�sC�^�����{`�M�f��+B��`���Mg]t)��f+�tAlM�.8�� %"�f�"�$����k$�$�_��^��S�Z��dm2r=����
��C�	Z���!�����y�w�Y�1�{���� ��dD\����E���t��p=���z��
��(�z�%Pd�(=)�����f���,��~I��c���2On����pc!��G����"���Gf���u�l�P�z���r	�7v�r���!�nW],�(�\�����bc<�6�<��A,z��;���hT�,������EH����n���(���o��o�h,���;]
#�K�%�^�~����g��������a���	�Kz�8����l�s���z{x�'L��'~��c-��5���}z��x%FOa+�8bG�����?�w�R��+�nas��Bc��;V?�I�=����1�so��[f�!<�
�e����1��%�#?�'%4��b3��
�N�T`9�fF�_��n��3�Y���E�,WUC�\��0V����%��G���*h'@��([����SK�T��"���r���E������!y��/�@X� �}�$�5�������y�@1��K�m����� �8��	��T�'��D4sqd�	�L���-��-�D�L�� C�p�9�d�q�������C#�#��X%-��)G����E,�s��o��������U��kZ8�~�v�
�W�Oo������[\��B�Gd�HQ��O��7)`�D�����^�,|(��.X�p�A���U��)��X���o�Z�.����e-������E�K���v\M6��t��BT�������)��5����Y���s0C���G�����;�d��J� Z(Oo]bEN-�X�����SI�����g �:l?|��P�;
����Q[E��`0�_�YS�f�l./--�sR�If>��������_�p12������|
Y�M���jeE�0��*#�a�X�njS���`���sP��
�"�Ym��O|���"!�����Z�G�e������F�,'w4$��n��%��l&�f�	�,���%����w���R�L� Y����;l2����T��:j�J�����~���*�������^��K|g"�)^���}��!s�I��KP����;��)y9/�6�����Kcy�:F�x���|����m�����z��i������O�u
1}�P���~v��P�>7���]�	�du�&�mV�	eQ������X FN�%�-,����|�/0�7�����E���s9R�@�<e
%�
��G����j��r4R��i
\�2���{���-��&N-�#�vu]1>����;����@�Ozuj����v9O}�W�����@���S������������^�0t��,���r|+�Ib>_��}2��~�L� �3[�f�	v]u?o~��JT����Z�dd)��ye�Vne����NNU*�H
K�;��Sf���3������x�}7���E��m���g�:��d��
�"/�p��������ZD
����\
��gp��4��R�2y����[p �;\rip#+�<[���R��������k��e�5�������z�	�{���jt�k�jK�E��
�sN��j��/����
	ZP��b�4*u+��#=�J��\�&6���~����xS���� ��aE�D� |"��aj�"$�
g&�Xr���O�u�0Q')}JR���>X:��
o�@�|��'��~�=�0�E7�C����}�!,J�.�d6�?�,�D�@KQ�
�,��G���u�p�@����^QR����[XW��x�F��gf{��[������������Q�.)����f#F��	q����$-���t�d[��:��u���$J����0��G����y�]n��+���'{��p}�wA�5����W�����
�X��\���������U"��S���I�,�H�0���dk� C'�{B�����b?1f_C�F�
(�_Z��]u �P��D.[���D#�����M���9�{���lL�A�*����GM�y���yne��"	U�s;B��ICKQ)���0:�? ���["�������[S�9�S��-Q`�"�#cZ�]Q�B^9��2����
��bW��Z�ek����������2����T��L�����Fm�����������x���s������Sx���J��e���Hx�����\������]5��+��H�\�%~���E�-3QX�Q���Z��:��=����qU�j�p�Oy�l��{~l���0�d�fZka�NJ'����m���M�fN�em'����>���&��e�cb�}�C9��F9��o����^����h��a�n�U�W<�������a��*`���dN�.��{�=G^"���������5�i��t�
#����8���T��W�me�����:� $*�$����JZ��|V�;���w�~��:���-7+]����o9!:}��*�B��'���&���P��y���{Gx}_�_�_�Y�~�
|`/��i�M����n�������[K�*��b��]P����]��|:��hp
���s�s�w]���B>4�3@��$��;�s������[�7'�-�q�����8n:,����;q���C��E�;m����bj����h��4�4���j�!�gv����(� o��+�R�����y�D=J@��vG����L�U�5 b�1��r�x���|�^���D}��8�����9>��Y�S�}�5�''
�����#�Y�&)O�gU���o�*���c�]�l0���q�%�OP%�m1�JiFv�s����9y�0�n�I>a~L���G�T^�������H>}��q�a���������@�
����m���mw6�������u������C�T�n #	X��>�����A�r��	U��,��q�������2�g����&=P���7+Ys9c{xt����������(9��b���p>/�GF��H�?>�l6�@���K��/�GE���,Y=�j[H�;��l�� �_����s�:�77�}
�$'�%0�D~<J��M�
5�o��&�-��@@�a��h
��@�B$6�Ye��LB;�}pMPE={��+�3��S����3������b1�������H��t+�+38��ON���7
_�(����n�~�=r0*`�$f9�%2��p���2q�����/��V6�6����V7m�q��x&�����"�����M�7v[CcMms�o��&�n���?z�(�v��Nf���~|�+��/�}<����j���.]����G��o���z��S>,����Hsip=�}����`2z����Q:X5s�L���+��ybcK<�d�=[$'�����	�|�z�%�710�%����L�r��.5����������E��"��E�0�mr�����Gi������M��H6�?3���|6=��(�lpF������K#&��#���p~c�L�����_�\�<&K�(��h�{���T)�����0g	��)p��ypBJ�1�����w�8
A/Rp��R�,�����G)�����'&C��4Q�������?��K�r�h��O��:0�d��O�������Y>>=|��>�:S�<����������������/�hx���sb?t���R���HB����9����o�B����-&K�����N?�L��*��w�C����,�]f�pw���)���O9��$x�:�/Ua
�����M@����H��j3�y��\�������Q=��28��)�THy�g�=�L��<��N>��Z��gp�<�s0��I������9���N�^%X��}*k�;TD�%��'n%��,��g,��M�� /�#:	q����Jf?���sW����L��R�t��7�����
��������
�Hi�Jt��&����-V����i��{�|
>6q�I4��8���9��U|��Jl:6� q
����Y+W=��]����I�5�"U�?{w�����q�7�Jmb<E+����!�.+�k��R5_���j��N����-/����1)�C�c�r�����c���gW�����e;J��q�yW�����NtG�w�X�|�)8�D>��������y�D�tq�;���C����b4�vQP���7[�w�9��V*�^kX.��~/�����V*�Fc�T*�������1���(Uk�bSl����H8B�6�(?���0�����Em���v�X|8�ue3�&_��8���!1VQW!�ww����=��}�����yZ������.���5�
�73O�$�,�h�s3,�a��6�$eQ�%��w�4�� hT{�a�\n�v�Vs�V�H�����= N�
x#��hs�������8����R��$t�GW�x���H��p[?DA�p����|^��O���?�
��pR�-����^������A�-�buPm�^moX�aR k7A���-n��
F��\�4�Gq:�=)l�o�����s��|�����Ot!�'�x��.����~t�,���&������[����XyKh�*�_��;�����^����3?�@xO���6�R�H�Yl�m���x����@8�"�C��0���xz%��	��x����Y���l�,2��t]��J��W�{�!�Z�a�vg)G�}w�����������nk����C�f;�����O����E�v���[�go^�t�:�@��,%[@�+�w�S��:;�Qo����B�R�&gutxy�x\*U��r��1:�t���^x�nm���7�\��A����l-p:����/�����|$�����$����@�=����H_N�uBI���[�����/�w�����V�
���������el�u�$x�g�Z�o>���i�� �����5���{�v�����{����:-<����p	��HJ���\-��L#���D�1A:&��P���mB�������2 $���?g\t��/&P9����)�U>���T���I�@�0��7��|Z����5���D��M�8Q���J,,�L�'���% Q��6MC����<����_���r��n�����������_f�`'NC��VowF�u���a���{!\�����hH�0a��K2���_���1��G:���>_�z��8^��-���L��k���"����2�(R�/]H���ebT9��w���:��>��-
�>@._��Q?$�T�n��� �g��7}��m�m~����&��6K��7?e�o�����2�7Mb��Hb�i\��uC|#|�$��!��i�������&O��~s���~����>��o2JCz�{��6�o�o���{� 1����s��l����
�HQd3�l��&���LW���V{�F�^LW\o�LV]����i�~A�*�FQ�M���<����q�A�3,��f>]��iw������v����"��)����5�Iq=��.Z�7�4��V.~z=��)co�y�+E��U8)��S(+���5)8t��"1��`~=`(9�R	�@�����p%�'yep���	��O�JFL�[�����Jh���K�Wc�	(�L���)�?��O��b)�c����)���=dR�w�
�E�26����c/�ZAo8IMr�`�Ip>y���'�@�"��/���`����@�z�F��v�yx�u����0%��i+���?���k#r ���7�����q]"����p7����H�|<�X-���He�Z
���4e���Vi5����{�[����k���<]��L��'��A��q �6�?�0����s ��E1�@���<
Dg�������}�G+�������X��N��o�����UT\	@�y��������I\,�������������x��������'��o����W��E���X���7�n��h��5��7�����Ye���I����$>,�r�p��T��
��}�2'��_���G���������:}cp��@X�����f���|����y
o�����������*�rxs5�=�E�`p3���C9��*��b&	�%��o������,��O��ar��8CuUI�t�*�H�E��]v..�c��{~����%G�������=�Pk���s��5�W���p2�y�s1d�����hx��/g�f�^�����/�y!���T������No������|����y�����-'J�9��t.@2���E���#L;J������E��s���o'�����>���E����
����������M�MI���;��}�l�iK�4��B���m��u��}X�������������x t�U���c������f
��"�CTX�Xx�E3�"��i�]� ��A�&XE�a/P�o���JY	-��������rR�?r\�|��Z�H���D%��R�q,�����A {?�7�����RH;0��L��d�.D�U�0�����s<�u�|�R>���2)R�4�(:O�I;��`�
'���5�C�=#�8"�A6;+9|.�<K}Iy]�G��$�A�b���AJ��a'��a����1pe8����<��Q��1v�l�
�iR~I������
�	�� ����1P��
�����5��(9���������5���9gy��0�5���6c�`ZPl�����b�KH��>��e�Y�*�Gn��TE��'&��~��?����m[U��������{������>��"�\y� ��;�L�b��I#��a�/o���>?g&�K���CgL��PI��tY��@�}~�����T���d�>����'da������J�P�E:bS�%rxu�A��~�g�6�r�){��f��G������O$Wx�^���e�����c�����A0o�E������a�)q��Z)��w�����>��G�r�8�x+E��'Hb�����j��yQy��b��j����O�7]8�����-�f(��k�]U��5����H�qAE��`��<�;���%;���V
l�����.�EQ)�����[W7j-��Ao��$�Uv�|��|�p��>a:YA��*nK�L� ����k�Q�B8E�=�/�.��&��7�B�&��.���d�Y�l'�Q��v��,�Q�4tF��p!Y�����d/�6��C�r/��M�[��E��i|�SV�V��"����w%b��)�Jj�y��KRln�oj�����F�A�����j��?��C@���w+*��(+�K<���G�+������aAx��P%5y���rv�o��>Q���L/h.*�A�9�(����c��y}J�*�?*8}����X!�T��Z+�*5�#lO�B�L��4����i��Zx����m��k+���nBb���e��Lt����F�O?AM�������2������V���=�&�-��'j��%�k

G��}���?ZG�U����P*������};(x���-�f�qD��NL)��+�hw}|�Q����`����M,����p�lV�`�[n�]�fU~'��b������!�
�e�t��W���������o?��IK�����a�8��Q$��A8�<���~*�R��!)�.�����^�1�F���^,���KEe���||/�Ta����opo�I��	k�U�)l�E.�c��+�un~��G&�x�?{��O*c����v`m�#�������o���{
!���bk[~���������EE����sU0`���u�������������=��2)�H^t0����6�!�)s|h����h��Q��*���h��d*�1��F�
�W�RiwW��
�~
B�_�U�?Q����KH�Tf;�pX�l��V������ITG����O.�_��$R���:����Y"O�1"3$*c���dk��#;�/Q9>�vI�f���p<0��EOyb�M�1�\�tKyX	s�FPG��f�|���Y�������J��5��n���"��x��} f2�1�k�i^�,�*>i�t��EXL�(�r��x	58(&��~	�s�F1%(Y�,��a8Q���w���w���K(��"�2I�'�����@��m(1p`�_)o���d"9�g��a�fHr�3CX�0��}?�����@�U�'~�f6��<K��/�i������������J���nR4E�P/�F#`��~��� �����=
V�
�����a�G�������o�b'7vX�4A<~�rL�G4E������!(w�W`jsE)�����b:Y��p1I���n� Q���Z$��E��}s33�(X,o���y��{x���E
���t>�to@%���<���Y�_��	��m�e*�����T��`�3N�8�\-������{�Vz���}�|F��cP�C)y�*�Q9,k�R�Cz��)I����Y�'��T1e�2��2:e>��)�'��P{���+>$%�ut���\���F��S������%]���T�������(Ox�M��r�c����s}X�a�� ����D���z � ��c�O�����0�~��:P8��Z����.���O���� �El��<�$TP���r�=1������c��a��<�2�ZuF���n����A�\����j5lV���@YCYp�j���MH�?1M�<�/����?�#H��Y �v�����UH�qwA~:>'9I)$������Q_� ���k��V�4����]w6���f&��d���V����+[[������}n�J���JG��J��K�|�n� ��2�*�-���,�cj)��gPij�o��g~V[������[���C����U�VS���j��?���\����rDB�����IV�U��E��N,h4k������J{���r��
V{�A#���{{��"�J�~�D�%��g�?Q����<o�(�}����6�y�?�7�������tL�57/�$�>���"���kr�E��l���R���Je�b�������g���~UE��Sm}Y�	�V��4��~��nm�N�����<�3�Z[[��������-��*�����&����W!jU��%��r���#����Z-�������r�5��RI���3s*���)�>����L�����3����j��� �>dA��L�;����M���L��m��[g�$<�K��]:l�k5l�+)�m�r�[�3���(I������QGA�\&q����n�.��[��*�0�<���������`u�x��xK'�^b����x���/[�y�l8��T��7���3z��!1��!�7fm�
�A�V+�!n����������� ��<����F���;�^�R���`���K�w4��?��Q�_��H���AD)9���t�����b���6?�r�Mx��4�wh�K-����7Q�����`!����+�g����\k���(I��mB�9�\��������#��ON:'�����kR��@:���������e��KPCa`k��@��[c�g������������B�6Y?r7�>k��"���.K���K>��������-aS��6��;-;�������"��`�Md�u�j�[�J��j���n�E�����g�l�/�e��6K�X�(5FNl$z�����_����S0��&�J���7��*_�x�T�u��Z�2|1�`c]25������G�`~W"�$�B���g��^�?�t_�������j,������>��9ywx���++������n}+�$��zw��\k
���Z��'�w�fA��'�Y�������!'��4����hr�����D�`>����#��oH@�#y��3����f�h����c���`B�h�H�mc�Xvcw�b&I�)��};��
����������0�o��h��?�]�&���s�����L$c���)�?h
��X~������o�/?���r9�y��$�Qn�0LaD����{�,���w�a.N���qA<ic�����X1��n��s�vz	�����?c/��U�5�qq(�ZA�BV��3)P�j��x,��*|�I�Y�n���h� a��	7�������I6d/]:@j/.�
�������>�O�	|5�M�~���3
�1O�s=�`����<��>�p\q��*�mQ)�����O�M8�i��^���p^�s���3��V�-��d9����~"�3g��`�z��hq�	���{��*.����&9���o�wC�E�uw���Vv�0�h�
?���z��[���
�G��!����oJ^��A��N� ��(�g(��L�GW�{��~LN��T����%F�";N�V(�3���y�'�1����A�Rp�����;.�����R���t��Q�R��_(�������ly��/�\
�}���[�����n����v��?{C`a��}��9o���u�5��?���.)v�!~�&������<r��>�;�����){���r�~�0[w|z�9?vy�����p���@��8��BVJ�'���=���jT�����8�W���pY�B�-����jix�^�,�H�G��@��Fe?A������c8o��s�agg�re�����8&LI�(m.�
am���l�}L(	��g$��T���H4�X�1aYO�6�
;&�M���'�����D��s?W��2�fON�9��f��`<>W�a���������n���M����
&�����KRoOS�z�Cr!6����|��$_S�yN@�Q��Kl��C�?![�|���,vrQF���/��<
�~n��A�|������� �
/��)�?��&W����x,�?aFA�����v����}h}4p�?!6�%��������`ia���r�	!�V99�K<�R�P�����Y^��T�c^}\�A��*����8��$��[�)����^(�����#�����2�)P�@�/'��_t�~
�!	9Y3�����zK�[	�����yC���l|V����C�r���2^%i�%)Ml�/6B���R�c�NB"���c�.o�%���kU��I��FVd�s����O��[�V���R-���	*7��y��4f����h4�s���[�Z�1��pv�<��L)
�d��S�U*��Fb�d��1�r�M`(/��11�pqF��������j0��d��e
������k<5���9&����*����H/�e:�w8} �.��%}6GG[4}����^(Z��S1��H�z ��@J�9P��Out����������:��a��
���}V#�$�����=���:NRB��* 
�I�s�[+v�p:T�(p������d�t�H������_lH7j�#{�K���;}���p���4|����6��:��OE[�
�)���iq_sd��vx���!v�(�uvjM
����*K�S��c�Ud�"�D���?�%8$>��5��7Qo� �@b�,4�\"6"}����!�K�sg��sc�P.����y���<W��+L��k�N9@V���� p����6�k0N�����B(c�d���$��$��iNtoz������C%�!��7;:�|�6�-z�k��~�]��,H9t�=�L�H�a�����Jv�s��]7�{���c�]�RWx����(�
�1$��:��@f�B��^K���]=�Vq�y3	"-Qe��)�n6��Jc�8�5��o��m������xs�a��Y+��w���j<��!��G������;�C�w�)_��|�����u����\�������|�Ko��n���j?l���������J���T{�^�^�V�a/����[>Y_����4�5���'-��s8�����t�S�	�r��$�H<�\�?�������T*/\�93�$�E����q�k?��G�N��R���T{��<&��cF�/�����k�v��z���A������_��c���/�K�jAW,��{j9���-�@��O�$�
"]v~���AK�E�`2��!��*_~��H.��r
Dg���d��������0�
�(�����wW_��*Z�Or�	 d�������A�����/v���D�Jb��Ey�O�9���{�)I���m���T���&	B��(,lyY����d���f��s���R.}�f��>�j�;]��;w�����0_WRmCZ4��s��.z��+��h���6c����9O�����_��'�1F�Yn��]�[�f��Lz~J��L,q��(RJ!�)Y�t���'�;��wG�F�RY�`V[Nz�9���s����k|!�����#�p���*O�i���������n�W�P������'��'�T���4��t��=.�5�����Lj\9'�{Ufq�����),fW%�y�-��B��Q�����R����(�da���N8�+��J?������h����~_n���w�7[Q�G�F���6������m������H���_LW��1�/�F'��[��5�02��g��=��
Nq��8��7���\�W���:q��p�	���u�,�rY7��������=�	4�����"����q�N�l��L�b�r��]0f��V��#X<���4&�+�����&T*���r�7��!�
l*��}�|��SHr/�,����Lvw����S�|��[�	���r��]J�����C2H��� �5
zp�,'M�"�N�(�&�`���E��.�4Y�0ss^Y�}��*\����<4����3�Y\y�����f'0+�.�x&O�����!�7���d�y"�uS������������M��o+$�	��u�����h9ZD��6�m��m�����'��n��/b�dmk�Q���W�����I�<_�v�Z�����CoIb�$"�ER��RH��*��|'����x�4�1�JQ��DPfG�qgYgT��%}d6��\c������>�	3�����&���~5[��f`���:m}�9'C��S5���K��?)bV�0'�����������	$U��FO�k���������v��Iy6���Q^v���������1�2LW�3��3)�0�K19'�7y}|�9�\��5%;#)��}���P5A;�%F���`���N>e:i������S����}	u(���3��tK����A@�������+�()� ��� [���������D�i���:�gB�e8�5-���8��'��6�I4�/	1�tE3`��^�<�d#�b��<��,�l��}��F���|��_��L�kfD�8qnf�5�w�hV��������v?f������{P��I�K
�'�6Q���6��^��n�������5A��f���v�7N��������)�f�Ip�������#�:/���.�t��,�p��c��.�x�9�/��C����x:��
�����o	�u^O���l,�-<�wz����������T
J��C��R�H\+�&��$|v���n����^�7���������w��J]�����5$	i
�$6�������l�{a��~���r��h�NsH�����Y�����J�^7��!�T YE��Z��US�Z����l�W�����u�6Q��Ug�'/����3F�Ra�4���������\[��VP��{�;a�w�JX�9o[����I�u[b�tiB2zTWE����.��VK��~�[�cM4tT?Hp�%�+����$e��l"-)��$e��l����������a!,e���l)��ehW$��]m��p�T"��A�Ze�F���/W�H=��(C�Y���4�g���+\��z�Vi������$8��l���$8�u�oqb��5�'N�M�*U���F����������.�����#s����eg,(���	�a6�2��>�H���H���0��<"{��F�K��-�bu�= ����u����
���a�
��J��!��M�J�RW���
�������3-6;�Y�����!`9�����@�I��O�!�_obN�*��]���K�i���h���Q��x�����_�t�'���m6G�F�jc����F�oZx)�`!��-�Bc�'���n������z�,=Wx���H?��??����	�SO����KH<���}.���']���c��4�[&���>��x���b[�]~l�v
���M�h�_~��l���M�00��J8f��������(��*�c{c�M��ho=��L����p=��b�"�����g������C����\��n�3����8
�>I�g�Y�w{�'�D�����b�Y�~�^����=��J?�"64+5��������p���b�w�������d�����4�*���Zq3$Is�����G�')���e�
7�(�6���s��iB��h)�)'�+����U���E;I>$lZ-����"2����H��	����.��lD������������(%P-�;&,-����(*M��&*9��6�g���ze��������o����Z��n���Z�Uo�*�au�Uk�����k7��V+�����$��ip7�d�9�<�?�[����<�_)��2,�]�������!	r��fn�X"�P�ZD�q�
�E X�n�d��p8�G�����[[����la��
�az�1x�9��C!�dKv���DQ��7��
��`r����](�� �; ����<�[`��p�9�����H����+���CU|���w��o��)n����_��

��0��p�`����$�J�W���j��F�j5�IJ��`@���	��<�<#���D���#'F(�!��.<��g��0X�M��������HNP�lX �rzVw&����P�Lo��)��ZN5���!L�h�dA���G�G`����\b���|
1�B���eq��A����F>�("�+�M��(��2��}�/]���%�e��_�����b�d���Uu�K��j��z���|�������o���z+�����p7WsU!�/Rg�h�>��{��Q�-�c����2��'0��5�e��O���w���[����r����w ��V��	k2����b���L~|=���'���.~������yV�����9��s�Z�!,l[�Y�b6�V�I�����Z�W�����L��)I�����=�@���MG���f�������3���y��
]Rr]�2��^�kv�j�1kW��\����^���W�0����]\*RH�`�IN.���G��������+��z������5�_�^@�j���4�0�j�g�	zK�C���<b��<h�n����v������v^�U��;y�bK�9���`������9�{>F�v��C*.Gd�����z@���r*�I��1�!��/Q����k2�?�t�S�������g�|G�y�sC��p��[�����������~8(���^��oV+1��5#$\�!z��k-���{�.����&}H!X����)���;k�f?Gm�#|J�e�R��LY	��yz��tK�P���v�Vm�����<�A�V�k�ww�u��t�_�-��]<�����>v��g	zQ���,x��������!9�>����#�<�]X����~���2��W���W�����>e���je[_���9I�<b13����e=���@`�!�Xr��v�E��QL|���s�����*A�!-#`����t�+�w�C�Xv%1�_�#)fq�Rj�	f ��Hq�H#$z�(�L��mWbH:<��v:�d�����g�Dt����P�>�}�GdMn�,��������K��+����Qs��e���q�`f}6%V5V�������k>�Ir^���2����0z�9�_�����(��:>��^�>~u�.������ �WQ=��?Z��9���I����J��.�c��������\@3".���~w~t��m�K!���;V�����g��`���v�R�crS�J�~�8"f5fr%���Th���7��� �����*}�PUF��x��R�H�!���A5�h@;@��Q�v�h+e���J��"b$��T���������g�}x}�����I�{x~~��������s
0�
�o`�6�{y1�{�����0kP$�Cch�S����DN��;v��B����	���9oE*$`�4;?��rk0R��
���������)���+��=��
G\���������g�F�h:��`b�o0����J=�E�����w#��mnc��2$
��DN,,�DOiZ������"����4%�%	������?��IG$Lr���r��F(E�:^N�|�Fl�g�{z

�
7
T�Pm(��qb&�U�����=rO��@�,R�X�"�-�[z���#�S��1Z��!���A�cu�`z&z�{���rFj<y�X&�o��+|"g�l��[A���t�t�}V
������6`Kx�n1������|������2���8'�
���U��)�'��4}�0��~��(����x���k�xS����w�Lc������gE>
���XW���-��F����9�"���sd��;�������MQ�S����=��3�����3���H ��
V�b���2���@.����@��?�c���z�yW��
,a<`�`��Z�V�?+pjF��K� ��m	����xs���,x�C,�B��d��=�9�%GT����?���d��-dw��2c@Z�����N��v9�������f�����(��"*�Pxz��NMz���)���$$�H��L�G^�l��jp� ��kP�1%�3��%����o����N%n|����0O��A�\}�c�YQ5�X����l�7���$�#k�G��.���:���j�`���B�����W�����7���"���'O~�<?<}��d�~���Q�����;�C����\x$
�I�q>���
*�bF��V���L=1��&�1��$���VR>���,�����9Rs�a�$	�&F`�swu�<?�V8G?A1�<�g���g�`\�WP����Mh��x��,\#�O���3U������?
q��m�����^J���m-�\�t4�4w��!+|YLO�F)G�%R����,��14���96�3��Rt���������Y�x�=b*=����0p���KapE�ym�?|,�*K�w�`��mn��`.�BlN$��r������?�ZQ�$�l_�OHy�����c��=����H�e���x��#�j����"�M�@T�'���iKT
��(�q�[?�������
_\�i_����:�EFe��c�c��S��b�D	�`��� �@b1�u�t�F�D��d�t$�^8$�6�!���_�>��Yg:���*Q���o� ���\"8z���"����+�E����S���?�f�F
h�����ZN�||�1�1��&n0AH�
u��[�m����8��%l�^�i�����^��. _�$����$�X��^>#�J�n����9l,��I�F�@���=:/�U�l�7F��
��(l��x�L5:Vh����s�DL����_��I���/$^���j]T��#�<�i��X�����kyW^O���
��x�:?��C�"�B��Uuxq�9����;=����X��bC��5���������:2����m��|�?XU������_��~Js�;�C���<�ylU�pg��k�Y�Ov�d�T�9�I� 	��hr+o�)xqN?���r���TE�=��Fqg�?��DH�kN�_�;�MM�����QQP��m=8�^Jr���=��|R=R�g���b�
6���(�E��c]L��������p�l�$A�Ax�i����d��7�\�9�K�H�/�I�qKC*jYRY*a��b�����EE8�:K[�]]�7>Mt6�#v���8g�{�D��Cc��'0��r�Q���7D�8�T^"\��r��"Fo���_^��_�
����o�(�{H����=���@�r��dn(S\����M��:������2iD�	P��Rb�MN�5���T�w��hk�_	��:��w�C���|X���@��&e������>2AS+`(g��i{l.�l����`.9��tyu���#LO�Pc�H.`��c�I��d]1�(�=Jg28�@�
-�����if����U�:zl6 $���,K���`��$7��{%�*�v)����Yk�p���3c``�q�O�e�T�hR�r�����U�`�aF�j�}�4@�*Z�����JmL)d��9�,e���	���Np��zt hH&|���0�
��0��[p�R0.n��!�M��eP~��t �,k
��tR����c�d3�|i���!��<>������K��R���>��_��Cq��>p�*^�=�x��s~����uQT�h&����e@�\����8c�������5����t4`H��R��L���TB�"O���)\.8sB�%����@<�����D��9���9�h�$�Y��n��Q�T�����o����1Q5�����T���w�`���	H����:}�d�%��955	�������}��z�7�0C����k;=7M0�z18����=�d�H����2�u��b��b$^�%�~-BV?�����*�����K���xl�E�,�4�����z4�X�t�$QQ��z���z��H��`�/��Q�~?�0�#o�]��)7Hy��":t��I#i�s�,���������F�����W�3��G��������[�l�9�Xb�����-��������m�fA�1HK������y���*-��[���M$_�����w^z,6�+�W��P���xfb�la�Ef�����A��A����L#���C��r�����n������(����$j�A����J}������������:�I��K��o�c9P!(X����h�Z'����
{��+�zP��r9�v��po/=��?V2���".�rQ�rt���8h:�S�r��rL���ipF��D��1G�>]t.. ��y���+p*|���>:��q��_��n��GEwe�~v�A�6��@)��!I�u|Dxbw
^��G���O�;L��_9;<?<9��t��:��*v�?��{X����/�������;����K�_$=���`���$����<�'�����\]@�I��l�����^�����z����n8H
@I��y����'����'|���NC�w�a��w�SN�sR�������E0Bk�<3� �7S������tOD��nn{�Z��*�P��KS)Y�+�3�����b�j��S������
�R4�XL�>t�7�:S�Hx���\�p:�Yc��g�SCJ��~F�A>�����:k��5��=6C�aT]o��On.q���t%���b��Tp��-[&5�wW�r����+d(<=Th�9C�������RY:z��UY�J/����]y���z��hybbS�I^��&^\�`|q���+�/��`��Hr
k�^E�Y����Vl�v*��v��F���<lP���u<��?aJ/���R��
Z�AsX�{��
v���n3�w�1����S�}��:�-_�yt|���%������{=�~�����M��o���|I�U�@�����Z��<W!��g��f#���h����`���L���f���j;��j{�r�_�{{�amL����M�
�l�-G�Mo'��.��B:����|,!���
�vk��s}���O<���$��)%��� �U+�A��O���������\�x���ps9���
�Gv��m�Ua;�����O������2�lm��Q�OA8r�K)V�t�0`�T;�\vN��&�vtxyj������7r���P7H9�\:{����������7T
B
���<��B<;�������nX��3��
D�f��_.�p+�.	������!����(5/�w�����V�
���+����el�*v���.�|Rf79y�V"lF��5�nX�����~�>&��v�����W�y5
��N/���	��L �L�:G��':sx��[T�i2��^c�������l2��<�	����|���(�FuKt��0�S~���%��F���U�8��!�����/���Q�f
'P��L�5���p��`fX�����',K���
��]E��
��s����>�������T^�{{�
�#��'�'��_���H2����'�g��[�-�y,g4{��zA9������b[����8P��m�)�O��CNCj���+�l�����^t.��N.Y.������"�*��B���1nW�V)E��y�q�$��r|�9����n�#���wo)�P���m�\]��a�0	�g�p{���<���p���p>
�|�(<��1��)�2>�����	KT ��lV�_
���;�����;����y=��_�Nd�����B�8u�X!V��*p�Yq<
����#��}����|��c����t"@�rg#
�)�������j����s�4�;��?�Q������`�2��=�*m���	Tn��!<6t�(	C��k�a��(��;H��-?��B���]�3�D*T�}|')~����{��9Y��t�W�m�[u(k~b}J�����+�5�D��y�����l�?�S�:��|�cp�):|�XPDA�g��#���w���{�CfFrk���8�.:���U���������������w~��[�&>X��D��J
�s:M�z�HAI<���~�6���jU�Fo�sR�?H�P9�+����W+�����`�r�����dUn]-��U��n]��k�F���7-�f=H�PY�������4�{���]#�w���5~���g�?�	X>�����n��4q0����Y
�����o$^��~��]S��5v�^�W��'m[����Po���Z�6H�%e@�C:��N��9��3�H��`���H��zr+W.��	����} ���x��J�������_�&����_$��n0����D���J��o;������~�t�6��<)��FbjJ�K�v+�-*���@N�.x�>�?H<R`��o>���gW��0<����F ���@�eB������������*�W�������`������3��<��
a-;B3��#��I/�Q�n3!]Z$��wf��">i������Uc�L����?�����6k�^��K����E$v��*���+��i��[�N��j��uY������P�f�6k'����e��C
�3@�/�S����j�_��������|�v���~��w��
"� ZK�-T�1vx��d��<����U}"�z��:�_O1���l��\�P!��G��I��o������Tqk�i+���m}A�,�������xzq����-	��L�^��b��N��?|���3?�<�[�D|/
�������A�������K�M\��g���f	���Y�|���c{G��������&f�����}�FQ����{�;���AU�u���vo���o&pn�z(���.��q������K���f������I�`��Jj���C�2�\���A
f����=Ozr���uw���$#3�Nn��7]>oo��i�����~������=��d6K�+{h*���EkW�ej���D�����K��iTc��M��.x�"���M~��7M~�)��8K�NV�/$�R{��������C%J�Q�|-��1��<����b�����L����C(�������(��j��h�CnB]���e/
��������R9�1A���,���������JKJ�8
�+@���d1e�k�t@:�\6z�����/��;�c�7����Z�B�R����4��-����|�VT0�#�(<p�^��*� �9���e��o�������v|�S��8��<��m��-������Y��R��
����{����<�����+�c?}�Arrn��Q%��b�&�?��5�Kn,q����u����������H����K,��PYya����?^���P���c�D9uC���<�w(��m�������|�wx���Q��q
��w�]H�s�����l~]�Q�����E���>0�����8�X�a���7_��������xC�|d�T�b:��`$�A0���q�<+w{�p��Qr�����::<~n��[� �0���ZK���&�����\RXB:�Q0�$x����������Pb�S�k��$e�B���p����e�����$ ��UTP��{�}�����j2�g�8�	����R��Cx��B~2��7�C�9�W���������ywvj	�wvy�ZS���v���!6j����p�Hg���.����5��4������B#��<'M����4���T���UUG	
5����@� �r��7�^K&��<x�v{
\']o�X���3k�
��������������W�����K����C�	����;9O�<����x�J�*���D�%�����07_�f��9�Rzq#�7��R������:;��HAc��E�0m�fQ9��I
�O�/80�>B������#������a����M;�7��cNL�H	F����sb��[�H �Y@��6�g��t{���$� t~u��$�_���u�~n����i/[��@2���������J�j��R9+�(gpQ��s��{!��an:���D��������\nOr�
������1�;��r��_��C���S�G�Z�����iGu��I�3���E��z���yI��rK!��\�t%S�����T<�G#�)w%�B�I^�^���1R�S��a��.#G�5L���r}�d@���������[y���������:��"����
Um\M��}����h�%*!Q/�x?�������*c�e<$�p
�J$ZAKx�*Y�8���eZ����MtL�_�?�T��'O�Vz!)��Ea=:
��E �O�����:��^�������KC��%4�Dc�z_�;X�Z��K��T��,1=����A����$B��l�N��@�"�,m�^C��
�`�X����U+P�V���&�K��+RaL�^�W�:U(DyFD���K~����\�}S�6��^S���k��f�=|�:�N���u���@<�t<�T�'[��p�������m�\�/���Y���1x���D����M�!V��R&�\�����rek��������f�$y�t���t��W%���(��~�o�I��~V[��'����7�Pm��j�F�������J����^�o	YR���_����v�����{
����� h
�{���������v��
��j�������Ju�o%a�g���D��O��n�0��� 
�G���"�
�x���Mt����n��#i���-yy����R�7��8
�`���Z�Y�������Je�b�����g���~UEew���($�>��Vi �[k9�_E������D��������3O'Q^����9]3��x�\��a�V�����M���%��a���b���lB����NP��	^%�\;v�t#���R�W?v^�����J��B�c�=���:/���m�p}��W;B�/�����H���H2��������:ersK��9��r��%*wj0�!��JS+`@c�,�s��Ww9a�������C����1�]u�(O��X�W(T�U�nuN���(���,����W�pP�j��}4�bV���&��m@��FHT)���������(P<��-[h����'���}D��v�uqp�U��Z{<Y���H��$<9���G(�'%WwE	�t�@��V�(cl��������'�t��l?A���f�eX�U<\]VJT�9v��Z���`bV���F�b�}���s���6}�C����W�E��=js���G���>����.&(�O�
�N.
��XE��WEE�jx=�9;���=������p!�S�N6��O���4�v�x���o����]el5�����;3e�x�,hzW����F��
*+��WYav���R�^@��v��yU?.eG����t����
k���H�cT�Ba����B:���8�:�Y���u��:K�T\��8iT�rX����d`�x��[�!������S�Ecu���E�V����z$�4k5�{�^�0�_L1���?��=Z������*!�.\ ��6^�[ �GQ\Ir������+���n�-yjb��?4�
�6|)�����f�=����"d��3hy��[���HI��1��P����%9��:��xUI,�P3��6���f�����bWC(a2s��E������6�
;&�
���!��?o,u�h����,��s�$�@�H@��,�"Y,���D�.�}{!*��c��H�g��c��|p��:x��|�e��T��r�AO^}C{xH�]-c�������G��	�C9�p-Kg
�bd�E:*'�����$�R0���2�`/h4�B��65,�\WP��iL=8��_Q��5��~�A�I%�� ��y������d����9���]@8�q���������K������@��������\I�n���S��t/�acO������Dw����F�H���-�1.S*j<�,�s8A?���*���U��D���+�La��)-���rlT��\��s/�8vI.F�A`�c���@���y����m#,k��E���H�5<��0��tUaL&�yy]���������X)��S�����
Puj:��|%� -��.I�N�t��8�a���������_����?S+-EQk��)�"��d�7�G�
�:�?����Nm��d��
���H
����E���+���TV��S������Uk��a�������,
��M
zIz���oi8Y@�,������
��u{�}|���B�`f�������Yz���[ZLK(D�����8�KTE�4���u8��[�5���k���v���h�
������Z�V�W����J��[�7����n����z�a��������Ns�����J�Yc/���D�/�kNF�[!LKv#����W������\���������t���P9����dg�,,��5���4��W���������O�a�m�adX�,G���4����D�,���o����3�l%�<������l�u�p]
�xRe������d/]F5 ��`����-p����K�l^R2���k�er!���@.aM������,M|Tl���������/�R�c~~��)����X�g2�-�Zq�m���=�k�k��}�hj�C1IT�M��'�~��#�w�!p6�b�,A!Or�����'aLC���X�Y���?��;q����$�����QY\@�f*�J�m�i��X;J�p:9!2��Z�L`[3)QK����v��)�	������r���34q�ui)���o2�H����/�D�M�$/�3�L��.��|��#`�����g[���������{�u4��Kt|&��Y�lg����P�`g�e����Q�KZ?{����d0��Etu3���:�(
 ����B�������0�r��v��J<��G�6P��.,+�� ��Z����!7'����<[������R'��f*5aM����&VxO��*����Z�;
���z5����k{55p��u��QEf�m��A�L�rg���Y�����V(���
�Uk�W�|����0�I)�'��I����/�=h5�*� (�����^m�7��~A���?Pf�-7k�JQ.y[�"b�#E7��)���q���������A��|z���C�4����@��tQ��7�^<�S
�^u��o�G/�Jj$����p<x�K�w���9������3ieC�l���P;0u�'���{��sSGt�����W%��pa������o���pG�l���T@F�R��]���C�u��_���g��=xt���$S'��I e����������r �wpk_0b�V���n�Q���E*�c�;'%7��Vm4\�����e+�'�����1�LU���k�~}��+���~���62�����=�l���bU^Fu��A���j��g�o�P`h$��)'��+��E���u3����e����H�"J����v����`3gH2�������gUD7@^���[�f*�T�2eb�	?�.�D����y8�'
2�b	���s��/?r9��N�w�\�f����V]�����P��`'�o���SP_�@n�A���D	�a)s��2�X[b��
0�9D��%��4�lRx�Eh�@hA�/��3rr��,�u���@��\��Q���	�vqZ���7��LV���kd���%���6*��������p<
-�h�<����p#2�����^^��-+�!)[4F5(g�6��]h�� Wr<�N�d���l��d��8�'��9%@���AT2�l��16�<:�1	�W������r
e>����� ��l7=��(��av����@�_��~<v����u����o�t�������8�b?�0�d��_��������RJ�.@�)��������$e�c��B��Z`��� ��A�=<��)�����.�#v=!��b+�(o��M���Cb����b��N��'Z����\~,W�'�W�3$�������h��G���`�o'����a���G���|��y8���!yK[h�`t%�%�]������?E&W��Y��sx��D$��I"�@������#�����T"%n(l�r�.�l�x�\����@�$3��(�MM@y�H�B�-tS~|�qmW��`��8l�X^$T�x������J������a ��+��<n�����4�H�(�W���F�,0�Oo����������95r�'}����j�����Z}��g�G�7��iw2E��78"�J�@�x���]�����@��������`���3����n%���K�����r/'w��!�d��
����������m��XUa9�5��hrg� i�qv�@���o��
������14z�"OcUy��t�X�s���O���.�8�����9�q��������<$(����mx�����u��W�(��Af�V�g|x�I
�D�5��,��&�o��Vc.8�����4^�T�@�K�����x [�����<�p���<<��T�N�L�������FK9���C�h��9��/��tW��]�B�Jpg*&��Rf�(�+�M������
���a4�����qa�9HF�(��E�m���K��HI*�\��y�G�H�W�]�aX�C�/>�[��{ ��ky����D�{�!4�\k0���N@�:G��j*�7��M`h�<8��a��9��s��S�G��3�D��\�t��G�^za�8@�V��V��R�u| ��u�9��M���o)��>h�!�{=����������t�����%dOA���P1���h�w$r�
��VB�<j>���C[��o:����D�V&�T3��Q��$�7!',����8�Di�bDF�'�Y�D��`���Ex�:��[��hN����	�`�kkP�2 �D�[�(�=2@���s����n�&�p��.�3���nW�����'rF���^�W�c�~9�n��x1UB/��}rG�S����	������W].��*���{3
�
��x���%����������z�Ao��V�!X����I�����;���%��=����S���|wyx���sx�}y��dp�m������|v���2�I���/e;y��t9�O�U��C�����>'#�RH�k�H#v+ ���ZG5]P���$?�Ig|����B����).x?�	S+�N��y�9�j����OS�	j���������E�K)����E�n���g�*��rF� �� ��&D!��]����\��e3%��-�{U��;��M�Dn����\�h^�)���nA���4y$�-�������w�L�����C/S���� ������/��v�(����-G(���`t\�K����,Dt��N(�p=JV<"�IP���^��b�����o��]g���4��
Qo;�r��k��G���1G����Hu�|�S����5�0�:�Dq��P��0h�"<�d�(��m369!���r] 9�ASB�������r���?+o2%�
��d��^�O+V�b;7_��������#���L��������C��� !$�K��x����o4��_�^)�stD����"u/X�P�q�KE���$(� k�����s����Bk!����x���s)�$U;��(K~�;�Y�������{�,�k�G[�p���I-1�B1�"��#����k��� Al��u��<{�������nB���_�[
���);�A�#M�0v����f3��<���g�1J��@����LJ��]����$f���x9
OV��oZ����W3��6����q�'�;��Tl��/�m�T��0���D�*$��WO$�Jy��u��Z@g��dh$T`�l2��i�A��7w�
c��V��9[kI��B�*-�j���j��������1S
Q��P� ��bK����f�$]cd�kj��noC�RVP�c�Q'�.-�����-x���-���4eo�����g�nq�Q@K)G�
�eY���e#���E��m��qKJ�XJa_-~3E����H
$#�e�S�KF�9?�K��q�����9�ce1��_/e�������]M�]�
(�zY���x&�	�6�{d�l���T��TfN�NY8�o�@�am��-0�9� ��?�h����j�ap��A�=��M�43�,�+G�jLm:�3��Atx ��Z�P��vI�	��B������?_�p�g��������D�z�l���2��~�
�	8V�R����R����p�'�[� B��p�7����"�+�H��L��maZ,�Ujx�am��dX:%�*�t��+���;�B���� *����������� ���O>����;���q �-E#�������D��Pd� i*\ ���**#�&��yv6D��5��(4)h�s��F�s���F|��'���j������x�h�;3��VH`E;�b���,:a@I��(�[O���R�������K���G�}�����|�����^�u�VZQHw$aJ�G�OS��v��������h�mB�T����%,��
������L�j�hY����r<}�2���U^�&�0A\����:zw��G����B��u��Of3�7
�-�Y�6����$��"
����8���=.�d���;� L��>+�%k;�s�|v(����������{v��s�[�����@�`�<7;��9�xa�\�=�N��e��E�^;Qu�G��
�[�499���X5e_������I���V�������) vRi�]>FE���aip��w>�}���������V��^Z��gvN���t�.�^�&������7���\"��c��~�^"�(�'5y�����\zW=�<��E��Wvx��	����\�J�^o���V�%O�jPQ���x,D���L]wb�a���T8��������5!�&��\�c����������� .�����w �\@A�������&�2���o��N����y�zzVTY�G�~�Z�L�H�f��)��"uR4|G���C|��	���t�+�
���1��)^����5f����^�.k��n��Qy�s8�G)I�s�+Z�DZ�X����'�����"����]UE*�P&�T�Z���y"I�Rn��9�EWA}w/I��PfR��K6��#0�'C
%��zf�"����m����I+���6������l�)y(� =4����������/1�r�u���H9�s������|�#1a�����En��}[|P����i�{uZe�s���%	,9���#���e!	E8�{������������$�.-:���b�����'�G�F���]���Jb��<�Z^���s�=y��OPD�����O�Z`��w���."��y.�����S�����w��m�q�	�No��"��)eSQ]'�c����&q���:`�a�VpQ��hPA���M�:�;�������p�iw�:	o��<(����7������-4P�(�$G��,������}^����Fs]o��78�����"�S��=]Ml���
���<B����T�P�3�=|-j�@�t
��j�[�������6�d�/@������J��
ec^�)a��6�?�����/��������1)�Hf8��d�~X��H)t,���.EH�
y��
��;���+'�H`�h ko/��u����S:����n�����*���]�2�&0�k�}.#�s��s�C>�g���c����B�Uy��\a��1MA
�P���8g	stu^�Vv���`Q&G��F\�T�����B!��d�*}��BdKa?�Sv�,���b8C��x]���*�H��[���a8\�Nl�	QW	��"���X��_�pVV��>US��D��8����MD�2Fa�K����q#1x$�@�WQbIRR�>FF�"�����S;[�� ,,��Cd��q��jqT���1�'1J6Z]D.���Y7M�r�|����<��IOt��R����hy�o8P���C�{u~|���`1O�1��d_(��+N�kJ��^"h6v9�bL�.:@[`JuA�������v���R�������?�`�����J��h�����~d��XhY�8�Y����X�*=k2B�������<��0�bj�������;O�Ag*�H/~����^��f�NR�n
|������.�p��C��&�����%����`M�'^���L:��
$��Q$�����@��J�1-&�g4�+9�2�nA���� ��L�;��=$)����tM�oS|���O~}�w~�����p���Q<S$��*�j����|�*>�\>�@k�0���B�>$��F7�RLX��4��4��Ii>=�7-dx ��2����Y�Quj8�:�a�R�����=}�b��u)�`U�����5������z��Wl@���.�M�EF��.���I	����N� u*�{I���n���
�]��k�(��mT�JYD9�(��3�`��IX����s`�M$�Y���2Oy���vE	cM��l<�����.I����n���
�a�Tt�
(�,q�.�Z��b���K�K��VE���
�i��z���P.�LH?c���x�l�-_����������%g��:����1�$7��cL���k�uVj���.�I^�]B�V�s�$Yc���XPN;��K����bFA4��a����;�D!���T�%��by*a&"v�T��_���>{-�%����@b����(Z��Gy�
���W���WZ��W�����/?�N9��U��Sa������WQZi<X�i�N��DvVe��
�>0�ZU��E���X�8�����P��Ip#��_�kb#{m;D"��R�|W�X]X���J�6p
f�'��0
A�A&����U[���r�����`���1m���m)�������������c�A!�M	9P-��xN���^��*��$
���"���W���F;�konN������3��Y)f]]m���X����b97(�^�����������DC5��P� ���$�~�N��R��w�Dl@9����p���e+��A��(]����xB���<1�sXj)�v�`��R��A��#V7�(.���1!6g���Zw*z���<b��L(������7���.�p��]������an2=�����u�M�w$�z �$Az=\�&��l� e>kd5J�����|\-�J�s+0���e���x)�����*�8
^�528p���eo�H��D~����
Y}�M�)�%��b~����������1�qMr�2V}gF'�u�s.�}��HE�s�/N3���^+J!��4859~o�f�4����ZA�6������^%�WAz�����I�S�b�Q�V��#�)�:�^v�O���EZ�����;��sz	�Q/�Fet���;T����H-�1���/9�`r���N���8��������u�j��	
c���(-xK�]���3�Pj��x�XT[O:��e�~��B��FH$/�;���'t�<�$����t^_������B�mH�.���S�Q�7j���SP��#~��C/�=�zm����z�`�\���������2P=1��m��M�m��O�)�8�ny���G~
$qP���Q<�Tc6�%�! �E�#TtYI��P�f���H�%'1�->����I�(v����a<e\rr���Y8�+���pE.a��q,o$������t�"���`,����tE=#~�K2@b��(�apc(#"�d���Zu�g�j��.T���3�JQDbs�;��P��l���D��n����*&�)AE�T%��jU�f��K��|L��b��-6�V���r�����
�S���8��W�'c���#���?��!~�hz������~.����m����
���C1��b.q�����7���D�(e9�@F�le�z�]cY��8����F����w�:aJ1�cMI�1���(v�*`V�*���R��Bn��%k�Tl�
R�&��V��x�d��ITH5*���R�)��\���5}�!���T,R�����L�;p��('p5	K]� �����2#)56�d%���S�6A�Q��L����b��a�
[_+�b&��7�p.��Gwn��B3{
s�[N�d�4tW���V�v��JU���R���d�S�S��m�a
���)t+����nK�W�-Y�Iz��nK�J�t|�,4(h(�4���~��3~y�}j�\�"���&V.I�"���X�����Y��#I��@�"�7�gm��E��6������{�8N2cI-�{���C�����$�
����)nQ������b�H��o7x�hP�Y�t�I������r��@��74x�hP���i�?��TD����(��T9��$
�{]�����)5�v,�Ib7��je�\n7�F��������VJlV[���F�69��9:~L���K$�^pj�7
��k�#���	���Y@I�X�D���G�e0�I jL���CrtVI�Nvvuu�d��H��l_��+�t'��%@w�j-�6��s�:�(��8�WAMf���1��o��!����Q
ycel������RpK�6�����.Q�R8��E���1����u �Fo�zP�Tl!�b�����jUzD�/�bPJ��N*�%p��C�i'bU�p<]����Vf�OD���t,e����0��O�M������n�/L��������+o��Q���d�^���}��$RA���(k�,(wQ��vv4�f��]h4ty��,�WT1����5��G\F�X�_�lX���)Y���5�SPL�p=�I(�CNR��Jd�&B'�����9qE!���4k,�e=�zBg*��r��2p��6��(�������MUL`�X���=��m���[>��K���2$�c�'~�G��m����;���_O���,��F��F�������p+AY��g��Yp�EH:v�=�J*��1�������gvf�w���<���,����P���C���<i��]�~��F~�6��9,F�{
H�E��q��tW>�i�q��x��m34D.'��U������� @�K�s�s�n����n����!H���2~�2)XN�J�s{�������:��[�/�;�&��=�NP�p<�C`��E�7����U����r�)�p����N� �� ��%�~
�"O�jAg�pS�Bt�����������s������3%�A�Q�����x��vv���)���=U�Ep����Z���3�P���	.�M�>Q������v��,������Q���������i�UIr�:Z|B
r}�!����<��P���!r57����s����:�f��d!��E~������]���7}��8'�������W?v��)H�Mi�RF�Ma��;Y���ay��'�@���6��kY~.�i}�<��qR$���p���EQ�4���R��������C��MW�<������l���=FA<K�O�
�r���_k�����Y�KCw&f���m�<��(4B�y��a!�g
=<p��	5���|\�x����Sm���J���Q���;GEF\C�4�FZ@(��KM�s����@y�!8����N`%u���[8Sk�e�+vsJ��ji%*Y���n���N��$6+
���>�!-']����M��m���=����F&�|��j�#�y��E)�U����U�ChU�:�	s�1�����w��h�Z�S��p��������^��]�0���<��)��c�L�p��3�UZ���6�����"U������Ly
Izk��s��(�������y q�q��K�_*��c���k��!��bYw��r��al3#3��k��T��P�:H�B^����	��h1��$�j����%���Ft������%
f|)��`
J8����&�\{�c����Y�R	d��S,���*�\LTc
�$�E�{\��3�`M2�:S��U�2M�J9W��3Dl{G����T��!�r-���S��C~_�Iy%�Hp�VQ��B�p}GJQI^�(+����v:G���O�t�9�\v�!O���;����8M����#��$Y��z_�tO��=3��5����;If&�hS��J����D7]�������1�T�vHo��,f�����k���X7{/�[\[���V*����
��)������c�b�m�F������O�������^���Z�im��Z������6_�H���`��:�LM�!v��FU���z�0mX(x����x�L
�K QB�F^��Q��Pi��f��^
+�&"O�u��������x�A���6����J�'������X���s�h!�Q�O�XW��h�����A/�����D*���Q������|g|�UA���6��
�as��W.�FeT���t��XI�;��&
V��.�N.3�kp���_.Gc��AQ��u.��.)����.:���'�������E^�������_��V�eo5�}�9|uiZ���j�n�Q�a{�A�Oa��x��H�'?���?�K1�jq-%l���� ��\��x^�X��5�CF��V=�V��/����=��A5��Y#�}�jE�>L�BF�-��p�v�Y�u�W��z�9��j���������uC.r9�����2��|��7�U@���[���L�����m�����v.����GkuS���|�k�hMbo92#��{�i�f������r������V��\��8x����I�k�W������^,g��G�7W�$�����L��v��c}U���J����2��N����0�-��i����������(��%d���a�G�'�ti%�A�I�	�+�G)���JA0�(M���W�L�TF)	&�z�W�������Y��t)��bOak;�%���tm����"O1�&]����������>�?�($fU��['�$.S�S��>S�7���I��������������e��#��:DwGu�W|�hJ�����J>o?)e���N��bd�Uw�3��a��J|k� ��V����2b���[0�����n*�R��p����
�q����������K�H��������@���[���L��o�/�v��[AQ��h�$�_��!���Y8��`��h��# ���wG1�{`A�)�����*�['<J�K
�����{c�bm�#��Z��Q7,�
�����f0����	��[��e����L�����d�m)���*�P����V�B6")d���,���-l���>�6���-��)&M��S���>I�)��;��S���<�I� �{x�k��'��3g������Y/������"cv���;������<m*��i��Z���0h�Tj����8��mu��{�F+,���^�R���v����0I�6���Ma�yfo�wn(W����ye������WZ��Y��b�{}|~q�}/?)���R��7%!bS�*6$�����(;��_f��cg���[�.��T&����o6��@�L��j6Z��G�X5���U-c�$�lO��^�U\tN:�.ET��Tv�h�
Qv��Qv
�(#}��"#�B�*C�v|��$
Qj�}FL��_�E��3P�D����PT�~��?k���?�s�6�g�������~Q#V��U5fU
ZU�V��U5nU
\U#� -`Il���z���������f�T�TZ�s������J�iI�>����E������������n��)��6��6v[���A�C��l�+��[f��%D����N�'Y�V�����J������� �%��j��~�Yi��^������J��o%y�D�%*�g�?Q�{��<o�(�}����6�y�?�7��������kY��/��(>����!;�TQ���z�du��U*[��_%�~&>�^��QHL��N��Q��J��W��r��B��=�aO'F��7�r~�+j�m����y���\:�v�������,-����K��gIV������Pq�>��\��5�+����s-��;R��K/������W���}���a#0H	[��jsO�����o��VEF�Ra+��������1�Q��a�V����^��j�nRI.iUIm��O���L �����������I�@����{0�_)���h�.��:���a4)q�"$<b�gm ���%>��
���X�
�
E�/@�C�����X�;g�P���.E�Sx�y�q,;8<���	Do���$����0��?��b~�
��j,^,�����Ap}R���:�;���we���Kd�N����9scct|S�}��<c�3Vs�3��ff������x5��Q
�|� �{{�\GE�%J���N������h,�34t��2�+Fj���
��������,P��*�$^MF�a����	d����A���\��_�fg�����
����
 =
)�u/��w��2��?��Ae��Oq�~r:Ce=���
F�|w��bS����2��v�5y5��d0k����i���IL��3��(�^@��xl���G�<�p�
���y�q�Ihk���x�.���$�������3HV�~��M�:t	�6q�>��M"�^ra���A���
h�D$������5�Qd�8�48 RM	����m�<����
�����r"O������r���..�)>D7�A�������RE�h�z���?����W�v�[��IX�c:��O^�p��Nk��Oci�X�!�9`��	�4���s�FHm�Jl�"*�<���W��n�'~�.t�����t�B�c<�����s{�.�SM	z��3�{r���7��H�/	�9���{
�V��O�4�R~�l��$Iu�jb4���l����8��x�E����`��G@�K@�M�FOM`���[{<���:;��c��+����O}�@��W�"q��<%d_wLk.��${��:3�]�s�c�
�,��k���K��8xT�X�J���2�h���[o�@�d�dYE������Yi1��#�1�W�Z]1!����	2��K�����N�~:��V��O;�J~Zy�a]6J&i��M�h�v!F��THJ��({1�����o���iJF�T���T*4�Z��caI�pY����VT��,�;J�*�?��2w�xVt�E����� 6��0�	���X82��R)�|0�2�����������w��#(�\��(R3��O��|<�d:)�EY�3c5�*]�)Tf-������uj�������	��)�+@J���y�+R��A���ms��k�^��V�,90{#Y����''�^]t.��u����{|z|yq��u�N7�	�eZ�����N����=�����/�|@����k)�����Eb������u�=T���_O���L�2B=ep���\����j��AN��f�<���|������p�3������)�G�?���5�"��������u:�����
�b}�����sBD����*g~uo�r�Q�������&����f	A��@r���k�xk�_-���/
d�c����n(Op��d��C]+��0{���M���^��HP�A���m��\'������8����8xQ:�<��r�Yoi��K��h���]_F��X�T�y���$���5ZHFu@����g����>|<�������Nl"@��?a��>@�a��uT�[�� 0:��UDB��#�2
�(�����Z�Ul4�-Zx���
w<����r)��(`��?wK��Q�l������F%�-!�F�(@����r�=�m�8���=����%wT���=�5Jn-d���+�d���+�8�K��v�ur�8b������5�����O�K�t���2�%
�6OM0�~X����o��a@���<�����=�b44����u�U�f|/���mX �(/��TaN2��Y�?
c�`�|�+��8�EB�c�`��
Q�:�s$AM�P=�j���uW�S����"�K�^ pL������l����ie��o������,��R}��_�/���������z/���������� ���r.G� 'Srx��.��u�nK|�X�v
�P����|�����6�w!C]M��c�������Y�h9�2�a:�m��z-6����T�-��^�e���:���&V/�c�1.!�9=XG�)�#p�������$�������A���+�i
��W���@i�oYz5�T�c?��}���k[j�"�[e:����PfJf"���w�G���y���M'�C}��HU8*��g}�ela����|Q�O)���W�5pz��xgx�����p�q���`��C?�ctF%"N�M�q_����_f(�a��(��vu��T���������uN&y�,���$`�gj��������-{�@)��y����!S{N���u(���h�Q,N�X`_wR�-��������
��7r8y��&�!w7w3w���n+g)~��$�R��A�I�������;U��9�Q�Z���<�j#���Iz�4�(�84�3�%��
V{����Z6s�3N� �|���vtH��L\��Ag5�u
h$����3��<��/������A����Qt�����f�����v<����^r�B���F���D1���Y$���(FN��*v�A~�J:�,^��z%\ ���@��}
�J�?�wSB�RrVo)�_�W��i���}��n��j i~����SA5����`hM���v��d��v��Ex�����*�8���������
0�*3���������_��_���8:5{5y�N(r��|�[y@���4W�4�=@�Oo�D��n�IJ���G��Y�w�h����;�]�.����|�����hH�N6h�������*�I��N�S��K:�����}�~@H&�����H��%+�$B��L�F�\������h	��x�tY�����TB�2�z�5\���;Ou����|K�=���|m��GOy����>��������VQ���k�0��ud��u'�/��r��<_�(�� �&%i�as5+6x?�Q�)7x_6%	�%I��TGv���f���6b7�����kd@����������zg�$aP�`;�>�k�[c88��9qh��:>f`�v�b�2����.�8�c �O_6�S�Whf�����V�X�"9��PbPT�c�hq���BQI��"]����4�W��"+��L���&h��k��M�2�e2�So���f#��Y�v�z�}6e�V��aV*��,�@�2ydX�D��GI��V|�[1��,����tj��uk�A�Y�A���?���M�W�M��G>�u�� ?��?��������v�n�W��6u���b��0�1�d�*Z&���w�M�����
��	G]V�IB���6P8C<~LP�C���^2����I����\��c�����ZW�l�J�B�i�5����~�P[{R����hx� B� ����
�����)�l��k8e� Z�GM�I�?��6r3>������
�M4S���~���6������j�Jc�2�79\j�o�)�Q	�A�H(7�r��V.���G;����	�
�=��������������WaB?-��;���H�*fJ�6������0�tzX�T���*�I������9�Cy�uN������^��l9���F����\N��
s��(��  X���V%��q�����)��p�Z>y�~�Z�����w��3�����O�9Kx���u��rx9��`�>�:E�nSp�(G���dt�F
�c��DL�p9c�^44�s`���!�7��0���8������WH�$�n�'��,�r"y*����q9K�����B}�=D����)���P�������'?��cU�����t1}�W������?E����fc�c�1���8�|����E�Hi��Pr�9q��*'�<X1/S��JpI�s�J*[�o�{!��W����
����Q�M
w�(;o+��zq�$��{R��K i���<��D�F����Q ~z[�e *��z�~��O0&�R�l#f$�S���v�hL��	~��!�b�F7��,�
��
�o�����X�^�o;R'oW�=���z���.�u 9����Mot���z�Q�<��1i�!�k��������	���;����knv(�$R��C	�����#��"���Rr��"* �:��f����H��8��4k���/b,�n������'�5�d1����P��~�/���y�\d�@��xf�������������#,_��>Ow{�4���q7+��b���v��Tp���#Y�����>p��2gWn�7��a�����L�v�L�c���b��R�1��2�@p���($�it5�
U��HU��l���� �4�����w���?t�L��+k}����g���)�Y{)P4��8%!%���8L��I�1x���[/V�
�ep���1�g8&�M�-D�0�5��O��D��o=��������2�`��G���UsEps���PB��G	�������w���<���_*����\��d�|�v"�P���'����"�t5�J
,��
.����S�ks�*O&C�ko�����tH��3�S�{���3���q��9Q�A��47����h��iO�v�O�C�Az�V�V[ym�w���{������������/:v����k����9.�C��<��/�������[ �F�mw�I*XH��|�-��m�}�c�����J,<�
1%Z�M�#�B?7��s)"�p��<Z\C��x���D�c�oMT�:8x���j��2}�y��p�����R�%Qb�)AAE��CZ �P�VvQ�E?�#�.��x.Y�Os^���;X�
J1&�V��Cr�W��� ����n,��+L��jN��
1��x�J�#�����A��wH��_{L�GL7H�w���6q�G������j����@�X�A��M
�Z5B]]��>��X��_V����x�Q��������A���)�E��S���d������������������`����v{{�
��8I�|�
��F�b���|F^�d:[H2��u�(�����8�AeX`o��R[)�:��a������'Y�����F��2;�h�)�i��YY%��
�'��K*�|�)@A�-y"�d=G�������f�C2aUp��	�B�y��6`�r�H����4��!C�gYuc`��� �UnoO
��j���_�>�V����b���l7�F����VX��A-�~�>��cj�%��\�^%���J���i�k�������,�7�~��51|qr�Y���&v������O�r.�k��c����� �
�r���U�FV��*�H�H���J-��L|����]Y�|D��"��e2
O}�-��9���,��M;�&�oj�]���T��F�[ �R�d�z��\=��1-OJUa�����o���#����V�!dm��M�yK����-����_���XY��4@�^P����z��Kw'�)����':Oiww}U��F0��/Ro3��{�����_���������^���x;����U(a�Q����n|Kq���	���3V���.<1qy��R��g�xI�|�c�����yN���7�������#�J�R�%�_�����l#�r�>�d�G�Q8��O;����Y�g�!0�T���&�Ue�����so%&�na.�l��"~�<���[=��D?< ��Y��7��c8�!�'Zf��S�f��U1���@o�Ub��x�*����Yw%{�hMA��#E�����.�����[��H��'�������!7M�\4!�l>Y�����m�����s��>���V\1���4�e��{N?=�
�i~�t%W����_�����.�}�T�pT��B��Y���p�@Ar����8�w�yx�,���K���?8v(�+���)FHb��x:������E�0���3�)�oQ�qDzt����~7�-U)�Q�"+����%���hQ������@���v���.-�������Ka�������p��@��Y����cW>�w?#F&�����c-����1����b$7��a�RI���\��\R1��5�&�fj�d��^�{U�������w�A���'����
��q�5,�m�HQ��}�E�X�����j	�'��	��Q<`�wV �M�����������WAFq��l��Q�6��[i�����Co�U�&�4��e?�zC<�J��Kz��c_��������MwU�	����%m��jO�x��vF5�O9�0�zhE��S���{�_�J+�+o���� ���c�bV�gR�����+s�B���5��c����Ki��k4��'�gJk_n��Y{2���}�\mj�`��W�b����S:v�:�)M3��f��$aMi����:kON���� ��#���7J���C��G�\�������t4k��� v������2$���qsE,��R��f)���FAve�W
���c� dv���� �*���p ��M9!0'C�#N(/����p�i/������{�V�Y.W�{�^�9��!��I���6hbj���0�4V6�^-pl9�0f	�d����=��:z#r(PGNyX�93l"����9{oz;�!��6����JZ?��o����H����@���u;+��:�Z�J��* �?R�������&���*
2z5������(,�Ng���"
��FU]�J���!v��:e�N�����wG1��P�����7��m^��:�!���?���������o�;�����}96X�����R�����~
�=li���������g�����E�����;���7tH��j����kS�����P�et�W��H�C��a�f���T�^�vo�|�������=l��r��4z�^����6��tt"M(�8��A�&(�U*Y-Dv�?�D56Qobr5�j��t5Dl(�x]e?{�}������������w���y�l�4��@�������+�q	Ri�c���*��2�rucobU���|P���_���,_���(jo_�]�U��X;0A87����=�d�L+L���)KR���IX_����0��,��k�vj�EY�LbPj\�������[#�-{Q��%�#�Q�C�5bc�=Y�<a���.�=��Z�e���F%,�O7�4���5��o.�� ����W��-��6���h��	f�� �-�EX���J
!�[���Dk5��g���z����c����6d��6(�	k|P�w���w���8<?�^v�.����*n�������P�(����fA5r�%��NG�tX�G�w�S�	W|�tL~�5�F���{�?�4���5p���
�7DQ�[����������ON�&5�[�r:���VD&�9�����������D�����7���FR������H
����R����Q9�\L�;�I�k�5���Y������([yyJ�H%2���$��XR�~R��`wf\���[���OLc;�/����01��Y���+�0������_� �2����S��3�������Pt�����^�kU�b����W{�|4c��V!���.�PAI"�{qbkqG~$E}����p������XTu��	���o�Ya�&X�PK)�OL|i5�1A`���1Q�@by���L�*�|)������������w�yP�nlA�����y�I���[��%^+.���QD�u�9���w.���v_�{w�����*���.��2�[�?�Xk����C=/���(��8�"��Z�Q/W������}n�J�J�]:J�0�F���t~���<��(� u�7~C�B*���v+�O���6+�6v[�J��j�v��F��Qy�������\�?�uz=�j����������6j��jP
�nk(�`w�������a�Um�}����x+���p&�-Q�<��I��R��a���#a$N�����������g$\G����[��������)�J��.��g���ZUlWZ����+�>�/_�($��J{���($�
BVp�P���u�����
R��w�
x�A�
tH�J���B�Y~|"�h����I$S)���Pa����b����W�����FU�w�_�
Y.�^UE�������h�(_*l��F���e�e�Qk�kU����Q.��Z�����H|��U$��Y6+�n�&C��V����<�t����^��;A".�H������|:E�x��1P|y���������#�HyK��^t����T��������$*�p�\�?Y|�:���DO a��w�O/?��&���Yl
�j)]�����V
'9e/���E�Q�!�������N��
��1*�-��->IdPO�	�n�o�o�A�-����R�Rz�_��$�h�qE!�����@'��$m���Q4�����[�E?���t_�����8�^����wk��������:�����Z2����d�B��w_�gl�|���cHsy�������1�!�Hb'���e,�	���,|�r4<
�y�g�.����%��A[)��2�/�-�7�O�P�C@�<Yk:���$]ae6I��2:�+�(�U�4}���r�}��L)�\4},�p�,�$^�z����m	��o���$���qa�?��5	E*h;����S��=�O1R�evh	Y'����}��'G�����+6��T
:�B�y �=������7{s��{�F#h�=;`�Q��=�?������Gy!��@��� ���`���� �[��;��� -�]���k�]Y��B����\�z@�5���	�*���I�2�h�"-u��F!f/�RN���WQ����nzZi�R���p�j�i�H���.R�Z�Q�_h����u�F&��n���\�0-�i��#�[$�?��O/����w�o/���Lm����tH�����@�-����~2�zP��@,��!��wrn�������x���
T�?������H��<����I�;����SV��q�����ny�
���
�7]b�R0�D?,�a��;�k��J���+���2��P��D,5�����|"��40��*�������?{��QXfi.���f|W��_����v4))��$��
4C+�?�z���j�f�������z������n����a�Z�
�v��V�j�Uk���������U�Y��=��@�>���
FV���&Z��y_������H~
 *oO�F�6b���W:[����Tkq�NU�����h
in;�]X�6+���^sX.W��A{�����e��H��
9��^�)�.����h����w��m����W�i�"���	*�V�6��lI.J��zT 0����jeUj���cf0�C��z�R�L����>���{���M1#���l��Oj��L�CEI�������,��!�����T���9�8G�����E%�����B��^�wY��m�g��=��+��1��lp5@_.&�4��KB��
9D�������Gq@���R.�}e������t6���o�~`4+��VuA��]Cl������$��Z�K$�[cFQ�#Y�����_2������8��f�s�.��R��<��]���NG������\���$>�����u�?Z��:����?�C�_x^����n;��{"H�^��a��(k������3����� ^��]���H������S{A�/����>����<E�D���
Q�oz������i6[D������/z���"�~F�\�k�{���6i�'���z�Z�e��b?���O��
����/i�ay���{5����� ��Z���v�*�b.]������);����������N)P����s�Ayw��#R�&��Y��}�Tk�8��:�t�
��~5�`{�v����!��k��
�|�VE�_n�����N��A~ k���r���G�
����<������@%{Q����D����Z���a���x�������`�w���BF["V��eb}��2�Q��R�l����X����Pn���n��2fw���d4�=�DM�����q
����u�=��1���`�`#���,����p������c�n�R������o����`������^��op�O���E����fx?y+�[�i��hE��0S�b���h�`�"�%��Q\���/]���4�:*�V�q���W����p2���� �������(y�{�L�73���
3b��mO�69G=�
p�����-�����s<t�� �f	��
�����dY/��w3/@�}P*������?�U��G�����L�Jx[����o�3#�o���9�Zp�����C���D8i*����t��s������,�0:r�L��������C������Gg)�M���a�(��X�s����������T+�|��#u���(�<�����s����J]����^��K����I�rm]I;/���\c�����^���^���]*��ZO��KG���+G��l��c����Yo�sr����n��� �"	&��#S&��P��Y������[xo�3!��5���%CP'�;��^#D8��;����MA��r3kw�f�8��a�M�%t{Qr���3�'�NX:z�����������d�z2���+�k!�}��~G'
�~�K�T?^�d�)���?��eiHM?���OE��8|����r�!$a�/��Xx��?��h��b�'��6�v�6tH������@��i��yN
�����V4��Y����m�ii(0Gz�4����>����F��C��a�����zVa\���w*�X���Fs�����O9��O>-���K�bZ�[�����"W���N-0E�[u��V���-r6�[}��5,j������I���)g+P���l��1.�����}�p����S���gk�{����V�����Q��R�����`��$h��v��Y;xr�_��u�h�������jib���.N�L<3���K+�����\6S�G�"���������sb��� q�~��q�y����E��N�������=�z�7m�����Wd��D,%`��GP����;C�4�k������/������O�e:��#^V�RmU�-���@�F�U=YD��(�n,k�����W���<����T�w���R�V(x:�6D���+�Wc������U�*�U������e��hCO^������F��������V�����]c��W�j�'��2P��Q2=-{q{ G�b�,?j��S,�=�b�^��+�&��7�������)�Q:�/�]�bn��� �����4?.,��b��AI���������y�kU<06Q���wd��7�M��H��4���m }�S�����HcwnL��ns�sq������*�'I��(J�i�|�,���� �a�A��qD��<�+��M(�����X�������}�eo��-���O�f?>�6F���U���50wD�P���mL���(��I��Jc|��k��NC��?"����gZ��^]�'U1���E�����O�X3���q�7�:�|��[��$C
8��=��������X�}e�a��fm��:�g�[j��D��.��EC����./�D5���E�$��^B������g_Nn
4����Y���U�k�>We��f���_�1J������%*U�Am����%�"��=�M�5�DF5���b���+�zhT�a���"�5P�d'�k�t�������Su���U�y�E47ry
��-���J��EV��3�K���<�YVVR�����3��r����2��+{#i�����������`�l��B�^d[3������{I
�]��@s�L��7��lb��8-�5�+���BT(nc���H������D�:�-����S�_��J�H7U�t�uJ�e[nE����.�<c����I�6��������=���nn�?�
J�<X��7���y�HS%�� i3,12�E	���Q�n����A���I��>���'�	������(`�4�L
���3p��_Q���|��������(�����y-?^;�E�h�"K�Y���O��:t���������@�2���H/> ��L�h����.m>i�Fa�m6����3?�z�n�m;�X�����6�%��g��U��8���2����D	 ���`l�\���t���}T������m%��v'�e����@�+
���LX����x���G�4�s!lLx��$������/ha�1��������&q?�$.���~�v\/��
��y�$m�"y��?t����?d�G����,E���A`c]"���/��I��4Ch6�(I�r����A�Tz���o�n�~-Z@0�~q ����^��T��3�T(�j�"��t8H��1�PN5V8-��mS+*K+,.E��>K�eC�����h;"����fYU���,��������yM;8i/�`�T;Z�M�&
�O�)����3���h{wlA��v�i2�Q����^��P�b8��
����2�EB y��U�:�������c��>��/�f���i�;�F���B�X�/�pS������
AN������|D�	��LdM�W��,}���A6�7��C���j��`$����dhA�)��0`�1]�9�a77g�pDa���8�B}R�E��d���HSl����M���0�]���M?�*�����9��&-��-������G����Ao0���h�RI��;������_�~�Y����gC�ms���m�Z��'���2��CY���#����na$�\��{i�����H��~���	F�m#>�_R����(�Jm�b��$K��\����)�!�D���b�R�����H�8�H������};#����D�vD�(N�#�Z�bk2��B�o�`d,�N�Pi$�+8�zK�l��c�VG���f���Bk�
���h��\�5hHH>�f�@��P�r{875������a#�e�q{���+���j��=;d���:|�?�}�H���L����%����a&�r>&��4�z�����xa�������6\�~����������Q�.F�e�0��<���&�M�Y�#�80��	�����k��I�1O�B��Li�,e�P�m�����2!6B��.��Z�2$����k���G��hX��^��-�jfN:'g�������c����sp���H$���9�L�.��X�t'W�w�����t�S��"�c<��X���������@��E��Q���-G+�Q^�����F�����k�������O���)�XMiMy�J�yJ$@��M�'�t9�������k��
4�7d������7�(�(_��=�h�E�U����C�*b=�����& k����S~�I�s%�z$=)��k`�4gxv���%C�L�NrU��*��\���_%���$�����
/��"3zkD�-��
g5��K�UK��l����B��4b������g'7C\-<���%t��UK� ������~�SNo����;^�A&��r��l�����(@���|�n��e�E1��p�T���h�K��h�t��
H�j���g�I�t��1oE:�3�b��6���
AB�7��D�49�� ��4
Nt]���IVG��|AU$����lx3��i,c%�^�8C�Ww8,O+:���p@T�P��%=��DX�)��a�����8'�����V�*������|
�<�NS��dx$i��r�;�N�n���$o�	�(�����p�;x
����e16S��f���1�	����e�,��I/�Q��\$�sj�(�a$�	
�����|!��[)��"����/�#�&T��t�iHX�����?E����pk��d���+��4����:�������GA�F-s��x�9$%Hv?��t���t�;�H�x��*p"��5�6���G=�t�r������@�S�mtFh9��2��:0j��ISp&�r��R�����U�!��>��H:�,�p���`��-�{@t����Yxa=�o�L����s�-?;F;�*�dpZ�Z�/����9�t��
���yIB@([�e��c�K3 �V�%��"B>�Fp@��S��Wjg}W����5���P�'����
�����8�����$6���t����e�!��M��5�B�6x�	�Y%p�����e)�o�d�I�Lo���&��^����-�I��y?�k��#9��>+z�q?������YE���`�����=�v
)2SI�������EA�jcjn����U����4�hq�'P��*�a<����q�J0�$����-�VVa�I@;��h[^[�+�� ��1�1�,��AC�d����l�E{
!��P��l���\]���n`���_�w�)o�KNy��'3f*��������g��G�T�?<��������G����D�'���z!�z�^?�e~�,N�v����C�Yu�i��]�ZwI�0�u3�"}�����z�?i`���k6]/u�$����i�)�2�����]��8����p>Lz�����J,�.�||/�x_A��j��Z�yqL�����
�4�����<l����&���H �{�#�D��
9�����wCW�	���w�7�	�e��A,[�H�R�8��f;M�:$C�����-����
���S����B%B�=J����X?:i�`���
+�o`_W-�SB���m�����2����P@����� %.�~�`F���baEV4cj�a��q_���e�t�����w��5��Z���\�*,��x���5���[}�������v�������@j�n8�J�eC��Fj��::c�j�����lw�h(����6r���F(\�Q��l���T�_�(jm���xH����D�En��S�O�^
����4N����'�Z�����"�O+!V[L�H�b��Cg���0��f���p���y��[a��x)d��)M���}�`�)KX���n���}����>�g�l��S����>��

#373

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#361)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Actually, I forgot -- I had one more question: Masahiko, is there a
reason for this extra local variable, which uses the base type, rather
than the typedef'd parameter?

+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shared memory */
+ control = handle;

#374

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#373)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 6, 2024 at 8:25 PM John Naylor <johncnaylorls@gmail.com> wrote:

Actually, I forgot -- I had one more question: Masahiko, is there a
reason for this extra local variable, which uses the base type, rather
than the typedef'd parameter?
+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shared memory */
+ control = handle;

I think it's mostly because of readability; it makes clear that the
handle should be castable to dsa_pointer and it's a control object. I
borrowed it from dshash_attach().

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#375

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#372)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 6, 2024 at 8:20 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I'd like to push 0001 and 0002 shortly, and then do another sweep over
0003, with remaining feedback, and get that in so we get some
buildfarm testing before the remaining polishing work on
tidstore/vacuum.

Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
polish tidstore and vacuum patches and update commit messages.

I don't think v66 got a CI run because of vacuumlazy.c bitrot, so I'm
attaching v67 which fixes that and has some small cosmetic adjustments
to the template.

Thank you for updating the patch.

One functional change for debugging build is that
RT_STATS now prints out the number of leaves. I'll squash and push
0001 tomorrow morning unless there are further comments.

The 0001 patch looks good to me. I have some minor comments:

--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+   $(WIN32RES) \
+   test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+

"src/backend/lib/radixtree.c" should be updated to
"src/include/lib/radixtree.h".

---
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.

This file is not updated for test_radixtree. I think we can remove it
as the test cases in test_radixtree are clear.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#376

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#375)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

+ /* Find the control object in shared memory */
+ control = handle;
I think it's mostly because of readability; it makes clear that the
handle should be castable to dsa_pointer and it's a control object. I
borrowed it from dshash_attach().

I find that a bit strange, but I went ahead and kept it.

On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The 0001 patch looks good to me. I have some minor comments:

+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
"src/backend/lib/radixtree.c" should be updated to
"src/include/lib/radixtree.h".

Done.

--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.

This file is not updated for test_radixtree. I think we can remove it
as the test cases in test_radixtree are clear.

Done. I pushed this with a few last-minute cosmetic adjustments. This
has been a very long time coming, but we're finally in the home
stretch!

Already, I see sifaka doesn't like this, and I'm looking now...

#377

johncnaylorls@gmail.com

almost 2 years ago

In reply to: John Naylor (#376)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 12:55 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
+ /* Find the control object in shared memory */
+ control = handle;
I think it's mostly because of readability; it makes clear that the
handle should be castable to dsa_pointer and it's a control object. I
borrowed it from dshash_attach().
I find that a bit strange, but I went ahead and kept it.

On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The 0001 patch looks good to me. I have some minor comments:
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
"src/backend/lib/radixtree.c" should be updated to
"src/include/lib/radixtree.h".
Done.
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
This file is not updated for test_radixtree. I think we can remove it
as the test cases in test_radixtree are clear.
Done. I pushed this with a few last-minute cosmetic adjustments. This
has been a very long time coming, but we're finally in the home
stretch!

Already, I see sifaka doesn't like this, and I'm looking now...

It's complaining that these forward declarations...

/* generate forward declarations necessary to use the radix tree */
#ifdef RT_DECLARE

typedef struct RT_RADIX_TREE RT_RADIX_TREE;
typedef struct RT_ITER RT_ITER;

... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
feature [-Werror,-Wtypedef-redefinition]"

I'll look in the other templates to see if what they do.

#378

johncnaylorls@gmail.com

almost 2 years ago

In reply to: John Naylor (#377)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 12:55 PM John Naylor <johncnaylorls@gmail.com> wrote:
On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
+ /* Find the control object in shared memory */
+ control = handle;
I think it's mostly because of readability; it makes clear that the
handle should be castable to dsa_pointer and it's a control object. I
borrowed it from dshash_attach().
I find that a bit strange, but I went ahead and kept it.

On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The 0001 patch looks good to me. I have some minor comments:
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
"src/backend/lib/radixtree.c" should be updated to
"src/include/lib/radixtree.h".
Done.
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
This file is not updated for test_radixtree. I think we can remove it
as the test cases in test_radixtree are clear.
Done. I pushed this with a few last-minute cosmetic adjustments. This
has been a very long time coming, but we're finally in the home
stretch!

Already, I see sifaka doesn't like this, and I'm looking now...
It's complaining that these forward declarations...

/* generate forward declarations necessary to use the radix tree */
#ifdef RT_DECLARE

typedef struct RT_RADIX_TREE RT_RADIX_TREE;
typedef struct RT_ITER RT_ITER;

... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
feature [-Werror,-Wtypedef-redefinition]"

I'll look in the other templates to see if what they do.

Their "declare" sections have full typedefs. I found it works to leave
out the typedef for the "define" section, but I first want to
reproduce the build failure.

In addition, olingo and grassquit are showing different kinds of
"AddressSanitizer: odr-violation" errors, which I'm not sure what to
make of -- example:

==1862767==ERROR: AddressSanitizer: odr-violation (0x7fc257476b60):
[1]: #0 0x563564b97bf6 in __asan_register_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #1 0x563564b98d1d in __asan_register_elf_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #2 0x7fc265c3fe3d in call_init elf/dl-init.c:74:3 #3 0x7fc265c3fe3d in call_init elf/dl-init.c:26:1
/home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
[2]: #0 0x563564b97bf6 in __asan_register_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #1 0x563564b98d1d in __asan_register_elf_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #2 0x7fc2649847f5 in call_init csu/../csu/libc-start.c:145:3 #3 0x7fc2649847f5 in __libc_start_main csu/../csu/libc-start.c:347:5
/home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
These globals were registered at these points:
[1]: #0 0x563564b97bf6 in __asan_register_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #1 0x563564b98d1d in __asan_register_elf_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #2 0x7fc265c3fe3d in call_init elf/dl-init.c:74:3 #3 0x7fc265c3fe3d in call_init elf/dl-init.c:26:1
#0 0x563564b97bf6 in __asan_register_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#1 0x563564b98d1d in __asan_register_elf_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#2 0x7fc265c3fe3d in call_init elf/dl-init.c:74:3
#3 0x7fc265c3fe3d in call_init elf/dl-init.c:26:1

[2]: #0 0x563564b97bf6 in __asan_register_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #1 0x563564b98d1d in __asan_register_elf_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #2 0x7fc2649847f5 in call_init csu/../csu/libc-start.c:145:3 #3 0x7fc2649847f5 in __libc_start_main csu/../csu/libc-start.c:347:5
#0 0x563564b97bf6 in __asan_register_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#1 0x563564b98d1d in __asan_register_elf_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#2 0x7fc2649847f5 in call_init csu/../csu/libc-start.c:145:3
#3 0x7fc2649847f5 in __libc_start_main csu/../csu/libc-start.c:347:5

#379

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#378)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote:
On Thu, Mar 7, 2024 at 12:55 PM John Naylor <johncnaylorls@gmail.com> wrote:
On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
+ /* Find the control object in shared memory */
+ control = handle;
I think it's mostly because of readability; it makes clear that the
handle should be castable to dsa_pointer and it's a control object. I
borrowed it from dshash_attach().
I find that a bit strange, but I went ahead and kept it.

On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The 0001 patch looks good to me. I have some minor comments:
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+
"src/backend/lib/radixtree.c" should be updated to
"src/include/lib/radixtree.h".
Done.
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
This file is not updated for test_radixtree. I think we can remove it
as the test cases in test_radixtree are clear.
Done. I pushed this with a few last-minute cosmetic adjustments. This
has been a very long time coming, but we're finally in the home
stretch!

Already, I see sifaka doesn't like this, and I'm looking now...
It's complaining that these forward declarations...

/* generate forward declarations necessary to use the radix tree */
#ifdef RT_DECLARE

typedef struct RT_RADIX_TREE RT_RADIX_TREE;
typedef struct RT_ITER RT_ITER;

... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
feature [-Werror,-Wtypedef-redefinition]"

I'll look in the other templates to see if what they do.
Their "declare" sections have full typedefs. I found it works to leave
out the typedef for the "define" section, but I first want to
reproduce the build failure.

Right. I've reproduced this build failure on my machine by specifying
flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
below change seems to fix the problem:

--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -676,7 +676,7 @@ typedef struct RT_RADIX_TREE_CONTROL
 }          RT_RADIX_TREE_CONTROL;

/* Entry point for allocating and accessing the tree */
-typedef struct RT_RADIX_TREE
+struct RT_RADIX_TREE
{
MemoryContext context;

@@ -691,7 +691,7 @@ typedef struct RT_RADIX_TREE
    /* leaf_context is used only for single-value leaves */
    MemoryContextData *leaf_context;
 #endif
-}          RT_RADIX_TREE;
+};

/*
* Iteration support.
@@ -714,7 +714,7 @@ typedef struct RT_NODE_ITER
} RT_NODE_ITER;

/* state for iterating over the whole radix tree */
-typedef struct RT_ITER
+struct RT_ITER
{
RT_RADIX_TREE *tree;

@@ -728,7 +728,7 @@ typedef struct RT_ITER

    /* The key constructed during iteration */
    uint64      key;
-}          RT_ITER;
+};

/* verification (available only in assert-enabled builds) */

In addition, olingo and grassquit are showing different kinds of
"AddressSanitizer: odr-violation" errors, which I'm not sure what to
make of -- example:

==1862767==ERROR: AddressSanitizer: odr-violation (0x7fc257476b60):
[1] size=256 'pg_leftmost_one_pos'
/home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
[2] size=256 'pg_leftmost_one_pos'
/home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
These globals were registered at these points:
[1]:
#0 0x563564b97bf6 in __asan_register_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#1 0x563564b98d1d in __asan_register_elf_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#2 0x7fc265c3fe3d in call_init elf/dl-init.c:74:3
#3 0x7fc265c3fe3d in call_init elf/dl-init.c:26:1

[2]:
#0 0x563564b97bf6 in __asan_register_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#1 0x563564b98d1d in __asan_register_elf_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#2 0x7fc2649847f5 in call_init csu/../csu/libc-start.c:145:3
#3 0x7fc2649847f5 in __libc_start_main csu/../csu/libc-start.c:347:5

I'll look at them too.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#380

[1]: https://en.wikipedia.org/wiki/One_Definition_Rule

sawada.mshk@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#379)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 3:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote:

In addition, olingo and grassquit are showing different kinds of
"AddressSanitizer: odr-violation" errors, which I'm not sure what to
make of -- example:

odr-violation seems to refer to One Definition Rule (ODR). According
to Wikipedia[1]https://en.wikipedia.org/wiki/One_Definition_Rule:

The One Definition Rule (ODR) is an important rule of the C++
programming language that prescribes that classes/structs and
non-inline functions cannot have more than one definition in the
entire program and template and types cannot have more than one
definition by translation unit. It is defined in the ISO C++ Standard
(ISO/IEC 14882) 2003, at section 3.2. Some other programming languages
have similar but differently defined rules towards the same objective.

I don't fully understand this concept yet but are these two different
build failures related?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#381

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#379)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote:

... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
feature [-Werror,-Wtypedef-redefinition]"

I'll look in the other templates to see if what they do.

Their "declare" sections have full typedefs. I found it works to leave
out the typedef for the "define" section, but I first want to
reproduce the build failure.

Right. I've reproduced this build failure on my machine by specifying
flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
below change seems to fix the problem:

Confirmed, will push shortly.

#382

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#381)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 4:01 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote:

... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
feature [-Werror,-Wtypedef-redefinition]"

I'll look in the other templates to see if what they do.

Their "declare" sections have full typedefs. I found it works to leave
out the typedef for the "define" section, but I first want to
reproduce the build failure.

Right. I've reproduced this build failure on my machine by specifying
flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
below change seems to fix the problem:

Confirmed, will push shortly.

mamba complained different build errors[1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18:

2740 | fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys);
| ~~^ ~~~~~~~~~~~~~~~~~~~
| | |
| long int int64 {aka long long int}
| %lld
../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld'
expects argument of type 'long int', but argument 4 has type 'int64'
{aka 'long long int'} [-Werror=format=]
2752 | fprintf(stderr, ", n%d = %ld", size_class.fanout,
tree->ctl->num_nodes[i]);
| ~~^
~~~~~~~~~~~~~~~~~~~~~~~
| |
|
| long int
int64 {aka long long int}
| %lld
../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld'
expects argument of type 'long int', but argument 3 has type 'int64'
{aka 'long long int'} [-Werror=format=]
2755 | fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
| ~~^ ~~~~~~~~~~~~~~~~~~~~~
| | |
| long int int64 {aka long long int}
| %lld

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#383

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#382)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 2:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 7, 2024 at 4:01 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote:

... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
feature [-Werror,-Wtypedef-redefinition]"

I'll look in the other templates to see if what they do.

Their "declare" sections have full typedefs. I found it works to leave
out the typedef for the "define" section, but I first want to
reproduce the build failure.

Right. I've reproduced this build failure on my machine by specifying
flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
below change seems to fix the problem:

Confirmed, will push shortly.

mamba complained different build errors[1]:

2740 | fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys);
| ~~^ ~~~~~~~~~~~~~~~~~~~
| | |
| long int int64 {aka long long int}
| %lld
../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld'
expects argument of type 'long int', but argument 4 has type 'int64'
{aka 'long long int'} [-Werror=format=]
2752 | fprintf(stderr, ", n%d = %ld", size_class.fanout,
tree->ctl->num_nodes[i]);
| ~~^
~~~~~~~~~~~~~~~~~~~~~~~
| |
|
| long int
int64 {aka long long int}
| %lld
../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld'
expects argument of type 'long int', but argument 3 has type 'int64'
{aka 'long long int'} [-Werror=format=]
2755 | fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
| ~~^ ~~~~~~~~~~~~~~~~~~~~~
| | |
| long int int64 {aka long long int}
| %lld

Regards,

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18

Yeah, the attached fixes it for me.

Attachments:

fix-long-int-format.patchtext/x-patch; charset=US-ASCII; name=fix-long-int-format.patchDownload

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 93e6a7d809..b8ad51c14d 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -2737,7 +2737,7 @@ RT_SCOPE void
 RT_STATS(RT_RADIX_TREE * tree)
 {
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
-	fprintf(stderr, "num_keys = %ld\n", tree->ctl->num_keys);
+	fprintf(stderr, "num_keys = %lld\n", (long long) tree->ctl->num_keys);
 
 #ifdef RT_SHMEM
 	fprintf(stderr, "handle = " DSA_POINTER_FORMAT "\n", tree->ctl->handle);
@@ -2749,10 +2749,10 @@ RT_STATS(RT_RADIX_TREE * tree)
 	{
 		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
 
-		fprintf(stderr, ", n%d = %ld", size_class.fanout, tree->ctl->num_nodes[i]);
+		fprintf(stderr, ", n%d = %lld", size_class.fanout, (long long) tree->ctl->num_nodes[i]);
 	}
 
-	fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
+	fprintf(stderr, ", leaves = %lld", (long long) tree->ctl->num_leaves);
 
 	fprintf(stderr, "\n");
 }

#384

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#383)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 4:21 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 2:14 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 7, 2024 at 4:01 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 7, 2024 at 3:20 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 12:59 PM John Naylor <johncnaylorls@gmail.com> wrote:

... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
feature [-Werror,-Wtypedef-redefinition]"

I'll look in the other templates to see if what they do.

Their "declare" sections have full typedefs. I found it works to leave
out the typedef for the "define" section, but I first want to
reproduce the build failure.

Right. I've reproduced this build failure on my machine by specifying
flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
below change seems to fix the problem:

Confirmed, will push shortly.

mamba complained different build errors[1]:

2740 | fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys);
| ~~^ ~~~~~~~~~~~~~~~~~~~
| | |
| long int int64 {aka long long int}
| %lld
../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld'
expects argument of type 'long int', but argument 4 has type 'int64'
{aka 'long long int'} [-Werror=format=]
2752 | fprintf(stderr, ", n%d = %ld", size_class.fanout,
tree->ctl->num_nodes[i]);
| ~~^
~~~~~~~~~~~~~~~~~~~~~~~
| |
|
| long int
int64 {aka long long int}
| %lld
../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld'
expects argument of type 'long int', but argument 3 has type 'int64'
{aka 'long long int'} [-Werror=format=]
2755 | fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
| ~~^ ~~~~~~~~~~~~~~~~~~~~~
| | |
| long int int64 {aka long long int}
| %lld

Regards,

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18

Yeah, the attached fixes it for me.

Thanks, LGTM.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#385

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#380)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 1:49 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

odr-violation seems to refer to One Definition Rule (ODR). According
to Wikipedia[1]:

The One Definition Rule (ODR) is an important rule of the C++
programming language that prescribes that classes/structs and
non-inline functions cannot have more than one definition in the
entire program and template and types cannot have more than one
definition by translation unit. It is defined in the ISO C++ Standard
(ISO/IEC 14882) 2003, at section 3.2. Some other programming languages
have similar but differently defined rules towards the same objective.

I don't fully understand this concept yet but are these two different
build failures related?

I thought it may have something to do with the prerequisite commit
that moved some symbols from bitmapset.c to .h:

/* Select appropriate bit-twiddling functions for bitmap word size */
#if BITS_PER_BITMAPWORD == 32
#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
#define bmw_popcount(w) pg_popcount32(w)
#elif BITS_PER_BITMAPWORD == 64
#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
#define bmw_popcount(w) pg_popcount64(w)
#else
#error "invalid BITS_PER_BITMAPWORD"
#endif

...but olingo's error seems strange to me, because it is complaining
of pg_leftmost_one_pos, which refers to the lookup table in
pg_bitutils.c -- I thought all buildfarm members used the bitscan
instructions.

grassquit is complaining of pg_popcount64, which is a global function,
also in pg_bitutils.c. Not sure what to make of this, since we're just
pointing symbols at things which should have a single definition...

#386

johncnaylorls@gmail.com

almost 2 years ago

In reply to: John Naylor (#378)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 1:19 PM John Naylor <johncnaylorls@gmail.com> wrote:

In addition, olingo and grassquit are showing different kinds of
"AddressSanitizer: odr-violation" errors, which I'm not sure what to
make of -- example:

This might be relevant:

$ git grep 'link_with: pgport_srv'
src/test/modules/test_radixtree/meson.build: link_with: pgport_srv,

No other test module uses this directive, and indeed, removing this
still builds fine for me. Thoughts?

#387

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#386)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 1:19 PM John Naylor <johncnaylorls@gmail.com> wrote:

In addition, olingo and grassquit are showing different kinds of
"AddressSanitizer: odr-violation" errors, which I'm not sure what to
make of -- example:

This might be relevant:

$ git grep 'link_with: pgport_srv'
src/test/modules/test_radixtree/meson.build: link_with: pgport_srv,

No other test module uses this directive, and indeed, removing this
still builds fine for me. Thoughts?

Yeah, it could be the culprit. The test_radixtree/meson.build is the
sole extension that explicitly specifies a link with pgport_srv. I
think we can get rid of it as I've also confirmed the build still fine
even without it.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#388

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#387)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 7, 2024 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote:

$ git grep 'link_with: pgport_srv'
src/test/modules/test_radixtree/meson.build: link_with: pgport_srv,

No other test module uses this directive, and indeed, removing this
still builds fine for me. Thoughts?

Yeah, it could be the culprit. The test_radixtree/meson.build is the
sole extension that explicitly specifies a link with pgport_srv. I
think we can get rid of it as I've also confirmed the build still fine
even without it.

olingo and grassquit have turned green, so that must have been it.

#389

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#388)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 8:06 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 7, 2024 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote:

$ git grep 'link_with: pgport_srv'
src/test/modules/test_radixtree/meson.build: link_with: pgport_srv,

No other test module uses this directive, and indeed, removing this
still builds fine for me. Thoughts?

Yeah, it could be the culprit. The test_radixtree/meson.build is the
sole extension that explicitly specifies a link with pgport_srv. I
think we can get rid of it as I've also confirmed the build still fine
even without it.

olingo and grassquit have turned green, so that must have been it.

Cool!

I've attached the remaining patches for CI. I've made some minor
changes in separate patches and drafted the commit message for
tidstore patch.

While reviewing the tidstore code, I thought that it would be more
appropriate to place tidstore.c under src/backend/lib instead of
src/backend/common/access since the tidstore is no longer implemented
only for heap or other access methods, and it might also be used by
executor nodes in the future. What do you think?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v68-ART.tar.gzapplication/x-gzip; name=v68-ART.tar.gzDownload

���e�<is�6��*�
�������|����Vw{��H�t��.DBc����������(Y9������%�����q���n�;��m7�.����F�7B��(l�|6."1��/�9o�ydM�������v�og��1�����^��Mg���������~���m����a��?q���of<�S��_5����G���M:�=�����l���������A{{�����w:m�n�+�cC1g�}�n�l���h��?�������v�+|iz����F����m���i��G��������
��
{�:��v�hw�����vy�Vt�>����>����i�a��L�j�E>#]e�������U2��j������E��Y.k�	g6L4AlE1��"t�=a#rL��.�E�$�d�`���f�{`�����]�Dn;�,
<x�
�M{�2������p��"�3��4���Xq9v���@�A������	�P�*����A@��	�\��G,aY��{�h�"j�c6Am�r$��Y�
���HOb���f��'�u��>����yr�)��E��[q<#61^�vl!Ed63?X; 'Z�p���qZ"��[�����!D�����]�Y����m�gH=�-6��'�i5	�	pH������]J�T�[.#��)H�9�{�������y�G�����@~I��%�p<����wE�����'p0�l�����D�=��<��xr�8#�*�g�������~���N%'UBk �N�O��K!��A�Yn���=�3m���c$"
��c-9���U�o�d�/2�_>wB+C����4���Q�%�@0�Zg���N��Mn~y�M�_~�����z�?��9������l�Sc8��m�����m���6��7��h4�������\`VB����l����o�����1��g��e�0l�v�^��?����<Xg=�L�����k��D��Imw��~�x��i�����>��0���������"��Z.!')-�e�e4�H���t����i���5���L�<��#�������}��]XC1�6P��%q.�����#)�]"Y���ht��	v��<f�Z
n{w���Y�A/
|�_iZ�=��5E_k�Y:p)��"�a���
`�=d�@z+Y	�R�WBl�4v��l�2��B�#���OcZ�����jX.��d��{f��6.��>.C����[�}��m�f�:����{�+1��@�F��~`�N�����j���f0�3f0�����c�f[��i��p��������v:G�i`&��d�A9�';~�r
���w��x���!}{#�cW�j�65�q������O����b|f�r���`6�����U��x�0�	X�4�g��!7		E�W�fV1�fX13w�L�p���e�>�������gI\Y�� �o0I����O&m�����������R>�����n0Uq{~����������*�R������O_#��,E0���S.�q7��Z)��rR<��F���V�/�Mj��u�3����v�����X5oo�@����M�j�{I��\��.YyW
W��B?��Nd���6�N�m�������OR��x
���'�r7�H��������a��
��'Z����~����V�w�x���F�l���b�}Q�)�t��A�1����r�����~@�[���k�>n���;���m��c�L�$�&����Puq�!�49����](
����B?�t�j��C��D6+����m&���hE��D@m)���.!(p���,����������9?�
�=��C�o�u(��&�{`����`<���Rm��1h�
��"%<L�E� �y/2��hr��r}��T����E��O#V���sx���nUfP|����w�1w�9Q���6����_�H�fS��l~�0���8r�]v�sx���y������Y�����4�m�6+�!����������Fs��\3,%����3��=s����"lIE
�=��Gk~?��q��a�q�V�<$����q0�#-����	��fp~����\-U�k��w��m0�W�[�PM���9����;1���=�V*��`�*���J���
V$����~���
�*���XG��*O���y|:3����|v�R����"���88����gS�r��Mb�J��-�XV�,�"��An����J-�
�H���Dc��]��x{������������L��G�&���=:�	JG���'����{D��Jl9����*� '|�sC����VM���n�w?1���Qe�
�nn� 
V��y���7����]\g�~�]~���~$K��L\������/�
�T�Y�P���x�r���1���E�C�v	KR�nyI,`��?
<�������rIC�Rv���A�%gP���8k�d.�d>1�2 %K�}���Chd�o�
0i�b.e ����������M�^��E(��JU��=�4��\��5"�F��
�G)%z�y.Gq�WZC��N8�G���)�L�T���h2B���F�d���D!zl��8%2�������:\`u
Y�5Ar��Hi+
��8��t����bE%C����'�VHI���~��@1=��ij��ia�dWqa��r�+��#��[��[��Q�n��R�0�i�N����NN��(k��:L�m���d�m�}�1���M
��iv��tJ�b��0O�6)[�!��J\�H/k�<�Q��	�=:!	M�1O!���2�/��&x�fM���rSg��lbO8p���N�{U�R�ggBpU6����"���?��a<�S��.d����<��h�������m%AB�-�(?L@���>)g�@+G� -U��&����*<P�FQ��:�n����dI��J��D�H����������O����ntq
�D!e����.���T.�����fMS�$��������0�q����P;�cM}_��&@E@a<_���C��iu�P�Se!�~�b�����HN���U���rOOOXG���yv�Z����o�\nb�9k�^�0K�6ej[H�>����"��C��{e ��U����&.�B����.p9[T�<I3����Jv�u��<��Ow����H�>@vw>��w7�>�Q!�'LE��W�'�Pd���|����,�5I��D������ ��g���/{B���Q��L�~�sH��u?��e�Hq�sI,Iz!�ew�M:~r�p���cW�jj�R��K��R���D��Q�&%h�����Y�k��S�RP�j����5;.h��RL6�C���!8zlv�`�����{��U��BRE�t��<$��`1������26�<w��kI��|�_�A��3�FJ!�x� �b�>]�cnHY�"��LkfE�~IQTL�Q����(�U�����Y%�q�1s���Oz�E��4��mE��H��:D� ��7Z�����#}_��He��n�,�_�����3Tu!dSE�B�Y(�l*�lC�H��L�0�Y��uC�=�c�`��I�
�+U&��f26����V�J�e)�f���@_9���2��K#�! #���[Ef*�������J�
�����h�L��E@�3�*j����yq��a�[�Gw�L��O�ti�H(4(�,��o��L>��Y�
$��]o�{EW���&��7�
��]��IT���a�z2��\�bU�_�~ #���%z�������}�1#�{{d��SW31����DC�j2���k���k�E�n(���0����
qWgfc�m���lA8G�Fj\m��.�k�\�A��k����������*��l[�f8���l���������)����������F����[���f�����r�w7�Qo8��������3y��N(�j�18T}�\/�
����j����t@�ll����	P�S��J�k��}����B9��cI�/e�����(<h�!����T@���jr��!J������"b� ����Q��B+�Qk��i"���]���+���fPg��x�&&�7��7��I�w��:>�j�9r�\���1�<�$�o��L��������F���k5��U��<�B����M+9�
��^���q*O+��/ l����$[����{���	'�����eTe+��3G���k���Y�������@��p��o	�p�l�r� ��j��b��n���X�	���:���<(\�G����y	����X����%R�
b0��WNY>� -�)0�,�9!���an���*�kL(�
74��
,
��t#���Co���MA�%��
���	�G�`
'YU!A>0����1:�)a�����v����Th���	�Z���p;S'��1���@����![��|���Y7�Z:6qd�Y���@Y����	��:�		��g��)�|q@��(���h��xN8v�	��c���&�J�_��V�
�)Kv'�[P�J��\'M��Zq��c�N���!���rt�������o���iY$��m{eT�}e+g�����*�S�G)���^��A$�o�kB�Z�`+�������/��0���w�G/x��S(&1�M*��w����<�hb�SE����v����ZZ�3l&��
(mj���1OQf��7R��6��#[���@e��p�|��:t����L�����Lw��D�8]�k�C�d����$�o�����T
��@.�O.Z�w��/����
$a:���bQ��Vv�v
^p�UM/h�����
�E��'����O��H�]&���6z��`�X���{`vm�[�EA���@������f�RT�����z���ws����� ~i]s��K���$S�hdaz���[�<��Q�@�:nL�����H�'�(2���q���~�,����5�J
�@QR�+���Gqa��Ld���+�g��\�f#����A'r��e��#���}4����%�Q^�Td��b�d��H�,!e���]�c�a�G��]5�s/��l��i��4
is���j����S�r�/9�_�7��_��������Q��x�T���7�����yi���8���I�[���������*m	��
+�~a�I�J�����"gU+lZ�S�'�C��X~�� �y�y��CL�)_�=2�c+��N��_�0?Q�o���7i�=�wwg~H5c��HJH`r<{%�F��V��'�g���`�>Xqe3�J����`j�/��c4ub/T��B�������,����/$�j3/��_��n���';�}�w�fS�vW�MW�yQ�k�U��y����|��#��P��WN"|;,[~~;.D���{�T\��?�@�Uf���q4���?^�\��3�+�Txus�s_��`(��6�n1����z�>��x{��Cm�U/JaW���*��s�'&�c�%��S�B�%:�Y����������w{��V�RW�{s����wF7:��Zzty1����	�)����'���,�������=	s?�=
�+�\����0��?������;�Z]a�|e5
�������'Su��<Q&�����������z!���S�xxsR���q?�+z�����`�k+�f&�����|=��J��t�6�Q��t1T"�I�����+�4)���4BL��Y�J|- I�E�I���C�0t�D�t�>e/��������z����b����5�+f�������u��������0%L�����_�\�m'_W7��Au��a�c�����)���k����j(���aOgB�\����WF�$�������*v�c���aB������m�H�h}�~�=m��b�������mY��dW�����$���f��5S��_l@�p�����J��$3�H ���_��������n��c�	���_��i��#/��`��8�-X���,�]%��l�v*��>��
��%�A�F!����$'L����Y
Z
��6GX�A��U+�f|���.�>�Z���k�|����������1,&A�f6�8W�������K�������>�u+�Z��+�g:���^0�+_������{4-n�f<X<���tTV�^�_��J�A�2�wZ���&���6!��	�u�NV��x�}���wS�,���!����_����sQC�nmC��B���������m��$)%�����W��a�<H���y���Wk���y��1@��.}<����k�����&�B>�o���Ok���%�E2������;>�u/Q8r�KQ�4m���E0q������HH�c;:�:�^
������0����{��I��K$t?_��{�g�����>�=�F�k�������-���	|��9@���WT j6��|	Xw������k�l��}��K��?�o�U�pC/b����hchBM)����?���t��G�N�aM����&���������6�=d��y���0w����.�*D@�)`y`�������lu��P��D�`�[ 3���^��;�V����8�d|<EX}�{sE^G�����7�[�sqNa������b��������(���A����W�Y�4�&�F�J\&x0,�}<=�X~x�����bz`{���������^}�8�����v������������>�>S��C����������><	���a|��j���V/}o1�Q
��g�W�\���:�3���m6E�	�z�iH/]���'4:w��d/�Wg��S�	�t|���We�WE\��^��v�[9WP�*��op���@Q����I��J-{z�R�.�>p^���#�~�������Am���^�������k����_�� [)��R����}�|���5��2�J�%�m��{�HO��Po�&���
W�c��Z�����Y�C���2p��V)������
,�;
���������Q���l��R��6����3P�������)TE����D��������W�*���&?�����	�*����*97U�/�t��!���UE@���w��$��V�&������|��0�D����s">I;)��*����"ag���&������lP�
V?r�z�1��f�cjD��Y��8��l�|��������-��E&��x�������G��g��}�B��`i��Y��Ut���U�W�w5���]��n��l#���I�$|����,���h�>t�4o��"%iC�gU6���Q��o|��'���X����b�q��J��n�h�L�Pn��zb�YEZW
��n]1�+�u�Ro���f�%�Yb-tG�%��s)�2�����kW��]=�v�ZT�=1���-���kd5�j�8rtqv������Mg_�$^l�~���R���~�����I��|�%1��"t��Ro�h����3i�(G�#�P�����m��~_��`��$��������b����=`�J��,���p<�������u��~a���X<=��E�����}���y��Y�5S�P
�4y����B������W�>H�(������|��I��	��{�.�Q��L�����iE���u��Yau2�D��Yo�6�����19����y�c�8Bz�c��Y?��P�n��1]Z
�GVGd�":h������DW�]3o���	��$
u�Ym�[u�A���W[D���V���z�=�����������e]�:a�!�(��P�n%��w���) ��|������U�r���������g����;���X�gK0X��d���3sh^��K�(]����������fF�����K��Q8�yCI�F'��_��8��9��{y���w����%;�\^]�i������]w/��X&��~$������w����/�o��<�����m��GnA`�YU{��JB~c����}��Q���BtTpAT�Y;��|���H
1_J~[�gi4�{,<x0�3^�;T%�V�1j������3�ss�c)�Gv!�|�#���x\^i47���@}Lz���P�]=*\#����:�%��7��'�'�'������nu	.��Lm����}O���������������wg�O�l��Aem��I��	C�7��h���B�6��	�p"r�b��1���&K�D�9fS�I��0?�k������$�|��q�~x+lHz������S@�����`��� (��n�HNtu7�n]��6���v_�
j��`�C�n}S�+���+���
�G.'��,����������J�Ahy����'l����/%����8sh��{�{o��N������f�!�_9XJ~��N����rps5�FAiEK8���C��%�������N����	���'�?����k�����
��������B�!�IN�z���t��	��7O{��>���fOq]#��������ss�uRIx �k��c@/K#O��2'��tv|�5Q�NLg�M����2�-���W�s��:+/����E���KW��<}*���nzx�x��=Q�&I��->�/�~���gG��&)�O��z�4���QN�+h~\g�p��0�$��>��9}p��G�h�o����CAo�^����9n��!\������X�2�t����a0�4X����$8K�(���������\<���4��x�	��f���H!jY��n.)�
",1����	>qF�dscu,����1���>q�'��b*rt|z��=
q��iF**,V�;����-[�����3�{��-��E��X��O�`b��^����}����%m	�Q�f����%�����n�Ip(X�I�����	�;�R�Yt#�y
SP4�A	��;��|�6j��u��Vl����S�Z&WU�4�tJ�{j���H(�z�/�qS��0��������N>�2�ZQ�g���}��z���w��W�@��9s��N���C�	����;��(y��3�)��*���DF&WpPfD��[��p����R|s���^��r��i�H��5Q�H��X�/J�bXT��wR��c�	��w��9�Ag0�J��r[J\{�����E�M9;����SNJ��	�����sb��.g$�?(�m�Oid�������
n'�M��$�o����������m?�Z��-��,��1b����"��)T�.��T���-*\�=�^�^19g�M�8h�;�=B��t�������Oq!����qpB�N�?_��Q��i��6�����Y��?:;����t�����d���,z�$�V���]X>�9����t4S��?aU"7`Mx����GHmFc��N�/�F�
��T)�*�.����8@����z��������dy���T��N|��6���I����i�%)!I/Jx�K��d�m��:c�U4$5��.�H�������eq��`:��l�&e���t��-)��L����(��e]:���E �����u��9::����v1
�����I�}��`�k�"/i�R�����i�%"����DHv��Y)�HLZ$5��
��;��;�c����S&�iz�:p�A��r)A4��T�S�����P��I�Q�!�R2��
�t����~[�to�Z����a�Yk�A�.��St�z��P~��]9�t��g{��?�`?�(��VJe��o����
z�����1B�}=4]�T�Il@�����j�Rmo�����f�������G~u]4ZG��������l����jT�O��h��?T��V��_?�+�z���*?��7��0��R?�z�w3�<Kk������w�/���j^c���z��
�Q�^i
;�ae����A�V�6�
������J6�>�����{���� 0R���7����>K������������X_]���T��?@3RMU����~����N��w���9�����������h��U����|I��V���7Ui�U>�oOU�>
n0#��@H���4����Je3!+a�J5T������mc�n���J��)T�t�J�w�n������
��b���5�(aN��;��{�����:M�"��NH�s�}���m�B;�]� ����x�2�j�2��vA��*5�5���1<�����jK0E!����:d�"�,�V�3��[ME)�2"%j����3��k����(���R$��Ti�#�[?R���]�R�:HqJ&���9�2C��>~
g��e���F����
���=���=��T��
:+�����2��P��h��0�u��7����)��\CJ��#R	Y�_OP
���l>�(����j�Ja�����{��y����O��s�r��9h�X���C��U�T�����*��$b��\S��P=�B�J��Z�e�7�24��P)�Z-�kr������/=q�J6��F���"�\�����2N��o�,��]S�N�����O
e���ljC^<��)����Hf,�:��]a^a}�2z�V���,5U�I�
w�����8�N���J��<V�4���)*\�?Yj����pL��6II���|.g31>�!�8w��F�8B�-&(�o+�B)��#�!���E�������M�?���rF���z�CGN��K�vA�~>��S��`u�I�E�Z���53B�����{���	���Ca��d;<Y8���n*m��N���S��<Q7a}*���������L���j*�����f��1�CD�����7��i��=����\sl�A��.��we��~A.8��u��9��&�dS�&���#�����9��	
�k_)����_�t����X�V=�X5*m�;>7>����qS1�X���L��;{aB�@<E��e�
�L��uR	D�:���7���>%ZN$���xj8�_�>�R�}&_Q�^�V�K9d��GD���	<$����9���`�����
��v��j���j:4t�\��]M���G�����
`go�7�B�NV�s��__+�1��zj�E;�_�1+�U��z,Dj�	����'/���
*���/w��Y����3p��
���Y�5�@�� �f*j��������n������-@acO�����Gw����F�H���-��b�Z�����@���)yZ>5�T��^lj'
DV��^Xz�h����1�"���)��EZ�S��h5�z�����D����4������m���
iP��-��{�����<S]�jy�2���UY���"�2�/>����qW	�{]th���F^`WCC���{�xy~���g~��D���L;vW�#DEQ�2��K�
������xt����O�b�-n~�7������B����j��)��)G��v�����a���c�Pe���&93�8����Oi�YH�,���L�J'>��;��GG�:V�����-�b��?�����"�E��E��E��"Wz,N����:���[�����k������������������J�R���r�?�;����P-�����q�������C�"T��~��_��1�����������!�"RSQi��R�0�~O}���mio�|��V������)�'op��[!<Z�>hD�td��z��r���J�F�y�D����h<�����������:������OXU���d.XR::���r|48��A��,3�w#F&{���h�
��%E}����m�y�d�7��3*��C0��3�4����E�-�{'�������T��K��=�0'�|�#�	LD����<3���jfS�������7���z����V;�r�$�"���9p������jC�|1�TN��r��B�����	�|���7��/�!��^����A�m<
P0����oPR�Xa�K2n{0L~�BkG��!N���X����7����n�����t�-.�������>�8�z<���7\6?�����c<�����N��W����d�l���
������M����b�~�hV����n��y������0�������3FE)����=�W�p6 ���v�u(� ��v�W��*����l��0n���W5�
%����]8!]�Pzo5�^����	����p�- ��oq��;��f�Ue
�}P�M�4�KE����O�=�J7�f6��2�{6Ut��<`����8�d�C����-Le�@wBJ�n��R�\s�WP�Ru���R��a���b�Oh-&����~A�a��)7=�Tj��Nu��\�����?��&�[nV��L9���EJ�������Z�`H.�I�1�_-fwo��z������Z���H��;O��B��
�^����N�'o���'>�Fc2|�K�^�7�+$4o8M�����zZ�
��-��]����?����9��:�3���8j:*E��3��c�4w`�|pb�����5����X��PK���	��
�6�^}]�#�������jP����2~��������R �^����jT����6b������E*���8'���V��]�����d�'��e�c������U�����)�j�A����5�FS�K���7%sd��Q�=d��i��^�a��f���'9i���>�t���I]Ip�������/�C���+��h
�o��f�I�z;S������)�ng_t�sZ8>�w�o�SR�
e����S��W�$�����x����d�����9Y���\��=�!�N���u���Z�5����"����@�v��x������v�F�04���H�0��sB����Z��=����|(Z��N����h-B�CB�}�����n�y@��V������@�^����rI������P�
&!V���ke����A�����cl�?P�Q��[n����W�^u����[��-�����6��G�� 39Y;Xyt)��&3�&(���(�j��;��Z���M�YeQ�	/��z����`�XM�a�$�U�������+1���C-x�<�� ����d(��7n����L���4u��%u��L�{st8<�-�A�G�1<������l,��`I9i��DEz������B��#��%�E�MZk��G�������zx�����q	@i^(��z�~I�V�Q�:!���W�C"����v���c-Hq�������*�0�}	����_���?������<�	���O��mt;��G;�q����?���>{K[h�C`���y7��
��� �&tr�;�!w	��Z7 �ww?�~��i�M�f�����������uL�9q]d�z)%4�����65!�y(Eb��&���.�������������Tz��^�Y}9��:�x�Q ��Tl7�$<�gi���*�����������U~���w$�wBG�X����Vz:����W����Z���J���a��7�y��8��jF�]�Fx��F���B���5+�2A7�j�����{�&?V=�2��j��{�C'�p�#��"���-��������T8�5�4����A2,�!��FI#N\ �����c����g�4���m���>����?f��
vqn���1�v�	��Gc��%-�S"$���KZ<��3���u������ �V�g������ 
��9Yn�L�#�X[l[�����kKG�dR�\�QG��Chm��x�����<xx�
T����H����B��|��fiH�!H
n�>��s�n����g��%9T��g��fbC�C*��{E�I��"/����C8 ���G�a��1.�:���-9�����}�HR���}���d4^ ^�F��0G�_thw���Qr!��fC&����{X���M�C��!�Ud��lB���$Ar�v���0���{����s��S���1�3R�
D8�������q���"��Z������<h�`���o)��L��-t���������X�o_fh.M:�-IG��kZ"�%�L��
�
�X��V���2:fI���8��8���=1��o�����M
3�9��7�ySv�b������@$��2�?����}/ ���"�#�1�> G�[s���K���JU��!E�T���>w�P:��G�a�G>�����F�
d�]n��=��NW>����SQo5%���U��o��[q#^���K��������1!�q����	KD��.��Lzt��L�C��m8^76g;�2i����u��3�f_{de��%�y�_(
>}~qF��{L�z�?\vIw���Nz?v�{oO�zIn�����	�{W�O���Q�o�{��+hG+WB��X��~}���/���Z�"�X�
�����q���L��l�':��O���<<8�	��>D�B�@���,O!j���������������k�g�z���U��
�]X��	�Q:!�j�� ����&D#�����5\��������N���6�s�����4��`�G*�pH��Ol[w��|�����%��z���w�LO���kG^����eD����J�M��k�a�1�(�z���%���N}+��E�n�c�	��G`��.:@�?��������|��|�Y-|;��s#������\K~Ua��s��#Qo'g�>�����}��c��W�`*3{f$Bq4����'���a��D��\P
%�jJ�wH'�����p�iAoX�)����|Z��������H�h���G�o:���f�^�I'�A"�K��������jni�X��C��u������`iCQ�y,5������(�l
K!�����gK���g�>������K��(��������dnmx��iu
���*���&�'ctR�� W���u�lyX��m�&�6�Yu����x����~�/7��7���"���_"z��]��.�3v��b�C`��[��
h�|;�A���$P����1@�~P��r�������`����Y�N>~=r�h���8Eyf���K�vI��h������)�h������W1E!�z�
�����%5G��S`�'R`�b20��A����7�D
����2�*���YD h�[(��+�JY��D������.-��_�?��;�-�����s'
���oXCc�7���uAuO��NT]Z�������Q>;�T�p:���`��N{eRp���Z���@��5@��l��_Q.�G�����������7S�	~o���2BT�9��-�2Td��]��N��
xN��
��l+�e�z�e��St��|=E,vU+�|��@B�1���b�lv���lD�������a;m��%@m &0��������{��L��w�������pH�:kf���,��)�u���l�f�{���@��
�P��@v	�DIaixtd���_�V<����������b2�lhr{m����5ar"���[AA�e���O��@@D����o��aE���!�2v�j5�8��Xl*W�D��WdYc������������w:�l�����h�z{
N�_���L��B����u��4$��d�~x'���c�j�V,�b
<V!Ee@����*+���}���&9c|.��d}.�w\��%���=Z
�����;�g^:���,��
V��S,��tK�,�(��*���ZL�_*�>�i_��?t������{�'`^�`��+���r�"
��$B��d�i������y�bM���E��j{u2������I�K�+���-�V����O��nf��@U�b4���4A��KW�u
�a��=�6#�x���m���d�}�*�{�mId"�������B�1D�}OJ��t��S�
g+8�rf��N������������g'G�����KZ��=��U��W�
�f���:�0�>��;�����2[������6!/����I��L��N�j���/�x���j�j�JN����v0;���.�#f��a�t������,(��t����������k6�$�o�J��W+Wb�������O����.�j�6��F��F
/�~+�;5~��O����G�=�|g����M���'� �:�j�VG�i�V������l�q':�X��bQ;�����?�P���r�U+n�VC��kL��Y�n��O���~�S���x����kF���xtE���O?��K���:>4<q
��!��YQHd)�	H���2G"Y��0N�,��;��q�=�7��=�M_�f�@]c�b|>�9��:�5���q�Q�`�j�^���m���������2*/3hG�('�tN{M��H�[x�|B�o�],�M!�=���:G��2����rl�vSqb�r�o>��->
j�N���f@�I���o��/���:(�.�k�(�c^
������!I����n	�&��������9���@��Pn��
P�}E1���G�4+@���C�dM��9��R�N^���{i����#r6��S�Y��M���n/`����%
h���#�O{��.�
b�/��vv�����h5���3�X����%S�����u/z'g��2�'/�?u�kA�B���1�� �3q���3)>���'��+�9��q�m\tB��;;���eZ�T��	��y��C��n�I��a@+<�j�x�� ,|}W��d.���%&kA%�%��]o�EB����p?h�L��7������/
P�('G/�Y�	��~��V��z�r]�a�u������D��H4�E�����!O�06��hwE�?�Hk�n)&�!ap���oH:�������<�������v�d�/@��Ml�7�zC�r���UV�du�fV������d����e2��a"
p0�1�v�i�����'Z�c�%�t)`
W�S����w.u7�����H�>\�'��)�<�t�'����G��*���j~����������6#�s3�XH%>���F`��S�2��"�U�Tm������&��%	)A�Y�3��9�~,{+��~T�(�#rj#)����C��p��
���3K��R� ���i���xy1�Fa��aFur�m�-"��8��#'����+&���~E��x��[���5�~�*��Hz&dj��^��A&"_��4#�3�C�q<4���#�X��T�O�������"jh��^/E�������v��jqT���1�L�3�lv�����'��n���9~Ir��U���c�������	r�
�z+A��V�#U��:��W�W��.���3�L^*-V$�r�pn�Z�]���R��EhKJ���s<����2�n�Yr
r�"#\�rSV��k�}���>���w���P|~�!Yd%�FV��,�D�
^nL��n0J��a��YL�yT�]3�Vv���q�����Qg���7~��K���o���Qc)�Q�����igi��v@�F���f�T��z�������a�'�_�L&��
$��Q"�����@��N��1-a��x�Gr�a�����O' *�.A���b�����lj��5|H�}��>����U�J~��5�����L�����W��C�_�-��$�st�@C���x�;L
��DH]��o�����U��ni�qi>=�7-dx� z�2����(�W�N��[>�z2ckpO�_�\7�����K89��c���zG����Vnv
u��Qn1�%�Z�	��3�V�<	 ��/�u0Q��f/9[��,?������"
Lr�c�Sq�%N���L%���rb�B�����E%&
��pV�j��S	�et��D���|>�O7L������F7~�����~*�yy�M��r�q-ZV1���H�����E���
�i��f����L�MH?Q��9�xesbM�>����������eg�DG'r$*�y����:����vM��LCsi���4k~PF������7kL�����9��0�v�`F��b�5�:��X�<�����B�e-��T8�"9�
��k81�g����'=�0��+�V��Q���G������+-h�w��*j���K���zp�M�����^O,{�����q����$MQ�u�UN�o�J�I��Y��1����+E%T��~����u��st�ca`�cl�D�R*��J���n��[��`�{���fHPl��E��c5�|���&�pMP�T������������"������N��c�A��E�%P-��y�0A7�=��:�7���F����k�N�PoGpb����:v������<*�l`�����?���+�s�&��^�m�����u�����e�{����3X�Ni����r<
�v6�%ZP���#�trLH�xt��
:�^������.a��L���5�2�
�����{qN�	1�`3���u�c�m�������.������f�1�]�!�:K�T��8��Z��DPE��:�f���;�I=��� 3)m���x��!kd5J�IH��Z���������#��pb�VIG	�����l��A��6�){EZ�Z��(�����5�j�\����q�M�!�O�4Mr�pr� oD`��\��U�Y�I%��9�K=�@G�K�/
s����V�B(5ipjr��V�$ip�^���VuX)���N��
�C/=ipjw����M)��Z�P�������UW}:�����4\�������&�&I�6�,��v����?��Dj���f��)I����v�<g�i�}���G+^0y�u�&6�$o7���f$w���^���P�������f�~.����C�!�����2��Q����[kv�}w�����)>�i�1��:Rg��R�MZ����������__P����+��?��J���6���r
���I��XB�J�p>�����2N�gc�;0����=����5��I��3���l�+�C �K>G�����AWg��'2��j� f�������p`�7����Y��?����C�����]1L�+H	[b�#y#�����'M�*��24���������x�
�T�o4�)G����9�P�#��VM�Y��&o6D��L�R��\�N�6�br��U["�D���iU������JZe��L��A�5���F��
K[�Z]�g�������^����c�O!���gGE��C�6��N��	}?������_O�m0���r�u��b�d�X�^!���w�J�
� �r�+@���*�B����JEp�6���sI�M�	s�)k����r&��KQ�2W����r{�u�J�������Z�����>�'�p#�R
r*��o��)!�d�#_l��8�&�4���Ka�;t�@���z���H�M�Kx��HZ�M&Y`Z� �B����`�uG��d���9q��O�������11��{j��L/m����b3{�[N�d7t����V�v��J]���R�V~
�����P���8�n���&rc�����n�V~�^j��b��-o_B#
r%C�Ul=S��~y���B��E��q	L�\.�
MPgT��������cI��@��"f8�������m�������x�8N2���>��������.����n�inQ������b�H���n�&��"����6�����q�M�����X�����V�sCIU������&�"����mN2D`8�����Jzk����Hj@��^eP)7J�v�N�q���%5������29����NnvA�����qH���KI���q{J
;f��?���?1h	������e<\yS�����);���J��{����c�8���e�*��_�;��.A��Tk�����C�C�k4�S~�d�����������S����f����\�BZ��P������+x��)��+Y���k����.��"�����tj��-�6t���7]�����,��)���T��?��q"R%��3�i��le��YQv�bKCY��RM(�r����I�8E��V��D��X$���{p������ZU��__j�4"�� A����rwx=ig�@snv���qA#�������*a����ESB=��t���[�����9Y���U��P��t=�
� �_<W�0�nl"�S��q����7�I�N#��bY�u�`t�Snk�}.G��?���}�������	�����\�v�������(C��z����4�6�6�'����\�m�"�m�ZmDQ�����
����[���I\����F\k�����e�hY��8��qf�{�BZ��_�tCP\�A��%�\OE�u��2?y����&�F>C�h2��z��6.����E����O&��2cTAd2��]o�O�;�^��[��{y��������A�����$g9�kE�������;f�z�:w_��N�#�NP+K<�c`��7���l������j�9�q��F��(�� �������	�Vr&�XW���[4U6�''�4����8������E���-���;��}ON�����NB��9��D����B��$��5��F�W���B\�-��X��W�O��?�.��.)D�v#�
��h�
+���G`?_�sd��!�=�C�}��i����������W�"�Y�B����(�J6|���p�N~��88=����`�6�����&w��0-��{�>��U�.��E�UK@�5�� O����5�@�@����Y�ZP��F�Z���yL��:�t����D���$��e��S�i�3����5`��Z����G���03�[J���c��B��y��Q���T�{��{Sn�]�x\�y����S�l��z�Z��Q��.�GA��D�
��P�=����������@�t��ow+��W�����,���SRW�(Q�r=p� (F�p��g��YQ�L5���H�5�*���[��4���p��42f���T�[����-���Xa�����"��i�C�7��D�sL����,�O��zw�����4��w�F�;���N��]�0���,��i��c��	����=�W����6���\�F�Lae�7G��%������Z�}zs���������E�\�T�?GC s��!��"Yw�����p
l�8F�F���U�TY��u�0}
{Jl�qt�j$�P�I�$�V4���8�5F4����r
��(�����D������\�oS�u}��zIn��T��������Z%�DC7�����f���X��N�������]��fQ����L2b{'��w��d�Pt���G�.���"���3)�%iM
-r7g$�JpL�����w����~<��Q��{��'���goa���q�IG�#��Lx�J�td�V�=<��~<��Ik�q�w��LvxQ+�r5�G6;;n{B�s�R���z��
�U��\��z,��.�0��>H��V��Z+�%$�`ft�z �,���p[�B�C��r��_Of���y6�����FZ�a��Am�DXo��
�oH��q����%��j��l�W-�$P�0L'��/Et/jSy�������Chds�M2+�;)����t<���;j��QO�}j��/�G�������>�On�{a�#����\7YH����"��y/X��u�f���x��M��CJ�8������������D}��_��~s���J^�^z�J�����W�' ��C�+�������������xl����������*������I����#�}��W��I!���[�zi�]��X�;����Wa����RZ��V	
:j�2p����)�Gy�&Y���7 a'�8�^�����
=/t�@�F�aMY�V�y����J�������_�u=������(�
1�������F�����T��{��T�i��y-�9�)���/\f20����2�@>��#�qs/�]�\�:Omk��N�&���J�0����p�.#��I�&�����k����Q�Z��Z�a�T��~���6��4��<3m��N���5��u��'� _�,W�����;����M�e���n>vW���gu����ULH�z���Io�~x�s��
LW�jc����)��
�E������I����������������1Y]� ����Q��d&l*���!�d]�#Ce��u��,�G�qF����_s�gg�����gLf��{�)]T4	�n������>�?��gaU���$�$.3�S��_Q�27�������oa�\\������m��Wtku&H��z�;������9����l��R\���S/��#��z�LXK�M�d�W��+�-�z,�Q����5 ��[M���AJ��/�x)1U���NN�?��]\�5���&�p����0z������?����Z�L�P��IH�b$C�cgs���fB�,QGa��gG]5�{dA�)�����*��&<J��I����'�JE��GNm5���nv�0Z����i�`���S�������E��p�W)��JI'�����XO��u���Qq!KG%���x�����H�]=���B����y?E�i�s���r�'�4��u��w���.�j���<�P�N���=�P�������� ��
�����	����2���3�����n��Nke��%�Ak���,�F���mao+��_���-�Tj6��r�S�7���X7q�6�D�MQ�yao�w�3���C�3�2r��a���-#�"�l�H��{w|qy������1�p(�f�$Dd���
��/>��/��sJ��b���-�VK��V:�`X4���+Sm���V�� xl�-\�M-#�$�bOH�������t�TP��uv�`�
���	N��)�5��kr(���3$���$QR(8��dQ�$
��Y��Q��SA!�(OBA�k���|V��&�u�l�gS>[����������+������;��^+������;����������Z�T���������.���z��h����rV�����-7�_n6��Yi5*�'|+7����z�U�5��j��r�\�V~P���f��
cZ��A�f�y��n�����P�m�=���J�/���_������j�/{��h��U`���D^�sUi�ry��SUX�=�f_}�K��z�U@������\��[F����>vVPuxh�������k�}��Z���{�����W��\��SU�E�W���U �L���aO�
mQ� �=��Z�A�/�7�O]��_Q�%��d[�t�)����UE�����D��Q_��%X�U���/�lHL����zM�l�G������i��R�H�x����/t�x��Z[�[�6@�H�@�+��b�3L����R�l��l1��nLI]��V��������^���R��i�[m�C�M*����*�MI�I����%��c���M����6���jE���A/���G��vx�C����;��E�X��G����
��C��k�^���o���XQ����%*����7UKtG��zz�]����H}��%�9�e����" ����X�d6R�=A�Ow�\�'�K�w5/c��S� �>��N�_�E�r%uF�]3z���������]�oq^F,=>`�z,0b{l���~8�1�Q���=��r���4�xH�����=Q�'��7QG��������Dn�����k�R�z��cJ��$���7^M���Q��N��Z�'��a�E�|�}������R�&�|C����FH�G
����%�xW=���z|�����s��<��(�p;4_a�Wg��1M2�����V�P���?������nZ���}�6Dg*��+��$�C��
�:�_�V�����_����v<kQ���R�������\c�<��u��g���]�iL��_��M"��(�"|$#��M$�C�$�L�j��Y��%F61(A2K�"��@�����+�3���#��������C��e��A��z4��t�\�$%r��d~'`��j��#4k6���N�l3+�!�d�)�{������l.v��P�?�-�k%3�lAo5��F4�?#0Gd�4����,������74SX4o�����wLg�	�C��)�=�I{�E��C�'i���{Kv���[�w�����������C��Wd�r~�:��3NR��1��r�Z�rd��]}�"m�>��x�1��"��bcBM`�����]{Z�wV����9�.M�GI�,�d��pE���yN��m��X6���&:MMf(�h)�D�j�^����.w���������.�ee�1\�S*2�5�����e�V�K�g��t_O��T_ih=J	9�G41a�w����[��W���z4�������sT���	aZ�m�jm�D�Q�8�R0�^M�p�$���E;7R��1'�	��*�KOi����${���dD�l+�VR�-s��rr�{G<+���"����dMA��	���X�2��r)��7�+EiO�sVnmN��������Q�X.X
��,e��3<#�!7�M�xP��B����x
���:a]Jh9#�Z,����Eb�dl�y�
�2�E0cOeE���G��v/�����OosHkw����,]r~����������>�\��O��.�������x*h�������C���/�p��<�_r��vYg�7R����B&IYN3w�����Me����l&���09� �SF�<�A����4q�����t�SM>���+oR���;X�]`��S��^J�X�r�i��#*"�	��3|M������@�!����G?y����CTd���:�d~uo�r�Q������K�a�D��c�.1y�����wc����Z_-�G�/n
b�#����n(�p3�l����U>-���6ns�{��8Pk�Kmt[��������:|���J�~0[-��-�'��B�2���|��1�+��e��tN�L��j^�%0�C�}D���/:����<�w��y��4%�E��
$�6^[[E���\6��s��JH��'����!
I�H�Lxu��*���O�%^�;�`��`5����(j
�-���Y���)�_�~�����w�L;����+a���]
�@�J��L^L�rWO[�H���b���u��,�k���$���I+K�:>z�q�����5��r���H�p���2�w�6NC0�~X�� ��N�I�����2�����=�Q4����0u�u�f����6,0M��%2]��
w�FV�Oc��$X)�#�
z3ud�P|0J0r��F�f�'s$^�X=�k���uW�������M��cpLg�����h����ie����&���&�w&�H�I�7oJ��-���O>����sInG���<^M��br��s9�G����#a���xF������EjW�E���4��-�]k��P���QW����n�J��<��x8�6�Q:��|n��M�.%��c K���'��mY�����K������FLJSN���%��r��OF�KI~��"0Ee�w��
������U��Q0�4��/Dz
��(��*��5�����>��U5��t�~N93�0�n���������y��w~�����~���U$*��{}Dela��E�|A�O+�����:�<MW���TM�?����~/�A�
����D���D�������\��	���~��msR�r�6$�������Lp�����$d���j��������[�6��R��6z�������3�@�amJ��/����_�����RnA���D�OFUdO�q���s��h<&nrgs�c��<nn+g)z�o$�)��`ZH��Y�_�S�u2*��+Y�\�P�����B���A&XipR�pd�DSz>�a��|���h����<8�0J,������!��2Q�?�����)���Z^2'���M�>��(��23R�����)���2'�~t���nt���I/;��x�Y��,sb��ED�{DC��f��v�9�u�����t�����J4A�{E��:����P��������R>����S
�IE�1��g�����5j��W�xZ����5k�R�Y����	h;����q����\�!�P2Ua�_�*�r_1_>�#_�2���������3!���#�M��
D�\���#z�{�J���X��N������}$;z'��_%��#��k ��a��s���<��:<��l�.m���B
�@�<$��;�OM�.���vG��)��(�>WzbyU�����2�kr���fu�/&K0��M�e1����S�P�������s�<w�`�oE��6R��$_���G�e���������5�S���R���j�j�����"��xA=�������",�t�ibI���������!��oX��Z�eWb[�$�pKu����h�%n=rBl�����#kJ�����pT���g`�v��f���s�~[t�[{�`.�x\�'�8Y;d�]k���.y_>�v�,�9l�wA�Bs�s�z��j� ������3kD�J��+����1)�uRS[�f��p�X�J9 �x:��#Zw��	������2��Y��
A�����������j�=���B�5��Uf��5�/�e�Q�9�?������/����k��u/�(��q� vhf��v��������.��:+W����J��Xs=����o4�x��\������	u�1�!'[��2�|��n�^�|u7����9���
�Rj �`���)C�C7�@�G�������0#�S�������m=}���Y�������Pk�(�1�P�zP�O�b<���cc����}c�
4tv��-��mm��f�����C�r3]\�I550;17�L�V�V�o�;�R��7�j�\m��w��oJ�ER�J��)*
�����K�'x�I���gk�n��B��a���[�q����2���~F���3����W���3���
|��V:=�E��<��?S�4�H�;�c���>ltS���h�7+�k��-`�������4~x��L
�0gA{9��g5�#����muB<�G�
���pv�s���a�YjqR���YQz�O�&G�?j>#�,���'��B�0�o.>�&E�i�s���xx0&w#���ky,&{��L�|/��91LR�`��M�=��vA_��eka��k"Y�;��,���O%�;=���"�C8f�^������X����\���ffE����}R�\3�����jw7��
���sV�89���j��p�6X��:��O?^v���8���r�y���GP9����y���@���\���R�B�K�>k�:4 ����|��0�v4YV.1j=%Y��������_=�5��Yt�	>�a�����>}��2eLI���@��}�*e�e���0DIoo�.������(/q-��Qp+��r��Y�0%-=��E��$-GJ�8���*�	k)(�^&��V
���:u�S}3>_��_�3&�Q1����Vf���^D�������`Box��C�L�>��'r,C(���k��V�K�J�ETP�u����Y�##ixoy��'_�H������M�^�1Ad�����X��~��D��\����dn���������3w�G���a��7n>�R������cT3�x�8P��9c����Q'����v��4��n�7��Q�����L�v�L�c���j��RB0���(�{4��_OqA�!4���0[#l�1����b2�xc����|��A��lB�.Z�O�I�s�@�����@(��N\�@}���,�#�<):�50�B����/;�C6���_a�$,B�D�9<��{�?��3QO?��W��3�Pe��l[,g��pExr��� r�#����#��$������R�G�R�M9�� � �*9x6[���*�j�|
`� �S'\��&�IN�e�RC�.�������\���)������<'�.=�8��'��X�^��d�d='i6Xy�����t����g=;�'�v�^�����^��F�a���c�R�G��o{�?\v�ra��o�4}�q�pB��	��p�
�l����m�#MR��@b^������|��������YL�E�Y#&��^����q�.@D\�|�i������(�f-c�<��R�����q������e��V���{����ZeG��B	
:�To���Z�� ��%�.��$������9Cw���LVT�����o*�-A���G�X��GqH����j`A.q��`���<�����t����!i�~�(mx2b�A�;��[4����X�88,1�O�����P����a�B���P�U#���$��]��
�78�9���-Qv�{H�{�u�"����x2������o����l���Rm�hj�F�������������8�?�y��l�2�C�u�(pg��Qw����2���EJ����R�i�0��qJ~(�$Y%%�U;��V*I�Z���f�Q�V��UT8,4��
z�B�FJ�"`o�������`�`�%�pi��!ka�s���B�y� 6`�rK������8+�%�e14��I'��W��024���}4x"���}��h�^�U�v�=��z�^���m�6h���W]c?K�o�6����EL.�D%�����{(4��d�v@p�qy~��������~����:9�,����&v������O�r.�k��cIX��TAB�t�V1e�F��U]��/���ZW^{�Vw���2\��5v�4c�^��P%������c��-��$���@�A�Zo0���A5r��8���&/�{:��������TNb��x���������(�8���|LY�)5o��J�y��a�!���Q���M��7\=��\=7�Kw�Pzw�{X%�vw�0����{��Ejmf����\�LV������vV�;^O�?�P���y�.��>rj?�y\�o).�����_D�)d^J���|�\)��}A��T���M��s����)+�M���L����1V�E�\+��bFR����o��r�I�)"���p�n>�XK��a� �A��$SM�T�:G�U�2!_��{+6�ms%0�V-.���Q@p���a����#��:*q���`k�����c-���S�f�u*��A>���,)�m4Y�DM�����@<G�� ���"���y�.����[L8�"b����[>�7����7Ms�TD0��d�?&/�y��QH������.�{�p��U��)E8;w��s�L�T<N����\
Ux_
�������z� ��_�3��8�(R�g!g.���3	��:���`��?�����T&)EH�3��l�y5�����r������T���F�$`=z�������Z�T�t{-�T�LgR��
a6��b�C�F���[���+R�4�W������0A[���(\w:������}���+��{?�&�_-���F�In*����F!�����J��g���/@�t��=���,��������~��4�a�����G���
��~�5,�m�HQ�u����QD�M�T�5����E��h<@���U��Gs���u_j��}���U��@�b![�0J�&~��f�/Ds�XoF��I-�`��\[�!��?�\�����{�~�4�y$EHs���e��Te�;���Xq����_4�A�Q���L�0��hE��U���ynnDJ+\ho��FB��U*�p��>�$��Xf]��&��|�s�R�G(����tM��5O$��Li��[3u�	�R�>Z�6�F��p��W1���W��?9Q����k���y"!	kJ�����N�������%������lE�C��[�]���������lR��������w7�D�O���!�8\R��n)���FCvc���G
d�����������4d7E��O8�]��9!('CF"N8/��k��x�����nx��wZ�j�T��;�~�9l$!������6dbju�A�?C+�K��%n[I"��YL�E<Y���p�������DoQD�� �������~��d�\�����2�n#�Jt��������0;N���fc��d
��a�;��:���R�Y�g�|Y�%������Q��F�����E������tDa�l<���/�T�'o���|�){trv��gG]5��
I�����w��m^���g��K��|�3�d��;�C�����;�U��kV8�c�Y�d
�li������
����XmY�u@�����������G��$q��)$�����Z��l�������#��5��[X�z�B�����:j�������J�~}X����J�`�������!���O��Qdi�"o���p1DdW��KTS���&a�F[������
�������{�>�^����O���>U�&�
k�i2��k����Ool'%XX�)�m?�
R�l���Y�Y��U�NNZJ��r/z�c��lU�?�0�GQ�����u����s�������,�i��v��2e������.q7�#��J�`���l�v_��/L��V=��I}kb_�U?���b}�*�Mh�AlJ�`+�'*��4e���T���YV�����t�����Xsp���	��8Y8z��@7N���g�i���c.��x� a!JD#�*8��n�jb�����_c���{����g1^�137���'�I�*�Cw��+�����E�����w~u�]zVq+4��]m�E�6�E5�����X�����:
M����nxo�`����7���7�����}�VEg~���h��e���40py��w|�{w|qyuxpr�
SC��(g���]mEd���1a`	q`�h���1�`	�`�GUF��Lfj#)f�_M�%�?����P)Cc��W9��?���tR�Zj�#89s�5��\e�'�D:���$I6��c��:V�����-�%�A��0����bn�o	�������r��z����_@���s��l�
���S�/�/������I�$/&��o���_���n�V����u����km��$�N��Z�Q2��>��B�P#8�a^h���HTu��1�iH����tK����yST�{�fS�������J��mV:
[V��[�i�kk���J�f�a����I���ZtF/jljXgu�'?mGE��D3)�&�B����i��`�������gg'Y���]�s�e=2Ua����?[�5j�I<Cn*����Q��j��)���~��_��b�\n/()W�+N���Etf+���Pa/X|)b�S����Vn�w@��f����V�b�_��l�P�7Z�Z�Y�6~A�Ui����u���nd���^���?���m������`k
Z��Y��^s��M�������pPmt�Z�Z�:��=��\UZ�\���@�.W���}�A`�.�;o��W}�n����q}�4����z�����*�<� !\Uk���~����N��w�����������`�j�h��SADLe��S%����{{�\���������_a��TE�5�#p���a~���l1�����}���j��?����:54l��a����������t+�������X-4I����;�U���3��3�J��DO*����V��+�o-)=���[\x�G�C�=�7s2����d����$J8���Q��]B��EL(PD>��T�(����Eq���f�_mU#���j������?���ry����v�<����3���N��
�-��U
���N���ze
�o"�G�����,(��Q�"��|�h��MC����|��
=@��Itc���`c���w:�vW�Z*��~�_�;���L���R3���rOH{?���U��_�������!u����G���*�-�[����.����u�����F����T�����?�����	nWZ�vs0���a�=V�^�So��j�V��#����������*�Te0uo�~SU�v���B�J�-���Zu/{�J�G�jx~p����4�D�,I<���"���]
*�N���'J��Q�b+���a:d,�v��C�k.������k��=\���%�9�|���=J��D/E�R:�}]��t�����
�����?�cWyU������bcSR�
�Te�S}��(���3�U�B��1^���?���!R�$gRS?Zw*��*k�[�*c��8WG�PAU0~n����R6D��"�"xX���������u=��a�J���4'sy���,�R�{���� �cT��R��eTgI��(|B���
�"��]�>w��b/:�����5��T�*���T}�]����d���|x��iO��a�J����7r��tY��w<�����w�g2:RCm"�2=�(�u�<y4g(����o:���R�i�2V��X�M�`�n,u���jU|�|�
��J2���&������5���@8����{Ez|���p6}���k�������E�����X���^RrNl{�dv��^\�]�v�y6;�)FY=!�������$���n �H9]���@��a���<	�2���i�'��a���kj��`��g�P���g?-l]K�!��eNg�v|o(7�5��4�&{'��,��d�"&�~a�+�,��g��6���+�����P�U�]�~���NE���7ly����=����a{����T�X�e__�gIn|TBw`�o��i�$���{�3��=���5U���4Ma�t
z�A&joO�����2����j�S
���cz|��z
CUEU�7��������~��R�5��������Ym4Q��������`�N~B#��w�W��Y��q�_��i�k��!M�������q��������a���t9���[���iN|j'�<b��
r�j���n�P$DD|��o���
`_�W�\�������0����������Jw_P1���x��$Q<u����UH)���Lx#�^��� �U`�����)Y�D��iwr�����0`gr9�,	�Z��vu����^���?�����������_3F�

#390

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-03-07%2012%3A53%3A20

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#388)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 8:06 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 7, 2024 at 6:37 PM John Naylor <johncnaylorls@gmail.com> wrote:

$ git grep 'link_with: pgport_srv'
src/test/modules/test_radixtree/meson.build: link_with: pgport_srv,

No other test module uses this directive, and indeed, removing this
still builds fine for me. Thoughts?

Yeah, it could be the culprit. The test_radixtree/meson.build is the
sole extension that explicitly specifies a link with pgport_srv. I
think we can get rid of it as I've also confirmed the build still fine
even without it.

olingo and grassquit have turned green, so that must have been it.

fairywren is complaining another build failure:

[1931/2156] "gcc" -o
src/test/modules/test_radixtree/test_radixtree.dll
src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj
"-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group"
"-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a"
"-Wl,--stack,4194304" "-Wl,--allow-multiple-definition"
"-Wl,--disable-auto-import" "-fvisibility=hidden"
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a"
"-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm"
"-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32"
"-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group"
FAILED: src/test/modules/test_radixtree/test_radixtree.dll
"gcc" -o src/test/modules/test_radixtree/test_radixtree.dll
src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj
"-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group"
"-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a"
"-Wl,--stack,4194304" "-Wl,--allow-multiple-definition"
"-Wl,--disable-auto-import" "-fvisibility=hidden"
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a"
"-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm"
"-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32"
"-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group"
C:/tools/nmsys64/ucrt64/bin/../lib/gcc/x86_64-w64-mingw32/12.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj:test_radixtree:(.rdata$.refptr.pg_popcount64[.refptr.pg_popcount64]+0x0):
undefined reference to `pg_popcount64'

It looks like it requires a link with pgport_srv but I'm not sure. It
seems that the recent commit 1f1d73a8b breaks CI, Windows - Server
2019, VS 2019 - Meson & ninja, too.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#391

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#390)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

It looks like it requires a link with pgport_srv but I'm not sure. It
seems that the recent commit 1f1d73a8b breaks CI, Windows - Server
2019, VS 2019 - Meson & ninja, too.

Unfortunately, none of the Windows animals happened to run both after
the initial commit and before removing the (seemingly useless on our
daily platfoms) link. I'll confirm on my own CI branch in a few
minutes.

#392

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#391)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 8, 2024 at 10:04 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

It looks like it requires a link with pgport_srv but I'm not sure. It
seems that the recent commit 1f1d73a8b breaks CI, Windows - Server
2019, VS 2019 - Meson & ninja, too.

Unfortunately, none of the Windows animals happened to run both after
the initial commit and before removing the (seemingly useless on our
daily platfoms) link. I'll confirm on my own CI branch in a few
minutes.

Yesterday I've confirmed the something like the below fixes the
problem happened in Windows CI:

--- a/src/test/modules/test_radixtree/meson.build
+++ b/src/test/modules/test_radixtree/meson.build
@@ -12,6 +12,7 @@ endif

test_radixtree = shared_module('test_radixtree',
test_radixtree_sources,
+ link_with: host_system == 'windows' ? pgport_srv : [],
kwargs: pg_test_mod_args,
)
test_install_libs += test_radixtree

But I'm not sure it's the right fix especially because I guess it
could raise "AddressSanitizer: odr-violation" error on Windows.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#393

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#392)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Yesterday I've confirmed the something like the below fixes the
problem happened in Windows CI:

Glad you shared before I went and did it.

--- a/src/test/modules/test_radixtree/meson.build
+++ b/src/test/modules/test_radixtree/meson.build
@@ -12,6 +12,7 @@ endif
test_radixtree = shared_module('test_radixtree',
test_radixtree_sources,
+ link_with: host_system == 'windows' ? pgport_srv : [],

I don't see any similar coding elsewhere, so that leaves me wondering
if we're missing something. On the other hand, maybe no test modules
use files in src/port ...

kwargs: pg_test_mod_args,
)
test_install_libs += test_radixtree

But I'm not sure it's the right fix especially because I guess it
could raise "AddressSanitizer: odr-violation" error on Windows.

Well, it's now at zero definitions that it can see, so I imagine it's
possible that adding the above would not cause more than one. In any
case, we might not know since as far as I can tell the MSVC animals
don't have address sanitizer. I'll look around some more, and if I
don't get any revelations, I guess we should go with the above.

#394

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#392)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Yesterday I've confirmed the something like the below fixes the
problem happened in Windows CI:
--- a/src/test/modules/test_radixtree/meson.build
+++ b/src/test/modules/test_radixtree/meson.build
@@ -12,6 +12,7 @@ endif
test_radixtree = shared_module('test_radixtree',
test_radixtree_sources,
+ link_with: host_system == 'windows' ? pgport_srv : [],
kwargs: pg_test_mod_args,
)
test_install_libs += test_radixtree

pgport_srv is for backend, shared libraries should be using pgport_shlib

Further, the top level meson.build has:

# all shared libraries not part of the backend should depend on this
frontend_shlib_code = declare_dependency(
include_directories: [postgres_inc],
link_with: [common_shlib, pgport_shlib],
sources: generated_headers,
dependencies: [shlib_code, os_deps, libintl],
)

...but the only things that declare needing frontend_shlib_code are in
src/interfaces/.

In any case, I'm trying it in CI branch with pgport_shlib now.

#395

johncnaylorls@gmail.com

almost 2 years ago

In reply to: John Naylor (#394)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 8, 2024 at 9:53 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Yesterday I've confirmed the something like the below fixes the
problem happened in Windows CI:
--- a/src/test/modules/test_radixtree/meson.build
+++ b/src/test/modules/test_radixtree/meson.build
@@ -12,6 +12,7 @@ endif
test_radixtree = shared_module('test_radixtree',
test_radixtree_sources,
+ link_with: host_system == 'windows' ? pgport_srv : [],
kwargs: pg_test_mod_args,
)
test_install_libs += test_radixtree
pgport_srv is for backend, shared libraries should be using pgport_shlib

In any case, I'm trying it in CI branch with pgport_shlib now.

That seems to work, so I'll push that just to get things green again.

#396

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#389)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the remaining patches for CI. I've made some minor
changes in separate patches and drafted the commit message for
tidstore patch.

While reviewing the tidstore code, I thought that it would be more
appropriate to place tidstore.c under src/backend/lib instead of
src/backend/common/access since the tidstore is no longer implemented
only for heap or other access methods, and it might also be used by
executor nodes in the future. What do you think?

That's a heck of a good question. I don't think src/backend/lib is
right -- it seems that's for general-purpose data structures.
Something like backend/utils is also too general.
src/backend/access/common has things for tuple descriptors, toast,
sessions, and I don't think tidstore is out of place here. I'm not
sure there's a better place, but I could be convinced otherwise.

v68-0001:

I'm not sure if commit messages are much a subject of review, and it's
up to the committer, but I'll share a couple comments just as
something to think about, not something I would ask you to change: I
think it's a bit distracting that the commit message talks about the
justification to use it for vacuum. Let's save that for the commit
with actual vacuum changes. Also, I suspect saying there are a "wide
range" of uses is over-selling it a bit, and that paragraph is a bit
awkward aside from that.

+ /* Collect TIDs extracted from the key-value pair */
+ result->num_offsets = 0;
+

This comment has nothing at all to do with this line. If the comment
is for several lines following, some of which are separated by blank
lines, there should be a blank line after the comment. Also, why isn't
tidstore_iter_extract_tids() responsible for setting that to zero?

+ ts->context = CurrentMemoryContext;

As far as I can tell, this member is never accessed again -- am I
missing something?

+ /* DSA for tidstore will be detached at the end of session */

No other test module pins the mapping, but that doesn't necessarily
mean it's wrong. Is there some advantage over explicitly detaching?

+-- Add tids in random order.

I don't see any randomization here. I do remember adding row_number to
remove whitespace in the output, but I don't remember a random order.
On that subject, the row_number was an easy trick to avoid extra
whitespace, but maybe we should just teach the setting function to
return blocknumber rather than null?

+Datum
+tidstore_create(PG_FUNCTION_ARGS)
+{
...
+ tidstore = TidStoreCreate(max_bytes, dsa);

+Datum
+tidstore_set_block_offsets(PG_FUNCTION_ARGS)
+{
....
+ TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);

These names are too similar. Maybe the test module should do
s/tidstore_/test_/ or similar.

+/* Sanity check if we've called tidstore_create() */
+static void
+check_tidstore_available(void)
+{
+ if (tidstore == NULL)
+ elog(ERROR, "tidstore is not initialized");
+}

I don't find this very helpful. If a developer wiped out the create
call, wouldn't the test crash and burn pretty obviously?

In general, the .sql file is still very hard-coded. Functions are
created that contain a VALUES statement. Maybe it's okay for now, but
wanted to mention it. Ideally, we'd have some randomized tests,
without having to display it. That could be in addition to (not
replacing) the small tests we have that display input. (see below)

v68-0002:

@@ -329,6 +381,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)

ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;

+#ifdef TIDSTORE_DEBUG
+ if (!TidStoreIsShared(ts))
+ {
+ bool ret_debug = ts_debug_is_member(ts, tid);;
+ Assert(ret == ret_debug);
+ }
+#endif

This only checking the case where we haven't returned already. In particular...

+ /* no entry for the blk */
+ if (page == NULL)
+ return false;
+
+ wordnum = WORDNUM(off);
+ bitnum = BITNUM(off);
+
+ /* no bitmap for the off */
+ if (wordnum >= page->nwords)
+ return false;

...these results are not checked.

More broadly, it seems like the test module should be able to test
everything that the debug-build array would complain about. Including
ordered iteration. This may require first saving our test input to a
table. We could create a cursor on a query that fetches the ordered
input from the table and verifies that the tid store iterate produces
the same ordered set, maybe with pl/pgSQL. Or something like that.
Seems like not a whole lot of work. I can try later in the week, if
you like.

v68-0005/6 look ready to squash

v68-0008 - I'm not a fan of captilizing short comment fragments. I use
the style of either: short lower-case phrases, or full sentences
including capitalization, correct grammar and period. I see these two
styles all over the code base, as appropriate.

+ /* Remain attached until end of backend */

We'll probably want this comment, if in fact we want this behavior.

+ /*
+ * Note that funcctx->call_cntr is incremented in SRF_RETURN_NEXT
+ * before return.
+ */

I'm not sure what this is trying to say or why it's relevant, since
it's been a while since I've written a SRF in C.

That's all I have for now, and I haven't looked at the vacuum changes this time.

#397

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#347)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Feb 15, 2024 at 8:26 PM John Naylor <johncnaylorls@gmail.com> wrote:

v61-0007: Runtime-embeddable tids -- Optional for v17, but should
reduce memory regressions, so should be considered. Up to 3 tids can
be stored in the last level child pointer. It's not polished, but I'll
only proceed with that if we think we need this. "flags" iis called
that because it could hold tidbitmap.c booleans (recheck, lossy) in
the future, in addition to reserving space for the pointer tag. Note:
I hacked the tests to only have 2 offsets per block to demo, but of
course both paths should be tested.

Interesting. I've run the same benchmark tests we did[1][2] (the
median of 3 runs):

[found a big speed-up where we don't expect one]

I tried to reproduce this (similar patch, but rebased on top of a bug
you recently fixed (possibly related?) -- attached, and also shows one
way to address some lack of coverage in the debug build, for as long
as we test that with CI).

Fortunately I cannot see a difference, so I believe it's not affecting
the case in this test all, as expected:

v68:

INFO: finished vacuuming "john.public.test": index scans: 1
pages: 0 removed, 442478 remain, 88478 scanned (20.00% of total)
tuples: 19995999 removed, 80003979 remain, 0 are dead but not yet removable
removable cutoff: 770, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 88478 pages from table (20.00% of total) had
19995999 dead item identifiers removed
index "test_x_idx": pages: 274194 in total, 54822 newly deleted, 54822
currently deleted, 0 reusable
avg read rate: 620.356 MB/s, avg write rate: 124.105 MB/s
buffer usage: 758236 hits, 274196 misses, 54854 dirtied
WAL usage: 2 records, 0 full page images, 425 bytes

system usage: CPU: user: 3.74 s, system: 0.68 s, elapsed: 4.45 s
system usage: CPU: user: 3.02 s, system: 0.42 s, elapsed: 3.47 s
system usage: CPU: user: 3.09 s, system: 0.38 s, elapsed: 3.49 s
system usage: CPU: user: 3.00 s, system: 0.43 s, elapsed: 3.45 s

v68 + emb values (that cannot be used because > 3 tids per block):

INFO: finished vacuuming "john.public.test": index scans: 1
pages: 0 removed, 442478 remain, 88478 scanned (20.00% of total)
tuples: 19995999 removed, 80003979 remain, 0 are dead but not yet removable
removable cutoff: 775, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 88478 pages from table (20.00% of total) had
19995999 dead item identifiers removed
index "test_x_idx": pages: 274194 in total, 54822 newly deleted, 54822
currently deleted, 0 reusable
avg read rate: 570.808 MB/s, avg write rate: 114.192 MB/s
buffer usage: 758236 hits, 274196 misses, 54854 dirtied
WAL usage: 2 records, 0 full page images, 425 bytes

system usage: CPU: user: 3.11 s, system: 0.62 s, elapsed: 3.75 s
system usage: CPU: user: 3.04 s, system: 0.41 s, elapsed: 3.46 s
system usage: CPU: user: 3.05 s, system: 0.41 s, elapsed: 3.47 s
system usage: CPU: user: 3.04 s, system: 0.43 s, elapsed: 3.49 s

I'll continue polishing the runtime-embeddable values patch as time
permits, for later consideration.

Attachments:

WIP-3-embeddable-TIDs.patch.nocfbotapplication/octet-stream; name=WIP-3-embeddable-TIDs.patch.nocfbotDownload

From 21bda3f9707db1bfab74fe3a7a3a6e1e88df5ee4 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 11 Mar 2024 13:21:10 +0700
Subject: [PATCH v6901 5/5] WIP: 3 embeddable TIDs

---
 src/backend/access/common/tidstore.c          | 87 ++++++++++++++++---
 src/include/lib/radixtree.h                   | 19 ++++
 .../test_tidstore/expected/test_tidstore.out  | 31 ++-----
 .../test_tidstore/sql/test_tidstore.sql       |  2 +-
 4 files changed, 100 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index dcdbd12c2e..8c04e7734a 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -47,7 +47,25 @@
  */
 typedef struct BlocktableEntry
 {
-	uint16		nwords;
+	union
+	{
+		struct
+		{
+#ifndef WORDS_BIGENDIAN
+			uint8		flags;
+			int8		nwords;
+#endif
+
+			OffsetNumber full_offsets[3];
+
+#ifdef WORDS_BIGENDIAN
+			int8		nwords;
+			uint8		flags;
+#endif
+		} ;
+		uintptr_t ptr;
+	} header;
+
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];
 } BlocktableEntry;
 #define MaxBlocktableEntrySize \
@@ -61,7 +79,8 @@ typedef struct BlocktableEntry
 #define RT_VALUE_TYPE BlocktableEntry
 #define RT_VARLEN_VALUE_SIZE(page) \
 	(offsetof(BlocktableEntry, words) + \
-	sizeof(bitmapword) * (page)->nwords)
+	sizeof(bitmapword) * (page)->header.nwords)
+#define RT_RUNTIME_EMBEDDABLE_VALUE
 #include "lib/radixtree.h"
 
 #define RT_PREFIX shared_rt
@@ -72,7 +91,8 @@ typedef struct BlocktableEntry
 #define RT_VALUE_TYPE BlocktableEntry
 #define RT_VARLEN_VALUE_SIZE(page) \
 	(offsetof(BlocktableEntry, words) + \
-	sizeof(bitmapword) * (page)->nwords)
+	sizeof(bitmapword) * (page)->header.nwords)
+#define RT_RUNTIME_EMBEDDABLE_VALUE
 #include "lib/radixtree.h"
 
 /* Per-backend state for a TidStore */
@@ -307,7 +327,17 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	bool	found PG_USED_FOR_ASSERTS_ONLY;
 
 	Assert(num_offsets > 0);
+	memset(page, 0, sizeof(BlocktableEntry));
 
+	if (num_offsets <= 3)
+	{
+		for (int i = 0; i < num_offsets; i++)
+			page->header.full_offsets[i] = offsets[i];
+
+		page->header.nwords = 0;
+	}
+	else
+	{
 	for (wordnum = 0, next_word_threshold = BITS_PER_BITMAPWORD;
 		wordnum <= WORDNUM(offsets[num_offsets - 1]);
 		wordnum++, next_word_threshold += BITS_PER_BITMAPWORD)
@@ -333,8 +363,9 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 		page->words[wordnum] = word;
 	}
 
-	page->nwords = wordnum;
-	Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
+	page->header.nwords = wordnum;
+	Assert(page->header.nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
+	}
 
 	if (TidStoreIsShared(ts))
 		found = shared_rt_set(ts->tree.shared, blkno, page);
@@ -370,17 +401,37 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 
 	/* no entry for the blk */
 	if (page == NULL)
-		return false;
+	{
+		ret = false;
+		goto finish;
+	}
 
 	wordnum = WORDNUM(off);
 	bitnum = BITNUM(off);
 
-	/* no bitmap for the off */
-	if (wordnum >= page->nwords)
-		return false;
-
-	ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
+	if (page->header.nwords == 0)
+	{
+		/* we have offsets in the header */
+		for (int i = 0; i < lengthof(page->header.full_offsets); i++)
+		{
+			if (page->header.full_offsets[i] == off)
+			{
+				ret = true;
+				goto finish;
+			}
+		}
+		ret = false;
+	}
+	else
+	{
+	/* No bitmap for the off */
+	if (wordnum >= page->header.nwords)
+		ret = false;
+	else
+		ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
+	}
 
+finish:
 #ifdef TIDSTORE_DEBUG
 	if (!TidStoreIsShared(ts))
 	{
@@ -388,6 +439,7 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 		Assert(ret == ret_debug);
 	}
 #endif
+
 	return ret;
 }
 
@@ -504,7 +556,18 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlocktableEntry *page)
 	TidStoreIterResult *result = (&iter->output);
 	int			wordnum;
 
-	for (wordnum = 0; wordnum < page->nwords; wordnum++)
+	if (page->header.nwords == 0)
+	{
+		/* we have offsets in the header */
+		for (int i = 0; i < 3; i++)
+		{
+			if (OffsetNumberIsValid(page->header.full_offsets[i]))
+				result->offsets[result->num_offsets++] = page->header.full_offsets[i];
+		}
+		return;
+	}
+
+	for (wordnum = 0; wordnum < page->header.nwords; wordnum++)
 	{
 		bitmapword	w = page->words[wordnum];
 
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index b8ad51c14d..108f21aa08 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -437,7 +437,13 @@ static inline bool
 RT_VALUE_IS_EMBEDDABLE(RT_VALUE_TYPE * value_p)
 {
 #ifdef RT_VARLEN_VALUE_SIZE
+
+#ifdef RT_RUNTIME_EMBEDDABLE_VALUE
+	return RT_GET_VALUE_SIZE(value_p) <= sizeof(RT_PTR_ALLOC);
+#else
 	return false;
+#endif
+
 #else
 	return RT_GET_VALUE_SIZE(value_p) <= sizeof(RT_PTR_ALLOC);
 #endif
@@ -451,7 +457,14 @@ static inline bool
 RT_CHILDPTR_IS_VALUE(RT_PTR_ALLOC child)
 {
 #ifdef RT_VARLEN_VALUE_SIZE
+
+#ifdef RT_RUNTIME_EMBEDDABLE_VALUE
+	/* check for pointer tag */
+	return ((uintptr_t) child) & 1;
+#else
 	return false;
+#endif
+
 #else
 	return sizeof(RT_VALUE_TYPE) <= sizeof(RT_PTR_ALLOC);
 #endif
@@ -1727,6 +1740,11 @@ have_slot:
 	{
 		/* store value directly in child pointer slot */
 		memcpy(slot, value_p, value_sz);
+
+#ifdef RT_RUNTIME_EMBEDDABLE_VALUE
+		/* tag child pointer */
+		*((uintptr_t *) slot) |= 1;
+#endif
 	}
 	else
 	{
@@ -2878,6 +2896,7 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_DEFINE
 #undef RT_VALUE_TYPE
 #undef RT_VARLEN_VALUE_SIZE
+#undef RT_RUNTIME_EMBEDDABLE_VALUE
 #undef RT_SHMEM
 #undef RT_USE_DELETE
 #undef RT_DEBUG
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
index 5e0517f3c3..054cd57ad2 100644
--- a/src/test/modules/test_tidstore/expected/test_tidstore.out
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -51,7 +51,7 @@ WITH blocks (blk) AS(
 VALUES (0), (1), (:maxblkno - 1), (:maxblkno / 2), (:maxblkno)
 ),
 offsets (off) AS (
-VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)
+VALUES (1), (2)--, (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)
 )
 SELECT row_number() over(ORDER BY blk),
      tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
@@ -68,19 +68,13 @@ SELECT row_number() over(ORDER BY blk),
 
 -- Lookup test and dump (sorted) tids.
 SELECT lookup_test();
-   lookup_test    
-------------------
+  lookup_test   
+----------------
  (0,1)
  (0,2)
- (0,256)
- (0,511)
- (0,512)
  (4294967295,1)
  (4294967295,2)
- (4294967295,256)
- (4294967295,511)
- (4294967295,512)
-(10 rows)
+(4 rows)
 
 SELECT tidstore_is_full();
  tidstore_is_full 
@@ -93,30 +87,15 @@ SELECT tidstore_dump_tids();
 --------------------
  (0,1)
  (0,2)
- (0,256)
- (0,511)
- (0,512)
  (1,1)
  (1,2)
- (1,256)
- (1,511)
- (1,512)
  (2147483647,1)
  (2147483647,2)
- (2147483647,256)
- (2147483647,511)
- (2147483647,512)
  (4294967294,1)
  (4294967294,2)
- (4294967294,256)
- (4294967294,511)
- (4294967294,512)
  (4294967295,1)
  (4294967295,2)
- (4294967295,256)
- (4294967295,511)
- (4294967295,512)
-(25 rows)
+(10 rows)
 
 -- cleanup
 SELECT tidstore_destroy();
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
index 4e626b74e3..b29c6e797e 100644
--- a/src/test/modules/test_tidstore/sql/test_tidstore.sql
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -41,7 +41,7 @@ WITH blocks (blk) AS(
 VALUES (0), (1), (:maxblkno - 1), (:maxblkno / 2), (:maxblkno)
 ),
 offsets (off) AS (
-VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)
+VALUES (1), (2)--, (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)
 )
 SELECT row_number() over(ORDER BY blk),
      tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
-- 
2.44.0

#398

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#396)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 11, 2024 at 12:20 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've attached the remaining patches for CI. I've made some minor
changes in separate patches and drafted the commit message for
tidstore patch.

While reviewing the tidstore code, I thought that it would be more
appropriate to place tidstore.c under src/backend/lib instead of
src/backend/common/access since the tidstore is no longer implemented
only for heap or other access methods, and it might also be used by
executor nodes in the future. What do you think?

That's a heck of a good question. I don't think src/backend/lib is
right -- it seems that's for general-purpose data structures.
Something like backend/utils is also too general.
src/backend/access/common has things for tuple descriptors, toast,
sessions, and I don't think tidstore is out of place here. I'm not
sure there's a better place, but I could be convinced otherwise.

Yeah, I agreed that src/backend/lib seems not to be the place for
tidstore. Let's keep it in src/backend/access/common. If others think
differently, we can move it later.

v68-0001:

I'm not sure if commit messages are much a subject of review, and it's
up to the committer, but I'll share a couple comments just as
something to think about, not something I would ask you to change: I
think it's a bit distracting that the commit message talks about the
justification to use it for vacuum. Let's save that for the commit
with actual vacuum changes. Also, I suspect saying there are a "wide
range" of uses is over-selling it a bit, and that paragraph is a bit
awkward aside from that.

Thank you for the comment, and I agreed. I've updated the commit message.

+ /* Collect TIDs extracted from the key-value pair */
+ result->num_offsets = 0;
+
This comment has nothing at all to do with this line. If the comment
is for several lines following, some of which are separated by blank
lines, there should be a blank line after the comment. Also, why isn't
tidstore_iter_extract_tids() responsible for setting that to zero?

Agreed, fixed.

I also updated this part so we set result->blkno in
tidstore_iter_extract_tids() too, which seems more readable.

+ ts->context = CurrentMemoryContext;

As far as I can tell, this member is never accessed again -- am I
missing something?

You're right. It was used to re-create the tidstore in the same
context again while resetting it, but we no longer support the reset
API. Considering it again, would it be better to allocate the iterator
struct in the same context as we store the tidstore struct?

+ /* DSA for tidstore will be detached at the end of session */

No other test module pins the mapping, but that doesn't necessarily
mean it's wrong. Is there some advantage over explicitly detaching?

One small benefit of not explicitly detaching dsa_area in
tidstore_destroy() would be simplicity; IIUC if we want to do that, we
need to remember the dsa_area using (for example) a static variable,
and free it if it's non-NULL. I've implemented this idea in the
attached patch.

+-- Add tids in random order.

I don't see any randomization here. I do remember adding row_number to
remove whitespace in the output, but I don't remember a random order.
On that subject, the row_number was an easy trick to avoid extra
whitespace, but maybe we should just teach the setting function to
return blocknumber rather than null?

Good idea, fixed.

+Datum
+tidstore_create(PG_FUNCTION_ARGS)
+{
...
+ tidstore = TidStoreCreate(max_bytes, dsa);

+Datum
+tidstore_set_block_offsets(PG_FUNCTION_ARGS)
+{
....
+ TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);

These names are too similar. Maybe the test module should do
s/tidstore_/test_/ or similar.

Agreed.

+/* Sanity check if we've called tidstore_create() */
+static void
+check_tidstore_available(void)
+{
+ if (tidstore == NULL)
+ elog(ERROR, "tidstore is not initialized");
+}
I don't find this very helpful. If a developer wiped out the create
call, wouldn't the test crash and burn pretty obviously?

Removed.

In general, the .sql file is still very hard-coded. Functions are
created that contain a VALUES statement. Maybe it's okay for now, but
wanted to mention it. Ideally, we'd have some randomized tests,
without having to display it. That could be in addition to (not
replacing) the small tests we have that display input. (see below)

Agreed to add randomized tests in addition to the existing tests.

v68-0002:

@@ -329,6 +381,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)

ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
+#ifdef TIDSTORE_DEBUG
+ if (!TidStoreIsShared(ts))
+ {
+ bool ret_debug = ts_debug_is_member(ts, tid);;
+ Assert(ret == ret_debug);
+ }
+#endif
This only checking the case where we haven't returned already. In particular...
+ /* no entry for the blk */
+ if (page == NULL)
+ return false;
+
+ wordnum = WORDNUM(off);
+ bitnum = BITNUM(off);
+
+ /* no bitmap for the off */
+ if (wordnum >= page->nwords)
+ return false;
...these results are not checked.

More broadly, it seems like the test module should be able to test
everything that the debug-build array would complain about. Including
ordered iteration. This may require first saving our test input to a
table. We could create a cursor on a query that fetches the ordered
input from the table and verifies that the tid store iterate produces
the same ordered set, maybe with pl/pgSQL. Or something like that.
Seems like not a whole lot of work. I can try later in the week, if
you like.

Sounds a good idea. In fact, if there are some bugs in tidstore, it's
likely that even initdb would fail in practice. However, it's a very
good idea that we can test the tidstore anyway with such a check
without a debug-build array.

Or as another idea, I wonder if we could keep the debug-build array in
some form. For example, we use the array with the particular build
flag and set a BF animal for that. That way, we can test the tidstore
in more real cases.

v68-0005/6 look ready to squash

Done.

v68-0008 - I'm not a fan of captilizing short comment fragments. I use
the style of either: short lower-case phrases, or full sentences
including capitalization, correct grammar and period. I see these two
styles all over the code base, as appropriate.

Agreed.

+ /* Remain attached until end of backend */

We'll probably want this comment, if in fact we want this behavior.

Kept it.

+ /*
+ * Note that funcctx->call_cntr is incremented in SRF_RETURN_NEXT
+ * before return.
+ */
I'm not sure what this is trying to say or why it's relevant, since
it's been a while since I've written a SRF in C.

I wanted to say is that we cannot do like:

SRF_RETURN_NEXT(funcctx, PointerGetDatum(&(tids[funcctx->call_cntr])));

because funcctx->call_cntr is incremented *before* return and
therefore we will end up accessing the index out of range. I've took
some time to realize this fact before.

That's all I have for now, and I haven't looked at the vacuum changes this time.

Thank you for the comments!

In the latest (v69) patch:

- squashed v68-0005 and v68-0006 patches.
- removed most of the changes in v68-0007 patch.
- addressed above review comments in v69-0002 patch.
- v69-0003, 0004, and 0005 are miscellaneous updates.

As for renaming TidStore to TIDStore, I dropped the patch for now
since it seems we're using "Tid" in some function names and variable
names. If we want to update it, we can do that later.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#399

sawada.mshk@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#398)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 11, 2024 at 5:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

In the latest (v69) patch:

- squashed v68-0005 and v68-0006 patches.
- removed most of the changes in v68-0007 patch.
- addressed above review comments in v69-0002 patch.
- v69-0003, 0004, and 0005 are miscellaneous updates.

Since the v69 conflicts with the current HEAD, I've rebased them. In
addition, v70-0008 is the new patch, which cleans up the vacuum
integration patch.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#400

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#398)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 11, 2024 at 3:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Mar 11, 2024 at 12:20 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

+ ts->context = CurrentMemoryContext;

As far as I can tell, this member is never accessed again -- am I
missing something?

You're right. It was used to re-create the tidstore in the same
context again while resetting it, but we no longer support the reset
API. Considering it again, would it be better to allocate the iterator
struct in the same context as we store the tidstore struct?

That makes sense.

+ /* DSA for tidstore will be detached at the end of session */

No other test module pins the mapping, but that doesn't necessarily
mean it's wrong. Is there some advantage over explicitly detaching?

One small benefit of not explicitly detaching dsa_area in
tidstore_destroy() would be simplicity; IIUC if we want to do that, we
need to remember the dsa_area using (for example) a static variable,
and free it if it's non-NULL. I've implemented this idea in the
attached patch.

Okay, I don't have a strong preference at this point.

+-- Add tids in random order.

I don't see any randomization here. I do remember adding row_number to
remove whitespace in the output, but I don't remember a random order.
On that subject, the row_number was an easy trick to avoid extra
whitespace, but maybe we should just teach the setting function to
return blocknumber rather than null?

Good idea, fixed.

+ test_set_block_offsets
+------------------------
+             2147483647
+                      0
+             4294967294
+                      1
+             4294967295

Hmm, was the earlier comment about randomness referring to this? I'm
not sure what other regression tests do in these cases, or how
relibale this is. If this is a problem we could simply insert this
result into a temp table so it's not output.

+Datum
+tidstore_create(PG_FUNCTION_ARGS)
+{
...
+ tidstore = TidStoreCreate(max_bytes, dsa);
+Datum
+tidstore_set_block_offsets(PG_FUNCTION_ARGS)
+{
....
+ TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
These names are too similar. Maybe the test module should do
s/tidstore_/test_/ or similar.
Agreed.

Mostly okay, although a couple look a bit generic now. I'll leave it
up to you if you want to tweak things.

In general, the .sql file is still very hard-coded. Functions are
created that contain a VALUES statement. Maybe it's okay for now, but
wanted to mention it. Ideally, we'd have some randomized tests,
without having to display it. That could be in addition to (not
replacing) the small tests we have that display input. (see below)

Agreed to add randomized tests in addition to the existing tests.

I'll try something tomorrow.

Sounds a good idea. In fact, if there are some bugs in tidstore, it's
likely that even initdb would fail in practice. However, it's a very
good idea that we can test the tidstore anyway with such a check
without a debug-build array.

Or as another idea, I wonder if we could keep the debug-build array in
some form. For example, we use the array with the particular build
flag and set a BF animal for that. That way, we can test the tidstore
in more real cases.

I think the purpose of a debug flag is to help developers catch
mistakes. I don't think it's quite useful enough for that. For one, it
has the same 1GB limitation as vacuum's current array. For another,
it'd be a terrible way to debug moving tidbitmap.c from its hash table
to use TID store -- AND/OR operations and lossy pages are pretty much
undoable with a copy of vacuum's array. Last year, when I insisted on
trying a long term realistic load that compares the result with the
array, the encoding scheme was much harder to understand in code. I
think it's now easier, and there are better tests.

In the latest (v69) patch:

- squashed v68-0005 and v68-0006 patches.
- removed most of the changes in v68-0007 patch.
- addressed above review comments in v69-0002 patch.
- v69-0003, 0004, and 0005 are miscellaneous updates.

As for renaming TidStore to TIDStore, I dropped the patch for now
since it seems we're using "Tid" in some function names and variable
names. If we want to update it, we can do that later.

I think we're not consistent across the codebase, and it's fine to
drop that patch.

v70-0008:

@@ -489,7 +489,7 @@ parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
  /*
  * Free the current tidstore and return allocated DSA segments to the
  * operating system. Then we recreate the tidstore with the same max_bytes
- * limitation.
+ * limitation we just used.

Nowadays, max_bytes is now more like a hint for tidstore, and not a
limitation, right? Vacuum has the limitation. Maybe instead of "with",
we should say "passing the same limitation".

I wonder how "di_info" would look as "dead_items_info". I don't feel
too strongly about it, though.

I'm going to try additional regression tests, as mentioned, and try a
couple benchmarks. It should be only a couple more days.

One thing that occurred to me: The radix tree regression tests only
compile and run the local memory case. The tidstore commit would be
the first time the buildfarm has seen the shared memory case, so we
should look out for possible build failures of the same sort we saw
with the the radix tree tests. I see you've already removed the
problematic link_with command -- that's the kind of thing to
double-check for.

#401

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#400)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Mar 12, 2024 at 7:34 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Mar 11, 2024 at 3:13 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Mar 11, 2024 at 12:20 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

+ ts->context = CurrentMemoryContext;

As far as I can tell, this member is never accessed again -- am I
missing something?

You're right. It was used to re-create the tidstore in the same
context again while resetting it, but we no longer support the reset
API. Considering it again, would it be better to allocate the iterator
struct in the same context as we store the tidstore struct?

That makes sense.

+ /* DSA for tidstore will be detached at the end of session */

No other test module pins the mapping, but that doesn't necessarily
mean it's wrong. Is there some advantage over explicitly detaching?

One small benefit of not explicitly detaching dsa_area in
tidstore_destroy() would be simplicity; IIUC if we want to do that, we
need to remember the dsa_area using (for example) a static variable,
and free it if it's non-NULL. I've implemented this idea in the
attached patch.

Okay, I don't have a strong preference at this point.

I'd keep the update on that.

+-- Add tids in random order.

I don't see any randomization here. I do remember adding row_number to
remove whitespace in the output, but I don't remember a random order.
On that subject, the row_number was an easy trick to avoid extra
whitespace, but maybe we should just teach the setting function to
return blocknumber rather than null?

Good idea, fixed.
+ test_set_block_offsets
+------------------------
+             2147483647
+                      0
+             4294967294
+                      1
+             4294967295
Hmm, was the earlier comment about randomness referring to this? I'm
not sure what other regression tests do in these cases, or how
relibale this is. If this is a problem we could simply insert this
result into a temp table so it's not output.

I didn't address the comment about randomness.

I think that we will have both random TIDs tests and fixed TIDs tests
in test_tidstore as we discussed, and probably we can do both tests
with similar steps; insert TIDs into both a temp table and tidstore
and check if the tidstore returned the results as expected by
comparing the results to the temp table. Probably we can have a common
pl/pgsql function that checks that and raises a WARNING or an ERROR.
Given that this is very similar to what we did in test_radixtree, why
do we really want to implement it using a pl/pgsql function? When we
discussed it before, I found the current way makes sense. But given
that we're adding more tests and will add more tests in the future,
doing the tests in C will be more maintainable and faster. Also, I
think we can do the debug-build array stuff in the test_tidstore code
instead.

+Datum
+tidstore_create(PG_FUNCTION_ARGS)
+{
...
+ tidstore = TidStoreCreate(max_bytes, dsa);
+Datum
+tidstore_set_block_offsets(PG_FUNCTION_ARGS)
+{
....
+ TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
These names are too similar. Maybe the test module should do
s/tidstore_/test_/ or similar.
Agreed.
Mostly okay, although a couple look a bit generic now. I'll leave it
up to you if you want to tweak things.

In general, the .sql file is still very hard-coded. Functions are
created that contain a VALUES statement. Maybe it's okay for now, but
wanted to mention it. Ideally, we'd have some randomized tests,
without having to display it. That could be in addition to (not
replacing) the small tests we have that display input. (see below)

Agreed to add randomized tests in addition to the existing tests.

I'll try something tomorrow.

Sounds a good idea. In fact, if there are some bugs in tidstore, it's
likely that even initdb would fail in practice. However, it's a very
good idea that we can test the tidstore anyway with such a check
without a debug-build array.

Or as another idea, I wonder if we could keep the debug-build array in
some form. For example, we use the array with the particular build
flag and set a BF animal for that. That way, we can test the tidstore
in more real cases.

I think the purpose of a debug flag is to help developers catch
mistakes. I don't think it's quite useful enough for that. For one, it
has the same 1GB limitation as vacuum's current array. For another,
it'd be a terrible way to debug moving tidbitmap.c from its hash table
to use TID store -- AND/OR operations and lossy pages are pretty much
undoable with a copy of vacuum's array.

Valid points.

As I mentioned above, if we implement the test cases in C, we can use
the debug-build array in the test code. And we won't use it in AND/OR
operations tests in the future.

In the latest (v69) patch:

- squashed v68-0005 and v68-0006 patches.
- removed most of the changes in v68-0007 patch.
- addressed above review comments in v69-0002 patch.
- v69-0003, 0004, and 0005 are miscellaneous updates.

As for renaming TidStore to TIDStore, I dropped the patch for now
since it seems we're using "Tid" in some function names and variable
names. If we want to update it, we can do that later.

I think we're not consistent across the codebase, and it's fine to
drop that patch.

v70-0008:
@@ -489,7 +489,7 @@ parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
/*
* Free the current tidstore and return allocated DSA segments to the
* operating system. Then we recreate the tidstore with the same max_bytes
- * limitation.
+ * limitation we just used.
Nowadays, max_bytes is now more like a hint for tidstore, and not a
limitation, right? Vacuum has the limitation.

Right.

Maybe instead of "with",
we should say "passing the same limitation".

Will fix.

I wonder how "di_info" would look as "dead_items_info". I don't feel
too strongly about it, though.

Agreed.

I'm going to try additional regression tests, as mentioned, and try a
couple benchmarks. It should be only a couple more days.

Thank you!

One thing that occurred to me: The radix tree regression tests only
compile and run the local memory case. The tidstore commit would be
the first time the buildfarm has seen the shared memory case, so we
should look out for possible build failures of the same sort we saw
with the the radix tree tests. I see you've already removed the
problematic link_with command -- that's the kind of thing to
double-check for.

Good point, agreed. I'll double-check it again.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#402

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#401)

6 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

As I mentioned above, if we implement the test cases in C, we can use
the debug-build array in the test code. And we won't use it in AND/OR
operations tests in the future.

That's a really interesting idea, so I went ahead and tried that for
v71. This seems like a good basis for testing larger, randomized
inputs, once we decide how best to hide that from the expected output.
The tests use SQL functions do_set_block_offsets() and
check_set_block_offsets(). The latter does two checks against a tid
array, and replaces test_dump_tids(). Funnily enough, the debug array
itself gave false failures when using a similar array in the test
harness, because it didn't know all the places where the array should
have been sorted -- it only worked by chance before because of what
order things were done.

I squashed everything from v70 and also took the liberty of switching
on shared memory for tid store tests. The only reason we didn't do
this with the radix tree tests is that the static attach/detach
functions would raise warnings since they are not used.

Attachments:

v71-0002-DEV-Debug-TidStore.patchapplication/x-patch; name=v71-0002-DEV-Debug-TidStore.patchDownload

From 9b6b517b9f9977999034d16b4dc33e91094ae7ba Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 12 Dec 2023 22:36:24 +0900
Subject: [PATCH v71 2/6] DEV: Debug TidStore.

---
 src/backend/access/common/tidstore.c | 203 ++++++++++++++++++++++++++-
 1 file changed, 201 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index b725b62d4c..33753d8ed2 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -29,6 +29,11 @@
 #include "utils/dsa.h"
 #include "utils/memutils.h"
 
+/* Enable TidStore debugging if USE_ASSERT_CHECKING */
+#ifdef USE_ASSERT_CHECKING
+#define TIDSTORE_DEBUG
+#include "catalog/index.h"
+#endif
 
 #define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
 #define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
@@ -88,6 +93,13 @@ struct TidStore
 
 	/* DSA area for TidStore if using shared memory */
 	dsa_area   *area;
+
+#ifdef TIDSTORE_DEBUG
+	ItemPointerData	*tids;
+	int64		max_tids;
+	int64		num_tids;
+	bool		tids_unordered;
+#endif
 };
 #define TidStoreIsShared(ts) ((ts)->area != NULL)
 
@@ -105,11 +117,25 @@ struct TidStoreIter
 
 	/* output for the caller */
 	TidStoreIterResult output;
+
+#ifdef TIDSTORE_DEBUG
+	/* iterator index for the ts->tids array */
+	int64		tids_idx;
+#endif
 };
 
 static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
 									   BlocktableEntry *page);
 
+/* debug functions available only when TIDSTORE_DEBUG */
+#ifdef TIDSTORE_DEBUG
+static void ts_debug_set_block_offsets(TidStore *ts, BlockNumber blkno,
+									   OffsetNumber *offsets, int num_offsets);
+static void ts_debug_iter_check_tids(TidStoreIter *iter);
+static bool ts_debug_is_member(TidStore *ts, ItemPointer tid);
+static int itemptr_cmp(const void *left, const void *right);
+#endif
+
 /*
  * Create a TidStore. The TidStore will live in the memory context that is
  * CurrentMemoryContext at the time of this call. The TID storage, backed
@@ -154,6 +180,17 @@ TidStoreCreate(size_t max_bytes, dsa_area *area)
 	else
 		ts->tree.local = local_rt_create(ts->rt_context);
 
+#ifdef TIDSTORE_DEBUG
+	{
+		int64		max_tids = max_bytes / sizeof(ItemPointerData);
+
+		ts->tids = palloc(sizeof(ItemPointerData) * max_tids);
+		ts->max_tids = max_tids;
+		ts->num_tids = 0;
+		ts->tids_unordered = false;
+	}
+#endif
+
 	return ts;
 }
 
@@ -191,6 +228,7 @@ TidStoreDetach(TidStore *ts)
 	Assert(TidStoreIsShared(ts));
 
 	shared_rt_detach(ts->tree.shared);
+
 	pfree(ts);
 }
 
@@ -241,6 +279,11 @@ TidStoreDestroy(TidStore *ts)
 
 	MemoryContextDelete(ts->rt_context);
 
+#ifdef TIDSTORE_DEBUG
+	if (!TidStoreIsShared(ts))
+		pfree(ts->tids);
+#endif
+
 	pfree(ts);
 }
 
@@ -297,6 +340,14 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 		found = local_rt_set(ts->tree.local, blkno, page);
 
 	Assert(!found);
+
+#ifdef TIDSTORE_DEBUG
+	if (!TidStoreIsShared(ts))
+	{
+		/* Insert TIDs into the TID array too */
+		ts_debug_set_block_offsets(ts, blkno, offsets, num_offsets);
+	}
+#endif
 }
 
 /* Return true if the given TID is present in the TidStore */
@@ -310,6 +361,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 	OffsetNumber off = ItemPointerGetOffsetNumber(tid);
 	bool		ret;
 
+#ifdef TIDSTORE_DEBUG
+	bool ret_debug = false;
+
+	if (!TidStoreIsShared(ts))
+		ret_debug = ts_debug_is_member(ts, tid);
+#endif
+
 	if (TidStoreIsShared(ts))
 		page = shared_rt_find(ts->tree.shared, blk);
 	else
@@ -317,17 +375,29 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 
 	/* no entry for the blk */
 	if (page == NULL)
-		return false;
+	{
+		ret = false;
+		goto done;
+	}
 
 	wordnum = WORDNUM(off);
 	bitnum = BITNUM(off);
 
 	/* no bitmap for the off */
 	if (wordnum >= page->nwords)
-		return false;
+	{
+		ret = false;
+		goto done;
+	}
 
 	ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
 
+done:
+#ifdef TIDSTORE_DEBUG
+	if (!TidStoreIsShared(ts))
+		Assert(ret == ret_debug);
+#endif
+
 	return ret;
 }
 
@@ -360,6 +430,11 @@ TidStoreBeginIterate(TidStore *ts)
 	else
 		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
 
+#ifdef TIDSTORE_DEBUG
+	if (!TidStoreIsShared(ts))
+		iter->tids_idx = 0;
+#endif
+
 	return iter;
 }
 
@@ -387,6 +462,11 @@ TidStoreIterateNext(TidStoreIter *iter)
 	/* Collect TIDs extracted from the key-value pair */
 	tidstore_iter_extract_tids(iter, key, page);
 
+#ifdef TIDSTORE_DEBUG
+	if (!TidStoreIsShared(iter->ts))
+		ts_debug_iter_check_tids(iter);
+#endif
+
 	return result;
 }
 
@@ -459,3 +539,122 @@ tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, BlocktableEntry *page
 		}
 	}
 }
+
+#ifdef TIDSTORE_DEBUG
+/* Comparator routines for ItemPointer */
+static int
+itemptr_cmp(const void *left, const void *right)
+{
+	BlockNumber lblk,
+		rblk;
+	OffsetNumber loff,
+		roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+
+/* Insert TIDs to the TID array for debugging */
+static void
+ts_debug_set_block_offsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+						   int num_offsets)
+{
+	if (ts->num_tids > 0 &&
+		blkno < ItemPointerGetBlockNumber(&(ts->tids[ts->num_tids - 1])))
+	{
+		/* The array will be sorted at ts_debug_is_member() */
+		ts->tids_unordered = true;
+	}
+
+	for (int i = 0; i < num_offsets; i++)
+	{
+		ItemPointer tid;
+		int idx = ts->num_tids + i;
+
+		/* Enlarge the TID array if necessary */
+		if (idx >= ts->max_tids)
+		{
+			ts->max_tids *= 2;
+			ts->tids = repalloc(ts->tids, sizeof(ItemPointerData) * ts->max_tids);
+		}
+
+		tid = &(ts->tids[idx]);
+
+		ItemPointerSetBlockNumber(tid, blkno);
+		ItemPointerSetOffsetNumber(tid, offsets[i]);
+	}
+
+	ts->num_tids += num_offsets;
+}
+
+/* Return true if the given TID is present in the TID array */
+static bool
+ts_debug_is_member(TidStore *ts, ItemPointer tid)
+{
+	int64	litem,
+		ritem,
+		item;
+	ItemPointer res;
+
+	if (ts->num_tids == 0)
+		return false;
+
+	/* Make sure the TID array is sorted */
+	if (ts->tids_unordered)
+	{
+		qsort(ts->tids, ts->num_tids, sizeof(ItemPointerData), itemptr_cmp);
+		ts->tids_unordered = false;
+	}
+
+	litem = itemptr_encode(&ts->tids[0]);
+	ritem = itemptr_encode(&ts->tids[ts->num_tids - 1]);
+	item = itemptr_encode(tid);
+
+	/*
+	 * Doing a simple bound check before bsearch() is useful to avoid the
+	 * extra cost of bsearch(), especially if dead items on the heap are
+	 * concentrated in a certain range.	Since this function is called for
+	 * every index tuple, it pays to be really fast.
+	 */
+	if (item < litem || item > ritem)
+		return false;
+
+	res = bsearch(tid, ts->tids, ts->num_tids, sizeof(ItemPointerData),
+				  itemptr_cmp);
+
+	return (res != NULL);
+}
+
+/* Verify if the iterator output matches the TIDs in the array for debugging */
+static void
+ts_debug_iter_check_tids(TidStoreIter *iter)
+{
+	BlockNumber blkno = iter->output.blkno;
+
+	for (int i = 0; i < iter->output.num_offsets; i++)
+	{
+		ItemPointer tid = &(iter->ts->tids[iter->tids_idx + i]);
+
+		Assert((iter->tids_idx + i) < iter->ts->max_tids);
+		Assert(ItemPointerGetBlockNumber(tid) == blkno);
+		Assert(ItemPointerGetOffsetNumber(tid) == iter->output.offsets[i]);
+	}
+
+	iter->tids_idx += iter->output.num_offsets;
+}
+#endif
-- 
2.44.0

v71-0003-DEV-Fix-failure-to-sort-debug-array-for-iteratio.patchapplication/x-patch; name=v71-0003-DEV-Fix-failure-to-sort-debug-array-for-iteratio.patchDownload

From 4ed9f578c2f8d20f4477c80703a085b3dd3bef01 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 13 Mar 2024 14:56:56 +0700
Subject: [PATCH v71 3/6] DEV: Fix failure to sort debug array for iteration

---
 src/backend/access/common/tidstore.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 33753d8ed2..4f5882818b 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -433,6 +433,13 @@ TidStoreBeginIterate(TidStore *ts)
 #ifdef TIDSTORE_DEBUG
 	if (!TidStoreIsShared(ts))
 		iter->tids_idx = 0;
+
+	/* Make sure the TID array is sorted */
+	if (ts->tids_unordered)
+	{
+		qsort(ts->tids, ts->num_tids, sizeof(ItemPointerData), itemptr_cmp);
+		ts->tids_unordered = false;
+	}
 #endif
 
 	return iter;
-- 
2.44.0

v71-0005-Use-shared-memory-in-TID-store-tests.patchapplication/x-patch; name=v71-0005-Use-shared-memory-in-TID-store-tests.patchDownload

From 753c38866b5185ce399f214e05e1869abb8d2dc2 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 13 Mar 2024 16:13:54 +0700
Subject: [PATCH v71 5/6] Use shared memory in TID store tests

---
 src/test/modules/test_tidstore/expected/test_tidstore.out | 5 +++--
 src/test/modules/test_tidstore/sql/test_tidstore.sql      | 5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
index 4eab5d30ba..84ec4c3a64 100644
--- a/src/test/modules/test_tidstore/expected/test_tidstore.out
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -23,8 +23,9 @@ SELECT t_ctid
     LATERAL test_lookup_tids(foo.tids)
   WHERE found ORDER BY t_ctid;
 END;
--- Test a local tdistore. A shared tidstore is created by passing true.
-SELECT test_create(false);
+-- Create a TID store in shared memory. We can't do that for for the radix tree
+-- tests, because unused static functions would raise warnings there.
+SELECT test_create(true);
  test_create 
 -------------
  
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
index 7317f52058..55535d4905 100644
--- a/src/test/modules/test_tidstore/sql/test_tidstore.sql
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -27,8 +27,9 @@ SELECT t_ctid
   WHERE found ORDER BY t_ctid;
 END;
 
--- Test a local tdistore. A shared tidstore is created by passing true.
-SELECT test_create(false);
+-- Create a TID store in shared memory. We can't do that for for the radix tree
+-- tests, because unused static functions would raise warnings there.
+SELECT test_create(true);
 
 -- Test on empty tidstore.
 SELECT *
-- 
2.44.0

v71-0006-Use-TidStore-to-store-dead-tuple-TIDs-during-laz.patchapplication/x-patch; name=v71-0006-Use-TidStore-to-store-dead-tuple-TIDs-during-laz.patchDownload

From d09dc6f4ed069aa038700cd19300cf4d3bf80fc0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 1 Mar 2024 16:04:49 +0900
Subject: [PATCH v71 6/6] Use TidStore to store dead tuple TIDs during lazy
 vacuum.

Previously, we used VacDeadItems, a simple ItemPointerData array, for
dead tuple's TID storage during lazyvacuum. It was not space efficient
and lookup performant.

With this change, lazyvacuum makes use of TidStore for dead tuple TIDs
storage. We use a new struct VacDeadItemsInfo to store additional
information for that such as the number of TIDs collected so
far. TidStore and VacDeadItemsInfo are shared among the parallel
vacuum worker in parallel vacuum case.

As of now, there is no concurrent multiple writes during parallel
vacuum, we don't need to take locks for now, though.

As for the progress reporting, reporting number of tuples does no
longer provide any meaningful insights for users. So it also changes
to report byte-based progress reporting. The columns of
pg_stat_progress_vacuum are also renamed accordingly: max_dead_tuples
and dead_tuple_bytes.

XXX: bump catalog version.
XXX: update typedef.list.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 doc/src/sgml/monitoring.sgml                  |   8 +-
 src/backend/access/heap/vacuumlazy.c          | 272 ++++++++----------
 src/backend/catalog/system_views.sql          |   2 +-
 src/backend/commands/vacuum.c                 |  79 +----
 src/backend/commands/vacuumparallel.c         | 100 +++++--
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +-
 src/include/commands/progress.h               |   4 +-
 src/include/commands/vacuum.h                 |  28 +-
 src/include/storage/lwlock.h                  |   1 +
 src/test/regress/expected/rules.out           |   4 +-
 src/tools/pgindent/typedefs.list              |   2 +-
 12 files changed, 228 insertions(+), 275 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c8d76906aa..875c92d9f3 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6240,10 +6240,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6251,10 +6251,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1800490775..b157664f2e 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,17 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -39,6 +38,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xloginsert.h"
@@ -179,8 +179,13 @@ typedef struct LVRelState
 	 * that has been processed by lazy_scan_prune.  Also needed by
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
+	 *
+	 * Both dead_items and di_info are allocated in shared memory in parallel
+	 * vacuum cases.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
+	VacDeadItemsInfo	*di_info;
+
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -239,8 +244,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  Buffer buffer, OffsetNumber *offsets,
+								  int num_offsets, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -257,6 +263,9 @@ static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+						   int num_offsets);
+static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -472,11 +481,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
-	 * parallel VACUUM initialization as part of allocating shared memory
-	 * space used for dead_items.  (But do a failsafe precheck first, to
-	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
-	 * is already dangerously old.)
+	 * Allocate dead_items memory using dead_items_alloc.  This handles parallel
+	 * VACUUM initialization as part of allocating shared memory space used for
+	 * dead_items.  (But do a failsafe precheck first, to ensure that parallel
+	 * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+	 * old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
@@ -782,7 +791,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -811,19 +820,20 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_fsm_block_to_vacuum = 0;
 	bool		all_visible_according_to_vm;
 
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
+	VacDeadItemsInfo *di_info = vacrel->di_info;
 	Buffer		vmbuffer = InvalidBuffer;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = di_info->max_bytes;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Initialize for the first heap_vac_scan_next_block() call */
@@ -866,8 +876,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (TidStoreMemoryUsage(dead_items) > di_info->max_bytes)
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -930,11 +939,11 @@ lazy_scan_heap(LVRelState *vacrel)
 
 		/*
 		 * If we didn't get the cleanup lock, we can still collect LP_DEAD
-		 * items in the dead_items array for later vacuuming, count live and
-		 * recently dead tuples for vacuum logging, and determine if this
-		 * block could later be truncated. If we encounter any xid/mxids that
-		 * require advancing the relfrozenxid/relminxid, we'll have to wait
-		 * for a cleanup lock and call lazy_scan_prune().
+		 * items in the dead_items for later vacuuming, count live and recently
+		 * dead tuples for vacuum logging, and determine if this block could
+		 * later be truncated. If we encounter any xid/mxids that require
+		 * advancing the relfrozenxid/relminxid, we'll have to wait for a
+		 * cleanup lock and call lazy_scan_prune().
 		 */
 		if (!got_cleanup_lock &&
 			!lazy_scan_noprune(vacrel, buf, blkno, page, &has_lpdead_items))
@@ -958,9 +967,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Like lazy_scan_noprune(), lazy_scan_prune() will count
 		 * recently_dead_tuples and live tuples for vacuum logging, determine
 		 * if the block can later be truncated, and accumulate the details of
-		 * remaining LP_DEAD line pointers on the page in the dead_items
-		 * array. These dead items include those pruned by lazy_scan_prune()
-		 * as well we line pointers previously marked LP_DEAD.
+		 * remaining LP_DEAD line pointers on the page in the dead_items.
+		 * These dead items include those pruned by lazy_scan_prune() as well
+		 * we line pointers previously marked LP_DEAD.
 		 */
 		if (got_cleanup_lock)
 			lazy_scan_prune(vacrel, buf, blkno, page,
@@ -1037,7 +1046,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (di_info->num_items > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1763,22 +1772,9 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1925,7 +1921,7 @@ lazy_scan_prune(LVRelState *vacrel,
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2084,7 +2080,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2104,9 +2100,6 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
 		 * indexes will be deleted during index vacuuming (and then marked
@@ -2114,17 +2107,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		vacrel->lpdead_items += lpdead_items;
 	}
@@ -2174,7 +2157,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		dead_items_reset(vacrel);
 		return;
 	}
 
@@ -2203,7 +2186,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == vacrel->di_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2230,8 +2213,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2276,7 +2259,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	dead_items_reset(vacrel);
 }
 
 /*
@@ -2368,7 +2351,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   vacrel->di_info->num_items == vacrel->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2390,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2408,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2426,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = TidStoreBeginIterate(vacrel->dead_items);
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2435,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = iter_result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2449,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, buf, iter_result->offsets,
+							  iter_result->num_offsets, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2459,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2468,14 +2454,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (vacrel->di_info->num_items == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, vacrel->di_info->num_items, vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2483,21 +2468,17 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 /*
  *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *						  vacrel->dead_items store.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
+static void
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+					  OffsetNumber *deadoffsets, int num_offsets,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2516,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2595,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -2722,8 +2697,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -2760,7 +2735,8 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
 	/* Do bulk deletion */
-	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items);
+	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items,
+								  vacrel->di_info);
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -3125,46 +3101,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = AmAutoVacuumWorkerProcess() &&
-		autovacuum_work_mem != -1 ?
-		autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3175,11 +3111,10 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	VacDeadItemsInfo *di_info;
+	int			vac_work_mem = AmAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3206,24 +3141,65 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
 		/* If parallel mode started, dead_items space is allocated in DSM */
 		if (ParallelVacuumIsActive(vacrel))
 		{
-			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs);
+			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs,
+																&vacrel->di_info);
 			return;
 		}
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
+	vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL);
+
+	di_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
+	di_info->max_bytes = vac_work_mem;
+	di_info->num_items = 0;
+	vacrel->di_info = di_info;
+}
+
+/*
+ * Add the given block number and offset numbers to dead_items.
+ */
+static void
+dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+			   int num_offsets)
+{
+	TidStore	*dead_items = vacrel->dead_items;
+
+	TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets);
+	vacrel->di_info->num_items += num_offsets;
+
+	/* update the memory usage report */
+	pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+								 TidStoreMemoryUsage(dead_items));
+}
+
+/*
+ * Forget all collected dead items.
+ */
+static void
+dead_items_reset(LVRelState *vacrel)
+{
+	TidStore	*dead_items = vacrel->dead_items;
+
+	if (ParallelVacuumIsActive(vacrel))
+	{
+		parallel_vacuum_reset_dead_items(vacrel->pvs);
+		return;
+	}
+
+	/* Recreate the tidstore with the same max_bytes limitation */
+	TidStoreDestroy(dead_items);
+	vacrel->dead_items = TidStoreCreate(vacrel->di_info->max_bytes, NULL);
 
-	vacrel->dead_items = dead_items;
+	/* Reset the counter */
+	vacrel->di_info->num_items = 0;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 04227a72d1..b6990ac0da 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1221,7 +1221,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples,
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
         S.param8 AS indexes_total, S.param9 AS indexes_processed
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 25281bbed9..53c3da660b 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,7 +116,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * GUC check function to ensure GUC value specified is within the allowable
@@ -2473,16 +2472,15 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items, VacDeadItemsInfo *di_info)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
-					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+			(errmsg("scanned index \"%s\" to remove " INT64_FORMAT " row versions",
+					RelationGetRelationName(ivinfo->index), di_info->num_items)));
 
 	return istat;
 }
@@ -2513,82 +2511,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch(itemptr,
-								dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return TidStoreIsMember(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index befda1c105..fc27e2837a 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -8,8 +8,8 @@
  *
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
+ * vacuum process.  ParalleVacuumState contains shared information as well as
+ * the memory space for storing dead items allocated in the DSA area.  We
  * launch parallel worker processes at the start of parallel index
  * bulk-deletion and index cleanup and once all indexes are processed, the
  * parallel worker processes exit.  Each time we process indexes in parallel,
@@ -110,6 +110,12 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* DSA pointer to the shared TidStore */
+	dsa_pointer	dead_items_handle;
+
+	/* Statistics of shared dead items */
+	VacDeadItemsInfo	di_info;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -176,7 +182,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -232,20 +239,22 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
+					 int nrequested_workers, int vac_work_mem,
 					 int elevel, BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -294,9 +303,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Initial size of DSA for dead tuples -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -362,6 +370,16 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = TidStoreCreate(vac_work_mem, dead_items_dsa);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -371,6 +389,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = TidStoreGetHandle(dead_items);
+	shared->di_info.max_bytes = vac_work_mem;
 
 	/* Use the same buffer size for all workers */
 	shared->ring_nbuffers = GetAccessStrategyBufferCount(bstrategy);
@@ -382,15 +402,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -448,6 +459,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	TidStoreDestroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -455,13 +469,39 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
-/* Returns the dead items space */
-VacDeadItems *
-parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
+/*
+ * Returns the dead items space and dead items information.
+ */
+TidStore *
+parallel_vacuum_get_dead_items(ParallelVacuumState *pvs, VacDeadItemsInfo **di_info_p)
 {
+	*di_info_p = &(pvs->shared->di_info);
 	return pvs->dead_items;
 }
 
+/* Forget all items in dead_items */
+void
+parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
+{
+	TidStore	*dead_items = pvs->dead_items;
+	VacDeadItemsInfo *di_info = &(pvs->shared->di_info);
+
+	/*
+	 * Free the current tidstore and return allocated DSA segments to the
+	 * operating system. Then we recreate the tidstore with the same max_bytes
+	 * limitation we just used.
+	 */
+	TidStoreDestroy(dead_items);
+	dsa_trim(pvs->dead_items_area);
+	pvs->dead_items = TidStoreCreate(di_info->max_bytes, pvs->dead_items_area);
+
+	/* Update the DSA pointer for dead_items to the new one */
+	pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
+
+	/* Reset the counter */
+	di_info->num_items = 0;
+}
+
 /*
  * Do parallel index bulk-deletion with parallel workers.
  */
@@ -861,7 +901,8 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	switch (indstats->status)
 	{
 		case PARALLEL_INDVAC_STATUS_NEED_BULKDELETE:
-			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items);
+			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items,
+											  &pvs->shared->di_info);
 			break;
 		case PARALLEL_INDVAC_STATUS_NEED_CLEANUP:
 			istat_res = vac_cleanup_one_index(&ivinfo, istat);
@@ -961,7 +1002,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -1005,10 +1048,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumUpdateCosts();
@@ -1056,6 +1099,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	TidStoreDetach(dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index e9dd5e6f99..a640da0157 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -168,6 +168,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SERIAL_SLRU] = "SerialSLRU",
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
+	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c08e00d1d6..7ad0c765ab 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -376,7 +376,7 @@ NotifySLRU	"Waiting to access the <command>NOTIFY</command> message SLRU cache."
 SerialSLRU	"Waiting to access the serializable transaction conflict SLRU cache."
 SubtransSLRU	"Waiting to access the sub-transaction SLRU cache."
 XactSLRU	"Waiting to access the transaction status SLRU cache."
-
+ParallelVacuumDSA	"Waiting for parallel vacuum dynamic shared memory allocation."
 
 #
 # Wait Events - Lock
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 73afa77a9c..82a8fe6bd1 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 #define PROGRESS_VACUUM_INDEXES_TOTAL			7
 #define PROGRESS_VACUUM_INDEXES_PROCESSED		8
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 7f623b37fd..0cb2e97726 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -277,19 +278,14 @@ struct VacuumCutoffs
 };
 
 /*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
+ * VacDeadItemsInfo stores additional information of dead tuple TIDs and
+ * dead tuple storage (e.g. TidStore).
  */
-typedef struct VacDeadItems
+typedef struct VacDeadItemsInfo
 {
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
+	size_t	max_bytes;	/* the maximum bytes TidStore can use */
+	int64	num_items;	/* current # of entries */
+} VacDeadItemsInfo;
 
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
@@ -350,10 +346,10 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items,
+													VacDeadItemsInfo *di_info);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* In postmaster/autovacuum.c */
 extern void AutoVacuumUpdateCostLimit(void);
@@ -362,10 +358,12 @@ extern void VacuumUpdateCosts(void);
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
+												 int vac_work_mem, int elevel,
 												 BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
+												VacDeadItemsInfo **di_info_p);
+extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 152e3b047e..66b60191b5 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -217,6 +217,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SERIAL_SLRU,
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 0cd2c64fca..6267647766 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2048,8 +2048,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples,
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes,
     s.param8 AS indexes_total,
     s.param9 AS indexes_processed
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a0fa0d2d1f..96d6bfa6f0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2963,7 +2963,7 @@ UserMapping
 UserOpts
 VacAttrStats
 VacAttrStatsP
-VacDeadItems
+VacDeadItemsInfo
 VacErrPhase
 VacObjFilter
 VacOptValue
-- 
2.44.0

v71-0004-WIP-Use-array-of-TIDs-in-TID-store-regression-te.patchapplication/x-patch; name=v71-0004-WIP-Use-array-of-TIDs-in-TID-store-regression-te.patchDownload

From 15e740bf3f7a58fccce3f19a94c840ff85016049 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 13 Mar 2024 14:02:01 +0700
Subject: [PATCH v71 4/6] WIP: Use array of TIDs in TID store regression tests

---
 .../test_tidstore/expected/test_tidstore.out  |  53 ++-----
 .../test_tidstore/sql/test_tidstore.sql       |   7 +-
 .../test_tidstore/test_tidstore--1.0.sql      |   7 +-
 .../modules/test_tidstore/test_tidstore.c     | 135 ++++++++++++------
 4 files changed, 114 insertions(+), 88 deletions(-)

diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
index 80698924a1..4eab5d30ba 100644
--- a/src/test/modules/test_tidstore/expected/test_tidstore.out
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -53,16 +53,16 @@ VALUES (0), (1), (:maxblkno - 1), (:maxblkno / 2), (:maxblkno)
 offsets (off) AS (
 VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)
 )
-SELECT test_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+SELECT do_set_block_offsets(blk, array_agg(offsets.off)::int2[])
   FROM blocks, offsets
   GROUP BY blk;
- test_set_block_offsets 
-------------------------
-             2147483647
-                      0
-             4294967294
-                      1
-             4294967295
+ do_set_block_offsets 
+----------------------
+           2147483647
+                    0
+           4294967294
+                    1
+           4294967295
 (5 rows)
 
 -- Lookup test and dump (sorted) tids.
@@ -81,42 +81,19 @@ SELECT lookup_test();
  (4294967295,512)
 (10 rows)
 
+-- Check TIDs we've added to the store.
+SELECT check_set_block_offsets();
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
 SELECT test_is_full();
  test_is_full 
 --------------
  f
 (1 row)
 
-SELECT test_dump_tids();
-  test_dump_tids  
-------------------
- (0,1)
- (0,2)
- (0,256)
- (0,511)
- (0,512)
- (1,1)
- (1,2)
- (1,256)
- (1,511)
- (1,512)
- (2147483647,1)
- (2147483647,2)
- (2147483647,256)
- (2147483647,511)
- (2147483647,512)
- (4294967294,1)
- (4294967294,2)
- (4294967294,256)
- (4294967294,511)
- (4294967294,512)
- (4294967295,1)
- (4294967295,2)
- (4294967295,256)
- (4294967295,511)
- (4294967295,512)
-(25 rows)
-
 -- cleanup
 SELECT test_destroy();
  test_destroy 
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
index cc1207821c..7317f52058 100644
--- a/src/test/modules/test_tidstore/sql/test_tidstore.sql
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -43,14 +43,17 @@ VALUES (0), (1), (:maxblkno - 1), (:maxblkno / 2), (:maxblkno)
 offsets (off) AS (
 VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)
 )
-SELECT test_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+SELECT do_set_block_offsets(blk, array_agg(offsets.off)::int2[])
   FROM blocks, offsets
   GROUP BY blk;
 
 -- Lookup test and dump (sorted) tids.
 SELECT lookup_test();
+
+-- Check TIDs we've added to the store.
+SELECT check_set_block_offsets();
+
 SELECT test_is_full();
-SELECT test_dump_tids();
 
 -- cleanup
 SELECT test_destroy();
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
index 305459334d..c9f8dfd6f9 100644
--- a/src/test/modules/test_tidstore/test_tidstore--1.0.sql
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -8,15 +8,14 @@ shared bool)
 RETURNS void STRICT PARALLEL UNSAFE
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
-CREATE FUNCTION test_set_block_offsets(
+CREATE FUNCTION do_set_block_offsets(
 blkno bigint,
 offsets int2[])
 RETURNS bigint STRICT PARALLEL UNSAFE
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
-CREATE FUNCTION test_dump_tids(
-t_ctid OUT tid)
-RETURNS SETOF tid STRICT PARALLEL UNSAFE
+CREATE FUNCTION check_set_block_offsets()
+RETURNS void STRICT PARALLEL UNSAFE
 AS 'MODULE_PATHNAME' LANGUAGE C;
 
 CREATE FUNCTION test_lookup_tids(
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 428d6a3fcf..f5b65eace0 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -24,17 +24,54 @@
 PG_MODULE_MAGIC;
 
 PG_FUNCTION_INFO_V1(test_create);
-PG_FUNCTION_INFO_V1(test_set_block_offsets);
-PG_FUNCTION_INFO_V1(test_dump_tids);
+PG_FUNCTION_INFO_V1(do_set_block_offsets);
+PG_FUNCTION_INFO_V1(check_set_block_offsets);
 PG_FUNCTION_INFO_V1(test_lookup_tids);
 PG_FUNCTION_INFO_V1(test_is_full);
 PG_FUNCTION_INFO_V1(test_destroy);
 
 static TidStore *tidstore = NULL;
 static dsa_area *dsa = NULL;
-static int64 num_tids = 0;
 static size_t max_bytes = (2 * 1024 * 1024L);	/* 2MB */
 
+/* array for verification of some tests */
+typedef struct ItemArray
+{
+	ItemPointerData	*tids;
+	int		max_tids;
+	int		num_tids;
+} ItemArray;
+
+static ItemArray items;
+
+/* comparator routine for ItemPointer */
+static int
+itemptr_cmp(const void *left, const void *right)
+{
+	BlockNumber lblk,
+		rblk;
+	OffsetNumber loff,
+		roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+
 /*
  * Create a TidStore. If shared is false, the tidstore is created
  * on TopMemoryContext, otherwise on DSA. Although the tidstore
@@ -72,7 +109,9 @@ test_create(PG_FUNCTION_ARGS)
 	else
 		tidstore = TidStoreCreate(max_bytes, NULL);
 
-	num_tids = 0;
+	items.num_tids = 0;
+	items.max_tids = max_bytes / sizeof(ItemPointerData);
+	items.tids =  (ItemPointerData *) palloc0(max_bytes);
 
 	MemoryContextSwitchTo(old_ctx);
 
@@ -95,7 +134,7 @@ sanity_check_array(ArrayType *ta)
 
 /* Set the given block and offsets pairs */
 Datum
-test_set_block_offsets(PG_FUNCTION_ARGS)
+do_set_block_offsets(PG_FUNCTION_ARGS)
 {
 	BlockNumber blkno = PG_GETARG_INT64(0);
 	ArrayType  *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
@@ -107,39 +146,58 @@ test_set_block_offsets(PG_FUNCTION_ARGS)
 	noffs = ArrayGetNItems(ARR_NDIM(ta), ARR_DIMS(ta));
 	offs = ((OffsetNumber *) ARR_DATA_PTR(ta));
 
-	/* Set TIDs */
+	/* Set TIDs in the store */
 	TidStoreLockExclusive(tidstore);
 	TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
 	TidStoreUnlock(tidstore);
 
+	/* Set TIDs in verification array */
+	for (int i = 0; i < noffs; i++)
+	{
+		ItemPointer tid;
+		int idx = items.num_tids + i;
+
+		/* Enlarge the TID array if necessary */
+		if (idx >= items.max_tids)
+		{
+			items.max_tids *= 2;
+			items.tids = repalloc(items.tids, sizeof(ItemPointerData) * items.max_tids);
+		}
+
+		tid = &(items.tids[idx]);
+
+		ItemPointerSetBlockNumber(tid, blkno);
+		ItemPointerSetOffsetNumber(tid, offs[i]);
+	}
+
 	/* Update statistics */
-	num_tids += noffs;
+	items.num_tids += noffs;
 
 	PG_RETURN_INT64(blkno);
 }
 
 /*
- * Dump and return TIDs in the tidstore. The output TIDs are ordered.
+ * Verify TIDs in store against the array.
  */
 Datum
-test_dump_tids(PG_FUNCTION_ARGS)
+check_set_block_offsets(PG_FUNCTION_ARGS)
 {
-	FuncCallContext *funcctx;
-	ItemPointerData *tids;
-
-	if (SRF_IS_FIRSTCALL())
-	{
-		MemoryContext oldcontext;
 		TidStoreIter *iter;
 		TidStoreIterResult *iter_result;
-		int64		ntids = 0;
+		ItemPointerData * tids;
+		int ntids = 0;
+
+		/* lookup each member in the verification array */
+		for (int i = 0; i < items.num_tids; i++)
+			if (!TidStoreIsMember(tidstore, &items.tids[i]))
+				elog(ERROR, "missing TID with block %u, offset %u",
+					ItemPointerGetBlockNumber(&items.tids[i]),
+					ItemPointerGetOffsetNumber(&items.tids[i]));
 
-		funcctx = SRF_FIRSTCALL_INIT();
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
 		tids = (ItemPointerData *)
-			palloc0(sizeof(ItemPointerData) * num_tids);
+			palloc0(sizeof(ItemPointerData) * items.num_tids);
 
-		/* Collect TIDs stored in the tidstore */
+		/* Collect TIDs stored in the tidstore, in order */
 		TidStoreLockShare(tidstore);
 		iter = TidStoreBeginIterate(tidstore);
 		while ((iter_result = TidStoreIterateNext(iter)) != NULL)
@@ -148,33 +206,21 @@ test_dump_tids(PG_FUNCTION_ARGS)
 				ItemPointerSet(&(tids[ntids++]), iter_result->blkno,
 							   iter_result->offsets[i]);
 		}
+		TidStoreEndIterate(iter);
 		TidStoreUnlock(tidstore);
 
-		Assert(ntids == num_tids);
-
-		funcctx->user_fctx = tids;
-		funcctx->max_calls = num_tids;
-
-		MemoryContextSwitchTo(oldcontext);
-	}
-
-	funcctx = SRF_PERCALL_SETUP();
-	tids = (ItemPointerData *) funcctx->user_fctx;
+		if (ntids != items.num_tids)
+			elog(ERROR, "should have %d TIDs, have %d", items.num_tids, ntids);
 
-	if (funcctx->call_cntr < funcctx->max_calls)
-	{
-		int			idx;
-
-		/*
-		 * Note that since funcctx->call_cntr is incremented in
-		 * SRF_RETURN_NEXT before return, we need to remember the current
-		 * counter to access the tid array.
-		 */
-		idx = funcctx->call_cntr;
-		SRF_RETURN_NEXT(funcctx, PointerGetDatum(&(tids[idx])));
-	}
+		/* Sort verification array and compare each member to what we dumped from the TID store. */
+		qsort(items.tids, items.num_tids, sizeof(ItemPointerData), itemptr_cmp);
+		for (int i = 0; i < items.num_tids; i++)
+		{
+			if (itemptr_cmp((const void *) &items.tids[i], (const void *) &tids[i]) != 0)
+				elog(ERROR, "Dumped TID array doesn't match verification array");
+		}
 
-	SRF_RETURN_DONE(funcctx);
+	PG_RETURN_VOID();
 }
 
 /*
@@ -236,7 +282,8 @@ test_destroy(PG_FUNCTION_ARGS)
 {
 	TidStoreDestroy(tidstore);
 	tidstore = NULL;
-	num_tids = 0;
+	items.num_tids = 0;
+	pfree(items.tids);
 
 	if (dsa)
 		dsa_detach(dsa);
-- 
2.44.0

v71-0001-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/x-patch; name=v71-0001-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From 0179fd17ec1ac414510e01b4afd5488eb2c6faa2 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 5 Mar 2024 15:05:47 +0700
Subject: [PATCH v71 1/6] Add TIDStore, to store sets of TIDs (ItemPointerData)
 efficiently.

TIDStore is a data structure designed to efficiently store large sets
of TIDs. For TID storage, it employs a radix tree, where the key is
the BlockNumber, and the value is a bitmap representing offset
numbers. The TIDStore can be created on a DSA area and used by
multiple backend processes simultaneously.

There are potential future users such as tidbitmap.c, though it's very
likely the interface will need to evolve as we come to understand the
needs of different kinds of users. For example, we can support
updating the offset bitmap of existing values.

Currently, the TIDStore is not used for anything yet, aside from the
test code. But an upcoming patch will use it.

This includes a unit test module, in src/test/modules/test_tidstore.

Co-authored-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml                  |   4 +
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 461 ++++++++++++++++++
 src/backend/storage/lmgr/lwlock.c             |   1 +
 src/include/access/tidstore.h                 |  48 ++
 src/include/storage/lwlock.h                  |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  | 128 +++++
 src/test/modules/test_tidstore/meson.build    |  33 ++
 .../test_tidstore/sql/test_tidstore.sql       |  58 +++
 .../test_tidstore/test_tidstore--1.0.sql      |  35 ++
 .../modules/test_tidstore/test_tidstore.c     | 245 ++++++++++
 .../test_tidstore/test_tidstore.control       |   4 +
 src/tools/pgindent/typedefs.list              |   4 +
 17 files changed, 1049 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8aca08140e..c8d76906aa 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1099,6 +1099,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
        See <xref linkend="wait-event-lwlock-table"/>.
       </entry>
      </row>
+     <row>
+      <entry><literal>SharedTidStore</literal></entry>
+      <entry>Waiting to access a shared TID store.</entry>
+     </row>
      <row>
       <entry><literal>Timeout</literal></entry>
       <entry>The server process is waiting for a timeout
diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 725041a4ce..a02397855e 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..b725b62d4c
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,461 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		TID (ItemPointerData) storage implementation.
+ *
+ * TidStore is a in-memory data structure to store TIDs (ItemPointerData).
+ * Internally it uses a radix tree as the storage for TIDs. The key is the
+ * BlockNumber and the value is a bitmap of offsets, BlocktableEntry.
+ *
+ * TidStore can be shared among parallel worker processes by passing DSA area
+ * to TidStoreCreate(). Other backends can attach to the shared TidStore by
+ * TidStoreAttach().
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+
+#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
+
+/* number of active words for a page: */
+#define WORDS_PER_PAGE(n) ((n) / BITS_PER_BITMAPWORD + 1)
+
+/*
+ * This is named similarly to PagetableEntry in tidbitmap.c
+ * because the two have a similar function.
+ */
+typedef struct BlocktableEntry
+{
+	uint16		nwords;
+	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];
+} BlocktableEntry;
+#define MaxBlocktableEntrySize \
+	offsetof(BlocktableEntry, words) + \
+		(sizeof(bitmapword) * WORDS_PER_PAGE(MaxOffsetNumber))
+
+#define RT_PREFIX local_rt
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE BlocktableEntry
+#define RT_VARLEN_VALUE_SIZE(page) \
+	(offsetof(BlocktableEntry, words) + \
+	sizeof(bitmapword) * (page)->nwords)
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_rt
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE BlocktableEntry
+#define RT_VARLEN_VALUE_SIZE(page) \
+	(offsetof(BlocktableEntry, words) + \
+	sizeof(bitmapword) * (page)->nwords)
+#include "lib/radixtree.h"
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/* MemoryContext where the TidStore is allocated */
+	MemoryContext context;
+
+	/* MemoryContext where the radix tree uses */
+	MemoryContext rt_context;
+
+	/* Storage for TIDs. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		local_rt_radix_tree *local;
+		shared_rt_radix_tree *shared;
+	}			tree;
+
+	/* DSA area for TidStore if using shared memory */
+	dsa_area   *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+struct TidStoreIter
+{
+	TidStore   *ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_rt_iter *shared;
+		local_rt_iter *local;
+	}			tree_iter;
+
+	/* output for the caller */
+	TidStoreIterResult output;
+};
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
+									   BlocktableEntry *page);
+
+/*
+ * Create a TidStore. The TidStore will live in the memory context that is
+ * CurrentMemoryContext at the time of this call. The TID storage, backed
+ * by a radix tree, will live in its child memory context, rt_context. The
+ * TidStore will be limited to (approximately) max_bytes total memory
+ * consumption. If the 'area' is non-NULL, the radix tree is created in the
+ * DSA area.
+ *
+ * The returned object is allocated in backend-local memory.
+ */
+TidStore *
+TidStoreCreate(size_t max_bytes, dsa_area *area)
+{
+	TidStore   *ts;
+	size_t		initBlockSize = ALLOCSET_DEFAULT_INITSIZE;
+	size_t		minContextSize = ALLOCSET_DEFAULT_MINSIZE;
+	size_t		maxBlockSize = ALLOCSET_DEFAULT_MAXSIZE;
+
+	ts = palloc0(sizeof(TidStore));
+	ts->context = CurrentMemoryContext;
+
+	/* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
+	while (16 * maxBlockSize > max_bytes * 1024L)
+		maxBlockSize >>= 1;
+
+	if (maxBlockSize < ALLOCSET_DEFAULT_INITSIZE)
+		maxBlockSize = ALLOCSET_DEFAULT_INITSIZE;
+
+	/* Create a memory context for the TID storage */
+	ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
+										   "TID storage",
+										   minContextSize,
+										   initBlockSize,
+										   maxBlockSize);
+
+	if (area != NULL)
+	{
+		ts->tree.shared = shared_rt_create(ts->rt_context, area,
+										   LWTRANCHE_SHARED_TIDSTORE);
+		ts->area = area;
+	}
+	else
+		ts->tree.local = local_rt_create(ts->rt_context);
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+TidStoreAttach(dsa_area *area, dsa_pointer handle)
+{
+	TidStore   *ts;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the shared the shared radix tree */
+	ts->tree.shared = shared_rt_attach(area, handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+TidStoreDetach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts));
+
+	shared_rt_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Lock support functions.
+ *
+ * We can use the radix tree's lock for shared TidStore as the data we
+ * need to protect is only the shared radix tree.
+ */
+void
+TidStoreLockExclusive(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+		shared_rt_lock_exclusive(ts->tree.shared);
+}
+
+void
+TidStoreLockShare(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+		shared_rt_lock_share(ts->tree.shared);
+}
+
+void
+TidStoreUnlock(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+		shared_rt_unlock(ts->tree.shared);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * Note that the caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().
+ */
+void
+TidStoreDestroy(TidStore *ts)
+{
+	/* Destroy underlying radix tree */
+	if (TidStoreIsShared(ts))
+		shared_rt_free(ts->tree.shared);
+	else
+		local_rt_free(ts->tree.local);
+
+	MemoryContextDelete(ts->rt_context);
+
+	pfree(ts);
+}
+
+/*
+ * Set the given TIDs on the blkno to TidStore.
+ *
+ * NB: the offset numbers in offsets must be sorted in ascending order.
+ */
+void
+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+						int num_offsets)
+{
+	char		data[MaxBlocktableEntrySize];
+	BlocktableEntry *page = (BlocktableEntry *) data;
+	bitmapword	word;
+	int			wordnum;
+	int			next_word_threshold;
+	int			idx = 0;
+	bool		found PG_USED_FOR_ASSERTS_ONLY;
+
+	Assert(num_offsets > 0);
+
+	for (wordnum = 0, next_word_threshold = BITS_PER_BITMAPWORD;
+		 wordnum <= WORDNUM(offsets[num_offsets - 1]);
+		 wordnum++, next_word_threshold += BITS_PER_BITMAPWORD)
+	{
+		word = 0;
+
+		while (idx < num_offsets)
+		{
+			OffsetNumber off = offsets[idx];
+
+			/* safety check to ensure we don't overrun bit array bounds */
+			if (!OffsetNumberIsValid(off))
+				elog(ERROR, "tuple offset out of range: %u", off);
+
+			if (off >= next_word_threshold)
+				break;
+
+			word |= ((bitmapword) 1 << BITNUM(off));
+			idx++;
+		}
+
+		/* write out offset bitmap for this wordnum */
+		page->words[wordnum] = word;
+	}
+
+	page->nwords = wordnum;
+	Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
+
+	if (TidStoreIsShared(ts))
+		found = shared_rt_set(ts->tree.shared, blkno, page);
+	else
+		found = local_rt_set(ts->tree.local, blkno, page);
+
+	Assert(!found);
+}
+
+/* Return true if the given TID is present in the TidStore */
+bool
+TidStoreIsMember(TidStore *ts, ItemPointer tid)
+{
+	int			wordnum;
+	int			bitnum;
+	BlocktableEntry *page;
+	BlockNumber blk = ItemPointerGetBlockNumber(tid);
+	OffsetNumber off = ItemPointerGetOffsetNumber(tid);
+	bool		ret;
+
+	if (TidStoreIsShared(ts))
+		page = shared_rt_find(ts->tree.shared, blk);
+	else
+		page = local_rt_find(ts->tree.local, blk);
+
+	/* no entry for the blk */
+	if (page == NULL)
+		return false;
+
+	wordnum = WORDNUM(off);
+	bitnum = BITNUM(off);
+
+	/* no bitmap for the off */
+	if (wordnum >= page->nwords)
+		return false;
+
+	ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
+
+	return ret;
+}
+
+/*
+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, TidStoreEndIterate() needs to be called when finished.
+ *
+ * The TidStoreIter struct is created in the caller's memory context.
+ *
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
+ */
+TidStoreIter *
+TidStoreBeginIterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/*
+	 * We start with an array large enough to contain at least the offsets
+	 * from one completely full bitmap element.
+	 */
+	iter->output.max_offset = 2 * BITS_PER_BITMAPWORD;
+	iter->output.offsets = palloc(sizeof(OffsetNumber) * iter->output.max_offset);
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_rt_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_rt_begin_iterate(ts->tree.local);
+
+	return iter;
+}
+
+
+/*
+ * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+TidStoreIterateNext(TidStoreIter *iter)
+{
+	uint64		key;
+	BlocktableEntry *page;
+	TidStoreIterResult *result = &(iter->output);
+
+	if (TidStoreIsShared(iter->ts))
+		page = shared_rt_iterate_next(iter->tree_iter.shared, &key);
+	else
+		page = local_rt_iterate_next(iter->tree_iter.local, &key);
+
+	if (page == NULL)
+		return NULL;
+
+	/* Collect TIDs extracted from the key-value pair */
+	tidstore_iter_extract_tids(iter, key, page);
+
+	return result;
+}
+
+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
+void
+TidStoreEndIterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_rt_end_iterate(iter->tree_iter.shared);
+	else
+		local_rt_end_iterate(iter->tree_iter.local);
+
+	pfree(iter->output.offsets);
+	pfree(iter);
+}
+
+/* Return the memory usage of TidStore */
+size_t
+TidStoreMemoryUsage(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+		return shared_rt_memory_usage(ts->tree.shared);
+	else
+		return local_rt_memory_usage(ts->tree.local);
+}
+
+dsa_pointer
+TidStoreGetHandle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts));
+
+	return (dsa_pointer) shared_rt_get_handle(ts->tree.shared);
+}
+
+/* Extract TIDs from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, BlocktableEntry *page)
+{
+	TidStoreIterResult *result = (&iter->output);
+	int			wordnum;
+
+	result->num_offsets = 0;
+	result->blkno = (BlockNumber) key;
+
+	for (wordnum = 0; wordnum < page->nwords; wordnum++)
+	{
+		bitmapword	w = page->words[wordnum];
+
+		/* Make sure there is enough space to add offsets */
+		if ((result->num_offsets + BITS_PER_BITMAPWORD) > result->max_offset)
+		{
+			result->max_offset *= 2;
+			result->offsets = repalloc(result->offsets,
+									   sizeof(OffsetNumber) * result->max_offset);
+		}
+
+		while (w != 0)
+		{
+			/* get pos of rightmost bit */
+			int			bitnum = bmw_rightmost_one_pos(w);
+			int			off = wordnum * BITS_PER_BITMAPWORD + bitnum;
+
+			result->offsets[result->num_offsets++] = off;
+
+			/* unset the rightmost bit */
+			w &= w - 1;
+		}
+	}
+}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 30f3a09a4c..e9dd5e6f99 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -151,6 +151,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_PER_SESSION_RECORD_TYPMOD] = "PerSessionRecordTypmod",
 	[LWTRANCHE_SHARED_TUPLESTORE] = "SharedTupleStore",
 	[LWTRANCHE_SHARED_TIDBITMAP] = "SharedTidBitmap",
+	[LWTRANCHE_SHARED_TIDSTORE] = "SharedTidStore",
 	[LWTRANCHE_PARALLEL_APPEND] = "ParallelAppend",
 	[LWTRANCHE_PER_XACT_PREDICATE_LIST] = "PerXactPredicateList",
 	[LWTRANCHE_PGSTATS_DSA] = "PgStatsDSA",
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..b3c331ea1d
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,48 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+/* Result struct for TidStoreIterateNext */
+typedef struct TidStoreIterResult
+{
+	BlockNumber blkno;
+	int			max_offset;
+	int			num_offsets;
+	OffsetNumber *offsets;
+} TidStoreIterResult;
+
+extern TidStore *TidStoreCreate(size_t max_bytes, dsa_area *dsa);
+extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer rt_dp);
+extern void TidStoreDetach(TidStore *ts);
+extern void TidStoreLockExclusive(TidStore *ts);
+extern void TidStoreLockShare(TidStore *ts);
+extern void TidStoreUnlock(TidStore *ts);
+extern void TidStoreDestroy(TidStore *ts);
+extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+									int num_offsets);
+extern bool TidStoreIsMember(TidStore *ts, ItemPointer tid);
+extern TidStoreIter *TidStoreBeginIterate(TidStore *ts);
+extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
+extern void TidStoreEndIterate(TidStoreIter *iter);
+extern size_t TidStoreMemoryUsage(TidStore *ts);
+extern dsa_pointer TidStoreGetHandle(TidStore *ts);
+
+#endif							/* TIDSTORE_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 10bea8c595..152e3b047e 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -200,6 +200,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_PER_SESSION_RECORD_TYPMOD,
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_SHARED_TIDBITMAP,
+	LWTRANCHE_SHARED_TIDSTORE,
 	LWTRANCHE_PARALLEL_APPEND,
 	LWTRANCHE_PER_XACT_PREDICATE_LIST,
 	LWTRANCHE_PGSTATS_DSA,
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 875a76d6f1..1cbd532156 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi \
 		  xid_wraparound
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index f1d18a1b29..7c11fb97f2 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -34,6 +34,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
 subdir('xid_wraparound')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..80698924a1
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,128 @@
+CREATE EXTENSION test_tidstore;
+-- Constant values used in the tests.
+\set maxblkno 4294967295
+-- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291.
+-- We use a higher number to test tidstore.
+\set maxoffset 512
+-- Support functions.
+CREATE FUNCTION make_tid(a bigint, b int2) RETURNS tid
+BEGIN ATOMIC
+RETURN ('(' || a || ', ' || b || ')')::tid;
+END;
+-- Lookup test function. Search 1 to (:maxoffset + 5) offset numbers in
+-- 4 blocks, and return TIDS if found in the tidstore.
+CREATE FUNCTION lookup_test() RETURNS SETOF tid
+BEGIN ATOMIC;
+WITH blocks (blk) AS (
+VALUES (0), (2), (:maxblkno - 2), (:maxblkno)
+)
+SELECT t_ctid
+  FROM
+    (SELECT array_agg(make_tid(blk, off::int2)) AS tids
+      FROM blocks, generate_series(1, :maxoffset + 5) off) AS foo,
+    LATERAL test_lookup_tids(foo.tids)
+  WHERE found ORDER BY t_ctid;
+END;
+-- Test a local tdistore. A shared tidstore is created by passing true.
+SELECT test_create(false);
+ test_create 
+-------------
+ 
+(1 row)
+
+-- Test on empty tidstore.
+SELECT *
+    FROM test_lookup_tids(ARRAY[make_tid(0, 1::int2),
+        make_tid(:maxblkno, :maxoffset::int2)]::tid[]);
+      t_ctid      | found 
+------------------+-------
+ (0,1)            | f
+ (4294967295,512) | f
+(2 rows)
+
+SELECT test_is_full();
+ test_is_full 
+--------------
+ f
+(1 row)
+
+-- Add tids in out of order.
+WITH blocks (blk) AS(
+VALUES (0), (1), (:maxblkno - 1), (:maxblkno / 2), (:maxblkno)
+),
+offsets (off) AS (
+VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)
+)
+SELECT test_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+  FROM blocks, offsets
+  GROUP BY blk;
+ test_set_block_offsets 
+------------------------
+             2147483647
+                      0
+             4294967294
+                      1
+             4294967295
+(5 rows)
+
+-- Lookup test and dump (sorted) tids.
+SELECT lookup_test();
+   lookup_test    
+------------------
+ (0,1)
+ (0,2)
+ (0,256)
+ (0,511)
+ (0,512)
+ (4294967295,1)
+ (4294967295,2)
+ (4294967295,256)
+ (4294967295,511)
+ (4294967295,512)
+(10 rows)
+
+SELECT test_is_full();
+ test_is_full 
+--------------
+ f
+(1 row)
+
+SELECT test_dump_tids();
+  test_dump_tids  
+------------------
+ (0,1)
+ (0,2)
+ (0,256)
+ (0,511)
+ (0,512)
+ (1,1)
+ (1,2)
+ (1,256)
+ (1,511)
+ (1,512)
+ (2147483647,1)
+ (2147483647,2)
+ (2147483647,256)
+ (2147483647,511)
+ (2147483647,512)
+ (4294967294,1)
+ (4294967294,2)
+ (4294967294,256)
+ (4294967294,511)
+ (4294967294,512)
+ (4294967295,1)
+ (4294967295,2)
+ (4294967295,256)
+ (4294967295,511)
+ (4294967295,512)
+(25 rows)
+
+-- cleanup
+SELECT test_destroy();
+ test_destroy 
+--------------
+ 
+(1 row)
+
+DROP FUNCTION lookup_test();
+DROP FUNCTION make_tid(a bigint, b int2);
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..0ed3ea2ef3
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_tidstore
+
+test_install_data += files(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..cc1207821c
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,58 @@
+CREATE EXTENSION test_tidstore;
+
+-- Constant values used in the tests.
+\set maxblkno 4294967295
+-- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291.
+-- We use a higher number to test tidstore.
+\set maxoffset 512
+
+-- Support functions.
+CREATE FUNCTION make_tid(a bigint, b int2) RETURNS tid
+BEGIN ATOMIC
+RETURN ('(' || a || ', ' || b || ')')::tid;
+END;
+
+-- Lookup test function. Search 1 to (:maxoffset + 5) offset numbers in
+-- 4 blocks, and return TIDS if found in the tidstore.
+CREATE FUNCTION lookup_test() RETURNS SETOF tid
+BEGIN ATOMIC;
+WITH blocks (blk) AS (
+VALUES (0), (2), (:maxblkno - 2), (:maxblkno)
+)
+SELECT t_ctid
+  FROM
+    (SELECT array_agg(make_tid(blk, off::int2)) AS tids
+      FROM blocks, generate_series(1, :maxoffset + 5) off) AS foo,
+    LATERAL test_lookup_tids(foo.tids)
+  WHERE found ORDER BY t_ctid;
+END;
+
+-- Test a local tdistore. A shared tidstore is created by passing true.
+SELECT test_create(false);
+
+-- Test on empty tidstore.
+SELECT *
+    FROM test_lookup_tids(ARRAY[make_tid(0, 1::int2),
+        make_tid(:maxblkno, :maxoffset::int2)]::tid[]);
+SELECT test_is_full();
+
+-- Add tids in out of order.
+WITH blocks (blk) AS(
+VALUES (0), (1), (:maxblkno - 1), (:maxblkno / 2), (:maxblkno)
+),
+offsets (off) AS (
+VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)
+)
+SELECT test_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+  FROM blocks, offsets
+  GROUP BY blk;
+
+-- Lookup test and dump (sorted) tids.
+SELECT lookup_test();
+SELECT test_is_full();
+SELECT test_dump_tids();
+
+-- cleanup
+SELECT test_destroy();
+DROP FUNCTION lookup_test();
+DROP FUNCTION make_tid(a bigint, b int2);
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..305459334d
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,35 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_create(
+shared bool)
+RETURNS void STRICT PARALLEL UNSAFE
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_set_block_offsets(
+blkno bigint,
+offsets int2[])
+RETURNS bigint STRICT PARALLEL UNSAFE
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_dump_tids(
+t_ctid OUT tid)
+RETURNS SETOF tid STRICT PARALLEL UNSAFE
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_lookup_tids(
+t_ctids tid[],
+t_ctid OUT tid,
+found OUT bool)
+RETURNS SETOF record STRICT PARALLEL UNSAFE
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_is_full()
+RETURNS bool STRICT PARALLEL UNSAFE
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_destroy()
+RETURNS void STRICT PARALLEL UNSAFE
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..428d6a3fcf
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,245 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/array.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_create);
+PG_FUNCTION_INFO_V1(test_set_block_offsets);
+PG_FUNCTION_INFO_V1(test_dump_tids);
+PG_FUNCTION_INFO_V1(test_lookup_tids);
+PG_FUNCTION_INFO_V1(test_is_full);
+PG_FUNCTION_INFO_V1(test_destroy);
+
+static TidStore *tidstore = NULL;
+static dsa_area *dsa = NULL;
+static int64 num_tids = 0;
+static size_t max_bytes = (2 * 1024 * 1024L);	/* 2MB */
+
+/*
+ * Create a TidStore. If shared is false, the tidstore is created
+ * on TopMemoryContext, otherwise on DSA. Although the tidstore
+ * is created on DSA, only the same process can subsequently use
+ * the tidstore. The tidstore handle is not shared anywhere.
+*/
+Datum
+test_create(PG_FUNCTION_ARGS)
+{
+	bool		shared = PG_GETARG_BOOL(0);
+	MemoryContext old_ctx;
+
+	Assert(tidstore == NULL);
+	Assert(dsa == NULL);
+
+	old_ctx = MemoryContextSwitchTo(TopMemoryContext);
+
+	if (shared)
+	{
+		int			tranche_id;
+
+		tranche_id = LWLockNewTrancheId();
+		LWLockRegisterTranche(tranche_id, "test_tidstore");
+
+		dsa = dsa_create(tranche_id);
+
+		/*
+		 * Remain attached until end of backend or explicitly detached so that
+		 * the same process use the tidstore for subsequent tests.
+		 */
+		dsa_pin_mapping(dsa);
+
+		tidstore = TidStoreCreate(max_bytes, dsa);
+	}
+	else
+		tidstore = TidStoreCreate(max_bytes, NULL);
+
+	num_tids = 0;
+
+	MemoryContextSwitchTo(old_ctx);
+
+	PG_RETURN_VOID();
+}
+
+static void
+sanity_check_array(ArrayType *ta)
+{
+	if (ARR_HASNULL(ta) && array_contains_nulls(ta))
+		ereport(ERROR,
+				(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+				 errmsg("array must not contain nulls")));
+
+	if (ARR_NDIM(ta) > 1)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_EXCEPTION),
+				 errmsg("argument must be empty or one-dimensional array")));
+}
+
+/* Set the given block and offsets pairs */
+Datum
+test_set_block_offsets(PG_FUNCTION_ARGS)
+{
+	BlockNumber blkno = PG_GETARG_INT64(0);
+	ArrayType  *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+	OffsetNumber *offs;
+	int			noffs;
+
+	sanity_check_array(ta);
+
+	noffs = ArrayGetNItems(ARR_NDIM(ta), ARR_DIMS(ta));
+	offs = ((OffsetNumber *) ARR_DATA_PTR(ta));
+
+	/* Set TIDs */
+	TidStoreLockExclusive(tidstore);
+	TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
+	TidStoreUnlock(tidstore);
+
+	/* Update statistics */
+	num_tids += noffs;
+
+	PG_RETURN_INT64(blkno);
+}
+
+/*
+ * Dump and return TIDs in the tidstore. The output TIDs are ordered.
+ */
+Datum
+test_dump_tids(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	ItemPointerData *tids;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		MemoryContext oldcontext;
+		TidStoreIter *iter;
+		TidStoreIterResult *iter_result;
+		int64		ntids = 0;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+		tids = (ItemPointerData *)
+			palloc0(sizeof(ItemPointerData) * num_tids);
+
+		/* Collect TIDs stored in the tidstore */
+		TidStoreLockShare(tidstore);
+		iter = TidStoreBeginIterate(tidstore);
+		while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+		{
+			for (int i = 0; i < iter_result->num_offsets; i++)
+				ItemPointerSet(&(tids[ntids++]), iter_result->blkno,
+							   iter_result->offsets[i]);
+		}
+		TidStoreUnlock(tidstore);
+
+		Assert(ntids == num_tids);
+
+		funcctx->user_fctx = tids;
+		funcctx->max_calls = num_tids;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	tids = (ItemPointerData *) funcctx->user_fctx;
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		int			idx;
+
+		/*
+		 * Note that since funcctx->call_cntr is incremented in
+		 * SRF_RETURN_NEXT before return, we need to remember the current
+		 * counter to access the tid array.
+		 */
+		idx = funcctx->call_cntr;
+		SRF_RETURN_NEXT(funcctx, PointerGetDatum(&(tids[idx])));
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Test if the given TIDs exist on the tidstore.
+ */
+Datum
+test_lookup_tids(PG_FUNCTION_ARGS)
+{
+	ArrayType  *ta = PG_GETARG_ARRAYTYPE_P_COPY(0);
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	ItemPointer tids;
+	int			ntids;
+	Datum		values[2];
+	bool		nulls[2] = {false};
+
+	sanity_check_array(ta);
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	ntids = ArrayGetNItems(ARR_NDIM(ta), ARR_DIMS(ta));
+	tids = ((ItemPointer) ARR_DATA_PTR(ta));
+
+	for (int i = 0; i < ntids; i++)
+	{
+		bool		found;
+		ItemPointerData tid = tids[i];
+
+		TidStoreLockShare(tidstore);
+		found = TidStoreIsMember(tidstore, &tid);
+		TidStoreUnlock(tidstore);
+
+		values[0] = ItemPointerGetDatum(&tid);
+		values[1] = BoolGetDatum(found);
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+	}
+
+	return (Datum) 0;
+}
+
+/*
+ * Return true if the size of tidstore reached the maximum memory
+ * limit.
+ */
+Datum
+test_is_full(PG_FUNCTION_ARGS)
+{
+	bool		is_full;
+
+	is_full = (TidStoreMemoryUsage(tidstore) > max_bytes);
+
+	PG_RETURN_BOOL(is_full);
+}
+
+/* Free the tidstore */
+Datum
+test_destroy(PG_FUNCTION_ARGS)
+{
+	TidStoreDestroy(tidstore);
+	tidstore = NULL;
+	num_tids = 0;
+
+	if (dsa)
+		dsa_detach(dsa);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d3a7f75b08..a0fa0d2d1f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4056,3 +4056,7 @@ rfile
 ws_options
 ws_file_info
 PathKeyInfo
+TidStore
+TidStoreIter
+TidStoreIterResult
+BlocktableEntry
-- 
2.44.0

#403

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#402)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 13, 2024 at 8:05 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

As I mentioned above, if we implement the test cases in C, we can use
the debug-build array in the test code. And we won't use it in AND/OR
operations tests in the future.

That's a really interesting idea, so I went ahead and tried that for
v71. This seems like a good basis for testing larger, randomized
inputs, once we decide how best to hide that from the expected output.
The tests use SQL functions do_set_block_offsets() and
check_set_block_offsets(). The latter does two checks against a tid
array, and replaces test_dump_tids().

Great! I think that's a very good starter.

The lookup_test() (and test_lookup_tids()) do also test that the
IsMember() function returns false as expected if the TID doesn't exist
in it, and probably we can do these tests in a C function too.

BTW do we still want to test the tidstore by using a combination of
SQL functions? We might no longer need to input TIDs via a SQL
function.

Funnily enough, the debug array
itself gave false failures when using a similar array in the test
harness, because it didn't know all the places where the array should
have been sorted -- it only worked by chance before because of what
order things were done.

Good catch, thanks.

I squashed everything from v70 and also took the liberty of switching
on shared memory for tid store tests. The only reason we didn't do
this with the radix tree tests is that the static attach/detach
functions would raise warnings since they are not used.

Agreed to test the tidstore on shared memory.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#404

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#403)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 13, 2024 at 9:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 13, 2024 at 8:05 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

As I mentioned above, if we implement the test cases in C, we can use
the debug-build array in the test code. And we won't use it in AND/OR
operations tests in the future.

That's a really interesting idea, so I went ahead and tried that for
v71. This seems like a good basis for testing larger, randomized
inputs, once we decide how best to hide that from the expected output.
The tests use SQL functions do_set_block_offsets() and
check_set_block_offsets(). The latter does two checks against a tid
array, and replaces test_dump_tids().

Great! I think that's a very good starter.

The lookup_test() (and test_lookup_tids()) do also test that the
IsMember() function returns false as expected if the TID doesn't exist
in it, and probably we can do these tests in a C function too.

BTW do we still want to test the tidstore by using a combination of
SQL functions? We might no longer need to input TIDs via a SQL
function.

I'm not sure. I stopped short of doing that to get feedback on this
much. One advantage with SQL functions is we can use generate_series
to easily input lists of blocks with different numbers and strides,
and array literals for offsets are a bit easier. What do you think?

#405

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#404)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 14, 2024 at 9:59 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Mar 13, 2024 at 9:29 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 13, 2024 at 8:05 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

As I mentioned above, if we implement the test cases in C, we can use
the debug-build array in the test code. And we won't use it in AND/OR
operations tests in the future.

That's a really interesting idea, so I went ahead and tried that for
v71. This seems like a good basis for testing larger, randomized
inputs, once we decide how best to hide that from the expected output.
The tests use SQL functions do_set_block_offsets() and
check_set_block_offsets(). The latter does two checks against a tid
array, and replaces test_dump_tids().

Great! I think that's a very good starter.

The lookup_test() (and test_lookup_tids()) do also test that the
IsMember() function returns false as expected if the TID doesn't exist
in it, and probably we can do these tests in a C function too.

BTW do we still want to test the tidstore by using a combination of
SQL functions? We might no longer need to input TIDs via a SQL
function.

I'm not sure. I stopped short of doing that to get feedback on this
much. One advantage with SQL functions is we can use generate_series
to easily input lists of blocks with different numbers and strides,
and array literals for offsets are a bit easier. What do you think?

While I'm not a fan of the following part, I agree that it makes sense
to use SQL functions for test data generation:

-- Constant values used in the tests.
\set maxblkno 4294967295
-- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291.
-- We use a higher number to test tidstore.
\set maxoffset 512

It would also be easier for developers to test the tidstore with their
own data set. So I agreed with the current approach; use SQL functions
for data generation and do the actual tests inside C functions. Is it
convenient for developers if we have functions like generate_tids()
and generate_random_tids() to generate TIDs so that they can pass them
to do_set_block_offsets()? Then they call check_set_block_offsets()
and others for actual data lookup and iteration tests.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#406

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#405)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 14, 2024 at 8:53 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 9:59 AM John Naylor <johncnaylorls@gmail.com> wrote:

BTW do we still want to test the tidstore by using a combination of
SQL functions? We might no longer need to input TIDs via a SQL
function.

I'm not sure. I stopped short of doing that to get feedback on this
much. One advantage with SQL functions is we can use generate_series
to easily input lists of blocks with different numbers and strides,
and array literals for offsets are a bit easier. What do you think?

While I'm not a fan of the following part, I agree that it makes sense
to use SQL functions for test data generation:

-- Constant values used in the tests.
\set maxblkno 4294967295
-- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291.
-- We use a higher number to test tidstore.
\set maxoffset 512

I'm not really a fan of these either, and could be removed a some
point if we've done everything else nicely.

It would also be easier for developers to test the tidstore with their
own data set. So I agreed with the current approach; use SQL functions
for data generation and do the actual tests inside C functions.

Okay, here's an another idea: Change test_lookup_tids() to be more
general and put the validation down into C as well. First we save the
blocks from do_set_block_offsets() into a table, then with all those
blocks lookup a sufficiently-large range of possible offsets and save
found values in another array. So the static items structure would
have 3 arrays: inserts, successful lookups, and iteration (currently
the iteration output is private to check_set_block_offsets(). Then
sort as needed and check they are all the same.

Further thought: We may not really need to test block numbers that
vigorously, since the radix tree tests should cover keys/values pretty
well. The difference here is using bitmaps of tids and that should be
well covered.

Locally (not CI), we should try big inputs to make sure we can
actually go up to many GB -- it's easier and faster this way than
having vacuum give us a large data set.

Is it
convenient for developers if we have functions like generate_tids()
and generate_random_tids() to generate TIDs so that they can pass them
to do_set_block_offsets()?

I guess I don't see the advantage of adding a layer of indirection at
this point, but it could be useful at a later time.

#407

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#406)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 14, 2024 at 8:53 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 9:59 AM John Naylor <johncnaylorls@gmail.com> wrote:

BTW do we still want to test the tidstore by using a combination of
SQL functions? We might no longer need to input TIDs via a SQL
function.

I'm not sure. I stopped short of doing that to get feedback on this
much. One advantage with SQL functions is we can use generate_series
to easily input lists of blocks with different numbers and strides,
and array literals for offsets are a bit easier. What do you think?

While I'm not a fan of the following part, I agree that it makes sense
to use SQL functions for test data generation:

-- Constant values used in the tests.
\set maxblkno 4294967295
-- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291.
-- We use a higher number to test tidstore.
\set maxoffset 512

I'm not really a fan of these either, and could be removed a some
point if we've done everything else nicely.

It would also be easier for developers to test the tidstore with their
own data set. So I agreed with the current approach; use SQL functions
for data generation and do the actual tests inside C functions.

Okay, here's an another idea: Change test_lookup_tids() to be more
general and put the validation down into C as well. First we save the
blocks from do_set_block_offsets() into a table, then with all those
blocks lookup a sufficiently-large range of possible offsets and save
found values in another array. So the static items structure would
have 3 arrays: inserts, successful lookups, and iteration (currently
the iteration output is private to check_set_block_offsets(). Then
sort as needed and check they are all the same.

That's a promising idea. We can use the same mechanism for randomized
tests too. If you're going to work on this, I'll do other tests on my
environment in the meantime.

Further thought: We may not really need to test block numbers that
vigorously, since the radix tree tests should cover keys/values pretty
well.

Agreed. Probably boundary block numbers: 0, 1, MaxBlockNumber - 1, and
MaxBlockNumber, would be sufficient.

The difference here is using bitmaps of tids and that should be
well covered.

Right. We would need to test offset numbers vigorously instead.

Locally (not CI), we should try big inputs to make sure we can
actually go up to many GB -- it's easier and faster this way than
having vacuum give us a large data set.

I'll do these tests.

Is it
convenient for developers if we have functions like generate_tids()
and generate_random_tids() to generate TIDs so that they can pass them
to do_set_block_offsets()?

I guess I don't see the advantage of adding a layer of indirection at
this point, but it could be useful at a later time.

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#408

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#407)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:

Okay, here's an another idea: Change test_lookup_tids() to be more
general and put the validation down into C as well. First we save the
blocks from do_set_block_offsets() into a table, then with all those
blocks lookup a sufficiently-large range of possible offsets and save
found values in another array. So the static items structure would
have 3 arrays: inserts, successful lookups, and iteration (currently
the iteration output is private to check_set_block_offsets(). Then
sort as needed and check they are all the same.

That's a promising idea. We can use the same mechanism for randomized
tests too. If you're going to work on this, I'll do other tests on my
environment in the meantime.

Some progress on this in v72 -- I tried first without using SQL to
save the blocks, just using the unique blocks from the verification
array. It seems to work fine. Some open questions on the test module:

- Since there are now three arrays we should reduce max bytes to
something smaller.
- Further on that, I'm not sure if the "is full" test is telling us
much. It seems we could make max bytes a static variable and set it to
the size of the empty store. I'm guessing it wouldn't take much to add
enough tids so that the contexts need to allocate some blocks, and
then it would appear full and we can test that. I've made it so all
arrays repalloc when needed, just in case.
- Why are we switching to TopMemoryContext? It's not explained -- the
comment only tells what the code is doing (which is obvious), but not
why.
- I'm not sure it's useful to keep test_lookup_tids() around. Since we
now have a separate lookup test, the only thing it can tell us is that
lookups fail on an empty store. I arranged it so that
check_set_block_offsets() works on an empty store. Although that's
even more trivial, it's just reusing what we already need.

#409

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#408)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 14, 2024 at 6:55 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:

Okay, here's an another idea: Change test_lookup_tids() to be more
general and put the validation down into C as well. First we save the
blocks from do_set_block_offsets() into a table, then with all those
blocks lookup a sufficiently-large range of possible offsets and save
found values in another array. So the static items structure would
have 3 arrays: inserts, successful lookups, and iteration (currently
the iteration output is private to check_set_block_offsets(). Then
sort as needed and check they are all the same.

That's a promising idea. We can use the same mechanism for randomized
tests too. If you're going to work on this, I'll do other tests on my
environment in the meantime.

Some progress on this in v72 -- I tried first without using SQL to
save the blocks, just using the unique blocks from the verification
array. It seems to work fine.

Thanks!

- Since there are now three arrays we should reduce max bytes to
something smaller.

Agreed.

- Further on that, I'm not sure if the "is full" test is telling us
much. It seems we could make max bytes a static variable and set it to
the size of the empty store. I'm guessing it wouldn't take much to add
enough tids so that the contexts need to allocate some blocks, and
then it would appear full and we can test that. I've made it so all
arrays repalloc when needed, just in case.

How about using work_mem as max_bytes instead of having it as a static
variable? In test_tidstore.sql we set work_mem before creating the
tidstore. It would make the tidstore more controllable by SQL queries.

- Why are we switching to TopMemoryContext? It's not explained -- the
comment only tells what the code is doing (which is obvious), but not
why.

This is because the tidstore needs to live across the transaction
boundary. We can use TopMemoryContext or CacheMemoryContext.

- I'm not sure it's useful to keep test_lookup_tids() around. Since we
now have a separate lookup test, the only thing it can tell us is that
lookups fail on an empty store. I arranged it so that
check_set_block_offsets() works on an empty store. Although that's
even more trivial, it's just reusing what we already need.

Agreed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#410

sawada.mshk@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#409)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 14, 2024 at 9:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 6:55 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:

Okay, here's an another idea: Change test_lookup_tids() to be more
general and put the validation down into C as well. First we save the
blocks from do_set_block_offsets() into a table, then with all those
blocks lookup a sufficiently-large range of possible offsets and save
found values in another array. So the static items structure would
have 3 arrays: inserts, successful lookups, and iteration (currently
the iteration output is private to check_set_block_offsets(). Then
sort as needed and check they are all the same.

That's a promising idea. We can use the same mechanism for randomized
tests too. If you're going to work on this, I'll do other tests on my
environment in the meantime.

Some progress on this in v72 -- I tried first without using SQL to
save the blocks, just using the unique blocks from the verification
array. It seems to work fine.

Thanks!

- Since there are now three arrays we should reduce max bytes to
something smaller.

Agreed.

- Further on that, I'm not sure if the "is full" test is telling us
much. It seems we could make max bytes a static variable and set it to
the size of the empty store. I'm guessing it wouldn't take much to add
enough tids so that the contexts need to allocate some blocks, and
then it would appear full and we can test that. I've made it so all
arrays repalloc when needed, just in case.

How about using work_mem as max_bytes instead of having it as a static
variable? In test_tidstore.sql we set work_mem before creating the
tidstore. It would make the tidstore more controllable by SQL queries.

- Why are we switching to TopMemoryContext? It's not explained -- the
comment only tells what the code is doing (which is obvious), but not
why.

This is because the tidstore needs to live across the transaction
boundary. We can use TopMemoryContext or CacheMemoryContext.

- I'm not sure it's useful to keep test_lookup_tids() around. Since we
now have a separate lookup test, the only thing it can tell us is that
lookups fail on an empty store. I arranged it so that
check_set_block_offsets() works on an empty store. Although that's
even more trivial, it's just reusing what we already need.

Agreed.

I have two questions on tidstore.c:

+/*
+ * Set the given TIDs on the blkno to TidStore.
+ *
+ * NB: the offset numbers in offsets must be sorted in ascending order.
+ */

Do we need some assertions to check if the given offset numbers are
sorted expectedly?

---
+   if (TidStoreIsShared(ts))
+       found = shared_rt_set(ts->tree.shared, blkno, page);
+   else
+       found = local_rt_set(ts->tree.local, blkno, page);
+
+   Assert(!found);

Given TidStoreSetBlockOffsets() is designed to always set (i.e.
overwrite) the value, I think we should not expect that found is
always false.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#411

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#410)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 6:55 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:

Okay, here's an another idea: Change test_lookup_tids() to be more
general and put the validation down into C as well. First we save the
blocks from do_set_block_offsets() into a table, then with all those
blocks lookup a sufficiently-large range of possible offsets and save
found values in another array. So the static items structure would
have 3 arrays: inserts, successful lookups, and iteration (currently
the iteration output is private to check_set_block_offsets(). Then
sort as needed and check they are all the same.

That's a promising idea. We can use the same mechanism for randomized
tests too. If you're going to work on this, I'll do other tests on my
environment in the meantime.

Some progress on this in v72 -- I tried first without using SQL to
save the blocks, just using the unique blocks from the verification
array. It seems to work fine.

Thanks!

Seems I forgot the attachment last time...there's more stuff now
anyway, based on discussion.

- Since there are now three arrays we should reduce max bytes to
something smaller.

Agreed.

I went further than this, see below.

- Further on that, I'm not sure if the "is full" test is telling us
much. It seems we could make max bytes a static variable and set it to
the size of the empty store. I'm guessing it wouldn't take much to add
enough tids so that the contexts need to allocate some blocks, and
then it would appear full and we can test that. I've made it so all
arrays repalloc when needed, just in case.

How about using work_mem as max_bytes instead of having it as a static
variable? In test_tidstore.sql we set work_mem before creating the
tidstore. It would make the tidstore more controllable by SQL queries.

My complaint is that the "is full" test is trivial, and also strange
in that max_bytes is used for two unrelated things:

- the initial size of the verification arrays, which was always larger
than necessary, and now there are three of them
- the hint to TidStoreCreate to calculate its max block size / the
threshold for being "full"

To make the "is_full" test slightly less trivial, my idea is to save
the empty store size and later add enough tids so that it has to
allocate new blocks/DSA segments, which is not that many, and then it
will appear full. I've done this and also separated the purpose of
various sizes in v72-0009/10.

Using actual work_mem seems a bit more difficult to make this work.

- I'm not sure it's useful to keep test_lookup_tids() around. Since we
now have a separate lookup test, the only thing it can tell us is that
lookups fail on an empty store. I arranged it so that
check_set_block_offsets() works on an empty store. Although that's
even more trivial, it's just reusing what we already need.

Agreed.

Removed in v72-0007

On Fri, Mar 15, 2024 at 9:49 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I have two questions on tidstore.c:
+/*
+ * Set the given TIDs on the blkno to TidStore.
+ *
+ * NB: the offset numbers in offsets must be sorted in ascending order.
+ */
Do we need some assertions to check if the given offset numbers are
sorted expectedly?

Done in v72-0008

---
+   if (TidStoreIsShared(ts))
+       found = shared_rt_set(ts->tree.shared, blkno, page);
+   else
+       found = local_rt_set(ts->tree.local, blkno, page);
+
+   Assert(!found);
Given TidStoreSetBlockOffsets() is designed to always set (i.e.
overwrite) the value, I think we should not expect that found is
always false.

I find that a puzzling statement, since 1) it was designed for
insert-only workloads, not actual overwrite IIRC and 2) the tests will
now fail if the same block is set twice, since we just switched the
tests to use a remnant of vacuum's old array. Having said that, I
don't object to removing artificial barriers to using it for purposes
not yet imagined, as long as test_tidstore.sql warns against that.

Given the above two things, I think this function's comment needs
stronger language about its limitations. Perhaps even mention that
it's intended for, and optimized for, vacuum. You and I have long
known that tidstore would need a separate, more complex, function to
add or remove individual tids from existing entries, but it might be
good to have that documented.

Other things:

v72-0011: Test that zero offset raises an error.

v72-0013: I had wanted to microbenchmark this, but since we are
running short of time I decided to skip that, so I want to revert some
code to make it again more similar to the equivalent in tidbitmap.c.
In the absence of evidence, it seems better to do it this way.

Attachments:

v72-ART.tar.gzapplication/gzip; name=v72-ART.tar.gzDownload

��<ks7��J�
�[�I�5����-�6�dIK�y�R��(N<����o������A�q�uuu�X"t��o4�|>��������]M"?���o��!�Q�����F_������l���Z|���7�}��=:0�����=��n������1z�}�f�N���OFf��7��o��������	�%q��������Qt
ntg}sn�����g�M�������1�8��XdXE4����#vc�]?`��am�������!���m?x��^�?e�1o����3z}�=85N�G�aFu�~�Vt�>�
�/���G=��������2��M������2�]a��2��u��s�r���v��0'd&�a�	b+�������q�k`r"��tU9]���E�'a>YN��r��kD������C����h��'��������[�n���Mfz6�6�X�7s���b_��{������GP@�=�$��L���O���6���j2`&4�q��uu���r9���'���x���v���P��Hl�GH���yLt@����!�[P��@@?~XC^��3�U������D���8{t\�y\����~�����/96��
sD�;UK������"���DQ#���L�]0%�W+?���
d�lD2+�rB�%����8H	�����G��s�����!�<Q�����i�x�zl�f����x����Pp��
��rcPH`��V��o��(Xl�����LA��H��2cB���l�����Zq:�w�Q�
O;a�m���\�z��?����y9�����~���h��w�a�E7�e���da[=�5|�7�����7���j������>,] V���6~g����>kTiuR3;������yo~�s7�������g���,�����?��O��w�t�G4tI~z)n5}2���G�c���%����`��2�S����X�XFO�sXo���vn$ZA`�vN�}0(����&�-3��������)!t�f������0������'<H&,�1;�%�z����3��zQ�k����vV�9��u����|�]�EA�T����=p�� �;WN7O����<��8�c��um�h!vP�����~_�MS��/U��AT������Bho�����Z���x2��]Eeb��e�����m��>:<1MS��A��Y�`{������5NN���A��6��I����}���!���!���(����n�t�)�V(N!
��8��������xhh�/M'j��`�-Z�9s���E[���@o�������xQm�/�g�J�{�:�����d��}�������a�' �R.�	3��&$I���9XI��a���;K�T�*��'� U�.&q��6�� �%\Y��!��0H������
������������qV�v@&�p�����#TD�}�Zx���v����*���BH��>}�|,0�v��L;%��������>P����Fi��V���]���m����`�Q���w��Vm�������+[u|�qVKv~@V~ �+a����"j�P_&��^6����lO�slo`{�����Zq���we��ol[���=�H����������3���g�-D���;^����Z�
9m4
`t��~�6:{���8����6#|�T�?K
2�fn1��^LL��	��&��ZK���u����0��P�������9;���h���M�4	Y���".������`�1����h�e�A�s���Z��.w��|J�-�>[C'(:Re	�w?�xIyC��f�@_�,*���(2a��i�2�(Zf�m\��T��"Ue��j8����:�������_'��fo]f��
����P��m���g0�1��T�@2?x�g�Dkl�4]��9fB��jxs?z3���noH��ni����B����L���o�/�
������{��9{�x�vE�����

lJ��G����{�L't��,mL1�B`��x��t;������T����������n8�����;PO�C�>��
��;1Q�C��V*��z,R���T0[#E�����:���R�X�u�4B��d2�\�A���q��m>���R���V�#��L,>��F�>[�X�SX�<�����r�(Y�ET��� 7�=�T<Z����k��>���<z}=����/��������`��<���G���\���oN���Y����l��pX�K	�r��yn	������3����x�f�3��LwD�������A��t\
/��a����&������pz� )�63l|=���'�_�5T�:-��#3JY!��.����1���C�C�r	��\`�;��gF�G��AK�q�\:�$��}I�Rm���=%��-�O�v���6\TI,�#�J����m�����T&�Yl�B�|
w(�����`�t��%D�B�����4�����+���D��(��F�(�D�2��|�T*�������	�������L���'S�`l��{��QaQ�^~�.��sv���ZE��P K@��q�|2
(��d)���%��_����8���T�'��3�'�>zG�RT*3�@L��r�����h�^d:��w�$''�SP��)��j:"���M������o��?�����=2�3-���S�Gy��D@�4.Fr���#UE�4��.qp�53A�Yr�:!1&9�K�M�">��g�:N�%Y�1g�ijFJ��y;�����(������L��Y�u�-���l���2a����0^�(-`#���D�)N���}3�Np��hRp��)��v�a����>�g�@K?�"5���%5����W�'�F���,�q������J����T�2�s6�����)B
>\�OG7��A��� �:�	����$��� ��%VcB�"*�Q�����������R�Ll�Z��L3D����x�8G��=Y��=DmN����q�����e�\�#Y�]�s���8g]A��Z���f�m�Xnb�9{V�K3L�6dej]8�}�#�H�ke<��z�������{3�X��PO��J���r(|�x��pSa�����d��)��n.�A�����)�������~4M�d(�|��
y�a��Ir]N�\���L����["�����lWz���!D����)��6�$Y$�J�$����
B<���,m�
MYS�?��c�$�zj�B�W��T��3~E y��z�d��Q����l�6Q��	>HJ������_q�=]���r�!6u���n������\��v�eMW�K
I����b��t�����"���L����u,���>�g����)���C	�TS�����
@J[�a��J�0eJ5L���*_J]��Q��J��R<��G�Y!�('�kp=��MRA������C���!�'��y���N*�>
��C��D2K�=y����<BR�O�q!�*c��R>e�e$�S���[�������NL_w���� b�X �4[b����
6�*��u3YA����k[�e����@_MJ�r_/#��!+��[�6f6�3>ek�n4&�peK�43!�lY�����s�/��c����`!|��
2d	�����D�6��"v��3���5�\��P"�N*+�m���5y���nH�Pt*�'q=;�����D�+��M����T�������_l�f�'���A������@y���<IT1�&23��;�����\L�����S�����#�S�1*��p��Me����T�7~,��bm�R�	������:����-6	@~B�$3��i��x��]m�c?�Da�}�R���m�������7���`2��'����_2���\�?R�#j�D�d%@GI�2R�`_�'r����-�����h4��j�N�$�8Z����-2�UN�P�ht���~����Cs�#�O�����>��������A��)���f3����U�����S	#�'l�&�?�������^D1�����������w��&��}���]]	�$�$?���_����k��z��`��3��~j4���x&�.�������)|�~����8��=�����b����N��RO����
�K��mt��\�l�����\�*n%nW�'�7M��ta���f�Xf�AL%����4L^YW��L��?u���=(\�/j��XTJ������.�H�,�B���t�����!L��e��	�po��� U���	;�R�f�)��X��J����<t	�U�@$�]�K�js57aB�#u�����!:4����X��tb��������h�`Uj����
AZ��{c#�m�������Lq/C��1
�����8�U8(q���;�{�b��O$������A�tN�g�5Jxm<��E;�.��-�f
����PT��3�����dV��U0z�^Av���H<�k#���_L�	�'�n�[$rkZ�pe��������/J�b}_Z~�X���"2�80	����	�P|�\��PL�|�GR�� �p�	���0��P �:���-.�Ta�0��oB���"�����(q���R��O���wt�s��
���#k&�Jf�%�H��S��Z��7���5��#+T�6����j�|�8T����L��b��0���iI�v�D����dT��H���mTUhZ�F}k��$���3-�Q��9�11h�����������0����zz��_���n��eSIr���5]��i���
!N
z��`�\���{�uk���EF@���(�����'����Q7�+/��K�w+��'�[�(��$G�iU����8��\?���c��/e��s���CE!������m:��
��6���O%�Wb�y���D�
N��P�$�*%_����q�
��;3g�t�rGT'>��?\C�rO9#f���[���LXS�0���$��C���*�_W�t�4�um=<������ 
6�X��?���v������s����}�w}����C
�m�d�Cu������P-�rIE�,-Qd������2
S�t6G>Kv��>�Q� �{?x>!��p��r���m'�/�GF���-�QZ�`L
�����Q�d{��e:S>bvN�N�+w�cC.TB�^0���G�'h4�@���
�7�������N�
(�9[>N�qSH$�J����T�`��L
nx&{�2�|,�C����ie��e-���G�=�E�
��e����n�]���1�a��1�7��o���������d�C�r\�������x��O4�3�+,>fO!>���}`z����K~D����*��m��d8��no���%�
� �����/�x0O9����WK�~���P����tDN��]���m�]	=���5�?�!(�)�#^~���xp}=�����7rU�F�`���
����t�
o�����������aom;x�~
IH���~p5��	�(��0�����J�����^e����r�ko|��y��������B�a�[$�6�'��KO�3����a���ue�w�UN�����C'���W���I��>�mB)���,�R)���������b~1��N��)�f��������P�K;���GK�x�i����0nnDQ�@���g���J�A7J����0t�%�-%'�F��o��� w��������1������j�E�����a9�J�w,�l����o
$U��}a
���3�12��a=h.Y� �Rh�Cc����lk���]c��c�����������������C�9�a�{d�C	��[��l6�CHv����a����[�Q��3�,h.9j�+��r��d������md����+��{k��)9��N��������M�fst(����%W��f�����@�/�,;��u��"A��<>S"������IT��=7t�]��4�8w�u�W~g��8gK�9��{qpt*��A!���(�]N&�������w�s��h:��X����e)8��X\p/��G��������0��L�k���%�Y.��-�t`�F�7�f��������\2�����EhdmY[O��p8�����7�h����n��.���]�#���r���������W�fg`�$�|���A�@nYQ����{��W�����"��	���2�po�K�~<���h�R���	
Cs�o��G��yzx���r3����WG?����Qv�h,A�>�+�
��z����12�b�����WSpx���5����d�9 +w����3�{����WG�Q%u��[��^����=c����h�I�����/���r6���.pe('>4),���'�=Zg�x��X�����jm�Il_��(_�PQ������K��>AGq�^k1,��ZJ���V\0��v����,��+K1�p
�<=�����y
���K��� kr$>�'����`���-87p��mv����]�*8��+�+�)�@����f��3��~��%>��'
�t>���a������S8$F
������W`��i��R�1L����Z���w�/1����P��d�Y��.�l����w��g�P������1�?���%�G�����]��o�E�g���G�~4v{{�$�.�-|���c~�{����E�4�dE���o�dN��I�|���jwE�	������t��o>����SO_�������+��?:�A��������X���8_�Ko�X����X�p��������l���^��}�q���9����z2AP������P��z�+I��"sceM�Q\7Z�dh���d��u���&_rx��E��G������P�2������?����}���s���C>
l?	����N��u��W;*���!�<��W��U���u�M'7+;4#\���J���� �0F�0>$s���!��%���k)�ou�E���,>�'���D���{�4�w��,j�/�<�L��/�M���A�a�C��z:��J�u�?�'������w��Ne+%�P��B�]x\�r`��u�.�W���,@A�J=���Tb!jdG.���R��_�(�e��6�������,��m����o���B����.4S:�����=�c�����|����4��'��'`��;)
�89��p~u�����Ad�����D�\����Z9�6������H��w�7�
%dE�Sx�W]7��/ �<�e��d/nkxQ�h������6��)}�����"���1s)7v~���<��y���d�	�Y�F�?F3Ef��4%q��2�pp���B0y��[-�=YK�_����]Q�����|3X��y����-y(�������3��P������]�TAHx
~�d�hd�����{o���,�B��gl�F�~����&��s��9���A�;}���E��>p.��i��oQ��jNZ�}yF/��������#���:����ksZ�oOw>��W��8�J�e��r�������F�p��������)�X���T���*��5�����jw�t�w��p����E{�����n�X�Y-3��,������%���������{��cb]������wE��{��H%V���
����7O���������:����[=�W���=��|Y55�G��D��!��K.�w�
K��;��;���,��m��U)�~j�e �#>g��R�)��txMA.\"�
�����K	�y����[l\N���b���a��
 ����<��K�
Hei��`�����[�w9;?=���~���l��a�	�xW�N���0�����_��O*-.�Z�sA�)�H���/��>��dn��|��\��v�����
��C������������.N��v��LX�2!�������m[��1cK��W`UVf�����A0�5+���cn�-;�*���v94<n���V�]���P1R

���(���:.�	\]��i���_+���%����s��g%��$*�~��'�����G|%�5��{G�����a��]%��"�=aa��[/,�����W[�"gH\"A��b�����H��p�c�)q{%��=�1L��	R�t�4��RG5r��@{P0�L+s@UA<L���,V#���� ����05�
��x�p��]D�>������3<a?�~�����8J���B:���<�N����_��
���?�_	���r���?}}&b�8@E�
�^������o��N�y2
aC����	��@w��>�7D
�\�����,�<���������A����g�C�G�`��^�6~|�����M�HZ�X��it��h*���g[9�G��S3R���L��E[1N�+@X�!�d�gO��5�a����B�EM�I�����B9��D}�#���z8�]���uQ>�4b���F8�g��i�2X��<��}n���+�A#�) K.	���������f�Ac<�}��Q�xy}�=�r�\J��"����~�?�~�1��w��s�y�{(��x�"ae�>Oq�z4��/�|{p�����������?<h���_��G]"�n���������<>8zC]|NiV�
���zyx����s�0$���L)\�p��@���^���,�����4Bcuy�������
�TE�����s�|)�]��L1�-a>��I������"��4(�-�'\(���\*Tq~��k���\���gD@��x�^����K�l�����|����3@���������HY(�t����H�Ay�V�����ax�dN�)Y��gL�t������'9DH�e5�qAqL�@����C
;������[A[�@�|��Fr/��L�����.q�&]�g��zGg�WG�g�/am��.T���;��5;�1����@���H�,��}�+���Cf-�������E{N��{�!������04�-�SH��}"���bS���`0��J�<|*C����,@([R�t��VT#�(����b��o
�G�����S�T�	���<>�����G=�@S�l~l������X1�����}L"�w���6������$t�)�{NX������s��
QC��t���,������8��[Mq���tq&�����l
�S�{Nn�����v6&l���yF�)A9��G���;>��\��r�F��%B1>�M����
���`2�E
d-����T��8�f��4o���B����#�)	$�./fJ������� ��I��������`,plU�W������\��\`�����r@H�@S������v�d%A��w���F�&����qp�����r��n�4�#��!~�w�� ��]`=�C�'uAU
e<���dh�x�V���Z�������Y��%$�y>h4^��	L�eOL����(H>�H�4��x��HqI� �
��"��:'*�9�c�}��r� ����gZ\�W���F�����W�Z���[�4�g����dg��.eO*��D!�g����H
sMfA��Q8��VAQ9�3�+DN,��N(�+:,�$C�3�U�P��Cq��<)����\z��j]�d_W�-�a��f���~h�Vgjv��
lA�vtY���3�{�qo�����a4�a��v(k�u��[�]��Kt��
1�+�<�F��Q�����(�E���K��!-�xN_�`����fhJ>U���b8����-�5���?�<=�M��Pr�����=d35v�?���RD�,```����/S3����k��<|�T�����0���hQ~�%zsX����@F��
���o������������
����[h{�g�����������r���A?�p����Xz���@��N�z?�Z��N�1:N��:{|�,�f�{���L��V�����9�4ag����iL�������/�������5���=v>��0�A�5�b��g�{���zW�kgs����>�������2�|lX�#jUr7 W���P�~c��W��1��\"�}��v$�1��z�m�����~k���%W�[R�;���N�c���W�k�Z����nv	���2(�:��<*\Wm��C=�i�N-�8��o8�q�<�C���/�����Dj�a���%�M�/��V��(-�&�i�i�;�3��z���z��q�:���2,7�M�
}�$)�)�E�T~,�������E�s��8	���|,t�O�a��-[?*�p|��D�4�T*!�a�(U&e��V���:U�|��	���T������8p��:5��4�l�4�|5��I=���,-0�5�o������L��U^b�Q����Q���^�B��G�6O��&��<J�X��q�jE����	��^pu]�@����G����r�i�T��18��-f�e����/	Z���-��9k��d���|�&�Y�2��X�A����cVa_���Z��%���0o�\7�4�(��^NG�6�rB������I�AMI�~��K����ie��47[���|q��|PL������y_<��/���V���`�)�P����^�I��)�Y�]��%I��l��m�*��3�WE�����*�fG$yJ#�H��K��P��D�n|w�L�8���Qf�K�C�V!�Y�!����w1����$�S3�, I�5�5e5/%W���m�El0
W��5��n���C���<�ev7c.bU�b%�BNN�<-Orm�b�N��CbKU��XE���w������A���ms�a�{��D�F��R2����(a��K����9vT�X�h'��o��e�M@4,��T9���W���P��$�Tq�N���X6)�b�>���4Z�d��T���uh�����,��,5	[/5��S�M�Q)�K������K2B��E��c�-�4	t��U�a
iZ��_�I�x_]���8��%wc�C*/U�S����q���6�>bV</�����	��hY:��!{B�<]����#���}��5��F�������zU��|��o|��My�)Ue�������m3�ZANC�I�^�t&�����5�#Qi�X�-���s����#3?�^>]@�%|�C���R�_�)�9��#'X6f��S#�fK�,J��GQ��B�+��9Od%�����&�M6L}��<\n��q��S�EI���=��%��Z�G�9s�9N�Cl��$�����V�OX���Q��d���G�R�)-��R�y�������$�Go<;��2��y��d��u������I������4���_�G�z����v@��)&����-�����<�������t>��e��L��j�S�$vB�e<�P�2x9L��S������|��!EC�#r��Z����CR��1��'��X_G��-�0�C�T��0D�����OG2���d��L���T�"��j��Tt�s��+���g	��/2�4�.
�X����)����o�Y����2I*r���b��!���<))�K��_����M��/ri'�v���B�r�E��~��Q��!�*�����/*��L�U�%Qb����f�,N{�b�F��"�OU�����G)\������w)����Y�Xm��������K{t���#���6h�(���5�� ����i�u�C���g�Y��4���{���Gh�At,S7��9a`��A�t:����@�v�e�����_F�@��������2*��Q{2�� �Q�i@�0�{O7�t�5u���"��=���d��!�����$��"�v{u�u�cN]Ex�{���n�=\���U@�����<�P����c���g6��KZS�;�t�z�,p�#�z��$��o�;v�������+��
`{�j������������/F_�����E�^��X��55g�Q('v��wg,���$�I}���*�U��(�1vy>&�?!���7��(�g1�������1�MZI�o�
�,�x���49V�pQ��_p���^I����n������K�N6�!m'��XL�������p��7���
7��	��	��,�J���2��,��2xo���Sc	�7Q�J�k�uf�l�t:f�P�Q���Y���8��6m�7����6�EA]�~~+r���`��P��=����s�6p�/��a�u���J������{k�[K����[K��5���	|o-���|o-������Z�[+�{kE|��%�`�[��{kE|o������{��W���v�U;��Ij)��F�@��
0@����/\R��v�q�	����%��WK�^�#q��WVC�� R%��VA���L��.5m[x�=�#��GV�1+���v7����z?b+����Y*��KH�v;�:� t%Y���������g��p�`|�Up�R\E��+c|��.��B�&�9Y�A�n�u
T�C�j�C����\�A\��>l9����`����\����:�D���#��&@�I�L���eCF���jk�V�^�cX�D��V��!������eR��4D1B���B�DmD"�"�G�����h�>*'O�D����h+"�	nO[TT)���V�D�pK.r�aD
nW^W^W���7���"���0XC��s�b�M%�&e����"�"l�y�
r����W�����gJ��$����Q)�r�:��C��d�^
��V�d6D[v�T+��_�W�'dk���H�euatl����(��ihB�4!_�P� �8���m
\��b�2���Y�AHh�k��V����kW��-���0�j���"/�\ke~�iEw�I�z��J�I�x2��[���0)e�SM�&�t|�O���_ ��,�qj���cV�+��P�|J[u�UA]jp�
�R��J@]�r�K���V
uY�>�%$of��Xj\�g���������_��������|(b����cC`@F�����e���\���v�K����\j���G�q{��u���2�,�7�����j�Ae�D���DWz ;����!]�p�J!���b�;�2v)�
v)�vSs6M�m�F"O,b4lc�R�F�gng�Y��p��Al���LSM��2LSm)��V�s�����]�i���i�����E�R2%�$;���9b�����G���-�6�H���*�;U��	��;U�v��v���S�E���-�S5N�(wb���-����2�t�^��������#�PU�������l~�&;��a��X�
������C�e���TN'Q�f=r/�GR\Ez^�Z[�������f�l���j�I:J6U:�������9�A:��p��S�|�M��������g9nc����UjZ��H���_E����������������{��X��w����;���F�v��c�����[��=�����_���E��6������$��JqI��S��pE��$�����YkvY�����#�N���qs��-��R�S����y��u�P��"�
�h7�g�a�`��:+WZ3
7�����������-��m!,c!q�0H7j�\7�D��U�l!lC�����Wnd������@�vM��K�jn*�
7�J����r�t�r�*7���Iah��.SiL��iYU�W(�H�G3��[&��9�
��Ro�N�T�Wb���z%��h�m�WZ����~l��&:)m]�T�KtR+��r/���+?O��� ��&J��$�6��P��se�7��J���u����)<����i��"M�;
��4�����'��*����B���B�+�MnX�:���V���jBV�1����I9m���7�����cq������&kjL��[�LC�U�(ct���?�G��`�q!�,�����V���K
��_7�����?��!�O����#�X��������}�7\�����|���?��m,��vx������Sk���a��e��d�G~]�yv]���P=V>��v��&���q'���v�X����o*���2A�|�+�N~�����/��Y����S���^�>:f��o�������[�E���m1��������,����x�6,=��a�]j�N�9
yP�i'&�Td'�2Y��3�?��]y�b`@&4��j��������2���HZ&���7�c�4%���i<������4d(� 
%���qC
j_OFgB�.�1���7��(�-V2�T�`�?��B0��$�N�Qt�oc�	oc��m�P\�e����|�l#�kYt~YT�si��\8m�%lh;��w��c����b���X4V��=P��>�I����l3A��oO*��������E��A�n��n��m�F�����k����4�r��S\L{���=F���2v��eM���&�j��m7��2l�\�A�l�E���a�3�U����������t��l��}-;6���D2�����l�l}��*��>����[]������kY]�U�wbM�����Wnnj�U�;���[�p�
6Z-g�\d�-�\�����u����jU��������������W���Y������}]O��2�Kr?6b�<�g��z-�^�N��Q���U��d6����V
K�]O�xHY q_��C��!���\9p�B��c4k�s��E�g>�I����������Q6;���M����k��
��1�87$	(Y���p��-����H>E}�
�{5����"�=eo�_�����_W_�O����<��f���2(}�OvS��0�r����X��N���}��#��63�c�
^do�m:	���2�����d�_�����^��wn�_��0�C����2����c�B�:�!;�7Dz���qIJnG����l�'j���������;��`��]��Cd���r�`����m-T$W
�A�r�J������(8����u�5XI�������?�����A������fm:k����5B4�^����V!�G�����8��u�i�A�1:��G���]�����]3�|g������=�[�����Qk��D�cF"����=xi��"��w�[���q�'`�/�
^%:Sa�!��wq�U����v�~`E��q�%2=���=+�3����
���I��d�O��+��)X���~e�lj�m�
7�t��z_�>k�T�7$��!�`�#��mE��+,���"����j7�?�i��X+t�d���W0�F�v;0=���:���1W��Wg��2o�M���������AG�-�Y�2��4��k������k���fE-�j�5-���){;Zb\o�����n���VUv�fT����&�L��������lA�C6"�U�<��{=H�9?���i�9��0��yae�h��gI���[n��ox�g=������?:��G��������04�����N����>�����?^M�������=���;������ZP+
�yj��d"^T��q"E�bb����w�NyiP��f�I{�Px���DtV�d�W1E��6�.�uVv�,����������v;`!��S�$v�0X�����t~x|��&���0�9yx5��H�����D���������~�D��"j�����BzN���m��}�Y���.��P���(�\c�BS~���&�����Ma�7g����a1�]�Ib5��?t�����Y'����iQ� -�.h�i�M��A�D��x\I!Y2�B
���JB����F=l�����;-�5]�em��
�)��/�~N���]W�|}=�K�s.r����X�������[e8��U�y�@`�Ep[��;q��VF�4W��En���w
��2r�fh�'py�mu��;���I%�^�����O�D�
����ry��Q0�V�R[�K�"�UB�J0h���
��Tc�n����;}�^On�������B�Y���<Vi���U�e��VU�����U4$���A\5����U����@�p��p!�q���o<�A�
R�#�������<�d\�)$�������x0a�O�!~?c���xK{����w.�B0����'����g4���G����lF#���J��~�%q8>#6���M��h��(<;}U�]k1]�����Ia����J���yY��`!���{�,E��Q��0L�p���{M����+������s�-1)���E |'���|o������A[�]�g��:'*�9,��b���($R�����k*b��(�"��N�6	o��W�\)��\ch6!���*L�eE�F����j8k?���T���h<���8��#
��`"8�������p<;�������!���A�7C�Hd�~?�
��x �Y�N�����;���{zw�X�����D�Lq����f� M���%��|���c��q0����+��o�d����6��t%��U�W*&���N����}S������|���T*^Z�l��C�j�"[��7�)��u��`�*x#�X6��� ��QD����^�=�����������On���]�@"e#vF�0��Q8j&b���"���n��j�z�jP���7�e��+��%���d����#����i�x������0���g/�{�����h�[�]��j�5@2%����K
%�X��!l��5�H�K���v����������C�_�:�"�un�����v�^�Z�c9n����;�����/��v��+������C�$��Q[�T
�b���pY��N) 
�|���S@���v;����������<�������k#t4��f{�*���'"G���Q��t���]%�����*������?�l����=�m�-������F���C����vc��9���K�{)
U~,p4�������IZ�#����eye�O��Y��X���L.s�2�e���i+����0L$����}
_��-��d%i�����o.*]��f����6'�y���c�`	�*Up�x�Y�J������9-k�rZ�n��\�M;O�+��^��s�o��]]��CP�a
�m�#�����LV���R����-��{4&6�r;�sL	���iR9n�b@�Z^h���T��`2/��qy�������W���)�\��'F/=[�����d���	��y������_��o��-�t���(G��?X�Q�l��)�������������cL�wi�?���@���q+.��F�e��D��\v�s/�������>t�3-+2L?p\���n8p��������ev� �Nt������$������S+g�c��bPE�&V��F���%p�i��|���rb��Y��6�RD��0�,C�6���h#�����@����4k r�<w�b����y�&����9\���e����K^D��kK����J5A"�V#�������	;���!=�C-��5�/��p�E�/����7����#��.���TL�����Ym�e�����x2�����Z93 �59|���p�uF���
���AM��\FcCd
Y�Z�e��#^���Y��]
��Q}t��mvt|�V�W&'���V�w�3Z�g�\������t��d�'�	�`���UC�U��O����������#�St�mz���Mr�������������Fn������5�� �rN���6�Unh�1�$��{F�~���\�y���\��Kx��.�U��N���gh����'�Nt�Of�M���/�Yr��W����p������iy���z���~W�����=�]_7��nv�{8�Y�����g���|I�Z���MP�������7F�������&�(����^%_�4����m������b��4��I������-�`�o��7��``x�7%�=.����R;TG��{�K!�R��3ER�c�}��_;�-���.�F����@~�<r��85~�c[��-�I�o�#�?S{G�4��������5b?R���p �F����e���������q/���~7��A4��}=4u��]��� 2l=�����X&���c�r�i���?������W����_\��Q;�\e�?E�s�t{��������L���VS/ `H\���*�J�Z�V�������U�V;�F���y<�Bv_R���a����8��WXK�.N��z����v�$�M�J��I��`H�)��~�h0C<g+���*���3�������P��-�B�=%-��0��Qnj�S�4�z�H�#��)�)�T)P�(�ixl�?C4�K�Sd��/�18$'Oj�N;��Zh�v�]��&�*��{P��&^�f2�����;b:Y���	��c��xr��������`>��@�+��q���EELS�%��p��P�H�u0�������
T��ve���t��X���V��2JB��WPYm�L)e&�QI��������m������3�
L�4n��	ClOm�����f|��#M;�&E���j�V����c[O��Q�I�`��+��Z����=r@B��W!NsO��~�i��1eZkg4�@O�S�[s�-H���G�	�;����"�@|��_�j�8���������������
�����?�D��6�������_8	H��/�F ���@�8��>����,�@m�c>>�H^t��� ������Qx���{�z��$�7���>q,WN���i�4�}T{N��	7���xtu1}<��%�k�KX�h�����k6�<�}y|�g��3,��P���2SK_s8F�0J�*��}Y2=v�sbx�O�sf��\�z��������
��,=�N����x��M&0,�x��j������a1��\x��5���E�)����������]�o�;�t��;���������"d�1m������H���.�P��������_\����Z�t:�y.-�Oa��_�t2��=��G&��y��).Y��"��y�O���'�O��9F��y��=}��dOq��s��S�����^V
g�+�%�z�]O�_��I�&E�����|?��t�1�2H�����|<���B��]�6I�P������#�l�v5'���2�l4��>{t14A
�� �P����G���s��1M�sAX�!���m��n�����qO`����c�a��+P�Z�UL����li�^q_/
�+�j����]���v�o8���3Z�W��2\�X��N����=2���Y�\�ah��o��y���c!��@������'���bda��	GJ�<�A�"�7���&����'�N��2_���8�9`�#���N4�d������ph$�04Hr&��s������Y�8����U�n���%q��9�'�e�X������FW���W�t��m}�/L���J
~�Z^�a�O=I���Q��k�^r1���PJ ���m�X`�#F5:�ED&��|�Ap^-au���)/�>�$�u�:234�����0��.�L��K�G)N��O��L�B�I2	8jeZ����W�,�w�-���^���M���N���KB����-Vi���h���Sq0��aP��5����F�1f�,	���q#e����j}Ip�@��)E�Pw!��X2��r0����0X:����t9���o�Fp�:�����������`��R�6s������FE�b����\Iv/L�:
�U_E�w�1�q��{�G��0_����w8����k������g����)���mq4�~Q|�?v>|8�-s�%��o�kc,T`�
H����i�$eE}���&-��("������6�����4/TL��V��=���0��uBx��;�CJWO��K��ADKnA��;�>���n�G������N�y3	&��d�ko����
�'7�V��~U(u[$Z����?�����>&�����8�~a�_� �}B�n5��R��B�>����{5���s`�#����a�s��)�"k�����p���Z���9�5�"V�U���v��6
��"���D+�4}����0��R�!VS�[1c�����U�����<����v >	�>�����"���<�����)k����IUVs`��3����O�Y�O���IS���
���
��d�Uo&��x��w$LFBG���2�r&UR��Z#����B-����-�����1��zkHx�v=#/�LD+����4��\��L�0�/tv������������W��j����0���m
�?� T���C7����{h����u�� %�������%4�2l2�]�^"��#}��I#N\ �'�}m�������!mL]3��|�s_�s���afB����Ysp;�l���y4�Q"�=1#�A�4m���3�{�C�|F&gT9�K<�H���,�\}���xA��� �]N���h�yTw6�xW�rX���@,�~A�E�(:��;(����E{4���������wT���_b���L"���*'L��H�`��% ��4��t�A<��]��j�#G�X�h�>f�]7A��!��C3aWp$����2�'����.�VlAl�v����r?Z�.|7�5�vM	�=��Vm'R��2�*������D�vK���:�P�6��L^�@���<��`�����5B���,��%��c#	���a�_y0#�����D�9C�S���1���B�����,����o,B�	����jKN�m�,��r��*N�X'�y�J�f�D�����P�G�z3G�u[a���#_b�b�Q��dHW�e;��`��39�'G���8�c��B��Maht@O���p%r<3@S���hN����%���"���j���(��2.��/��
���O�x�	)�t�v|��P���h@�~�;�����D$q|�
Fa���3IM���2��S�T���U~����G!��v��@i����x���Z���U�n-}cV�����&�@7��x�"��6�9 h1�~B& �A��hF����L'#��m��72�H��|��'J���9;��y���3e�����uIw.�_t��?v;g���?��-��9��B����pv��!+��I�����F*(6[5��p�0�O��^������@�[�=+0���n�����p/��#�9�J���~��sJ~}�2��� ���nS��pq�dR3����W���i�
�]����'}H��b�VP��Z������>�B97������@���ZL��f��V�	�n�9��P�	A:]T'�1H\�P�)��[�[
?O}����oS
�hR�%�Z�y�c�(g.��K|��,��|�Uro���YRF
�:�7	�lO�\���[��K�r�&r���#
����r@4Q�)���2q�m��[�e��n���hL�����E�#;yB+j"�\�'�WW@��n1&�pJ�K�������B�$��s�6��V��%=��>�%�T8c�$�N8nHU(��p�h�pT�)�3�E��������c�=��GB������,�s�# ���M����r�t������C]t���P3�	�;��@Q�z,m(J?����t�PX2���a&��}���|e��������k�4�������$�S���������f����L�M����pO��/6�\!>#���a�����@B��a���KG�uH��{|�U�_Z��Dns����GO>�k�|gs���Z��J9�o����7;����_��&��z6��A%0�J~d�K�7�"�����f�;y�����������(�3�}_*�Kz�Z��J�~��}8�i*K{�d���]���~==�� n�9r����=��������g|x�c�^d���>�#wU�V�Z�/�p����U�������C7�6��&��T����FR�����X/c�5�������H�f�FV_H��`���Y��D:����;>fCf�,c/}}J\�|9h�kF�$������{S�n���r�)�o�;�� ��Dr������3�b4
\�,6���A���g������r����4@�:�na�-�4Y����m�v����X'�v�.)�{�z.��a��eA���	��.�z�<��T��{BO+�q�u�p
T`
z
P_,}Ro���Vi��W��+ 4���L�����p���P,�@xn`dI�����C��df-��6��C�����kQ�qm�� ����T��~�>"�/�����m
G"i��L��@�pyZ)|�OiU��E�
�C@YM�N��s���BEVN��������}�e�)��'wI��
�g ������u���s:��d~x'�����������s�����/�
.Xaj��/*+N��-tE:Mr��'/���]��H�.������N C�p�hBD�����?����s����h�TJ���{�H�6����A��#`��^����PY3S�'����5a9���m ��S�J�F�t'�
���3g�]� ��
�xS!���St��A�m(l��,9R�I�>����Ok����/S�B�*s1�Hs��&4���\���:�����b<?#�x��ll���d�L�6�J�K��I<��S��i]��w�Yd%����]$;������)m5j=��V��v�����g��=Z��:�^�$J��F]I��a�J�t~�}|t�=�ek�c���$�����{s�btTH��&������k����������c���u0����.���R(3��n_@����N$��l�������w��6��[��`�}�^�Q�Mo��o1��~N;��m�X����Q����S#T:�>��5������?�c�;1rw0H�w��j�]&>��������"xq0�{�0���a�.��� S��\9����<+F�4&�������fb��,�e��u��E�]>�}�No��%�`�7O����]<l��%��'��_=D3<+��,E�U�$��Qrr���kc�b1���9��c�����#��r=_c��9Hd�|��+#e���������u,"QF$l�m���Xh9a���C��u�P���^n����	��-v�ND�j�].������\j�[C����>��U�'���\������$�~�3��o������<��������b^�Kh`]��B�5���z ��I������!q���UY�-��X�#��Dp�+������������V��hV6�yH<�F@
�Z��\��[�,�������>m��.���5�UTY,	r,l.E:�`Q|�Iw���;"�,�m�e-�0��K,���������P�#�v$��t�N�|�O D���-U��O:��_�A��>]�������,)��~x��{�?~�S�������������.$;8G0�LlO�L�t��6�c����x>�.��9����;�/����>N0���.�������n����S@+<�j�x4(F�~_��d����%�]A�W_���@����n�e�4]&}��!���Fz���e�DQ8N�^
��>
>�����p�u`/+u�v�.�2��7��T3s:�*�Pe�:�0��u4�E����Xk��)�&�auah���\�q�����h
3*���?'����c7�5������n%{�}���*�T�p�^��'�j�����x�����m:��J5�L�������m�\�����SkT��(�'g��3������5�H�N:��.c�'�)���3�g�W�S�iK�Z\y�x��"�� �����!�L)�R|[���2��e�����K���}4�	 DaIA�AP�>��c.���}^]i?*X���5����'|��48������N�d�����m�KY�s���A�<��-�0&�0����{�]�� .�8��#�*x+QBL2��B<��"��~���t�C*�AQ�i2��������5��d�����R��;�>d���@�<&�%NI��;h�,�JR�'+����H�"���c��6T�vV��G�
v2��:#������8���"�p=�p�OI�����+:(s3��8�������z�� �}-����.:������~�K��4�Lv��J��������I��x�c1,9�bJ
�aD11�4G�U�C�^�m����k�9t�D�z3����Q��[��������2C.��J����UY�r��*�t���k6Fi��2�9<���2���k�V�� ~g�a���o���^�����>lm����eT����,-����6�mU���o�v���cl�H���M���Z�(�H$!���HXZ192""���HN=�%�\������.���.���=�X��:%R�$�K������W�+����^,`�Ds>�52�n�|����[��{"j������z_P�DH]�M���U��Y����5��|z�jZ��A�@e��C%�j���v��Z_fl
���w���vJ0R�Y.��*�QoN�r�+���x�*�F���-*e!��	V�iT���<	 �R�RB���T�bx�I�T�&\���I ��xpZ �kt	�5Z�l ���Cb�B�������:JL�IA�h������v��"��b:2�n�����w���c����p�s�H����R%e��+w��e3^	���(o�C��v�q�L���\w`zlB��"���*��h����(V���n�H.{�$:��N1�[�&������0���k�uf�K��LfY���2:xmm��	Xc
��a�V�	0�v�pNY���b��:s��d>�0\E�7�!����S�@�������3}�\lKz�����t����)q��Hxyw��������V��K�kg2�" �=��%��]N-{'����)8����$MQ���!��S�����zH������I�6�t.�9�R}�y�X������Da���!��,	�n5(�V��$�;��`v�
����� ��6��j��FK�K����R�H�?m�o�m)���E2��������V�b@���H�m�9�T���p����@�$V�Y��'�p�7"8���fx�)K����E�����c��z��Y���b97h�^0�����iH!BN�tb��� ��8�;��w�aM�b8��c$��8\�����hIQA���X��YY���[�r&�0�����]%��Da'v��6�Zk$�����8}Bb��fl�:r��)@��-1Q�����:��$'t�g���N���yH�,u�=m��07=�A��H�X��
l���Y�L���� 3�|k���x��!kd5J�IH���1�A�,L�L����^�#���+$Q:�]�2n���+����IH��������!���\�#��z8�`�Gc��,@Q��`��v4�J�w�S?����W�:y���C8�D��&��b�"K/N{^��[V�������A��1�g��L�^�@6�=M���?����C���)R�)r>Scw�+59���J����`G����y��$*�����^��:\]��i�������u��9�Sk$vj&9���j�����J�4h��eoXy�9�S���dNmJ^U����7\�j���?��b��S���tO19)a
a�^��j�����O0���YsoJn�x'u�����~x�v����LZp���
#��M/-��o^0��7t�1��'v76�}3h?
��}��	�b���J��{&��S�xk����.���?:�g1�9��W���)�~�A��4G�J:���|6������`����R�6���f��P�1����!���{~5�Y-
e�f�O0	�3	�c"����O�:%��;=%�]��D����H�ie����/V�>����/V7������|o�=���)J=#�y��
��[���e8�3i�Z8��3��"�.�����/M����c0dA�W�X��yz��p�$)�z6�d�F��7��,�!z^P�1<1���:�2O�Z51eb=��v�
�����{):hj.g'1�Z'�Y����iR{h�N��T�W��26iR���h���a�>%�e���� T������w�+�
��8�o0��qb,��QT�	�p��FX�^K
�~�y���<�����������1?�-c�:EG����TB@����FC\b�<_+�W�N��%�i�tb/�J\�%�7��9�LH/���!�3!�G���_��Z$�-�u����+n3Q	EM���4R�^oW��<Y^��f�s*�DD�!����)�Z�L���!���1��d���_QNR1��=�+�i����#�r������7�.��� jY��iIX,Z1GE	�N����������_����>��},l�s�i����d�-e�.���Olfa��
��I����
h�'�8���b�������pG���O�8���p�Fu)�H�-���d�����+��	U5��w����<�(���j��%5��O$&�D@(Pj�.��{K�Kb���.a7a�3�S.)�g�����X�Y�8���:"�n#��/�	 �1�=�kW����^)^�?�`�7�V�!v��}	�,4�h�����L��d��S(��"���&�#��	B*4A=�S������5dI��q�b���I���p^L<me��-��)G14j����m�m�N�"�hx���9�-js$����^l)V^�
��Td�[=���}�c���a��;�<nl�&���!�g6�n(�
Q�����	�	N|��I������UCq���I
����+�F�4V[~u�����H�o��f�����T�<g�j'1��<��0@�W�%*�K������Q�Mb��5�����f2Z{S�U��R��Y�v�=x�����O<�m���Tt�1�s��
� �qJ��`N�!4��%�EP�c�������:T��G�y���������G\ICZ��P�����f-=�&#qt%��k�@�;
��b*2����L�V���V�,y��T���}�����@t�n�1�Y(�!�pq��3�z2������j2�3
u��F���yIC"���Q�������Mo=��G�@������R�t��9��z\�=R�����f�p��W�:�Uv��P��5���� �_���q��&�����%�����>MZ��5U����3�3��ZG2���m��8K��������������1]��E�.�o��QT��b`�uIT�q�1����n~=��l6Ri��V��E�R}�r�k��r�H����u�i�n�����p���J���u1�3^���?�;R	�=\����������!J�������]e~��|^�M��|�*�d�!�	�m\�_���&�9u3�N��e�(�g2��]���;�^��[��{y�������c������P�Y�Z
~}kR��3T�U���/�/�N�#mJ�U+K��_��*\]�W�a_{��W����S����:_��g��f{���%Hx��3i�����3����8��w�����*�S�W�����)Y|Q�����W�^3v�������$�|���T���&��}4�=�&�G�Y�w�2���r4��rY�
]���ptz������\�(~�v$�
Hi1�
���	H�?[�d��!=�C�}��i�����=�.fA���E���;�@�e:�S��l��3�G�|q�9=����`�6���6w��0-��t�>�	�U�6��M�UK`��=�� q�3	%�
�@�@����,_-�rB#�	�����J:��:T}����z�c�s� �R������i�3����5`�Z����t-��`f�����$|����'��3$�
����2�#���qs����1�������h����������1J���s�=,��$� l`�������&��6��
(�xK"��x>}��\Qo��r������?%up���-���#B4(�2n~(�E�Ts�/1g�d���L�,�����H30�E;	��(�������6���l3�hO�*]��-���GN3�w6���&��cB,=D�~����{7Y�'hl0����������B������w��:�����[��C{�-n�i����o�BV�{s�i��P���?p,I�wm}�/H��g��3O�]r�vECA�0������������V�
���:���Q�Se�����2�7'�[����HBM���tD%������I�hT�a�%k��h��Y��:u"���P�����W��u�&O ��/�+1�d��!�t�ut�MTv3������Y��)} S]��T{���Z:W���M�T������]�$3��J�$��t4j�x?PY-��H�h���#�90An�3����i�{�������q����"������G�&�'A?�����I� �i���������i
2���R8�6/j�\�&��sgG�u_H�z4s���?�BfA��77)��*u�}d��}�".��z)uR�
J�X)��%�����c���mM��P���W1�����z�M`������zXy{P[�������Do\����
��j�=���������D�Y����#m{ GR���1r
�lN�R�Iv�r;�����fp\vg#M{�
���6���8��M����d0?�i�e��,��'��d2YpJT���R>X��~1o��>�x��MJ�5��jN8!���������j�DM��m����G����������G^��h����������>4I��P*���&�+�o����'S������G������z����q�w|��>�5���i��-�����+��������E��o��Vy�U�����\�����DV�!���id�����	3�����W��L��X!�f���6���=�������Of��~�<����R����f���
�K�q<��)���u��ky��W��\�����!��J����^�����G�~}��:�(q&V���6�KO�T�7i=�b�;'Zk�#�����*���]�}�E�+��uiO�O2O�vP���1��*�D���Q��v8��O����v��7$Lf_�#�T�=jtH�JaCA�V�{����J�Uo�7�J7�"��Vl���J5���n��1�v4���������L��Kk���k�����E|����D�ZG��9�Ho)Q��{��z��E�<�}�-t�:I�`��%�a��M��\F��'������T\��5��������J��pP���V����~6��i��"tr������7����zQ�z�p���
�|�����|��V^D��x�K�0�l�m�	+�	z�]N3�����%�
=b��1\�E�iu�w�B�>n����c5�������������4L:��h4��;��J�'����ed6��������_�,�Z��#nT���n������*��|���|q\3���&'������Q9q�����)$2{�����i.`u
��K"{>�;��r�v}����k��{��{��+;$7V���JR������%���7�Y���4s�>���*)�y���w�~�-bF���b����8�����������*��3���W�zL��������N���_�
�����p����z[z���3��~8�i5V�Q&��Vo��.o1*����A�[B-�\a*����]5��(+�R���ts��Lf��7�I
~{Iy�i����w������F&��I����4��������a��-|$/�0�N(��R[��jB������]��:��9U�L��7G���!�a���8�o/��q}�l�=}��~N[w [��^�����B�r7�C�j����p�=�.�Bt��,|m"�v�]m��&cvN���;�L������:\�i�s�X��s	����4�V�D���mao+��_��-�Tj6�r�]4���X7q�6�DJR�ao�_�3�hu�F�f�9����qoE
��?*��wG����x%.��@�CM��D�����a�|������[P���5,ni��z�-�t����:l��CV��l5��V3A���[�@�ZrQ�:���7jf�sj��z�����
J���z��#���i�95F�!-��bCn��>t��|����Ajb�����#�=;����� ��W�����5zIE���]���|7��)�-����������=Vt��gEwZ��Vt��oEw\�=W����3�OAd�D��V��F%�(�qxC�8^y��G�Qe\*����`�5������+k�lj���M.3��p�C�/O���
������48:������3��e/��Jw�<�<5�������$��=;KLO�UK�z����������U-���Z��������e�b��+��X��8�`��rw@2V����Q.~�w��(�������M��h�)Ro5�����j���*���>k�S����f��m����u&�������~��l��A��W����?6���kVF���NP��/T��������j�a7/������e
��+XiF?�E9<-�}Z�//����B�[N
���X�j�ZW���r�E����V���[�	�����sq��\U���J�7�������N)��=����0/��{Hm��m��#�z�Y��?z��(�=U����+TA���P�S��Q��\A���� [����R_;���>Y��Q�o�Gp����WF�Q{?����]����O�F����������m�J��&�����L�$ttg����.1�����x:B~�*���(�r�
&o�W_�eP���du�-�z����bsj�I�F:�Fr�����z{t����
�t�~~~������:O6��gT1I��G����h�a�&T�S���q�ln�, 7��o�Z���5j�����=�XjS�b^���# ����Q�H�����@.��p�Wnp}�7�H�go9��=� ��KI��*���|��&��~=�.|���UOa ��D������")�v~k�)#����T�b`�����k|Q����G~����y�<~?���������y�<~?���������y�|��?�`B50

#412

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#411)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 15, 2024 at 4:36 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 6:55 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:

Okay, here's an another idea: Change test_lookup_tids() to be more
general and put the validation down into C as well. First we save the
blocks from do_set_block_offsets() into a table, then with all those
blocks lookup a sufficiently-large range of possible offsets and save
found values in another array. So the static items structure would
have 3 arrays: inserts, successful lookups, and iteration (currently
the iteration output is private to check_set_block_offsets(). Then
sort as needed and check they are all the same.

That's a promising idea. We can use the same mechanism for randomized
tests too. If you're going to work on this, I'll do other tests on my
environment in the meantime.

Some progress on this in v72 -- I tried first without using SQL to
save the blocks, just using the unique blocks from the verification
array. It seems to work fine.

Thanks!

Seems I forgot the attachment last time...there's more stuff now
anyway, based on discussion.

Thank you for updating the patches!

The idea of using three TID arrays for the lookup test and iteration
test looks good to me. I think we can add random-TIDs tests on top of
it.

- Since there are now three arrays we should reduce max bytes to
something smaller.

Agreed.

I went further than this, see below.

- Further on that, I'm not sure if the "is full" test is telling us
much. It seems we could make max bytes a static variable and set it to
the size of the empty store. I'm guessing it wouldn't take much to add
enough tids so that the contexts need to allocate some blocks, and
then it would appear full and we can test that. I've made it so all
arrays repalloc when needed, just in case.

How about using work_mem as max_bytes instead of having it as a static
variable? In test_tidstore.sql we set work_mem before creating the
tidstore. It would make the tidstore more controllable by SQL queries.

My complaint is that the "is full" test is trivial, and also strange
in that max_bytes is used for two unrelated things:

- the initial size of the verification arrays, which was always larger
than necessary, and now there are three of them
- the hint to TidStoreCreate to calculate its max block size / the
threshold for being "full"

To make the "is_full" test slightly less trivial, my idea is to save
the empty store size and later add enough tids so that it has to
allocate new blocks/DSA segments, which is not that many, and then it
will appear full. I've done this and also separated the purpose of
various sizes in v72-0009/10.

I see your point and the changes look good to me.

Using actual work_mem seems a bit more difficult to make this work.

Agreed.

---
+   if (TidStoreIsShared(ts))
+       found = shared_rt_set(ts->tree.shared, blkno, page);
+   else
+       found = local_rt_set(ts->tree.local, blkno, page);
+
+   Assert(!found);
Given TidStoreSetBlockOffsets() is designed to always set (i.e.
overwrite) the value, I think we should not expect that found is
always false.
I find that a puzzling statement, since 1) it was designed for
insert-only workloads, not actual overwrite IIRC and 2) the tests will
now fail if the same block is set twice, since we just switched the
tests to use a remnant of vacuum's old array. Having said that, I
don't object to removing artificial barriers to using it for purposes
not yet imagined, as long as test_tidstore.sql warns against that.

I think that if it supports only insert-only workload and expects the
same block is set only once, it should raise an error rather than an
assertion. It's odd to me that the function fails only with an
assertion build assertions even though it actually works fine even in
that case.

As for test_tidstore you're right that the test code doesn't handle
the case where setting the same block twice. I think that there is no
problem in the fixed-TIDs tests, but we would need something for
random-TIDs tests so that we don't set the same block twice. I guess
it could be trivial since we can use SQL queries to generate TIDs. I'm
not sure how the random-TIDs tests would be like, but I think we can
use SELECT DISTINCT to eliminate the duplicates of block numbers to
use.

Given the above two things, I think this function's comment needs
stronger language about its limitations. Perhaps even mention that
it's intended for, and optimized for, vacuum. You and I have long
known that tidstore would need a separate, more complex, function to
add or remove individual tids from existing entries, but it might be
good to have that documented.

Agreed.

Other things:

v72-0011: Test that zero offset raises an error.

v72-0013: I had wanted to microbenchmark this, but since we are
running short of time I decided to skip that, so I want to revert some
code to make it again more similar to the equivalent in tidbitmap.c.
In the absence of evidence, it seems better to do it this way.

LGTM.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#413

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#412)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 15, 2024 at 4:36 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Given TidStoreSetBlockOffsets() is designed to always set (i.e.
overwrite) the value, I think we should not expect that found is
always false.

I find that a puzzling statement, since 1) it was designed for
insert-only workloads, not actual overwrite IIRC and 2) the tests will
now fail if the same block is set twice, since we just switched the
tests to use a remnant of vacuum's old array. Having said that, I
don't object to removing artificial barriers to using it for purposes
not yet imagined, as long as test_tidstore.sql warns against that.

I think that if it supports only insert-only workload and expects the
same block is set only once, it should raise an error rather than an
assertion. It's odd to me that the function fails only with an
assertion build assertions even though it actually works fine even in
that case.

After thinking some more, I think you're right -- it's too
heavy-handed to throw an error/assert and a public function shouldn't
make assumptions about the caller. It's probably just a matter of
documenting the function (and it's lack of generality), and the tests
(which are based on the thing we're replacing).

As for test_tidstore you're right that the test code doesn't handle
the case where setting the same block twice. I think that there is no
problem in the fixed-TIDs tests, but we would need something for
random-TIDs tests so that we don't set the same block twice. I guess
it could be trivial since we can use SQL queries to generate TIDs. I'm
not sure how the random-TIDs tests would be like, but I think we can
use SELECT DISTINCT to eliminate the duplicates of block numbers to
use.

Also, I don't think we need random blocks, since the radix tree tests
excercise that heavily already.

Random offsets is what I was thinking of (if made distinct and
ordered), but even there the code is fairy trivial, so I don't have a
strong feeling about it.

Given the above two things, I think this function's comment needs
stronger language about its limitations. Perhaps even mention that
it's intended for, and optimized for, vacuum. You and I have long
known that tidstore would need a separate, more complex, function to
add or remove individual tids from existing entries, but it might be
good to have that documented.

Agreed.

How about this:

 /*
- * Set the given TIDs on the blkno to TidStore.
+ * Create or replace an entry for the given block and array of offsets
  *
- * NB: the offset numbers in offsets must be sorted in ascending order.
+ * NB: This function is designed and optimized for vacuum's heap scanning
+ * phase, so has some limitations:
+ * - The offset numbers in "offsets" must be sorted in ascending order.
+ * - If the block number already exists, the entry will be replaced --
+ *   there is no way to add or remove offsets from an entry.
  */
 void
 TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,

I think we can stop including the debug-tid-store patch for CI now.
That would allow getting rid of some unnecessary variables. More
comments:

+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, TidStoreEndIterate() needs to be called when finished.

+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.

This is outdated. Locking is optional. The remaining real reason now
is that TidStoreEndIterate needs to free memory. We probably need to
say something about locking, too, but not this.

+ * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+TidStoreIterateNext(TidStoreIter *iter)

The wording is a bit awkward.

+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */

s/existing/exiting/ ?

It seems to say we need to finish after finishing. Maybe more precise wording.

+/* Extract TIDs from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
BlocktableEntry *page)

This is a leftover from the old encoding scheme. This should really
take a "BlockNumber blockno" not a "key", and the only call site
should probably cast the uint64 to BlockNumber.

+ * tidstore.h
+ *   Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group

Update year.

+typedef struct BlocktableEntry
+{
+ uint16 nwords;
+ bitmapword words[FLEXIBLE_ARRAY_MEMBER];
+} BlocktableEntry;

In my WIP for runtime-embeddable offsets, nwords needs to be one byte.
That doesn't have any real-world affect on the largest offset
encountered, and only in 32-bit builds with 32kB block size would the
theoretical max change at all. To be precise, we could use in the
MaxBlocktableEntrySize calculation:

Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);

Tests: I never got rid of maxblkno and maxoffset, in case you wanted
to do that. And as discussed above, maybe

-- Note: The test code use an array of TIDs for verification similar
-- to vacuum's dead item array pre-PG17. To avoid adding duplicates,
-- each call to do_set_block_offsets() should use different block
-- numbers.

#414

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#413)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 15, 2024 at 4:36 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Given TidStoreSetBlockOffsets() is designed to always set (i.e.
overwrite) the value, I think we should not expect that found is
always false.

I find that a puzzling statement, since 1) it was designed for
insert-only workloads, not actual overwrite IIRC and 2) the tests will
now fail if the same block is set twice, since we just switched the
tests to use a remnant of vacuum's old array. Having said that, I
don't object to removing artificial barriers to using it for purposes
not yet imagined, as long as test_tidstore.sql warns against that.

I think that if it supports only insert-only workload and expects the
same block is set only once, it should raise an error rather than an
assertion. It's odd to me that the function fails only with an
assertion build assertions even though it actually works fine even in
that case.

After thinking some more, I think you're right -- it's too
heavy-handed to throw an error/assert and a public function shouldn't
make assumptions about the caller. It's probably just a matter of
documenting the function (and it's lack of generality), and the tests
(which are based on the thing we're replacing).

Removed 'found' in 0003 patch.

As for test_tidstore you're right that the test code doesn't handle
the case where setting the same block twice. I think that there is no
problem in the fixed-TIDs tests, but we would need something for
random-TIDs tests so that we don't set the same block twice. I guess
it could be trivial since we can use SQL queries to generate TIDs. I'm
not sure how the random-TIDs tests would be like, but I think we can
use SELECT DISTINCT to eliminate the duplicates of block numbers to
use.

Also, I don't think we need random blocks, since the radix tree tests
excercise that heavily already.

Random offsets is what I was thinking of (if made distinct and
ordered), but even there the code is fairy trivial, so I don't have a
strong feeling about it.

Agreed.

Given the above two things, I think this function's comment needs
stronger language about its limitations. Perhaps even mention that
it's intended for, and optimized for, vacuum. You and I have long
known that tidstore would need a separate, more complex, function to
add or remove individual tids from existing entries, but it might be
good to have that documented.

Agreed.

How about this:
/*
- * Set the given TIDs on the blkno to TidStore.
+ * Create or replace an entry for the given block and array of offsets
*
- * NB: the offset numbers in offsets must be sorted in ascending order.
+ * NB: This function is designed and optimized for vacuum's heap scanning
+ * phase, so has some limitations:
+ * - The offset numbers in "offsets" must be sorted in ascending order.
+ * - If the block number already exists, the entry will be replaced --
+ *   there is no way to add or remove offsets from an entry.
*/
void
TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,

Looks good.

I think we can stop including the debug-tid-store patch for CI now.
That would allow getting rid of some unnecessary variables.

Agreed.

+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, TidStoreEndIterate() needs to be called when finished.
+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.
This is outdated. Locking is optional. The remaining real reason now
is that TidStoreEndIterate needs to free memory. We probably need to
say something about locking, too, but not this.

Fixed.

+ * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+TidStoreIterateNext(TidStoreIter *iter)

The wording is a bit awkward.

Fixed.

+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */
s/existing/exiting/ ?

It seems to say we need to finish after finishing. Maybe more precise wording.

Fixed.

+/* Extract TIDs from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
BlocktableEntry *page)
This is a leftover from the old encoding scheme. This should really
take a "BlockNumber blockno" not a "key", and the only call site
should probably cast the uint64 to BlockNumber.

Fixed.

+ * tidstore.h
+ *   Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group

Update year.

Updated.

+typedef struct BlocktableEntry
+{
+ uint16 nwords;
+ bitmapword words[FLEXIBLE_ARRAY_MEMBER];
+} BlocktableEntry;
In my WIP for runtime-embeddable offsets, nwords needs to be one byte.
That doesn't have any real-world affect on the largest offset
encountered, and only in 32-bit builds with 32kB block size would the
theoretical max change at all. To be precise, we could use in the
MaxBlocktableEntrySize calculation:

Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);

I don't get this expression. Making the nwords one byte works well?
With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256
bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit
OS, respectively. One byte nwrods variable seems not to be sufficient
for both cases. Also, where does the expression "BITS_PER_BITMAPWORD *
PG_INT8_MAX - 1" come from?

Tests: I never got rid of maxblkno and maxoffset, in case you wanted
to do that. And as discussed above, maybe

-- Note: The test code use an array of TIDs for verification similar
-- to vacuum's dead item array pre-PG17. To avoid adding duplicates,
-- each call to do_set_block_offsets() should use different block
-- numbers.

I've added this comment on top of the .sql file.

I've attached the new patch sets. The summary of updates is:

- Squashed all updates of v72
- 0004 and 0005 are updates for test_tidstore.sql. Particularly the
0005 patch adds randomized TID tests.
- 0006 addresses review comments above.
- 0007 and 0008 patches are pgindent stuff.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#415

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#414)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote:

Random offsets is what I was thinking of (if made distinct and
ordered), but even there the code is fairy trivial, so I don't have a
strong feeling about it.

Agreed.

Looks good.

A related thing I should mention is that the tests which look up all
possible offsets are really expensive with the number of blocks we're
using now (assert build):

v70 0.33s
v72 1.15s
v73 1.32

To trim that back, I think we should give up on using shared memory
for the is-full test: We can cause aset to malloc a new block with a
lot fewer entries. In the attached, this brings it back down to 0.43s.
It might also be worth reducing the number of blocks in the random
test -- multiple runs will have different offsets anyway.

I think we can stop including the debug-tid-store patch for CI now.
That would allow getting rid of some unnecessary variables.

Agreed.

Okay, all that remains here is to get rid of those variables (might be
just one).

+ * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+TidStoreIterateNext(TidStoreIter *iter)

The wording is a bit awkward.

Fixed.

- * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
- * in one block. We return the block numbers in ascending order and the offset
- * numbers in each result is also sorted in ascending order.
+ * Scan the TidStore and return the TIDs of the next block. The returned block
+ * numbers is sorted in ascending order, and the offset numbers in each result
+ * is also sorted in ascending order.

Better, but it's still not very clear. Maybe "The offsets in each
iteration result are ordered, as are the block numbers over all
iterations."

+/* Extract TIDs from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
BlocktableEntry *page)
This is a leftover from the old encoding scheme. This should really
take a "BlockNumber blockno" not a "key", and the only call site
should probably cast the uint64 to BlockNumber.
Fixed.

This part looks good. I didn't notice earlier, but this comment has a
similar issue

@@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter)
return NULL;

  /* Collect TIDs extracted from the key-value pair */
- tidstore_iter_extract_tids(iter, key, page);
+ tidstore_iter_extract_tids(iter, (BlockNumber) key, page);

..."extracted" was once a separate operation. I think just removing
that one word is enough to update it.

Some other review on code comments:

v73-0001:

+ /* Enlarge the TID array if necessary */

It's "arrays" now.

v73-0005:

+-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has
+-- different randon 100 offset numbers each other.

The numbers are obvious from the query. Maybe just mention that the
offsets are randomized and must be unique and ordered.

+ * The caller is responsible for release any locks.

"releasing"

+typedef struct BlocktableEntry
+{
+ uint16 nwords;
+ bitmapword words[FLEXIBLE_ARRAY_MEMBER];
+} BlocktableEntry;
In my WIP for runtime-embeddable offsets, nwords needs to be one byte.

I should be more clear here: nwords fitting into one byte allows 3
embedded offsets (1 on 32-bit platforms, which is good for testing at
least). With uint16 nwords that reduces to 2 (none on 32-bit
platforms). Further, after the current patch series is fully
committed, I plan to split the embedded-offset patch into two parts:
The first would store the offsets in the header, but would still need
a (smaller) allocation. The second would embed them in the child
pointer. Only the second patch will care about the size of nwords
because it needs to reserve a byte for the pointer tag.

That doesn't have any real-world affect on the largest offset
encountered, and only in 32-bit builds with 32kB block size would the
theoretical max change at all. To be precise, we could use in the
MaxBlocktableEntrySize calculation:

Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);

I don't get this expression. Making the nwords one byte works well?
With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256
bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit
OS, respectively. One byte nwrods variable seems not to be sufficient

I believe there is confusion between bitmap words and bytes:
2048 / 64 = 32 words = 256 bytes

It used to be max tuples per (heap) page, but we wanted a simple way
to make this independent of heap. I believe we won't need to ever
store the actual MaxOffsetNumber, although we technically still could
with a one-byte type and 32kB pages, at least on 64-bit platforms.

for both cases. Also, where does the expression "BITS_PER_BITMAPWORD *
PG_INT8_MAX - 1" come from?

127 words, each with 64 (or 32) bits. The zero bit is not a valid
offset, so subtract one. And I used signed type in case there was a
need for -1 to mean something.

Attachments:

use-fewer-blocks.patch.nocfbotapplication/octet-stream; name=use-fewer-blocks.patch.nocfbotDownload

diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
index 8ee05bc772..5cd70b6c50 100644
--- a/src/test/modules/test_tidstore/expected/test_tidstore.out
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -10,9 +10,7 @@ CREATE TEMP TABLE hideblocks(blockno bigint);
 -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291.
 -- We use a higher number to test tidstore.
 \set maxoffset 512
--- Create a TID store in shared memory. We can't do that for for the radix tree
--- tests, because unused static functions would raise warnings there.
-SELECT test_create(true);
+SELECT test_create(false);
  test_create 
 -------------
  
@@ -50,11 +48,12 @@ SELECT do_set_block_offsets(blk, array_agg(offsets.off)::int2[])
            4294967295
 (5 rows)
 
--- Add enough TIDs to cause allocation of an additional DSM segment
--- when using shared memory.
+-- Add enough TIDs to cause the store to appear "full", compared
+-- to the allocated memory it started out with. This is easier
+-- with memory contexts in local memory.
 INSERT INTO hideblocks (blockno)
 SELECT do_set_block_offsets(blk, ARRAY[1,31,32,63,64,200]::int2[])
-  FROM generate_series(1000, 20000, 1) blk;
+  FROM generate_series(1000, 2000, 1) blk;
 -- Zero offset not allowed
 SELECT do_set_block_offsets(1, ARRAY[0]::int2[]);
 ERROR:  tuple offset out of range: 0
@@ -72,6 +71,8 @@ SELECT test_is_full();
 (1 row)
 
 -- Re-create the TID store for randommized tests.
+-- Use shared memory this time. We can't do that for for the radix tree
+-- tests, because unused static functions would raise warnings there.
 SELECT test_destroy();
  test_destroy 
 --------------
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
index 4ffdf3e7b5..482fc94738 100644
--- a/src/test/modules/test_tidstore/sql/test_tidstore.sql
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -14,9 +14,7 @@ CREATE TEMP TABLE hideblocks(blockno bigint);
 -- We use a higher number to test tidstore.
 \set maxoffset 512
 
--- Create a TID store in shared memory. We can't do that for for the radix tree
--- tests, because unused static functions would raise warnings there.
-SELECT test_create(true);
+SELECT test_create(false);
 
 -- Test on empty tidstore.
 SELECT test_is_full();
@@ -33,11 +31,12 @@ SELECT do_set_block_offsets(blk, array_agg(offsets.off)::int2[])
   FROM blocks, offsets
   GROUP BY blk;
 
--- Add enough TIDs to cause allocation of an additional DSM segment
--- when using shared memory.
+-- Add enough TIDs to cause the store to appear "full", compared
+-- to the allocated memory it started out with. This is easier
+-- with memory contexts in local memory.
 INSERT INTO hideblocks (blockno)
 SELECT do_set_block_offsets(blk, ARRAY[1,31,32,63,64,200]::int2[])
-  FROM generate_series(1000, 20000, 1) blk;
+  FROM generate_series(1000, 2000, 1) blk;
 
 -- Zero offset not allowed
 SELECT do_set_block_offsets(1, ARRAY[0]::int2[]);
@@ -49,6 +48,8 @@ SELECT test_is_full();
 
 -- Re-create the TID store for randommized tests.
 SELECT test_destroy();
+-- Use shared memory this time. We can't do that in test_radixtree.sql,
+-- because unused static functions would raise warnings there.
 SELECT test_create(true);
 
 -- Random TIDs test. We insert TIDs for 1000 blocks. Each block has

#416

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#415)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Mar 19, 2024 at 8:35 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote:

Random offsets is what I was thinking of (if made distinct and
ordered), but even there the code is fairy trivial, so I don't have a
strong feeling about it.

Agreed.

Looks good.

A related thing I should mention is that the tests which look up all
possible offsets are really expensive with the number of blocks we're
using now (assert build):

v70 0.33s
v72 1.15s
v73 1.32

To trim that back, I think we should give up on using shared memory
for the is-full test: We can cause aset to malloc a new block with a
lot fewer entries. In the attached, this brings it back down to 0.43s.

Looks good. Agreed with this change.

It might also be worth reducing the number of blocks in the random
test -- multiple runs will have different offsets anyway.

Yes. If we reduce the number of blocks from 1000 to 100, the
regression test took on my environment:

1000 blocks : 516 ms
100 blocks : 228 ms

I think we can stop including the debug-tid-store patch for CI now.
That would allow getting rid of some unnecessary variables.

Agreed.

Okay, all that remains here is to get rid of those variables (might be
just one).

Removed some unnecessary variables in 0002 patch.

+ * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+TidStoreIterateNext(TidStoreIter *iter)

The wording is a bit awkward.

Fixed.

- * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
- * in one block. We return the block numbers in ascending order and the offset
- * numbers in each result is also sorted in ascending order.
+ * Scan the TidStore and return the TIDs of the next block. The returned block
+ * numbers is sorted in ascending order, and the offset numbers in each result
+ * is also sorted in ascending order.

Better, but it's still not very clear. Maybe "The offsets in each
iteration result are ordered, as are the block numbers over all
iterations."

Thanks, fixed.

+/* Extract TIDs from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
BlocktableEntry *page)
This is a leftover from the old encoding scheme. This should really
take a "BlockNumber blockno" not a "key", and the only call site
should probably cast the uint64 to BlockNumber.
Fixed.
This part looks good. I didn't notice earlier, but this comment has a
similar issue

@@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter)
return NULL;
/* Collect TIDs extracted from the key-value pair */
- tidstore_iter_extract_tids(iter, key, page);
+ tidstore_iter_extract_tids(iter, (BlockNumber) key, page);
..."extracted" was once a separate operation. I think just removing
that one word is enough to update it.

Fixed.

Some other review on code comments:

v73-0001:

+ /* Enlarge the TID array if necessary */

It's "arrays" now.

v73-0005:
+-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has
+-- different randon 100 offset numbers each other.
The numbers are obvious from the query. Maybe just mention that the
offsets are randomized and must be unique and ordered.

+ * The caller is responsible for release any locks.

"releasing"

Fixed.

+typedef struct BlocktableEntry
+{
+ uint16 nwords;
+ bitmapword words[FLEXIBLE_ARRAY_MEMBER];
+} BlocktableEntry;
In my WIP for runtime-embeddable offsets, nwords needs to be one byte.
I should be more clear here: nwords fitting into one byte allows 3
embedded offsets (1 on 32-bit platforms, which is good for testing at
least). With uint16 nwords that reduces to 2 (none on 32-bit
platforms). Further, after the current patch series is fully
committed, I plan to split the embedded-offset patch into two parts:
The first would store the offsets in the header, but would still need
a (smaller) allocation. The second would embed them in the child
pointer. Only the second patch will care about the size of nwords
because it needs to reserve a byte for the pointer tag.

Thank you for the clarification.

That doesn't have any real-world affect on the largest offset
encountered, and only in 32-bit builds with 32kB block size would the
theoretical max change at all. To be precise, we could use in the
MaxBlocktableEntrySize calculation:

Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);

I don't get this expression. Making the nwords one byte works well?
With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256
bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit
OS, respectively. One byte nwrods variable seems not to be sufficient

I believe there is confusion between bitmap words and bytes:
2048 / 64 = 32 words = 256 bytes

Oops, you're right.

It used to be max tuples per (heap) page, but we wanted a simple way
to make this independent of heap. I believe we won't need to ever
store the actual MaxOffsetNumber, although we technically still could
with a one-byte type and 32kB pages, at least on 64-bit platforms.

for both cases. Also, where does the expression "BITS_PER_BITMAPWORD *
PG_INT8_MAX - 1" come from?

127 words, each with 64 (or 32) bits. The zero bit is not a valid
offset, so subtract one. And I used signed type in case there was a
need for -1 to mean something.

Okay, I missed that we want to change nwords from uint8 to int8.

So the MaxBlocktableEntrySize calculation would be as follows?

#define MaxBlocktableEntrySize \
offsetof(BlocktableEntry, words) + \
(sizeof(bitmapword) * \
WORDS_PER_PAGE(Min(MaxOffsetNumber, \
BITS_PER_BITMAPWORD * PG_INT8_MAX - 1))))

I've made this change in the 0003 patch.

While reviewing the vacuum patch, I realized that we always pass
LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related
to the tidstore is therefore always the same. I think it would be
better to make the caller of TidStoreCreate() specify the tranch_id
and pass it to RT_CREATE(). That way, the caller can specify their own
wait event for tidstore. The 0008 patch tried this idea. dshash.c does
the same idea.

Other patches are minor updates for tidstore and vacuum patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#417

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#416)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Mar 19, 2024 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 19, 2024 at 8:35 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote:

It might also be worth reducing the number of blocks in the random
test -- multiple runs will have different offsets anyway.

Yes. If we reduce the number of blocks from 1000 to 100, the
regression test took on my environment:

1000 blocks : 516 ms
100 blocks : 228 ms

Sounds good.

Removed some unnecessary variables in 0002 patch.

Looks good.

So the MaxBlocktableEntrySize calculation would be as follows?

#define MaxBlocktableEntrySize \
offsetof(BlocktableEntry, words) + \
(sizeof(bitmapword) * \
WORDS_PER_PAGE(Min(MaxOffsetNumber, \
BITS_PER_BITMAPWORD * PG_INT8_MAX - 1))))

I've made this change in the 0003 patch.

This is okay, but one side effect is that we have both an assert and
an elog, for different limits. I think we'll need a separate #define
to help. But for now, I don't want to hold up tidstore further with
this because I believe almost everything else in v74 is in pretty good
shape. I'll save this for later as a part of the optimization I
proposed.

Remaining things I noticed:

+#define RT_PREFIX local_rt
+#define RT_PREFIX shared_rt

Prefixes for simplehash, for example, don't have "sh" -- maybe "local/shared_ts"

+ /* MemoryContext where the radix tree uses */

s/where/that/

+/*
+ * Lock support functions.
+ *
+ * We can use the radix tree's lock for shared TidStore as the data we
+ * need to protect is only the shared radix tree.
+ */
+void
+TidStoreLockExclusive(TidStore *ts)

Talking about multiple things, so maybe a blank line after the comment.

With those, I think you can go ahead and squash all the tidstore
patches except for 0003 and commit it.

While reviewing the vacuum patch, I realized that we always pass
LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related
to the tidstore is therefore always the same. I think it would be
better to make the caller of TidStoreCreate() specify the tranch_id
and pass it to RT_CREATE(). That way, the caller can specify their own
wait event for tidstore. The 0008 patch tried this idea. dshash.c does
the same idea.

Sounds reasonable. I'll just note that src/include/storage/lwlock.h
still has an entry for LWTRANCHE_SHARED_TIDSTORE.

#418

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#417)

2 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Tue, Mar 19, 2024 at 6:40 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Tue, Mar 19, 2024 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Mar 19, 2024 at 8:35 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sun, Mar 17, 2024 at 11:46 AM John Naylor <johncnaylorls@gmail.com> wrote:

It might also be worth reducing the number of blocks in the random
test -- multiple runs will have different offsets anyway.

Yes. If we reduce the number of blocks from 1000 to 100, the
regression test took on my environment:

1000 blocks : 516 ms
100 blocks : 228 ms

Sounds good.

Removed some unnecessary variables in 0002 patch.

Looks good.

So the MaxBlocktableEntrySize calculation would be as follows?

#define MaxBlocktableEntrySize \
offsetof(BlocktableEntry, words) + \
(sizeof(bitmapword) * \
WORDS_PER_PAGE(Min(MaxOffsetNumber, \
BITS_PER_BITMAPWORD * PG_INT8_MAX - 1))))

I've made this change in the 0003 patch.

This is okay, but one side effect is that we have both an assert and
an elog, for different limits. I think we'll need a separate #define
to help. But for now, I don't want to hold up tidstore further with
this because I believe almost everything else in v74 is in pretty good
shape. I'll save this for later as a part of the optimization I
proposed.

Remaining things I noticed:

+#define RT_PREFIX local_rt
+#define RT_PREFIX shared_rt

Prefixes for simplehash, for example, don't have "sh" -- maybe "local/shared_ts"

+ /* MemoryContext where the radix tree uses */

s/where/that/
+/*
+ * Lock support functions.
+ *
+ * We can use the radix tree's lock for shared TidStore as the data we
+ * need to protect is only the shared radix tree.
+ */
+void
+TidStoreLockExclusive(TidStore *ts)
Talking about multiple things, so maybe a blank line after the comment.

With those, I think you can go ahead and squash all the tidstore
patches except for 0003 and commit it.

While reviewing the vacuum patch, I realized that we always pass
LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related
to the tidstore is therefore always the same. I think it would be
better to make the caller of TidStoreCreate() specify the tranch_id
and pass it to RT_CREATE(). That way, the caller can specify their own
wait event for tidstore. The 0008 patch tried this idea. dshash.c does
the same idea.

Sounds reasonable. I'll just note that src/include/storage/lwlock.h
still has an entry for LWTRANCHE_SHARED_TIDSTORE.

Thank you. I've incorporated all the comments above. I've attached the
latest patches, and am going to push them (one by one) after
self-review again.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v75-0001-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchapplication/octet-stream; name=v75-0001-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patchDownload

From 5e2f8837b07075d65a357a8ef73500c9f755ec6a Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Tue, 5 Mar 2024 15:05:47 +0700
Subject: [PATCH v75 1/2] Add TIDStore, to store sets of TIDs (ItemPointerData)
 efficiently.

TIDStore is a data structure designed to efficiently store large sets
of TIDs. For TID storage, it employs a radix tree, where the key is
the BlockNumber, and the value is a bitmap representing offset
numbers. The TIDStore can be created on a DSA area and used by
multiple backend processes simultaneously.

There are potential future users such as tidbitmap.c, though it's very
likely the interface will need to evolve as we come to understand the
needs of different kinds of users. For example, we can support
updating the offset bitmap of existing values.

Currently, the TIDStore is not used for anything yet, aside from the
test code. But an upcoming patch will use it.

This includes a unit test module, in src/test/modules/test_tidstore.

Co-authored-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
---
 src/backend/access/common/Makefile            |   1 +
 src/backend/access/common/meson.build         |   1 +
 src/backend/access/common/tidstore.c          | 463 ++++++++++++++++++
 src/include/access/tidstore.h                 |  49 ++
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 src/test/modules/test_tidstore/Makefile       |  23 +
 .../test_tidstore/expected/test_tidstore.out  | 109 +++++
 src/test/modules/test_tidstore/meson.build    |  33 ++
 .../test_tidstore/sql/test_tidstore.sql       |  68 +++
 .../test_tidstore/test_tidstore--1.0.sql      |  27 +
 .../modules/test_tidstore/test_tidstore.c     | 311 ++++++++++++
 .../test_tidstore/test_tidstore.control       |   4 +
 src/tools/pgindent/typedefs.list              |   5 +
 14 files changed, 1096 insertions(+)
 create mode 100644 src/backend/access/common/tidstore.c
 create mode 100644 src/include/access/tidstore.h
 create mode 100644 src/test/modules/test_tidstore/Makefile
 create mode 100644 src/test/modules/test_tidstore/expected/test_tidstore.out
 create mode 100644 src/test/modules/test_tidstore/meson.build
 create mode 100644 src/test/modules/test_tidstore/sql/test_tidstore.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore--1.0.sql
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.c
 create mode 100644 src/test/modules/test_tidstore/test_tidstore.control

diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
 	syncscan.o \
 	toast_compression.o \
 	toast_internals.o \
+	tidstore.o \
 	tupconvert.o \
 	tupdesc.o
 
diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build
index 725041a4ce..a02397855e 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
new file mode 100644
index 0000000000..afa03ee785
--- /dev/null
+++ b/src/backend/access/common/tidstore.c
@@ -0,0 +1,463 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.c
+ *		TID (ItemPointerData) storage implementation.
+ *
+ * TidStore is a in-memory data structure to store TIDs (ItemPointerData).
+ * Internally it uses a radix tree as the storage for TIDs. The key is the
+ * BlockNumber and the value is a bitmap of offsets, BlocktableEntry.
+ *
+ * TidStore can be shared among parallel worker processes by passing DSA area
+ * to TidStoreCreate(). Other backends can attach to the shared TidStore by
+ * TidStoreAttach().
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/common/tidstore.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "miscadmin.h"
+#include "nodes/bitmapset.h"
+#include "port/pg_bitutils.h"
+#include "storage/lwlock.h"
+#include "utils/dsa.h"
+#include "utils/memutils.h"
+
+
+#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
+
+/* number of active words for a page: */
+#define WORDS_PER_PAGE(n) ((n) / BITS_PER_BITMAPWORD + 1)
+
+/*
+ * This is named similarly to PagetableEntry in tidbitmap.c
+ * because the two have a similar function.
+ */
+typedef struct BlocktableEntry
+{
+	uint16		nwords;
+	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];
+} BlocktableEntry;
+#define MaxBlocktableEntrySize \
+	offsetof(BlocktableEntry, words) + \
+		(sizeof(bitmapword) * WORDS_PER_PAGE(MaxOffsetNumber))
+
+#define RT_PREFIX local_ts
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE BlocktableEntry
+#define RT_VARLEN_VALUE_SIZE(page) \
+	(offsetof(BlocktableEntry, words) + \
+	sizeof(bitmapword) * (page)->nwords)
+#include "lib/radixtree.h"
+
+#define RT_PREFIX shared_ts
+#define RT_SHMEM
+#define RT_SCOPE static
+#define RT_DECLARE
+#define RT_DEFINE
+#define RT_VALUE_TYPE BlocktableEntry
+#define RT_VARLEN_VALUE_SIZE(page) \
+	(offsetof(BlocktableEntry, words) + \
+	sizeof(bitmapword) * (page)->nwords)
+#include "lib/radixtree.h"
+
+/* Per-backend state for a TidStore */
+struct TidStore
+{
+	/* MemoryContext where the TidStore is allocated */
+	MemoryContext context;
+
+	/* MemoryContext that the radix tree uses */
+	MemoryContext rt_context;
+
+	/* Storage for TIDs. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		local_ts_radix_tree *local;
+		shared_ts_radix_tree *shared;
+	}			tree;
+
+	/* DSA area for TidStore if using shared memory */
+	dsa_area   *area;
+};
+#define TidStoreIsShared(ts) ((ts)->area != NULL)
+
+/* Iterator for TidStore */
+struct TidStoreIter
+{
+	TidStore   *ts;
+
+	/* iterator of radix tree. Use either one depending on TidStoreIsShared() */
+	union
+	{
+		shared_ts_iter *shared;
+		local_ts_iter *local;
+	}			tree_iter;
+
+	/* output for the caller */
+	TidStoreIterResult output;
+};
+
+static void tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
+									   BlocktableEntry *page);
+
+/*
+ * Create a TidStore. The TidStore will live in the memory context that is
+ * CurrentMemoryContext at the time of this call. The TID storage, backed
+ * by a radix tree, will live in its child memory context, rt_context. The
+ * TidStore will be limited to (approximately) max_bytes total memory
+ * consumption. If the 'area' is non-NULL, the radix tree is created in the
+ * DSA area.
+ *
+ * The returned object is allocated in backend-local memory.
+ */
+TidStore *
+TidStoreCreate(size_t max_bytes, dsa_area *area, int tranche_id)
+{
+	TidStore   *ts;
+	size_t		initBlockSize = ALLOCSET_DEFAULT_INITSIZE;
+	size_t		minContextSize = ALLOCSET_DEFAULT_MINSIZE;
+	size_t		maxBlockSize = ALLOCSET_DEFAULT_MAXSIZE;
+
+	ts = palloc0(sizeof(TidStore));
+	ts->context = CurrentMemoryContext;
+
+	/* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
+	while (16 * maxBlockSize > max_bytes * 1024L)
+		maxBlockSize >>= 1;
+
+	if (maxBlockSize < ALLOCSET_DEFAULT_INITSIZE)
+		maxBlockSize = ALLOCSET_DEFAULT_INITSIZE;
+
+	/* Create a memory context for the TID storage */
+	ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
+										   "TID storage",
+										   minContextSize,
+										   initBlockSize,
+										   maxBlockSize);
+
+	if (area != NULL)
+	{
+		ts->tree.shared = shared_ts_create(ts->rt_context, area,
+										   tranche_id);
+		ts->area = area;
+	}
+	else
+		ts->tree.local = local_ts_create(ts->rt_context);
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore using a handle. The returned object is
+ * allocated in backend-local memory using the CurrentMemoryContext.
+ */
+TidStore *
+TidStoreAttach(dsa_area *area, dsa_pointer handle)
+{
+	TidStore   *ts;
+
+	Assert(area != NULL);
+	Assert(DsaPointerIsValid(handle));
+
+	/* create per-backend state */
+	ts = palloc0(sizeof(TidStore));
+
+	/* Find the shared the shared radix tree */
+	ts->tree.shared = shared_ts_attach(area, handle);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Detach from a TidStore. This detaches from radix tree and frees the
+ * backend-local resources. The radix tree will continue to exist until
+ * it is either explicitly destroyed, or the area that backs it is returned
+ * to the operating system.
+ */
+void
+TidStoreDetach(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts));
+
+	shared_ts_detach(ts->tree.shared);
+	pfree(ts);
+}
+
+/*
+ * Lock support functions.
+ *
+ * We can use the radix tree's lock for shared TidStore as the data we
+ * need to protect is only the shared radix tree.
+ */
+
+void
+TidStoreLockExclusive(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+		shared_ts_lock_exclusive(ts->tree.shared);
+}
+
+void
+TidStoreLockShare(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+		shared_ts_lock_share(ts->tree.shared);
+}
+
+void
+TidStoreUnlock(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+		shared_ts_unlock(ts->tree.shared);
+}
+
+/*
+ * Destroy a TidStore, returning all memory.
+ *
+ * Note that the caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().
+ */
+void
+TidStoreDestroy(TidStore *ts)
+{
+	/* Destroy underlying radix tree */
+	if (TidStoreIsShared(ts))
+		shared_ts_free(ts->tree.shared);
+	else
+		local_ts_free(ts->tree.local);
+
+	MemoryContextDelete(ts->rt_context);
+
+	pfree(ts);
+}
+
+/*
+ * Create or replace an entry for the given block and array of offsets.
+ *
+ * NB: This function is designed and optimized for vacuum's heap scanning
+ * phase, so has some limitations:
+ *
+ * - The offset numbers "offsets" must be sorted in ascending order.
+ * - If the block number already exists, the entry will be replaced --
+ *	 there is no way to add or remove offsets from an entry.
+ */
+void
+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+						int num_offsets)
+{
+	char		data[MaxBlocktableEntrySize];
+	BlocktableEntry *page = (BlocktableEntry *) data;
+	bitmapword	word;
+	int			wordnum;
+	int			next_word_threshold;
+	int			idx = 0;
+
+	Assert(num_offsets > 0);
+
+	/* Check if the given offset numbers are ordered */
+	for (int i = 1; i < num_offsets; i++)
+		Assert(offsets[i] > offsets[i - 1]);
+
+	for (wordnum = 0, next_word_threshold = BITS_PER_BITMAPWORD;
+		 wordnum <= WORDNUM(offsets[num_offsets - 1]);
+		 wordnum++, next_word_threshold += BITS_PER_BITMAPWORD)
+	{
+		word = 0;
+
+		while (idx < num_offsets)
+		{
+			OffsetNumber off = offsets[idx];
+
+			/* safety check to ensure we don't overrun bit array bounds */
+			if (!OffsetNumberIsValid(off))
+				elog(ERROR, "tuple offset out of range: %u", off);
+
+			if (off >= next_word_threshold)
+				break;
+
+			word |= ((bitmapword) 1 << BITNUM(off));
+			idx++;
+		}
+
+		/* write out offset bitmap for this wordnum */
+		page->words[wordnum] = word;
+	}
+
+	page->nwords = wordnum;
+	Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
+
+	if (TidStoreIsShared(ts))
+		shared_ts_set(ts->tree.shared, blkno, page);
+	else
+		local_ts_set(ts->tree.local, blkno, page);
+}
+
+/* Return true if the given TID is present in the TidStore */
+bool
+TidStoreIsMember(TidStore *ts, ItemPointer tid)
+{
+	int			wordnum;
+	int			bitnum;
+	BlocktableEntry *page;
+	BlockNumber blk = ItemPointerGetBlockNumber(tid);
+	OffsetNumber off = ItemPointerGetOffsetNumber(tid);
+
+	if (TidStoreIsShared(ts))
+		page = shared_ts_find(ts->tree.shared, blk);
+	else
+		page = local_ts_find(ts->tree.local, blk);
+
+	/* no entry for the blk */
+	if (page == NULL)
+		return false;
+
+	wordnum = WORDNUM(off);
+	bitnum = BITNUM(off);
+
+	/* no bitmap for the off */
+	if (wordnum >= page->nwords)
+		return false;
+
+	return (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
+}
+
+/*
+ * Prepare to iterate through a TidStore.
+ *
+ * The TidStoreIter struct is created in the caller's memory context, and it
+ * will be freed in TidStoreEndIterate.
+ *
+ * The caller is responsible for locking TidStore until the iteration is
+ * finished.
+ */
+TidStoreIter *
+TidStoreBeginIterate(TidStore *ts)
+{
+	TidStoreIter *iter;
+
+	iter = palloc0(sizeof(TidStoreIter));
+	iter->ts = ts;
+
+	/*
+	 * We start with an array large enough to contain at least the offsets
+	 * from one completely full bitmap element.
+	 */
+	iter->output.max_offset = 2 * BITS_PER_BITMAPWORD;
+	iter->output.offsets = palloc(sizeof(OffsetNumber) * iter->output.max_offset);
+
+	if (TidStoreIsShared(ts))
+		iter->tree_iter.shared = shared_ts_begin_iterate(ts->tree.shared);
+	else
+		iter->tree_iter.local = local_ts_begin_iterate(ts->tree.local);
+
+	return iter;
+}
+
+
+/*
+ * Scan the TidStore and return the TIDs of the next block. The offsets in
+ * each iteration result are ordered, as are the block numbers over all
+ * iterations.
+ */
+TidStoreIterResult *
+TidStoreIterateNext(TidStoreIter *iter)
+{
+	uint64		key;
+	BlocktableEntry *page;
+
+	if (TidStoreIsShared(iter->ts))
+		page = shared_ts_iterate_next(iter->tree_iter.shared, &key);
+	else
+		page = local_ts_iterate_next(iter->tree_iter.local, &key);
+
+	if (page == NULL)
+		return NULL;
+
+	/* Collect TIDs from the key-value pair */
+	tidstore_iter_extract_tids(iter, (BlockNumber) key, page);
+
+	return &(iter->output);
+}
+
+/*
+ * Finish the iteration on TidStore.
+ *
+ * The caller is responsible for releasing any locks.
+ */
+void
+TidStoreEndIterate(TidStoreIter *iter)
+{
+	if (TidStoreIsShared(iter->ts))
+		shared_ts_end_iterate(iter->tree_iter.shared);
+	else
+		local_ts_end_iterate(iter->tree_iter.local);
+
+	pfree(iter->output.offsets);
+	pfree(iter);
+}
+
+/* Return the memory usage of TidStore */
+size_t
+TidStoreMemoryUsage(TidStore *ts)
+{
+	if (TidStoreIsShared(ts))
+		return shared_ts_memory_usage(ts->tree.shared);
+	else
+		return local_ts_memory_usage(ts->tree.local);
+}
+
+dsa_pointer
+TidStoreGetHandle(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts));
+
+	return (dsa_pointer) shared_ts_get_handle(ts->tree.shared);
+}
+
+/* Extract TIDs from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
+						   BlocktableEntry *page)
+{
+	TidStoreIterResult *result = (&iter->output);
+	int			wordnum;
+
+	result->num_offsets = 0;
+	result->blkno = blkno;
+
+	for (wordnum = 0; wordnum < page->nwords; wordnum++)
+	{
+		bitmapword	w = page->words[wordnum];
+		int			off = wordnum * BITS_PER_BITMAPWORD;
+
+		/* Make sure there is enough space to add offsets */
+		if ((result->num_offsets + BITS_PER_BITMAPWORD) > result->max_offset)
+		{
+			result->max_offset *= 2;
+			result->offsets = repalloc(result->offsets,
+									   sizeof(OffsetNumber) * result->max_offset);
+		}
+
+		while (w != 0)
+		{
+			if (w & 1)
+				result->offsets[result->num_offsets++] = (OffsetNumber) off;
+			off++;
+			w >>= 1;
+		}
+	}
+}
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
new file mode 100644
index 0000000000..7e39587074
--- /dev/null
+++ b/src/include/access/tidstore.h
@@ -0,0 +1,49 @@
+/*-------------------------------------------------------------------------
+ *
+ * tidstore.h
+ *	  Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/tidstore.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef TIDSTORE_H
+#define TIDSTORE_H
+
+#include "storage/itemptr.h"
+#include "utils/dsa.h"
+
+typedef struct TidStore TidStore;
+typedef struct TidStoreIter TidStoreIter;
+
+/* Result struct for TidStoreIterateNext */
+typedef struct TidStoreIterResult
+{
+	BlockNumber blkno;
+	int			max_offset;
+	int			num_offsets;
+	OffsetNumber *offsets;
+} TidStoreIterResult;
+
+extern TidStore *TidStoreCreate(size_t max_bytes, dsa_area *dsa,
+								int tranche_id);
+extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer rt_dp);
+extern void TidStoreDetach(TidStore *ts);
+extern void TidStoreLockExclusive(TidStore *ts);
+extern void TidStoreLockShare(TidStore *ts);
+extern void TidStoreUnlock(TidStore *ts);
+extern void TidStoreDestroy(TidStore *ts);
+extern void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+									int num_offsets);
+extern bool TidStoreIsMember(TidStore *ts, ItemPointer tid);
+extern TidStoreIter *TidStoreBeginIterate(TidStore *ts);
+extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
+extern void TidStoreEndIterate(TidStoreIter *iter);
+extern size_t TidStoreMemoryUsage(TidStore *ts);
+extern dsa_pointer TidStoreGetHandle(TidStore *ts);
+
+#endif							/* TIDSTORE_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 875a76d6f1..1cbd532156 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -35,6 +35,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_tidstore \
 		  unsafe_tests \
 		  worker_spi \
 		  xid_wraparound
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index f1d18a1b29..7c11fb97f2 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -34,6 +34,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_tidstore')
 subdir('unsafe_tests')
 subdir('worker_spi')
 subdir('xid_wraparound')
diff --git a/src/test/modules/test_tidstore/Makefile b/src/test/modules/test_tidstore/Makefile
new file mode 100644
index 0000000000..dab107d70c
--- /dev/null
+++ b/src/test/modules/test_tidstore/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_tidstore/Makefile
+
+MODULE_big = test_tidstore
+OBJS = \
+	$(WIN32RES) \
+	test_tidstore.o
+PGFILEDESC = "test_tidstore - test code for src/backend/access/common/tidstore.c"
+
+EXTENSION = test_tidstore
+DATA = test_tidstore--1.0.sql
+
+REGRESS = test_tidstore
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_tidstore
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
new file mode 100644
index 0000000000..16f4a90e2c
--- /dev/null
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -0,0 +1,109 @@
+-- Note: The test code use an array of TIDs for verification similar
+-- to vacuum's dead item array pre-PG17. To avoid adding duplicates,
+-- each call to do_set_block_offsets() should use different block
+-- numbers.
+CREATE EXTENSION test_tidstore;
+-- To hide the output of do_set_block_offsets()
+CREATE TEMP TABLE hideblocks(blockno bigint);
+-- Constant values used in the tests.
+\set maxblkno 4294967295
+-- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291.
+-- We use a higher number to test tidstore.
+\set maxoffset 512
+SELECT test_create(false);
+ test_create 
+-------------
+ 
+(1 row)
+
+-- Test on empty tidstore.
+SELECT test_is_full();
+ test_is_full 
+--------------
+ f
+(1 row)
+
+SELECT check_set_block_offsets();
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+-- Add tids in out of order.
+WITH blocks (blk) AS(
+VALUES (0), (1), (:maxblkno - 1), (:maxblkno / 2), (:maxblkno)
+),
+offsets (off) AS (
+VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)
+)
+SELECT do_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+  FROM blocks, offsets
+  GROUP BY blk;
+ do_set_block_offsets 
+----------------------
+           2147483647
+                    0
+           4294967294
+                    1
+           4294967295
+(5 rows)
+
+-- Add enough TIDs to cause the store to appear "full", compared
+-- to the allocated memory it started out with. This is easier
+-- with memory contexts in local memory.
+INSERT INTO hideblocks (blockno)
+SELECT do_set_block_offsets(blk, ARRAY[1,31,32,63,64,200]::int2[])
+  FROM generate_series(1000, 2000, 1) blk;
+-- Zero offset not allowed
+SELECT do_set_block_offsets(1, ARRAY[0]::int2[]);
+ERROR:  tuple offset out of range: 0
+-- Check TIDs we've added to the store.
+SELECT check_set_block_offsets();
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT test_is_full();
+ test_is_full 
+--------------
+ t
+(1 row)
+
+-- Re-create the TID store for randommized tests.
+SELECT test_destroy();
+ test_destroy 
+--------------
+ 
+(1 row)
+
+-- Use shared memory this time. We can't do that in test_radixtree.sql,
+-- because unused static functions would raise warnings there.
+SELECT test_create(true);
+ test_create 
+-------------
+ 
+(1 row)
+
+-- Random TIDs test. The offset numbers are randomized and must be
+-- unique and ordered.
+INSERT INTO hideblocks (blockno)
+SELECT do_set_block_offsets(blkno, array_agg(DISTINCT greatest((random() * :maxoffset)::int, 1))::int2[])
+  FROM generate_series(1, 100) num_offsets,
+  generate_series(1000, 1100, 1) blkno
+GROUP BY blkno;
+-- Check TIDs we've added to the store.
+SELECT check_set_block_offsets();
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+-- cleanup
+SELECT test_destroy();
+ test_destroy 
+--------------
+ 
+(1 row)
+
+DROP TABLE hideblocks;
diff --git a/src/test/modules/test_tidstore/meson.build b/src/test/modules/test_tidstore/meson.build
new file mode 100644
index 0000000000..0ed3ea2ef3
--- /dev/null
+++ b/src/test/modules/test_tidstore/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+test_tidstore_sources = files(
+  'test_tidstore.c',
+)
+
+if host_system == 'windows'
+  test_tidstore_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_tidstore',
+    '--FILEDESC', 'test_tidstore - test code for src/backend/access/common/tidstore.c',])
+endif
+
+test_tidstore = shared_module('test_tidstore',
+  test_tidstore_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_tidstore
+
+test_install_data += files(
+  'test_tidstore.control',
+  'test_tidstore--1.0.sql',
+)
+
+tests += {
+  'name': 'test_tidstore',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'test_tidstore',
+    ],
+  },
+}
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
new file mode 100644
index 0000000000..e49e796ace
--- /dev/null
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -0,0 +1,68 @@
+-- Note: The test code use an array of TIDs for verification similar
+-- to vacuum's dead item array pre-PG17. To avoid adding duplicates,
+-- each call to do_set_block_offsets() should use different block
+-- numbers.
+
+CREATE EXTENSION test_tidstore;
+
+-- To hide the output of do_set_block_offsets()
+CREATE TEMP TABLE hideblocks(blockno bigint);
+
+-- Constant values used in the tests.
+\set maxblkno 4294967295
+-- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291.
+-- We use a higher number to test tidstore.
+\set maxoffset 512
+
+SELECT test_create(false);
+
+-- Test on empty tidstore.
+SELECT test_is_full();
+SELECT check_set_block_offsets();
+
+-- Add tids in out of order.
+WITH blocks (blk) AS(
+VALUES (0), (1), (:maxblkno - 1), (:maxblkno / 2), (:maxblkno)
+),
+offsets (off) AS (
+VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)
+)
+SELECT do_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+  FROM blocks, offsets
+  GROUP BY blk;
+
+-- Add enough TIDs to cause the store to appear "full", compared
+-- to the allocated memory it started out with. This is easier
+-- with memory contexts in local memory.
+INSERT INTO hideblocks (blockno)
+SELECT do_set_block_offsets(blk, ARRAY[1,31,32,63,64,200]::int2[])
+  FROM generate_series(1000, 2000, 1) blk;
+
+-- Zero offset not allowed
+SELECT do_set_block_offsets(1, ARRAY[0]::int2[]);
+
+-- Check TIDs we've added to the store.
+SELECT check_set_block_offsets();
+
+SELECT test_is_full();
+
+-- Re-create the TID store for randommized tests.
+SELECT test_destroy();
+-- Use shared memory this time. We can't do that in test_radixtree.sql,
+-- because unused static functions would raise warnings there.
+SELECT test_create(true);
+
+-- Random TIDs test. The offset numbers are randomized and must be
+-- unique and ordered.
+INSERT INTO hideblocks (blockno)
+SELECT do_set_block_offsets(blkno, array_agg(DISTINCT greatest((random() * :maxoffset)::int, 1))::int2[])
+  FROM generate_series(1, 100) num_offsets,
+  generate_series(1000, 1100, 1) blkno
+GROUP BY blkno;
+
+-- Check TIDs we've added to the store.
+SELECT check_set_block_offsets();
+
+-- cleanup
+SELECT test_destroy();
+DROP TABLE hideblocks;
diff --git a/src/test/modules/test_tidstore/test_tidstore--1.0.sql b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
new file mode 100644
index 0000000000..7e6c60c7bb
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore--1.0.sql
@@ -0,0 +1,27 @@
+/* src/test/modules/test_tidstore/test_tidstore--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_tidstore" to load this file. \quit
+
+CREATE FUNCTION test_create(
+shared bool)
+RETURNS void STRICT PARALLEL UNSAFE
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION do_set_block_offsets(
+blkno bigint,
+offsets int2[])
+RETURNS bigint STRICT PARALLEL UNSAFE
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION check_set_block_offsets()
+RETURNS void STRICT PARALLEL UNSAFE
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_is_full()
+RETURNS bool STRICT PARALLEL UNSAFE
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION test_destroy()
+RETURNS void STRICT PARALLEL UNSAFE
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
new file mode 100644
index 0000000000..67603e8462
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -0,0 +1,311 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_tidstore.c
+ *		Test TidStore data structure.
+ *
+ * Copyright (c) 2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_tidstore/test_tidstore.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/tidstore.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "storage/lwlock.h"
+#include "utils/array.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_create);
+PG_FUNCTION_INFO_V1(do_set_block_offsets);
+PG_FUNCTION_INFO_V1(check_set_block_offsets);
+PG_FUNCTION_INFO_V1(test_is_full);
+PG_FUNCTION_INFO_V1(test_destroy);
+
+static TidStore *tidstore = NULL;
+static dsa_area *dsa = NULL;
+static size_t tidstore_empty_size;
+
+/* array for verification of some tests */
+typedef struct ItemArray
+{
+	ItemPointerData *insert_tids;
+	ItemPointerData *lookup_tids;
+	ItemPointerData *iter_tids;
+	int			max_tids;
+	int			num_tids;
+} ItemArray;
+
+static ItemArray items;
+
+/* comparator routine for ItemPointer */
+static int
+itemptr_cmp(const void *left, const void *right)
+{
+	BlockNumber lblk,
+				rblk;
+	OffsetNumber loff,
+				roff;
+
+	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+	if (lblk < rblk)
+		return -1;
+	if (lblk > rblk)
+		return 1;
+
+	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+	if (loff < roff)
+		return -1;
+	if (loff > roff)
+		return 1;
+
+	return 0;
+}
+
+/*
+ * Create a TidStore. If shared is false, the tidstore is created
+ * on TopMemoryContext, otherwise on DSA. Although the tidstore
+ * is created on DSA, only the same process can subsequently use
+ * the tidstore. The tidstore handle is not shared anywhere.
+*/
+Datum
+test_create(PG_FUNCTION_ARGS)
+{
+	bool		shared = PG_GETARG_BOOL(0);
+	MemoryContext old_ctx;
+
+	/* doesn't really matter, since it's just a hint */
+	size_t		tidstore_max_size = 2 * 1024 * 1024;
+	size_t		array_init_size = 1024;
+
+	Assert(tidstore == NULL);
+	Assert(dsa == NULL);
+
+	/*
+	 * Create the TidStore on TopMemoryContext so that the same process use it
+	 * for subsequent tests.
+	 */
+	old_ctx = MemoryContextSwitchTo(TopMemoryContext);
+
+	if (shared)
+	{
+		int			tranche_id;
+
+		tranche_id = LWLockNewTrancheId();
+		LWLockRegisterTranche(tranche_id, "test_tidstore");
+
+		dsa = dsa_create(tranche_id);
+
+		/*
+		 * Remain attached until end of backend or explicitly detached so that
+		 * the same process use the tidstore for subsequent tests.
+		 */
+		dsa_pin_mapping(dsa);
+
+		tidstore = TidStoreCreate(tidstore_max_size, dsa, tranche_id);
+	}
+	else
+		tidstore = TidStoreCreate(tidstore_max_size, NULL, 0);
+
+	tidstore_empty_size = TidStoreMemoryUsage(tidstore);
+
+	items.num_tids = 0;
+	items.max_tids = array_init_size / sizeof(ItemPointerData);
+	items.insert_tids = (ItemPointerData *) palloc0(array_init_size);
+	items.lookup_tids = (ItemPointerData *) palloc0(array_init_size);
+	items.iter_tids = (ItemPointerData *) palloc0(array_init_size);
+
+	MemoryContextSwitchTo(old_ctx);
+
+	PG_RETURN_VOID();
+}
+
+static void
+sanity_check_array(ArrayType *ta)
+{
+	if (ARR_HASNULL(ta) && array_contains_nulls(ta))
+		ereport(ERROR,
+				(errcode(ERRCODE_NULL_VALUE_NOT_ALLOWED),
+				 errmsg("array must not contain nulls")));
+
+	if (ARR_NDIM(ta) > 1)
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_EXCEPTION),
+				 errmsg("argument must be empty or one-dimensional array")));
+}
+
+/* Set the given block and offsets pairs */
+Datum
+do_set_block_offsets(PG_FUNCTION_ARGS)
+{
+	BlockNumber blkno = PG_GETARG_INT64(0);
+	ArrayType  *ta = PG_GETARG_ARRAYTYPE_P_COPY(1);
+	OffsetNumber *offs;
+	int			noffs;
+
+	sanity_check_array(ta);
+
+	noffs = ArrayGetNItems(ARR_NDIM(ta), ARR_DIMS(ta));
+	offs = ((OffsetNumber *) ARR_DATA_PTR(ta));
+
+	/* Set TIDs in the store */
+	TidStoreLockExclusive(tidstore);
+	TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
+	TidStoreUnlock(tidstore);
+
+	/* Set TIDs in verification array */
+	for (int i = 0; i < noffs; i++)
+	{
+		ItemPointer tid;
+		int			idx = items.num_tids + i;
+
+		/* Enlarge the TID arrays if necessary */
+		if (idx >= items.max_tids)
+		{
+			items.max_tids *= 2;
+			items.insert_tids = repalloc(items.insert_tids, sizeof(ItemPointerData) * items.max_tids);
+			items.lookup_tids = repalloc(items.lookup_tids, sizeof(ItemPointerData) * items.max_tids);
+			items.iter_tids = repalloc(items.iter_tids, sizeof(ItemPointerData) * items.max_tids);
+		}
+
+		tid = &(items.insert_tids[idx]);
+		ItemPointerSet(tid, blkno, offs[i]);
+	}
+
+	/* Update statistics */
+	items.num_tids += noffs;
+
+	PG_RETURN_INT64(blkno);
+}
+
+/*
+ * Verify TIDs in store against the array.
+ */
+Datum
+check_set_block_offsets(PG_FUNCTION_ARGS)
+{
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
+	int			num_iter_tids = 0;
+	int			num_lookup_tids = 0;
+	BlockNumber prevblkno = 0;;
+
+	/* lookup each member in the verification array */
+	for (int i = 0; i < items.num_tids; i++)
+		if (!TidStoreIsMember(tidstore, &items.insert_tids[i]))
+			elog(ERROR, "missing TID with block %u, offset %u",
+				 ItemPointerGetBlockNumber(&items.insert_tids[i]),
+				 ItemPointerGetOffsetNumber(&items.insert_tids[i]));
+
+	/*
+	 * Lookup all possible TIDs for each distinct block in the verification
+	 * array and save successful lookups in the lookup array.
+	 */
+
+	for (int i = 0; i < items.num_tids; i++)
+	{
+		BlockNumber blkno = ItemPointerGetBlockNumber(&items.insert_tids[i]);
+
+		if (i > 0 && blkno == prevblkno)
+			continue;
+
+		for (OffsetNumber offset = FirstOffsetNumber; offset < MaxOffsetNumber; offset++)
+		{
+			ItemPointerData tid;
+
+			ItemPointerSet(&tid, blkno, offset);
+
+			if (TidStoreIsMember(tidstore, &tid))
+				ItemPointerSet(&items.lookup_tids[num_lookup_tids++], blkno, offset);
+		}
+
+		prevblkno = blkno;
+	}
+
+	/* Collect TIDs stored in the tidstore, in order */
+
+	TidStoreLockShare(tidstore);
+	iter = TidStoreBeginIterate(tidstore);
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
+	{
+		for (int i = 0; i < iter_result->num_offsets; i++)
+			ItemPointerSet(&(items.iter_tids[num_iter_tids++]), iter_result->blkno,
+						   iter_result->offsets[i]);
+	}
+	TidStoreEndIterate(iter);
+	TidStoreUnlock(tidstore);
+
+	/*
+	 * Sort verification and lookup arrays and test that all arrays are the
+	 * same.
+	 */
+
+	if (num_lookup_tids != items.num_tids)
+		elog(ERROR, "should have %d TIDs, have %d", items.num_tids, num_lookup_tids);
+	if (num_iter_tids != items.num_tids)
+		elog(ERROR, "should have %d TIDs, have %d", items.num_tids, num_iter_tids);
+
+	qsort(items.insert_tids, items.num_tids, sizeof(ItemPointerData), itemptr_cmp);
+	qsort(items.lookup_tids, items.num_tids, sizeof(ItemPointerData), itemptr_cmp);
+	for (int i = 0; i < items.num_tids; i++)
+	{
+		if (itemptr_cmp((const void *) &items.insert_tids[i], (const void *) &items.iter_tids[i]) != 0)
+			elog(ERROR, "TID iter array doesn't match verification array, got (%u,%u) expected (%u,%u)",
+				 ItemPointerGetBlockNumber(&items.iter_tids[i]),
+				 ItemPointerGetOffsetNumber(&items.iter_tids[i]),
+				 ItemPointerGetBlockNumber(&items.insert_tids[i]),
+				 ItemPointerGetOffsetNumber(&items.insert_tids[i]));
+		if (itemptr_cmp((const void *) &items.insert_tids[i], (const void *) &items.lookup_tids[i]) != 0)
+			elog(ERROR, "TID lookup array doesn't match verification array, got (%u,%u) expected (%u,%u)",
+				 ItemPointerGetBlockNumber(&items.lookup_tids[i]),
+				 ItemPointerGetOffsetNumber(&items.lookup_tids[i]),
+				 ItemPointerGetBlockNumber(&items.insert_tids[i]),
+				 ItemPointerGetOffsetNumber(&items.insert_tids[i]));
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * In real world use, we care if the memory usage is greater than
+ * some configured limit. Here we just want to verify that
+ * TidStoreMemoryUsage is not broken.
+ */
+Datum
+test_is_full(PG_FUNCTION_ARGS)
+{
+	bool		is_full;
+
+	is_full = (TidStoreMemoryUsage(tidstore) > tidstore_empty_size);
+
+	PG_RETURN_BOOL(is_full);
+}
+
+/* Free the tidstore */
+Datum
+test_destroy(PG_FUNCTION_ARGS)
+{
+	TidStoreDestroy(tidstore);
+	tidstore = NULL;
+	items.num_tids = 0;
+	pfree(items.insert_tids);
+	pfree(items.lookup_tids);
+	pfree(items.iter_tids);
+
+	if (dsa)
+		dsa_detach(dsa);
+	dsa = NULL;
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_tidstore/test_tidstore.control b/src/test/modules/test_tidstore/test_tidstore.control
new file mode 100644
index 0000000000..9b6bd4638f
--- /dev/null
+++ b/src/test/modules/test_tidstore/test_tidstore.control
@@ -0,0 +1,4 @@
+comment = 'Test code for tidstore'
+default_version = '1.0'
+module_pathname = '$libdir/test_tidstore'
+relocatable = true
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 042d04c8de..b6d292a631 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4064,3 +4064,8 @@ rfile
 ws_options
 ws_file_info
 PathKeyInfo
+TidStore
+TidStoreIter
+TidStoreIterResult
+BlocktableEntry
+ItemArray
-- 
2.39.3

v75-0002-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchapplication/octet-stream; name=v75-0002-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchDownload

From c339f1d09e341f819310ced8e2cd113cb941c194 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 1 Mar 2024 16:04:49 +0900
Subject: [PATCH v75 2/2] Use TidStore for dead tuple TIDs storage during lazy
 vacuum.

Previously, we used VacDeadItems, a simple ItemPointerData array, for
dead tuple's TID storage during lazyvacuum, which was not space
efficient and lookup performant.

This commit makes (parallel) lazy vacuum use of TidSTore for dead
tuple TIDs storage. A new struct VacDeadItemsInfo stores additional
information such as max_bytes. TidStore and VacDeadItemsInfo are
shared among the parallel vacuum workers in parallel vacuum cases. We
don't take any locks on TidStore during parallel vacuum since there is
no concurrent writes.

The performance benchmark results showed significant speed up (more
than x5 speed up on my machine) in index vacuum.

As for the progress reporting, reporting number of tuples does no
longer provide any meaningful insights for users. So this commit also
changes to report byte-based progress reporting. The columns of
pg_stat_progress_vacuum are also renamed accordingly:
max_dead_tuple_bytes and dead_tuple_bytes.

XXX: bump catalog version

Co-authored-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml                  |   8 +-
 src/backend/access/heap/vacuumlazy.c          | 273 ++++++++----------
 src/backend/catalog/system_views.sql          |   2 +-
 src/backend/commands/vacuum.c                 |  79 +----
 src/backend/commands/vacuumparallel.c         | 102 +++++--
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +-
 src/include/commands/progress.h               |   4 +-
 src/include/commands/vacuum.h                 |  28 +-
 src/include/storage/lwlock.h                  |   1 +
 src/test/regress/expected/rules.out           |   4 +-
 src/tools/pgindent/typedefs.list              |   2 +-
 12 files changed, 231 insertions(+), 275 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8736eac284..6a74e4a24d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6237,10 +6237,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6248,10 +6248,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1800490775..4f72366383 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,17 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -39,6 +38,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xloginsert.h"
@@ -179,8 +179,13 @@ typedef struct LVRelState
 	 * that has been processed by lazy_scan_prune.  Also needed by
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
+	 *
+	 * Both dead_items and dead_items_info are allocated in shared memory in parallel
+	 * vacuum cases.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore *dead_items;	/* TIDs whose index tuples we'll delete */
+	VacDeadItemsInfo	*dead_items_info;
+
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -239,8 +244,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  Buffer buffer, OffsetNumber *offsets,
+								  int num_offsets, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -257,6 +263,9 @@ static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+						   int num_offsets);
+static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -472,11 +481,11 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
-	 * parallel VACUUM initialization as part of allocating shared memory
-	 * space used for dead_items.  (But do a failsafe precheck first, to
-	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
-	 * is already dangerously old.)
+	 * Allocate dead_items memory using dead_items_alloc.  This handles parallel
+	 * VACUUM initialization as part of allocating shared memory space used for
+	 * dead_items.  (But do a failsafe precheck first, to ensure that parallel
+	 * VACUUM won't be attempted at all when relfrozenxid is already dangerously
+	 * old.)
 	 */
 	lazy_check_wraparound_failsafe(vacrel);
 	dead_items_alloc(vacrel, params->nworkers);
@@ -782,7 +791,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -811,19 +820,20 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_fsm_block_to_vacuum = 0;
 	bool		all_visible_according_to_vm;
 
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore *dead_items = vacrel->dead_items;
+	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
 	Buffer		vmbuffer = InvalidBuffer;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = dead_items_info->max_bytes;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Initialize for the first heap_vac_scan_next_block() call */
@@ -866,8 +876,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -930,11 +939,11 @@ lazy_scan_heap(LVRelState *vacrel)
 
 		/*
 		 * If we didn't get the cleanup lock, we can still collect LP_DEAD
-		 * items in the dead_items array for later vacuuming, count live and
-		 * recently dead tuples for vacuum logging, and determine if this
-		 * block could later be truncated. If we encounter any xid/mxids that
-		 * require advancing the relfrozenxid/relminxid, we'll have to wait
-		 * for a cleanup lock and call lazy_scan_prune().
+		 * items in the dead_items for later vacuuming, count live and recently
+		 * dead tuples for vacuum logging, and determine if this block could
+		 * later be truncated. If we encounter any xid/mxids that require
+		 * advancing the relfrozenxid/relminxid, we'll have to wait for a
+		 * cleanup lock and call lazy_scan_prune().
 		 */
 		if (!got_cleanup_lock &&
 			!lazy_scan_noprune(vacrel, buf, blkno, page, &has_lpdead_items))
@@ -958,9 +967,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Like lazy_scan_noprune(), lazy_scan_prune() will count
 		 * recently_dead_tuples and live tuples for vacuum logging, determine
 		 * if the block can later be truncated, and accumulate the details of
-		 * remaining LP_DEAD line pointers on the page in the dead_items
-		 * array. These dead items include those pruned by lazy_scan_prune()
-		 * as well we line pointers previously marked LP_DEAD.
+		 * remaining LP_DEAD line pointers on the page in the dead_items.
+		 * These dead items include those pruned by lazy_scan_prune() as well
+		 * we line pointers previously marked LP_DEAD.
 		 */
 		if (got_cleanup_lock)
 			lazy_scan_prune(vacrel, buf, blkno, page,
@@ -1037,7 +1046,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (dead_items_info->num_items > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1763,22 +1772,9 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1925,7 +1921,7 @@ lazy_scan_prune(LVRelState *vacrel,
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2084,7 +2080,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2104,9 +2100,6 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
 		 * indexes will be deleted during index vacuuming (and then marked
@@ -2114,17 +2107,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		vacrel->lpdead_items += lpdead_items;
 	}
@@ -2174,7 +2157,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		dead_items_reset(vacrel);
 		return;
 	}
 
@@ -2203,7 +2186,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2230,8 +2213,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2276,7 +2259,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	dead_items_reset(vacrel);
 }
 
 /*
@@ -2368,7 +2351,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2390,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2408,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2426,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = TidStoreBeginIterate(vacrel->dead_items);
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2435,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = iter_result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2449,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, buf, iter_result->offsets,
+							  iter_result->num_offsets, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2459,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2468,14 +2454,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2483,21 +2468,17 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 /*
  *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *						  vacrel->dead_items store.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
+static void
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+					  OffsetNumber *deadoffsets, int num_offsets,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2516,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber	toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2595,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -2722,8 +2697,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -2760,7 +2735,8 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
 	/* Do bulk deletion */
-	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items);
+	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items,
+								  vacrel->dead_items_info);
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -3125,46 +3101,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = AmAutoVacuumWorkerProcess() &&
-		autovacuum_work_mem != -1 ?
-		autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3175,11 +3111,10 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	VacDeadItemsInfo *dead_items_info;
+	int			vac_work_mem = AmAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3206,24 +3141,66 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
 		/* If parallel mode started, dead_items space is allocated in DSM */
 		if (ParallelVacuumIsActive(vacrel))
 		{
-			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs);
+			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs,
+																&vacrel->dead_items_info);
 			return;
 		}
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
+	vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
+
+	dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
+	dead_items_info->max_bytes = vac_work_mem;
+	dead_items_info->num_items = 0;
+	vacrel->dead_items_info = dead_items_info;
+}
+
+/*
+ * Add the given block number and offset numbers to dead_items.
+ */
+static void
+dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+			   int num_offsets)
+{
+	TidStore	*dead_items = vacrel->dead_items;
+
+	TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets);
+	vacrel->dead_items_info->num_items += num_offsets;
+
+	/* update the memory usage report */
+	pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+								 TidStoreMemoryUsage(dead_items));
+}
+
+/*
+ * Forget all collected dead items.
+ */
+static void
+dead_items_reset(LVRelState *vacrel)
+{
+	TidStore	*dead_items = vacrel->dead_items;
+
+	if (ParallelVacuumIsActive(vacrel))
+	{
+		parallel_vacuum_reset_dead_items(vacrel->pvs);
+		return;
+	}
+
+	/* Recreate the tidstore with the same max_bytes limitation */
+	TidStoreDestroy(dead_items);
+	vacrel->dead_items = TidStoreCreate(vacrel->dead_items_info->max_bytes,
+										NULL, 0);
 
-	vacrel->dead_items = dead_items;
+	/* Reset the counter */
+	vacrel->dead_items_info->num_items = 0;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 04227a72d1..b6990ac0da 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1221,7 +1221,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples,
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
         S.param8 AS indexes_total, S.param9 AS indexes_processed
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e63c86cae4..adbd358bfa 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,7 +116,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * GUC check function to ensure GUC value specified is within the allowable
@@ -2489,16 +2488,15 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items, VacDeadItemsInfo *di_info)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
-					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+			(errmsg("scanned index \"%s\" to remove " INT64_FORMAT " row versions",
+					RelationGetRelationName(ivinfo->index), di_info->num_items)));
 
 	return istat;
 }
@@ -2529,82 +2527,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch(itemptr,
-								dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore *dead_items = (TidStore *) state;
 
-	return 0;
+	return TidStoreIsMember(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index befda1c105..4aa9e9f9fc 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -8,8 +8,8 @@
  *
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
+ * vacuum process.  ParalleVacuumState contains shared information as well as
+ * the memory space for storing dead items allocated in the DSA area.  We
  * launch parallel worker processes at the start of parallel index
  * bulk-deletion and index cleanup and once all indexes are processed, the
  * parallel worker processes exit.  Each time we process indexes in parallel,
@@ -110,6 +110,12 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* DSA pointer to the shared TidStore */
+	dsa_pointer	dead_items_handle;
+
+	/* Statistics of shared dead items */
+	VacDeadItemsInfo	di_info;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -176,7 +182,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore *dead_items;
+	dsa_area *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -232,20 +239,22 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
+					 int nrequested_workers, int vac_work_mem,
 					 int elevel, BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void		*area_space;
+	dsa_area	*dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -294,9 +303,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Initial size of DSA for dead tuples -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -362,6 +370,17 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = TidStoreCreate(vac_work_mem, dead_items_dsa,
+								LWTRANCHE_PARALLEL_VACUUM_DSA);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -371,6 +390,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = TidStoreGetHandle(dead_items);
+	shared->di_info.max_bytes = vac_work_mem;
 
 	/* Use the same buffer size for all workers */
 	shared->ring_nbuffers = GetAccessStrategyBufferCount(bstrategy);
@@ -382,15 +403,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -448,6 +460,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	TidStoreDestroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -455,13 +470,40 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
-/* Returns the dead items space */
-VacDeadItems *
-parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
+/*
+ * Returns the dead items space and dead items information.
+ */
+TidStore *
+parallel_vacuum_get_dead_items(ParallelVacuumState *pvs, VacDeadItemsInfo **di_info_p)
 {
+	*di_info_p = &(pvs->shared->di_info);
 	return pvs->dead_items;
 }
 
+/* Forget all items in dead_items */
+void
+parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
+{
+	TidStore	*dead_items = pvs->dead_items;
+	VacDeadItemsInfo *di_info = &(pvs->shared->di_info);
+
+	/*
+	 * Free the current tidstore and return allocated DSA segments to the
+	 * operating system. Then we recreate the tidstore with the same max_bytes
+	 * limitation we just used.
+	 */
+	TidStoreDestroy(dead_items);
+	dsa_trim(pvs->dead_items_area);
+	pvs->dead_items = TidStoreCreate(di_info->max_bytes, pvs->dead_items_area,
+									 LWTRANCHE_PARALLEL_VACUUM_DSA);
+
+	/* Update the DSA pointer for dead_items to the new one */
+	pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
+
+	/* Reset the counter */
+	di_info->num_items = 0;
+}
+
 /*
  * Do parallel index bulk-deletion with parallel workers.
  */
@@ -861,7 +903,8 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	switch (indstats->status)
 	{
 		case PARALLEL_INDVAC_STATUS_NEED_BULKDELETE:
-			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items);
+			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items,
+											  &pvs->shared->di_info);
 			break;
 		case PARALLEL_INDVAC_STATUS_NEED_CLEANUP:
 			istat_res = vac_cleanup_one_index(&ivinfo, istat);
@@ -961,7 +1004,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore	*dead_items;
+	void		*area_space;
+	dsa_area	*dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -1005,10 +1050,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumUpdateCosts();
@@ -1056,6 +1101,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	TidStoreDetach(dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 30f3a09a4c..197e0c275e 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -167,6 +167,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SERIAL_SLRU] = "SerialSLRU",
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
+	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index c08e00d1d6..7ad0c765ab 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -376,7 +376,7 @@ NotifySLRU	"Waiting to access the <command>NOTIFY</command> message SLRU cache."
 SerialSLRU	"Waiting to access the serializable transaction conflict SLRU cache."
 SubtransSLRU	"Waiting to access the sub-transaction SLRU cache."
 XactSLRU	"Waiting to access the transaction status SLRU cache."
-
+ParallelVacuumDSA	"Waiting for parallel vacuum dynamic shared memory allocation."
 
 #
 # Wait Events - Lock
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 73afa77a9c..82a8fe6bd1 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 #define PROGRESS_VACUUM_INDEXES_TOTAL			7
 #define PROGRESS_VACUUM_INDEXES_PROCESSED		8
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1182a96742..69ec22500a 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -278,19 +279,14 @@ struct VacuumCutoffs
 };
 
 /*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
+ * VacDeadItemsInfo stores additional information of dead tuple TIDs and
+ * dead tuple storage (e.g. TidStore).
  */
-typedef struct VacDeadItems
+typedef struct VacDeadItemsInfo
 {
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
+	size_t	max_bytes;	/* the maximum bytes TidStore can use */
+	int64	num_items;	/* current # of entries */
+} VacDeadItemsInfo;
 
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
@@ -351,10 +347,10 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items,
+													VacDeadItemsInfo *di_info);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* In postmaster/autovacuum.c */
 extern void AutoVacuumUpdateCostLimit(void);
@@ -363,10 +359,12 @@ extern void VacuumUpdateCosts(void);
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
+												 int vac_work_mem, int elevel,
 												 BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
+												VacDeadItemsInfo **di_info_p);
+extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 10bea8c595..74d9f14f54 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -216,6 +216,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SERIAL_SLRU,
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 84e359f6ed..ae3e91549f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2048,8 +2048,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples,
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes,
     s.param8 AS indexes_total,
     s.param9 AS indexes_processed
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b6d292a631..4a3c82bd20 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2969,7 +2969,7 @@ UserMapping
 UserOpts
 VacAttrStats
 VacAttrStatsP
-VacDeadItems
+VacDeadItemsInfo
 VacErrPhase
 VacObjFilter
 VacOptValue
-- 
2.39.3

#419

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#407)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:

Locally (not CI), we should try big inputs to make sure we can
actually go up to many GB -- it's easier and faster this way than
having vacuum give us a large data set.

I'll do these tests.

I just remembered this -- did any of this kind of testing happen? I
can do it as well.

Thank you. I've incorporated all the comments above. I've attached the
latest patches, and am going to push them (one by one) after
self-review again.

One more cosmetic thing in 0001 that caught my eye:

diff --git a/src/backend/access/common/Makefile
b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
  syncscan.o \
  toast_compression.o \
  toast_internals.o \
+ tidstore.o \
  tupconvert.o \
  tupdesc.o

diff --git a/src/backend/access/common/meson.build
b/src/backend/access/common/meson.build
index 725041a4ce..a02397855e 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )

These aren't in alphabetical order.

#420

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#419)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 20, 2024 at 3:48 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 14, 2024 at 1:29 PM John Naylor <johncnaylorls@gmail.com> wrote:

Locally (not CI), we should try big inputs to make sure we can
actually go up to many GB -- it's easier and faster this way than
having vacuum give us a large data set.

I'll do these tests.

I just remembered this -- did any of this kind of testing happen? I
can do it as well.

I forgot to report the results. Yes, I did some tests where I inserted
many TIDs to make the tidstore use several GB memory. I did two cases:

1. insert 100M blocks of TIDs with an offset of 100.
2. insert 10M blocks of TIDs with an offset of 2048.

The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup
and iteration results were expected.

Thank you. I've incorporated all the comments above. I've attached the
latest patches, and am going to push them (one by one) after
self-review again.

One more cosmetic thing in 0001 that caught my eye:

diff --git a/src/backend/access/common/Makefile
b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
syncscan.o \
toast_compression.o \
toast_internals.o \
+ tidstore.o \
tupconvert.o \
tupdesc.o

diff --git a/src/backend/access/common/meson.build
b/src/backend/access/common/meson.build
index 725041a4ce..a02397855e 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
'syncscan.c',
'toast_compression.c',
'toast_internals.c',
+  'tidstore.c',
'tupconvert.c',
'tupdesc.c',
)

These aren't in alphabetical order.

Good catch. I'll fix them before the push.

While reviewing the codes again, the following two things caught my eyes:

in check_set_block_offset() function, we don't take a lock on the
tidstore while checking all possible TIDs. I'll add
TidStoreLockShare() and TidStoreUnlock() as follows:

+ TidStoreLockShare(tidstore);
if (TidStoreIsMember(tidstore, &tid))
ItemPointerSet(&items.lookup_tids[num_lookup_tids++],
blkno, offset);
+ TidStoreUnlock(tidstore);

---
Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take
a lock on the shared tidstore since dsa_get_total_size() (called by
RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it
in the comment as follows:

-/* Return the memory usage of TidStore */
+/*
+ * Return the memory usage of TidStore.
+ *
+ * In shared TidStore cases, since shared_ts_memory_usage() does appropriate
+ * locking, the caller doesn't need to take a lock.
+ */

What do you think?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#421

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#420)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 20, 2024 at 8:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I forgot to report the results. Yes, I did some tests where I inserted
many TIDs to make the tidstore use several GB memory. I did two cases:

1. insert 100M blocks of TIDs with an offset of 100.
2. insert 10M blocks of TIDs with an offset of 2048.

The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup
and iteration results were expected.

Thanks for confirming!

While reviewing the codes again, the following two things caught my eyes:

in check_set_block_offset() function, we don't take a lock on the
tidstore while checking all possible TIDs. I'll add
TidStoreLockShare() and TidStoreUnlock() as follows:

+ TidStoreLockShare(tidstore);
if (TidStoreIsMember(tidstore, &tid))
ItemPointerSet(&items.lookup_tids[num_lookup_tids++],
blkno, offset);
+ TidStoreUnlock(tidstore);

In one sense, all locking in the test module is useless since there is
only a single process. On the other hand, it seems good to at least
run what we have written to run it trivially, and serve as an example
of usage. We should probably be consistent, and document at the top
that the locks are pro-forma only.

It's both a blessing and a curse that vacuum only has a single writer.
It makes development less of a hassle, but also means that tidstore
locking is done for API-completeness reasons, not (yet) as a practical
necessity. Even tidbitmap.c's hash table currently has a single
writer, and while using tidstore for that is still an engineering
challenge for other reasons, it wouldn't exercise locking
meaningfully, either, at least at first.

Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take
a lock on the shared tidstore since dsa_get_total_size() (called by
RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it
in the comment as follows:
-/* Return the memory usage of TidStore */
+/*
+ * Return the memory usage of TidStore.
+ *
+ * In shared TidStore cases, since shared_ts_memory_usage() does appropriate
+ * locking, the caller doesn't need to take a lock.
+ */
What do you think?

That duplicates the underlying comment on the radix tree function that
this calls, so I'm inclined to leave it out. At this level it's
probably best to document when a caller _does_ need to take an action.

One thing I forgot to ask about earlier:

+-- Add tids in out of order.

Are they (the blocks to be precise) really out of order? The VALUES
statement is ordered, but after inserting it does not output that way.
I wondered if this is platform independent, but CI and our dev
machines haven't failed this test, and I haven't looked into what
determines the order. It's easy enough to hide the blocks if we ever
need to, as we do elsewhere...

#422

[1]: /messages/by-id/CAD21AoCVMw6DSmgZY9h+xfzKtzJeqWiwxaUD2T-FztVcV-XibQ@mail.gmail.com

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#421)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 20, 2024 at 11:19 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Wed, Mar 20, 2024 at 8:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I forgot to report the results. Yes, I did some tests where I inserted
many TIDs to make the tidstore use several GB memory. I did two cases:

1. insert 100M blocks of TIDs with an offset of 100.
2. insert 10M blocks of TIDs with an offset of 2048.

The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup
and iteration results were expected.

Thanks for confirming!

While reviewing the codes again, the following two things caught my eyes:

in check_set_block_offset() function, we don't take a lock on the
tidstore while checking all possible TIDs. I'll add
TidStoreLockShare() and TidStoreUnlock() as follows:

+ TidStoreLockShare(tidstore);
if (TidStoreIsMember(tidstore, &tid))
ItemPointerSet(&items.lookup_tids[num_lookup_tids++],
blkno, offset);
+ TidStoreUnlock(tidstore);

In one sense, all locking in the test module is useless since there is
only a single process. On the other hand, it seems good to at least
run what we have written to run it trivially, and serve as an example
of usage. We should probably be consistent, and document at the top
that the locks are pro-forma only.

Agreed.

Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take
a lock on the shared tidstore since dsa_get_total_size() (called by
RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it
in the comment as follows:
-/* Return the memory usage of TidStore */
+/*
+ * Return the memory usage of TidStore.
+ *
+ * In shared TidStore cases, since shared_ts_memory_usage() does appropriate
+ * locking, the caller doesn't need to take a lock.
+ */
What do you think?
That duplicates the underlying comment on the radix tree function that
this calls, so I'm inclined to leave it out. At this level it's
probably best to document when a caller _does_ need to take an action.

Okay, I didn't change it.

One thing I forgot to ask about earlier:

+-- Add tids in out of order.

Are they (the blocks to be precise) really out of order? The VALUES
statement is ordered, but after inserting it does not output that way.
I wondered if this is platform independent, but CI and our dev
machines haven't failed this test, and I haven't looked into what
determines the order. It's easy enough to hide the blocks if we ever
need to, as we do elsewhere...

It seems not necessary as such a test is already covered by
test_radixtree. I've changed the query to hide the output blocks.

I've pushed the tidstore patch after incorporating the above changes.
In addition to that, I've added the following changes before the push:

- Added src/test/modules/test_tidstore/.gitignore file.
- Removed unnecessary #include from tidstore.c.

The buildfarm has been all-green so far.

I've attached the latest vacuum improvement patch.

I just remembered that the tidstore cannot still be used for parallel
vacuum with minimum maintenance_work_mem. Even when the shared
tidstore is empty, its memory usage reports 1056768 bytes, a bit above
1MB (1048576 bytes). We need something discussed on another thread[1]/messages/by-id/CAD21AoCVMw6DSmgZY9h+xfzKtzJeqWiwxaUD2T-FztVcV-XibQ@mail.gmail.com
in order to make it work.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v76-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchapplication/octet-stream; name=v76-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchDownload

From a1e8958dcdf1e874c54d0f3d6dcf62a6c81278be Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 1 Mar 2024 16:04:49 +0900
Subject: [PATCH v76] Use TidStore for dead tuple TIDs storage during lazy
 vacuum.

Previously, we used VacDeadItems, a simple ItemPointerData array, for
dead tuple's TID storage during lazy vacuum, which was neither space
efficient nor lookup performant.

This commit makes (parallel) lazy vacuum use of TidSTore for dead
tuple TIDs storage, instead of VacDeadItems. A new struct
VacDeadItemsInfo stores additional information such as
max_bytes. TidStore and VacDeadItemsInfo are shared among the parallel
vacuum workers in parallel vacuum cases. We don't take any locks on
TidStore during parallel vacuum since there are no concurrent reads
and writes.

As for the progress reporting, reporting number of tuples does no
longer provide any meaningful insights for users. So this commit also
changes to report byte-based progress reporting. The columns of
pg_stat_progress_vacuum are also renamed accordingly:
max_dead_tuple_bytes and dead_tuple_bytes.

XXX: bump catalog version

Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |   5 -
 doc/src/sgml/monitoring.sgml                  |   8 +-
 src/backend/access/heap/vacuumlazy.c          | 257 ++++++++----------
 src/backend/catalog/system_views.sql          |   2 +-
 src/backend/commands/vacuum.c                 |  79 +-----
 src/backend/commands/vacuumparallel.c         | 102 +++++--
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +-
 src/include/commands/progress.h               |   4 +-
 src/include/commands/vacuum.h                 |  28 +-
 src/include/storage/lwlock.h                  |   1 +
 src/test/regress/expected/rules.out           |   4 +-
 src/tools/pgindent/typedefs.list              |   2 +-
 13 files changed, 223 insertions(+), 272 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c408..a4ab76b7bc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1918,11 +1918,6 @@ include_dir 'conf.d'
         too high.  It may be useful to control for this by separately
         setting <xref linkend="guc-autovacuum-work-mem"/>.
        </para>
-       <para>
-        Note that for the collection of dead tuple identifiers,
-        <command>VACUUM</command> is only able to utilize up to a maximum of
-        <literal>1GB</literal> of memory.
-       </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8736eac284..6a74e4a24d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6237,10 +6237,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6248,10 +6248,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1800490775..6c025b609c 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,17 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -39,6 +38,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xloginsert.h"
@@ -179,8 +179,13 @@ typedef struct LVRelState
 	 * that has been processed by lazy_scan_prune.  Also needed by
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
+	 *
+	 * Both dead_items and dead_items_info are allocated in shared memory in
+	 * parallel vacuum cases.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore   *dead_items;		/* TIDs whose index tuples we'll delete */
+	VacDeadItemsInfo *dead_items_info;
+
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -239,8 +244,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  Buffer buffer, OffsetNumber *offsets,
+								  int num_offsets, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -257,6 +263,9 @@ static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+						   int num_offsets);
+static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -472,7 +481,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
+	 * Allocate dead_items memory using dead_items_alloc.  This handles
 	 * parallel VACUUM initialization as part of allocating shared memory
 	 * space used for dead_items.  (But do a failsafe precheck first, to
 	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
@@ -782,7 +791,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -811,19 +820,20 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_fsm_block_to_vacuum = 0;
 	bool		all_visible_according_to_vm;
 
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore   *dead_items = vacrel->dead_items;
+	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
 	Buffer		vmbuffer = InvalidBuffer;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = dead_items_info->max_bytes;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Initialize for the first heap_vac_scan_next_block() call */
@@ -866,8 +876,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -930,7 +939,7 @@ lazy_scan_heap(LVRelState *vacrel)
 
 		/*
 		 * If we didn't get the cleanup lock, we can still collect LP_DEAD
-		 * items in the dead_items array for later vacuuming, count live and
+		 * items in the dead_items for later vacuuming, count live and
 		 * recently dead tuples for vacuum logging, and determine if this
 		 * block could later be truncated. If we encounter any xid/mxids that
 		 * require advancing the relfrozenxid/relminxid, we'll have to wait
@@ -958,9 +967,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Like lazy_scan_noprune(), lazy_scan_prune() will count
 		 * recently_dead_tuples and live tuples for vacuum logging, determine
 		 * if the block can later be truncated, and accumulate the details of
-		 * remaining LP_DEAD line pointers on the page in the dead_items
-		 * array. These dead items include those pruned by lazy_scan_prune()
-		 * as well we line pointers previously marked LP_DEAD.
+		 * remaining LP_DEAD line pointers on the page in the dead_items.
+		 * These dead items include those pruned by lazy_scan_prune() as well
+		 * we line pointers previously marked LP_DEAD.
 		 */
 		if (got_cleanup_lock)
 			lazy_scan_prune(vacrel, buf, blkno, page,
@@ -1037,7 +1046,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (dead_items_info->num_items > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1763,22 +1772,9 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1925,7 +1921,7 @@ lazy_scan_prune(LVRelState *vacrel,
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2084,7 +2080,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2104,9 +2100,6 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
 		 * indexes will be deleted during index vacuuming (and then marked
@@ -2114,17 +2107,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		vacrel->lpdead_items += lpdead_items;
 	}
@@ -2174,7 +2157,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		dead_items_reset(vacrel);
 		return;
 	}
 
@@ -2203,7 +2186,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2230,8 +2213,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2276,7 +2259,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	dead_items_reset(vacrel);
 }
 
 /*
@@ -2368,7 +2351,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2390,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2408,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2426,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = TidStoreBeginIterate(vacrel->dead_items);
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2435,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = iter_result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2449,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, buf, iter_result->offsets,
+							  iter_result->num_offsets, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2459,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2468,14 +2454,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2483,21 +2468,17 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 /*
  *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *						  vacrel->dead_items store.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
+static void
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+					  OffsetNumber *deadoffsets, int num_offsets,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2516,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2595,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -2722,8 +2697,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -2760,7 +2735,8 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
 	/* Do bulk deletion */
-	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items);
+	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items,
+								  vacrel->dead_items_info);
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -3125,46 +3101,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = AmAutoVacuumWorkerProcess() &&
-		autovacuum_work_mem != -1 ?
-		autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3175,11 +3111,10 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	VacDeadItemsInfo *dead_items_info;
+	int			vac_work_mem = AmAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3206,24 +3141,66 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
 		/* If parallel mode started, dead_items space is allocated in DSM */
 		if (ParallelVacuumIsActive(vacrel))
 		{
-			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs);
+			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs,
+																&vacrel->dead_items_info);
 			return;
 		}
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
+	vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
+
+	dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
+	dead_items_info->max_bytes = vac_work_mem;
+	dead_items_info->num_items = 0;
+	vacrel->dead_items_info = dead_items_info;
+}
+
+/*
+ * Add the given block number and offset numbers to dead_items.
+ */
+static void
+dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+			   int num_offsets)
+{
+	TidStore   *dead_items = vacrel->dead_items;
+
+	TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets);
+	vacrel->dead_items_info->num_items += num_offsets;
+
+	/* update the memory usage report */
+	pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+								 TidStoreMemoryUsage(dead_items));
+}
+
+/*
+ * Forget all collected dead items.
+ */
+static void
+dead_items_reset(LVRelState *vacrel)
+{
+	TidStore   *dead_items = vacrel->dead_items;
+
+	if (ParallelVacuumIsActive(vacrel))
+	{
+		parallel_vacuum_reset_dead_items(vacrel->pvs);
+		return;
+	}
+
+	/* Recreate the tidstore with the same max_bytes limitation */
+	TidStoreDestroy(dead_items);
+	vacrel->dead_items = TidStoreCreate(vacrel->dead_items_info->max_bytes,
+										NULL, 0);
 
-	vacrel->dead_items = dead_items;
+	/* Reset the counter */
+	vacrel->dead_items_info->num_items = 0;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 04227a72d1..b6990ac0da 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1221,7 +1221,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples,
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
         S.param8 AS indexes_total, S.param9 AS indexes_processed
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e63c86cae4..72299b0838 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,7 +116,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * GUC check function to ensure GUC value specified is within the allowable
@@ -2489,16 +2488,15 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items, VacDeadItemsInfo *dead_items_info)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
-					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+			(errmsg("scanned index \"%s\" to remove " INT64_FORMAT " row versions",
+					RelationGetRelationName(ivinfo->index), dead_items_info->num_items)));
 
 	return istat;
 }
@@ -2529,82 +2527,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch(itemptr,
-								dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore   *dead_items = (TidStore *) state;
 
-	return 0;
+	return TidStoreIsMember(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index befda1c105..23a4fc6c58 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -8,8 +8,8 @@
  *
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
+ * vacuum process.  ParalleVacuumState contains shared information as well as
+ * the memory space for storing dead items allocated in the DSA area.  We
  * launch parallel worker processes at the start of parallel index
  * bulk-deletion and index cleanup and once all indexes are processed, the
  * parallel worker processes exit.  Each time we process indexes in parallel,
@@ -110,6 +110,12 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* DSA pointer to the shared TidStore */
+	dsa_pointer dead_items_handle;
+
+	/* Statistics of shared dead items */
+	VacDeadItemsInfo dead_items_info;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -176,7 +182,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
+	dsa_area   *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -232,20 +239,22 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
+					 int nrequested_workers, int vac_work_mem,
 					 int elevel, BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void	   *area_space;
+	dsa_area   *dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -294,9 +303,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Initial size of DSA for dead tuples -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -362,6 +370,17 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = TidStoreCreate(vac_work_mem, dead_items_dsa,
+								LWTRANCHE_PARALLEL_VACUUM_DSA);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -371,6 +390,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = TidStoreGetHandle(dead_items);
+	shared->dead_items_info.max_bytes = vac_work_mem;
 
 	/* Use the same buffer size for all workers */
 	shared->ring_nbuffers = GetAccessStrategyBufferCount(bstrategy);
@@ -382,15 +403,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -448,6 +460,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	TidStoreDestroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -455,13 +470,40 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
-/* Returns the dead items space */
-VacDeadItems *
-parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
+/*
+ * Returns the dead items space and dead items information.
+ */
+TidStore *
+parallel_vacuum_get_dead_items(ParallelVacuumState *pvs, VacDeadItemsInfo **dead_items_info_p)
 {
+	*dead_items_info_p = &(pvs->shared->dead_items_info);
 	return pvs->dead_items;
 }
 
+/* Forget all items in dead_items */
+void
+parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
+{
+	TidStore   *dead_items = pvs->dead_items;
+	VacDeadItemsInfo *dead_items_info = &(pvs->shared->dead_items_info);
+
+	/*
+	 * Free the current tidstore and return allocated DSA segments to the
+	 * operating system. Then we recreate the tidstore with the same max_bytes
+	 * limitation we just used.
+	 */
+	TidStoreDestroy(dead_items);
+	dsa_trim(pvs->dead_items_area);
+	pvs->dead_items = TidStoreCreate(dead_items_info->max_bytes, pvs->dead_items_area,
+									 LWTRANCHE_PARALLEL_VACUUM_DSA);
+
+	/* Update the DSA pointer for dead_items to the new one */
+	pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
+
+	/* Reset the counter */
+	dead_items_info->num_items = 0;
+}
+
 /*
  * Do parallel index bulk-deletion with parallel workers.
  */
@@ -861,7 +903,8 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	switch (indstats->status)
 	{
 		case PARALLEL_INDVAC_STATUS_NEED_BULKDELETE:
-			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items);
+			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items,
+											  &pvs->shared->dead_items_info);
 			break;
 		case PARALLEL_INDVAC_STATUS_NEED_CLEANUP:
 			istat_res = vac_cleanup_one_index(&ivinfo, istat);
@@ -961,7 +1004,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
+	void	   *area_space;
+	dsa_area   *dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -1005,10 +1050,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumUpdateCosts();
@@ -1056,6 +1101,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	TidStoreDetach(dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 83992725de..b1e388dc7c 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -168,6 +168,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SERIAL_SLRU] = "SerialSLRU",
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
+	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8d0571a03d..2eee03daec 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -376,7 +376,7 @@ NotifySLRU	"Waiting to access the <command>NOTIFY</command> message SLRU cache."
 SerialSLRU	"Waiting to access the serializable transaction conflict SLRU cache."
 SubtransSLRU	"Waiting to access the sub-transaction SLRU cache."
 XactSLRU	"Waiting to access the transaction status SLRU cache."
-
+ParallelVacuumDSA	"Waiting for parallel vacuum dynamic shared memory allocation."
 
 #
 # Wait Events - Lock
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 73afa77a9c..82a8fe6bd1 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 #define PROGRESS_VACUUM_INDEXES_TOTAL			7
 #define PROGRESS_VACUUM_INDEXES_PROCESSED		8
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1182a96742..759f9a87d3 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -278,19 +279,14 @@ struct VacuumCutoffs
 };
 
 /*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
+ * VacDeadItemsInfo stores supplemental information for dead tuple TID
+ * storage (i.e. TidStore).
  */
-typedef struct VacDeadItems
+typedef struct VacDeadItemsInfo
 {
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
+	size_t		max_bytes;		/* the maximum bytes TidStore can use */
+	int64		num_items;		/* current # of entries */
+} VacDeadItemsInfo;
 
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
@@ -351,10 +347,10 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items,
+													VacDeadItemsInfo *dead_items_info);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* In postmaster/autovacuum.c */
 extern void AutoVacuumUpdateCostLimit(void);
@@ -363,10 +359,12 @@ extern void VacuumUpdateCosts(void);
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
+												 int vac_work_mem, int elevel,
 												 BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
+												VacDeadItemsInfo **dead_items_info_p);
+extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 3479b4cf52..d70e6d37e0 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -214,6 +214,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SERIAL_SLRU,
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 84e359f6ed..ae3e91549f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2048,8 +2048,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples,
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes,
     s.param8 AS indexes_total,
     s.param9 AS indexes_processed
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c79e7b2eb6..1f71a06005 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2970,7 +2970,7 @@ UserMapping
 UserOpts
 VacAttrStats
 VacAttrStatsP
-VacDeadItems
+VacDeadItemsInfo
 VacErrPhase
 VacObjFilter
 VacOptValue
-- 
2.39.3

#423

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#422)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 20, 2024 at 11:19 PM John Naylor <johncnaylorls@gmail.com> wrote:

Are they (the blocks to be precise) really out of order? The VALUES
statement is ordered, but after inserting it does not output that way.
I wondered if this is platform independent, but CI and our dev
machines haven't failed this test, and I haven't looked into what
determines the order. It's easy enough to hide the blocks if we ever
need to, as we do elsewhere...

It seems not necessary as such a test is already covered by
test_radixtree. I've changed the query to hide the output blocks.

Okay.

The buildfarm has been all-green so far.

Great!

I've attached the latest vacuum improvement patch.

I just remembered that the tidstore cannot still be used for parallel
vacuum with minimum maintenance_work_mem. Even when the shared
tidstore is empty, its memory usage reports 1056768 bytes, a bit above
1MB (1048576 bytes). We need something discussed on another thread[1]
in order to make it work.

For exactly this reason, we used to have a clamp on max_bytes when it
was internal to tidstore, so that it never reported full when first
created, so I guess that got thrown away when we got rid of the
control object in shared memory. Forcing callers to clamp their own
limits seems pretty unfriendly, though.

The proposals in that thread are pretty simple. If those don't move
forward soon, a hackish workaround would be to round down the number
we get from dsa_get_total_size to the nearest megabyte. Then
controlling min/max segment size would be a nice-to-have for PG17, not
a prerequisite.

#424

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#423)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 21, 2024 at 12:40 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 20, 2024 at 11:19 PM John Naylor <johncnaylorls@gmail.com> wrote:

Are they (the blocks to be precise) really out of order? The VALUES
statement is ordered, but after inserting it does not output that way.
I wondered if this is platform independent, but CI and our dev
machines haven't failed this test, and I haven't looked into what
determines the order. It's easy enough to hide the blocks if we ever
need to, as we do elsewhere...

It seems not necessary as such a test is already covered by
test_radixtree. I've changed the query to hide the output blocks.

Okay.

The buildfarm has been all-green so far.

Great!

I've attached the latest vacuum improvement patch.

I just remembered that the tidstore cannot still be used for parallel
vacuum with minimum maintenance_work_mem. Even when the shared
tidstore is empty, its memory usage reports 1056768 bytes, a bit above
1MB (1048576 bytes). We need something discussed on another thread[1]
in order to make it work.

For exactly this reason, we used to have a clamp on max_bytes when it
was internal to tidstore, so that it never reported full when first
created, so I guess that got thrown away when we got rid of the
control object in shared memory. Forcing callers to clamp their own
limits seems pretty unfriendly, though.

Or we can have a new function for dsa.c to set the initial and max
segment size (or either one) to the existing DSA area so that
TidStoreCreate() can specify them at creation. In shared TidStore
cases, since all memory required by shared radix tree is allocated in
the passed-in DSA area and the memory usage is the total segment size
allocated in the DSA area, the user will have to prepare a DSA area
only for the shared tidstore. So we might be able to expect that the
DSA passed-in to TidStoreCreate() is empty and its segment sizes can
be adjustable.

The proposals in that thread are pretty simple. If those don't move
forward soon, a hackish workaround would be to round down the number
we get from dsa_get_total_size to the nearest megabyte. Then
controlling min/max segment size would be a nice-to-have for PG17, not
a prerequisite.

Interesting idea.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#425

sawada.mshk@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#424)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 21, 2024 at 3:10 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 21, 2024 at 12:40 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 20, 2024 at 11:19 PM John Naylor <johncnaylorls@gmail.com> wrote:

Are they (the blocks to be precise) really out of order? The VALUES
statement is ordered, but after inserting it does not output that way.
I wondered if this is platform independent, but CI and our dev
machines haven't failed this test, and I haven't looked into what
determines the order. It's easy enough to hide the blocks if we ever
need to, as we do elsewhere...

It seems not necessary as such a test is already covered by
test_radixtree. I've changed the query to hide the output blocks.

Okay.

The buildfarm has been all-green so far.

Great!

I've attached the latest vacuum improvement patch.

I just remembered that the tidstore cannot still be used for parallel
vacuum with minimum maintenance_work_mem. Even when the shared
tidstore is empty, its memory usage reports 1056768 bytes, a bit above
1MB (1048576 bytes). We need something discussed on another thread[1]
in order to make it work.

For exactly this reason, we used to have a clamp on max_bytes when it
was internal to tidstore, so that it never reported full when first
created, so I guess that got thrown away when we got rid of the
control object in shared memory. Forcing callers to clamp their own
limits seems pretty unfriendly, though.

Or we can have a new function for dsa.c to set the initial and max
segment size (or either one) to the existing DSA area so that
TidStoreCreate() can specify them at creation. In shared TidStore
cases, since all memory required by shared radix tree is allocated in
the passed-in DSA area and the memory usage is the total segment size
allocated in the DSA area, the user will have to prepare a DSA area
only for the shared tidstore. So we might be able to expect that the
DSA passed-in to TidStoreCreate() is empty and its segment sizes can
be adjustable.

Yet another idea is that TidStore creates its own DSA area in
TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area
(using dsa_create()) and pass it to RT_CREATE(). Also, we need a new
API to get the DSA area. The caller (e.g. parallel vacuum) gets the
dsa_handle of the DSA and stores it in the shared memory (e.g. in
PVShared). TidStoreAttach() will take two arguments: dsa_handle for
the DSA area and dsa_pointer for the shared radix tree. This idea
still requires controlling min/max segment sizes since dsa_create()
uses the 1MB as the initial segment size. But the TidStoreCreate()
would be more user friendly.

I've attached a PoC patch for discussion.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

tidstore_creates_dsa.patch.nocfbotapplication/octet-stream; name=tidstore_creates_dsa.patch.nocfbotDownload

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 745393806d..be7a5c9c00 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -119,7 +119,7 @@ static void tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
  * The returned object is allocated in backend-local memory.
  */
 TidStore *
-TidStoreCreate(size_t max_bytes, dsa_area *area, int tranche_id)
+TidStoreCreate(size_t max_bytes, bool shared, int tranche_id)
 {
 	TidStore   *ts;
 	size_t		initBlockSize = ALLOCSET_DEFAULT_INITSIZE;
@@ -143,8 +143,13 @@ TidStoreCreate(size_t max_bytes, dsa_area *area, int tranche_id)
 										   initBlockSize,
 										   maxBlockSize);
 
-	if (area != NULL)
+	if (shared)
 	{
+		dsa_area *area;
+
+		/* XXX: set initial and max segment sizes */
+		area = dsa_create(tranche_id);
+
 		ts->tree.shared = shared_ts_create(ts->rt_context, area,
 										   tranche_id);
 		ts->area = area;
@@ -160,16 +165,19 @@ TidStoreCreate(size_t max_bytes, dsa_area *area, int tranche_id)
  * is allocated in backend-local memory using the CurrentMemoryContext.
  */
 TidStore *
-TidStoreAttach(dsa_area *area, dsa_pointer handle)
+TidStoreAttach(dsa_handle area_handle, dsa_pointer handle)
 {
 	TidStore   *ts;
+	dsa_area *area;
 
-	Assert(area != NULL);
+	Assert(area_handle != DSA_HANDLE_INVALID);
 	Assert(DsaPointerIsValid(handle));
 
 	/* create per-backend state */
 	ts = palloc0(sizeof(TidStore));
 
+	area = dsa_attach(area_handle);
+
 	/* Find the shared the shared radix tree */
 	ts->tree.shared = shared_ts_attach(area, handle);
 	ts->area = area;
@@ -189,6 +197,8 @@ TidStoreDetach(TidStore *ts)
 	Assert(TidStoreIsShared(ts));
 
 	shared_ts_detach(ts->tree.shared);
+	dsa_detach(ts->area);
+
 	pfree(ts);
 }
 
@@ -234,7 +244,10 @@ TidStoreDestroy(TidStore *ts)
 {
 	/* Destroy underlying radix tree */
 	if (TidStoreIsShared(ts))
+	{
 		shared_ts_free(ts->tree.shared);
+		dsa_detach(ts->area);
+	}
 	else
 		local_ts_free(ts->tree.local);
 
@@ -420,6 +433,17 @@ TidStoreMemoryUsage(TidStore *ts)
 		return local_ts_memory_usage(ts->tree.local);
 }
 
+/*
+ * Return the DSA area where the TidStore lives.
+ */
+dsa_area *
+TidStoreGetDSA(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts));
+
+	return ts->area;
+}
+
 dsa_pointer
 TidStoreGetHandle(TidStore *ts)
 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 6c025b609c..f72a68aa52 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3155,7 +3155,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
+	vacrel->dead_items = TidStoreCreate(vac_work_mem, false, 0);
 
 	dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
 	dead_items_info->max_bytes = vac_work_mem;
@@ -3197,7 +3197,7 @@ dead_items_reset(LVRelState *vacrel)
 	/* Recreate the tidstore with the same max_bytes limitation */
 	TidStoreDestroy(dead_items);
 	vacrel->dead_items = TidStoreCreate(vacrel->dead_items_info->max_bytes,
-										NULL, 0);
+										false, 0);
 
 	/* Reset the counter */
 	vacrel->dead_items_info->num_items = 0;
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 23a4fc6c58..2b7667c7ef 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -45,7 +45,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2 /* XXX unused */
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -111,6 +111,9 @@ typedef struct PVShared
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
 
+	/* DSA handle where the TidStore lives */
+	dsa_handle	dead_items_dsa_handle;
+
 	/* DSA pointer to the shared TidStore */
 	dsa_pointer dead_items_handle;
 
@@ -183,7 +186,6 @@ struct ParallelVacuumState
 
 	/* Shared dead items space among parallel vacuum workers */
 	TidStore   *dead_items;
-	dsa_area   *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -250,7 +252,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	void	   *area_space;
-	dsa_area   *dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
@@ -373,13 +374,9 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	/* Prepare DSA space for dead items */
 	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
-	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
-										 LWTRANCHE_PARALLEL_VACUUM_DSA,
-										 pcxt->seg);
-	dead_items = TidStoreCreate(vac_work_mem, dead_items_dsa,
+	dead_items = TidStoreCreate(vac_work_mem, true,
 								LWTRANCHE_PARALLEL_VACUUM_DSA);
 	pvs->dead_items = dead_items;
-	pvs->dead_items_area = dead_items_dsa;
 
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
@@ -390,6 +387,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
 	shared->dead_items_handle = TidStoreGetHandle(dead_items);
 	shared->dead_items_info.max_bytes = vac_work_mem;
 
@@ -461,7 +459,6 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	}
 
 	TidStoreDestroy(pvs->dead_items);
-	dsa_detach(pvs->dead_items_area);
 
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
@@ -493,11 +490,12 @@ parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
 	 * limitation we just used.
 	 */
 	TidStoreDestroy(dead_items);
-	dsa_trim(pvs->dead_items_area);
-	pvs->dead_items = TidStoreCreate(dead_items_info->max_bytes, pvs->dead_items_area,
+	pvs->dead_items = TidStoreCreate(dead_items_info->max_bytes,
+									 true,
 									 LWTRANCHE_PARALLEL_VACUUM_DSA);
 
 	/* Update the DSA pointer for dead_items to the new one */
+	pvs->shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
 	pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
 
 	/* Reset the counter */
@@ -1005,8 +1003,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	PVIndStats *indstats;
 	PVShared   *shared;
 	TidStore   *dead_items;
-	void	   *area_space;
-	dsa_area   *dead_items_area;
+	void	   *area_space; /* XXX: unused */
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -1052,8 +1049,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 
 	/* Set dead items */
 	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
-	dead_items_area = dsa_attach_in_place(area_space, seg);
-	dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
+	dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
+								shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumUpdateCosts();
@@ -1102,7 +1099,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 						  &wal_usage[ParallelWorkerNumber]);
 
 	TidStoreDetach(dead_items);
-	dsa_detach(dead_items_area);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index 8cf4e94f12..d91898d60f 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -29,9 +29,10 @@ typedef struct TidStoreIterResult
 	OffsetNumber *offsets;
 } TidStoreIterResult;
 
-extern TidStore *TidStoreCreate(size_t max_bytes, dsa_area *dsa,
+extern TidStore *TidStoreCreate(size_t max_bytes, bool shared,//dsa_area *dsa,
 								int tranche_id);
-extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer rt_dp);
+//extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer rt_dp);
+extern TidStore *TidStoreAttach(dsa_handle dsa_handle, dsa_pointer rt_dp);
 extern void TidStoreDetach(TidStore *ts);
 extern void TidStoreLockExclusive(TidStore *ts);
 extern void TidStoreLockShare(TidStore *ts);
@@ -45,5 +46,6 @@ extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
 extern void TidStoreEndIterate(TidStoreIter *iter);
 extern size_t TidStoreMemoryUsage(TidStore *ts);
 extern dsa_pointer TidStoreGetHandle(TidStore *ts);
+extern dsa_area *TidStoreGetDSA(TidStore *ts);
 
 #endif							/* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index c74ad2cf8b..8a15f2646b 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -34,7 +34,6 @@ PG_FUNCTION_INFO_V1(test_is_full);
 PG_FUNCTION_INFO_V1(test_destroy);
 
 static TidStore *tidstore = NULL;
-static dsa_area *dsa = NULL;
 static size_t tidstore_empty_size;
 
 /* array for verification of some tests */
@@ -94,7 +93,6 @@ test_create(PG_FUNCTION_ARGS)
 	size_t		array_init_size = 1024;
 
 	Assert(tidstore == NULL);
-	Assert(dsa == NULL);
 
 	/*
 	 * Create the TidStore on TopMemoryContext so that the same process use it
@@ -109,15 +107,13 @@ test_create(PG_FUNCTION_ARGS)
 		tranche_id = LWLockNewTrancheId();
 		LWLockRegisterTranche(tranche_id, "test_tidstore");
 
-		dsa = dsa_create(tranche_id);
+		tidstore = TidStoreCreate(tidstore_max_size, true, tranche_id);
 
 		/*
 		 * Remain attached until end of backend or explicitly detached so that
 		 * the same process use the tidstore for subsequent tests.
 		 */
-		dsa_pin_mapping(dsa);
-
-		tidstore = TidStoreCreate(tidstore_max_size, dsa, tranche_id);
+		dsa_pin_mapping(TidStoreGetDSA(tidstore));
 	}
 	else
 		tidstore = TidStoreCreate(tidstore_max_size, NULL, 0);
@@ -309,9 +305,5 @@ test_destroy(PG_FUNCTION_ARGS)
 	pfree(items.lookup_tids);
 	pfree(items.iter_tids);
 
-	if (dsa)
-		dsa_detach(dsa);
-	dsa = NULL;
-
 	PG_RETURN_VOID();
 }

#426

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#425)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 21, 2024 at 1:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Or we can have a new function for dsa.c to set the initial and max
segment size (or either one) to the existing DSA area so that
TidStoreCreate() can specify them at creation.

I didn't like this very much, because it's splitting an operation
across an API boundary. The caller already has all the information it
needs when it creates the DSA. Straw man proposal: it could do the
same for local memory, then they'd be more similar. But if we made
local contexts the responsibility of the caller, that would cause
duplication between creating and resetting.

In shared TidStore
cases, since all memory required by shared radix tree is allocated in
the passed-in DSA area and the memory usage is the total segment size
allocated in the DSA area

...plus apparently some overhead, I just found out today, but that's
beside the point.

On Thu, Mar 21, 2024 at 2:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Yet another idea is that TidStore creates its own DSA area in
TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area
(using dsa_create()) and pass it to RT_CREATE(). Also, we need a new
API to get the DSA area. The caller (e.g. parallel vacuum) gets the
dsa_handle of the DSA and stores it in the shared memory (e.g. in
PVShared). TidStoreAttach() will take two arguments: dsa_handle for
the DSA area and dsa_pointer for the shared radix tree. This idea
still requires controlling min/max segment sizes since dsa_create()
uses the 1MB as the initial segment size. But the TidStoreCreate()
would be more user friendly.

This seems like an overall simplification, aside from future size
configuration, so +1 to continue looking into this. If we go this
route, I'd like to avoid a boolean parameter and cleanly separate
TidStoreCreateLocal() and TidStoreCreateShared(). Every operation
after that can introspect, but it's a bit awkward to force these cases
into the same function. It always was a little bit, but this change
makes it more so.

#427

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#426)

3 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 21, 2024 at 4:35 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 21, 2024 at 1:11 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Or we can have a new function for dsa.c to set the initial and max
segment size (or either one) to the existing DSA area so that
TidStoreCreate() can specify them at creation.

I didn't like this very much, because it's splitting an operation
across an API boundary. The caller already has all the information it
needs when it creates the DSA. Straw man proposal: it could do the
same for local memory, then they'd be more similar. But if we made
local contexts the responsibility of the caller, that would cause
duplication between creating and resetting.

Fair point.

In shared TidStore
cases, since all memory required by shared radix tree is allocated in
the passed-in DSA area and the memory usage is the total segment size
allocated in the DSA area

...plus apparently some overhead, I just found out today, but that's
beside the point.

On Thu, Mar 21, 2024 at 2:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Yet another idea is that TidStore creates its own DSA area in
TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area
(using dsa_create()) and pass it to RT_CREATE(). Also, we need a new
API to get the DSA area. The caller (e.g. parallel vacuum) gets the
dsa_handle of the DSA and stores it in the shared memory (e.g. in
PVShared). TidStoreAttach() will take two arguments: dsa_handle for
the DSA area and dsa_pointer for the shared radix tree. This idea
still requires controlling min/max segment sizes since dsa_create()
uses the 1MB as the initial segment size. But the TidStoreCreate()
would be more user friendly.

This seems like an overall simplification, aside from future size
configuration, so +1 to continue looking into this. If we go this
route, I'd like to avoid a boolean parameter and cleanly separate
TidStoreCreateLocal() and TidStoreCreateShared(). Every operation
after that can introspect, but it's a bit awkward to force these cases
into the same function. It always was a little bit, but this change
makes it more so.

I've looked into this idea further. Overall, it looks clean and I
don't see any problem so far in terms of integration with lazy vacuum.
I've attached three patches for discussion and tests.

- 0001 patch makes lazy vacuum use of tidstore.
- 0002 patch makes DSA init/max segment size configurable (borrowed
from another thread).
- 0003 patch makes TidStore create its own DSA area with init/max DSA
segment adjustment (PoC patch).

One thing unclear to me is that this idea will be usable even when we
want to use the tidstore for parallel bitmap scan. Currently, we
create a shared tidbitmap on a DSA area in ParallelExecutorInfo. This
DSA area is used not only for tidbitmap but also for parallel hash
etc. If the tidstore created its own DSA area, parallel bitmap scan
would have to use the tidstore's DSA in addition to the DSA area in
ParallelExecutorInfo. I'm not sure if there are some differences
between these usages in terms of resource manager etc. It seems no
problem but I might be missing something.

Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v77-0003-PoC-Make-shared-TidStore-create-its-own-DSA-area.patchapplication/octet-stream; name=v77-0003-PoC-Make-shared-TidStore-create-its-own-DSA-area.patchDownload

From 88c6a83543ff0ba1a788217983543163169f7432 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 21 Mar 2024 17:11:16 +0900
Subject: [PATCH v77 3/3] PoC: Make shared TidStore create its own DSA area.

---
 src/backend/access/common/tidstore.c          | 85 +++++++++++++++----
 src/backend/access/heap/vacuumlazy.c          |  5 +-
 src/backend/commands/vacuumparallel.c         | 38 +++------
 src/include/access/tidstore.h                 |  7 +-
 .../modules/test_tidstore/test_tidstore.c     | 14 +--
 5 files changed, 90 insertions(+), 59 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 745393806d..8f5df7d89f 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -7,9 +7,9 @@
  * Internally it uses a radix tree as the storage for TIDs. The key is the
  * BlockNumber and the value is a bitmap of offsets, BlocktableEntry.
  *
- * TidStore can be shared among parallel worker processes by passing DSA area
- * to TidStoreCreate(). Other backends can attach to the shared TidStore by
- * TidStoreAttach().
+ * TidStore can be shared among parallel worker processes by using
+ * TidStoreCreateShared(). Other backends can attach to the shared TidStore
+ * by TidStoreAttach().
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -105,9 +105,25 @@ struct TidStoreIter
 	TidStoreIterResult output;
 };
 
+static TidStore * tidstore_create_internal(size_t max_bytes, bool shared,
+										   int tranche_id);
 static void tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 									   BlocktableEntry *page);
 
+/* Public APIs to create local or shared TidStore */
+
+TidStore *
+TidStoreCreateLocal(size_t max_bytes)
+{
+	return tidstore_create_internal(max_bytes, false, 0);
+}
+
+TidStore *
+TidStoreCreateShared(size_t max_bytes, int tranche_id)
+{
+	return tidstore_create_internal(max_bytes, true, tranche_id);
+}
+
 /*
  * Create a TidStore. The TidStore will live in the memory context that is
  * CurrentMemoryContext at the time of this call. The TID storage, backed
@@ -118,8 +134,8 @@ static void tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
  *
  * The returned object is allocated in backend-local memory.
  */
-TidStore *
-TidStoreCreate(size_t max_bytes, dsa_area *area, int tranche_id)
+static TidStore *
+tidstore_create_internal(size_t max_bytes, bool shared, int tranche_id)
 {
 	TidStore   *ts;
 	size_t		initBlockSize = ALLOCSET_DEFAULT_INITSIZE;
@@ -143,8 +159,27 @@ TidStoreCreate(size_t max_bytes, dsa_area *area, int tranche_id)
 										   initBlockSize,
 										   maxBlockSize);
 
-	if (area != NULL)
+	if (shared)
 	{
+		dsa_area *area;
+		size_t	dsa_init_size = DSA_INITIAL_SEGMENT_SIZE;
+		size_t	dsa_max_size = DSA_MAX_SEGMENT_SIZE;
+
+		/*
+		 * Choose the DSA initial and max segment sizes to be no longer than
+		 * 1/16 and 1/8 of max_bytes, respectively.
+		 */
+		while (16 * dsa_init_size > max_bytes * 1024L)
+			dsa_init_size >>= 1;
+		while (8 * dsa_max_size > max_bytes * 1024L)
+			dsa_max_size >>= 1;
+
+		if (dsa_init_size < DSA_MIN_SEGMENT_SIZE)
+			dsa_init_size = DSA_MIN_SEGMENT_SIZE;
+		if (dsa_max_size < DSA_MAX_SEGMENT_SIZE)
+			dsa_max_size = DSA_MAX_SEGMENT_SIZE;
+
+		area = dsa_create_ext(tranche_id, dsa_init_size, dsa_max_size);
 		ts->tree.shared = shared_ts_create(ts->rt_context, area,
 										   tranche_id);
 		ts->area = area;
@@ -156,20 +191,25 @@ TidStoreCreate(size_t max_bytes, dsa_area *area, int tranche_id)
 }
 
 /*
- * Attach to the shared TidStore using the given  handle. The returned object
- * is allocated in backend-local memory using the CurrentMemoryContext.
+ * Attach to the shared TidStore. 'area_handle' is the DSA handle where
+ * the TidStore is created. 'handle' is the dsa_pointer returned by
+ * TidStoreGetHandle(). The returned object is allocated in backend-local
+ * memory using the CurrentMemoryContext.
  */
 TidStore *
-TidStoreAttach(dsa_area *area, dsa_pointer handle)
+TidStoreAttach(dsa_handle area_handle, dsa_pointer handle)
 {
 	TidStore   *ts;
+	dsa_area *area;
 
-	Assert(area != NULL);
+	Assert(area_handle != DSA_HANDLE_INVALID);
 	Assert(DsaPointerIsValid(handle));
 
 	/* create per-backend state */
 	ts = palloc0(sizeof(TidStore));
 
+	area = dsa_attach(area_handle);
+
 	/* Find the shared the shared radix tree */
 	ts->tree.shared = shared_ts_attach(area, handle);
 	ts->area = area;
@@ -178,10 +218,8 @@ TidStoreAttach(dsa_area *area, dsa_pointer handle)
 }
 
 /*
- * Detach from a TidStore. This detaches from radix tree and frees the
- * backend-local resources. The radix tree will continue to exist until
- * it is either explicitly destroyed, or the area that backs it is returned
- * to the operating system.
+ * Detach from a TidStore. This also detaches from radix tree and frees
+ * the backend-local resources.
  */
 void
 TidStoreDetach(TidStore *ts)
@@ -189,6 +227,8 @@ TidStoreDetach(TidStore *ts)
 	Assert(TidStoreIsShared(ts));
 
 	shared_ts_detach(ts->tree.shared);
+	dsa_detach(ts->area);
+
 	pfree(ts);
 }
 
@@ -232,9 +272,13 @@ TidStoreUnlock(TidStore *ts)
 void
 TidStoreDestroy(TidStore *ts)
 {
-	/* Destroy underlying radix tree */
 	if (TidStoreIsShared(ts))
+	{
+		/* Destroy underlying radix tree */
 		shared_ts_free(ts->tree.shared);
+
+		dsa_detach(ts->area);
+	}
 	else
 		local_ts_free(ts->tree.local);
 
@@ -420,6 +464,17 @@ TidStoreMemoryUsage(TidStore *ts)
 		return local_ts_memory_usage(ts->tree.local);
 }
 
+/*
+ * Return the DSA area where the TidStore lives.
+ */
+dsa_area *
+TidStoreGetDSA(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts));
+
+	return ts->area;
+}
+
 dsa_pointer
 TidStoreGetHandle(TidStore *ts)
 {
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f6c09c8da1..7bb2a95a82 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3155,7 +3155,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
+	vacrel->dead_items = TidStoreCreateLocal(vac_work_mem);
 
 	dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
 	dead_items_info->max_bytes = vac_work_mem * 1024L;
@@ -3196,8 +3196,7 @@ dead_items_reset(LVRelState *vacrel)
 
 	/* Recreate the tidstore with the same max_bytes limitation */
 	TidStoreDestroy(dead_items);
-	vacrel->dead_items = TidStoreCreate(vacrel->dead_items_info->max_bytes,
-										NULL, 0);
+	vacrel->dead_items = TidStoreCreateLocal(vacrel->dead_items_info->max_bytes);
 
 	/* Reset the counter */
 	vacrel->dead_items_info->num_items = 0;
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 4bd0df3b5e..6e45fa4b95 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -45,7 +45,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+/* 2 was PARALLEL_VACUUM_KEY_DEAD_ITEMS */
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -111,6 +111,9 @@ typedef struct PVShared
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
 
+	/* DSA handle where the TidStore lives */
+	dsa_handle	dead_items_dsa_handle;
+
 	/* DSA pointer to the shared TidStore */
 	dsa_pointer dead_items_handle;
 
@@ -183,7 +186,6 @@ struct ParallelVacuumState
 
 	/* Shared dead items space among parallel vacuum workers */
 	TidStore   *dead_items;
-	dsa_area   *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -249,12 +251,9 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
-	void	   *area_space;
-	dsa_area   *dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -303,10 +302,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Initial size of DSA for dead tuples -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
-	shm_toc_estimate_keys(&pcxt->estimator, 1);
-
 	/*
 	 * Estimate space for BufferUsage and WalUsage --
 	 * PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
@@ -371,15 +366,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->indstats = indstats;
 
 	/* Prepare DSA space for dead items */
-	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
-	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
-										 LWTRANCHE_PARALLEL_VACUUM_DSA,
-										 pcxt->seg);
-	dead_items = TidStoreCreate(vac_work_mem, dead_items_dsa,
-								LWTRANCHE_PARALLEL_VACUUM_DSA);
+	dead_items = TidStoreCreateShared(vac_work_mem, LWTRANCHE_PARALLEL_VACUUM_DSA);
 	pvs->dead_items = dead_items;
-	pvs->dead_items_area = dead_items_dsa;
 
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
@@ -390,6 +378,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
 	shared->dead_items_handle = TidStoreGetHandle(dead_items);
 	shared->dead_items_info.max_bytes = vac_work_mem * 1024L;
 
@@ -461,7 +450,6 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	}
 
 	TidStoreDestroy(pvs->dead_items);
-	dsa_detach(pvs->dead_items_area);
 
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
@@ -493,11 +481,11 @@ parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
 	 * limitation we just used.
 	 */
 	TidStoreDestroy(dead_items);
-	dsa_trim(pvs->dead_items_area);
-	pvs->dead_items = TidStoreCreate(dead_items_info->max_bytes, pvs->dead_items_area,
-									 LWTRANCHE_PARALLEL_VACUUM_DSA);
+	pvs->dead_items = TidStoreCreateShared(dead_items_info->max_bytes,
+									   LWTRANCHE_PARALLEL_VACUUM_DSA);
 
 	/* Update the DSA pointer for dead_items to the new one */
+	pvs->shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
 	pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
 
 	/* Reset the counter */
@@ -1005,8 +993,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	PVIndStats *indstats;
 	PVShared   *shared;
 	TidStore   *dead_items;
-	void	   *area_space;
-	dsa_area   *dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -1051,9 +1037,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 false);
 
 	/* Set dead items */
-	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
-	dead_items_area = dsa_attach_in_place(area_space, seg);
-	dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
+	dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
+								shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumUpdateCosts();
@@ -1102,7 +1087,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 						  &wal_usage[ParallelWorkerNumber]);
 
 	TidStoreDetach(dead_items);
-	dsa_detach(dead_items_area);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index 8cf4e94f12..1cc695f90a 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -29,9 +29,9 @@ typedef struct TidStoreIterResult
 	OffsetNumber *offsets;
 } TidStoreIterResult;
 
-extern TidStore *TidStoreCreate(size_t max_bytes, dsa_area *dsa,
-								int tranche_id);
-extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer rt_dp);
+extern TidStore *TidStoreCreateLocal(size_t max_bytes);
+extern TidStore *TidStoreCreateShared(size_t max_bytes, int tranche_id);
+extern TidStore *TidStoreAttach(dsa_handle dsa_handle, dsa_pointer rt_dp);
 extern void TidStoreDetach(TidStore *ts);
 extern void TidStoreLockExclusive(TidStore *ts);
 extern void TidStoreLockShare(TidStore *ts);
@@ -45,5 +45,6 @@ extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
 extern void TidStoreEndIterate(TidStoreIter *iter);
 extern size_t TidStoreMemoryUsage(TidStore *ts);
 extern dsa_pointer TidStoreGetHandle(TidStore *ts);
+extern dsa_area *TidStoreGetDSA(TidStore *ts);
 
 #endif							/* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index c74ad2cf8b..3d4af77dda 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -34,7 +34,6 @@ PG_FUNCTION_INFO_V1(test_is_full);
 PG_FUNCTION_INFO_V1(test_destroy);
 
 static TidStore *tidstore = NULL;
-static dsa_area *dsa = NULL;
 static size_t tidstore_empty_size;
 
 /* array for verification of some tests */
@@ -94,7 +93,6 @@ test_create(PG_FUNCTION_ARGS)
 	size_t		array_init_size = 1024;
 
 	Assert(tidstore == NULL);
-	Assert(dsa == NULL);
 
 	/*
 	 * Create the TidStore on TopMemoryContext so that the same process use it
@@ -109,18 +107,16 @@ test_create(PG_FUNCTION_ARGS)
 		tranche_id = LWLockNewTrancheId();
 		LWLockRegisterTranche(tranche_id, "test_tidstore");
 
-		dsa = dsa_create(tranche_id);
+		tidstore = TidStoreCreateShared(tidstore_max_size, tranche_id);
 
 		/*
 		 * Remain attached until end of backend or explicitly detached so that
 		 * the same process use the tidstore for subsequent tests.
 		 */
-		dsa_pin_mapping(dsa);
-
-		tidstore = TidStoreCreate(tidstore_max_size, dsa, tranche_id);
+		dsa_pin_mapping(TidStoreGetDSA(tidstore));
 	}
 	else
-		tidstore = TidStoreCreate(tidstore_max_size, NULL, 0);
+		tidstore = TidStoreCreateLocal(tidstore_max_size);
 
 	tidstore_empty_size = TidStoreMemoryUsage(tidstore);
 
@@ -309,9 +305,5 @@ test_destroy(PG_FUNCTION_ARGS)
 	pfree(items.lookup_tids);
 	pfree(items.iter_tids);
 
-	if (dsa)
-		dsa_detach(dsa);
-	dsa = NULL;
-
 	PG_RETURN_VOID();
 }
-- 
2.39.3

v77-0002-Make-DSA-initial-and-maximum-segment-size-config.patchapplication/octet-stream; name=v77-0002-Make-DSA-initial-and-maximum-segment-size-config.patchDownload

From 58392f793745c57f0cfd61f639df660ba96d0b4a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 21 Mar 2024 17:13:16 +0900
Subject: [PATCH v77 2/3] Make DSA initial and maximum segment size
 configurable.

---
 src/backend/utils/mmgr/dsa.c | 63 +++++++++++++++++-------------------
 src/include/utils/dsa.h      | 43 +++++++++++++++++++++---
 2 files changed, 68 insertions(+), 38 deletions(-)

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index c3af071940..99e5bd68b6 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -59,14 +59,6 @@
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
-/*
- * The size of the initial DSM segment that backs a dsa_area created by
- * dsa_create.  After creating some number of segments of this size we'll
- * double this size, and so on.  Larger segments may be created if necessary
- * to satisfy large requests.
- */
-#define DSA_INITIAL_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
-
 /*
  * How many segments to create before we double the segment size.  If this is
  * low, then there is likely to be a lot of wasted space in the largest
@@ -76,17 +68,6 @@
  */
 #define DSA_NUM_SEGMENTS_AT_EACH_SIZE 2
 
-/*
- * The number of bits used to represent the offset part of a dsa_pointer.
- * This controls the maximum size of a segment, the maximum possible
- * allocation size and also the maximum number of segments per area.
- */
-#if SIZEOF_DSA_POINTER == 4
-#define DSA_OFFSET_WIDTH 27		/* 32 segments of size up to 128MB */
-#else
-#define DSA_OFFSET_WIDTH 40		/* 1024 segments of size up to 1TB */
-#endif
-
 /*
  * The maximum number of DSM segments that an area can own, determined by
  * the number of bits remaining (but capped at 1024).
@@ -97,9 +78,6 @@
 /* The bitmask for extracting the offset from a dsa_pointer. */
 #define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
 
-/* The maximum size of a DSM segment. */
-#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
-
 /* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
 #define DSA_PAGES_PER_SUPERBLOCK		16
 
@@ -318,6 +296,10 @@ typedef struct
 	dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
 	/* The object pools for each size class. */
 	dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+	/* initial allocation segment size */
+	size_t		init_segment_size;
+	/* maximum allocation segment size */
+	size_t		max_segment_size;
 	/* The total size of all active segments. */
 	size_t		total_segment_size;
 	/* The maximum total size of backing storage we are allowed. */
@@ -417,7 +399,9 @@ static dsa_segment_map *make_new_segment(dsa_area *area, size_t requested_pages)
 static dsa_area *create_internal(void *place, size_t size,
 								 int tranche_id,
 								 dsm_handle control_handle,
-								 dsm_segment *control_segment);
+								 dsm_segment *control_segment,
+								 size_t init_segment_size,
+								 size_t max_segment_size);
 static dsa_area *attach_internal(void *place, dsm_segment *segment,
 								 dsa_handle handle);
 static void check_for_freed_segments(dsa_area *area);
@@ -434,7 +418,7 @@ static void rebin_segment(dsa_area *area, dsa_segment_map *segment_map);
  * we require the caller to provide one.
  */
 dsa_area *
-dsa_create(int tranche_id)
+dsa_create_ext(int tranche_id, size_t init_segment_size, size_t max_segment_size)
 {
 	dsm_segment *segment;
 	dsa_area   *area;
@@ -443,7 +427,7 @@ dsa_create(int tranche_id)
 	 * Create the DSM segment that will hold the shared control object and the
 	 * first segment of usable space.
 	 */
-	segment = dsm_create(DSA_INITIAL_SEGMENT_SIZE, 0);
+	segment = dsm_create(init_segment_size, 0);
 
 	/*
 	 * All segments backing this area are pinned, so that DSA can explicitly
@@ -455,9 +439,10 @@ dsa_create(int tranche_id)
 
 	/* Create a new DSA area with the control object in this segment. */
 	area = create_internal(dsm_segment_address(segment),
-						   DSA_INITIAL_SEGMENT_SIZE,
+						   init_segment_size,
 						   tranche_id,
-						   dsm_segment_handle(segment), segment);
+						   dsm_segment_handle(segment), segment,
+						   init_segment_size, max_segment_size);
 
 	/* Clean up when the control segment detaches. */
 	on_dsm_detach(segment, &dsa_on_dsm_detach_release_in_place,
@@ -483,13 +468,15 @@ dsa_create(int tranche_id)
  * See dsa_create() for a note about the tranche arguments.
  */
 dsa_area *
-dsa_create_in_place(void *place, size_t size,
-					int tranche_id, dsm_segment *segment)
+dsa_create_in_place_ext(void *place, size_t size,
+						int tranche_id, dsm_segment *segment,
+						size_t init_segment_size, size_t max_segment_size)
 {
 	dsa_area   *area;
 
 	area = create_internal(place, size, tranche_id,
-						   DSM_HANDLE_INVALID, NULL);
+						   DSM_HANDLE_INVALID, NULL,
+						   init_segment_size, max_segment_size);
 
 	/*
 	 * Clean up when the control segment detaches, if a containing DSM segment
@@ -1231,7 +1218,8 @@ static dsa_area *
 create_internal(void *place, size_t size,
 				int tranche_id,
 				dsm_handle control_handle,
-				dsm_segment *control_segment)
+				dsm_segment *control_segment,
+				size_t init_segment_size, size_t max_segment_size)
 {
 	dsa_area_control *control;
 	dsa_area   *area;
@@ -1241,6 +1229,11 @@ create_internal(void *place, size_t size,
 	size_t		metadata_bytes;
 	int			i;
 
+	/* Validate the initial and maximum block sizes */
+	Assert(init_segment_size >= DSA_MIN_SEGMENT_SIZE);
+	Assert(max_segment_size >= init_segment_size);
+	Assert(max_segment_size <= DSA_MAX_SEGMENT_SIZE);
+
 	/* Sanity check on the space we have to work in. */
 	if (size < dsa_minimum_size())
 		elog(ERROR, "dsa_area space must be at least %zu, but %zu provided",
@@ -1270,8 +1263,10 @@ create_internal(void *place, size_t size,
 	control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
 	control->segment_header.usable_pages = usable_pages;
 	control->segment_header.freed = false;
-	control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+	control->segment_header.size = size;
 	control->handle = control_handle;
+	control->init_segment_size = init_segment_size;
+	control->max_segment_size = max_segment_size;
 	control->max_total_segment_size = (size_t) -1;
 	control->total_segment_size = size;
 	control->segment_handles[0] = control_handle;
@@ -2127,9 +2122,9 @@ make_new_segment(dsa_area *area, size_t requested_pages)
 	 * move to huge pages in the future.  Then we work back to the number of
 	 * pages we can fit.
 	 */
-	total_size = DSA_INITIAL_SEGMENT_SIZE *
+	total_size = area->control->init_segment_size *
 		((size_t) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
-	total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+	total_size = Min(total_size, area->control->max_segment_size);
 	total_size = Min(total_size,
 					 area->control->max_total_segment_size -
 					 area->control->total_segment_size);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index fe9cbebbec..af7f9d64d2 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -77,6 +77,31 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
 /* A sentinel value for dsa_pointer used to indicate failure to allocate. */
 #define InvalidDsaPointer ((dsa_pointer) 0)
 
+/*
+ * The default size of the initial DSM segment that backs a dsa_area created
+ * by dsa_create.  After creating some number of segments of this size we'll
+ * double this size, and so on.  Larger segments may be created if necessary
+ * to satisfy large requests.
+ */
+#define DSA_INITIAL_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
+
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27		/* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40		/* 1024 segments of size up to 1TB */
+#endif
+
+/* The minimum size of a DSM segment. */
+#define DSA_MIN_SEGMENT_SIZE	((size_t) 1024)
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
 /* Check if a dsa_pointer value is valid. */
 #define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
 
@@ -88,6 +113,14 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
 #define dsa_allocate0(area, size) \
 	dsa_allocate_extended(area, size, DSA_ALLOC_ZERO)
 
+/* Create dsa_area with default segment sizes */
+#define dsa_create(tranch_id) \
+	dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
+
+/* Create dsa_area with default segment sizes in an existing share memory space */
+#define dsa_create_in_place(place, size, tranch_id, segment) \
+	dsa_create_in_place_ext(place, size, tranch_id, segment, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
+
 /*
  * The type used for dsa_area handles.  dsa_handle values can be shared with
  * other processes, so that they can attach to them.  This provides a way to
@@ -102,10 +135,12 @@ typedef dsm_handle dsa_handle;
 /* Sentinel value to use for invalid dsa_handles. */
 #define DSA_HANDLE_INVALID ((dsa_handle) DSM_HANDLE_INVALID)
 
-
-extern dsa_area *dsa_create(int tranche_id);
-extern dsa_area *dsa_create_in_place(void *place, size_t size,
-									 int tranche_id, dsm_segment *segment);
+extern dsa_area *dsa_create_ext(int tranche_id, size_t init_segment_size,
+								size_t max_segment_size);
+extern dsa_area *dsa_create_in_place_ext(void *place, size_t size,
+										 int tranche_id, dsm_segment *segment,
+										 size_t init_segment_size,
+										 size_t max_segment_size);
 extern dsa_area *dsa_attach(dsa_handle handle);
 extern dsa_area *dsa_attach_in_place(void *place, dsm_segment *segment);
 extern void dsa_release_in_place(void *place);
-- 
2.39.3

v77-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchapplication/octet-stream; name=v77-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchDownload

From 43fb6cd43e5f2504c978c86b333f8f9e71959cfc Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 1 Mar 2024 16:04:49 +0900
Subject: [PATCH v77 1/3] Use TidStore for dead tuple TIDs storage during lazy
 vacuum.

Previously, we used VacDeadItems, a simple ItemPointerData array, for
dead tuple's TID storage during lazy vacuum, which was neither space
efficient nor lookup performant.

This commit makes (parallel) lazy vacuum use of TidSTore for dead
tuple TIDs storage, instead of VacDeadItems. A new struct
VacDeadItemsInfo stores additional information such as
max_bytes. TidStore and VacDeadItemsInfo are shared among the parallel
vacuum workers in parallel vacuum cases. We don't take any locks on
TidStore during parallel vacuum since there are no concurrent reads
and writes.

As for the progress reporting, reporting number of tuples does no
longer provide any meaningful insights for users. So this commit also
changes to report byte-based progress reporting. The columns of
pg_stat_progress_vacuum are also renamed accordingly:
max_dead_tuple_bytes and dead_tuple_bytes.

XXX: bump catalog version

Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |   5 -
 doc/src/sgml/monitoring.sgml                  |   8 +-
 src/backend/access/heap/vacuumlazy.c          | 257 ++++++++----------
 src/backend/catalog/system_views.sql          |   2 +-
 src/backend/commands/vacuum.c                 |  79 +-----
 src/backend/commands/vacuumparallel.c         | 102 +++++--
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +-
 src/include/commands/progress.h               |   4 +-
 src/include/commands/vacuum.h                 |  28 +-
 src/include/storage/lwlock.h                  |   1 +
 src/test/regress/expected/rules.out           |   4 +-
 src/tools/pgindent/typedefs.list              |   2 +-
 13 files changed, 223 insertions(+), 272 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c408..a4ab76b7bc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1918,11 +1918,6 @@ include_dir 'conf.d'
         too high.  It may be useful to control for this by separately
         setting <xref linkend="guc-autovacuum-work-mem"/>.
        </para>
-       <para>
-        Note that for the collection of dead tuple identifiers,
-        <command>VACUUM</command> is only able to utilize up to a maximum of
-        <literal>1GB</literal> of memory.
-       </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8736eac284..6a74e4a24d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6237,10 +6237,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6248,10 +6248,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1800490775..f6c09c8da1 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,17 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -39,6 +38,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xloginsert.h"
@@ -179,8 +179,13 @@ typedef struct LVRelState
 	 * that has been processed by lazy_scan_prune.  Also needed by
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
+	 *
+	 * Both dead_items and dead_items_info are allocated in shared memory in
+	 * parallel vacuum cases.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore   *dead_items;		/* TIDs whose index tuples we'll delete */
+	VacDeadItemsInfo *dead_items_info;
+
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -239,8 +244,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  Buffer buffer, OffsetNumber *offsets,
+								  int num_offsets, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -257,6 +263,9 @@ static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+						   int num_offsets);
+static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -472,7 +481,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
+	 * Allocate dead_items memory using dead_items_alloc.  This handles
 	 * parallel VACUUM initialization as part of allocating shared memory
 	 * space used for dead_items.  (But do a failsafe precheck first, to
 	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
@@ -782,7 +791,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -811,19 +820,20 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_fsm_block_to_vacuum = 0;
 	bool		all_visible_according_to_vm;
 
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore   *dead_items = vacrel->dead_items;
+	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
 	Buffer		vmbuffer = InvalidBuffer;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = dead_items_info->max_bytes;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Initialize for the first heap_vac_scan_next_block() call */
@@ -866,8 +876,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -930,7 +939,7 @@ lazy_scan_heap(LVRelState *vacrel)
 
 		/*
 		 * If we didn't get the cleanup lock, we can still collect LP_DEAD
-		 * items in the dead_items array for later vacuuming, count live and
+		 * items in the dead_items for later vacuuming, count live and
 		 * recently dead tuples for vacuum logging, and determine if this
 		 * block could later be truncated. If we encounter any xid/mxids that
 		 * require advancing the relfrozenxid/relminxid, we'll have to wait
@@ -958,9 +967,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Like lazy_scan_noprune(), lazy_scan_prune() will count
 		 * recently_dead_tuples and live tuples for vacuum logging, determine
 		 * if the block can later be truncated, and accumulate the details of
-		 * remaining LP_DEAD line pointers on the page in the dead_items
-		 * array. These dead items include those pruned by lazy_scan_prune()
-		 * as well we line pointers previously marked LP_DEAD.
+		 * remaining LP_DEAD line pointers on the page in the dead_items.
+		 * These dead items include those pruned by lazy_scan_prune() as well
+		 * we line pointers previously marked LP_DEAD.
 		 */
 		if (got_cleanup_lock)
 			lazy_scan_prune(vacrel, buf, blkno, page,
@@ -1037,7 +1046,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (dead_items_info->num_items > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1763,22 +1772,9 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1925,7 +1921,7 @@ lazy_scan_prune(LVRelState *vacrel,
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2084,7 +2080,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2104,9 +2100,6 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
 		 * indexes will be deleted during index vacuuming (and then marked
@@ -2114,17 +2107,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		vacrel->lpdead_items += lpdead_items;
 	}
@@ -2174,7 +2157,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		dead_items_reset(vacrel);
 		return;
 	}
 
@@ -2203,7 +2186,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2230,8 +2213,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2276,7 +2259,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	dead_items_reset(vacrel);
 }
 
 /*
@@ -2368,7 +2351,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2390,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2408,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2426,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = TidStoreBeginIterate(vacrel->dead_items);
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2435,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = iter_result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2449,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, buf, iter_result->offsets,
+							  iter_result->num_offsets, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2459,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2468,14 +2454,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2483,21 +2468,17 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 /*
  *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *						  vacrel->dead_items store.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
+static void
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+					  OffsetNumber *deadoffsets, int num_offsets,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2516,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2595,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -2722,8 +2697,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -2760,7 +2735,8 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
 	/* Do bulk deletion */
-	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items);
+	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items,
+								  vacrel->dead_items_info);
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -3125,46 +3101,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = AmAutoVacuumWorkerProcess() &&
-		autovacuum_work_mem != -1 ?
-		autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3175,11 +3111,10 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	VacDeadItemsInfo *dead_items_info;
+	int			vac_work_mem = AmAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem : maintenance_work_mem;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3206,24 +3141,66 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
 		/* If parallel mode started, dead_items space is allocated in DSM */
 		if (ParallelVacuumIsActive(vacrel))
 		{
-			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs);
+			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs,
+																&vacrel->dead_items_info);
 			return;
 		}
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
+	vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
+
+	dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
+	dead_items_info->max_bytes = vac_work_mem * 1024L;
+	dead_items_info->num_items = 0;
+	vacrel->dead_items_info = dead_items_info;
+}
+
+/*
+ * Add the given block number and offset numbers to dead_items.
+ */
+static void
+dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+			   int num_offsets)
+{
+	TidStore   *dead_items = vacrel->dead_items;
+
+	TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets);
+	vacrel->dead_items_info->num_items += num_offsets;
+
+	/* update the memory usage report */
+	pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+								 TidStoreMemoryUsage(dead_items));
+}
+
+/*
+ * Forget all collected dead items.
+ */
+static void
+dead_items_reset(LVRelState *vacrel)
+{
+	TidStore   *dead_items = vacrel->dead_items;
+
+	if (ParallelVacuumIsActive(vacrel))
+	{
+		parallel_vacuum_reset_dead_items(vacrel->pvs);
+		return;
+	}
+
+	/* Recreate the tidstore with the same max_bytes limitation */
+	TidStoreDestroy(dead_items);
+	vacrel->dead_items = TidStoreCreate(vacrel->dead_items_info->max_bytes,
+										NULL, 0);
 
-	vacrel->dead_items = dead_items;
+	/* Reset the counter */
+	vacrel->dead_items_info->num_items = 0;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 04227a72d1..b6990ac0da 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1221,7 +1221,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples,
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
         S.param8 AS indexes_total, S.param9 AS indexes_processed
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e63c86cae4..72299b0838 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,7 +116,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * GUC check function to ensure GUC value specified is within the allowable
@@ -2489,16 +2488,15 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items, VacDeadItemsInfo *dead_items_info)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
-					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+			(errmsg("scanned index \"%s\" to remove " INT64_FORMAT " row versions",
+					RelationGetRelationName(ivinfo->index), dead_items_info->num_items)));
 
 	return istat;
 }
@@ -2529,82 +2527,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch(itemptr,
-								dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore   *dead_items = (TidStore *) state;
 
-	return 0;
+	return TidStoreIsMember(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index befda1c105..4bd0df3b5e 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -8,8 +8,8 @@
  *
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
+ * vacuum process.  ParalleVacuumState contains shared information as well as
+ * the memory space for storing dead items allocated in the DSA area.  We
  * launch parallel worker processes at the start of parallel index
  * bulk-deletion and index cleanup and once all indexes are processed, the
  * parallel worker processes exit.  Each time we process indexes in parallel,
@@ -110,6 +110,12 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* DSA pointer to the shared TidStore */
+	dsa_pointer dead_items_handle;
+
+	/* Statistics of shared dead items */
+	VacDeadItemsInfo dead_items_info;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -176,7 +182,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
+	dsa_area   *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -232,20 +239,22 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
+					 int nrequested_workers, int vac_work_mem,
 					 int elevel, BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void	   *area_space;
+	dsa_area   *dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -294,9 +303,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Initial size of DSA for dead tuples -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -362,6 +370,17 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = TidStoreCreate(vac_work_mem, dead_items_dsa,
+								LWTRANCHE_PARALLEL_VACUUM_DSA);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -371,6 +390,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = TidStoreGetHandle(dead_items);
+	shared->dead_items_info.max_bytes = vac_work_mem * 1024L;
 
 	/* Use the same buffer size for all workers */
 	shared->ring_nbuffers = GetAccessStrategyBufferCount(bstrategy);
@@ -382,15 +403,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -448,6 +460,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	TidStoreDestroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -455,13 +470,40 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
-/* Returns the dead items space */
-VacDeadItems *
-parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
+/*
+ * Returns the dead items space and dead items information.
+ */
+TidStore *
+parallel_vacuum_get_dead_items(ParallelVacuumState *pvs, VacDeadItemsInfo **dead_items_info_p)
 {
+	*dead_items_info_p = &(pvs->shared->dead_items_info);
 	return pvs->dead_items;
 }
 
+/* Forget all items in dead_items */
+void
+parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
+{
+	TidStore   *dead_items = pvs->dead_items;
+	VacDeadItemsInfo *dead_items_info = &(pvs->shared->dead_items_info);
+
+	/*
+	 * Free the current tidstore and return allocated DSA segments to the
+	 * operating system. Then we recreate the tidstore with the same max_bytes
+	 * limitation we just used.
+	 */
+	TidStoreDestroy(dead_items);
+	dsa_trim(pvs->dead_items_area);
+	pvs->dead_items = TidStoreCreate(dead_items_info->max_bytes, pvs->dead_items_area,
+									 LWTRANCHE_PARALLEL_VACUUM_DSA);
+
+	/* Update the DSA pointer for dead_items to the new one */
+	pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
+
+	/* Reset the counter */
+	dead_items_info->num_items = 0;
+}
+
 /*
  * Do parallel index bulk-deletion with parallel workers.
  */
@@ -861,7 +903,8 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	switch (indstats->status)
 	{
 		case PARALLEL_INDVAC_STATUS_NEED_BULKDELETE:
-			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items);
+			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items,
+											  &pvs->shared->dead_items_info);
 			break;
 		case PARALLEL_INDVAC_STATUS_NEED_CLEANUP:
 			istat_res = vac_cleanup_one_index(&ivinfo, istat);
@@ -961,7 +1004,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
+	void	   *area_space;
+	dsa_area   *dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -1005,10 +1050,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumUpdateCosts();
@@ -1056,6 +1101,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	TidStoreDetach(dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 83992725de..b1e388dc7c 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -168,6 +168,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SERIAL_SLRU] = "SerialSLRU",
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
+	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8d0571a03d..2eee03daec 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -376,7 +376,7 @@ NotifySLRU	"Waiting to access the <command>NOTIFY</command> message SLRU cache."
 SerialSLRU	"Waiting to access the serializable transaction conflict SLRU cache."
 SubtransSLRU	"Waiting to access the sub-transaction SLRU cache."
 XactSLRU	"Waiting to access the transaction status SLRU cache."
-
+ParallelVacuumDSA	"Waiting for parallel vacuum dynamic shared memory allocation."
 
 #
 # Wait Events - Lock
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 73afa77a9c..82a8fe6bd1 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 #define PROGRESS_VACUUM_INDEXES_TOTAL			7
 #define PROGRESS_VACUUM_INDEXES_PROCESSED		8
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1182a96742..759f9a87d3 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -278,19 +279,14 @@ struct VacuumCutoffs
 };
 
 /*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
+ * VacDeadItemsInfo stores supplemental information for dead tuple TID
+ * storage (i.e. TidStore).
  */
-typedef struct VacDeadItems
+typedef struct VacDeadItemsInfo
 {
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
+	size_t		max_bytes;		/* the maximum bytes TidStore can use */
+	int64		num_items;		/* current # of entries */
+} VacDeadItemsInfo;
 
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
@@ -351,10 +347,10 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items,
+													VacDeadItemsInfo *dead_items_info);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* In postmaster/autovacuum.c */
 extern void AutoVacuumUpdateCostLimit(void);
@@ -363,10 +359,12 @@ extern void VacuumUpdateCosts(void);
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
+												 int vac_work_mem, int elevel,
 												 BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
+												VacDeadItemsInfo **dead_items_info_p);
+extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 3479b4cf52..d70e6d37e0 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -214,6 +214,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SERIAL_SLRU,
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 84e359f6ed..ae3e91549f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2048,8 +2048,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples,
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes,
     s.param8 AS indexes_total,
     s.param9 AS indexes_processed
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c79e7b2eb6..1f71a06005 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2970,7 +2970,7 @@ UserMapping
 UserOpts
 VacAttrStats
 VacAttrStatsP
-VacDeadItems
+VacDeadItemsInfo
 VacErrPhase
 VacObjFilter
 VacOptValue
-- 
2.39.3

#428

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#427)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 21, 2024 at 4:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've looked into this idea further. Overall, it looks clean and I
don't see any problem so far in terms of integration with lazy vacuum.
I've attached three patches for discussion and tests.

Seems okay in the big picture, it's the details we need to be careful of.

v77-0001

- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
+ vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
+
+ dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
+ dead_items_info->max_bytes = vac_work_mem * 1024L;

This is confusing enough that it looks like a bug:

[inside TidStoreCreate()]
/* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
while (16 * maxBlockSize > max_bytes * 1024L)
maxBlockSize >>= 1;

This was copied from CreateWorkExprContext, which operates directly on
work_mem -- if the parameter is actually bytes, we can't "* 1024"
here. If we're passing something measured in kilobytes, the parameter
is badly named. Let's use convert once and use bytes everywhere.

Note: This was not another pass over the whole vacuum patch, just
looking an the issue at hand.
Also for later: Dilip Kumar reviewed an earlier version.

v77-0002:

+#define dsa_create(tranch_id) \
+ dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)

Since these macros are now referring to defaults, maybe their name
should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE
(*_MAX_*)

+/* The minimum size of a DSM segment. */
+#define DSA_MIN_SEGMENT_SIZE ((size_t) 1024)

That's a *lot* smaller than it is now. Maybe 256kB? We just want 1MB
m_w_m to work correctly.

v77-0003:

+/* Public APIs to create local or shared TidStore */
+
+TidStore *
+TidStoreCreateLocal(size_t max_bytes)
+{
+ return tidstore_create_internal(max_bytes, false, 0);
+}
+
+TidStore *
+TidStoreCreateShared(size_t max_bytes, int tranche_id)
+{
+ return tidstore_create_internal(max_bytes, true, tranche_id);
+}

I don't think these operations have enough in common to justify
sharing even an internal implementation. Choosing aset block size is
done for both memory types, but it's pointless to do it for shared
memory, because the local context is then only used for small
metadata.

+ /*
+ * Choose the DSA initial and max segment sizes to be no longer than
+ * 1/16 and 1/8 of max_bytes, respectively.
+ */

I'm guessing the 1/8 here because the number of segments is limited? I
know these numbers are somewhat arbitrary, but readers will wonder why
one has 1/8 and the other has 1/16.

+ if (dsa_init_size < DSA_MIN_SEGMENT_SIZE)
+     dsa_init_size = DSA_MIN_SEGMENT_SIZE;
+ if (dsa_max_size < DSA_MAX_SEGMENT_SIZE)
+     dsa_max_size = DSA_MAX_SEGMENT_SIZE;

The second clamp seems against the whole point of this patch -- it
seems they should all be clamped bigger than the DSA_MIN_SEGMENT_SIZE?
Did you try it with 1MB m_w_m?

#429

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#428)

6 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 21, 2024 at 7:48 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 21, 2024 at 4:03 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I've looked into this idea further. Overall, it looks clean and I
don't see any problem so far in terms of integration with lazy vacuum.
I've attached three patches for discussion and tests.

Seems okay in the big picture, it's the details we need to be careful of.

v77-0001
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
+ vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
+
+ dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
+ dead_items_info->max_bytes = vac_work_mem * 1024L;
This is confusing enough that it looks like a bug:

[inside TidStoreCreate()]
/* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
while (16 * maxBlockSize > max_bytes * 1024L)
maxBlockSize >>= 1;

This was copied from CreateWorkExprContext, which operates directly on
work_mem -- if the parameter is actually bytes, we can't "* 1024"
here. If we're passing something measured in kilobytes, the parameter
is badly named. Let's use convert once and use bytes everywhere.

True. The attached 0001 patch fixes it.

v77-0002:
+#define dsa_create(tranch_id) \
+ dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
Since these macros are now referring to defaults, maybe their name
should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE
(*_MAX_*)

It makes sense to rename DSA_INITIAL_SEGMENT_SIZE , but I think that
the DSA_MAX_SEGMENT_SIZE is the theoretical maximum size, the current
name also makes sense to me.

+/* The minimum size of a DSM segment. */
+#define DSA_MIN_SEGMENT_SIZE ((size_t) 1024)
That's a *lot* smaller than it is now. Maybe 256kB? We just want 1MB
m_w_m to work correctly.

Fixed.

v77-0003:
+/* Public APIs to create local or shared TidStore */
+
+TidStore *
+TidStoreCreateLocal(size_t max_bytes)
+{
+ return tidstore_create_internal(max_bytes, false, 0);
+}
+
+TidStore *
+TidStoreCreateShared(size_t max_bytes, int tranche_id)
+{
+ return tidstore_create_internal(max_bytes, true, tranche_id);
+}
I don't think these operations have enough in common to justify
sharing even an internal implementation. Choosing aset block size is
done for both memory types, but it's pointless to do it for shared
memory, because the local context is then only used for small
metadata.
+ /*
+ * Choose the DSA initial and max segment sizes to be no longer than
+ * 1/16 and 1/8 of max_bytes, respectively.
+ */
I'm guessing the 1/8 here because the number of segments is limited? I
know these numbers are somewhat arbitrary, but readers will wonder why
one has 1/8 and the other has 1/16.
+ if (dsa_init_size < DSA_MIN_SEGMENT_SIZE)
+     dsa_init_size = DSA_MIN_SEGMENT_SIZE;
+ if (dsa_max_size < DSA_MAX_SEGMENT_SIZE)
+     dsa_max_size = DSA_MAX_SEGMENT_SIZE;
The second clamp seems against the whole point of this patch -- it
seems they should all be clamped bigger than the DSA_MIN_SEGMENT_SIZE?
Did you try it with 1MB m_w_m?

I've incorporated the above comments and test results look good to me.

I've attached the several patches:

- 0002 is a minor fix for tidstore I found.
- 0005 changes the create APIs of tidstore.
- 0006 update the vacuum improvement patch to use the new
TidStoreCreateLocal/Shared() APIs.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v78-0005-Rethink-create-and-attach-APIs-of-shared-TidStor.patchapplication/octet-stream; name=v78-0005-Rethink-create-and-attach-APIs-of-shared-TidStor.patchDownload

From b20624fd766a49c5040ae1014eda27e731863975 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 21 Mar 2024 23:02:13 +0900
Subject: [PATCH v78 5/6] Rethink create and attach APIs of shared TidStore.

Previously, the behavior of TidStoreCreate() was inconsistent between
local and shared TidStore instances in terms of memory limitation. For
local TidStore, a memory context was created with initial and maximum
memory block sizes, as well as a minimum memory context size, based on
the specified memory limitation. However, for shared TidStore, the
provided DSA area was used for TID storage. Although commit XXX
allowed specifying the initial and maximum DSA segment sizes, callers
still needed to clamp their own limits, which was not consistent and
user-friendly.

With this commit, when creating a shared TidStore, a dedicated DSA
area is created for TID storage instead of using the provided DSA
area. The initial and maximum DSA segment sizes are chosen based on
the specified max_bytes memory limitation. Other processes can attach
to the shared TidStore using the handle of the created DSA returned by
the new TidStoreGetDSA() function and the DSA pointer returned by
TidStoreGetHandle(). The created DSA has the same lifetime as the
shared TidStore and is deleted when all processes detach from it.

To improve clarity, the TidStoreCreate() function has been divided
into two separate functions: TidStoreCreateLocal() and
TidStoreCreateShared().

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/backend/access/common/tidstore.c          | 109 ++++++++++++++----
 src/include/access/tidstore.h                 |   7 +-
 .../modules/test_tidstore/test_tidstore.c     |  14 +--
 3 files changed, 91 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index f79141590e..6cb9a0a32f 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -7,9 +7,9 @@
  * Internally it uses a radix tree as the storage for TIDs. The key is the
  * BlockNumber and the value is a bitmap of offsets, BlocktableEntry.
  *
- * TidStore can be shared among parallel worker processes by passing DSA area
- * to TidStoreCreate(). Other backends can attach to the shared TidStore by
- * TidStoreAttach().
+ * TidStore can be shared among parallel worker processes by using
+ * TidStoreCreateShared(). Other backends can attach to the shared TidStore
+ * by TidStoreAttach().
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -113,13 +113,10 @@ static void tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
  * CurrentMemoryContext at the time of this call. The TID storage, backed
  * by a radix tree, will live in its child memory context, rt_context. The
  * TidStore will be limited to (approximately) max_bytes total memory
- * consumption. If the 'area' is non-NULL, the radix tree is created in the
- * DSA area.
- *
- * The returned object is allocated in backend-local memory.
+ * consumption.
  */
 TidStore *
-TidStoreCreate(size_t max_bytes, dsa_area *area, int tranche_id)
+TidStoreCreateLocal(size_t max_bytes)
 {
 	TidStore   *ts;
 	size_t		initBlockSize = ALLOCSET_DEFAULT_INITSIZE;
@@ -143,33 +140,80 @@ TidStoreCreate(size_t max_bytes, dsa_area *area, int tranche_id)
 										   initBlockSize,
 										   maxBlockSize);
 
-	if (area != NULL)
-	{
-		ts->tree.shared = shared_ts_create(ts->rt_context, area,
-										   tranche_id);
-		ts->area = area;
-	}
-	else
-		ts->tree.local = local_ts_create(ts->rt_context);
+	ts->tree.local = local_ts_create(ts->rt_context);
+
+	return ts;
+}
+
+/*
+ * Similar to TidStoreCreateLocal() but create a shared TidStore on a
+ * DSA area. The TID storage will live in the DSA area, and a memory
+ * context rt_context will have only meta data of the radix tree.
+ *
+ * The returned object is allocated in backend-local memory.
+ */
+TidStore *
+TidStoreCreateShared(size_t max_bytes, int tranche_id)
+{
+	TidStore   *ts;
+	dsa_area   *area;
+	size_t		dsa_init_size = DSA_DEFAULT_INIT_SEGMENT_SIZE;
+	size_t		dsa_max_size = DSA_MAX_SEGMENT_SIZE;;
+
+	ts = palloc0(sizeof(TidStore));
+	ts->context = CurrentMemoryContext;
+
+	ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
+										   "TID storage meta data",
+										   ALLOCSET_SMALL_SIZES);
+
+	/*
+	 * Choose the initial and maximum DSA segment sizes to be no longer
+	 * than 1/16 and 1/8 of max_bytes, respectively. If the initial
+	 * segment size is low, we end up having many segments, which risks
+	 * exceeding the total number of segments the platform can have.
+	 * if the maximum segment size is high, there is a risk that the
+	 * total segment size overshoots the max_bytes a lot.
+	 */
+
+	while (16 * dsa_init_size > max_bytes)
+		dsa_init_size >>= 1;
+	while (8 * dsa_max_size > max_bytes)
+		dsa_max_size >>= 1;
+
+	if (dsa_init_size < DSA_MIN_SEGMENT_SIZE)
+		dsa_init_size = DSA_MIN_SEGMENT_SIZE;
+	if (dsa_max_size < DSA_MIN_SEGMENT_SIZE)
+		dsa_max_size = DSA_MIN_SEGMENT_SIZE;
+
+	area = dsa_create_ext(tranche_id, dsa_init_size, dsa_max_size);
+	ts->tree.shared = shared_ts_create(ts->rt_context, area,
+									   tranche_id);
+	ts->area = area;
 
 	return ts;
 }
 
 /*
- * Attach to the shared TidStore using the given  handle. The returned object
- * is allocated in backend-local memory using the CurrentMemoryContext.
+ * Attach to the shared TidStore. 'area_handle' is the DSA handle where
+ * the TidStore is created. 'handle' is the dsa_pointer returned by
+ * TidStoreGetHandle(). The returned object is allocated in backend-local
+ * memory using the CurrentMemoryContext.
  */
 TidStore *
-TidStoreAttach(dsa_area *area, dsa_pointer handle)
+TidStoreAttach(dsa_handle area_handle, dsa_pointer handle)
 {
 	TidStore   *ts;
+	dsa_area   *area;
 
-	Assert(area != NULL);
+	Assert(area_handle != DSA_HANDLE_INVALID);
 	Assert(DsaPointerIsValid(handle));
 
 	/* create per-backend state */
 	ts = palloc0(sizeof(TidStore));
 
+	area = dsa_attach(area_handle);
+
 	/* Find the shared the shared radix tree */
 	ts->tree.shared = shared_ts_attach(area, handle);
 	ts->area = area;
@@ -178,10 +222,8 @@ TidStoreAttach(dsa_area *area, dsa_pointer handle)
 }
 
 /*
- * Detach from a TidStore. This detaches from radix tree and frees the
- * backend-local resources. The radix tree will continue to exist until
- * it is either explicitly destroyed, or the area that backs it is returned
- * to the operating system.
+ * Detach from a TidStore. This also detaches from radix tree and frees
+ * the backend-local resources.
  */
 void
 TidStoreDetach(TidStore *ts)
@@ -189,6 +231,8 @@ TidStoreDetach(TidStore *ts)
 	Assert(TidStoreIsShared(ts));
 
 	shared_ts_detach(ts->tree.shared);
+	dsa_detach(ts->area);
+
 	pfree(ts);
 }
 
@@ -232,9 +276,13 @@ TidStoreUnlock(TidStore *ts)
 void
 TidStoreDestroy(TidStore *ts)
 {
-	/* Destroy underlying radix tree */
 	if (TidStoreIsShared(ts))
+	{
+		/* Destroy underlying radix tree */
 		shared_ts_free(ts->tree.shared);
+
+		dsa_detach(ts->area);
+	}
 	else
 		local_ts_free(ts->tree.local);
 
@@ -420,6 +468,17 @@ TidStoreMemoryUsage(TidStore *ts)
 		return local_ts_memory_usage(ts->tree.local);
 }
 
+/*
+ * Return the DSA area where the TidStore lives.
+ */
+dsa_area *
+TidStoreGetDSA(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts));
+
+	return ts->area;
+}
+
 dsa_pointer
 TidStoreGetHandle(TidStore *ts)
 {
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index 95d4f8f9ee..1de8aa8035 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -29,9 +29,9 @@ typedef struct TidStoreIterResult
 	OffsetNumber *offsets;
 } TidStoreIterResult;
 
-extern TidStore *TidStoreCreate(size_t max_bytes, dsa_area *dsa,
-								int tranche_id);
-extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer handle);
+extern TidStore *TidStoreCreateLocal(size_t max_bytes);
+extern TidStore *TidStoreCreateShared(size_t max_bytes, int tranche_id);
+extern TidStore *TidStoreAttach(dsa_handle dsa_handle, dsa_pointer handle);
 extern void TidStoreDetach(TidStore *ts);
 extern void TidStoreLockExclusive(TidStore *ts);
 extern void TidStoreLockShare(TidStore *ts);
@@ -45,5 +45,6 @@ extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
 extern void TidStoreEndIterate(TidStoreIter *iter);
 extern size_t TidStoreMemoryUsage(TidStore *ts);
 extern dsa_pointer TidStoreGetHandle(TidStore *ts);
+extern dsa_area *TidStoreGetDSA(TidStore *ts);
 
 #endif							/* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index c74ad2cf8b..3d4af77dda 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -34,7 +34,6 @@ PG_FUNCTION_INFO_V1(test_is_full);
 PG_FUNCTION_INFO_V1(test_destroy);
 
 static TidStore *tidstore = NULL;
-static dsa_area *dsa = NULL;
 static size_t tidstore_empty_size;
 
 /* array for verification of some tests */
@@ -94,7 +93,6 @@ test_create(PG_FUNCTION_ARGS)
 	size_t		array_init_size = 1024;
 
 	Assert(tidstore == NULL);
-	Assert(dsa == NULL);
 
 	/*
 	 * Create the TidStore on TopMemoryContext so that the same process use it
@@ -109,18 +107,16 @@ test_create(PG_FUNCTION_ARGS)
 		tranche_id = LWLockNewTrancheId();
 		LWLockRegisterTranche(tranche_id, "test_tidstore");
 
-		dsa = dsa_create(tranche_id);
+		tidstore = TidStoreCreateShared(tidstore_max_size, tranche_id);
 
 		/*
 		 * Remain attached until end of backend or explicitly detached so that
 		 * the same process use the tidstore for subsequent tests.
 		 */
-		dsa_pin_mapping(dsa);
-
-		tidstore = TidStoreCreate(tidstore_max_size, dsa, tranche_id);
+		dsa_pin_mapping(TidStoreGetDSA(tidstore));
 	}
 	else
-		tidstore = TidStoreCreate(tidstore_max_size, NULL, 0);
+		tidstore = TidStoreCreateLocal(tidstore_max_size);
 
 	tidstore_empty_size = TidStoreMemoryUsage(tidstore);
 
@@ -309,9 +305,5 @@ test_destroy(PG_FUNCTION_ARGS)
 	pfree(items.lookup_tids);
 	pfree(items.iter_tids);
 
-	if (dsa)
-		dsa_detach(dsa);
-	dsa = NULL;
-
 	PG_RETURN_VOID();
 }
-- 
2.39.3

v78-0004-Allow-specifying-initial-and-maximum-segment-siz.patchapplication/octet-stream; name=v78-0004-Allow-specifying-initial-and-maximum-segment-siz.patchDownload

From a5e34005dccf2e7019ee8464182a521a12e7c32f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 21 Mar 2024 17:13:16 +0900
Subject: [PATCH v78 4/6] Allow specifying initial and maximum segment sizes
 for DSA.

---
 src/backend/utils/mmgr/dsa.c | 63 +++++++++++++++++-------------------
 src/include/utils/dsa.h      | 46 +++++++++++++++++++++++---
 2 files changed, 71 insertions(+), 38 deletions(-)

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index c3af071940..99e5bd68b6 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -59,14 +59,6 @@
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
-/*
- * The size of the initial DSM segment that backs a dsa_area created by
- * dsa_create.  After creating some number of segments of this size we'll
- * double this size, and so on.  Larger segments may be created if necessary
- * to satisfy large requests.
- */
-#define DSA_INITIAL_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
-
 /*
  * How many segments to create before we double the segment size.  If this is
  * low, then there is likely to be a lot of wasted space in the largest
@@ -76,17 +68,6 @@
  */
 #define DSA_NUM_SEGMENTS_AT_EACH_SIZE 2
 
-/*
- * The number of bits used to represent the offset part of a dsa_pointer.
- * This controls the maximum size of a segment, the maximum possible
- * allocation size and also the maximum number of segments per area.
- */
-#if SIZEOF_DSA_POINTER == 4
-#define DSA_OFFSET_WIDTH 27		/* 32 segments of size up to 128MB */
-#else
-#define DSA_OFFSET_WIDTH 40		/* 1024 segments of size up to 1TB */
-#endif
-
 /*
  * The maximum number of DSM segments that an area can own, determined by
  * the number of bits remaining (but capped at 1024).
@@ -97,9 +78,6 @@
 /* The bitmask for extracting the offset from a dsa_pointer. */
 #define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
 
-/* The maximum size of a DSM segment. */
-#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
-
 /* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
 #define DSA_PAGES_PER_SUPERBLOCK		16
 
@@ -318,6 +296,10 @@ typedef struct
 	dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
 	/* The object pools for each size class. */
 	dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+	/* initial allocation segment size */
+	size_t		init_segment_size;
+	/* maximum allocation segment size */
+	size_t		max_segment_size;
 	/* The total size of all active segments. */
 	size_t		total_segment_size;
 	/* The maximum total size of backing storage we are allowed. */
@@ -417,7 +399,9 @@ static dsa_segment_map *make_new_segment(dsa_area *area, size_t requested_pages)
 static dsa_area *create_internal(void *place, size_t size,
 								 int tranche_id,
 								 dsm_handle control_handle,
-								 dsm_segment *control_segment);
+								 dsm_segment *control_segment,
+								 size_t init_segment_size,
+								 size_t max_segment_size);
 static dsa_area *attach_internal(void *place, dsm_segment *segment,
 								 dsa_handle handle);
 static void check_for_freed_segments(dsa_area *area);
@@ -434,7 +418,7 @@ static void rebin_segment(dsa_area *area, dsa_segment_map *segment_map);
  * we require the caller to provide one.
  */
 dsa_area *
-dsa_create(int tranche_id)
+dsa_create_ext(int tranche_id, size_t init_segment_size, size_t max_segment_size)
 {
 	dsm_segment *segment;
 	dsa_area   *area;
@@ -443,7 +427,7 @@ dsa_create(int tranche_id)
 	 * Create the DSM segment that will hold the shared control object and the
 	 * first segment of usable space.
 	 */
-	segment = dsm_create(DSA_INITIAL_SEGMENT_SIZE, 0);
+	segment = dsm_create(init_segment_size, 0);
 
 	/*
 	 * All segments backing this area are pinned, so that DSA can explicitly
@@ -455,9 +439,10 @@ dsa_create(int tranche_id)
 
 	/* Create a new DSA area with the control object in this segment. */
 	area = create_internal(dsm_segment_address(segment),
-						   DSA_INITIAL_SEGMENT_SIZE,
+						   init_segment_size,
 						   tranche_id,
-						   dsm_segment_handle(segment), segment);
+						   dsm_segment_handle(segment), segment,
+						   init_segment_size, max_segment_size);
 
 	/* Clean up when the control segment detaches. */
 	on_dsm_detach(segment, &dsa_on_dsm_detach_release_in_place,
@@ -483,13 +468,15 @@ dsa_create(int tranche_id)
  * See dsa_create() for a note about the tranche arguments.
  */
 dsa_area *
-dsa_create_in_place(void *place, size_t size,
-					int tranche_id, dsm_segment *segment)
+dsa_create_in_place_ext(void *place, size_t size,
+						int tranche_id, dsm_segment *segment,
+						size_t init_segment_size, size_t max_segment_size)
 {
 	dsa_area   *area;
 
 	area = create_internal(place, size, tranche_id,
-						   DSM_HANDLE_INVALID, NULL);
+						   DSM_HANDLE_INVALID, NULL,
+						   init_segment_size, max_segment_size);
 
 	/*
 	 * Clean up when the control segment detaches, if a containing DSM segment
@@ -1231,7 +1218,8 @@ static dsa_area *
 create_internal(void *place, size_t size,
 				int tranche_id,
 				dsm_handle control_handle,
-				dsm_segment *control_segment)
+				dsm_segment *control_segment,
+				size_t init_segment_size, size_t max_segment_size)
 {
 	dsa_area_control *control;
 	dsa_area   *area;
@@ -1241,6 +1229,11 @@ create_internal(void *place, size_t size,
 	size_t		metadata_bytes;
 	int			i;
 
+	/* Validate the initial and maximum block sizes */
+	Assert(init_segment_size >= DSA_MIN_SEGMENT_SIZE);
+	Assert(max_segment_size >= init_segment_size);
+	Assert(max_segment_size <= DSA_MAX_SEGMENT_SIZE);
+
 	/* Sanity check on the space we have to work in. */
 	if (size < dsa_minimum_size())
 		elog(ERROR, "dsa_area space must be at least %zu, but %zu provided",
@@ -1270,8 +1263,10 @@ create_internal(void *place, size_t size,
 	control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
 	control->segment_header.usable_pages = usable_pages;
 	control->segment_header.freed = false;
-	control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+	control->segment_header.size = size;
 	control->handle = control_handle;
+	control->init_segment_size = init_segment_size;
+	control->max_segment_size = max_segment_size;
 	control->max_total_segment_size = (size_t) -1;
 	control->total_segment_size = size;
 	control->segment_handles[0] = control_handle;
@@ -2127,9 +2122,9 @@ make_new_segment(dsa_area *area, size_t requested_pages)
 	 * move to huge pages in the future.  Then we work back to the number of
 	 * pages we can fit.
 	 */
-	total_size = DSA_INITIAL_SEGMENT_SIZE *
+	total_size = area->control->init_segment_size *
 		((size_t) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
-	total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+	total_size = Min(total_size, area->control->max_segment_size);
 	total_size = Min(total_size,
 					 area->control->max_total_segment_size -
 					 area->control->total_segment_size);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index fe9cbebbec..8dff964bf3 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -77,6 +77,31 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
 /* A sentinel value for dsa_pointer used to indicate failure to allocate. */
 #define InvalidDsaPointer ((dsa_pointer) 0)
 
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27		/* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40		/* 1024 segments of size up to 1TB */
+#endif
+
+/*
+ * The default size of the initial DSM segment that backs a dsa_area created
+ * by dsa_create.  After creating some number of segments of the initial size
+ * we'll double this size, and so on.  Larger segments may be created if
+ * necessary to satisfy large requests.
+ */
+#define DSA_DEFAULT_INIT_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
+
+/* The minimum size of a DSM segment. */
+#define DSA_MIN_SEGMENT_SIZE	((size_t) (256 * 1024L))
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
 /* Check if a dsa_pointer value is valid. */
 #define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
 
@@ -88,6 +113,17 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
 #define dsa_allocate0(area, size) \
 	dsa_allocate_extended(area, size, DSA_ALLOC_ZERO)
 
+/* Create dsa_area with default segment sizes */
+#define dsa_create(tranch_id) \
+	dsa_create_ext(tranch_id, DSA_DEFAULT_INIT_SEGMENT_SIZE, \
+				   DSA_MAX_SEGMENT_SIZE)
+
+/* Create dsa_area with default segment sizes in an existing share memory space */
+#define dsa_create_in_place(place, size, tranch_id, segment) \
+	dsa_create_in_place_ext(place, size, tranch_id, segment, \
+							DSA_DEFAULT_INIT_SEGMENT_SIZE, \
+							DSA_MAX_SEGMENT_SIZE)
+
 /*
  * The type used for dsa_area handles.  dsa_handle values can be shared with
  * other processes, so that they can attach to them.  This provides a way to
@@ -102,10 +138,12 @@ typedef dsm_handle dsa_handle;
 /* Sentinel value to use for invalid dsa_handles. */
 #define DSA_HANDLE_INVALID ((dsa_handle) DSM_HANDLE_INVALID)
 
-
-extern dsa_area *dsa_create(int tranche_id);
-extern dsa_area *dsa_create_in_place(void *place, size_t size,
-									 int tranche_id, dsm_segment *segment);
+extern dsa_area *dsa_create_ext(int tranche_id, size_t init_segment_size,
+								size_t max_segment_size);
+extern dsa_area *dsa_create_in_place_ext(void *place, size_t size,
+										 int tranche_id, dsm_segment *segment,
+										 size_t init_segment_size,
+										 size_t max_segment_size);
 extern dsa_area *dsa_attach(dsa_handle handle);
 extern dsa_area *dsa_attach_in_place(void *place, dsm_segment *segment);
 extern void dsa_release_in_place(void *place);
-- 
2.39.3

v78-0003-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchapplication/octet-stream; name=v78-0003-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchDownload

From f53086683bb5319a3475fd807e56285486761bd4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 1 Mar 2024 16:04:49 +0900
Subject: [PATCH v78 3/6] Use TidStore for dead tuple TIDs storage during lazy
 vacuum.

Previously, we used VacDeadItems, a simple ItemPointerData array, for
dead tuple's TID storage during lazy vacuum, which was neither space
efficient nor lookup performant.

This commit makes (parallel) lazy vacuum use of TidSTore for dead
tuple TIDs storage, instead of VacDeadItems. A new struct
VacDeadItemsInfo stores additional information such as
max_bytes. TidStore and VacDeadItemsInfo are shared among the parallel
vacuum workers in parallel vacuum cases. We don't take any locks on
TidStore during parallel vacuum since there are no concurrent reads
and writes.

As for the progress reporting, reporting number of tuples does no
longer provide any meaningful insights for users. So this commit also
changes to report byte-based progress reporting. The columns of
pg_stat_progress_vacuum are also renamed accordingly:
max_dead_tuple_bytes and dead_tuple_bytes.

XXX: bump catalog version

Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |   5 -
 doc/src/sgml/monitoring.sgml                  |   8 +-
 src/backend/access/heap/vacuumlazy.c          | 257 ++++++++----------
 src/backend/catalog/system_views.sql          |   2 +-
 src/backend/commands/vacuum.c                 |  79 +-----
 src/backend/commands/vacuumparallel.c         | 102 +++++--
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +-
 src/include/commands/progress.h               |   4 +-
 src/include/commands/vacuum.h                 |  28 +-
 src/include/storage/lwlock.h                  |   1 +
 src/test/regress/expected/rules.out           |   4 +-
 src/tools/pgindent/typedefs.list              |   2 +-
 13 files changed, 223 insertions(+), 272 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c408..a4ab76b7bc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1918,11 +1918,6 @@ include_dir 'conf.d'
         too high.  It may be useful to control for this by separately
         setting <xref linkend="guc-autovacuum-work-mem"/>.
        </para>
-       <para>
-        Note that for the collection of dead tuple identifiers,
-        <command>VACUUM</command> is only able to utilize up to a maximum of
-        <literal>1GB</literal> of memory.
-       </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8736eac284..6a74e4a24d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6237,10 +6237,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6248,10 +6248,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1800490775..320b8785f3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,17 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -39,6 +38,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xloginsert.h"
@@ -179,8 +179,13 @@ typedef struct LVRelState
 	 * that has been processed by lazy_scan_prune.  Also needed by
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
+	 *
+	 * Both dead_items and dead_items_info are allocated in shared memory in
+	 * parallel vacuum cases.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore   *dead_items;		/* TIDs whose index tuples we'll delete */
+	VacDeadItemsInfo *dead_items_info;
+
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -239,8 +244,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  Buffer buffer, OffsetNumber *offsets,
+								  int num_offsets, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -257,6 +263,9 @@ static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+						   int num_offsets);
+static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -472,7 +481,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
+	 * Allocate dead_items memory using dead_items_alloc.  This handles
 	 * parallel VACUUM initialization as part of allocating shared memory
 	 * space used for dead_items.  (But do a failsafe precheck first, to
 	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
@@ -782,7 +791,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -811,19 +820,20 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_fsm_block_to_vacuum = 0;
 	bool		all_visible_according_to_vm;
 
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore   *dead_items = vacrel->dead_items;
+	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
 	Buffer		vmbuffer = InvalidBuffer;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = dead_items_info->max_bytes;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Initialize for the first heap_vac_scan_next_block() call */
@@ -866,8 +876,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -930,7 +939,7 @@ lazy_scan_heap(LVRelState *vacrel)
 
 		/*
 		 * If we didn't get the cleanup lock, we can still collect LP_DEAD
-		 * items in the dead_items array for later vacuuming, count live and
+		 * items in the dead_items for later vacuuming, count live and
 		 * recently dead tuples for vacuum logging, and determine if this
 		 * block could later be truncated. If we encounter any xid/mxids that
 		 * require advancing the relfrozenxid/relminxid, we'll have to wait
@@ -958,9 +967,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Like lazy_scan_noprune(), lazy_scan_prune() will count
 		 * recently_dead_tuples and live tuples for vacuum logging, determine
 		 * if the block can later be truncated, and accumulate the details of
-		 * remaining LP_DEAD line pointers on the page in the dead_items
-		 * array. These dead items include those pruned by lazy_scan_prune()
-		 * as well we line pointers previously marked LP_DEAD.
+		 * remaining LP_DEAD line pointers on the page in the dead_items.
+		 * These dead items include those pruned by lazy_scan_prune() as well
+		 * we line pointers previously marked LP_DEAD.
 		 */
 		if (got_cleanup_lock)
 			lazy_scan_prune(vacrel, buf, blkno, page,
@@ -1037,7 +1046,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (dead_items_info->num_items > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1763,22 +1772,9 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1925,7 +1921,7 @@ lazy_scan_prune(LVRelState *vacrel,
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2084,7 +2080,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2104,9 +2100,6 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
 		 * indexes will be deleted during index vacuuming (and then marked
@@ -2114,17 +2107,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		vacrel->lpdead_items += lpdead_items;
 	}
@@ -2174,7 +2157,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		dead_items_reset(vacrel);
 		return;
 	}
 
@@ -2203,7 +2186,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2230,8 +2213,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2276,7 +2259,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	dead_items_reset(vacrel);
 }
 
 /*
@@ -2368,7 +2351,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2390,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2408,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2426,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = TidStoreBeginIterate(vacrel->dead_items);
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2435,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = iter_result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2449,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, buf, iter_result->offsets,
+							  iter_result->num_offsets, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2459,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2468,14 +2454,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2483,21 +2468,17 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 /*
  *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *						  vacrel->dead_items store.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
+static void
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+					  OffsetNumber *deadoffsets, int num_offsets,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2516,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2595,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -2722,8 +2697,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -2760,7 +2735,8 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
 	/* Do bulk deletion */
-	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items);
+	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items,
+								  vacrel->dead_items_info);
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -3125,46 +3101,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = AmAutoVacuumWorkerProcess() &&
-		autovacuum_work_mem != -1 ?
-		autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3175,11 +3111,10 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	VacDeadItemsInfo *dead_items_info;
+	size_t			vac_work_mem = AmAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3206,24 +3141,66 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
 		/* If parallel mode started, dead_items space is allocated in DSM */
 		if (ParallelVacuumIsActive(vacrel))
 		{
-			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs);
+			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs,
+																&vacrel->dead_items_info);
 			return;
 		}
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
+	vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
+
+	dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
+	dead_items_info->max_bytes = vac_work_mem;
+	dead_items_info->num_items = 0;
+	vacrel->dead_items_info = dead_items_info;
+}
+
+/*
+ * Add the given block number and offset numbers to dead_items.
+ */
+static void
+dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+			   int num_offsets)
+{
+	TidStore   *dead_items = vacrel->dead_items;
+
+	TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets);
+	vacrel->dead_items_info->num_items += num_offsets;
+
+	/* update the memory usage report */
+	pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+								 TidStoreMemoryUsage(dead_items));
+}
+
+/*
+ * Forget all collected dead items.
+ */
+static void
+dead_items_reset(LVRelState *vacrel)
+{
+	TidStore   *dead_items = vacrel->dead_items;
+
+	if (ParallelVacuumIsActive(vacrel))
+	{
+		parallel_vacuum_reset_dead_items(vacrel->pvs);
+		return;
+	}
+
+	/* Recreate the tidstore with the same max_bytes limitation */
+	TidStoreDestroy(dead_items);
+	vacrel->dead_items = TidStoreCreate(vacrel->dead_items_info->max_bytes,
+										NULL, 0);
 
-	vacrel->dead_items = dead_items;
+	/* Reset the counter */
+	vacrel->dead_items_info->num_items = 0;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 04227a72d1..b6990ac0da 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1221,7 +1221,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples,
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
         S.param8 AS indexes_total, S.param9 AS indexes_processed
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e63c86cae4..72299b0838 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,7 +116,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * GUC check function to ensure GUC value specified is within the allowable
@@ -2489,16 +2488,15 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items, VacDeadItemsInfo *dead_items_info)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
-					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+			(errmsg("scanned index \"%s\" to remove " INT64_FORMAT " row versions",
+					RelationGetRelationName(ivinfo->index), dead_items_info->num_items)));
 
 	return istat;
 }
@@ -2529,82 +2527,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch(itemptr,
-								dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore   *dead_items = (TidStore *) state;
 
-	return 0;
+	return TidStoreIsMember(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index befda1c105..9606554003 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -8,8 +8,8 @@
  *
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
+ * vacuum process.  ParalleVacuumState contains shared information as well as
+ * the memory space for storing dead items allocated in the DSA area.  We
  * launch parallel worker processes at the start of parallel index
  * bulk-deletion and index cleanup and once all indexes are processed, the
  * parallel worker processes exit.  Each time we process indexes in parallel,
@@ -110,6 +110,12 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* DSA pointer to the shared TidStore */
+	dsa_pointer dead_items_handle;
+
+	/* Statistics of shared dead items */
+	VacDeadItemsInfo dead_items_info;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -176,7 +182,8 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
+	dsa_area   *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -232,20 +239,22 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
+					 int nrequested_workers, size_t vac_work_mem,
 					 int elevel, BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	void	   *area_space;
+	dsa_area   *dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
+	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -294,9 +303,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
+	/* Initial size of DSA for dead tuples -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
+	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
 	/*
@@ -362,6 +370,17 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
+	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
+	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
+										 LWTRANCHE_PARALLEL_VACUUM_DSA,
+										 pcxt->seg);
+	dead_items = TidStoreCreate(vac_work_mem, dead_items_dsa,
+								LWTRANCHE_PARALLEL_VACUUM_DSA);
+	pvs->dead_items = dead_items;
+	pvs->dead_items_area = dead_items_dsa;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -371,6 +390,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_handle = TidStoreGetHandle(dead_items);
+	shared->dead_items_info.max_bytes = vac_work_mem;
 
 	/* Use the same buffer size for all workers */
 	shared->ring_nbuffers = GetAccessStrategyBufferCount(bstrategy);
@@ -382,15 +403,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -448,6 +460,9 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	TidStoreDestroy(pvs->dead_items);
+	dsa_detach(pvs->dead_items_area);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -455,13 +470,40 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
-/* Returns the dead items space */
-VacDeadItems *
-parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
+/*
+ * Returns the dead items space and dead items information.
+ */
+TidStore *
+parallel_vacuum_get_dead_items(ParallelVacuumState *pvs, VacDeadItemsInfo **dead_items_info_p)
 {
+	*dead_items_info_p = &(pvs->shared->dead_items_info);
 	return pvs->dead_items;
 }
 
+/* Forget all items in dead_items */
+void
+parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
+{
+	TidStore   *dead_items = pvs->dead_items;
+	VacDeadItemsInfo *dead_items_info = &(pvs->shared->dead_items_info);
+
+	/*
+	 * Free the current tidstore and return allocated DSA segments to the
+	 * operating system. Then we recreate the tidstore with the same max_bytes
+	 * limitation we just used.
+	 */
+	TidStoreDestroy(dead_items);
+	dsa_trim(pvs->dead_items_area);
+	pvs->dead_items = TidStoreCreate(dead_items_info->max_bytes, pvs->dead_items_area,
+									 LWTRANCHE_PARALLEL_VACUUM_DSA);
+
+	/* Update the DSA pointer for dead_items to the new one */
+	pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
+
+	/* Reset the counter */
+	dead_items_info->num_items = 0;
+}
+
 /*
  * Do parallel index bulk-deletion with parallel workers.
  */
@@ -861,7 +903,8 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	switch (indstats->status)
 	{
 		case PARALLEL_INDVAC_STATUS_NEED_BULKDELETE:
-			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items);
+			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items,
+											  &pvs->shared->dead_items_info);
 			break;
 		case PARALLEL_INDVAC_STATUS_NEED_CLEANUP:
 			istat_res = vac_cleanup_one_index(&ivinfo, istat);
@@ -961,7 +1004,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
+	void	   *area_space;
+	dsa_area   *dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -1005,10 +1050,10 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
+	dead_items_area = dsa_attach_in_place(area_space, seg);
+	dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumUpdateCosts();
@@ -1056,6 +1101,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	TidStoreDetach(dead_items);
+	dsa_detach(dead_items_area);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 83992725de..b1e388dc7c 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -168,6 +168,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SERIAL_SLRU] = "SerialSLRU",
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
+	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8d0571a03d..2eee03daec 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -376,7 +376,7 @@ NotifySLRU	"Waiting to access the <command>NOTIFY</command> message SLRU cache."
 SerialSLRU	"Waiting to access the serializable transaction conflict SLRU cache."
 SubtransSLRU	"Waiting to access the sub-transaction SLRU cache."
 XactSLRU	"Waiting to access the transaction status SLRU cache."
-
+ParallelVacuumDSA	"Waiting for parallel vacuum dynamic shared memory allocation."
 
 #
 # Wait Events - Lock
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 73afa77a9c..82a8fe6bd1 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 #define PROGRESS_VACUUM_INDEXES_TOTAL			7
 #define PROGRESS_VACUUM_INDEXES_PROCESSED		8
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1182a96742..c2226ebcac 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -278,19 +279,14 @@ struct VacuumCutoffs
 };
 
 /*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
+ * VacDeadItemsInfo stores supplemental information for dead tuple TID
+ * storage (i.e. TidStore).
  */
-typedef struct VacDeadItems
+typedef struct VacDeadItemsInfo
 {
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
+	size_t		max_bytes;		/* the maximum bytes TidStore can use */
+	int64		num_items;		/* current # of entries */
+} VacDeadItemsInfo;
 
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
@@ -351,10 +347,10 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items,
+													VacDeadItemsInfo *dead_items_info);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* In postmaster/autovacuum.c */
 extern void AutoVacuumUpdateCostLimit(void);
@@ -363,10 +359,12 @@ extern void VacuumUpdateCosts(void);
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
+												 size_t vac_work_mem, int elevel,
 												 BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
+												VacDeadItemsInfo **dead_items_info_p);
+extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 3479b4cf52..d70e6d37e0 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -214,6 +214,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SERIAL_SLRU,
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 84e359f6ed..ae3e91549f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2048,8 +2048,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples,
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes,
     s.param8 AS indexes_total,
     s.param9 AS indexes_processed
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c79e7b2eb6..1f71a06005 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2970,7 +2970,7 @@ UserMapping
 UserOpts
 VacAttrStats
 VacAttrStatsP
-VacDeadItems
+VacDeadItemsInfo
 VacErrPhase
 VacObjFilter
 VacOptValue
-- 
2.39.3

v78-0006-Adjust-the-vacuum-improvement-patch-to-new-TidSt.patchapplication/octet-stream; name=v78-0006-Adjust-the-vacuum-improvement-patch-to-new-TidSt.patchDownload

From 9411ca22eff13d62638517556f2ab8a7dafab5f3 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 21 Mar 2024 23:16:27 +0900
Subject: [PATCH v78 6/6] Adjust the vacuum improvement patch to new TidStore
 APIs.

---
 src/backend/access/heap/vacuumlazy.c  |  5 ++--
 src/backend/commands/vacuumparallel.c | 38 ++++++++-------------------
 2 files changed, 13 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 320b8785f3..82c5a2c690 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3155,7 +3155,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	}
 
 	/* Serial VACUUM case */
-	vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
+	vacrel->dead_items = TidStoreCreateLocal(vac_work_mem);
 
 	dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
 	dead_items_info->max_bytes = vac_work_mem;
@@ -3196,8 +3196,7 @@ dead_items_reset(LVRelState *vacrel)
 
 	/* Recreate the tidstore with the same max_bytes limitation */
 	TidStoreDestroy(dead_items);
-	vacrel->dead_items = TidStoreCreate(vacrel->dead_items_info->max_bytes,
-										NULL, 0);
+	vacrel->dead_items = TidStoreCreateLocal(vacrel->dead_items_info->max_bytes);
 
 	/* Reset the counter */
 	vacrel->dead_items_info->num_items = 0;
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 9606554003..719055a734 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -45,7 +45,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+/* 2 was PARALLEL_VACUUM_KEY_DEAD_ITEMS */
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -111,6 +111,9 @@ typedef struct PVShared
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
 
+	/* DSA handle where the TidStore lives */
+	dsa_handle	dead_items_dsa_handle;
+
 	/* DSA pointer to the shared TidStore */
 	dsa_pointer dead_items_handle;
 
@@ -183,7 +186,6 @@ struct ParallelVacuumState
 
 	/* Shared dead items space among parallel vacuum workers */
 	TidStore   *dead_items;
-	dsa_area   *dead_items_area;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -249,12 +251,9 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
-	void	   *area_space;
-	dsa_area   *dead_items_dsa;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		dsa_minsize = dsa_minimum_size();
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -303,10 +302,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Initial size of DSA for dead tuples -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	shm_toc_estimate_chunk(&pcxt->estimator, dsa_minsize);
-	shm_toc_estimate_keys(&pcxt->estimator, 1);
-
 	/*
 	 * Estimate space for BufferUsage and WalUsage --
 	 * PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
@@ -371,15 +366,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	pvs->indstats = indstats;
 
 	/* Prepare DSA space for dead items */
-	area_space = shm_toc_allocate(pcxt->toc, dsa_minsize);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, area_space);
-	dead_items_dsa = dsa_create_in_place(area_space, dsa_minsize,
-										 LWTRANCHE_PARALLEL_VACUUM_DSA,
-										 pcxt->seg);
-	dead_items = TidStoreCreate(vac_work_mem, dead_items_dsa,
-								LWTRANCHE_PARALLEL_VACUUM_DSA);
+	dead_items = TidStoreCreateShared(vac_work_mem, LWTRANCHE_PARALLEL_VACUUM_DSA);
 	pvs->dead_items = dead_items;
-	pvs->dead_items_area = dead_items_dsa;
 
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
@@ -390,6 +378,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
 	shared->dead_items_handle = TidStoreGetHandle(dead_items);
 	shared->dead_items_info.max_bytes = vac_work_mem;
 
@@ -461,7 +450,6 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	}
 
 	TidStoreDestroy(pvs->dead_items);
-	dsa_detach(pvs->dead_items_area);
 
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
@@ -493,11 +481,11 @@ parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
 	 * limitation we just used.
 	 */
 	TidStoreDestroy(dead_items);
-	dsa_trim(pvs->dead_items_area);
-	pvs->dead_items = TidStoreCreate(dead_items_info->max_bytes, pvs->dead_items_area,
-									 LWTRANCHE_PARALLEL_VACUUM_DSA);
+	pvs->dead_items = TidStoreCreateShared(dead_items_info->max_bytes,
+										   LWTRANCHE_PARALLEL_VACUUM_DSA);
 
 	/* Update the DSA pointer for dead_items to the new one */
+	pvs->shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
 	pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
 
 	/* Reset the counter */
@@ -1005,8 +993,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	PVIndStats *indstats;
 	PVShared   *shared;
 	TidStore   *dead_items;
-	void	   *area_space;
-	dsa_area   *dead_items_area;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -1051,9 +1037,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 false);
 
 	/* Set dead items */
-	area_space = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, false);
-	dead_items_area = dsa_attach_in_place(area_space, seg);
-	dead_items = TidStoreAttach(dead_items_area, shared->dead_items_handle);
+	dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
+								shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumUpdateCosts();
@@ -1102,7 +1087,6 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 						  &wal_usage[ParallelWorkerNumber]);
 
 	TidStoreDetach(dead_items);
-	dsa_detach(dead_items_area);
 
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
-- 
2.39.3

v78-0002-Fix-an-inconsistent-function-prototype-with-the-.patchapplication/octet-stream; name=v78-0002-Fix-an-inconsistent-function-prototype-with-the-.patchDownload

From a75a2ac80687dfab33bfbe425bbae761068f91e2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 21 Mar 2024 21:36:54 +0900
Subject: [PATCH v78 2/6] Fix an inconsistent function prototype with the
 function definition.

Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch-through:
---
 src/include/access/tidstore.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index 8cf4e94f12..95d4f8f9ee 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -31,7 +31,7 @@ typedef struct TidStoreIterResult
 
 extern TidStore *TidStoreCreate(size_t max_bytes, dsa_area *dsa,
 								int tranche_id);
-extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer rt_dp);
+extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer handle);
 extern void TidStoreDetach(TidStore *ts);
 extern void TidStoreLockExclusive(TidStore *ts);
 extern void TidStoreLockShare(TidStore *ts);
-- 
2.39.3

v78-0001-Fix-a-calculation-in-TidStoreCreate.patchapplication/octet-stream; name=v78-0001-Fix-a-calculation-in-TidStoreCreate.patchDownload

From d319ff9f43737535f4aac7ef93790511d5202a9e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 21 Mar 2024 21:31:17 +0900
Subject: [PATCH v78 1/6] Fix a calculation in TidStoreCreate().

Since we expect that the max_bytes is in bytes, not in kilobytes, we
should not do multiple it by 1024.

Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/access/common/tidstore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 745393806d..f79141590e 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -130,7 +130,7 @@ TidStoreCreate(size_t max_bytes, dsa_area *area, int tranche_id)
 	ts->context = CurrentMemoryContext;
 
 	/* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
-	while (16 * maxBlockSize > max_bytes * 1024L)
+	while (16 * maxBlockSize > max_bytes)
 		maxBlockSize >>= 1;
 
 	if (maxBlockSize < ALLOCSET_DEFAULT_INITSIZE)
-- 
2.39.3

#430

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: John Naylor (#376)

Re: [PoC] Improve dead tuple storage for lazy vacuum

John Naylor <johncnaylorls@gmail.com> writes:

Done. I pushed this with a few last-minute cosmetic adjustments. This
has been a very long time coming, but we're finally in the home
stretch!

I'm not sure why it took a couple weeks for Coverity to notice
ee1b30f12, but it saw it today, and it's not happy:

/srv/coverity/git/pgsql-git/postgresql/src/include/lib/radixtree.h: 1621 in local_ts_extend_down()
1615 node = child;
1616 shift -= RT_SPAN;
1617 }
1618
1619 /* Reserve slot for the value. */
1620 n4 = (RT_NODE_4 *) node.local;

CID 1594658: Integer handling issues (BAD_SHIFT)
In expression "key >> shift", shifting by a negative amount has undefined behavior. The shift amount, "shift", is as little as -7.

1621 n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
1622 n4->base.count = 1;
1623
1624 return &n4->children[0];
1625 }
1626

I think the point here is that if you start with an arbitrary
non-negative shift value, the preceding loop may in fact decrement it
down to something less than zero before exiting, in which case we
would indeed have trouble. I suspect that the code is making
undocumented assumptions about the possible initial values of shift.
Maybe some Asserts would be good? Also, if we're effectively assuming
that shift must be exactly zero here, why not let the compiler
hard-code that?

-     	n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
+     	n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0);

regards, tom lane

#431

sawada.mshk@gmail.com

almost 2 years ago

In reply to: Tom Lane (#430)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 25, 2024 at 1:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

John Naylor <johncnaylorls@gmail.com> writes:

Done. I pushed this with a few last-minute cosmetic adjustments. This
has been a very long time coming, but we're finally in the home
stretch!

Thank you for the report.

I'm not sure why it took a couple weeks for Coverity to notice
ee1b30f12, but it saw it today, and it's not happy:

Hmm, I've also done Coverity Scan in development but I wasn't able to
see this one for some reason...

/srv/coverity/git/pgsql-git/postgresql/src/include/lib/radixtree.h: 1621 in local_ts_extend_down()
1615 node = child;
1616 shift -= RT_SPAN;
1617 }
1618
1619 /* Reserve slot for the value. */
1620 n4 = (RT_NODE_4 *) node.local;

CID 1594658: Integer handling issues (BAD_SHIFT)
In expression "key >> shift", shifting by a negative amount has undefined behavior. The shift amount, "shift", is as little as -7.

1621 n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
1622 n4->base.count = 1;
1623
1624 return &n4->children[0];
1625 }
1626

I think the point here is that if you start with an arbitrary
non-negative shift value, the preceding loop may in fact decrement it
down to something less than zero before exiting, in which case we
would indeed have trouble. I suspect that the code is making
undocumented assumptions about the possible initial values of shift.
Maybe some Asserts would be good? Also, if we're effectively assuming
that shift must be exactly zero here, why not let the compiler
hard-code that?
-       n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
+       n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0);

Sounds like a good solution. I've attached the patch for that.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

fix_radixtree.patchapplication/octet-stream; name=fix_radixtree.patchDownload

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index b8ad51c14d..e56b0ac9cd 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1617,8 +1617,9 @@ RT_EXTEND_DOWN(RT_RADIX_TREE * tree, RT_PTR_ALLOC * parent_slot, uint64 key, int
 	}
 
 	/* Reserve slot for the value. */
+	Assert(shift == 0);
 	n4 = (RT_NODE_4 *) node.local;
-	n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
+	n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0);
 	n4->base.count = 1;
 
 	return &n4->children[0];

#432

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: Masahiko Sawada (#431)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Masahiko Sawada <sawada.mshk@gmail.com> writes:

On Mon, Mar 25, 2024 at 1:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think the point here is that if you start with an arbitrary
non-negative shift value, the preceding loop may in fact decrement it
down to something less than zero before exiting, in which case we
would indeed have trouble. I suspect that the code is making
undocumented assumptions about the possible initial values of shift.
Maybe some Asserts would be good? Also, if we're effectively assuming
that shift must be exactly zero here, why not let the compiler
hard-code that?

Sounds like a good solution. I've attached the patch for that.

Personally I'd put the Assert immediately after the loop, because
it's not related to the "Reserve slot for the value" comment.
Seems reasonable otherwise.

regards, tom lane

#433

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#431)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 25, 2024 at 8:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Mar 25, 2024 at 1:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm not sure why it took a couple weeks for Coverity to notice
ee1b30f12, but it saw it today, and it's not happy:

Hmm, I've also done Coverity Scan in development but I wasn't able to
see this one for some reason...

Hmm, before 30e144287 this code only ran in a test module, is it
possible Coverity would not find it there?

#434

Tom Lane

tgl@sss.pgh.pa.us

almost 2 years ago

In reply to: John Naylor (#433)

Re: [PoC] Improve dead tuple storage for lazy vacuum

John Naylor <johncnaylorls@gmail.com> writes:

Hmm, before 30e144287 this code only ran in a test module, is it
possible Coverity would not find it there?

That could indeed explain why Coverity didn't see it. I'm not
sure how our community run is set up, but it may not build the
test modules.

regards, tom lane

#435

sawada.mshk@gmail.com

almost 2 years ago

In reply to: Tom Lane (#432)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 25, 2024 at 10:13 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Masahiko Sawada <sawada.mshk@gmail.com> writes:

On Mon, Mar 25, 2024 at 1:53 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think the point here is that if you start with an arbitrary
non-negative shift value, the preceding loop may in fact decrement it
down to something less than zero before exiting, in which case we
would indeed have trouble. I suspect that the code is making
undocumented assumptions about the possible initial values of shift.
Maybe some Asserts would be good? Also, if we're effectively assuming
that shift must be exactly zero here, why not let the compiler
hard-code that?

Sounds like a good solution. I've attached the patch for that.

Personally I'd put the Assert immediately after the loop, because
it's not related to the "Reserve slot for the value" comment.
Seems reasonable otherwise.

Thanks. Pushed the fix after moving the Assert.

Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#436

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#429)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 21, 2024 at 7:48 PM John Naylor <johncnaylorls@gmail.com> wrote:

v77-0001
- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
+ vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
+
+ dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
+ dead_items_info->max_bytes = vac_work_mem * 1024L;
This is confusing enough that it looks like a bug:

[inside TidStoreCreate()]
/* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
while (16 * maxBlockSize > max_bytes * 1024L)
maxBlockSize >>= 1;

This was copied from CreateWorkExprContext, which operates directly on
work_mem -- if the parameter is actually bytes, we can't "* 1024"
here. If we're passing something measured in kilobytes, the parameter
is badly named. Let's use convert once and use bytes everywhere.
True. The attached 0001 patch fixes it.

v78-0001 and 02 are fine, but for 0003 there is a consequence that I
didn't see mentioned: vac_work_mem now refers to bytes, where before
it referred to kilobytes. It seems pretty confusing to use a different
convention from elsewhere, especially if it has the same name but
different meaning across versions. Worse, this change is buried inside
a moving-stuff-around diff, making it hard to see. Maybe "convert only
once" is still possible, but I was actually thinking of

+ dead_items_info->max_bytes = vac_work_mem * 1024L;
+ vacrel->dead_items = TidStoreCreate(dead_items_info->max_bytes, NULL, 0);

That way it's pretty obvious that it's correct. That may require a bit
of duplication and moving around for shmem, but there is some of that
already.

Attachments:

v79-0006-Address-review-comments-on-vacuum-integration.patchapplication/octet-stream; name=v79-0006-Address-review-comments-on-vacuum-integration.patchDownload

From 026ff14da2689bb6e6603eb351823a806840c2e9 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Mar 2024 16:09:45 +0900
Subject: [PATCH v79 6/6] Address review comments on vacuum integration.

---
 src/backend/access/heap/vacuumlazy.c  | 36 ++++++++++++++++-----------
 src/backend/commands/vacuum.c         |  5 ++--
 src/backend/commands/vacuumparallel.c | 22 ++++++++--------
 3 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 82c5a2c690..a0ff7526f3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,7 +3,7 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is TidStore, a storage for dead TIDs
+ * The major space usage for vacuuming is TID store, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
  * set upper bounds on the maximum memory that can be used for keeping track
@@ -11,7 +11,7 @@
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
  * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
- * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * TID store is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
  * the pages that we've pruned). This frees up the memory space dedicated to
  * to store dead TIDs.
  *
@@ -939,7 +939,7 @@ lazy_scan_heap(LVRelState *vacrel)
 
 		/*
 		 * If we didn't get the cleanup lock, we can still collect LP_DEAD
-		 * items in the dead_items for later vacuuming, count live and
+		 * items in the dead_items area for later vacuuming, count live and
 		 * recently dead tuples for vacuum logging, and determine if this
 		 * block could later be truncated. If we encounter any xid/mxids that
 		 * require advancing the relfrozenxid/relminxid, we'll have to wait
@@ -967,9 +967,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Like lazy_scan_noprune(), lazy_scan_prune() will count
 		 * recently_dead_tuples and live tuples for vacuum logging, determine
 		 * if the block can later be truncated, and accumulate the details of
-		 * remaining LP_DEAD line pointers on the page in the dead_items.
+		 * remaining LP_DEAD line pointers on the page into the dead_items.
 		 * These dead items include those pruned by lazy_scan_prune() as well
-		 * we line pointers previously marked LP_DEAD.
+		 * as line pointers previously marked LP_DEAD.
 		 */
 		if (got_cleanup_lock)
 			lazy_scan_prune(vacrel, buf, blkno, page,
@@ -2459,8 +2459,9 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages",
-					vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages)));
+			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
+					vacrel->relname, (long long) vacrel->dead_items_info->num_items,
+					vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -3102,8 +3103,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 }
 
 /*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate dead_items and dead_items_info (either using palloc, or in dynamic
+ * shared memory). Sets both in vacrel for caller.
  *
  * Also handles parallel initialization as part of allocating dead_items in
  * DSM when required.
@@ -3114,7 +3115,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	VacDeadItemsInfo *dead_items_info;
 	size_t			vac_work_mem = AmAutoVacuumWorkerProcess() &&
 		autovacuum_work_mem != -1 ?
-		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
+		autovacuum_work_mem : maintenance_work_mem;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3145,7 +3146,10 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
-		/* If parallel mode started, dead_items space is allocated in DSM */
+		/*
+		 * If parallel mode started, dead_items and dead_items_info spaces are
+		 * allocated in DSM.
+		 */
 		if (ParallelVacuumIsActive(vacrel))
 		{
 			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs,
@@ -3154,13 +3158,17 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		}
 	}
 
-	/* Serial VACUUM case */
-	vacrel->dead_items = TidStoreCreateLocal(vac_work_mem);
+	/*
+	 * Serial VACUUM case. Allocate both dead_items and dead_items_info
+	 * locally.
+	 */
 
 	dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
-	dead_items_info->max_bytes = vac_work_mem;
+	dead_items_info->max_bytes = vac_work_mem * 1024L;
 	dead_items_info->num_items = 0;
 	vacrel->dead_items_info = dead_items_info;
+
+	vacrel->dead_items = TidStoreCreateLocal(dead_items_info->max_bytes);
 }
 
 /*
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 72299b0838..b589279d49 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2495,8 +2495,9 @@ vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove " INT64_FORMAT " row versions",
-					RelationGetRelationName(ivinfo->index), dead_items_info->num_items)));
+			(errmsg("scanned index \"%s\" to remove %lld row versions",
+					RelationGetRelationName(ivinfo->index),
+					(long long) dead_items_info->num_items)));
 
 	return istat;
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 719055a734..233a111111 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -45,11 +45,10 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-/* 2 was PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-#define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
-#define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
-#define PARALLEL_VACUUM_KEY_WAL_USAGE		5
-#define PARALLEL_VACUUM_KEY_INDEX_STATS		6
+#define PARALLEL_VACUUM_KEY_QUERY_TEXT		2
+#define PARALLEL_VACUUM_KEY_BUFFER_USAGE	3
+#define PARALLEL_VACUUM_KEY_WAL_USAGE		4
+#define PARALLEL_VACUUM_KEY_INDEX_STATS		5
 
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
@@ -365,10 +364,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
-	/* Prepare DSA space for dead items */
-	dead_items = TidStoreCreateShared(vac_work_mem, LWTRANCHE_PARALLEL_VACUUM_DSA);
-	pvs->dead_items = dead_items;
-
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -378,9 +373,14 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
-	shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
+	shared->dead_items_info.max_bytes = vac_work_mem * 1024L;
+
+	/* Prepare DSA space for dead items */
+	dead_items = TidStoreCreateShared(shared->dead_items_info.max_bytes,
+									  LWTRANCHE_PARALLEL_VACUUM_DSA);
+	pvs->dead_items = dead_items;
 	shared->dead_items_handle = TidStoreGetHandle(dead_items);
-	shared->dead_items_info.max_bytes = vac_work_mem;
+	shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
 
 	/* Use the same buffer size for all workers */
 	shared->ring_nbuffers = GetAccessStrategyBufferCount(bstrategy);
-- 
2.39.3

v79-0004-Address-review-comments-on-tidstore.patchapplication/octet-stream; name=v79-0004-Address-review-comments-on-tidstore.patchDownload

From 0d0d5954efd5a08bb9a2ee22e615b8985b6afe80 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 25 Mar 2024 16:36:11 +0900
Subject: [PATCH v79 4/6] Address review comments on tidstore.

---
 src/backend/access/common/tidstore.c | 27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index e3e37f718b..c969da4235 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -111,9 +111,12 @@ static void tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 /*
  * Create a TidStore. The TidStore will live in the memory context that is
  * CurrentMemoryContext at the time of this call. The TID storage, backed
- * by a radix tree, will live in its child memory context, rt_context. The
- * TidStore will be limited to (approximately) max_bytes total memory
- * consumption.
+ * by a radix tree, will live in its child memory context, rt_context.
+ *
+ * max_bytes is not a limit; it's used to choose the memory block sizes of
+ * a memory context for TID storage in order for the total memory consumption
+ * not to be overshot a lot. The caller can use the max_bytes as the criteria
+ * for reporting whether it's full or not.
  */
 TidStore *
 TidStoreCreateLocal(size_t max_bytes)
@@ -158,7 +161,7 @@ TidStoreCreateShared(size_t max_bytes, int tranche_id)
 	TidStore   *ts;
 	dsa_area   *area;
 	size_t		dsa_init_size = DSA_DEFAULT_INIT_SEGMENT_SIZE;
-	size_t		dsa_max_size = DSA_MAX_SEGMENT_SIZE;;
+	size_t		dsa_max_size = DSA_MAX_SEGMENT_SIZE;
 
 	ts = palloc0(sizeof(TidStore));
 	ts->context = CurrentMemoryContext;
@@ -169,23 +172,17 @@ TidStoreCreateShared(size_t max_bytes, int tranche_id)
 
 	/*
 	 * Choose the initial and maximum DSA segment sizes to be no longer
-	 * than 1/16 and 1/8 of max_bytes, respectively. If the initial
-	 * segment size is low, we end up having many segments, which risks
-	 * exceeding the total number of segments the platform can have.
-	 * if the maximum segment size is high, there is a risk that the
-	 * total segment size overshoots the max_bytes a lot.
+	 * than 1/8 of max_bytes.
 	 */
-
-	while (16 * dsa_init_size > max_bytes)
-		dsa_init_size >>= 1;
 	while (8 * dsa_max_size > max_bytes)
 		dsa_max_size >>= 1;
 
-	if (dsa_init_size < DSA_MIN_SEGMENT_SIZE)
-		dsa_init_size = DSA_MIN_SEGMENT_SIZE;
 	if (dsa_max_size < DSA_MIN_SEGMENT_SIZE)
 		dsa_max_size = DSA_MIN_SEGMENT_SIZE;
 
+	if (dsa_init_size > dsa_max_size)
+		dsa_init_size = dsa_max_size;
+
 	area = dsa_create_ext(tranche_id, dsa_init_size, dsa_max_size);
 	ts->tree.shared = shared_ts_create(ts->rt_context, area,
 									   tranche_id);
@@ -276,9 +273,9 @@ TidStoreUnlock(TidStore *ts)
 void
 TidStoreDestroy(TidStore *ts)
 {
+	/* Destroy underlying radix tree */
 	if (TidStoreIsShared(ts))
 	{
-		/* Destroy underlying radix tree */
 		shared_ts_free(ts->tree.shared);
 
 		dsa_detach(ts->area);
-- 
2.39.3

v79-0005-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchapplication/octet-stream; name=v79-0005-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchDownload

From f6406b26421a7d35e0dccbffaa59d4ba8646ed1e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 1 Mar 2024 16:04:49 +0900
Subject: [PATCH v79 5/6] Use TidStore for dead tuple TIDs storage during lazy
 vacuum.

Previously, we used VacDeadItems, a simple ItemPointerData array, for
dead tuple's TID storage during lazy vacuum, which was neither space
efficient nor lookup performant.

This commit makes (parallel) lazy vacuum use of TidSTore for dead
tuple TIDs storage, instead of VacDeadItems. A new struct
VacDeadItemsInfo stores additional information such as
max_bytes. TidStore and VacDeadItemsInfo are shared among the parallel
vacuum workers in parallel vacuum cases. We don't take any locks on
TidStore during parallel vacuum since there are no concurrent reads
and writes.

As for the progress reporting, reporting number of tuples does no
longer provide any meaningful insights for users. So this commit also
changes to report byte-based progress reporting. The columns of
pg_stat_progress_vacuum are also renamed accordingly:
max_dead_tuple_bytes and dead_tuple_bytes.

XXX: bump catalog version

Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |   5 -
 doc/src/sgml/monitoring.sgml                  |   8 +-
 src/backend/access/heap/vacuumlazy.c          | 256 ++++++++----------
 src/backend/catalog/system_views.sql          |   2 +-
 src/backend/commands/vacuum.c                 |  79 +-----
 src/backend/commands/vacuumparallel.c         |  92 +++++--
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +-
 src/include/commands/progress.h               |   4 +-
 src/include/commands/vacuum.h                 |  28 +-
 src/include/storage/lwlock.h                  |   1 +
 src/test/regress/expected/rules.out           |   4 +-
 src/tools/pgindent/typedefs.list              |   2 +-
 13 files changed, 209 insertions(+), 275 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c408..a4ab76b7bc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1918,11 +1918,6 @@ include_dir 'conf.d'
         too high.  It may be useful to control for this by separately
         setting <xref linkend="guc-autovacuum-work-mem"/>.
        </para>
-       <para>
-        Note that for the collection of dead tuple identifiers,
-        <command>VACUUM</command> is only able to utilize up to a maximum of
-        <literal>1GB</literal> of memory.
-       </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8736eac284..6a74e4a24d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6237,10 +6237,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6248,10 +6248,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1800490775..82c5a2c690 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,17 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -39,6 +38,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xloginsert.h"
@@ -179,8 +179,13 @@ typedef struct LVRelState
 	 * that has been processed by lazy_scan_prune.  Also needed by
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
+	 *
+	 * Both dead_items and dead_items_info are allocated in shared memory in
+	 * parallel vacuum cases.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore   *dead_items;		/* TIDs whose index tuples we'll delete */
+	VacDeadItemsInfo *dead_items_info;
+
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -239,8 +244,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  Buffer buffer, OffsetNumber *offsets,
+								  int num_offsets, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -257,6 +263,9 @@ static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+						   int num_offsets);
+static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -472,7 +481,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
+	 * Allocate dead_items memory using dead_items_alloc.  This handles
 	 * parallel VACUUM initialization as part of allocating shared memory
 	 * space used for dead_items.  (But do a failsafe precheck first, to
 	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
@@ -782,7 +791,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -811,19 +820,20 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_fsm_block_to_vacuum = 0;
 	bool		all_visible_according_to_vm;
 
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore   *dead_items = vacrel->dead_items;
+	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
 	Buffer		vmbuffer = InvalidBuffer;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = dead_items_info->max_bytes;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Initialize for the first heap_vac_scan_next_block() call */
@@ -866,8 +876,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -930,7 +939,7 @@ lazy_scan_heap(LVRelState *vacrel)
 
 		/*
 		 * If we didn't get the cleanup lock, we can still collect LP_DEAD
-		 * items in the dead_items array for later vacuuming, count live and
+		 * items in the dead_items for later vacuuming, count live and
 		 * recently dead tuples for vacuum logging, and determine if this
 		 * block could later be truncated. If we encounter any xid/mxids that
 		 * require advancing the relfrozenxid/relminxid, we'll have to wait
@@ -958,9 +967,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Like lazy_scan_noprune(), lazy_scan_prune() will count
 		 * recently_dead_tuples and live tuples for vacuum logging, determine
 		 * if the block can later be truncated, and accumulate the details of
-		 * remaining LP_DEAD line pointers on the page in the dead_items
-		 * array. These dead items include those pruned by lazy_scan_prune()
-		 * as well we line pointers previously marked LP_DEAD.
+		 * remaining LP_DEAD line pointers on the page in the dead_items.
+		 * These dead items include those pruned by lazy_scan_prune() as well
+		 * we line pointers previously marked LP_DEAD.
 		 */
 		if (got_cleanup_lock)
 			lazy_scan_prune(vacrel, buf, blkno, page,
@@ -1037,7 +1046,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (dead_items_info->num_items > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1763,22 +1772,9 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1925,7 +1921,7 @@ lazy_scan_prune(LVRelState *vacrel,
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in the dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2084,7 +2080,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2104,9 +2100,6 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
 		 * indexes will be deleted during index vacuuming (and then marked
@@ -2114,17 +2107,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		vacrel->lpdead_items += lpdead_items;
 	}
@@ -2174,7 +2157,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		dead_items_reset(vacrel);
 		return;
 	}
 
@@ -2203,7 +2186,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2230,8 +2213,8 @@ lazy_vacuum(LVRelState *vacrel)
 		 * cases then this may need to be reconsidered.
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
-		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+		bypass = (vacrel->lpdead_item_pages < threshold) &&
+			TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
 	}
 
 	if (bypass)
@@ -2276,7 +2259,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	dead_items_reset(vacrel);
 }
 
 /*
@@ -2368,7 +2351,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2390,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2408,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2426,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = TidStoreBeginIterate(vacrel->dead_items);
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2435,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = iter_result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2449,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, buf, iter_result->offsets,
+							  iter_result->num_offsets, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2459,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2468,14 +2454,13 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
-			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+			(errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages",
+					vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2483,21 +2468,17 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 /*
  *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *						  vacrel->dead_items store.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
+static void
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+					  OffsetNumber *deadoffsets, int num_offsets,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2516,16 +2497,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2595,7 +2571,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -2722,8 +2697,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -2760,7 +2735,8 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
 	/* Do bulk deletion */
-	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items);
+	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items,
+								  vacrel->dead_items_info);
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -3125,46 +3101,6 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 	return vacrel->nonempty_pages;
 }
 
-/*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = AmAutoVacuumWorkerProcess() &&
-		autovacuum_work_mem != -1 ?
-		autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
 /*
  * Allocate dead_items (either using palloc, or in dynamic shared memory).
  * Sets dead_items in vacrel for caller.
@@ -3175,11 +3111,10 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	VacDeadItemsInfo *dead_items_info;
+	size_t			vac_work_mem = AmAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem * 1024L : maintenance_work_mem * 1024L;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3206,24 +3141,65 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
 		/* If parallel mode started, dead_items space is allocated in DSM */
 		if (ParallelVacuumIsActive(vacrel))
 		{
-			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs);
+			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs,
+																&vacrel->dead_items_info);
 			return;
 		}
 	}
 
 	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
+	vacrel->dead_items = TidStoreCreateLocal(vac_work_mem);
+
+	dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
+	dead_items_info->max_bytes = vac_work_mem;
+	dead_items_info->num_items = 0;
+	vacrel->dead_items_info = dead_items_info;
+}
+
+/*
+ * Add the given block number and offset numbers to dead_items.
+ */
+static void
+dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+			   int num_offsets)
+{
+	TidStore   *dead_items = vacrel->dead_items;
+
+	TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets);
+	vacrel->dead_items_info->num_items += num_offsets;
+
+	/* update the memory usage report */
+	pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+								 TidStoreMemoryUsage(dead_items));
+}
+
+/*
+ * Forget all collected dead items.
+ */
+static void
+dead_items_reset(LVRelState *vacrel)
+{
+	TidStore   *dead_items = vacrel->dead_items;
+
+	if (ParallelVacuumIsActive(vacrel))
+	{
+		parallel_vacuum_reset_dead_items(vacrel->pvs);
+		return;
+	}
+
+	/* Recreate the tidstore with the same max_bytes limitation */
+	TidStoreDestroy(dead_items);
+	vacrel->dead_items = TidStoreCreateLocal(vacrel->dead_items_info->max_bytes);
 
-	vacrel->dead_items = dead_items;
+	/* Reset the counter */
+	vacrel->dead_items_info->num_items = 0;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index bc70ff193e..477c6a12a3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1223,7 +1223,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples,
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
         S.param8 AS indexes_total, S.param9 AS indexes_processed
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e63c86cae4..72299b0838 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,7 +116,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * GUC check function to ensure GUC value specified is within the allowable
@@ -2489,16 +2488,15 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items, VacDeadItemsInfo *dead_items_info)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
-					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+			(errmsg("scanned index \"%s\" to remove " INT64_FORMAT " row versions",
+					RelationGetRelationName(ivinfo->index), dead_items_info->num_items)));
 
 	return istat;
 }
@@ -2529,82 +2527,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch(itemptr,
-								dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore   *dead_items = (TidStore *) state;
 
-	return 0;
+	return TidStoreIsMember(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index befda1c105..719055a734 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -8,8 +8,8 @@
  *
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
+ * vacuum process.  ParalleVacuumState contains shared information as well as
+ * the memory space for storing dead items allocated in the DSA area.  We
  * launch parallel worker processes at the start of parallel index
  * bulk-deletion and index cleanup and once all indexes are processed, the
  * parallel worker processes exit.  Each time we process indexes in parallel,
@@ -45,7 +45,7 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
+/* 2 was PARALLEL_VACUUM_KEY_DEAD_ITEMS */
 #define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
 #define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
 #define PARALLEL_VACUUM_KEY_WAL_USAGE		5
@@ -110,6 +110,15 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* DSA handle where the TidStore lives */
+	dsa_handle	dead_items_dsa_handle;
+
+	/* DSA pointer to the shared TidStore */
+	dsa_pointer dead_items_handle;
+
+	/* Statistics of shared dead items */
+	VacDeadItemsInfo dead_items_info;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -176,7 +185,7 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -232,20 +241,19 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
+					 int nrequested_workers, size_t vac_work_mem,
 					 int elevel, BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -294,11 +302,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
-	shm_toc_estimate_keys(&pcxt->estimator, 1);
-
 	/*
 	 * Estimate space for BufferUsage and WalUsage --
 	 * PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
@@ -362,6 +365,10 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INDEX_STATS, indstats);
 	pvs->indstats = indstats;
 
+	/* Prepare DSA space for dead items */
+	dead_items = TidStoreCreateShared(vac_work_mem, LWTRANCHE_PARALLEL_VACUUM_DSA);
+	pvs->dead_items = dead_items;
+
 	/* Prepare shared information */
 	shared = (PVShared *) shm_toc_allocate(pcxt->toc, est_shared_len);
 	MemSet(shared, 0, est_shared_len);
@@ -371,6 +378,9 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
+	shared->dead_items_handle = TidStoreGetHandle(dead_items);
+	shared->dead_items_info.max_bytes = vac_work_mem;
 
 	/* Use the same buffer size for all workers */
 	shared->ring_nbuffers = GetAccessStrategyBufferCount(bstrategy);
@@ -382,15 +392,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -448,6 +449,8 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	TidStoreDestroy(pvs->dead_items);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -455,13 +458,40 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
-/* Returns the dead items space */
-VacDeadItems *
-parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
+/*
+ * Returns the dead items space and dead items information.
+ */
+TidStore *
+parallel_vacuum_get_dead_items(ParallelVacuumState *pvs, VacDeadItemsInfo **dead_items_info_p)
 {
+	*dead_items_info_p = &(pvs->shared->dead_items_info);
 	return pvs->dead_items;
 }
 
+/* Forget all items in dead_items */
+void
+parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
+{
+	TidStore   *dead_items = pvs->dead_items;
+	VacDeadItemsInfo *dead_items_info = &(pvs->shared->dead_items_info);
+
+	/*
+	 * Free the current tidstore and return allocated DSA segments to the
+	 * operating system. Then we recreate the tidstore with the same max_bytes
+	 * limitation we just used.
+	 */
+	TidStoreDestroy(dead_items);
+	pvs->dead_items = TidStoreCreateShared(dead_items_info->max_bytes,
+										   LWTRANCHE_PARALLEL_VACUUM_DSA);
+
+	/* Update the DSA pointer for dead_items to the new one */
+	pvs->shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
+	pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
+
+	/* Reset the counter */
+	dead_items_info->num_items = 0;
+}
+
 /*
  * Do parallel index bulk-deletion with parallel workers.
  */
@@ -861,7 +891,8 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	switch (indstats->status)
 	{
 		case PARALLEL_INDVAC_STATUS_NEED_BULKDELETE:
-			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items);
+			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items,
+											  &pvs->shared->dead_items_info);
 			break;
 		case PARALLEL_INDVAC_STATUS_NEED_CLEANUP:
 			istat_res = vac_cleanup_one_index(&ivinfo, istat);
@@ -961,7 +992,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -1005,10 +1036,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Set dead items */
+	dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
+								shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumUpdateCosts();
@@ -1056,6 +1086,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	TidStoreDetach(dead_items);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 83992725de..b1e388dc7c 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -168,6 +168,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SERIAL_SLRU] = "SerialSLRU",
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
+	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8d0571a03d..2eee03daec 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -376,7 +376,7 @@ NotifySLRU	"Waiting to access the <command>NOTIFY</command> message SLRU cache."
 SerialSLRU	"Waiting to access the serializable transaction conflict SLRU cache."
 SubtransSLRU	"Waiting to access the sub-transaction SLRU cache."
 XactSLRU	"Waiting to access the transaction status SLRU cache."
-
+ParallelVacuumDSA	"Waiting for parallel vacuum dynamic shared memory allocation."
 
 #
 # Wait Events - Lock
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 73afa77a9c..82a8fe6bd1 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 #define PROGRESS_VACUUM_INDEXES_TOTAL			7
 #define PROGRESS_VACUUM_INDEXES_PROCESSED		8
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1182a96742..c2226ebcac 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -278,19 +279,14 @@ struct VacuumCutoffs
 };
 
 /*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
+ * VacDeadItemsInfo stores supplemental information for dead tuple TID
+ * storage (i.e. TidStore).
  */
-typedef struct VacDeadItems
+typedef struct VacDeadItemsInfo
 {
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
+	size_t		max_bytes;		/* the maximum bytes TidStore can use */
+	int64		num_items;		/* current # of entries */
+} VacDeadItemsInfo;
 
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
@@ -351,10 +347,10 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items,
+													VacDeadItemsInfo *dead_items_info);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* In postmaster/autovacuum.c */
 extern void AutoVacuumUpdateCostLimit(void);
@@ -363,10 +359,12 @@ extern void VacuumUpdateCosts(void);
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
+												 size_t vac_work_mem, int elevel,
 												 BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
+												VacDeadItemsInfo **dead_items_info_p);
+extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 3479b4cf52..d70e6d37e0 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -214,6 +214,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SERIAL_SLRU,
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index dfcbaec387..c67107a3f8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2050,8 +2050,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples,
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes,
     s.param8 AS indexes_total,
     s.param9 AS indexes_processed
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e2a0525dd4..7ba7c2dbe7 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2983,7 +2983,7 @@ UserMapping
 UserOpts
 VacAttrStats
 VacAttrStatsP
-VacDeadItems
+VacDeadItemsInfo
 VacErrPhase
 VacObjFilter
 VacOptValue
-- 
2.39.3

v79-0002-Allow-specifying-initial-and-maximum-segment-siz.patchapplication/octet-stream; name=v79-0002-Allow-specifying-initial-and-maximum-segment-siz.patchDownload

From 35e52c460f419dac87e9eb09610772b41579da2d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 21 Mar 2024 17:13:16 +0900
Subject: [PATCH v79 2/6] Allow specifying initial and maximum segment sizes
 for DSA.

Previously, the DSA segment size always starts with 1MB and grows up
to DSA_MAX_SEGMENT_SIZE. It was inconvinient in certain scenarios,
such as when the caller desired a soft constraint on the total DSA
segment size, limiting it to less than 1MB.

To improve the situation, this commit introduces the capability to
specify the initial and maximum DSA segment sizes during the creation
of a DSA area.

Reviewed-by: John Naylor, Tomas Vondra
Discussion: https://postgr.es/m/CAD21AoAYGGC1ePjVX0H%2Bpp9rH%3D9vuPK19nNOiu12NprdV5TVJA%40mail.gmail.com
---
 src/backend/utils/mmgr/dsa.c | 63 +++++++++++++++++-------------------
 src/include/utils/dsa.h      | 46 +++++++++++++++++++++++---
 2 files changed, 71 insertions(+), 38 deletions(-)

diff --git a/src/backend/utils/mmgr/dsa.c b/src/backend/utils/mmgr/dsa.c
index c3af071940..99e5bd68b6 100644
--- a/src/backend/utils/mmgr/dsa.c
+++ b/src/backend/utils/mmgr/dsa.c
@@ -59,14 +59,6 @@
 #include "utils/memutils.h"
 #include "utils/resowner.h"
 
-/*
- * The size of the initial DSM segment that backs a dsa_area created by
- * dsa_create.  After creating some number of segments of this size we'll
- * double this size, and so on.  Larger segments may be created if necessary
- * to satisfy large requests.
- */
-#define DSA_INITIAL_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
-
 /*
  * How many segments to create before we double the segment size.  If this is
  * low, then there is likely to be a lot of wasted space in the largest
@@ -76,17 +68,6 @@
  */
 #define DSA_NUM_SEGMENTS_AT_EACH_SIZE 2
 
-/*
- * The number of bits used to represent the offset part of a dsa_pointer.
- * This controls the maximum size of a segment, the maximum possible
- * allocation size and also the maximum number of segments per area.
- */
-#if SIZEOF_DSA_POINTER == 4
-#define DSA_OFFSET_WIDTH 27		/* 32 segments of size up to 128MB */
-#else
-#define DSA_OFFSET_WIDTH 40		/* 1024 segments of size up to 1TB */
-#endif
-
 /*
  * The maximum number of DSM segments that an area can own, determined by
  * the number of bits remaining (but capped at 1024).
@@ -97,9 +78,6 @@
 /* The bitmask for extracting the offset from a dsa_pointer. */
 #define DSA_OFFSET_BITMASK (((dsa_pointer) 1 << DSA_OFFSET_WIDTH) - 1)
 
-/* The maximum size of a DSM segment. */
-#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
-
 /* Number of pages (see FPM_PAGE_SIZE) per regular superblock. */
 #define DSA_PAGES_PER_SUPERBLOCK		16
 
@@ -318,6 +296,10 @@ typedef struct
 	dsa_segment_index segment_bins[DSA_NUM_SEGMENT_BINS];
 	/* The object pools for each size class. */
 	dsa_area_pool pools[DSA_NUM_SIZE_CLASSES];
+	/* initial allocation segment size */
+	size_t		init_segment_size;
+	/* maximum allocation segment size */
+	size_t		max_segment_size;
 	/* The total size of all active segments. */
 	size_t		total_segment_size;
 	/* The maximum total size of backing storage we are allowed. */
@@ -417,7 +399,9 @@ static dsa_segment_map *make_new_segment(dsa_area *area, size_t requested_pages)
 static dsa_area *create_internal(void *place, size_t size,
 								 int tranche_id,
 								 dsm_handle control_handle,
-								 dsm_segment *control_segment);
+								 dsm_segment *control_segment,
+								 size_t init_segment_size,
+								 size_t max_segment_size);
 static dsa_area *attach_internal(void *place, dsm_segment *segment,
 								 dsa_handle handle);
 static void check_for_freed_segments(dsa_area *area);
@@ -434,7 +418,7 @@ static void rebin_segment(dsa_area *area, dsa_segment_map *segment_map);
  * we require the caller to provide one.
  */
 dsa_area *
-dsa_create(int tranche_id)
+dsa_create_ext(int tranche_id, size_t init_segment_size, size_t max_segment_size)
 {
 	dsm_segment *segment;
 	dsa_area   *area;
@@ -443,7 +427,7 @@ dsa_create(int tranche_id)
 	 * Create the DSM segment that will hold the shared control object and the
 	 * first segment of usable space.
 	 */
-	segment = dsm_create(DSA_INITIAL_SEGMENT_SIZE, 0);
+	segment = dsm_create(init_segment_size, 0);
 
 	/*
 	 * All segments backing this area are pinned, so that DSA can explicitly
@@ -455,9 +439,10 @@ dsa_create(int tranche_id)
 
 	/* Create a new DSA area with the control object in this segment. */
 	area = create_internal(dsm_segment_address(segment),
-						   DSA_INITIAL_SEGMENT_SIZE,
+						   init_segment_size,
 						   tranche_id,
-						   dsm_segment_handle(segment), segment);
+						   dsm_segment_handle(segment), segment,
+						   init_segment_size, max_segment_size);
 
 	/* Clean up when the control segment detaches. */
 	on_dsm_detach(segment, &dsa_on_dsm_detach_release_in_place,
@@ -483,13 +468,15 @@ dsa_create(int tranche_id)
  * See dsa_create() for a note about the tranche arguments.
  */
 dsa_area *
-dsa_create_in_place(void *place, size_t size,
-					int tranche_id, dsm_segment *segment)
+dsa_create_in_place_ext(void *place, size_t size,
+						int tranche_id, dsm_segment *segment,
+						size_t init_segment_size, size_t max_segment_size)
 {
 	dsa_area   *area;
 
 	area = create_internal(place, size, tranche_id,
-						   DSM_HANDLE_INVALID, NULL);
+						   DSM_HANDLE_INVALID, NULL,
+						   init_segment_size, max_segment_size);
 
 	/*
 	 * Clean up when the control segment detaches, if a containing DSM segment
@@ -1231,7 +1218,8 @@ static dsa_area *
 create_internal(void *place, size_t size,
 				int tranche_id,
 				dsm_handle control_handle,
-				dsm_segment *control_segment)
+				dsm_segment *control_segment,
+				size_t init_segment_size, size_t max_segment_size)
 {
 	dsa_area_control *control;
 	dsa_area   *area;
@@ -1241,6 +1229,11 @@ create_internal(void *place, size_t size,
 	size_t		metadata_bytes;
 	int			i;
 
+	/* Validate the initial and maximum block sizes */
+	Assert(init_segment_size >= DSA_MIN_SEGMENT_SIZE);
+	Assert(max_segment_size >= init_segment_size);
+	Assert(max_segment_size <= DSA_MAX_SEGMENT_SIZE);
+
 	/* Sanity check on the space we have to work in. */
 	if (size < dsa_minimum_size())
 		elog(ERROR, "dsa_area space must be at least %zu, but %zu provided",
@@ -1270,8 +1263,10 @@ create_internal(void *place, size_t size,
 	control->segment_header.prev = DSA_SEGMENT_INDEX_NONE;
 	control->segment_header.usable_pages = usable_pages;
 	control->segment_header.freed = false;
-	control->segment_header.size = DSA_INITIAL_SEGMENT_SIZE;
+	control->segment_header.size = size;
 	control->handle = control_handle;
+	control->init_segment_size = init_segment_size;
+	control->max_segment_size = max_segment_size;
 	control->max_total_segment_size = (size_t) -1;
 	control->total_segment_size = size;
 	control->segment_handles[0] = control_handle;
@@ -2127,9 +2122,9 @@ make_new_segment(dsa_area *area, size_t requested_pages)
 	 * move to huge pages in the future.  Then we work back to the number of
 	 * pages we can fit.
 	 */
-	total_size = DSA_INITIAL_SEGMENT_SIZE *
+	total_size = area->control->init_segment_size *
 		((size_t) 1 << (new_index / DSA_NUM_SEGMENTS_AT_EACH_SIZE));
-	total_size = Min(total_size, DSA_MAX_SEGMENT_SIZE);
+	total_size = Min(total_size, area->control->max_segment_size);
 	total_size = Min(total_size,
 					 area->control->max_total_segment_size -
 					 area->control->total_segment_size);
diff --git a/src/include/utils/dsa.h b/src/include/utils/dsa.h
index fe9cbebbec..8dff964bf3 100644
--- a/src/include/utils/dsa.h
+++ b/src/include/utils/dsa.h
@@ -77,6 +77,31 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
 /* A sentinel value for dsa_pointer used to indicate failure to allocate. */
 #define InvalidDsaPointer ((dsa_pointer) 0)
 
+/*
+ * The number of bits used to represent the offset part of a dsa_pointer.
+ * This controls the maximum size of a segment, the maximum possible
+ * allocation size and also the maximum number of segments per area.
+ */
+#if SIZEOF_DSA_POINTER == 4
+#define DSA_OFFSET_WIDTH 27		/* 32 segments of size up to 128MB */
+#else
+#define DSA_OFFSET_WIDTH 40		/* 1024 segments of size up to 1TB */
+#endif
+
+/*
+ * The default size of the initial DSM segment that backs a dsa_area created
+ * by dsa_create.  After creating some number of segments of the initial size
+ * we'll double this size, and so on.  Larger segments may be created if
+ * necessary to satisfy large requests.
+ */
+#define DSA_DEFAULT_INIT_SEGMENT_SIZE ((size_t) (1 * 1024 * 1024))
+
+/* The minimum size of a DSM segment. */
+#define DSA_MIN_SEGMENT_SIZE	((size_t) (256 * 1024L))
+
+/* The maximum size of a DSM segment. */
+#define DSA_MAX_SEGMENT_SIZE ((size_t) 1 << DSA_OFFSET_WIDTH)
+
 /* Check if a dsa_pointer value is valid. */
 #define DsaPointerIsValid(x) ((x) != InvalidDsaPointer)
 
@@ -88,6 +113,17 @@ typedef pg_atomic_uint64 dsa_pointer_atomic;
 #define dsa_allocate0(area, size) \
 	dsa_allocate_extended(area, size, DSA_ALLOC_ZERO)
 
+/* Create dsa_area with default segment sizes */
+#define dsa_create(tranch_id) \
+	dsa_create_ext(tranch_id, DSA_DEFAULT_INIT_SEGMENT_SIZE, \
+				   DSA_MAX_SEGMENT_SIZE)
+
+/* Create dsa_area with default segment sizes in an existing share memory space */
+#define dsa_create_in_place(place, size, tranch_id, segment) \
+	dsa_create_in_place_ext(place, size, tranch_id, segment, \
+							DSA_DEFAULT_INIT_SEGMENT_SIZE, \
+							DSA_MAX_SEGMENT_SIZE)
+
 /*
  * The type used for dsa_area handles.  dsa_handle values can be shared with
  * other processes, so that they can attach to them.  This provides a way to
@@ -102,10 +138,12 @@ typedef dsm_handle dsa_handle;
 /* Sentinel value to use for invalid dsa_handles. */
 #define DSA_HANDLE_INVALID ((dsa_handle) DSM_HANDLE_INVALID)
 
-
-extern dsa_area *dsa_create(int tranche_id);
-extern dsa_area *dsa_create_in_place(void *place, size_t size,
-									 int tranche_id, dsm_segment *segment);
+extern dsa_area *dsa_create_ext(int tranche_id, size_t init_segment_size,
+								size_t max_segment_size);
+extern dsa_area *dsa_create_in_place_ext(void *place, size_t size,
+										 int tranche_id, dsm_segment *segment,
+										 size_t init_segment_size,
+										 size_t max_segment_size);
 extern dsa_area *dsa_attach(dsa_handle handle);
 extern dsa_area *dsa_attach_in_place(void *place, dsm_segment *segment);
 extern void dsa_release_in_place(void *place);
-- 
2.39.3

v79-0003-Rethink-create-and-attach-APIs-of-shared-TidStor.patchapplication/octet-stream; name=v79-0003-Rethink-create-and-attach-APIs-of-shared-TidStor.patchDownload

From 804640cdeda724f9bc2862b758854aeadeab939c Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 21 Mar 2024 23:02:13 +0900
Subject: [PATCH v79 3/6] Rethink create and attach APIs of shared TidStore.

Previously, the behavior of TidStoreCreate() was inconsistent between
local and shared TidStore instances in terms of memory limitation. For
local TidStore, a memory context was created with initial and maximum
memory block sizes, as well as a minimum memory context size, based on
the specified memory limitation. However, for shared TidStore, the
provided DSA area was used for TID storage. Although commit XXX
allowed specifying the initial and maximum DSA segment sizes, callers
would have needed to clamp their own limits, which was not consistent
and user-friendly.

With this commit, when creating a shared TidStore, a dedicated DSA
area is created for TID storage instead of using the provided DSA
area. The initial and maximum DSA segment sizes are chosen based on
the specified max_bytes memory limitation. Other processes can attach
to the shared TidStore using the handle of the created DSA returned by
the new TidStoreGetDSA() function and the DSA pointer returned by
TidStoreGetHandle(). The created DSA has the same lifetime as the
shared TidStore and is deleted when all processes detach from it.

To improve clarity, the TidStoreCreate() function has been divided
into two separate functions: TidStoreCreateLocal() and
TidStoreCreateShared().

Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoAyc1j%3DBCdUqZfk6qbdjZ68UgRx1Gkpk0oah4K7S0Ri9g%40mail.gmail.com
---
 src/backend/access/common/tidstore.c          | 109 ++++++++++++++----
 src/include/access/tidstore.h                 |   7 +-
 .../modules/test_tidstore/test_tidstore.c     |  14 +--
 3 files changed, 91 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 745393806d..e3e37f718b 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -7,9 +7,9 @@
  * Internally it uses a radix tree as the storage for TIDs. The key is the
  * BlockNumber and the value is a bitmap of offsets, BlocktableEntry.
  *
- * TidStore can be shared among parallel worker processes by passing DSA area
- * to TidStoreCreate(). Other backends can attach to the shared TidStore by
- * TidStoreAttach().
+ * TidStore can be shared among parallel worker processes by using
+ * TidStoreCreateShared(). Other backends can attach to the shared TidStore
+ * by TidStoreAttach().
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -113,13 +113,10 @@ static void tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
  * CurrentMemoryContext at the time of this call. The TID storage, backed
  * by a radix tree, will live in its child memory context, rt_context. The
  * TidStore will be limited to (approximately) max_bytes total memory
- * consumption. If the 'area' is non-NULL, the radix tree is created in the
- * DSA area.
- *
- * The returned object is allocated in backend-local memory.
+ * consumption.
  */
 TidStore *
-TidStoreCreate(size_t max_bytes, dsa_area *area, int tranche_id)
+TidStoreCreateLocal(size_t max_bytes)
 {
 	TidStore   *ts;
 	size_t		initBlockSize = ALLOCSET_DEFAULT_INITSIZE;
@@ -143,33 +140,80 @@ TidStoreCreate(size_t max_bytes, dsa_area *area, int tranche_id)
 										   initBlockSize,
 										   maxBlockSize);
 
-	if (area != NULL)
-	{
-		ts->tree.shared = shared_ts_create(ts->rt_context, area,
-										   tranche_id);
-		ts->area = area;
-	}
-	else
-		ts->tree.local = local_ts_create(ts->rt_context);
+	ts->tree.local = local_ts_create(ts->rt_context);
+
+	return ts;
+}
+
+/*
+ * Similar to TidStoreCreateLocal() but create a shared TidStore on a
+ * DSA area. The TID storage will live in the DSA area, and a memory
+ * context rt_context will have only meta data of the radix tree.
+ *
+ * The returned object is allocated in backend-local memory.
+ */
+TidStore *
+TidStoreCreateShared(size_t max_bytes, int tranche_id)
+{
+	TidStore   *ts;
+	dsa_area   *area;
+	size_t		dsa_init_size = DSA_DEFAULT_INIT_SEGMENT_SIZE;
+	size_t		dsa_max_size = DSA_MAX_SEGMENT_SIZE;;
+
+	ts = palloc0(sizeof(TidStore));
+	ts->context = CurrentMemoryContext;
+
+	ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
+										   "TID storage meta data",
+										   ALLOCSET_SMALL_SIZES);
+
+	/*
+	 * Choose the initial and maximum DSA segment sizes to be no longer
+	 * than 1/16 and 1/8 of max_bytes, respectively. If the initial
+	 * segment size is low, we end up having many segments, which risks
+	 * exceeding the total number of segments the platform can have.
+	 * if the maximum segment size is high, there is a risk that the
+	 * total segment size overshoots the max_bytes a lot.
+	 */
+
+	while (16 * dsa_init_size > max_bytes)
+		dsa_init_size >>= 1;
+	while (8 * dsa_max_size > max_bytes)
+		dsa_max_size >>= 1;
+
+	if (dsa_init_size < DSA_MIN_SEGMENT_SIZE)
+		dsa_init_size = DSA_MIN_SEGMENT_SIZE;
+	if (dsa_max_size < DSA_MIN_SEGMENT_SIZE)
+		dsa_max_size = DSA_MIN_SEGMENT_SIZE;
+
+	area = dsa_create_ext(tranche_id, dsa_init_size, dsa_max_size);
+	ts->tree.shared = shared_ts_create(ts->rt_context, area,
+									   tranche_id);
+	ts->area = area;
 
 	return ts;
 }
 
 /*
- * Attach to the shared TidStore using the given  handle. The returned object
- * is allocated in backend-local memory using the CurrentMemoryContext.
+ * Attach to the shared TidStore. 'area_handle' is the DSA handle where
+ * the TidStore is created. 'handle' is the dsa_pointer returned by
+ * TidStoreGetHandle(). The returned object is allocated in backend-local
+ * memory using the CurrentMemoryContext.
  */
 TidStore *
-TidStoreAttach(dsa_area *area, dsa_pointer handle)
+TidStoreAttach(dsa_handle area_handle, dsa_pointer handle)
 {
 	TidStore   *ts;
+	dsa_area   *area;
 
-	Assert(area != NULL);
+	Assert(area_handle != DSA_HANDLE_INVALID);
 	Assert(DsaPointerIsValid(handle));
 
 	/* create per-backend state */
 	ts = palloc0(sizeof(TidStore));
 
+	area = dsa_attach(area_handle);
+
 	/* Find the shared the shared radix tree */
 	ts->tree.shared = shared_ts_attach(area, handle);
 	ts->area = area;
@@ -178,10 +222,8 @@ TidStoreAttach(dsa_area *area, dsa_pointer handle)
 }
 
 /*
- * Detach from a TidStore. This detaches from radix tree and frees the
- * backend-local resources. The radix tree will continue to exist until
- * it is either explicitly destroyed, or the area that backs it is returned
- * to the operating system.
+ * Detach from a TidStore. This also detaches from radix tree and frees
+ * the backend-local resources.
  */
 void
 TidStoreDetach(TidStore *ts)
@@ -189,6 +231,8 @@ TidStoreDetach(TidStore *ts)
 	Assert(TidStoreIsShared(ts));
 
 	shared_ts_detach(ts->tree.shared);
+	dsa_detach(ts->area);
+
 	pfree(ts);
 }
 
@@ -232,9 +276,13 @@ TidStoreUnlock(TidStore *ts)
 void
 TidStoreDestroy(TidStore *ts)
 {
-	/* Destroy underlying radix tree */
 	if (TidStoreIsShared(ts))
+	{
+		/* Destroy underlying radix tree */
 		shared_ts_free(ts->tree.shared);
+
+		dsa_detach(ts->area);
+	}
 	else
 		local_ts_free(ts->tree.local);
 
@@ -420,6 +468,17 @@ TidStoreMemoryUsage(TidStore *ts)
 		return local_ts_memory_usage(ts->tree.local);
 }
 
+/*
+ * Return the DSA area where the TidStore lives.
+ */
+dsa_area *
+TidStoreGetDSA(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts));
+
+	return ts->area;
+}
+
 dsa_pointer
 TidStoreGetHandle(TidStore *ts)
 {
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index 95d4f8f9ee..1de8aa8035 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -29,9 +29,9 @@ typedef struct TidStoreIterResult
 	OffsetNumber *offsets;
 } TidStoreIterResult;
 
-extern TidStore *TidStoreCreate(size_t max_bytes, dsa_area *dsa,
-								int tranche_id);
-extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer handle);
+extern TidStore *TidStoreCreateLocal(size_t max_bytes);
+extern TidStore *TidStoreCreateShared(size_t max_bytes, int tranche_id);
+extern TidStore *TidStoreAttach(dsa_handle dsa_handle, dsa_pointer handle);
 extern void TidStoreDetach(TidStore *ts);
 extern void TidStoreLockExclusive(TidStore *ts);
 extern void TidStoreLockShare(TidStore *ts);
@@ -45,5 +45,6 @@ extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
 extern void TidStoreEndIterate(TidStoreIter *iter);
 extern size_t TidStoreMemoryUsage(TidStore *ts);
 extern dsa_pointer TidStoreGetHandle(TidStore *ts);
+extern dsa_area *TidStoreGetDSA(TidStore *ts);
 
 #endif							/* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index c74ad2cf8b..3d4af77dda 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -34,7 +34,6 @@ PG_FUNCTION_INFO_V1(test_is_full);
 PG_FUNCTION_INFO_V1(test_destroy);
 
 static TidStore *tidstore = NULL;
-static dsa_area *dsa = NULL;
 static size_t tidstore_empty_size;
 
 /* array for verification of some tests */
@@ -94,7 +93,6 @@ test_create(PG_FUNCTION_ARGS)
 	size_t		array_init_size = 1024;
 
 	Assert(tidstore == NULL);
-	Assert(dsa == NULL);
 
 	/*
 	 * Create the TidStore on TopMemoryContext so that the same process use it
@@ -109,18 +107,16 @@ test_create(PG_FUNCTION_ARGS)
 		tranche_id = LWLockNewTrancheId();
 		LWLockRegisterTranche(tranche_id, "test_tidstore");
 
-		dsa = dsa_create(tranche_id);
+		tidstore = TidStoreCreateShared(tidstore_max_size, tranche_id);
 
 		/*
 		 * Remain attached until end of backend or explicitly detached so that
 		 * the same process use the tidstore for subsequent tests.
 		 */
-		dsa_pin_mapping(dsa);
-
-		tidstore = TidStoreCreate(tidstore_max_size, dsa, tranche_id);
+		dsa_pin_mapping(TidStoreGetDSA(tidstore));
 	}
 	else
-		tidstore = TidStoreCreate(tidstore_max_size, NULL, 0);
+		tidstore = TidStoreCreateLocal(tidstore_max_size);
 
 	tidstore_empty_size = TidStoreMemoryUsage(tidstore);
 
@@ -309,9 +305,5 @@ test_destroy(PG_FUNCTION_ARGS)
 	pfree(items.lookup_tids);
 	pfree(items.iter_tids);
 
-	if (dsa)
-		dsa_detach(dsa);
-	dsa = NULL;
-
 	PG_RETURN_VOID();
 }
-- 
2.39.3

v79-0001-Fix-an-inconsistent-function-prototype-with-the-.patchapplication/octet-stream; name=v79-0001-Fix-an-inconsistent-function-prototype-with-the-.patchDownload

From fe15be5b85b2a764cb2ed243625cd2d65f0e64a7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 21 Mar 2024 21:36:54 +0900
Subject: [PATCH v79 1/6] Fix an inconsistent function prototype with the
 function definition.

Introduced in 30e144287a.

Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoCaDT%2B-ZaVjbtvumms0tyyHPNLELK2UX-MLG9XCgioaNw%40mail.gmail.com
---
 src/include/access/tidstore.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index 8cf4e94f12..95d4f8f9ee 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -31,7 +31,7 @@ typedef struct TidStoreIterResult
 
 extern TidStore *TidStoreCreate(size_t max_bytes, dsa_area *dsa,
 								int tranche_id);
-extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer rt_dp);
+extern TidStore *TidStoreAttach(dsa_area *dsa, dsa_pointer handle);
 extern void TidStoreDetach(TidStore *ts);
 extern void TidStoreLockExclusive(TidStore *ts);
 extern void TidStoreLockShare(TidStore *ts);
-- 
2.39.3

#438

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#437)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Mar 25, 2024 at 3:25 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

- * remaining LP_DEAD line pointers on the page in the dead_items
- * array. These dead items include those pruned by lazy_scan_prune()
- * as well we line pointers previously marked LP_DEAD.
+ * remaining LP_DEAD line pointers on the page in the dead_items.
+ * These dead items include those pruned by lazy_scan_prune() as well
+ * we line pointers previously marked LP_DEAD.

Here maybe "into dead_items".

- * remaining LP_DEAD line pointers on the page in the dead_items.
+ * remaining LP_DEAD line pointers on the page into the dead_items.

Let me explain. It used to be "in the dead_items array." It is not an
array anymore, so it was changed to "in the dead_items". dead_items is
a variable name, and names don't take "the". "into dead_items" seems
most natural to me, but there are other possible phrasings.

Did you try it with 1MB m_w_m?

I've incorporated the above comments and test results look good to me.

Could you be more specific about what the test was?
Does it work with 1MB m_w_m?

If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.

FYI other test cases I tested were:

* m_w_m = 2199023254528 (maximum value)
initial: 1MB
max: 128GB

* m_w_m = 64MB (default)
initial: 1MB
max: 8MB

If the test was a vacuum, how big a table was needed to hit 128GB?

The existing comment slipped past my radar, but max_bytes is not a
limit, it's a hint. Come to think of it, it never was a limit in the
normal sense, but in earlier patches it was the criteria for reporting
"I'm full" when asked.

Updated the comment.

+ * max_bytes is not a limit; it's used to choose the memory block sizes of
+ * a memory context for TID storage in order for the total memory consumption
+ * not to be overshot a lot. The caller can use the max_bytes as the criteria
+ * for reporting whether it's full or not.

This is good information. I suggest this edit:

"max_bytes" is not an internally-enforced limit; it is used only as a
hint to cap the memory block size of the memory context for TID
storage. This reduces space wastage due to over-allocation. If the
caller wants to monitor memory usage, it must compare its limit with
the value reported by TidStoreMemoryUsage().

Other comments:

v79-0002 looks good to me.

v79-0003:

"With this commit, when creating a shared TidStore, a dedicated DSA
area is created for TID storage instead of using the provided DSA
area."

This is very subtle, but "the provided..." implies there still is one.
-> "a provided..."

+ * Similar to TidStoreCreateLocal() but create a shared TidStore on a
+ * DSA area. The TID storage will live in the DSA area, and a memory
+ * context rt_context will have only meta data of the radix tree.

-> "the memory context"

I think you can go ahead and commit 0002 and 0003/4.

v79-0005:

- bypass = (vacrel->lpdead_item_pages < threshold &&
-   vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);

The parentheses look strange, and the first line shouldn't change
without a good reason.

- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
+ shared->dead_items_handle);

I feel ambivalent about this comment change. The original is not very
descriptive to begin with. If we need to change at all, maybe "find
dead_items in shared memory"?

v79-0005: As I said earlier, Dilip Kumar reviewed an earlier version.

v79-0006:

vac_work_mem should also go back to being an int.

#439

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#438)

2 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 27, 2024 at 9:25 AM John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Mar 25, 2024 at 3:25 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
- * remaining LP_DEAD line pointers on the page in the dead_items
- * array. These dead items include those pruned by lazy_scan_prune()
- * as well we line pointers previously marked LP_DEAD.
+ * remaining LP_DEAD line pointers on the page in the dead_items.
+ * These dead items include those pruned by lazy_scan_prune() as well
+ * we line pointers previously marked LP_DEAD.
Here maybe "into dead_items".
- * remaining LP_DEAD line pointers on the page in the dead_items.
+ * remaining LP_DEAD line pointers on the page into the dead_items.
Let me explain. It used to be "in the dead_items array." It is not an
array anymore, so it was changed to "in the dead_items". dead_items is
a variable name, and names don't take "the". "into dead_items" seems
most natural to me, but there are other possible phrasings.

Thanks for the explanation. I was distracted. Fixed in the latest patch.

Did you try it with 1MB m_w_m?

I've incorporated the above comments and test results look good to me.

Could you be more specific about what the test was?
Does it work with 1MB m_w_m?

If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.

FYI other test cases I tested were:

* m_w_m = 2199023254528 (maximum value)
initial: 1MB
max: 128GB

* m_w_m = 64MB (default)
initial: 1MB
max: 8MB

If the test was a vacuum, how big a table was needed to hit 128GB?

I just checked how TIdStoreCreateLocal() calculated the initial and
max segment sizes while changing m_w_m, so didn't check how big
segments are actually allocated in the maximum value test case.

The existing comment slipped past my radar, but max_bytes is not a
limit, it's a hint. Come to think of it, it never was a limit in the
normal sense, but in earlier patches it was the criteria for reporting
"I'm full" when asked.

Updated the comment.
+ * max_bytes is not a limit; it's used to choose the memory block sizes of
+ * a memory context for TID storage in order for the total memory consumption
+ * not to be overshot a lot. The caller can use the max_bytes as the criteria
+ * for reporting whether it's full or not.
This is good information. I suggest this edit:

"max_bytes" is not an internally-enforced limit; it is used only as a
hint to cap the memory block size of the memory context for TID
storage. This reduces space wastage due to over-allocation. If the
caller wants to monitor memory usage, it must compare its limit with
the value reported by TidStoreMemoryUsage().

Other comments:

Thanks for the suggestion!

v79-0002 looks good to me.

v79-0003:

"With this commit, when creating a shared TidStore, a dedicated DSA
area is created for TID storage instead of using the provided DSA
area."

This is very subtle, but "the provided..." implies there still is one.
-> "a provided..."
+ * Similar to TidStoreCreateLocal() but create a shared TidStore on a
+ * DSA area. The TID storage will live in the DSA area, and a memory
+ * context rt_context will have only meta data of the radix tree.
-> "the memory context"

Fixed in the latest patch.

I think you can go ahead and commit 0002 and 0003/4.

I've pushed the 0002 (dsa init and max segment size) patch, and will
push the attached 0001 patch next.

v79-0005:

- bypass = (vacrel->lpdead_item_pages < threshold &&
-   vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);

The parentheses look strange, and the first line shouldn't change
without a good reason.

Fixed.

- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
+ shared->dead_items_handle);
I feel ambivalent about this comment change. The original is not very
descriptive to begin with. If we need to change at all, maybe "find
dead_items in shared memory"?

Agreed.

v79-0005: As I said earlier, Dilip Kumar reviewed an earlier version.

v79-0006:

vac_work_mem should also go back to being an int.

Fixed.

I've attached the latest patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v80-0001-Rethink-create-and-attach-APIs-of-shared-TidStor.patchapplication/octet-stream; name=v80-0001-Rethink-create-and-attach-APIs-of-shared-TidStor.patchDownload

From 6d00b05e586467ec1b52a548f9f7dcbcd1b9c14f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 21 Mar 2024 23:02:13 +0900
Subject: [PATCH v80 1/2] Rethink create and attach APIs of shared TidStore.

Previously, the behavior of TidStoreCreate() was inconsistent between
local and shared TidStore instances in terms of memory limitation. For
local TidStore, a memory context was created with initial and maximum
memory block sizes, as well as a minimum memory context size, based on
the specified memory limitation. However, for shared TidStore, the
provided DSA area was used for TID storage. Although commit XXX
allowed specifying the initial and maximum DSA segment sizes, callers
would have needed to clamp their own limits, which was not consistent
and user-friendly.

With this commit, when creating a shared TidStore, a dedicated DSA
area is created for TID storage instead of using a provided DSA
area. The initial and maximum DSA segment sizes are chosen based on
the specified max_bytes memory limitation. Other processes can attach
to the shared TidStore using the handle of the created DSA returned by
the new TidStoreGetDSA() function and the DSA pointer returned by
TidStoreGetHandle(). The created DSA has the same lifetime as the
shared TidStore and is deleted when all processes detach from it.

To improve clarity, the TidStoreCreate() function has been divided
into two separate functions: TidStoreCreateLocal() and
TidStoreCreateShared().

Reviewed-by: John Naylor
Discussion: https://postgr.es/m/CAD21AoAyc1j%3DBCdUqZfk6qbdjZ68UgRx1Gkpk0oah4K7S0Ri9g%40mail.gmail.com
---
 src/backend/access/common/tidstore.c          | 107 ++++++++++++++----
 src/include/access/tidstore.h                 |   7 +-
 .../modules/test_tidstore/test_tidstore.c     |  14 +--
 3 files changed, 89 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index f79141590e..fcb8e839d2 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -7,9 +7,9 @@
  * Internally it uses a radix tree as the storage for TIDs. The key is the
  * BlockNumber and the value is a bitmap of offsets, BlocktableEntry.
  *
- * TidStore can be shared among parallel worker processes by passing DSA area
- * to TidStoreCreate(). Other backends can attach to the shared TidStore by
- * TidStoreAttach().
+ * TidStore can be shared among parallel worker processes by using
+ * TidStoreCreateShared(). Other backends can attach to the shared TidStore
+ * by TidStoreAttach().
  *
  * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -111,15 +111,16 @@ static void tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 /*
  * Create a TidStore. The TidStore will live in the memory context that is
  * CurrentMemoryContext at the time of this call. The TID storage, backed
- * by a radix tree, will live in its child memory context, rt_context. The
- * TidStore will be limited to (approximately) max_bytes total memory
- * consumption. If the 'area' is non-NULL, the radix tree is created in the
- * DSA area.
+ * by a radix tree, will live in its child memory context, rt_context.
  *
- * The returned object is allocated in backend-local memory.
+ * "max_bytes" is not an internally-enforced limit; it is used only as a
+ * hint to cap the memory block size of the memory context for TID storage.
+ * This reduces space wastage due to over-allocation. If the caller wants to
+ * monitor memory usage, it must compare its limit with the value reported
+ * by TidStoreMemoryUsage().
  */
 TidStore *
-TidStoreCreate(size_t max_bytes, dsa_area *area, int tranche_id)
+TidStoreCreateLocal(size_t max_bytes)
 {
 	TidStore   *ts;
 	size_t		initBlockSize = ALLOCSET_DEFAULT_INITSIZE;
@@ -143,33 +144,74 @@ TidStoreCreate(size_t max_bytes, dsa_area *area, int tranche_id)
 										   initBlockSize,
 										   maxBlockSize);
 
-	if (area != NULL)
-	{
-		ts->tree.shared = shared_ts_create(ts->rt_context, area,
-										   tranche_id);
-		ts->area = area;
-	}
-	else
-		ts->tree.local = local_ts_create(ts->rt_context);
+	ts->tree.local = local_ts_create(ts->rt_context);
 
 	return ts;
 }
 
 /*
- * Attach to the shared TidStore using the given  handle. The returned object
- * is allocated in backend-local memory using the CurrentMemoryContext.
+ * Similar to TidStoreCreateLocal() but create a shared TidStore on a
+ * DSA area. The TID storage will live in the DSA area, and the memory
+ * context rt_context will have only meta data of the radix tree.
+ *
+ * The returned object is allocated in backend-local memory.
  */
 TidStore *
-TidStoreAttach(dsa_area *area, dsa_pointer handle)
+TidStoreCreateShared(size_t max_bytes, int tranche_id)
 {
 	TidStore   *ts;
+	dsa_area   *area;
+	size_t		dsa_init_size = DSA_DEFAULT_INIT_SEGMENT_SIZE;
+	size_t		dsa_max_size = DSA_MAX_SEGMENT_SIZE;
 
-	Assert(area != NULL);
+	ts = palloc0(sizeof(TidStore));
+	ts->context = CurrentMemoryContext;
+
+	ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
+										   "TID storage meta data",
+										   ALLOCSET_SMALL_SIZES);
+
+	/*
+	 * Choose the initial and maximum DSA segment sizes to be no longer
+	 * than 1/8 of max_bytes.
+	 */
+	while (8 * dsa_max_size > max_bytes)
+		dsa_max_size >>= 1;
+
+	if (dsa_max_size < DSA_MIN_SEGMENT_SIZE)
+		dsa_max_size = DSA_MIN_SEGMENT_SIZE;
+
+	if (dsa_init_size > dsa_max_size)
+		dsa_init_size = dsa_max_size;
+
+	area = dsa_create_ext(tranche_id, dsa_init_size, dsa_max_size);
+	ts->tree.shared = shared_ts_create(ts->rt_context, area,
+									   tranche_id);
+	ts->area = area;
+
+	return ts;
+}
+
+/*
+ * Attach to the shared TidStore. 'area_handle' is the DSA handle where
+ * the TidStore is created. 'handle' is the dsa_pointer returned by
+ * TidStoreGetHandle(). The returned object is allocated in backend-local
+ * memory using the CurrentMemoryContext.
+ */
+TidStore *
+TidStoreAttach(dsa_handle area_handle, dsa_pointer handle)
+{
+	TidStore   *ts;
+	dsa_area   *area;
+
+	Assert(area_handle != DSA_HANDLE_INVALID);
 	Assert(DsaPointerIsValid(handle));
 
 	/* create per-backend state */
 	ts = palloc0(sizeof(TidStore));
 
+	area = dsa_attach(area_handle);
+
 	/* Find the shared the shared radix tree */
 	ts->tree.shared = shared_ts_attach(area, handle);
 	ts->area = area;
@@ -178,10 +220,8 @@ TidStoreAttach(dsa_area *area, dsa_pointer handle)
 }
 
 /*
- * Detach from a TidStore. This detaches from radix tree and frees the
- * backend-local resources. The radix tree will continue to exist until
- * it is either explicitly destroyed, or the area that backs it is returned
- * to the operating system.
+ * Detach from a TidStore. This also detaches from radix tree and frees
+ * the backend-local resources.
  */
 void
 TidStoreDetach(TidStore *ts)
@@ -189,6 +229,8 @@ TidStoreDetach(TidStore *ts)
 	Assert(TidStoreIsShared(ts));
 
 	shared_ts_detach(ts->tree.shared);
+	dsa_detach(ts->area);
+
 	pfree(ts);
 }
 
@@ -234,7 +276,11 @@ TidStoreDestroy(TidStore *ts)
 {
 	/* Destroy underlying radix tree */
 	if (TidStoreIsShared(ts))
+	{
 		shared_ts_free(ts->tree.shared);
+
+		dsa_detach(ts->area);
+	}
 	else
 		local_ts_free(ts->tree.local);
 
@@ -420,6 +466,17 @@ TidStoreMemoryUsage(TidStore *ts)
 		return local_ts_memory_usage(ts->tree.local);
 }
 
+/*
+ * Return the DSA area where the TidStore lives.
+ */
+dsa_area *
+TidStoreGetDSA(TidStore *ts)
+{
+	Assert(TidStoreIsShared(ts));
+
+	return ts->area;
+}
+
 dsa_pointer
 TidStoreGetHandle(TidStore *ts)
 {
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index 09f7a9a474..1de8aa8035 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -29,9 +29,9 @@ typedef struct TidStoreIterResult
 	OffsetNumber *offsets;
 } TidStoreIterResult;
 
-extern TidStore *TidStoreCreate(size_t max_bytes, dsa_area *area,
-								int tranche_id);
-extern TidStore *TidStoreAttach(dsa_area *area, dsa_pointer handle);
+extern TidStore *TidStoreCreateLocal(size_t max_bytes);
+extern TidStore *TidStoreCreateShared(size_t max_bytes, int tranche_id);
+extern TidStore *TidStoreAttach(dsa_handle dsa_handle, dsa_pointer handle);
 extern void TidStoreDetach(TidStore *ts);
 extern void TidStoreLockExclusive(TidStore *ts);
 extern void TidStoreLockShare(TidStore *ts);
@@ -45,5 +45,6 @@ extern TidStoreIterResult *TidStoreIterateNext(TidStoreIter *iter);
 extern void TidStoreEndIterate(TidStoreIter *iter);
 extern size_t TidStoreMemoryUsage(TidStore *ts);
 extern dsa_pointer TidStoreGetHandle(TidStore *ts);
+extern dsa_area *TidStoreGetDSA(TidStore *ts);
 
 #endif							/* TIDSTORE_H */
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index c74ad2cf8b..3d4af77dda 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -34,7 +34,6 @@ PG_FUNCTION_INFO_V1(test_is_full);
 PG_FUNCTION_INFO_V1(test_destroy);
 
 static TidStore *tidstore = NULL;
-static dsa_area *dsa = NULL;
 static size_t tidstore_empty_size;
 
 /* array for verification of some tests */
@@ -94,7 +93,6 @@ test_create(PG_FUNCTION_ARGS)
 	size_t		array_init_size = 1024;
 
 	Assert(tidstore == NULL);
-	Assert(dsa == NULL);
 
 	/*
 	 * Create the TidStore on TopMemoryContext so that the same process use it
@@ -109,18 +107,16 @@ test_create(PG_FUNCTION_ARGS)
 		tranche_id = LWLockNewTrancheId();
 		LWLockRegisterTranche(tranche_id, "test_tidstore");
 
-		dsa = dsa_create(tranche_id);
+		tidstore = TidStoreCreateShared(tidstore_max_size, tranche_id);
 
 		/*
 		 * Remain attached until end of backend or explicitly detached so that
 		 * the same process use the tidstore for subsequent tests.
 		 */
-		dsa_pin_mapping(dsa);
-
-		tidstore = TidStoreCreate(tidstore_max_size, dsa, tranche_id);
+		dsa_pin_mapping(TidStoreGetDSA(tidstore));
 	}
 	else
-		tidstore = TidStoreCreate(tidstore_max_size, NULL, 0);
+		tidstore = TidStoreCreateLocal(tidstore_max_size);
 
 	tidstore_empty_size = TidStoreMemoryUsage(tidstore);
 
@@ -309,9 +305,5 @@ test_destroy(PG_FUNCTION_ARGS)
 	pfree(items.lookup_tids);
 	pfree(items.iter_tids);
 
-	if (dsa)
-		dsa_detach(dsa);
-	dsa = NULL;
-
 	PG_RETURN_VOID();
 }
-- 
2.39.3

v80-0002-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchapplication/octet-stream; name=v80-0002-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchDownload

From 0cc21f12d87b9d815b700aa44777fa9434762b06 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 1 Mar 2024 16:04:49 +0900
Subject: [PATCH v80 2/2] Use TidStore for dead tuple TIDs storage during lazy
 vacuum.

Previously, we used VacDeadItems, a simple ItemPointerData array, for
dead tuple's TID storage during lazy vacuum, which was neither space
efficient nor lookup performant.

This commit makes (parallel) lazy vacuum use of TidSTore for dead
tuple TIDs storage, instead of VacDeadItems. A new struct
VacDeadItemsInfo stores additional information such as
max_bytes. TidStore and VacDeadItemsInfo are shared among the parallel
vacuum workers in parallel vacuum cases. We don't take any locks on
TidStore during parallel vacuum since there are no concurrent reads
and writes.

As for the progress reporting, reporting number of tuples does no
longer provide any meaningful insights for users. So this commit also
changes to report byte-based progress reporting. The columns of
pg_stat_progress_vacuum are also renamed accordingly:
max_dead_tuple_bytes and dead_tuple_bytes.

XXX: bump catalog version

Reviewed-by: John Naylor, Dilip Kumar
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |   5 -
 doc/src/sgml/monitoring.sgml                  |   8 +-
 src/backend/access/heap/vacuumlazy.c          | 268 ++++++++----------
 src/backend/catalog/system_views.sql          |   2 +-
 src/backend/commands/vacuum.c                 |  78 +----
 src/backend/commands/vacuumparallel.c         | 100 ++++---
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +-
 src/include/commands/progress.h               |   4 +-
 src/include/commands/vacuum.h                 |  28 +-
 src/include/storage/lwlock.h                  |   1 +
 src/test/regress/expected/rules.out           |   4 +-
 src/tools/pgindent/typedefs.list              |   2 +-
 13 files changed, 223 insertions(+), 280 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 65a6e6c408..a4ab76b7bc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1918,11 +1918,6 @@ include_dir 'conf.d'
         too high.  It may be useful to control for this by separately
         setting <xref linkend="guc-autovacuum-work-mem"/>.
        </para>
-       <para>
-        Note that for the collection of dead tuple identifiers,
-        <command>VACUUM</command> is only able to utilize up to a maximum of
-        <literal>1GB</literal> of memory.
-       </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8736eac284..6a74e4a24d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6237,10 +6237,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6248,10 +6248,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index ba5b7083a3..690ae8138f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,17 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TID store, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TID store is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -39,6 +38,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xloginsert.h"
@@ -179,8 +179,13 @@ typedef struct LVRelState
 	 * that has been processed by lazy_scan_prune.  Also needed by
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
+	 *
+	 * Both dead_items and dead_items_info are allocated in shared memory in
+	 * parallel vacuum cases.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore   *dead_items;		/* TIDs whose index tuples we'll delete */
+	VacDeadItemsInfo *dead_items_info;
+
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -239,8 +244,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  Buffer buffer, OffsetNumber *offsets,
+								  int num_offsets, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -257,6 +263,9 @@ static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+						   int num_offsets);
+static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -472,7 +481,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
+	 * Allocate dead_items memory using dead_items_alloc.  This handles
 	 * parallel VACUUM initialization as part of allocating shared memory
 	 * space used for dead_items.  (But do a failsafe precheck first, to
 	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
@@ -782,7 +791,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -811,19 +820,20 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_fsm_block_to_vacuum = 0;
 	bool		all_visible_according_to_vm;
 
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore   *dead_items = vacrel->dead_items;
+	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
 	Buffer		vmbuffer = InvalidBuffer;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = dead_items_info->max_bytes;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Initialize for the first heap_vac_scan_next_block() call */
@@ -866,8 +876,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -930,7 +939,7 @@ lazy_scan_heap(LVRelState *vacrel)
 
 		/*
 		 * If we didn't get the cleanup lock, we can still collect LP_DEAD
-		 * items in the dead_items array for later vacuuming, count live and
+		 * items in the dead_items area for later vacuuming, count live and
 		 * recently dead tuples for vacuum logging, and determine if this
 		 * block could later be truncated. If we encounter any xid/mxids that
 		 * require advancing the relfrozenxid/relminxid, we'll have to wait
@@ -958,9 +967,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Like lazy_scan_noprune(), lazy_scan_prune() will count
 		 * recently_dead_tuples and live tuples for vacuum logging, determine
 		 * if the block can later be truncated, and accumulate the details of
-		 * remaining LP_DEAD line pointers on the page in the dead_items
-		 * array. These dead items include those pruned by lazy_scan_prune()
-		 * as well we line pointers previously marked LP_DEAD.
+		 * remaining LP_DEAD line pointers on the page into dead_items. These
+		 * dead items include those pruned by lazy_scan_prune() as well as line
+		 * pointers previously marked LP_DEAD.
 		 */
 		if (got_cleanup_lock)
 			lazy_scan_prune(vacrel, buf, blkno, page,
@@ -1037,7 +1046,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (dead_items_info->num_items > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1763,22 +1772,9 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1925,7 +1921,7 @@ lazy_scan_prune(LVRelState *vacrel,
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2084,7 +2080,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2104,9 +2100,6 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
 		 * indexes will be deleted during index vacuuming (and then marked
@@ -2114,17 +2107,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		vacrel->lpdead_items += lpdead_items;
 	}
@@ -2174,7 +2157,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		dead_items_reset(vacrel);
 		return;
 	}
 
@@ -2203,7 +2186,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2231,7 +2214,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
 		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+				  (TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L)));
 	}
 
 	if (bypass)
@@ -2276,7 +2259,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	dead_items_reset(vacrel);
 }
 
 /*
@@ -2368,7 +2351,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2390,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2408,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2426,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = TidStoreBeginIterate(vacrel->dead_items);
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2435,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = iter_result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2449,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, buf, iter_result->offsets,
+							  iter_result->num_offsets, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2459,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2468,14 +2454,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+					vacrel->relname, (long long) vacrel->dead_items_info->num_items,
+					vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2483,21 +2469,17 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 /*
  *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *						  vacrel->dead_items store.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
+static void
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+					  OffsetNumber *deadoffsets, int num_offsets,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2516,16 +2498,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2589,7 +2566,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -2716,8 +2692,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -2754,7 +2730,8 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
 	/* Do bulk deletion */
-	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items);
+	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items,
+								  vacrel->dead_items_info);
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -3120,48 +3097,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 }
 
 /*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = AmAutoVacuumWorkerProcess() &&
-		autovacuum_work_mem != -1 ?
-		autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
-/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate dead_items and dead_items_info (either using palloc, or in dynamic
+ * shared memory). Sets both in vacrel for caller.
  *
  * Also handles parallel initialization as part of allocating dead_items in
  * DSM when required.
@@ -3169,11 +3106,10 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	VacDeadItemsInfo *dead_items_info;
+	int				vac_work_mem = AmAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem : maintenance_work_mem;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3200,24 +3136,72 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
-		/* If parallel mode started, dead_items space is allocated in DSM */
+		/*
+		 * If parallel mode started, dead_items and dead_items_info spaces are
+		 * allocated in DSM.
+		 */
 		if (ParallelVacuumIsActive(vacrel))
 		{
-			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs);
+			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs,
+																&vacrel->dead_items_info);
 			return;
 		}
 	}
 
-	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
+	/*
+	 * Serial VACUUM case. Allocate both dead_items and dead_items_info
+	 * locally.
+	 */
+
+	dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
+	dead_items_info->max_bytes = vac_work_mem * 1024L;
+	dead_items_info->num_items = 0;
+	vacrel->dead_items_info = dead_items_info;
+
+	vacrel->dead_items = TidStoreCreateLocal(dead_items_info->max_bytes);
+}
+
+/*
+ * Add the given block number and offset numbers to dead_items.
+ */
+static void
+dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+			   int num_offsets)
+{
+	TidStore   *dead_items = vacrel->dead_items;
+
+	TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets);
+	vacrel->dead_items_info->num_items += num_offsets;
+
+	/* update the memory usage report */
+	pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+								 TidStoreMemoryUsage(dead_items));
+}
+
+/*
+ * Forget all collected dead items.
+ */
+static void
+dead_items_reset(LVRelState *vacrel)
+{
+	TidStore   *dead_items = vacrel->dead_items;
+
+	if (ParallelVacuumIsActive(vacrel))
+	{
+		parallel_vacuum_reset_dead_items(vacrel->pvs);
+		return;
+	}
+
+	/* Recreate the tidstore with the same max_bytes limitation */
+	TidStoreDestroy(dead_items);
+	vacrel->dead_items = TidStoreCreateLocal(vacrel->dead_items_info->max_bytes);
 
-	vacrel->dead_items = dead_items;
+	/* Reset the counter */
+	vacrel->dead_items_info->num_items = 0;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index bc70ff193e..477c6a12a3 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1223,7 +1223,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples,
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
         S.param8 AS indexes_total, S.param9 AS indexes_processed
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e63c86cae4..b589279d49 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,7 +116,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * GUC check function to ensure GUC value specified is within the allowable
@@ -2489,16 +2488,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items, VacDeadItemsInfo *dead_items_info)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove %lld row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					(long long) dead_items_info->num_items)));
 
 	return istat;
 }
@@ -2529,82 +2528,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch(itemptr,
-								dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore   *dead_items = (TidStore *) state;
 
-	return 0;
+	return TidStoreIsMember(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index befda1c105..5174a4e975 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -8,8 +8,8 @@
  *
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
+ * vacuum process.  ParalleVacuumState contains shared information as well as
+ * the memory space for storing dead items allocated in the DSA area.  We
  * launch parallel worker processes at the start of parallel index
  * bulk-deletion and index cleanup and once all indexes are processed, the
  * parallel worker processes exit.  Each time we process indexes in parallel,
@@ -45,11 +45,10 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
-#define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
-#define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
-#define PARALLEL_VACUUM_KEY_WAL_USAGE		5
-#define PARALLEL_VACUUM_KEY_INDEX_STATS		6
+#define PARALLEL_VACUUM_KEY_QUERY_TEXT		2
+#define PARALLEL_VACUUM_KEY_BUFFER_USAGE	3
+#define PARALLEL_VACUUM_KEY_WAL_USAGE		4
+#define PARALLEL_VACUUM_KEY_INDEX_STATS		5
 
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
@@ -110,6 +109,15 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* DSA handle where the TidStore lives */
+	dsa_handle	dead_items_dsa_handle;
+
+	/* DSA pointer to the shared TidStore */
+	dsa_pointer dead_items_handle;
+
+	/* Statistics of shared dead items */
+	VacDeadItemsInfo dead_items_info;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -176,7 +184,7 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -232,20 +240,19 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
+					 int nrequested_workers, int vac_work_mem,
 					 int elevel, BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -294,11 +301,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
-	shm_toc_estimate_keys(&pcxt->estimator, 1);
-
 	/*
 	 * Estimate space for BufferUsage and WalUsage --
 	 * PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
@@ -371,6 +373,14 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_info.max_bytes = vac_work_mem * 1024L;
+
+	/* Prepare DSA space for dead items */
+	dead_items = TidStoreCreateShared(shared->dead_items_info.max_bytes,
+									  LWTRANCHE_PARALLEL_VACUUM_DSA);
+	pvs->dead_items = dead_items;
+	shared->dead_items_handle = TidStoreGetHandle(dead_items);
+	shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
 
 	/* Use the same buffer size for all workers */
 	shared->ring_nbuffers = GetAccessStrategyBufferCount(bstrategy);
@@ -382,15 +392,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -448,6 +449,8 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	TidStoreDestroy(pvs->dead_items);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -455,13 +458,40 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
-/* Returns the dead items space */
-VacDeadItems *
-parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
+/*
+ * Returns the dead items space and dead items information.
+ */
+TidStore *
+parallel_vacuum_get_dead_items(ParallelVacuumState *pvs, VacDeadItemsInfo **dead_items_info_p)
 {
+	*dead_items_info_p = &(pvs->shared->dead_items_info);
 	return pvs->dead_items;
 }
 
+/* Forget all items in dead_items */
+void
+parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
+{
+	TidStore   *dead_items = pvs->dead_items;
+	VacDeadItemsInfo *dead_items_info = &(pvs->shared->dead_items_info);
+
+	/*
+	 * Free the current tidstore and return allocated DSA segments to the
+	 * operating system. Then we recreate the tidstore with the same max_bytes
+	 * limitation we just used.
+	 */
+	TidStoreDestroy(dead_items);
+	pvs->dead_items = TidStoreCreateShared(dead_items_info->max_bytes,
+										   LWTRANCHE_PARALLEL_VACUUM_DSA);
+
+	/* Update the DSA pointer for dead_items to the new one */
+	pvs->shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
+	pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
+
+	/* Reset the counter */
+	dead_items_info->num_items = 0;
+}
+
 /*
  * Do parallel index bulk-deletion with parallel workers.
  */
@@ -861,7 +891,8 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	switch (indstats->status)
 	{
 		case PARALLEL_INDVAC_STATUS_NEED_BULKDELETE:
-			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items);
+			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items,
+											  &pvs->shared->dead_items_info);
 			break;
 		case PARALLEL_INDVAC_STATUS_NEED_CLEANUP:
 			istat_res = vac_cleanup_one_index(&ivinfo, istat);
@@ -961,7 +992,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -1005,10 +1036,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Find dead_items in shared memory */
+	dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
+								shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumUpdateCosts();
@@ -1056,6 +1086,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	TidStoreDetach(dead_items);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 83992725de..b1e388dc7c 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -168,6 +168,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SERIAL_SLRU] = "SerialSLRU",
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
+	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8d0571a03d..2eee03daec 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -376,7 +376,7 @@ NotifySLRU	"Waiting to access the <command>NOTIFY</command> message SLRU cache."
 SerialSLRU	"Waiting to access the serializable transaction conflict SLRU cache."
 SubtransSLRU	"Waiting to access the sub-transaction SLRU cache."
 XactSLRU	"Waiting to access the transaction status SLRU cache."
-
+ParallelVacuumDSA	"Waiting for parallel vacuum dynamic shared memory allocation."
 
 #
 # Wait Events - Lock
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 73afa77a9c..82a8fe6bd1 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 #define PROGRESS_VACUUM_INDEXES_TOTAL			7
 #define PROGRESS_VACUUM_INDEXES_PROCESSED		8
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1182a96742..759f9a87d3 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -278,19 +279,14 @@ struct VacuumCutoffs
 };
 
 /*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
+ * VacDeadItemsInfo stores supplemental information for dead tuple TID
+ * storage (i.e. TidStore).
  */
-typedef struct VacDeadItems
+typedef struct VacDeadItemsInfo
 {
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
+	size_t		max_bytes;		/* the maximum bytes TidStore can use */
+	int64		num_items;		/* current # of entries */
+} VacDeadItemsInfo;
 
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
@@ -351,10 +347,10 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items,
+													VacDeadItemsInfo *dead_items_info);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* In postmaster/autovacuum.c */
 extern void AutoVacuumUpdateCostLimit(void);
@@ -363,10 +359,12 @@ extern void VacuumUpdateCosts(void);
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
+												 int vac_work_mem, int elevel,
 												 BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
+												VacDeadItemsInfo **dead_items_info_p);
+extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 3479b4cf52..d70e6d37e0 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -214,6 +214,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SERIAL_SLRU,
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index dfcbaec387..c67107a3f8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2050,8 +2050,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples,
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes,
     s.param8 AS indexes_total,
     s.param9 AS indexes_processed
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfa9d5aaea..6de09b7aee 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2983,7 +2983,7 @@ UserMapping
 UserOpts
 VacAttrStats
 VacAttrStatsP
-VacDeadItems
+VacDeadItemsInfo
 VacErrPhase
 VacObjFilter
 VacOptValue
-- 
2.39.3

#440

sawada.mshk@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#439)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Wed, Mar 27, 2024 at 5:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 27, 2024 at 9:25 AM John Naylor <johncnaylorls@gmail.com> wrote:
On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Mar 25, 2024 at 3:25 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
- * remaining LP_DEAD line pointers on the page in the dead_items
- * array. These dead items include those pruned by lazy_scan_prune()
- * as well we line pointers previously marked LP_DEAD.
+ * remaining LP_DEAD line pointers on the page in the dead_items.
+ * These dead items include those pruned by lazy_scan_prune() as well
+ * we line pointers previously marked LP_DEAD.
Here maybe "into dead_items".
- * remaining LP_DEAD line pointers on the page in the dead_items.
+ * remaining LP_DEAD line pointers on the page into the dead_items.
Let me explain. It used to be "in the dead_items array." It is not an
array anymore, so it was changed to "in the dead_items". dead_items is
a variable name, and names don't take "the". "into dead_items" seems
most natural to me, but there are other possible phrasings.
Thanks for the explanation. I was distracted. Fixed in the latest patch.

Did you try it with 1MB m_w_m?

I've incorporated the above comments and test results look good to me.

Could you be more specific about what the test was?
Does it work with 1MB m_w_m?

If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.

FYI other test cases I tested were:

* m_w_m = 2199023254528 (maximum value)
initial: 1MB
max: 128GB

* m_w_m = 64MB (default)
initial: 1MB
max: 8MB

If the test was a vacuum, how big a table was needed to hit 128GB?

I just checked how TIdStoreCreateLocal() calculated the initial and
max segment sizes while changing m_w_m, so didn't check how big
segments are actually allocated in the maximum value test case.
The existing comment slipped past my radar, but max_bytes is not a
limit, it's a hint. Come to think of it, it never was a limit in the
normal sense, but in earlier patches it was the criteria for reporting
"I'm full" when asked.

Updated the comment.
+ * max_bytes is not a limit; it's used to choose the memory block sizes of
+ * a memory context for TID storage in order for the total memory consumption
+ * not to be overshot a lot. The caller can use the max_bytes as the criteria
+ * for reporting whether it's full or not.
This is good information. I suggest this edit:

"max_bytes" is not an internally-enforced limit; it is used only as a
hint to cap the memory block size of the memory context for TID
storage. This reduces space wastage due to over-allocation. If the
caller wants to monitor memory usage, it must compare its limit with
the value reported by TidStoreMemoryUsage().

Other comments:
Thanks for the suggestion!
v79-0002 looks good to me.

v79-0003:

"With this commit, when creating a shared TidStore, a dedicated DSA
area is created for TID storage instead of using the provided DSA
area."

This is very subtle, but "the provided..." implies there still is one.
-> "a provided..."
+ * Similar to TidStoreCreateLocal() but create a shared TidStore on a
+ * DSA area. The TID storage will live in the DSA area, and a memory
+ * context rt_context will have only meta data of the radix tree.
-> "the memory context"
Fixed in the latest patch.

I think you can go ahead and commit 0002 and 0003/4.

I've pushed the 0002 (dsa init and max segment size) patch, and will
push the attached 0001 patch next.

Pushed the refactoring patch.

I've attached the rebased vacuum improvement patch for cfbot. I
mentioned in the commit message that this patch eliminates the 1GB
limitation.

I think the patch is in good shape. Do you have other comments or
suggestions, John?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v81-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchapplication/octet-stream; name=v81-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchDownload

From ae052d965be6dc25210516761445c1601ac7f948 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 1 Mar 2024 16:04:49 +0900
Subject: [PATCH v81] Use TidStore for dead tuple TIDs storage during lazy
 vacuum.

Previously, we used VacDeadItems, a simple ItemPointerData array, for
dead tuple's TID storage during lazy vacuum, which was neither space
efficient nor lookup performant.

This commit makes (parallel) lazy vacuum use of TidSTore for dead
tuple TIDs storage, instead of VacDeadItems, and collaterally
eliminates the limitation that VACUUM is only able to use up to 1GB of
memory. A new struct VacDeadItemsInfo stores additional information
such as max_bytes. TidStore and VacDeadItemsInfo are shared among the
parallel vacuum workers in parallel vacuum cases. We don't take any
locks on TidStore during parallel vacuum since there are no concurrent
reads and writes.

The performance benchmark results showed speedup (over x5 on my
machine) for index vacuum.

As for the progress reporting, reporting number of tuples does no
longer provide any meaningful insights for users. So this commit also
changes to report byte-based progress reporting. The columns of
pg_stat_progress_vacuum are also renamed accordingly:
max_dead_tuple_bytes and dead_tuple_bytes.

XXX: bump catalog version

Reviewed-by: John Naylor, Dilip Kumar
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |   5 -
 doc/src/sgml/monitoring.sgml                  |   8 +-
 src/backend/access/heap/vacuumlazy.c          | 268 ++++++++----------
 src/backend/catalog/system_views.sql          |   2 +-
 src/backend/commands/vacuum.c                 |  78 +----
 src/backend/commands/vacuumparallel.c         | 100 ++++---
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +-
 src/include/commands/progress.h               |   4 +-
 src/include/commands/vacuum.h                 |  28 +-
 src/include/storage/lwlock.h                  |   1 +
 src/test/regress/expected/rules.out           |   4 +-
 src/tools/pgindent/typedefs.list              |   2 +-
 13 files changed, 223 insertions(+), 280 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5468637e2e..b4dacc38b6 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1918,11 +1918,6 @@ include_dir 'conf.d'
         too high.  It may be useful to control for this by separately
         setting <xref linkend="guc-autovacuum-work-mem"/>.
        </para>
-       <para>
-        Note that for the collection of dead tuple identifiers,
-        <command>VACUUM</command> is only able to utilize up to a maximum of
-        <literal>1GB</literal> of memory.
-       </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8736eac284..6a74e4a24d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6237,10 +6237,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6248,10 +6248,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index ba5b7083a3..a613559527 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,17 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TID store, a storage for dead TIDs
  * that are to be removed from indexes.  We want to ensure we can vacuum even
  * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TID store is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -39,6 +38,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xloginsert.h"
@@ -179,8 +179,13 @@ typedef struct LVRelState
 	 * that has been processed by lazy_scan_prune.  Also needed by
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
+	 *
+	 * Both dead_items and dead_items_info are allocated in shared memory in
+	 * parallel vacuum cases.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore   *dead_items;		/* TIDs whose index tuples we'll delete */
+	VacDeadItemsInfo *dead_items_info;
+
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -239,8 +244,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  Buffer buffer, OffsetNumber *offsets,
+								  int num_offsets, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -257,6 +263,9 @@ static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+						   int num_offsets);
+static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -472,7 +481,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
+	 * Allocate dead_items memory using dead_items_alloc.  This handles
 	 * parallel VACUUM initialization as part of allocating shared memory
 	 * space used for dead_items.  (But do a failsafe precheck first, to
 	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
@@ -782,7 +791,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -811,19 +820,20 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_fsm_block_to_vacuum = 0;
 	bool		all_visible_according_to_vm;
 
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore   *dead_items = vacrel->dead_items;
+	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
 	Buffer		vmbuffer = InvalidBuffer;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = dead_items_info->max_bytes;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Initialize for the first heap_vac_scan_next_block() call */
@@ -866,8 +876,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -930,7 +939,7 @@ lazy_scan_heap(LVRelState *vacrel)
 
 		/*
 		 * If we didn't get the cleanup lock, we can still collect LP_DEAD
-		 * items in the dead_items array for later vacuuming, count live and
+		 * items in the dead_items area for later vacuuming, count live and
 		 * recently dead tuples for vacuum logging, and determine if this
 		 * block could later be truncated. If we encounter any xid/mxids that
 		 * require advancing the relfrozenxid/relminxid, we'll have to wait
@@ -958,9 +967,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Like lazy_scan_noprune(), lazy_scan_prune() will count
 		 * recently_dead_tuples and live tuples for vacuum logging, determine
 		 * if the block can later be truncated, and accumulate the details of
-		 * remaining LP_DEAD line pointers on the page in the dead_items
-		 * array. These dead items include those pruned by lazy_scan_prune()
-		 * as well we line pointers previously marked LP_DEAD.
+		 * remaining LP_DEAD line pointers on the page into dead_items. These
+		 * dead items include those pruned by lazy_scan_prune() as well as
+		 * line pointers previously marked LP_DEAD.
 		 */
 		if (got_cleanup_lock)
 			lazy_scan_prune(vacrel, buf, blkno, page,
@@ -1037,7 +1046,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (dead_items_info->num_items > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1763,22 +1772,9 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1925,7 +1921,7 @@ lazy_scan_prune(LVRelState *vacrel,
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2084,7 +2080,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2104,9 +2100,6 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
 		 * indexes will be deleted during index vacuuming (and then marked
@@ -2114,17 +2107,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		vacrel->lpdead_items += lpdead_items;
 	}
@@ -2174,7 +2157,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		dead_items_reset(vacrel);
 		return;
 	}
 
@@ -2203,7 +2186,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2231,7 +2214,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
 		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+				  (TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L)));
 	}
 
 	if (bypass)
@@ -2276,7 +2259,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	dead_items_reset(vacrel);
 }
 
 /*
@@ -2368,7 +2351,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2390,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2408,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2426,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = TidStoreBeginIterate(vacrel->dead_items);
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2435,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = iter_result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2449,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, buf, iter_result->offsets,
+							  iter_result->num_offsets, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2459,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2468,14 +2454,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+					vacrel->relname, (long long) vacrel->dead_items_info->num_items,
+					vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2483,21 +2469,17 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 /*
  *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *						  vacrel->dead_items store.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
+static void
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+					  OffsetNumber *deadoffsets, int num_offsets,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2516,16 +2498,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2589,7 +2566,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -2716,8 +2692,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -2754,7 +2730,8 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
 	/* Do bulk deletion */
-	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items);
+	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items,
+								  vacrel->dead_items_info);
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -3120,48 +3097,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 }
 
 /*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = AmAutoVacuumWorkerProcess() &&
-		autovacuum_work_mem != -1 ?
-		autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
-/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate dead_items and dead_items_info (either using palloc, or in dynamic
+ * shared memory). Sets both in vacrel for caller.
  *
  * Also handles parallel initialization as part of allocating dead_items in
  * DSM when required.
@@ -3169,11 +3106,10 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	VacDeadItemsInfo *dead_items_info;
+	int			vac_work_mem = AmAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem : maintenance_work_mem;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3200,24 +3136,72 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
-		/* If parallel mode started, dead_items space is allocated in DSM */
+		/*
+		 * If parallel mode started, dead_items and dead_items_info spaces are
+		 * allocated in DSM.
+		 */
 		if (ParallelVacuumIsActive(vacrel))
 		{
-			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs);
+			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs,
+																&vacrel->dead_items_info);
 			return;
 		}
 	}
 
-	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
+	/*
+	 * Serial VACUUM case. Allocate both dead_items and dead_items_info
+	 * locally.
+	 */
+
+	dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
+	dead_items_info->max_bytes = vac_work_mem * 1024L;
+	dead_items_info->num_items = 0;
+	vacrel->dead_items_info = dead_items_info;
+
+	vacrel->dead_items = TidStoreCreateLocal(dead_items_info->max_bytes);
+}
+
+/*
+ * Add the given block number and offset numbers to dead_items.
+ */
+static void
+dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+			   int num_offsets)
+{
+	TidStore   *dead_items = vacrel->dead_items;
+
+	TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets);
+	vacrel->dead_items_info->num_items += num_offsets;
+
+	/* update the memory usage report */
+	pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+								 TidStoreMemoryUsage(dead_items));
+}
+
+/*
+ * Forget all collected dead items.
+ */
+static void
+dead_items_reset(LVRelState *vacrel)
+{
+	TidStore   *dead_items = vacrel->dead_items;
+
+	if (ParallelVacuumIsActive(vacrel))
+	{
+		parallel_vacuum_reset_dead_items(vacrel->pvs);
+		return;
+	}
+
+	/* Recreate the tidstore with the same max_bytes limitation */
+	TidStoreDestroy(dead_items);
+	vacrel->dead_items = TidStoreCreateLocal(vacrel->dead_items_info->max_bytes);
 
-	vacrel->dead_items = dead_items;
+	/* Reset the counter */
+	vacrel->dead_items_info->num_items = 0;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 401fb35947..2e61f6d74e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1223,7 +1223,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples,
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
         S.param8 AS indexes_total, S.param9 AS indexes_processed
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e63c86cae4..b589279d49 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,7 +116,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * GUC check function to ensure GUC value specified is within the allowable
@@ -2489,16 +2488,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items, VacDeadItemsInfo *dead_items_info)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove %lld row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					(long long) dead_items_info->num_items)));
 
 	return istat;
 }
@@ -2529,82 +2528,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch(itemptr,
-								dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore   *dead_items = (TidStore *) state;
 
-	return 0;
+	return TidStoreIsMember(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index befda1c105..5174a4e975 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -8,8 +8,8 @@
  *
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
+ * vacuum process.  ParalleVacuumState contains shared information as well as
+ * the memory space for storing dead items allocated in the DSA area.  We
  * launch parallel worker processes at the start of parallel index
  * bulk-deletion and index cleanup and once all indexes are processed, the
  * parallel worker processes exit.  Each time we process indexes in parallel,
@@ -45,11 +45,10 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
-#define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
-#define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
-#define PARALLEL_VACUUM_KEY_WAL_USAGE		5
-#define PARALLEL_VACUUM_KEY_INDEX_STATS		6
+#define PARALLEL_VACUUM_KEY_QUERY_TEXT		2
+#define PARALLEL_VACUUM_KEY_BUFFER_USAGE	3
+#define PARALLEL_VACUUM_KEY_WAL_USAGE		4
+#define PARALLEL_VACUUM_KEY_INDEX_STATS		5
 
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
@@ -110,6 +109,15 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* DSA handle where the TidStore lives */
+	dsa_handle	dead_items_dsa_handle;
+
+	/* DSA pointer to the shared TidStore */
+	dsa_pointer dead_items_handle;
+
+	/* Statistics of shared dead items */
+	VacDeadItemsInfo dead_items_info;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -176,7 +184,7 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -232,20 +240,19 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
+					 int nrequested_workers, int vac_work_mem,
 					 int elevel, BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -294,11 +301,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
-	shm_toc_estimate_keys(&pcxt->estimator, 1);
-
 	/*
 	 * Estimate space for BufferUsage and WalUsage --
 	 * PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
@@ -371,6 +373,14 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_info.max_bytes = vac_work_mem * 1024L;
+
+	/* Prepare DSA space for dead items */
+	dead_items = TidStoreCreateShared(shared->dead_items_info.max_bytes,
+									  LWTRANCHE_PARALLEL_VACUUM_DSA);
+	pvs->dead_items = dead_items;
+	shared->dead_items_handle = TidStoreGetHandle(dead_items);
+	shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
 
 	/* Use the same buffer size for all workers */
 	shared->ring_nbuffers = GetAccessStrategyBufferCount(bstrategy);
@@ -382,15 +392,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -448,6 +449,8 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	TidStoreDestroy(pvs->dead_items);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -455,13 +458,40 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
-/* Returns the dead items space */
-VacDeadItems *
-parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
+/*
+ * Returns the dead items space and dead items information.
+ */
+TidStore *
+parallel_vacuum_get_dead_items(ParallelVacuumState *pvs, VacDeadItemsInfo **dead_items_info_p)
 {
+	*dead_items_info_p = &(pvs->shared->dead_items_info);
 	return pvs->dead_items;
 }
 
+/* Forget all items in dead_items */
+void
+parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
+{
+	TidStore   *dead_items = pvs->dead_items;
+	VacDeadItemsInfo *dead_items_info = &(pvs->shared->dead_items_info);
+
+	/*
+	 * Free the current tidstore and return allocated DSA segments to the
+	 * operating system. Then we recreate the tidstore with the same max_bytes
+	 * limitation we just used.
+	 */
+	TidStoreDestroy(dead_items);
+	pvs->dead_items = TidStoreCreateShared(dead_items_info->max_bytes,
+										   LWTRANCHE_PARALLEL_VACUUM_DSA);
+
+	/* Update the DSA pointer for dead_items to the new one */
+	pvs->shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
+	pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
+
+	/* Reset the counter */
+	dead_items_info->num_items = 0;
+}
+
 /*
  * Do parallel index bulk-deletion with parallel workers.
  */
@@ -861,7 +891,8 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	switch (indstats->status)
 	{
 		case PARALLEL_INDVAC_STATUS_NEED_BULKDELETE:
-			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items);
+			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items,
+											  &pvs->shared->dead_items_info);
 			break;
 		case PARALLEL_INDVAC_STATUS_NEED_CLEANUP:
 			istat_res = vac_cleanup_one_index(&ivinfo, istat);
@@ -961,7 +992,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -1005,10 +1036,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Find dead_items in shared memory */
+	dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
+								shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumUpdateCosts();
@@ -1056,6 +1086,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	TidStoreDetach(dead_items);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 83992725de..b1e388dc7c 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -168,6 +168,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SERIAL_SLRU] = "SerialSLRU",
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
+	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8d0571a03d..2eee03daec 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -376,7 +376,7 @@ NotifySLRU	"Waiting to access the <command>NOTIFY</command> message SLRU cache."
 SerialSLRU	"Waiting to access the serializable transaction conflict SLRU cache."
 SubtransSLRU	"Waiting to access the sub-transaction SLRU cache."
 XactSLRU	"Waiting to access the transaction status SLRU cache."
-
+ParallelVacuumDSA	"Waiting for parallel vacuum dynamic shared memory allocation."
 
 #
 # Wait Events - Lock
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 73afa77a9c..82a8fe6bd1 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 #define PROGRESS_VACUUM_INDEXES_TOTAL			7
 #define PROGRESS_VACUUM_INDEXES_PROCESSED		8
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 1182a96742..759f9a87d3 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -278,19 +279,14 @@ struct VacuumCutoffs
 };
 
 /*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
+ * VacDeadItemsInfo stores supplemental information for dead tuple TID
+ * storage (i.e. TidStore).
  */
-typedef struct VacDeadItems
+typedef struct VacDeadItemsInfo
 {
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
+	size_t		max_bytes;		/* the maximum bytes TidStore can use */
+	int64		num_items;		/* current # of entries */
+} VacDeadItemsInfo;
 
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
@@ -351,10 +347,10 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items,
+													VacDeadItemsInfo *dead_items_info);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* In postmaster/autovacuum.c */
 extern void AutoVacuumUpdateCostLimit(void);
@@ -363,10 +359,12 @@ extern void VacuumUpdateCosts(void);
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
+												 int vac_work_mem, int elevel,
 												 BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
+												VacDeadItemsInfo **dead_items_info_p);
+extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 3479b4cf52..d70e6d37e0 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -214,6 +214,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SERIAL_SLRU,
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f53c3036a6..6cccdfa0ac 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2050,8 +2050,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples,
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes,
     s.param8 AS indexes_total,
     s.param9 AS indexes_processed
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cfa9d5aaea..6de09b7aee 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2983,7 +2983,7 @@ UserMapping
 UserOpts
 VacAttrStats
 VacAttrStatsP
-VacDeadItems
+VacDeadItemsInfo
 VacErrPhase
 VacObjFilter
 VacOptValue
-- 
2.39.3

#441

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#440)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Pushed the refactoring patch.

I've attached the rebased vacuum improvement patch for cfbot. I
mentioned in the commit message that this patch eliminates the 1GB
limitation.

I think the patch is in good shape. Do you have other comments or
suggestions, John?

I'll do another pass tomorrow, but first I wanted to get in another
slightly-challenging in-situ test. On my humble laptop, I can still
fit a table large enough to cause PG16 to choke on multiple rounds of
index cleanup:

drop table if exists test;
create unlogged table test (a int, b uuid) with (autovacuum_enabled=false);

insert into test (a,b) select i, gen_random_uuid() from
generate_series(1,1000*1000*1000) i;

create index on test (a);
create index on test (b);

delete from test;

vacuum (verbose, truncate off, parallel 2) test;

INFO: vacuuming "john.public.test"
INFO: launched 1 parallel vacuum worker for index vacuuming (planned: 1)
INFO: finished vacuuming "john.public.test": index scans: 1
pages: 0 removed, 6369427 remain, 6369427 scanned (100.00% of total)
tuples: 999997174 removed, 2826 remain, 0 are dead but not yet removable
tuples missed: 2826 dead from 18 pages not removed due to cleanup lock
contention
removable cutoff: 771, which was 0 XIDs old when operation ended
new relfrozenxid: 767, which is 4 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 6369409 pages from table (100.00% of total) had
999997174 dead item identifiers removed
index "test_a_idx": pages: 2741898 in total, 2741825 newly deleted,
2741825 currently deleted, 0 reusable
index "test_b_idx": pages: 3850387 in total, 3842056 newly deleted,
3842056 currently deleted, 0 reusable
avg read rate: 159.740 MB/s, avg write rate: 161.726 MB/s
buffer usage: 26367981 hits, 14958634 misses, 15144601 dirtied
WAL usage: 3 records, 1 full page images, 2050 bytes
system usage: CPU: user: 151.89 s, system: 193.54 s, elapsed: 731.59 s

Watching pg_stat_progress_vacuum, dead_tuple_bytes got up to 398458880.

About the "tuples missed" -- I didn't expect contention during this
test. I believe that's completely unrelated behavior, but wanted to
mention it anyway, since I found it confusing.

#442

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#441)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 28, 2024 at 6:15 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Pushed the refactoring patch.

I've attached the rebased vacuum improvement patch for cfbot. I
mentioned in the commit message that this patch eliminates the 1GB
limitation.

I think the patch is in good shape. Do you have other comments or
suggestions, John?

I'll do another pass tomorrow, but first I wanted to get in another
slightly-challenging in-situ test.

Thanks!

About the "tuples missed" -- I didn't expect contention during this
test. I believe that's completely unrelated behavior, but wanted to
mention it anyway, since I found it confusing.

I don't investigate it enough but bgwriter might be related to the contention.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#443

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#440)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think the patch is in good shape. Do you have other comments or
suggestions, John?

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1918,11 +1918,6 @@ include_dir 'conf.d'
         too high.  It may be useful to control for this by separately
         setting <xref linkend="guc-autovacuum-work-mem"/>.
        </para>
-       <para>
-        Note that for the collection of dead tuple identifiers,
-        <command>VACUUM</command> is only able to utilize up to a maximum of
-        <literal>1GB</literal> of memory.
-       </para>
       </listitem>
      </varlistentry>

This is mentioned twice for two different GUCs -- need to remove the
other one, too. Other than that, I just have minor nits:

- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TID store, a storage for dead TIDs

I think I've helped edit this sentence before, but I still don't quite
like it. I'm thinking now "is storage for the dead tuple IDs".

- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.

I think "maximum" is redundant with "upper bounds".

I also feel the commit message needs more "meat" -- we need to clearly
narrate the features and benefits. I've attached how I would write it,
but feel free to use what you like to match your taste.

I've marked it Ready for Committer.

#444

sawada.mshk@gmail.com

almost 2 years ago

In reply to: John Naylor (#443)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Fri, Mar 29, 2024 at 4:21 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think the patch is in good shape. Do you have other comments or
suggestions, John?
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1918,11 +1918,6 @@ include_dir 'conf.d'
too high.  It may be useful to control for this by separately
setting <xref linkend="guc-autovacuum-work-mem"/>.
</para>
-       <para>
-        Note that for the collection of dead tuple identifiers,
-        <command>VACUUM</command> is only able to utilize up to a maximum of
-        <literal>1GB</literal> of memory.
-       </para>
</listitem>
</varlistentry>
This is mentioned twice for two different GUCs -- need to remove the
other one, too.

Good catch, removed.

Other than that, I just have minor nits:
- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TID store, a storage for dead TIDs
I think I've helped edit this sentence before, but I still don't quite
like it. I'm thinking now "is storage for the dead tuple IDs".
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.
I think "maximum" is redundant with "upper bounds".

Fixed.

I also feel the commit message needs more "meat" -- we need to clearly
narrate the features and benefits. I've attached how I would write it,
but feel free to use what you like to match your taste.

Well, that's much better than mine.

I've marked it Ready for Committer.

Thank you! I've attached the patch that I'm going to push tomorrow.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v82-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchapplication/octet-stream; name=v82-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patchDownload

From acb2424291bf495afd1db73da1bf91a6a9d30599 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Fri, 1 Mar 2024 16:04:49 +0900
Subject: [PATCH v82] Use TidStore for dead tuple TIDs storage during lazy
 vacuum.

Previously, we used a simple array for storing dead tuple IDs during
lazy vacuum, which had a number of problems:

* The array used a single allocation and so was limited to 1GB.
* The allocation was pessimistically sized according to table size.
* Lookup with binary search was slow because of poor CPU cache and
  branch prediction behavior.

This commit replaces that array with the TID store from commit
30e144287a.

Since the backing radix tree makes small allocations as needed, the
1GB limit is now gone. Further, the total memory used is now often
smaller by an order of magnitude or more, depending on the
distribution of blocks and offsets. These two features should make
multiple rounds of heap scanning and index cleanup an extremely rare
event. TID lookup during index cleanup is also several times faster,
even more so when index order is correlated with heap tuple order.

Since there is no longer a predictable relationship between the number
of dead tuples vacuumed and the space taken up by their TIDs, the
number of tuples no longer provides any meaningful insights for users,
nor is the maximum number predictable. For that reason this commit
also changes to byte-based progress reporting, with the relevant
columns of pg_stat_progress_vacuum renamed accordingly to
max_dead_tuple_bytes and dead_tuple_bytes.

For parallel vacuum, both the TID store and supplemental information
specific to vacuum are shared among the parallel vacuum workers. As
with the previous array, we don't take any locks on TidStore during
parallel vacuum since writes are still only done by the leader
process.

XXX: bump catalog version

Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar
Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
---
 doc/src/sgml/config.sgml                      |  12 -
 doc/src/sgml/monitoring.sgml                  |   8 +-
 src/backend/access/heap/vacuumlazy.c          | 272 +++++++++---------
 src/backend/catalog/system_views.sql          |   2 +-
 src/backend/commands/vacuum.c                 |  78 +----
 src/backend/commands/vacuumparallel.c         | 100 ++++---
 src/backend/storage/lmgr/lwlock.c             |   1 +
 .../utils/activity/wait_event_names.txt       |   2 +-
 src/include/commands/progress.h               |   4 +-
 src/include/commands/vacuum.h                 |  28 +-
 src/include/storage/lwlock.h                  |   1 +
 src/test/regress/expected/rules.out           |   4 +-
 src/tools/pgindent/typedefs.list              |   2 +-
 13 files changed, 225 insertions(+), 289 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index f65c17e5ae..0e9617bcff 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1919,11 +1919,6 @@ include_dir 'conf.d'
         too high.  It may be useful to control for this by separately
         setting <xref linkend="guc-autovacuum-work-mem"/>.
        </para>
-       <para>
-        Note that for the collection of dead tuple identifiers,
-        <command>VACUUM</command> is only able to utilize up to a maximum of
-        <literal>1GB</literal> of memory.
-       </para>
       </listitem>
      </varlistentry>
 
@@ -1946,13 +1941,6 @@ include_dir 'conf.d'
         <filename>postgresql.conf</filename> file or on the server command
         line.
        </para>
-       <para>
-        For the collection of dead tuple identifiers, autovacuum is only able
-        to utilize up to a maximum of <literal>1GB</literal> of memory, so
-        setting <varname>autovacuum_work_mem</varname> to a value higher than
-        that has no effect on the number of dead tuples that autovacuum can
-        collect while scanning a table.
-       </para>
       </listitem>
      </varlistentry>
 
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8736eac284..6a74e4a24d 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -6237,10 +6237,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>max_dead_tuples</structfield> <type>bigint</type>
+       <structfield>max_dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples that we can store before needing to perform
+       Amount of dead tuple data that we can store before needing to perform
        an index vacuum cycle, based on
        <xref linkend="guc-maintenance-work-mem"/>.
       </para></entry>
@@ -6248,10 +6248,10 @@ FROM pg_stat_get_backend_idset() AS backendid;
 
      <row>
       <entry role="catalog_table_entry"><para role="column_definition">
-       <structfield>num_dead_tuples</structfield> <type>bigint</type>
+       <structfield>dead_tuple_bytes</structfield> <type>bigint</type>
       </para>
       <para>
-       Number of dead tuples collected since the last index vacuum cycle.
+       Amount of dead tuple data collected since the last index vacuum cycle.
       </para></entry>
      </row>
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index ba5b7083a3..1f3bbb2b35 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3,18 +3,17 @@
  * vacuumlazy.c
  *	  Concurrent ("lazy") vacuuming.
  *
- * The major space usage for vacuuming is storage for the array of dead TIDs
- * that are to be removed from indexes.  We want to ensure we can vacuum even
- * the very largest relations with finite memory space usage.  To do that, we
- * set upper bounds on the number of TIDs we can keep track of at once.
+ * The major space usage for vacuuming is storage for the dead tuple IDs that
+ * are to be removed from indexes.  We want to ensure we can vacuum even the
+ * very largest relations with finite memory space usage.  To do that, we set
+ * upper bounds on the memory that can be used for keeping track of dead TIDs
+ * at once.
  *
  * We are willing to use at most maintenance_work_mem (or perhaps
- * autovacuum_work_mem) memory space to keep track of dead TIDs.  We initially
- * allocate an array of TIDs of that size, with an upper limit that depends on
- * table size (this limit ensures we don't allocate a huge area uselessly for
- * vacuuming small tables).  If the array threatens to overflow, we must call
- * lazy_vacuum to vacuum indexes (and to vacuum the pages that we've pruned).
- * This frees up the memory space dedicated to storing dead TIDs.
+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TID store is full, we must call lazy_vacuum to vacuum indexes (and to vacuum
+ * the pages that we've pruned). This frees up the memory space dedicated to
+ * to store dead TIDs.
  *
  * In practice VACUUM will often complete its initial pass over the target
  * heap relation without ever running out of space to store TIDs.  This means
@@ -39,6 +38,7 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/tidstore.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xloginsert.h"
@@ -179,8 +179,13 @@ typedef struct LVRelState
 	 * that has been processed by lazy_scan_prune.  Also needed by
 	 * lazy_vacuum_heap_rel, which marks the same LP_DEAD line pointers as
 	 * LP_UNUSED during second heap pass.
+	 *
+	 * Both dead_items and dead_items_info are allocated in shared memory in
+	 * parallel vacuum cases.
 	 */
-	VacDeadItems *dead_items;	/* TIDs whose index tuples we'll delete */
+	TidStore   *dead_items;		/* TIDs whose index tuples we'll delete */
+	VacDeadItemsInfo *dead_items_info;
+
 	BlockNumber rel_pages;		/* total number of pages */
 	BlockNumber scanned_pages;	/* # pages examined (not skipped via VM) */
 	BlockNumber removed_pages;	/* # pages removed by relation truncation */
@@ -239,8 +244,9 @@ static bool lazy_scan_noprune(LVRelState *vacrel, Buffer buf,
 static void lazy_vacuum(LVRelState *vacrel);
 static bool lazy_vacuum_all_indexes(LVRelState *vacrel);
 static void lazy_vacuum_heap_rel(LVRelState *vacrel);
-static int	lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
-								  Buffer buffer, int index, Buffer vmbuffer);
+static void lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno,
+								  Buffer buffer, OffsetNumber *offsets,
+								  int num_offsets, Buffer vmbuffer);
 static bool lazy_check_wraparound_failsafe(LVRelState *vacrel);
 static void lazy_cleanup_all_indexes(LVRelState *vacrel);
 static IndexBulkDeleteResult *lazy_vacuum_one_index(Relation indrel,
@@ -257,6 +263,9 @@ static void lazy_truncate_heap(LVRelState *vacrel);
 static BlockNumber count_nondeletable_pages(LVRelState *vacrel,
 											bool *lock_waiter_detected);
 static void dead_items_alloc(LVRelState *vacrel, int nworkers);
+static void dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+						   int num_offsets);
+static void dead_items_reset(LVRelState *vacrel);
 static void dead_items_cleanup(LVRelState *vacrel);
 static bool heap_page_is_all_visible(LVRelState *vacrel, Buffer buf,
 									 TransactionId *visibility_cutoff_xid, bool *all_frozen);
@@ -472,7 +481,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	}
 
 	/*
-	 * Allocate dead_items array memory using dead_items_alloc.  This handles
+	 * Allocate dead_items memory using dead_items_alloc.  This handles
 	 * parallel VACUUM initialization as part of allocating shared memory
 	 * space used for dead_items.  (But do a failsafe precheck first, to
 	 * ensure that parallel VACUUM won't be attempted at all when relfrozenxid
@@ -782,7 +791,7 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
  *		have collected the TIDs whose index tuples need to be removed.
  *
  *		Finally, invokes lazy_vacuum_heap_rel to vacuum heap pages, which
- *		largely consists of marking LP_DEAD items (from collected TID array)
+ *		largely consists of marking LP_DEAD items (from vacrel->dead_items)
  *		as LP_UNUSED.  This has to happen in a second, final pass over the
  *		heap, to preserve a basic invariant that all index AMs rely on: no
  *		extant index tuple can ever be allowed to contain a TID that points to
@@ -811,19 +820,20 @@ lazy_scan_heap(LVRelState *vacrel)
 				next_fsm_block_to_vacuum = 0;
 	bool		all_visible_according_to_vm;
 
-	VacDeadItems *dead_items = vacrel->dead_items;
+	TidStore   *dead_items = vacrel->dead_items;
+	VacDeadItemsInfo *dead_items_info = vacrel->dead_items_info;
 	Buffer		vmbuffer = InvalidBuffer;
 	const int	initprog_index[] = {
 		PROGRESS_VACUUM_PHASE,
 		PROGRESS_VACUUM_TOTAL_HEAP_BLKS,
-		PROGRESS_VACUUM_MAX_DEAD_TUPLES
+		PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES
 	};
 	int64		initprog_val[3];
 
 	/* Report that we're scanning the heap, advertising total # of blocks */
 	initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;
 	initprog_val[1] = rel_pages;
-	initprog_val[2] = dead_items->max_items;
+	initprog_val[2] = dead_items_info->max_bytes;
 	pgstat_progress_update_multi_param(3, initprog_index, initprog_val);
 
 	/* Initialize for the first heap_vac_scan_next_block() call */
@@ -866,8 +876,7 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * dead_items TIDs, pause and do a cycle of vacuuming before we tackle
 		 * this page.
 		 */
-		Assert(dead_items->max_items >= MaxHeapTuplesPerPage);
-		if (dead_items->max_items - dead_items->num_items < MaxHeapTuplesPerPage)
+		if (TidStoreMemoryUsage(dead_items) > dead_items_info->max_bytes)
 		{
 			/*
 			 * Before beginning index vacuuming, we release any pin we may
@@ -930,7 +939,7 @@ lazy_scan_heap(LVRelState *vacrel)
 
 		/*
 		 * If we didn't get the cleanup lock, we can still collect LP_DEAD
-		 * items in the dead_items array for later vacuuming, count live and
+		 * items in the dead_items area for later vacuuming, count live and
 		 * recently dead tuples for vacuum logging, and determine if this
 		 * block could later be truncated. If we encounter any xid/mxids that
 		 * require advancing the relfrozenxid/relminxid, we'll have to wait
@@ -958,9 +967,9 @@ lazy_scan_heap(LVRelState *vacrel)
 		 * Like lazy_scan_noprune(), lazy_scan_prune() will count
 		 * recently_dead_tuples and live tuples for vacuum logging, determine
 		 * if the block can later be truncated, and accumulate the details of
-		 * remaining LP_DEAD line pointers on the page in the dead_items
-		 * array. These dead items include those pruned by lazy_scan_prune()
-		 * as well we line pointers previously marked LP_DEAD.
+		 * remaining LP_DEAD line pointers on the page into dead_items. These
+		 * dead items include those pruned by lazy_scan_prune() as well as
+		 * line pointers previously marked LP_DEAD.
 		 */
 		if (got_cleanup_lock)
 			lazy_scan_prune(vacrel, buf, blkno, page,
@@ -1037,7 +1046,7 @@ lazy_scan_heap(LVRelState *vacrel)
 	 * Do index vacuuming (call each index's ambulkdelete routine), then do
 	 * related heap vacuuming
 	 */
-	if (dead_items->num_items > 0)
+	if (dead_items_info->num_items > 0)
 		lazy_vacuum(vacrel);
 
 	/*
@@ -1763,22 +1772,9 @@ lazy_scan_prune(LVRelState *vacrel,
 	 */
 	if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		/*
 		 * It was convenient to ignore LP_DEAD items in all_visible earlier on
@@ -1925,7 +1921,7 @@ lazy_scan_prune(LVRelState *vacrel,
  * lazy_scan_prune, which requires a full cleanup lock.  While pruning isn't
  * performed here, it's quite possible that an earlier opportunistic pruning
  * operation left LP_DEAD items behind.  We'll at least collect any such items
- * in the dead_items array for removal from indexes.
+ * in dead_items for removal from indexes.
  *
  * For aggressive VACUUM callers, we may return false to indicate that a full
  * cleanup lock is required for processing by lazy_scan_prune.  This is only
@@ -2084,7 +2080,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 	vacrel->NewRelfrozenXid = NoFreezePageRelfrozenXid;
 	vacrel->NewRelminMxid = NoFreezePageRelminMxid;
 
-	/* Save any LP_DEAD items found on the page in dead_items array */
+	/* Save any LP_DEAD items found on the page in dead_items */
 	if (vacrel->nindexes == 0)
 	{
 		/* Using one-pass strategy (since table has no indexes) */
@@ -2104,9 +2100,6 @@ lazy_scan_noprune(LVRelState *vacrel,
 	}
 	else if (lpdead_items > 0)
 	{
-		VacDeadItems *dead_items = vacrel->dead_items;
-		ItemPointerData tmp;
-
 		/*
 		 * Page has LP_DEAD items, and so any references/TIDs that remain in
 		 * indexes will be deleted during index vacuuming (and then marked
@@ -2114,17 +2107,7 @@ lazy_scan_noprune(LVRelState *vacrel,
 		 */
 		vacrel->lpdead_item_pages++;
 
-		ItemPointerSetBlockNumber(&tmp, blkno);
-
-		for (int i = 0; i < lpdead_items; i++)
-		{
-			ItemPointerSetOffsetNumber(&tmp, deadoffsets[i]);
-			dead_items->items[dead_items->num_items++] = tmp;
-		}
-
-		Assert(dead_items->num_items <= dead_items->max_items);
-		pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES,
-									 dead_items->num_items);
+		dead_items_add(vacrel, blkno, deadoffsets, lpdead_items);
 
 		vacrel->lpdead_items += lpdead_items;
 	}
@@ -2174,7 +2157,7 @@ lazy_vacuum(LVRelState *vacrel)
 	if (!vacrel->do_index_vacuuming)
 	{
 		Assert(!vacrel->do_index_cleanup);
-		vacrel->dead_items->num_items = 0;
+		dead_items_reset(vacrel);
 		return;
 	}
 
@@ -2203,7 +2186,7 @@ lazy_vacuum(LVRelState *vacrel)
 		BlockNumber threshold;
 
 		Assert(vacrel->num_index_scans == 0);
-		Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+		Assert(vacrel->lpdead_items == vacrel->dead_items_info->num_items);
 		Assert(vacrel->do_index_vacuuming);
 		Assert(vacrel->do_index_cleanup);
 
@@ -2231,7 +2214,7 @@ lazy_vacuum(LVRelState *vacrel)
 		 */
 		threshold = (double) vacrel->rel_pages * BYPASS_THRESHOLD_PAGES;
 		bypass = (vacrel->lpdead_item_pages < threshold &&
-				  vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+				  (TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L)));
 	}
 
 	if (bypass)
@@ -2276,7 +2259,7 @@ lazy_vacuum(LVRelState *vacrel)
 	 * Forget the LP_DEAD items that we just vacuumed (or just decided to not
 	 * vacuum)
 	 */
-	vacrel->dead_items->num_items = 0;
+	dead_items_reset(vacrel);
 }
 
 /*
@@ -2368,7 +2351,7 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 	 * place).
 	 */
 	Assert(vacrel->num_index_scans > 0 ||
-		   vacrel->dead_items->num_items == vacrel->lpdead_items);
+		   vacrel->dead_items_info->num_items == vacrel->lpdead_items);
 	Assert(allindexes || VacuumFailsafeActive);
 
 	/*
@@ -2390,9 +2373,8 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 /*
  *	lazy_vacuum_heap_rel() -- second pass over the heap for two pass strategy
  *
- * This routine marks LP_DEAD items in vacrel->dead_items array as LP_UNUSED.
- * Pages that never had lazy_scan_prune record LP_DEAD items are not visited
- * at all.
+ * This routine marks LP_DEAD items in vacrel->dead_items as LP_UNUSED. Pages
+ * that never had lazy_scan_prune record LP_DEAD items are not visited at all.
  *
  * We may also be able to truncate the line pointer array of the heap pages we
  * visit.  If there is a contiguous group of LP_UNUSED items at the end of the
@@ -2408,10 +2390,11 @@ lazy_vacuum_all_indexes(LVRelState *vacrel)
 static void
 lazy_vacuum_heap_rel(LVRelState *vacrel)
 {
-	int			index = 0;
 	BlockNumber vacuumed_pages = 0;
 	Buffer		vmbuffer = InvalidBuffer;
 	LVSavedErrInfo saved_err_info;
+	TidStoreIter *iter;
+	TidStoreIterResult *iter_result;
 
 	Assert(vacrel->do_index_vacuuming);
 	Assert(vacrel->do_index_cleanup);
@@ -2426,7 +2409,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 							 VACUUM_ERRCB_PHASE_VACUUM_HEAP,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
-	while (index < vacrel->dead_items->num_items)
+	iter = TidStoreBeginIterate(vacrel->dead_items);
+	while ((iter_result = TidStoreIterateNext(iter)) != NULL)
 	{
 		BlockNumber blkno;
 		Buffer		buf;
@@ -2435,7 +2419,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 		vacuum_delay_point();
 
-		blkno = ItemPointerGetBlockNumber(&vacrel->dead_items->items[index]);
+		blkno = iter_result->blkno;
 		vacrel->blkno = blkno;
 
 		/*
@@ -2449,7 +2433,8 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		buf = ReadBufferExtended(vacrel->rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
 								 vacrel->bstrategy);
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
-		index = lazy_vacuum_heap_page(vacrel, blkno, buf, index, vmbuffer);
+		lazy_vacuum_heap_page(vacrel, blkno, buf, iter_result->offsets,
+							  iter_result->num_offsets, vmbuffer);
 
 		/* Now that we've vacuumed the page, record its available space */
 		page = BufferGetPage(buf);
@@ -2459,6 +2444,7 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 		RecordPageWithFreeSpace(vacrel->rel, blkno, freespace);
 		vacuumed_pages++;
 	}
+	TidStoreEndIterate(iter);
 
 	vacrel->blkno = InvalidBlockNumber;
 	if (BufferIsValid(vmbuffer))
@@ -2468,14 +2454,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 	 * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
 	 * the second heap pass.  No more, no less.
 	 */
-	Assert(index > 0);
 	Assert(vacrel->num_index_scans > 1 ||
-		   (index == vacrel->lpdead_items &&
+		   (vacrel->dead_items_info->num_items == vacrel->lpdead_items &&
 			vacuumed_pages == vacrel->lpdead_item_pages));
 
 	ereport(DEBUG2,
 			(errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
-					vacrel->relname, (long long) index, vacuumed_pages)));
+					vacrel->relname, (long long) vacrel->dead_items_info->num_items,
+					vacuumed_pages)));
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -2483,21 +2469,17 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
 
 /*
  *	lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *						  vacrel->dead_items array.
+ *						  vacrel->dead_items store.
  *
  * Caller must have an exclusive buffer lock on the buffer (though a full
  * cleanup lock is also acceptable).  vmbuffer must be valid and already have
  * a pin on blkno's visibility map page.
- *
- * index is an offset into the vacrel->dead_items array for the first listed
- * LP_DEAD item on the page.  The return value is the first index immediately
- * after all LP_DEAD items for the same page in the array.
  */
-static int
+static void
 lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
-					  int index, Buffer vmbuffer)
+					  OffsetNumber *deadoffsets, int num_offsets,
+					  Buffer vmbuffer)
 {
-	VacDeadItems *dead_items = vacrel->dead_items;
 	Page		page = BufferGetPage(buffer);
 	OffsetNumber unused[MaxHeapTuplesPerPage];
 	int			nunused = 0;
@@ -2516,16 +2498,11 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	START_CRIT_SECTION();
 
-	for (; index < dead_items->num_items; index++)
+	for (int i = 0; i < num_offsets; i++)
 	{
-		BlockNumber tblk;
-		OffsetNumber toff;
 		ItemId		itemid;
+		OffsetNumber toff = deadoffsets[i];
 
-		tblk = ItemPointerGetBlockNumber(&dead_items->items[index]);
-		if (tblk != blkno)
-			break;				/* past end of tuples for this block */
-		toff = ItemPointerGetOffsetNumber(&dead_items->items[index]);
 		itemid = PageGetItemId(page, toff);
 
 		Assert(ItemIdIsDead(itemid) && !ItemIdHasStorage(itemid));
@@ -2589,7 +2566,6 @@ lazy_vacuum_heap_page(LVRelState *vacrel, BlockNumber blkno, Buffer buffer,
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
-	return index;
 }
 
 /*
@@ -2716,8 +2692,8 @@ lazy_cleanup_all_indexes(LVRelState *vacrel)
  *	lazy_vacuum_one_index() -- vacuum index relation.
  *
  *		Delete all the index tuples containing a TID collected in
- *		vacrel->dead_items array.  Also update running statistics.
- *		Exact details depend on index AM's ambulkdelete routine.
+ *		vacrel->dead_items.  Also update running statistics. Exact
+ *		details depend on index AM's ambulkdelete routine.
  *
  *		reltuples is the number of heap tuples to be passed to the
  *		bulkdelete callback.  It's always assumed to be estimated.
@@ -2754,7 +2730,8 @@ lazy_vacuum_one_index(Relation indrel, IndexBulkDeleteResult *istat,
 							 InvalidBlockNumber, InvalidOffsetNumber);
 
 	/* Do bulk deletion */
-	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items);
+	istat = vac_bulkdel_one_index(&ivinfo, istat, (void *) vacrel->dead_items,
+								  vacrel->dead_items_info);
 
 	/* Revert to the previous phase information for error traceback */
 	restore_vacuum_error_info(vacrel, &saved_err_info);
@@ -3120,48 +3097,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
 }
 
 /*
- * Returns the number of dead TIDs that VACUUM should allocate space to
- * store, given a heap rel of size vacrel->rel_pages, and given current
- * maintenance_work_mem setting (or current autovacuum_work_mem setting,
- * when applicable).
- *
- * See the comments at the head of this file for rationale.
- */
-static int
-dead_items_max_items(LVRelState *vacrel)
-{
-	int64		max_items;
-	int			vac_work_mem = AmAutoVacuumWorkerProcess() &&
-		autovacuum_work_mem != -1 ?
-		autovacuum_work_mem : maintenance_work_mem;
-
-	if (vacrel->nindexes > 0)
-	{
-		BlockNumber rel_pages = vacrel->rel_pages;
-
-		max_items = MAXDEADITEMS(vac_work_mem * 1024L);
-		max_items = Min(max_items, INT_MAX);
-		max_items = Min(max_items, MAXDEADITEMS(MaxAllocSize));
-
-		/* curious coding here to ensure the multiplication can't overflow */
-		if ((BlockNumber) (max_items / MaxHeapTuplesPerPage) > rel_pages)
-			max_items = rel_pages * MaxHeapTuplesPerPage;
-
-		/* stay sane if small maintenance_work_mem */
-		max_items = Max(max_items, MaxHeapTuplesPerPage);
-	}
-	else
-	{
-		/* One-pass case only stores a single heap page's TIDs at a time */
-		max_items = MaxHeapTuplesPerPage;
-	}
-
-	return (int) max_items;
-}
-
-/*
- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate dead_items and dead_items_info (either using palloc, or in dynamic
+ * shared memory). Sets both in vacrel for caller.
  *
  * Also handles parallel initialization as part of allocating dead_items in
  * DSM when required.
@@ -3169,11 +3106,10 @@ dead_items_max_items(LVRelState *vacrel)
 static void
 dead_items_alloc(LVRelState *vacrel, int nworkers)
 {
-	VacDeadItems *dead_items;
-	int			max_items;
-
-	max_items = dead_items_max_items(vacrel);
-	Assert(max_items >= MaxHeapTuplesPerPage);
+	VacDeadItemsInfo *dead_items_info;
+	int			vac_work_mem = AmAutoVacuumWorkerProcess() &&
+		autovacuum_work_mem != -1 ?
+		autovacuum_work_mem : maintenance_work_mem;
 
 	/*
 	 * Initialize state for a parallel vacuum.  As of now, only one worker can
@@ -3200,24 +3136,72 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 		else
 			vacrel->pvs = parallel_vacuum_init(vacrel->rel, vacrel->indrels,
 											   vacrel->nindexes, nworkers,
-											   max_items,
+											   vac_work_mem,
 											   vacrel->verbose ? INFO : DEBUG2,
 											   vacrel->bstrategy);
 
-		/* If parallel mode started, dead_items space is allocated in DSM */
+		/*
+		 * If parallel mode started, dead_items and dead_items_info spaces are
+		 * allocated in DSM.
+		 */
 		if (ParallelVacuumIsActive(vacrel))
 		{
-			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs);
+			vacrel->dead_items = parallel_vacuum_get_dead_items(vacrel->pvs,
+																&vacrel->dead_items_info);
 			return;
 		}
 	}
 
-	/* Serial VACUUM case */
-	dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
+	/*
+	 * Serial VACUUM case. Allocate both dead_items and dead_items_info
+	 * locally.
+	 */
+
+	dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
+	dead_items_info->max_bytes = vac_work_mem * 1024L;
+	dead_items_info->num_items = 0;
+	vacrel->dead_items_info = dead_items_info;
+
+	vacrel->dead_items = TidStoreCreateLocal(dead_items_info->max_bytes);
+}
+
+/*
+ * Add the given block number and offset numbers to dead_items.
+ */
+static void
+dead_items_add(LVRelState *vacrel, BlockNumber blkno, OffsetNumber *offsets,
+			   int num_offsets)
+{
+	TidStore   *dead_items = vacrel->dead_items;
+
+	TidStoreSetBlockOffsets(dead_items, blkno, offsets, num_offsets);
+	vacrel->dead_items_info->num_items += num_offsets;
+
+	/* update the memory usage report */
+	pgstat_progress_update_param(PROGRESS_VACUUM_DEAD_TUPLE_BYTES,
+								 TidStoreMemoryUsage(dead_items));
+}
+
+/*
+ * Forget all collected dead items.
+ */
+static void
+dead_items_reset(LVRelState *vacrel)
+{
+	TidStore   *dead_items = vacrel->dead_items;
+
+	if (ParallelVacuumIsActive(vacrel))
+	{
+		parallel_vacuum_reset_dead_items(vacrel->pvs);
+		return;
+	}
+
+	/* Recreate the tidstore with the same max_bytes limitation */
+	TidStoreDestroy(dead_items);
+	vacrel->dead_items = TidStoreCreateLocal(vacrel->dead_items_info->max_bytes);
 
-	vacrel->dead_items = dead_items;
+	/* Reset the counter */
+	vacrel->dead_items_info->num_items = 0;
 }
 
 /*
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 401fb35947..2e61f6d74e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1223,7 +1223,7 @@ CREATE VIEW pg_stat_progress_vacuum AS
                       END AS phase,
         S.param2 AS heap_blks_total, S.param3 AS heap_blks_scanned,
         S.param4 AS heap_blks_vacuumed, S.param5 AS index_vacuum_count,
-        S.param6 AS max_dead_tuples, S.param7 AS num_dead_tuples,
+        S.param6 AS max_dead_tuple_bytes, S.param7 AS dead_tuple_bytes,
         S.param8 AS indexes_total, S.param9 AS indexes_processed
     FROM pg_stat_get_progress_info('VACUUM') AS S
         LEFT JOIN pg_database D ON S.datid = D.oid;
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index e63c86cae4..b589279d49 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -116,7 +116,6 @@ static bool vacuum_rel(Oid relid, RangeVar *relation, VacuumParams *params,
 static double compute_parallel_delay(void);
 static VacOptValue get_vacoptval_from_boolean(DefElem *def);
 static bool vac_tid_reaped(ItemPointer itemptr, void *state);
-static int	vac_cmp_itemptr(const void *left, const void *right);
 
 /*
  * GUC check function to ensure GUC value specified is within the allowable
@@ -2489,16 +2488,16 @@ get_vacoptval_from_boolean(DefElem *def)
  */
 IndexBulkDeleteResult *
 vac_bulkdel_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat,
-					  VacDeadItems *dead_items)
+					  TidStore *dead_items, VacDeadItemsInfo *dead_items_info)
 {
 	/* Do bulk deletion */
 	istat = index_bulk_delete(ivinfo, istat, vac_tid_reaped,
 							  (void *) dead_items);
 
 	ereport(ivinfo->message_level,
-			(errmsg("scanned index \"%s\" to remove %d row versions",
+			(errmsg("scanned index \"%s\" to remove %lld row versions",
 					RelationGetRelationName(ivinfo->index),
-					dead_items->num_items)));
+					(long long) dead_items_info->num_items)));
 
 	return istat;
 }
@@ -2529,82 +2528,15 @@ vac_cleanup_one_index(IndexVacuumInfo *ivinfo, IndexBulkDeleteResult *istat)
 	return istat;
 }
 
-/*
- * Returns the total required space for VACUUM's dead_items array given a
- * max_items value.
- */
-Size
-vac_max_items_to_alloc_size(int max_items)
-{
-	Assert(max_items <= MAXDEADITEMS(MaxAllocSize));
-
-	return offsetof(VacDeadItems, items) + sizeof(ItemPointerData) * max_items;
-}
-
 /*
  *	vac_tid_reaped() -- is a particular tid deletable?
  *
  *		This has the right signature to be an IndexBulkDeleteCallback.
- *
- *		Assumes dead_items array is sorted (in ascending TID order).
  */
 static bool
 vac_tid_reaped(ItemPointer itemptr, void *state)
 {
-	VacDeadItems *dead_items = (VacDeadItems *) state;
-	int64		litem,
-				ritem,
-				item;
-	ItemPointer res;
-
-	litem = itemptr_encode(&dead_items->items[0]);
-	ritem = itemptr_encode(&dead_items->items[dead_items->num_items - 1]);
-	item = itemptr_encode(itemptr);
-
-	/*
-	 * Doing a simple bound check before bsearch() is useful to avoid the
-	 * extra cost of bsearch(), especially if dead items on the heap are
-	 * concentrated in a certain range.  Since this function is called for
-	 * every index tuple, it pays to be really fast.
-	 */
-	if (item < litem || item > ritem)
-		return false;
-
-	res = (ItemPointer) bsearch(itemptr,
-								dead_items->items,
-								dead_items->num_items,
-								sizeof(ItemPointerData),
-								vac_cmp_itemptr);
-
-	return (res != NULL);
-}
-
-/*
- * Comparator routines for use with qsort() and bsearch().
- */
-static int
-vac_cmp_itemptr(const void *left, const void *right)
-{
-	BlockNumber lblk,
-				rblk;
-	OffsetNumber loff,
-				roff;
-
-	lblk = ItemPointerGetBlockNumber((ItemPointer) left);
-	rblk = ItemPointerGetBlockNumber((ItemPointer) right);
-
-	if (lblk < rblk)
-		return -1;
-	if (lblk > rblk)
-		return 1;
-
-	loff = ItemPointerGetOffsetNumber((ItemPointer) left);
-	roff = ItemPointerGetOffsetNumber((ItemPointer) right);
-
-	if (loff < roff)
-		return -1;
-	if (loff > roff)
-		return 1;
+	TidStore   *dead_items = (TidStore *) state;
 
-	return 0;
+	return TidStoreIsMember(dead_items, itemptr);
 }
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index befda1c105..5174a4e975 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -8,8 +8,8 @@
  *
  * In a parallel vacuum, we perform both index bulk deletion and index cleanup
  * with parallel worker processes.  Individual indexes are processed by one
- * vacuum process.  ParallelVacuumState contains shared information as well as
- * the memory space for storing dead items allocated in the DSM segment.  We
+ * vacuum process.  ParalleVacuumState contains shared information as well as
+ * the memory space for storing dead items allocated in the DSA area.  We
  * launch parallel worker processes at the start of parallel index
  * bulk-deletion and index cleanup and once all indexes are processed, the
  * parallel worker processes exit.  Each time we process indexes in parallel,
@@ -45,11 +45,10 @@
  * use small integers.
  */
 #define PARALLEL_VACUUM_KEY_SHARED			1
-#define PARALLEL_VACUUM_KEY_DEAD_ITEMS		2
-#define PARALLEL_VACUUM_KEY_QUERY_TEXT		3
-#define PARALLEL_VACUUM_KEY_BUFFER_USAGE	4
-#define PARALLEL_VACUUM_KEY_WAL_USAGE		5
-#define PARALLEL_VACUUM_KEY_INDEX_STATS		6
+#define PARALLEL_VACUUM_KEY_QUERY_TEXT		2
+#define PARALLEL_VACUUM_KEY_BUFFER_USAGE	3
+#define PARALLEL_VACUUM_KEY_WAL_USAGE		4
+#define PARALLEL_VACUUM_KEY_INDEX_STATS		5
 
 /*
  * Shared information among parallel workers.  So this is allocated in the DSM
@@ -110,6 +109,15 @@ typedef struct PVShared
 
 	/* Counter for vacuuming and cleanup */
 	pg_atomic_uint32 idx;
+
+	/* DSA handle where the TidStore lives */
+	dsa_handle	dead_items_dsa_handle;
+
+	/* DSA pointer to the shared TidStore */
+	dsa_pointer dead_items_handle;
+
+	/* Statistics of shared dead items */
+	VacDeadItemsInfo dead_items_info;
 } PVShared;
 
 /* Status used during parallel index vacuum or cleanup */
@@ -176,7 +184,7 @@ struct ParallelVacuumState
 	PVIndStats *indstats;
 
 	/* Shared dead items space among parallel vacuum workers */
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
 
 	/* Points to buffer usage area in DSM */
 	BufferUsage *buffer_usage;
@@ -232,20 +240,19 @@ static void parallel_vacuum_error_callback(void *arg);
  */
 ParallelVacuumState *
 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
-					 int nrequested_workers, int max_items,
+					 int nrequested_workers, int vac_work_mem,
 					 int elevel, BufferAccessStrategy bstrategy)
 {
 	ParallelVacuumState *pvs;
 	ParallelContext *pcxt;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
 	PVIndStats *indstats;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	bool	   *will_parallel_vacuum;
 	Size		est_indstats_len;
 	Size		est_shared_len;
-	Size		est_dead_items_len;
 	int			nindexes_mwm = 0;
 	int			parallel_workers = 0;
 	int			querylen;
@@ -294,11 +301,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_estimate_chunk(&pcxt->estimator, est_shared_len);
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
-	/* Estimate size for dead_items -- PARALLEL_VACUUM_KEY_DEAD_ITEMS */
-	est_dead_items_len = vac_max_items_to_alloc_size(max_items);
-	shm_toc_estimate_chunk(&pcxt->estimator, est_dead_items_len);
-	shm_toc_estimate_keys(&pcxt->estimator, 1);
-
 	/*
 	 * Estimate space for BufferUsage and WalUsage --
 	 * PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
@@ -371,6 +373,14 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 		(nindexes_mwm > 0) ?
 		maintenance_work_mem / Min(parallel_workers, nindexes_mwm) :
 		maintenance_work_mem;
+	shared->dead_items_info.max_bytes = vac_work_mem * 1024L;
+
+	/* Prepare DSA space for dead items */
+	dead_items = TidStoreCreateShared(shared->dead_items_info.max_bytes,
+									  LWTRANCHE_PARALLEL_VACUUM_DSA);
+	pvs->dead_items = dead_items;
+	shared->dead_items_handle = TidStoreGetHandle(dead_items);
+	shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
 
 	/* Use the same buffer size for all workers */
 	shared->ring_nbuffers = GetAccessStrategyBufferCount(bstrategy);
@@ -382,15 +392,6 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_SHARED, shared);
 	pvs->shared = shared;
 
-	/* Prepare the dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-												   est_dead_items_len);
-	dead_items->max_items = max_items;
-	dead_items->num_items = 0;
-	MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
-	shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
-	pvs->dead_items = dead_items;
-
 	/*
 	 * Allocate space for each worker's BufferUsage and WalUsage; no need to
 	 * initialize
@@ -448,6 +449,8 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 			istats[i] = NULL;
 	}
 
+	TidStoreDestroy(pvs->dead_items);
+
 	DestroyParallelContext(pvs->pcxt);
 	ExitParallelMode();
 
@@ -455,13 +458,40 @@ parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats)
 	pfree(pvs);
 }
 
-/* Returns the dead items space */
-VacDeadItems *
-parallel_vacuum_get_dead_items(ParallelVacuumState *pvs)
+/*
+ * Returns the dead items space and dead items information.
+ */
+TidStore *
+parallel_vacuum_get_dead_items(ParallelVacuumState *pvs, VacDeadItemsInfo **dead_items_info_p)
 {
+	*dead_items_info_p = &(pvs->shared->dead_items_info);
 	return pvs->dead_items;
 }
 
+/* Forget all items in dead_items */
+void
+parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
+{
+	TidStore   *dead_items = pvs->dead_items;
+	VacDeadItemsInfo *dead_items_info = &(pvs->shared->dead_items_info);
+
+	/*
+	 * Free the current tidstore and return allocated DSA segments to the
+	 * operating system. Then we recreate the tidstore with the same max_bytes
+	 * limitation we just used.
+	 */
+	TidStoreDestroy(dead_items);
+	pvs->dead_items = TidStoreCreateShared(dead_items_info->max_bytes,
+										   LWTRANCHE_PARALLEL_VACUUM_DSA);
+
+	/* Update the DSA pointer for dead_items to the new one */
+	pvs->shared->dead_items_dsa_handle = dsa_get_handle(TidStoreGetDSA(dead_items));
+	pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
+
+	/* Reset the counter */
+	dead_items_info->num_items = 0;
+}
+
 /*
  * Do parallel index bulk-deletion with parallel workers.
  */
@@ -861,7 +891,8 @@ parallel_vacuum_process_one_index(ParallelVacuumState *pvs, Relation indrel,
 	switch (indstats->status)
 	{
 		case PARALLEL_INDVAC_STATUS_NEED_BULKDELETE:
-			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items);
+			istat_res = vac_bulkdel_one_index(&ivinfo, istat, pvs->dead_items,
+											  &pvs->shared->dead_items_info);
 			break;
 		case PARALLEL_INDVAC_STATUS_NEED_CLEANUP:
 			istat_res = vac_cleanup_one_index(&ivinfo, istat);
@@ -961,7 +992,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	Relation   *indrels;
 	PVIndStats *indstats;
 	PVShared   *shared;
-	VacDeadItems *dead_items;
+	TidStore   *dead_items;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -1005,10 +1036,9 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 											 PARALLEL_VACUUM_KEY_INDEX_STATS,
 											 false);
 
-	/* Set dead_items space */
-	dead_items = (VacDeadItems *) shm_toc_lookup(toc,
-												 PARALLEL_VACUUM_KEY_DEAD_ITEMS,
-												 false);
+	/* Find dead_items in shared memory */
+	dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
+								shared->dead_items_handle);
 
 	/* Set cost-based vacuum delay */
 	VacuumUpdateCosts();
@@ -1056,6 +1086,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
+	TidStoreDetach(dead_items);
+
 	/* Pop the error context stack */
 	error_context_stack = errcallback.previous;
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 83992725de..b1e388dc7c 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -168,6 +168,7 @@ static const char *const BuiltinTrancheNames[] = {
 	[LWTRANCHE_SERIAL_SLRU] = "SerialSLRU",
 	[LWTRANCHE_SUBTRANS_SLRU] = "SubtransSLRU",
 	[LWTRANCHE_XACT_SLRU] = "XactSLRU",
+	[LWTRANCHE_PARALLEL_VACUUM_DSA] = "ParallelVacuumDSA",
 };
 
 StaticAssertDecl(lengthof(BuiltinTrancheNames) ==
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8d0571a03d..2eee03daec 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -376,7 +376,7 @@ NotifySLRU	"Waiting to access the <command>NOTIFY</command> message SLRU cache."
 SerialSLRU	"Waiting to access the serializable transaction conflict SLRU cache."
 SubtransSLRU	"Waiting to access the sub-transaction SLRU cache."
 XactSLRU	"Waiting to access the transaction status SLRU cache."
-
+ParallelVacuumDSA	"Waiting for parallel vacuum dynamic shared memory allocation."
 
 #
 # Wait Events - Lock
diff --git a/src/include/commands/progress.h b/src/include/commands/progress.h
index 73afa77a9c..82a8fe6bd1 100644
--- a/src/include/commands/progress.h
+++ b/src/include/commands/progress.h
@@ -23,8 +23,8 @@
 #define PROGRESS_VACUUM_HEAP_BLKS_SCANNED		2
 #define PROGRESS_VACUUM_HEAP_BLKS_VACUUMED		3
 #define PROGRESS_VACUUM_NUM_INDEX_VACUUMS		4
-#define PROGRESS_VACUUM_MAX_DEAD_TUPLES			5
-#define PROGRESS_VACUUM_NUM_DEAD_TUPLES			6
+#define PROGRESS_VACUUM_MAX_DEAD_TUPLE_BYTES	5
+#define PROGRESS_VACUUM_DEAD_TUPLE_BYTES		6
 #define PROGRESS_VACUUM_INDEXES_TOTAL			7
 #define PROGRESS_VACUUM_INDEXES_PROCESSED		8
 
diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h
index 68068dd900..9514f8b2fd 100644
--- a/src/include/commands/vacuum.h
+++ b/src/include/commands/vacuum.h
@@ -17,6 +17,7 @@
 #include "access/htup.h"
 #include "access/genam.h"
 #include "access/parallel.h"
+#include "access/tidstore.h"
 #include "catalog/pg_class.h"
 #include "catalog/pg_statistic.h"
 #include "catalog/pg_type.h"
@@ -293,19 +294,14 @@ struct VacuumCutoffs
 };
 
 /*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
+ * VacDeadItemsInfo stores supplemental information for dead tuple TID
+ * storage (i.e. TidStore).
  */
-typedef struct VacDeadItems
+typedef struct VacDeadItemsInfo
 {
-	int			max_items;		/* # slots allocated in array */
-	int			num_items;		/* current # of entries */
-
-	/* Sorted array of TIDs to delete from indexes */
-	ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;
-
-#define MAXDEADITEMS(avail_mem) \
-	(((avail_mem) - offsetof(VacDeadItems, items)) / sizeof(ItemPointerData))
+	size_t		max_bytes;		/* the maximum bytes TidStore can use */
+	int64		num_items;		/* current # of entries */
+} VacDeadItemsInfo;
 
 /* GUC parameters */
 extern PGDLLIMPORT int default_statistics_target;	/* PGDLLIMPORT for PostGIS */
@@ -366,10 +362,10 @@ extern Relation vacuum_open_relation(Oid relid, RangeVar *relation,
 									 LOCKMODE lmode);
 extern IndexBulkDeleteResult *vac_bulkdel_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat,
-													VacDeadItems *dead_items);
+													TidStore *dead_items,
+													VacDeadItemsInfo *dead_items_info);
 extern IndexBulkDeleteResult *vac_cleanup_one_index(IndexVacuumInfo *ivinfo,
 													IndexBulkDeleteResult *istat);
-extern Size vac_max_items_to_alloc_size(int max_items);
 
 /* In postmaster/autovacuum.c */
 extern void AutoVacuumUpdateCostLimit(void);
@@ -378,10 +374,12 @@ extern void VacuumUpdateCosts(void);
 /* in commands/vacuumparallel.c */
 extern ParallelVacuumState *parallel_vacuum_init(Relation rel, Relation *indrels,
 												 int nindexes, int nrequested_workers,
-												 int max_items, int elevel,
+												 int vac_work_mem, int elevel,
 												 BufferAccessStrategy bstrategy);
 extern void parallel_vacuum_end(ParallelVacuumState *pvs, IndexBulkDeleteResult **istats);
-extern VacDeadItems *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs);
+extern TidStore *parallel_vacuum_get_dead_items(ParallelVacuumState *pvs,
+												VacDeadItemsInfo **dead_items_info_p);
+extern void parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs);
 extern void parallel_vacuum_bulkdel_all_indexes(ParallelVacuumState *pvs,
 												long num_table_tuples,
 												int num_index_scans);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 3479b4cf52..d70e6d37e0 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -214,6 +214,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SERIAL_SLRU,
 	LWTRANCHE_SUBTRANS_SLRU,
 	LWTRANCHE_XACT_SLRU,
+	LWTRANCHE_PARALLEL_VACUUM_DSA,
 	LWTRANCHE_FIRST_USER_DEFINED,
 }			BuiltinTrancheIds;
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5e45ce64f7..f4a0f36377 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2050,8 +2050,8 @@ pg_stat_progress_vacuum| SELECT s.pid,
     s.param3 AS heap_blks_scanned,
     s.param4 AS heap_blks_vacuumed,
     s.param5 AS index_vacuum_count,
-    s.param6 AS max_dead_tuples,
-    s.param7 AS num_dead_tuples,
+    s.param6 AS max_dead_tuple_bytes,
+    s.param7 AS dead_tuple_bytes,
     s.param8 AS indexes_total,
     s.param9 AS indexes_processed
    FROM (pg_stat_get_progress_info('VACUUM'::text) s(pid, datid, relid, param1, param2, param3, param4, param5, param6, param7, param8, param9, param10, param11, param12, param13, param14, param15, param16, param17, param18, param19, param20)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a8d7bed411..00b1c20755 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2984,7 +2984,7 @@ UserMapping
 UserOpts
 VacAttrStats
 VacAttrStatsP
-VacDeadItems
+VacDeadItemsInfo
 VacErrPhase
 VacObjFilter
 VacOptValue
-- 
2.39.3

#445

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Masahiko Sawada (#444)

3 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Apr 1, 2024 at 9:54 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thank you! I've attached the patch that I'm going to push tomorrow.

Excellent!

I've attached a mostly-polished update on runtime embeddable values,
storing up to 3 offsets in the child pointer (1 on 32-bit platforms).
As discussed, this includes a macro to cap max possible offset that
can be stored in the bitmap, which I believe only reduces the valid
offset range for 32kB pages on 32-bit platforms. Even there, it allows
for more line pointers than can possibly be useful. It also splits
into two parts for readability. It would be committed in two pieces as
well, since they are independently useful.

Attachments:

v83-0001-store-offsets-in-the-header.patchtext/x-patch; charset=US-ASCII; name=v83-0001-store-offsets-in-the-header.patchDownload

From 24bd672deb4a6fa14abfc8583b500d1cbc332032 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 5 Apr 2024 17:04:12 +0700
Subject: [PATCH v83 1/3] store offsets in the header

---
 src/backend/access/common/tidstore.c          | 52 +++++++++++++++++++
 .../test_tidstore/expected/test_tidstore.out  | 14 +++++
 .../test_tidstore/sql/test_tidstore.sql       |  5 ++
 3 files changed, 71 insertions(+)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index e1a7e82469..4eb5d46951 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -34,6 +34,9 @@
 /* number of active words for a page: */
 #define WORDS_PER_PAGE(n) ((n) / BITS_PER_BITMAPWORD + 1)
 
+/* number of offsets we can store in the header of a BlocktableEntry */
+#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint16)) / sizeof(OffsetNumber))
+
 /*
  * This is named similarly to PagetableEntry in tidbitmap.c
  * because the two have a similar function.
@@ -41,6 +44,10 @@
 typedef struct BlocktableEntry
 {
 	uint16		nwords;
+
+	/* We can store a small number of offsets here to avoid wasting space with a sparse bitmap. */
+	OffsetNumber full_offsets[NUM_FULL_OFFSETS];
+
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];
 } BlocktableEntry;
 #define MaxBlocktableEntrySize \
@@ -316,6 +323,25 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	for (int i = 1; i < num_offsets; i++)
 		Assert(offsets[i] > offsets[i - 1]);
 
+	memset(page, 0, offsetof(BlocktableEntry, words));
+
+	if (num_offsets <= NUM_FULL_OFFSETS)
+	{
+		for (int i = 0; i < num_offsets; i++)
+		{
+			OffsetNumber off = offsets[i];
+
+			/* safety check to ensure we don't overrun bit array bounds */
+			if (!OffsetNumberIsValid(off))
+				elog(ERROR, "tuple offset out of range: %u", off);
+
+			page->full_offsets[i] = off;
+		}
+
+		page->nwords = 0;
+	}
+	else
+	{
 	for (wordnum = 0, next_word_threshold = BITS_PER_BITMAPWORD;
 		 wordnum <= WORDNUM(offsets[num_offsets - 1]);
 		 wordnum++, next_word_threshold += BITS_PER_BITMAPWORD)
@@ -343,6 +369,7 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 
 	page->nwords = wordnum;
 	Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
+}
 
 	if (TidStoreIsShared(ts))
 		shared_ts_set(ts->tree.shared, blkno, page);
@@ -369,6 +396,18 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 	if (page == NULL)
 		return false;
 
+	if (page->nwords == 0)
+	{
+		/* we have offsets in the header */
+		for (int i = 0; i < NUM_FULL_OFFSETS; i++)
+		{
+			if (page->full_offsets[i] == off)
+				return true;
+		}
+		return false;
+	}
+	else
+	{
 	wordnum = WORDNUM(off);
 	bitnum = BITNUM(off);
 
@@ -378,6 +417,7 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 
 	return (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
 }
+}
 
 /*
  * Prepare to iterate through a TidStore.
@@ -496,6 +536,17 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 	result->num_offsets = 0;
 	result->blkno = blkno;
 
+	if (page->nwords == 0)
+	{
+		/* we have offsets in the header */
+		for (int i = 0; i < NUM_FULL_OFFSETS; i++)
+		{
+			if (page->full_offsets[i] != InvalidOffsetNumber)
+				result->offsets[result->num_offsets++] = page->full_offsets[i];
+		}
+	}
+	else
+	{
 	for (wordnum = 0; wordnum < page->nwords; wordnum++)
 	{
 		bitmapword	w = page->words[wordnum];
@@ -518,3 +569,4 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 		}
 	}
 }
+}
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
index 0ae2f970da..c40779859b 100644
--- a/src/test/modules/test_tidstore/expected/test_tidstore.out
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -36,6 +36,20 @@ SELECT do_set_block_offsets(blk, array_agg(off)::int2[])
     (VALUES (0), (1), (:maxblkno / 2), (:maxblkno - 1), (:maxblkno)) AS blocks(blk),
     (VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)) AS offsets(off)
   GROUP BY blk;
+-- Test offsets embedded in the bitmap header.
+SELECT do_set_block_offsets(501, array[greatest((random() * :maxoffset)::int, 1)]::int2[]);
+ do_set_block_offsets 
+----------------------
+                  501
+(1 row)
+
+SELECT do_set_block_offsets(502, array_agg(DISTINCT greatest((random() * :maxoffset)::int, 1))::int2[])
+  FROM generate_series(1, 3);
+ do_set_block_offsets 
+----------------------
+                  502
+(1 row)
+
 -- Add enough TIDs to cause the store to appear "full", compared
 -- to the allocated memory it started out with. This is easier
 -- with memory contexts in local memory.
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
index e5edfbb264..1f4e4a807e 100644
--- a/src/test/modules/test_tidstore/sql/test_tidstore.sql
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -28,6 +28,11 @@ SELECT do_set_block_offsets(blk, array_agg(off)::int2[])
     (VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)) AS offsets(off)
   GROUP BY blk;
 
+-- Test offsets embedded in the bitmap header.
+SELECT do_set_block_offsets(501, array[greatest((random() * :maxoffset)::int, 1)]::int2[]);
+SELECT do_set_block_offsets(502, array_agg(DISTINCT greatest((random() * :maxoffset)::int, 1))::int2[])
+  FROM generate_series(1, 3);
+
 -- Add enough TIDs to cause the store to appear "full", compared
 -- to the allocated memory it started out with. This is easier
 -- with memory contexts in local memory.
-- 
2.44.0

v83-0002-pgindent.patchtext/x-patch; charset=US-ASCII; name=v83-0002-pgindent.patchDownload

From 89d5abe4f0b945d07645ac7ead252c8aafe09331 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 5 Apr 2024 17:06:55 +0700
Subject: [PATCH v83 2/3] pgindent

---
 src/backend/access/common/tidstore.c | 99 ++++++++++++++--------------
 1 file changed, 51 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 4eb5d46951..26eb52948b 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -45,7 +45,10 @@ typedef struct BlocktableEntry
 {
 	uint16		nwords;
 
-	/* We can store a small number of offsets here to avoid wasting space with a sparse bitmap. */
+	/*
+	 * We can store a small number of offsets here to avoid wasting space with
+	 * a sparse bitmap.
+	 */
 	OffsetNumber full_offsets[NUM_FULL_OFFSETS];
 
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];
@@ -342,35 +345,35 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	}
 	else
 	{
-	for (wordnum = 0, next_word_threshold = BITS_PER_BITMAPWORD;
-		 wordnum <= WORDNUM(offsets[num_offsets - 1]);
-		 wordnum++, next_word_threshold += BITS_PER_BITMAPWORD)
-	{
-		word = 0;
-
-		while (idx < num_offsets)
+		for (wordnum = 0, next_word_threshold = BITS_PER_BITMAPWORD;
+			 wordnum <= WORDNUM(offsets[num_offsets - 1]);
+			 wordnum++, next_word_threshold += BITS_PER_BITMAPWORD)
 		{
-			OffsetNumber off = offsets[idx];
+			word = 0;
 
-			/* safety check to ensure we don't overrun bit array bounds */
-			if (!OffsetNumberIsValid(off))
-				elog(ERROR, "tuple offset out of range: %u", off);
+			while (idx < num_offsets)
+			{
+				OffsetNumber off = offsets[idx];
+
+				/* safety check to ensure we don't overrun bit array bounds */
+				if (!OffsetNumberIsValid(off))
+					elog(ERROR, "tuple offset out of range: %u", off);
 
-			if (off >= next_word_threshold)
-				break;
+				if (off >= next_word_threshold)
+					break;
 
-			word |= ((bitmapword) 1 << BITNUM(off));
-			idx++;
+				word |= ((bitmapword) 1 << BITNUM(off));
+				idx++;
+			}
+
+			/* write out offset bitmap for this wordnum */
+			page->words[wordnum] = word;
 		}
 
-		/* write out offset bitmap for this wordnum */
-		page->words[wordnum] = word;
+		page->nwords = wordnum;
+		Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
 	}
 
-	page->nwords = wordnum;
-	Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
-}
-
 	if (TidStoreIsShared(ts))
 		shared_ts_set(ts->tree.shared, blkno, page);
 	else
@@ -408,15 +411,15 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 	}
 	else
 	{
-	wordnum = WORDNUM(off);
-	bitnum = BITNUM(off);
+		wordnum = WORDNUM(off);
+		bitnum = BITNUM(off);
 
-	/* no bitmap for the off */
-	if (wordnum >= page->nwords)
-		return false;
+		/* no bitmap for the off */
+		if (wordnum >= page->nwords)
+			return false;
 
-	return (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
-}
+		return (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
+	}
 }
 
 /*
@@ -547,26 +550,26 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 	}
 	else
 	{
-	for (wordnum = 0; wordnum < page->nwords; wordnum++)
-	{
-		bitmapword	w = page->words[wordnum];
-		int			off = wordnum * BITS_PER_BITMAPWORD;
-
-		/* Make sure there is enough space to add offsets */
-		if ((result->num_offsets + BITS_PER_BITMAPWORD) > result->max_offset)
+		for (wordnum = 0; wordnum < page->nwords; wordnum++)
 		{
-			result->max_offset *= 2;
-			result->offsets = repalloc(result->offsets,
-									   sizeof(OffsetNumber) * result->max_offset);
-		}
-
-		while (w != 0)
-		{
-			if (w & 1)
-				result->offsets[result->num_offsets++] = (OffsetNumber) off;
-			off++;
-			w >>= 1;
+			bitmapword	w = page->words[wordnum];
+			int			off = wordnum * BITS_PER_BITMAPWORD;
+
+			/* Make sure there is enough space to add offsets */
+			if ((result->num_offsets + BITS_PER_BITMAPWORD) > result->max_offset)
+			{
+				result->max_offset *= 2;
+				result->offsets = repalloc(result->offsets,
+										   sizeof(OffsetNumber) * result->max_offset);
+			}
+
+			while (w != 0)
+			{
+				if (w & 1)
+					result->offsets[result->num_offsets++] = (OffsetNumber) off;
+				off++;
+				w >>= 1;
+			}
 		}
 	}
 }
-}
-- 
2.44.0

v83-0003-Teach-radix-tree-to-embed-values-at-runtime.patchtext/x-patch; charset=US-ASCII; name=v83-0003-Teach-radix-tree-to-embed-values-at-runtime.patchDownload

From 8547efd90cb438a2c9624a5dfbeef945e1ca97ac Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 5 Apr 2024 17:45:59 +0700
Subject: [PATCH v83 3/3] Teach radix tree to embed values at runtime

Previously, the decision to store values in leaves or within the child
pointer was made at compile time, with variable length values using
leaves by necessity. This commit allows introspecting the length of
variable length values at runtime for that decision. This requires
the ability to tell whether the last-level child pointer is actually
a value, so we use a pointer tag in the lowest level bit.

Use this in TID store. This entails adding a byte to the header to
reserve space for the tag. Commit XXXXXXXXX stores up to three offsets
within the header with no bitmap, and now the header can be embedded
as above. This reduces worst-case memory usage when TIDs are sparse.

Discussion: https://postgr.es/m/
---
 src/backend/access/common/tidstore.c | 79 ++++++++++++++++++++--------
 src/include/lib/radixtree.h          | 32 +++++++++++
 2 files changed, 88 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 26eb52948b..5784223b76 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -43,19 +43,50 @@
  */
 typedef struct BlocktableEntry
 {
-	uint16		nwords;
-
-	/*
-	 * We can store a small number of offsets here to avoid wasting space with
-	 * a sparse bitmap.
-	 */
-	OffsetNumber full_offsets[NUM_FULL_OFFSETS];
+	union
+	{
+		struct
+		{
+#ifndef WORDS_BIGENDIAN
+			/*
+			 * We need to position this member so that the backing radix tree
+			 * can use the lowest bit for a pointer tag. In particular, it
+			 * must be placed within 'header' so that it corresponds to the
+			 * lowest byte in 'ptr'. We position 'nwords' along with it to
+			 * avoid struct padding.
+			 */
+			uint8		flags;
+
+			int8		nwords;
+#endif
+
+			/*
+			 * We can store a small number of offsets here to avoid wasting
+			 * space with a sparse bitmap.
+			 */
+			OffsetNumber full_offsets[NUM_FULL_OFFSETS];
+
+#ifdef WORDS_BIGENDIAN
+			int8		nwords;
+			uint8		flags;
+#endif
+		};
+		uintptr_t	ptr;
+	}			header;
 
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];
 } BlocktableEntry;
+
+/*
+ * The type of 'nwords' limits the max number of words in the 'words' array.
+ * This computes the max offset we can actually store in the bitmap. In
+ * practice, it's almost always the same as MaxOffsetNumber.
+ */
+#define MAX_OFFSET_IN_BITMAP Min(BITS_PER_BITMAPWORD * PG_INT8_MAX - 1, MaxOffsetNumber)
+
 #define MaxBlocktableEntrySize \
 	offsetof(BlocktableEntry, words) + \
-		(sizeof(bitmapword) * WORDS_PER_PAGE(MaxOffsetNumber))
+		(sizeof(bitmapword) * WORDS_PER_PAGE(MAX_OFFSET_IN_BITMAP))
 
 #define RT_PREFIX local_ts
 #define RT_SCOPE static
@@ -64,7 +95,8 @@ typedef struct BlocktableEntry
 #define RT_VALUE_TYPE BlocktableEntry
 #define RT_VARLEN_VALUE_SIZE(page) \
 	(offsetof(BlocktableEntry, words) + \
-	sizeof(bitmapword) * (page)->nwords)
+	sizeof(bitmapword) * (page)->header.nwords)
+#define RT_RUNTIME_EMBEDDABLE_VALUE
 #include "lib/radixtree.h"
 
 #define RT_PREFIX shared_ts
@@ -75,7 +107,8 @@ typedef struct BlocktableEntry
 #define RT_VALUE_TYPE BlocktableEntry
 #define RT_VARLEN_VALUE_SIZE(page) \
 	(offsetof(BlocktableEntry, words) + \
-	sizeof(bitmapword) * (page)->nwords)
+	sizeof(bitmapword) * (page)->header.nwords)
+#define RT_RUNTIME_EMBEDDABLE_VALUE
 #include "lib/radixtree.h"
 
 /* Per-backend state for a TidStore */
@@ -335,13 +368,13 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 			OffsetNumber off = offsets[i];
 
 			/* safety check to ensure we don't overrun bit array bounds */
-			if (!OffsetNumberIsValid(off))
+			if (off == InvalidOffsetNumber || off > MAX_OFFSET_IN_BITMAP)
 				elog(ERROR, "tuple offset out of range: %u", off);
 
-			page->full_offsets[i] = off;
+			page->header.full_offsets[i] = off;
 		}
 
-		page->nwords = 0;
+		page->header.nwords = 0;
 	}
 	else
 	{
@@ -356,7 +389,7 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 				OffsetNumber off = offsets[idx];
 
 				/* safety check to ensure we don't overrun bit array bounds */
-				if (!OffsetNumberIsValid(off))
+				if (off == InvalidOffsetNumber || off > MAX_OFFSET_IN_BITMAP)
 					elog(ERROR, "tuple offset out of range: %u", off);
 
 				if (off >= next_word_threshold)
@@ -370,8 +403,8 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 			page->words[wordnum] = word;
 		}
 
-		page->nwords = wordnum;
-		Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
+		page->header.nwords = wordnum;
+		Assert(page->header.nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
 	}
 
 	if (TidStoreIsShared(ts))
@@ -399,12 +432,12 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 	if (page == NULL)
 		return false;
 
-	if (page->nwords == 0)
+	if (page->header.nwords == 0)
 	{
 		/* we have offsets in the header */
 		for (int i = 0; i < NUM_FULL_OFFSETS; i++)
 		{
-			if (page->full_offsets[i] == off)
+			if (page->header.full_offsets[i] == off)
 				return true;
 		}
 		return false;
@@ -415,7 +448,7 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 		bitnum = BITNUM(off);
 
 		/* no bitmap for the off */
-		if (wordnum >= page->nwords)
+		if (wordnum >= page->header.nwords)
 			return false;
 
 		return (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
@@ -539,18 +572,18 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 	result->num_offsets = 0;
 	result->blkno = blkno;
 
-	if (page->nwords == 0)
+	if (page->header.nwords == 0)
 	{
 		/* we have offsets in the header */
 		for (int i = 0; i < NUM_FULL_OFFSETS; i++)
 		{
-			if (page->full_offsets[i] != InvalidOffsetNumber)
-				result->offsets[result->num_offsets++] = page->full_offsets[i];
+			if (page->header.full_offsets[i] != InvalidOffsetNumber)
+				result->offsets[result->num_offsets++] = page->header.full_offsets[i];
 		}
 	}
 	else
 	{
-		for (wordnum = 0; wordnum < page->nwords; wordnum++)
+		for (wordnum = 0; wordnum < page->header.nwords; wordnum++)
 		{
 			bitmapword	w = page->words[wordnum];
 			int			off = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 6f36e8bfde..a9a2c90db2 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -105,6 +105,10 @@
  *   involving a pointer to the value type, to calculate size.
  *     NOTE: implies that the value is in fact variable-length,
  *     so do not set for fixed-length values.
+ * - RT_RUNTIME_EMBEDDABLE_VALUE - for variable length values, allows
+ *   storing the value in a child pointer slot, rather than as a single-
+ *   value leaf, if small enough. This requires that the value, when
+ *   read as a child pointer, can be tagged in the lowest bit.
  *
  * Optional parameters:
  * - RT_SHMEM - if defined, the radix tree is created in the DSA area
@@ -437,7 +441,13 @@ static inline bool
 RT_VALUE_IS_EMBEDDABLE(RT_VALUE_TYPE * value_p)
 {
 #ifdef RT_VARLEN_VALUE_SIZE
+
+#ifdef RT_RUNTIME_EMBEDDABLE_VALUE
+	return RT_GET_VALUE_SIZE(value_p) <= sizeof(RT_PTR_ALLOC);
+#else
 	return false;
+#endif
+
 #else
 	return RT_GET_VALUE_SIZE(value_p) <= sizeof(RT_PTR_ALLOC);
 #endif
@@ -451,7 +461,19 @@ static inline bool
 RT_CHILDPTR_IS_VALUE(RT_PTR_ALLOC child)
 {
 #ifdef RT_VARLEN_VALUE_SIZE
+
+#ifdef RT_RUNTIME_EMBEDDABLE_VALUE
+	/* check for pointer tag */
+#ifdef RT_SHMEM
+	return child & 1;
+#else
+	return ((uintptr_t) child) & 1;
+#endif
+
+#else
 	return false;
+#endif
+
 #else
 	return sizeof(RT_VALUE_TYPE) <= sizeof(RT_PTR_ALLOC);
 #endif
@@ -1728,6 +1750,15 @@ have_slot:
 	{
 		/* store value directly in child pointer slot */
 		memcpy(slot, value_p, value_sz);
+
+#ifdef RT_RUNTIME_EMBEDDABLE_VALUE
+		/* tag child pointer */
+#ifdef RT_SHMEM
+		*slot |= 1;
+#else
+		*((uintptr_t *) slot) |= 1;
+#endif
+#endif
 	}
 	else
 	{
@@ -2879,6 +2910,7 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_DEFINE
 #undef RT_VALUE_TYPE
 #undef RT_VARLEN_VALUE_SIZE
+#undef RT_RUNTIME_EMBEDDABLE_VALUE
 #undef RT_SHMEM
 #undef RT_USE_DELETE
 #undef RT_DEBUG
-- 
2.44.0

#446

johncnaylorls@gmail.com

almost 2 years ago

In reply to: John Naylor (#445)

8 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sun, Apr 7, 2024 at 9:08 AM John Naylor <johncnaylorls@gmail.com> wrote:

I've attached a mostly-polished update on runtime embeddable values,
storing up to 3 offsets in the child pointer (1 on 32-bit platforms).

And...since there's a new bump context patch, I wanted to anticipate
squeezing an update on top of that, if that gets committed. 0004/5 are
the v6 bump context, and 0006 uses it for vacuum. The rest are to show
it works -- the expected.out changes make possible problems in CI
easier to see. The allocation size is 16 bytes, so this difference is
entirely due to lack of chunk header:

aset: 6619136
bump: 5047296

(Note: assert builds still have the chunk header for sanity checking,
so this was done in a more optimized build)

Attachments:

v84-0005-Introduce-a-bump-memory-allocator.patchtext/x-patch; charset=US-ASCII; name=v84-0005-Introduce-a-bump-memory-allocator.patchDownload

From e2333bad47974a22120a4de3fad804a21757f631 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Tue, 20 Feb 2024 21:49:57 +1300
Subject: [PATCH v84 5/8] Introduce a bump memory allocator

---
 src/backend/nodes/gen_node_support.pl |   2 +-
 src/backend/utils/mmgr/Makefile       |   1 +
 src/backend/utils/mmgr/bump.c         | 818 ++++++++++++++++++++++++++
 src/backend/utils/mmgr/mcxt.c         |  15 +-
 src/backend/utils/mmgr/meson.build    |   1 +
 src/include/nodes/memnodes.h          |   3 +-
 src/include/utils/memutils.h          |   7 +
 src/include/utils/memutils_internal.h |  18 +-
 src/tools/pgindent/typedefs.list      |   2 +
 9 files changed, 862 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/utils/mmgr/bump.c

diff --git a/src/backend/nodes/gen_node_support.pl b/src/backend/nodes/gen_node_support.pl
index d4244facbb..81df3bdf95 100644
--- a/src/backend/nodes/gen_node_support.pl
+++ b/src/backend/nodes/gen_node_support.pl
@@ -149,7 +149,7 @@ my @abstract_types = qw(Node);
 # they otherwise don't participate in node support.
 my @extra_tags = qw(
   IntList OidList XidList
-  AllocSetContext GenerationContext SlabContext
+  AllocSetContext GenerationContext SlabContext BumpContext
   TIDBitmap
   WindowObjectData
 );
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index dae3432c98..01a1fb8527 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	alignedalloc.o \
 	aset.o \
+	bump.o \
 	dsa.o \
 	freepage.o \
 	generation.o \
diff --git a/src/backend/utils/mmgr/bump.c b/src/backend/utils/mmgr/bump.c
new file mode 100644
index 0000000000..f98a203a0c
--- /dev/null
+++ b/src/backend/utils/mmgr/bump.c
@@ -0,0 +1,818 @@
+/*-------------------------------------------------------------------------
+ *
+ * bump.c
+ *	  Bump allocator definitions.
+ *
+ * Bump is a MemoryContext implementation designed for memory usages which
+ * require allocating a large number of chunks, none of which ever need to be
+ * pfree'd or realloc'd.  Chunks allocated by this context have no chunk header
+ * and operations which ordinarily require looking at the chunk header cannot
+ * be performed.  For example, pfree, realloc, GetMemoryChunkSpace and
+ * GetMemoryChunkContext are all not possible with bump allocated chunks.  The
+ * only way to release memory allocated by this context type is to reset or
+ * delete the context.
+ *
+ * Portions Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/mmgr/bump.c
+ *
+ *
+ *	Bump is best suited to cases which require a large number of short-lived
+ *	chunks where performance matters.  Because bump allocated chunks don't
+ *	have a chunk header, it can fit more chunks on each block.  This means we
+ *	can do more with less memory and fewer cache lines.  The reason it's best
+ *	suited for short-lived usages of memory is that ideally, pointers to bump
+ *	allocated chunks won't be visible to a large amount of code.  The more
+ *	code that operates on memory allocated by this allocator, the more chances
+ *	that some code will try to perform a pfree or one of the other operations
+ *	which are made impossible due to the lack of chunk header.  In order to
+ *	to detect accidental usage of the various disallowed operations, we do add
+ *	a MemoryChunk chunk header in MEMORY_CONTEXT_CHECKING builds and have the
+ *	various disallowed functions raise an ERROR.
+ *
+ *	Allocations are MAXALIGNed.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "port/pg_bitutils.h"
+#include "utils/memdebug.h"
+#include "utils/memutils.h"
+#include "utils/memutils_memorychunk.h"
+#include "utils/memutils_internal.h"
+
+#define Bump_BLOCKHDRSZ	MAXALIGN(sizeof(BumpBlock))
+
+/* No chunk header unless built with MEMORY_CONTEXT_CHECKING */
+#ifdef MEMORY_CONTEXT_CHECKING
+#define Bump_CHUNKHDRSZ	sizeof(MemoryChunk)
+#else
+#define Bump_CHUNKHDRSZ	0
+#endif
+
+#define Bump_CHUNK_FRACTION	8
+
+/* The keeper block is allocated in the same allocation as the set */
+#define KeeperBlock(set) ((BumpBlock *) ((char *) (set) + sizeof(BumpContext)))
+#define IsKeeperBlock(set, blk) (KeeperBlock(set) == (blk))
+
+typedef struct BumpBlock BumpBlock; /* forward reference */
+
+typedef struct BumpContext
+{
+	MemoryContextData header;	/* Standard memory-context fields */
+
+	/* Bump context parameters */
+	uint32		initBlockSize;	/* initial block size */
+	uint32		maxBlockSize;	/* maximum block size */
+	uint32		nextBlockSize;	/* next block size to allocate */
+	uint32		allocChunkLimit;	/* effective chunk size limit */
+
+	dlist_head	blocks;			/* list of blocks with the block currently
+								 * being filled at the head */
+} BumpContext;
+
+/*
+ * BumpBlock
+ *		BumpBlock is the unit of memory that is obtained by bump.c from
+ *		malloc().  It contains zero or more allocations, which are the
+ *		units requested by palloc().
+ */
+struct BumpBlock
+{
+	dlist_node	node;			/* doubly-linked list of blocks */
+#ifdef MEMORY_CONTEXT_CHECKING
+	BumpContext *context;		/* pointer back to the owning context */
+#endif
+	char	   *freeptr;		/* start of free space in this block */
+	char	   *endptr;			/* end of space in this block */
+};
+
+/*
+ * BumpIsValid
+ *		True iff set is valid bump context.
+ */
+#define BumpIsValid(set) \
+	(PointerIsValid(set) && IsA(set, BumpContext))
+
+/*
+ * BumpBlockIsValid
+ *		True iff block is valid block of a bump context
+ */
+#define BumpBlockIsValid(block) \
+	(PointerIsValid(block) && BumpIsValid((block)->context))
+
+/*
+ * We always store external chunks on a dedicated block.  This makes fetching
+ * the block from an external chunk easy since it's always the first and only
+ * chunk on the block.
+ */
+#define ExternalChunkGetBlock(chunk) \
+	(BumpBlock *) ((char *) chunk - Bump_BLOCKHDRSZ)
+
+/* Inlined helper functions */
+static inline void BumpBlockInit(BumpContext *context, BumpBlock *block,
+								 Size blksize);
+static inline bool BumpBlockIsEmpty(BumpBlock *block);
+static inline void BumpBlockMarkEmpty(BumpBlock *block);
+static inline Size BumpBlockFreeBytes(BumpBlock *block);
+static inline void BumpBlockFree(BumpContext *set, BumpBlock *block);
+
+
+/*
+* BumpContextCreate
+*		Create a new Bump context.
+*
+* parent: parent context, or NULL if top-level context
+* name: name of context (must be statically allocated)
+* minContextSize: minimum context size
+* initBlockSize: initial allocation block size
+* maxBlockSize: maximum allocation block size
+*/
+MemoryContext
+BumpContextCreate(MemoryContext parent,
+				  const char *name,
+				  Size minContextSize,
+				  Size initBlockSize,
+				  Size maxBlockSize)
+{
+	Size		firstBlockSize;
+	Size		allocSize;
+	BumpContext *set;
+	BumpBlock  *block;
+
+	/* ensure MemoryChunk's size is properly maxaligned */
+	StaticAssertDecl(Bump_CHUNKHDRSZ == MAXALIGN(Bump_CHUNKHDRSZ),
+					 "sizeof(MemoryChunk) is not maxaligned");
+
+	/*
+	 * First, validate allocation parameters.  Asserts seem sufficient because
+	 * nobody varies their parameters at runtime.  We somewhat arbitrarily
+	 * enforce a minimum 1K block size.  We restrict the maximum block size to
+	 * MEMORYCHUNK_MAX_BLOCKOFFSET as MemoryChunks are limited to this in
+	 * regards to addressing the offset between the chunk and the block that
+	 * the chunk is stored on.  We would be unable to store the offset between
+	 * the chunk and block for any chunks that were beyond
+	 * MEMORYCHUNK_MAX_BLOCKOFFSET bytes into the block if the block was to be
+	 * larger than this.
+	 */
+	Assert(initBlockSize == MAXALIGN(initBlockSize) &&
+		   initBlockSize >= 1024);
+	Assert(maxBlockSize == MAXALIGN(maxBlockSize) &&
+		   maxBlockSize >= initBlockSize &&
+		   AllocHugeSizeIsValid(maxBlockSize)); /* must be safe to double */
+	Assert(minContextSize == 0 ||
+		   (minContextSize == MAXALIGN(minContextSize) &&
+			minContextSize >= 1024 &&
+			minContextSize <= maxBlockSize));
+	Assert(maxBlockSize <= MEMORYCHUNK_MAX_BLOCKOFFSET);
+
+	/* Determine size of initial block */
+	allocSize = MAXALIGN(sizeof(BumpContext)) + Bump_BLOCKHDRSZ +
+		Bump_CHUNKHDRSZ;
+	if (minContextSize != 0)
+		allocSize = Max(allocSize, minContextSize);
+	else
+		allocSize = Max(allocSize, initBlockSize);
+
+	/*
+	 * Allocate the initial block.  Unlike other bump.c blocks, it starts with
+	 * the context header and its block header follows that.
+	 */
+	set = (BumpContext *) malloc(allocSize);
+	if (set == NULL)
+	{
+		MemoryContextStats(TopMemoryContext);
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+				 errdetail("Failed while creating memory context \"%s\".",
+						   name)));
+	}
+
+	/*
+	 * Avoid writing code that can fail between here and MemoryContextCreate;
+	 * we'd leak the header if we ereport in this stretch.
+	 */
+	dlist_init(&set->blocks);
+
+	/* Fill in the initial block's block header */
+	block = (BumpBlock *) (((char *) set) + MAXALIGN(sizeof(BumpContext)));
+	/* determine the block size and initialize it */
+	firstBlockSize = allocSize - MAXALIGN(sizeof(BumpContext));
+	BumpBlockInit(set, block, firstBlockSize);
+
+	/* add it to the doubly-linked list of blocks */
+	dlist_push_head(&set->blocks, &block->node);
+
+	/*
+	 * Fill in BumpContext-specific header fields.  The Asserts above should
+	 * ensure that these all fit inside a uint32.
+	 */
+	set->initBlockSize = (uint32) initBlockSize;
+	set->maxBlockSize = (uint32) maxBlockSize;
+	set->nextBlockSize = (uint32) initBlockSize;
+
+	/*
+	 * Compute the allocation chunk size limit for this context.
+	 *
+	 * Limit the maximum size a non-dedicated chunk can be so that we can fit
+	 * at least Bump_CHUNK_FRACTION of chunks this big onto the maximum sized
+	 * block.  We must further limit this value so that it's no more than
+	 * MEMORYCHUNK_MAX_VALUE.  We're unable to have non-external chunks larger
+	 * than that value as we store the chunk size in the MemoryChunk 'value'
+	 * field in the call to MemoryChunkSetHdrMask().
+	 */
+	set->allocChunkLimit = Min(maxBlockSize, MEMORYCHUNK_MAX_VALUE);
+	while ((Size) (set->allocChunkLimit + Bump_CHUNKHDRSZ) >
+		   (Size) ((Size) (maxBlockSize - Bump_BLOCKHDRSZ) / Bump_CHUNK_FRACTION))
+		set->allocChunkLimit >>= 1;
+
+	/* Finally, do the type-independent part of context creation */
+	MemoryContextCreate((MemoryContext) set,
+						T_BumpContext,
+						MCTX_BUMP_ID,
+						parent,
+						name);
+
+	((MemoryContext) set)->mem_allocated = allocSize;
+
+	return (MemoryContext) set;
+}
+
+/*
+ * BumpReset
+ *		Frees all memory which is allocated in the given set.
+ *
+ * The code simply frees all the blocks in the context apart from the keeper
+ * block.
+ */
+void
+BumpReset(MemoryContext context)
+{
+	BumpContext *set = (BumpContext *) context;
+	dlist_mutable_iter miter;
+
+	Assert(BumpIsValid(set));
+
+#ifdef MEMORY_CONTEXT_CHECKING
+	/* Check for corruption and leaks before freeing */
+	BumpCheck(context);
+#endif
+
+	dlist_foreach_modify(miter, &set->blocks)
+	{
+		BumpBlock  *block = dlist_container(BumpBlock, node, miter.cur);
+
+		if (IsKeeperBlock(set, block))
+			BumpBlockMarkEmpty(block);
+		else
+			BumpBlockFree(set, block);
+	}
+
+	/* Reset block size allocation sequence, too */
+	set->nextBlockSize = set->initBlockSize;
+
+	/* Ensure there is only 1 item in the dlist */
+	Assert(!dlist_is_empty(&set->blocks));
+	Assert(!dlist_has_next(&set->blocks, dlist_head_node(&set->blocks)));
+}
+
+/*
+ * BumpDelete
+ *		Free all memory which is allocated in the given context.
+ */
+void
+BumpDelete(MemoryContext context)
+{
+	/* Reset to release all releasable BumpBlocks */
+	BumpReset(context);
+	/* And free the context header and keeper block */
+	free(context);
+}
+
+/*
+ * Helper for BumpAlloc() that allocates an entire block for the chunk.
+ *
+ * BumpAlloc()'s comment explains why this is separate.
+ */
+pg_noinline
+static void *
+BumpAllocLarge(MemoryContext context, Size size, int flags)
+{
+	BumpContext *set = (BumpContext *) context;
+	BumpBlock  *block;
+#ifdef MEMORY_CONTEXT_CHECKING
+	MemoryChunk *chunk;
+#endif
+	Size		chunk_size;
+	Size		required_size;
+	Size		blksize;
+
+	/* validate 'size' is within the limits for the given 'flags' */
+	MemoryContextCheckSize(context, size, flags);
+
+#ifdef MEMORY_CONTEXT_CHECKING
+	/* ensure there's always space for the sentinel byte */
+	chunk_size = MAXALIGN(size + 1);
+#else
+	chunk_size = MAXALIGN(size);
+#endif
+
+	required_size = chunk_size + Bump_CHUNKHDRSZ;
+	blksize = required_size + Bump_BLOCKHDRSZ;
+
+	block = (BumpBlock *) malloc(blksize);
+	if (block == NULL)
+		return NULL;
+
+	context->mem_allocated += blksize;
+
+	/* the block is completely full */
+	block->freeptr = block->endptr = ((char *) block) + blksize;
+
+#ifdef MEMORY_CONTEXT_CHECKING
+	/* block with a single (used) chunk */
+	block->context = set;
+
+	chunk = (MemoryChunk *) (((char *) block) + Bump_BLOCKHDRSZ);
+
+	/* mark the MemoryChunk as externally managed */
+	MemoryChunkSetHdrMaskExternal(chunk, MCTX_BUMP_ID);
+
+	chunk->requested_size = size;
+	/* set mark to catch clobber of "unused" space */
+	Assert(size < chunk_size);
+	set_sentinel(MemoryChunkGetPointer(chunk), size);
+#endif
+#ifdef RANDOMIZE_ALLOCATED_MEMORY
+	/* fill the allocated space with junk */
+	randomize_mem((char *) MemoryChunkGetPointer(chunk), size);
+#endif
+
+	/* add the block to the list of allocated blocks */
+	dlist_push_head(&set->blocks, &block->node);
+
+#ifdef MEMORY_CONTEXT_CHECKING
+	/* Ensure any padding bytes are marked NOACCESS. */
+	VALGRIND_MAKE_MEM_NOACCESS((char *) MemoryChunkGetPointer(chunk) + size,
+							   chunk_size - size);
+
+	/* Disallow access to the chunk header. */
+	VALGRIND_MAKE_MEM_NOACCESS(chunk, Bump_CHUNKHDRSZ);
+
+	return MemoryChunkGetPointer(chunk);
+#else
+	return (void *) (((char *) block) + Bump_BLOCKHDRSZ);
+#endif
+}
+
+/*
+ * Small helper for allocating a new chunk from a chunk, to avoid duplicating
+ * the code between BumpAlloc() and BumpAllocFromNewBlock().
+ */
+static inline void *
+BumpAllocChunkFromBlock(MemoryContext context, BumpBlock *block, Size size,
+						Size chunk_size)
+{
+#ifdef MEMORY_CONTEXT_CHECKING
+	MemoryChunk *chunk;
+#else
+	void	   *ptr;
+#endif
+
+	/* validate we've been given a block with enough free space */
+	Assert(block != NULL);
+	Assert((block->endptr - block->freeptr) >= Bump_CHUNKHDRSZ + chunk_size);
+
+#ifdef MEMORY_CONTEXT_CHECKING
+	chunk = (MemoryChunk *) block->freeptr;
+#else
+	ptr = (void *) block->freeptr;
+#endif
+
+	/* point the freeptr beyond this chunk */
+	block->freeptr += (Bump_CHUNKHDRSZ + chunk_size);
+	Assert(block->freeptr <= block->endptr);
+
+#ifdef MEMORY_CONTEXT_CHECKING
+	/* Prepare to initialize the chunk header. */
+	VALGRIND_MAKE_MEM_UNDEFINED(chunk, Bump_CHUNKHDRSZ);
+
+	MemoryChunkSetHdrMask(chunk, block, chunk_size, MCTX_BUMP_ID);
+	chunk->requested_size = size;
+	/* set mark to catch clobber of "unused" space */
+	Assert(size < chunk_size);
+	set_sentinel(MemoryChunkGetPointer(chunk), size);
+
+#ifdef RANDOMIZE_ALLOCATED_MEMORY
+	/* fill the allocated space with junk */
+	randomize_mem((char *) MemoryChunkGetPointer(chunk), size);
+#endif
+
+	/* Ensure any padding bytes are marked NOACCESS. */
+	VALGRIND_MAKE_MEM_NOACCESS((char *) MemoryChunkGetPointer(chunk) + size,
+							   chunk_size - size);
+
+	/* Disallow access to the chunk header. */
+	VALGRIND_MAKE_MEM_NOACCESS(chunk, Bump_CHUNKHDRSZ);
+
+	return MemoryChunkGetPointer(chunk);
+#else
+	return ptr;
+#endif							/* MEMORY_CONTEXT_CHECKING */
+}
+
+/*
+ * Helper for BumpAlloc() that allocates a new block and returns a chunk
+ * allocated from it.
+ *
+ * BumpAlloc()'s comment explains why this is separate.
+ */
+pg_noinline
+static void *
+BumpAllocFromNewBlock(MemoryContext context, Size size, int flags,
+					  Size chunk_size)
+{
+	BumpContext *set = (BumpContext *) context;
+	BumpBlock  *block;
+	Size		blksize;
+	Size		required_size;
+
+	/*
+	 * The first such block has size initBlockSize, and we double the space in
+	 * each succeeding block, but not more than maxBlockSize.
+	 */
+	blksize = set->nextBlockSize;
+	set->nextBlockSize <<= 1;
+	if (set->nextBlockSize > set->maxBlockSize)
+		set->nextBlockSize = set->maxBlockSize;
+
+	/* we'll need space for the chunk, chunk hdr and block hdr */
+	required_size = chunk_size + Bump_CHUNKHDRSZ + Bump_BLOCKHDRSZ;
+	/* round the size up to the next power of 2 */
+	if (blksize < required_size)
+		blksize = pg_nextpower2_size_t(required_size);
+
+	block = (BumpBlock *) malloc(blksize);
+
+	if (block == NULL)
+		return MemoryContextAllocationFailure(context, size, flags);
+
+	context->mem_allocated += blksize;
+
+	/* initialize the new block */
+	BumpBlockInit(set, block, blksize);
+
+	/* add it to the doubly-linked list of blocks */
+	dlist_push_head(&set->blocks, &block->node);
+
+	return BumpAllocChunkFromBlock(context, block, size, chunk_size);
+}
+
+/*
+ * BumpAlloc
+ *		Returns a pointer to allocated memory of given size or raises an ERROR
+ *		on allocation failure, or returns NULL when flags contains
+ *		MCXT_ALLOC_NO_OOM.
+ *
+ * No request may exceed:
+ *		MAXALIGN_DOWN(SIZE_MAX) - Bump_BLOCKHDRSZ - Bump_CHUNKHDRSZ
+ * All callers use a much-lower limit.
+ *
+ *
+ * Note: when using valgrind, it doesn't matter how the returned allocation
+ * is marked, as mcxt.c will set it to UNDEFINED.
+ * This function should only contain the most common code paths.  Everything
+ * else should be in pg_noinline helper functions, thus avoiding the overhead
+ * of creating a stack frame for the common cases.  Allocating memory is often
+ * a bottleneck in many workloads, so avoiding stack frame setup is
+ * worthwhile.  Helper functions should always directly return the newly
+ * allocated memory so that we can just return that address directly as a tail
+ * call.
+ */
+void *
+BumpAlloc(MemoryContext context, Size size, int flags)
+{
+	BumpContext *set = (BumpContext *) context;
+	BumpBlock  *block;
+	Size		chunk_size;
+	Size		required_size;
+
+	Assert(BumpIsValid(set));
+
+#ifdef MEMORY_CONTEXT_CHECKING
+	/* ensure there's always space for the sentinel byte */
+	chunk_size = MAXALIGN(size + 1);
+#else
+	chunk_size = MAXALIGN(size);
+#endif
+
+	/*
+	 * If requested size exceeds maximum for chunks we hand the the request
+	 * off to BumpAllocLarge().
+	 */
+	if (chunk_size > set->allocChunkLimit)
+		return BumpAllocLarge(context, size, flags);
+
+	required_size = chunk_size + Bump_CHUNKHDRSZ;
+
+	/*
+	 * Not an oversized chunk.  We try to first make use of the latest block,
+	 * but if there's not enough space in it we must allocate a new block.
+	 */
+	block = dlist_container(BumpBlock, node, dlist_head_node(&set->blocks));
+
+	if (BumpBlockFreeBytes(block) < required_size)
+		return BumpAllocFromNewBlock(context, size, flags, chunk_size);
+
+	/* The current block has space, so just allocate chunk there. */
+	return BumpAllocChunkFromBlock(context, block, size, chunk_size);
+
+}
+
+/*
+ * BumpBlockInit
+ *		Initializes 'block' assuming 'blksize'.  Does not update the context's
+ *		mem_allocated field.
+ */
+static inline void
+BumpBlockInit(BumpContext *context, BumpBlock *block, Size blksize)
+{
+#ifdef MEMORY_CONTEXT_CHECKING
+	block->context = context;
+#endif
+	block->freeptr = ((char *) block) + Bump_BLOCKHDRSZ;
+	block->endptr = ((char *) block) + blksize;
+
+	/* Mark unallocated space NOACCESS. */
+	VALGRIND_MAKE_MEM_NOACCESS(block->freeptr, blksize - Bump_BLOCKHDRSZ);
+}
+
+/*
+ * BumpBlockIsEmpty
+ *		Returns true iff 'block' contains no chunks
+ */
+static inline bool
+BumpBlockIsEmpty(BumpBlock *block)
+{
+	/* it's empty if the freeptr has not moved */
+	return (block->freeptr == ((char *) block + Bump_BLOCKHDRSZ));
+}
+
+/*
+ * BumpBlockMarkEmpty
+ *		Set a block as empty.  Does not free the block.
+ */
+static inline void
+BumpBlockMarkEmpty(BumpBlock *block)
+{
+#if defined(USE_VALGRIND) || defined(CLOBBER_FREED_MEMORY)
+	char	   *datastart = ((char *) block) + Bump_BLOCKHDRSZ;
+#endif
+
+#ifdef CLOBBER_FREED_MEMORY
+	wipe_mem(datastart, block->freeptr - datastart);
+#else
+	/* wipe_mem() would have done this */
+	VALGRIND_MAKE_MEM_NOACCESS(datastart, block->freeptr - datastart);
+#endif
+
+	/* Reset the block, but don't return it to malloc */
+	block->freeptr = ((char *) block) + Bump_BLOCKHDRSZ;
+}
+
+/*
+ * BumpBlockFreeBytes
+ *		Returns the number of bytes free in 'block'
+ */
+static inline Size
+BumpBlockFreeBytes(BumpBlock *block)
+{
+	return (block->endptr - block->freeptr);
+}
+
+/*
+ * BumpBlockFree
+ *		Remove 'block' from 'set' and release the memory consumed by it.
+ */
+static inline void
+BumpBlockFree(BumpContext *set, BumpBlock *block)
+{
+	/* Make sure nobody tries to free the keeper block */
+	Assert(!IsKeeperBlock(set, block));
+
+	/* release the block from the list of blocks */
+	dlist_delete(&block->node);
+
+	((MemoryContext) set)->mem_allocated -= ((char *) block->endptr - (char *) block);
+
+#ifdef CLOBBER_FREED_MEMORY
+	wipe_mem(block, ((char *) block->endptr - (char *) block));
+#endif
+
+	free(block);
+}
+
+/*
+ * BumpFree
+ *		Unsupported.
+ */
+void
+BumpFree(void *pointer)
+{
+	elog(ERROR, "pfree is not supported by the bump memory allocator");
+}
+
+/*
+ * BumpRealloc
+ *		Unsupported.
+ */
+void *
+BumpRealloc(void *pointer, Size size, int flags)
+{
+	elog(ERROR, "%s is not supported by the bump memory allocator", "realloc");
+	return NULL;				/* keep compiler quiet */
+}
+
+/*
+ * BumpGetChunkContext
+ *		Unsupported.
+ */
+MemoryContext
+BumpGetChunkContext(void *pointer)
+{
+	elog(ERROR, "%s is not supported by the bump memory allocator", "GetMemoryChunkContext");
+	return NULL;				/* keep compiler quiet */
+}
+
+/*
+* BumpGetChunkSpace
+*		Given a currently-allocated chunk, determine the total space
+*		it occupies (including all memory-allocation overhead).
+*/
+Size
+BumpGetChunkSpace(void *pointer)
+{
+	elog(ERROR, "%s is not supported by the bump memory allocator", "GetMemoryChunkSpace");
+	return 0;					/* keep compiler quiet */
+}
+
+/*
+ * BumpIsEmpty
+ *		Is a BumpContext empty of any allocated space?
+ */
+bool
+BumpIsEmpty(MemoryContext context)
+{
+	BumpContext *set = (BumpContext *) context;
+	dlist_iter	iter;
+
+	Assert(BumpIsValid(set));
+
+	dlist_foreach(iter, &set->blocks)
+	{
+		BumpBlock  *block = dlist_container(BumpBlock, node, iter.cur);
+
+		if (!BumpBlockIsEmpty(block))
+			return false;
+	}
+
+	return true;
+}
+
+/*
+ * BumpStats
+ *		Compute stats about memory consumption of a Bump context.
+ *
+ * printfunc: if not NULL, pass a human-readable stats string to this.
+ * passthru: pass this pointer through to printfunc.
+ * totals: if not NULL, add stats about this context into *totals.
+ * print_to_stderr: print stats to stderr if true, elog otherwise.
+ */
+void
+BumpStats(MemoryContext context, MemoryStatsPrintFunc printfunc,
+		  void *passthru, MemoryContextCounters *totals, bool print_to_stderr)
+{
+	BumpContext *set = (BumpContext *) context;
+	Size		nblocks = 0;
+	Size		totalspace = 0;
+	Size		freespace = 0;
+	dlist_iter	iter;
+
+	Assert(BumpIsValid(set));
+
+	dlist_foreach(iter, &set->blocks)
+	{
+		BumpBlock  *block = dlist_container(BumpBlock, node, iter.cur);
+
+		nblocks++;
+		totalspace += (block->endptr - (char *) block);
+		freespace += (block->endptr - block->freeptr);
+	}
+
+	if (printfunc)
+	{
+		char		stats_string[200];
+
+		snprintf(stats_string, sizeof(stats_string),
+				 "%zu total in %zu blocks; %zu free; %zu used",
+				 totalspace, nblocks, freespace, totalspace - freespace);
+		printfunc(context, passthru, stats_string, print_to_stderr);
+	}
+
+	if (totals)
+	{
+		totals->nblocks += nblocks;
+		totals->totalspace += totalspace;
+		totals->freespace += freespace;
+	}
+}
+
+
+#ifdef MEMORY_CONTEXT_CHECKING
+
+/*
+ * BumpCheck
+ *		Walk through chunks and check consistency of memory.
+ *
+ * NOTE: report errors as WARNING, *not* ERROR or FATAL.  Otherwise you'll
+ * find yourself in an infinite loop when trouble occurs, because this
+ * routine will be entered again when elog cleanup tries to release memory!
+ */
+void
+BumpCheck(MemoryContext context)
+{
+	BumpContext *bump = (BumpContext *) context;
+	const char *name = context->name;
+	dlist_iter	iter;
+	Size		total_allocated = 0;
+
+	/* walk all blocks in this context */
+	dlist_foreach(iter, &bump->blocks)
+	{
+		BumpBlock  *block = dlist_container(BumpBlock, node, iter.cur);
+		int			nchunks;
+		char	   *ptr;
+		bool		has_external_chunk = false;
+
+		if (IsKeeperBlock(bump, block))
+			total_allocated += block->endptr - (char *) bump;
+		else
+			total_allocated += block->endptr - (char *) block;
+
+		/* check block belongs to the correct context */
+		if (block->context != bump)
+			elog(WARNING, "problem in Bump %s: bogus context link in block %p",
+				 name, block);
+
+		/* now walk through the chunks and count them */
+		nchunks = 0;
+		ptr = ((char *) block) + Bump_BLOCKHDRSZ;
+
+		while (ptr < block->freeptr)
+		{
+			MemoryChunk *chunk = (MemoryChunk *) ptr;
+			BumpBlock  *chunkblock;
+			Size		chunksize;
+
+			/* allow access to the chunk header */
+			VALGRIND_MAKE_MEM_DEFINED(chunk, Bump_CHUNKHDRSZ);
+
+			if (MemoryChunkIsExternal(chunk))
+			{
+				chunkblock = ExternalChunkGetBlock(chunk);
+				chunksize = block->endptr - (char *) MemoryChunkGetPointer(chunk);
+				has_external_chunk = true;
+			}
+			else
+			{
+				chunkblock = MemoryChunkGetBlock(chunk);
+				chunksize = MemoryChunkGetValue(chunk);
+			}
+
+			/* move to the next chunk */
+			ptr += (chunksize + Bump_CHUNKHDRSZ);
+
+			nchunks += 1;
+
+			/* chunks have both block and context pointers, so check both */
+			if (chunkblock != block)
+				elog(WARNING, "problem in Bump %s: bogus block link in block %p, chunk %p",
+					 name, block, chunk);
+		}
+
+		if (has_external_chunk && nchunks > 1)
+			elog(WARNING, "problem in Bump %s: external chunk on non-dedicated block %p",
+				 name, block);
+
+	}
+
+	Assert(total_allocated == context->mem_allocated);
+}
+
+#endif							/* MEMORY_CONTEXT_CHECKING */
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 52ae120af9..80abfdc9d7 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -100,6 +100,19 @@ static const MemoryContextMethods mcxt_methods[] = {
 	[MCTX_ALIGNED_REDIRECT_ID].check = NULL,	/* not required */
 #endif
 
+	/* bump.c */
+	[MCTX_BUMP_ID].alloc = BumpAlloc,
+	[MCTX_BUMP_ID].free_p = BumpFree,
+	[MCTX_BUMP_ID].realloc = BumpRealloc,
+	[MCTX_BUMP_ID].reset = BumpReset,
+	[MCTX_BUMP_ID].delete_context = BumpDelete,
+	[MCTX_BUMP_ID].get_chunk_context = BumpGetChunkContext,
+	[MCTX_BUMP_ID].get_chunk_space = BumpGetChunkSpace,
+	[MCTX_BUMP_ID].is_empty = BumpIsEmpty,
+	[MCTX_BUMP_ID].stats = BumpStats,
+#ifdef MEMORY_CONTEXT_CHECKING
+	[MCTX_BUMP_ID].check = BumpCheck,
+#endif
 
 	/*
 	 * Unused (as yet) IDs should have dummy entries here.  This allows us to
@@ -107,8 +120,6 @@ static const MemoryContextMethods mcxt_methods[] = {
 	 * seems sufficient to provide routines for the methods that might get
 	 * invoked from inspection of a chunk (see MCXT_METHOD calls below).
 	 */
-
-	BOGUS_MCTX(MCTX_7_UNUSED_ID),
 	BOGUS_MCTX(MCTX_8_UNUSED_ID),
 	BOGUS_MCTX(MCTX_9_UNUSED_ID),
 	BOGUS_MCTX(MCTX_10_UNUSED_ID),
diff --git a/src/backend/utils/mmgr/meson.build b/src/backend/utils/mmgr/meson.build
index 9dcf990cdc..dd43a6844c 100644
--- a/src/backend/utils/mmgr/meson.build
+++ b/src/backend/utils/mmgr/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'alignedalloc.c',
   'aset.c',
+  'bump.c',
   'dsa.c',
   'freepage.c',
   'generation.c',
diff --git a/src/include/nodes/memnodes.h b/src/include/nodes/memnodes.h
index edc0257f36..c4c9fd3e3e 100644
--- a/src/include/nodes/memnodes.h
+++ b/src/include/nodes/memnodes.h
@@ -146,6 +146,7 @@ typedef struct MemoryContextData
 	((context) != NULL && \
 	 (IsA((context), AllocSetContext) || \
 	  IsA((context), SlabContext) || \
-	  IsA((context), GenerationContext)))
+	  IsA((context), GenerationContext) || \
+	  IsA((context), BumpContext)))
 
 #endif							/* MEMNODES_H */
diff --git a/src/include/utils/memutils.h b/src/include/utils/memutils.h
index 6e5fa72b0e..4446e14223 100644
--- a/src/include/utils/memutils.h
+++ b/src/include/utils/memutils.h
@@ -108,6 +108,13 @@ extern void ProcessLogMemoryContextInterrupt(void);
  * Memory-context-type-specific functions
  */
 
+/* bump.c */
+extern MemoryContext BumpContextCreate(MemoryContext parent,
+									   const char *name,
+									   Size minContextSize,
+									   Size initBlockSize,
+									   Size maxBlockSize);
+
 /* aset.c */
 extern MemoryContext AllocSetContextCreateInternal(MemoryContext parent,
 												   const char *name,
diff --git a/src/include/utils/memutils_internal.h b/src/include/utils/memutils_internal.h
index c3f010b595..d39392a873 100644
--- a/src/include/utils/memutils_internal.h
+++ b/src/include/utils/memutils_internal.h
@@ -79,6 +79,22 @@ extern void *AlignedAllocRealloc(void *pointer, Size size, int flags);
 extern MemoryContext AlignedAllocGetChunkContext(void *pointer);
 extern Size AlignedAllocGetChunkSpace(void *pointer);
 
+ /* These functions implement the MemoryContext API for the Bump context. */
+extern void *BumpAlloc(MemoryContext context, Size size, int flags);
+extern void BumpFree(void *pointer);
+extern void *BumpRealloc(void *pointer, Size size, int flags);
+extern void BumpReset(MemoryContext context);
+extern void BumpDelete(MemoryContext context);
+extern MemoryContext BumpGetChunkContext(void *pointer);
+extern Size BumpGetChunkSpace(void *pointer);
+extern bool BumpIsEmpty(MemoryContext context);
+extern void BumpStats(MemoryContext context, MemoryStatsPrintFunc printfunc,
+					  void *passthru, MemoryContextCounters *totals,
+					  bool print_to_stderr);
+#ifdef MEMORY_CONTEXT_CHECKING
+extern void BumpCheck(MemoryContext context);
+#endif
+
 /*
  * How many extra bytes do we need to request in order to ensure that we can
  * align a pointer to 'alignto'.  Since palloc'd pointers are already aligned
@@ -111,7 +127,7 @@ typedef enum MemoryContextMethodID
 	MCTX_GENERATION_ID,
 	MCTX_SLAB_ID,
 	MCTX_ALIGNED_REDIRECT_ID,
-	MCTX_7_UNUSED_ID,
+	MCTX_BUMP_ID,
 	MCTX_8_UNUSED_ID,
 	MCTX_9_UNUSED_ID,
 	MCTX_10_UNUSED_ID,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 01845ee71d..4f47218dc6 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -335,6 +335,8 @@ BulkInsertState
 BulkInsertStateData
 BulkWriteBuffer
 BulkWriteState
+BumpBlock
+BumpContext
 CACHESIGN
 CAC_state
 CCFastEqualFN
-- 
2.44.0

v84-0008-DEV-compare-bump-context-in-tests.patchtext/x-patch; charset=US-ASCII; name=v84-0008-DEV-compare-bump-context-in-tests.patchDownload

From 07dba8c7c0f2fde4a0b2f41303e1197f9b06ba0b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 7 Apr 2024 16:16:24 +0700
Subject: [PATCH v84 8/8] DEV compare bump context in tests

---
 src/test/modules/test_tidstore/test_tidstore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index e4aad4dabb..77c275ba92 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -117,7 +117,7 @@ test_create(PG_FUNCTION_ARGS)
 	}
 	else
 		/* VACUUM uses insert only, so we test the other option. */
-		tidstore = TidStoreCreateLocal(tidstore_max_size, false);
+		tidstore = TidStoreCreateLocal(tidstore_max_size, true);
 
 	tidstore_empty_size = TidStoreMemoryUsage(tidstore);
 
-- 
2.44.0

v84-0004-Enlarge-bit-space-for-MemoryContextMethodID.patchtext/x-patch; charset=US-ASCII; name=v84-0004-Enlarge-bit-space-for-MemoryContextMethodID.patchDownload

From 2e3340824363161243f984a4be0797c81b247751 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Sat, 6 Apr 2024 21:58:07 +1300
Subject: [PATCH v84 4/8] Enlarge bit-space for MemoryContextMethodID

Reserve 4 bits for MemoryContextMethodID rather than 3.  3 bits did
technically allow a maximum of 8 memory context types, however, we've
opted to reserve some bit patterns which left us with only 4 slots, all
of which were used.

Here we add another bit which frees up 8 slots for future memory context
types.

In passing, adjust the enum names in MemoryContextMethodID to make it
more clear which ones can be used and which ones are reserved.

Author: Matthias van de Meent, David Rowley
Discussion: https://postgr.es/m/CAApHDvqGSpCU95TmM=Bp=6xjL_nLys4zdZOpfNyWBk97Xrdj2w@mail.gmail.com
---
 src/backend/utils/mmgr/README            | 21 ++++++++----
 src/backend/utils/mmgr/mcxt.c            | 41 +++++++++++++-----------
 src/include/utils/memutils_internal.h    | 18 ++++++++---
 src/include/utils/memutils_memorychunk.h | 30 +++++++++++++----
 4 files changed, 72 insertions(+), 38 deletions(-)

diff --git a/src/backend/utils/mmgr/README b/src/backend/utils/mmgr/README
index b20b9d4852..f484f7d6f5 100644
--- a/src/backend/utils/mmgr/README
+++ b/src/backend/utils/mmgr/README
@@ -395,14 +395,14 @@ relevant MemoryContext as a parameter, operations like free and
 realloc are trickier.  To make those work, we require all memory
 context types to produce allocated chunks that are immediately,
 without any padding, preceded by a uint64 value of which the least
-significant 3 bits are set to the owning context's MemoryContextMethodID.
+significant 4 bits are set to the owning context's MemoryContextMethodID.
 This allows the code to determine the correct MemoryContextMethods to
-use by looking up the mcxt_methods[] array using the 3 bits as an index
+use by looking up the mcxt_methods[] array using the 4 bits as an index
 into that array.
 
 If a type of allocator needs additional information about its chunks,
 like e.g. the size of the allocation, that information can in turn
-either be encoded into the remaining 61 bits of the preceding uint64 value
+either be encoded into the remaining 60 bits of the preceding uint64 value
 or if more space is required, additional values may be stored directly prior
 to the uint64 value.  It is up to the context implementation to manage this.
 
@@ -420,13 +420,20 @@ pfree(void *pointer)
 
 All of the current memory contexts make use of the MemoryChunk header type
 which is defined in memutils_memorychunk.h.  This suits all of the existing
-context types well as it makes use of the remaining 61-bits of the uint64
+context types well as it makes use of the remaining 60-bits of the uint64
 header to efficiently encode the size of the chunk of memory (or freelist
 index, in the case of aset.c) and the number of bytes which must be subtracted
 from the chunk in order to obtain a reference to the block that the chunk
-belongs to.  30 bits are used for each of these.  If more than 30 bits are
-required then the memory context must manage that itself.  This can be done by
-calling the MemoryChunkSetHdrMaskExternal() function on the given chunk.
+belongs to.  30 bits are used for each of these, but only a total of 59 bits
+as the lowest bit for the chunk to block offset is the same bit as the highest
+bit of the chunk size.  This overlapping is possible as the relative offset
+between the block and the chunk is expected to be a MAXALIGNed value which
+guarantees the lowest bit is always 0.  If more than 30 bits are required for
+each of these fields then the memory context must manage that itself.  This
+can be done by calling the MemoryChunkSetHdrMaskExternal() function on the
+given chunk.  Whether a chunk is an external chunk can be determined by the 1
+remaining bit from the 64-bit MemoryChunk.
+
 Currently, each memory context type stores large allocations on dedicated
 blocks (which always contain only a single chunk).  For these, finding the
 block is simple as we know that the chunk must be the first on the given
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 5d426795d9..52ae120af9 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -37,6 +37,11 @@ static Size BogusGetChunkSpace(void *pointer);
 /*****************************************************************************
  *	  GLOBAL MEMORY															 *
  *****************************************************************************/
+#define BOGUS_MCTX(id) \
+	[id].free_p = BogusFree, \
+	[id].realloc = BogusRealloc, \
+	[id].get_chunk_context = BogusGetChunkContext, \
+	[id].get_chunk_space = BogusGetChunkSpace
 
 static const MemoryContextMethods mcxt_methods[] = {
 	/* aset.c */
@@ -103,27 +108,25 @@ static const MemoryContextMethods mcxt_methods[] = {
 	 * invoked from inspection of a chunk (see MCXT_METHOD calls below).
 	 */
 
-	[MCTX_UNUSED1_ID].free_p = BogusFree,
-	[MCTX_UNUSED1_ID].realloc = BogusRealloc,
-	[MCTX_UNUSED1_ID].get_chunk_context = BogusGetChunkContext,
-	[MCTX_UNUSED1_ID].get_chunk_space = BogusGetChunkSpace,
-
-	[MCTX_UNUSED2_ID].free_p = BogusFree,
-	[MCTX_UNUSED2_ID].realloc = BogusRealloc,
-	[MCTX_UNUSED2_ID].get_chunk_context = BogusGetChunkContext,
-	[MCTX_UNUSED2_ID].get_chunk_space = BogusGetChunkSpace,
-
-	[MCTX_UNUSED3_ID].free_p = BogusFree,
-	[MCTX_UNUSED3_ID].realloc = BogusRealloc,
-	[MCTX_UNUSED3_ID].get_chunk_context = BogusGetChunkContext,
-	[MCTX_UNUSED3_ID].get_chunk_space = BogusGetChunkSpace,
-
-	[MCTX_UNUSED4_ID].free_p = BogusFree,
-	[MCTX_UNUSED4_ID].realloc = BogusRealloc,
-	[MCTX_UNUSED4_ID].get_chunk_context = BogusGetChunkContext,
-	[MCTX_UNUSED4_ID].get_chunk_space = BogusGetChunkSpace,
+	BOGUS_MCTX(MCTX_7_UNUSED_ID),
+	BOGUS_MCTX(MCTX_8_UNUSED_ID),
+	BOGUS_MCTX(MCTX_9_UNUSED_ID),
+	BOGUS_MCTX(MCTX_10_UNUSED_ID),
+	BOGUS_MCTX(MCTX_11_UNUSED_ID),
+	BOGUS_MCTX(MCTX_12_UNUSED_ID),
+	BOGUS_MCTX(MCTX_13_UNUSED_ID),
+	BOGUS_MCTX(MCTX_14_UNUSED_ID),
+
+	/*
+	 * Reserved IDs with bit patterns that we'd see if we were working on
+	 * invalid memory, either uninitialized or wiped.
+	 */
+	BOGUS_MCTX(MCTX_0_RESERVED_UNUSEDMEM_ID),
+	BOGUS_MCTX(MCTX_15_RESERVED_WIPEMEM_ID),
 };
 
+#undef BOGUS_MCTX
+
 /*
  * CurrentMemoryContext
  *		Default memory context for allocations.
diff --git a/src/include/utils/memutils_internal.h b/src/include/utils/memutils_internal.h
index ad1048fd82..c3f010b595 100644
--- a/src/include/utils/memutils_internal.h
+++ b/src/include/utils/memutils_internal.h
@@ -104,21 +104,29 @@ extern Size AlignedAllocGetChunkSpace(void *pointer);
  */
 typedef enum MemoryContextMethodID
 {
-	MCTX_UNUSED1_ID,			/* 000 occurs in never-used memory */
-	MCTX_UNUSED2_ID,			/* glibc malloc'd chunks usually match 001 */
-	MCTX_UNUSED3_ID,			/* glibc malloc'd chunks > 128kB match 010 */
+	MCTX_0_RESERVED_UNUSEDMEM_ID,	/*  0; occurs in never-used memory */
+	MCTX_1_RESERVED_GLIBC_ID,	/*  1=0001; glibc malloc'd chunks */
+	MCTX_2_RESERVED_GLIBC_ID,	/*  2=0010; glibc malloc'd chunks > 128kB */
 	MCTX_ASET_ID,
 	MCTX_GENERATION_ID,
 	MCTX_SLAB_ID,
 	MCTX_ALIGNED_REDIRECT_ID,
-	MCTX_UNUSED4_ID,			/* 111 occurs in wipe_mem'd memory */
+	MCTX_7_UNUSED_ID,
+	MCTX_8_UNUSED_ID,
+	MCTX_9_UNUSED_ID,
+	MCTX_10_UNUSED_ID,
+	MCTX_11_UNUSED_ID,
+	MCTX_12_UNUSED_ID,
+	MCTX_13_UNUSED_ID,
+	MCTX_14_UNUSED_ID,
+	MCTX_15_RESERVED_WIPEMEM_ID	/* 1111 occurs in wipe_mem'd memory (0x7F) */
 } MemoryContextMethodID;
 
 /*
  * The number of bits that 8-byte memory chunk headers can use to encode the
  * MemoryContextMethodID.
  */
-#define MEMORY_CONTEXT_METHODID_BITS 3
+#define MEMORY_CONTEXT_METHODID_BITS 4
 #define MEMORY_CONTEXT_METHODID_MASK \
 	((((uint64) 1) << MEMORY_CONTEXT_METHODID_BITS) - 1)
 
diff --git a/src/include/utils/memutils_memorychunk.h b/src/include/utils/memutils_memorychunk.h
index 38296abe1b..948a5ac954 100644
--- a/src/include/utils/memutils_memorychunk.h
+++ b/src/include/utils/memutils_memorychunk.h
@@ -12,7 +12,7 @@
  * Although MemoryChunks are used by each of our MemoryContexts, future
  * implementations may choose to implement their own method for storing chunk
  * headers.  The only requirement is that the header ends with an 8-byte value
- * which the least significant 3-bits of are set to the MemoryContextMethodID
+ * which the least significant 4-bits of are set to the MemoryContextMethodID
  * of the given context.
  *
  * By default, a MemoryChunk is 8 bytes in size, however, when
@@ -25,7 +25,7 @@
  * used to encode 4 separate pieces of information.  Starting with the least
  * significant bits of 'hdrmask', the bit space is reserved as follows:
  *
- * 1.	3-bits to indicate the MemoryContextMethodID as defined by
+ * 1.	4-bits to indicate the MemoryContextMethodID as defined by
  *		MEMORY_CONTEXT_METHODID_MASK
  * 2.	1-bit to denote an "external" chunk (see below)
  * 3.	30-bits reserved for the MemoryContext to use for anything it
@@ -34,6 +34,14 @@
  * 4.	30-bits for the number of bytes that must be subtracted from the chunk
  *		to obtain the address of the block that the chunk is stored on.
  *
+ * If you're paying close attention, you'll notice this adds up to 65 bits
+ * rather than 64 bits.  This is because the highest order bit of #3 is the
+ * same bit as the lowest order bit of #4.  We can do this as we insist that
+ * the chunk and block pointers are both MAXALIGNed, therefore the relative
+ * offset between those will always be a MAXALIGNed value which means the
+ * lowest order bit is always 0.  When fetching the the chunk to block offset
+ * we mask out the lowest-order bit to ensure it's still zero.
+ *
  * In some cases, for example when memory allocations become large, it's
  * possible fields 3 and 4 above are not large enough to store the values
  * required for the chunk.  In this case, the MemoryContext can choose to mark
@@ -93,10 +101,16 @@
  */
 #define MEMORYCHUNK_MAX_BLOCKOFFSET		UINT64CONST(0x3FFFFFFF)
 
+/*
+ * As above, but mask out the lowest-order (always zero) bit as this is shared
+ * with the MemoryChunkGetValue field.
+ */
+#define MEMORYCHUNK_BLOCKOFFSET_MASK 	UINT64CONST(0x3FFFFFFE)
+
 /* define the least significant base-0 bit of each portion of the hdrmask */
 #define MEMORYCHUNK_EXTERNAL_BASEBIT	MEMORY_CONTEXT_METHODID_BITS
 #define MEMORYCHUNK_VALUE_BASEBIT		(MEMORYCHUNK_EXTERNAL_BASEBIT + 1)
-#define MEMORYCHUNK_BLOCKOFFSET_BASEBIT	(MEMORYCHUNK_VALUE_BASEBIT + 30)
+#define MEMORYCHUNK_BLOCKOFFSET_BASEBIT	(MEMORYCHUNK_VALUE_BASEBIT + 29)
 
 /*
  * A magic number for storing in the free bits of an external chunk.  This
@@ -131,11 +145,11 @@ typedef struct MemoryChunk
 	(((hdrmask) >> MEMORYCHUNK_VALUE_BASEBIT) & MEMORYCHUNK_MAX_VALUE)
 
 /*
- * We should have used up all the bits here, so the compiler is likely to
- * optimize out the & MEMORYCHUNK_MAX_BLOCKOFFSET.
+ * We should have used up all the bits here, we just need to ensure we mask
+ * out the low-order bit that's shared with the MemoryChunkGetValue field.
  */
 #define HdrMaskBlockOffset(hdrmask) \
-	(((hdrmask) >> MEMORYCHUNK_BLOCKOFFSET_BASEBIT) & MEMORYCHUNK_MAX_BLOCKOFFSET)
+	(((hdrmask) >> MEMORYCHUNK_BLOCKOFFSET_BASEBIT) & MEMORYCHUNK_BLOCKOFFSET_MASK)
 
 /* For external chunks only, check the magic number matches */
 #define HdrMaskCheckMagic(hdrmask) \
@@ -149,6 +163,7 @@ typedef struct MemoryChunk
  * The number of bytes between 'block' and 'chunk' must be <=
  * MEMORYCHUNK_MAX_BLOCKOFFSET.
  * 'value' must be <= MEMORYCHUNK_MAX_VALUE.
+ * Both 'chunk' and 'block' must be MAXALIGNed pointers.
  */
 static inline void
 MemoryChunkSetHdrMask(MemoryChunk *chunk, void *block,
@@ -157,7 +172,7 @@ MemoryChunkSetHdrMask(MemoryChunk *chunk, void *block,
 	Size		blockoffset = (char *) chunk - (char *) block;
 
 	Assert((char *) chunk >= (char *) block);
-	Assert(blockoffset <= MEMORYCHUNK_MAX_BLOCKOFFSET);
+	Assert((blockoffset & MEMORYCHUNK_BLOCKOFFSET_MASK) == blockoffset);
 	Assert(value <= MEMORYCHUNK_MAX_VALUE);
 	Assert((int) methodid <= MEMORY_CONTEXT_METHODID_MASK);
 
@@ -225,6 +240,7 @@ MemoryChunkGetBlock(MemoryChunk *chunk)
 }
 
 /* cleanup all internal definitions */
+#undef MEMORYCHUNK_BLOCKOFFSET_MASK
 #undef MEMORYCHUNK_EXTERNAL_BASEBIT
 #undef MEMORYCHUNK_VALUE_BASEBIT
 #undef MEMORYCHUNK_BLOCKOFFSET_BASEBIT
-- 
2.44.0

v84-0007-DEV-log-memory-usage-in-tests.patchtext/x-patch; charset=US-ASCII; name=v84-0007-DEV-log-memory-usage-in-tests.patchDownload

From 660fa32ae29b7d7fd0e14d1d73f025a01ce90386 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 7 Apr 2024 16:16:01 +0700
Subject: [PATCH v84 7/8] DEV log memory usage in tests

---
 src/test/modules/test_tidstore/expected/test_tidstore.out | 4 ++--
 src/test/modules/test_tidstore/sql/test_tidstore.sql      | 4 ++--
 src/test/modules/test_tidstore/test_tidstore.c            | 1 +
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
index c40779859b..aae86579e3 100644
--- a/src/test/modules/test_tidstore/expected/test_tidstore.out
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -54,8 +54,8 @@ SELECT do_set_block_offsets(502, array_agg(DISTINCT greatest((random() * :maxoff
 -- to the allocated memory it started out with. This is easier
 -- with memory contexts in local memory.
 INSERT INTO hideblocks (blockno)
-SELECT do_set_block_offsets(blk, ARRAY[1,31,32,63,64,200]::int2[])
-  FROM generate_series(1000, 2000, 1) blk;
+SELECT do_set_block_offsets(blk, ARRAY[1,31,32,63]::int2[])
+  FROM generate_series(1000, 200000, 1) blk;
 -- Zero offset not allowed
 SELECT do_set_block_offsets(1, ARRAY[0]::int2[]);
 ERROR:  tuple offset out of range: 0
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
index 1f4e4a807e..7b64b0bf58 100644
--- a/src/test/modules/test_tidstore/sql/test_tidstore.sql
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -37,8 +37,8 @@ SELECT do_set_block_offsets(502, array_agg(DISTINCT greatest((random() * :maxoff
 -- to the allocated memory it started out with. This is easier
 -- with memory contexts in local memory.
 INSERT INTO hideblocks (blockno)
-SELECT do_set_block_offsets(blk, ARRAY[1,31,32,63,64,200]::int2[])
-  FROM generate_series(1000, 2000, 1) blk;
+SELECT do_set_block_offsets(blk, ARRAY[1,31,32,63]::int2[])
+  FROM generate_series(1000, 200000, 1) blk;
 
 -- Zero offset not allowed
 SELECT do_set_block_offsets(1, ARRAY[0]::int2[]);
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 0a3a58722d..e4aad4dabb 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -291,6 +291,7 @@ test_is_full(PG_FUNCTION_ARGS)
 	bool		is_full;
 
 	is_full = (TidStoreMemoryUsage(tidstore) > tidstore_empty_size);
+	fprintf(stderr, "TID bytes%zu\n", TidStoreMemoryUsage(tidstore));
 
 	PG_RETURN_BOOL(is_full);
 }
-- 
2.44.0

v84-0006-Use-bump-context-for-vacuum-s-TID-storage.patchtext/x-patch; charset=US-ASCII; name=v84-0006-Use-bump-context-for-vacuum-s-TID-storage.patchDownload

From a99757d435b5dbe612aadcca1bbb05f9c5d87259 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 7 Apr 2024 12:27:34 +0700
Subject: [PATCH v84 6/8] Use bump context for vacuum's TID storage

Vacuum does not pfree individual entries, and only frees the entire
storage space when finished with it. This allows using a bump context,
eliminating the chunk header in each leaf allocation. Most leaf
allocations will be 16 or 24 bytes, so that's a significant savings.
TidStoreCreateLocal gets a boolean parameter to hint that the created
store is insert-only.

This requires a separate tree context for iteration, since we free
the iteration state after iteration completes.

Discussion: https://postgr.es/m/https://www.postgresql.org/message-id/CANWCAZac%3DpBePg3rhX8nXkUuaLoiAJJLtmnCfZsPEAS4EtJ%3Dkg%40mail.gmail.com
---
 src/backend/access/common/tidstore.c           | 15 +++++++++++++--
 src/backend/access/heap/vacuumlazy.c           |  4 ++--
 src/include/access/tidstore.h                  |  2 +-
 src/include/lib/radixtree.h                    | 11 ++++++++++-
 src/test/modules/test_tidstore/test_tidstore.c |  3 ++-
 5 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 5784223b76..78a09b72ca 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -163,7 +163,7 @@ static void tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
  * by TidStoreMemoryUsage().
  */
 TidStore *
-TidStoreCreateLocal(size_t max_bytes)
+TidStoreCreateLocal(size_t max_bytes, bool insert_only)
 {
 	TidStore   *ts;
 	size_t		initBlockSize = ALLOCSET_DEFAULT_INITSIZE;
@@ -181,11 +181,22 @@ TidStoreCreateLocal(size_t max_bytes)
 		maxBlockSize = ALLOCSET_DEFAULT_INITSIZE;
 
 	/* Create a memory context for the TID storage */
-	ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
+	if (insert_only)
+	{
+		ts->rt_context = BumpContextCreate(CurrentMemoryContext,
 										   "TID storage",
 										   minContextSize,
 										   initBlockSize,
 										   maxBlockSize);
+	}
+	else
+	{
+		ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
+											   "TID storage",
+											   minContextSize,
+											   initBlockSize,
+											   maxBlockSize);
+	}
 
 	ts->tree.local = local_ts_create(ts->rt_context);
 
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index c3a9dc1ad6..de109acc89 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2874,7 +2874,7 @@ dead_items_alloc(LVRelState *vacrel, int nworkers)
 	dead_items_info->num_items = 0;
 	vacrel->dead_items_info = dead_items_info;
 
-	vacrel->dead_items = TidStoreCreateLocal(dead_items_info->max_bytes);
+	vacrel->dead_items = TidStoreCreateLocal(dead_items_info->max_bytes, true);
 }
 
 /*
@@ -2910,7 +2910,7 @@ dead_items_reset(LVRelState *vacrel)
 
 	/* Recreate the tidstore with the same max_bytes limitation */
 	TidStoreDestroy(dead_items);
-	vacrel->dead_items = TidStoreCreateLocal(vacrel->dead_items_info->max_bytes);
+	vacrel->dead_items = TidStoreCreateLocal(vacrel->dead_items_info->max_bytes, true);
 
 	/* Reset the counter */
 	vacrel->dead_items_info->num_items = 0;
diff --git a/src/include/access/tidstore.h b/src/include/access/tidstore.h
index f05d748783..32aa999519 100644
--- a/src/include/access/tidstore.h
+++ b/src/include/access/tidstore.h
@@ -29,7 +29,7 @@ typedef struct TidStoreIterResult
 	OffsetNumber *offsets;
 } TidStoreIterResult;
 
-extern TidStore *TidStoreCreateLocal(size_t max_bytes);
+extern TidStore *TidStoreCreateLocal(size_t max_bytes, bool insert_only);
 extern TidStore *TidStoreCreateShared(size_t max_bytes, int tranche_id);
 extern TidStore *TidStoreAttach(dsa_handle area_handle, dsa_pointer handle);
 extern void TidStoreDetach(TidStore *ts);
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index a9a2c90db2..6291b360e2 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -713,6 +713,7 @@ struct RT_RADIX_TREE
 	/* leaf_context is used only for single-value leaves */
 	MemoryContextData *leaf_context;
 #endif
+	MemoryContextData *iter_context;
 };
 
 /*
@@ -1827,6 +1828,14 @@ RT_CREATE(MemoryContext ctx)
 	tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
 	tree->context = ctx;
 
+	/*
+	 * Separate context for iteration in case the tree context doesn't support
+	 * pfree
+	 */
+	tree->iter_context = AllocSetContextCreate(ctx,
+											   RT_STR(RT_PREFIX) "radix tree iteration",
+											   ALLOCSET_SMALL_SIZES);
+
 #ifdef RT_SHMEM
 	tree->dsa = dsa;
 	dp = dsa_allocate0(dsa, sizeof(RT_RADIX_TREE_CONTROL));
@@ -2069,7 +2078,7 @@ RT_BEGIN_ITERATE(RT_RADIX_TREE * tree)
 	RT_ITER    *iter;
 	RT_CHILD_PTR root;
 
-	iter = (RT_ITER *) MemoryContextAllocZero(tree->context,
+	iter = (RT_ITER *) MemoryContextAllocZero(tree->iter_context,
 											  sizeof(RT_ITER));
 	iter->tree = tree;
 
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 32c6c477b7..0a3a58722d 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -116,7 +116,8 @@ test_create(PG_FUNCTION_ARGS)
 		dsa_pin_mapping(TidStoreGetDSA(tidstore));
 	}
 	else
-		tidstore = TidStoreCreateLocal(tidstore_max_size);
+		/* VACUUM uses insert only, so we test the other option. */
+		tidstore = TidStoreCreateLocal(tidstore_max_size, false);
 
 	tidstore_empty_size = TidStoreMemoryUsage(tidstore);
 
-- 
2.44.0

v84-0003-Teach-radix-tree-to-embed-values-at-runtime.patchtext/x-patch; charset=US-ASCII; name=v84-0003-Teach-radix-tree-to-embed-values-at-runtime.patchDownload

From 8547efd90cb438a2c9624a5dfbeef945e1ca97ac Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 5 Apr 2024 17:45:59 +0700
Subject: [PATCH v84 3/8] Teach radix tree to embed values at runtime

Previously, the decision to store values in leaves or within the child
pointer was made at compile time, with variable length values using
leaves by necessity. This commit allows introspecting the length of
variable length values at runtime for that decision. This requires
the ability to tell whether the last-level child pointer is actually
a value, so we use a pointer tag in the lowest level bit.

Use this in TID store. This entails adding a byte to the header to
reserve space for the tag. Commit XXXXXXXXX stores up to three offsets
within the header with no bitmap, and now the header can be embedded
as above. This reduces worst-case memory usage when TIDs are sparse.

Discussion: https://postgr.es/m/
---
 src/backend/access/common/tidstore.c | 79 ++++++++++++++++++++--------
 src/include/lib/radixtree.h          | 32 +++++++++++
 2 files changed, 88 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 26eb52948b..5784223b76 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -43,19 +43,50 @@
  */
 typedef struct BlocktableEntry
 {
-	uint16		nwords;
-
-	/*
-	 * We can store a small number of offsets here to avoid wasting space with
-	 * a sparse bitmap.
-	 */
-	OffsetNumber full_offsets[NUM_FULL_OFFSETS];
+	union
+	{
+		struct
+		{
+#ifndef WORDS_BIGENDIAN
+			/*
+			 * We need to position this member so that the backing radix tree
+			 * can use the lowest bit for a pointer tag. In particular, it
+			 * must be placed within 'header' so that it corresponds to the
+			 * lowest byte in 'ptr'. We position 'nwords' along with it to
+			 * avoid struct padding.
+			 */
+			uint8		flags;
+
+			int8		nwords;
+#endif
+
+			/*
+			 * We can store a small number of offsets here to avoid wasting
+			 * space with a sparse bitmap.
+			 */
+			OffsetNumber full_offsets[NUM_FULL_OFFSETS];
+
+#ifdef WORDS_BIGENDIAN
+			int8		nwords;
+			uint8		flags;
+#endif
+		};
+		uintptr_t	ptr;
+	}			header;
 
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];
 } BlocktableEntry;
+
+/*
+ * The type of 'nwords' limits the max number of words in the 'words' array.
+ * This computes the max offset we can actually store in the bitmap. In
+ * practice, it's almost always the same as MaxOffsetNumber.
+ */
+#define MAX_OFFSET_IN_BITMAP Min(BITS_PER_BITMAPWORD * PG_INT8_MAX - 1, MaxOffsetNumber)
+
 #define MaxBlocktableEntrySize \
 	offsetof(BlocktableEntry, words) + \
-		(sizeof(bitmapword) * WORDS_PER_PAGE(MaxOffsetNumber))
+		(sizeof(bitmapword) * WORDS_PER_PAGE(MAX_OFFSET_IN_BITMAP))
 
 #define RT_PREFIX local_ts
 #define RT_SCOPE static
@@ -64,7 +95,8 @@ typedef struct BlocktableEntry
 #define RT_VALUE_TYPE BlocktableEntry
 #define RT_VARLEN_VALUE_SIZE(page) \
 	(offsetof(BlocktableEntry, words) + \
-	sizeof(bitmapword) * (page)->nwords)
+	sizeof(bitmapword) * (page)->header.nwords)
+#define RT_RUNTIME_EMBEDDABLE_VALUE
 #include "lib/radixtree.h"
 
 #define RT_PREFIX shared_ts
@@ -75,7 +107,8 @@ typedef struct BlocktableEntry
 #define RT_VALUE_TYPE BlocktableEntry
 #define RT_VARLEN_VALUE_SIZE(page) \
 	(offsetof(BlocktableEntry, words) + \
-	sizeof(bitmapword) * (page)->nwords)
+	sizeof(bitmapword) * (page)->header.nwords)
+#define RT_RUNTIME_EMBEDDABLE_VALUE
 #include "lib/radixtree.h"
 
 /* Per-backend state for a TidStore */
@@ -335,13 +368,13 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 			OffsetNumber off = offsets[i];
 
 			/* safety check to ensure we don't overrun bit array bounds */
-			if (!OffsetNumberIsValid(off))
+			if (off == InvalidOffsetNumber || off > MAX_OFFSET_IN_BITMAP)
 				elog(ERROR, "tuple offset out of range: %u", off);
 
-			page->full_offsets[i] = off;
+			page->header.full_offsets[i] = off;
 		}
 
-		page->nwords = 0;
+		page->header.nwords = 0;
 	}
 	else
 	{
@@ -356,7 +389,7 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 				OffsetNumber off = offsets[idx];
 
 				/* safety check to ensure we don't overrun bit array bounds */
-				if (!OffsetNumberIsValid(off))
+				if (off == InvalidOffsetNumber || off > MAX_OFFSET_IN_BITMAP)
 					elog(ERROR, "tuple offset out of range: %u", off);
 
 				if (off >= next_word_threshold)
@@ -370,8 +403,8 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 			page->words[wordnum] = word;
 		}
 
-		page->nwords = wordnum;
-		Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
+		page->header.nwords = wordnum;
+		Assert(page->header.nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
 	}
 
 	if (TidStoreIsShared(ts))
@@ -399,12 +432,12 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 	if (page == NULL)
 		return false;
 
-	if (page->nwords == 0)
+	if (page->header.nwords == 0)
 	{
 		/* we have offsets in the header */
 		for (int i = 0; i < NUM_FULL_OFFSETS; i++)
 		{
-			if (page->full_offsets[i] == off)
+			if (page->header.full_offsets[i] == off)
 				return true;
 		}
 		return false;
@@ -415,7 +448,7 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 		bitnum = BITNUM(off);
 
 		/* no bitmap for the off */
-		if (wordnum >= page->nwords)
+		if (wordnum >= page->header.nwords)
 			return false;
 
 		return (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
@@ -539,18 +572,18 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 	result->num_offsets = 0;
 	result->blkno = blkno;
 
-	if (page->nwords == 0)
+	if (page->header.nwords == 0)
 	{
 		/* we have offsets in the header */
 		for (int i = 0; i < NUM_FULL_OFFSETS; i++)
 		{
-			if (page->full_offsets[i] != InvalidOffsetNumber)
-				result->offsets[result->num_offsets++] = page->full_offsets[i];
+			if (page->header.full_offsets[i] != InvalidOffsetNumber)
+				result->offsets[result->num_offsets++] = page->header.full_offsets[i];
 		}
 	}
 	else
 	{
-		for (wordnum = 0; wordnum < page->nwords; wordnum++)
+		for (wordnum = 0; wordnum < page->header.nwords; wordnum++)
 		{
 			bitmapword	w = page->words[wordnum];
 			int			off = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 6f36e8bfde..a9a2c90db2 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -105,6 +105,10 @@
  *   involving a pointer to the value type, to calculate size.
  *     NOTE: implies that the value is in fact variable-length,
  *     so do not set for fixed-length values.
+ * - RT_RUNTIME_EMBEDDABLE_VALUE - for variable length values, allows
+ *   storing the value in a child pointer slot, rather than as a single-
+ *   value leaf, if small enough. This requires that the value, when
+ *   read as a child pointer, can be tagged in the lowest bit.
  *
  * Optional parameters:
  * - RT_SHMEM - if defined, the radix tree is created in the DSA area
@@ -437,7 +441,13 @@ static inline bool
 RT_VALUE_IS_EMBEDDABLE(RT_VALUE_TYPE * value_p)
 {
 #ifdef RT_VARLEN_VALUE_SIZE
+
+#ifdef RT_RUNTIME_EMBEDDABLE_VALUE
+	return RT_GET_VALUE_SIZE(value_p) <= sizeof(RT_PTR_ALLOC);
+#else
 	return false;
+#endif
+
 #else
 	return RT_GET_VALUE_SIZE(value_p) <= sizeof(RT_PTR_ALLOC);
 #endif
@@ -451,7 +461,19 @@ static inline bool
 RT_CHILDPTR_IS_VALUE(RT_PTR_ALLOC child)
 {
 #ifdef RT_VARLEN_VALUE_SIZE
+
+#ifdef RT_RUNTIME_EMBEDDABLE_VALUE
+	/* check for pointer tag */
+#ifdef RT_SHMEM
+	return child & 1;
+#else
+	return ((uintptr_t) child) & 1;
+#endif
+
+#else
 	return false;
+#endif
+
 #else
 	return sizeof(RT_VALUE_TYPE) <= sizeof(RT_PTR_ALLOC);
 #endif
@@ -1728,6 +1750,15 @@ have_slot:
 	{
 		/* store value directly in child pointer slot */
 		memcpy(slot, value_p, value_sz);
+
+#ifdef RT_RUNTIME_EMBEDDABLE_VALUE
+		/* tag child pointer */
+#ifdef RT_SHMEM
+		*slot |= 1;
+#else
+		*((uintptr_t *) slot) |= 1;
+#endif
+#endif
 	}
 	else
 	{
@@ -2879,6 +2910,7 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_DEFINE
 #undef RT_VALUE_TYPE
 #undef RT_VARLEN_VALUE_SIZE
+#undef RT_RUNTIME_EMBEDDABLE_VALUE
 #undef RT_SHMEM
 #undef RT_USE_DELETE
 #undef RT_DEBUG
-- 
2.44.0

v84-0002-pgindent.patchtext/x-patch; charset=US-ASCII; name=v84-0002-pgindent.patchDownload

From 89d5abe4f0b945d07645ac7ead252c8aafe09331 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 5 Apr 2024 17:06:55 +0700
Subject: [PATCH v84 2/8] pgindent

---
 src/backend/access/common/tidstore.c | 99 ++++++++++++++--------------
 1 file changed, 51 insertions(+), 48 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index 4eb5d46951..26eb52948b 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -45,7 +45,10 @@ typedef struct BlocktableEntry
 {
 	uint16		nwords;
 
-	/* We can store a small number of offsets here to avoid wasting space with a sparse bitmap. */
+	/*
+	 * We can store a small number of offsets here to avoid wasting space with
+	 * a sparse bitmap.
+	 */
 	OffsetNumber full_offsets[NUM_FULL_OFFSETS];
 
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];
@@ -342,35 +345,35 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	}
 	else
 	{
-	for (wordnum = 0, next_word_threshold = BITS_PER_BITMAPWORD;
-		 wordnum <= WORDNUM(offsets[num_offsets - 1]);
-		 wordnum++, next_word_threshold += BITS_PER_BITMAPWORD)
-	{
-		word = 0;
-
-		while (idx < num_offsets)
+		for (wordnum = 0, next_word_threshold = BITS_PER_BITMAPWORD;
+			 wordnum <= WORDNUM(offsets[num_offsets - 1]);
+			 wordnum++, next_word_threshold += BITS_PER_BITMAPWORD)
 		{
-			OffsetNumber off = offsets[idx];
+			word = 0;
 
-			/* safety check to ensure we don't overrun bit array bounds */
-			if (!OffsetNumberIsValid(off))
-				elog(ERROR, "tuple offset out of range: %u", off);
+			while (idx < num_offsets)
+			{
+				OffsetNumber off = offsets[idx];
+
+				/* safety check to ensure we don't overrun bit array bounds */
+				if (!OffsetNumberIsValid(off))
+					elog(ERROR, "tuple offset out of range: %u", off);
 
-			if (off >= next_word_threshold)
-				break;
+				if (off >= next_word_threshold)
+					break;
 
-			word |= ((bitmapword) 1 << BITNUM(off));
-			idx++;
+				word |= ((bitmapword) 1 << BITNUM(off));
+				idx++;
+			}
+
+			/* write out offset bitmap for this wordnum */
+			page->words[wordnum] = word;
 		}
 
-		/* write out offset bitmap for this wordnum */
-		page->words[wordnum] = word;
+		page->nwords = wordnum;
+		Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
 	}
 
-	page->nwords = wordnum;
-	Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
-}
-
 	if (TidStoreIsShared(ts))
 		shared_ts_set(ts->tree.shared, blkno, page);
 	else
@@ -408,15 +411,15 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 	}
 	else
 	{
-	wordnum = WORDNUM(off);
-	bitnum = BITNUM(off);
+		wordnum = WORDNUM(off);
+		bitnum = BITNUM(off);
 
-	/* no bitmap for the off */
-	if (wordnum >= page->nwords)
-		return false;
+		/* no bitmap for the off */
+		if (wordnum >= page->nwords)
+			return false;
 
-	return (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
-}
+		return (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
+	}
 }
 
 /*
@@ -547,26 +550,26 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 	}
 	else
 	{
-	for (wordnum = 0; wordnum < page->nwords; wordnum++)
-	{
-		bitmapword	w = page->words[wordnum];
-		int			off = wordnum * BITS_PER_BITMAPWORD;
-
-		/* Make sure there is enough space to add offsets */
-		if ((result->num_offsets + BITS_PER_BITMAPWORD) > result->max_offset)
+		for (wordnum = 0; wordnum < page->nwords; wordnum++)
 		{
-			result->max_offset *= 2;
-			result->offsets = repalloc(result->offsets,
-									   sizeof(OffsetNumber) * result->max_offset);
-		}
-
-		while (w != 0)
-		{
-			if (w & 1)
-				result->offsets[result->num_offsets++] = (OffsetNumber) off;
-			off++;
-			w >>= 1;
+			bitmapword	w = page->words[wordnum];
+			int			off = wordnum * BITS_PER_BITMAPWORD;
+
+			/* Make sure there is enough space to add offsets */
+			if ((result->num_offsets + BITS_PER_BITMAPWORD) > result->max_offset)
+			{
+				result->max_offset *= 2;
+				result->offsets = repalloc(result->offsets,
+										   sizeof(OffsetNumber) * result->max_offset);
+			}
+
+			while (w != 0)
+			{
+				if (w & 1)
+					result->offsets[result->num_offsets++] = (OffsetNumber) off;
+				off++;
+				w >>= 1;
+			}
 		}
 	}
 }
-}
-- 
2.44.0

v84-0001-store-offsets-in-the-header.patchtext/x-patch; charset=US-ASCII; name=v84-0001-store-offsets-in-the-header.patchDownload

From 24bd672deb4a6fa14abfc8583b500d1cbc332032 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Fri, 5 Apr 2024 17:04:12 +0700
Subject: [PATCH v84 1/8] store offsets in the header

---
 src/backend/access/common/tidstore.c          | 52 +++++++++++++++++++
 .../test_tidstore/expected/test_tidstore.out  | 14 +++++
 .../test_tidstore/sql/test_tidstore.sql       |  5 ++
 3 files changed, 71 insertions(+)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index e1a7e82469..4eb5d46951 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -34,6 +34,9 @@
 /* number of active words for a page: */
 #define WORDS_PER_PAGE(n) ((n) / BITS_PER_BITMAPWORD + 1)
 
+/* number of offsets we can store in the header of a BlocktableEntry */
+#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint16)) / sizeof(OffsetNumber))
+
 /*
  * This is named similarly to PagetableEntry in tidbitmap.c
  * because the two have a similar function.
@@ -41,6 +44,10 @@
 typedef struct BlocktableEntry
 {
 	uint16		nwords;
+
+	/* We can store a small number of offsets here to avoid wasting space with a sparse bitmap. */
+	OffsetNumber full_offsets[NUM_FULL_OFFSETS];
+
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];
 } BlocktableEntry;
 #define MaxBlocktableEntrySize \
@@ -316,6 +323,25 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	for (int i = 1; i < num_offsets; i++)
 		Assert(offsets[i] > offsets[i - 1]);
 
+	memset(page, 0, offsetof(BlocktableEntry, words));
+
+	if (num_offsets <= NUM_FULL_OFFSETS)
+	{
+		for (int i = 0; i < num_offsets; i++)
+		{
+			OffsetNumber off = offsets[i];
+
+			/* safety check to ensure we don't overrun bit array bounds */
+			if (!OffsetNumberIsValid(off))
+				elog(ERROR, "tuple offset out of range: %u", off);
+
+			page->full_offsets[i] = off;
+		}
+
+		page->nwords = 0;
+	}
+	else
+	{
 	for (wordnum = 0, next_word_threshold = BITS_PER_BITMAPWORD;
 		 wordnum <= WORDNUM(offsets[num_offsets - 1]);
 		 wordnum++, next_word_threshold += BITS_PER_BITMAPWORD)
@@ -343,6 +369,7 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 
 	page->nwords = wordnum;
 	Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
+}
 
 	if (TidStoreIsShared(ts))
 		shared_ts_set(ts->tree.shared, blkno, page);
@@ -369,6 +396,18 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 	if (page == NULL)
 		return false;
 
+	if (page->nwords == 0)
+	{
+		/* we have offsets in the header */
+		for (int i = 0; i < NUM_FULL_OFFSETS; i++)
+		{
+			if (page->full_offsets[i] == off)
+				return true;
+		}
+		return false;
+	}
+	else
+	{
 	wordnum = WORDNUM(off);
 	bitnum = BITNUM(off);
 
@@ -378,6 +417,7 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 
 	return (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
 }
+}
 
 /*
  * Prepare to iterate through a TidStore.
@@ -496,6 +536,17 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 	result->num_offsets = 0;
 	result->blkno = blkno;
 
+	if (page->nwords == 0)
+	{
+		/* we have offsets in the header */
+		for (int i = 0; i < NUM_FULL_OFFSETS; i++)
+		{
+			if (page->full_offsets[i] != InvalidOffsetNumber)
+				result->offsets[result->num_offsets++] = page->full_offsets[i];
+		}
+	}
+	else
+	{
 	for (wordnum = 0; wordnum < page->nwords; wordnum++)
 	{
 		bitmapword	w = page->words[wordnum];
@@ -518,3 +569,4 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 		}
 	}
 }
+}
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
index 0ae2f970da..c40779859b 100644
--- a/src/test/modules/test_tidstore/expected/test_tidstore.out
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -36,6 +36,20 @@ SELECT do_set_block_offsets(blk, array_agg(off)::int2[])
     (VALUES (0), (1), (:maxblkno / 2), (:maxblkno - 1), (:maxblkno)) AS blocks(blk),
     (VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)) AS offsets(off)
   GROUP BY blk;
+-- Test offsets embedded in the bitmap header.
+SELECT do_set_block_offsets(501, array[greatest((random() * :maxoffset)::int, 1)]::int2[]);
+ do_set_block_offsets 
+----------------------
+                  501
+(1 row)
+
+SELECT do_set_block_offsets(502, array_agg(DISTINCT greatest((random() * :maxoffset)::int, 1))::int2[])
+  FROM generate_series(1, 3);
+ do_set_block_offsets 
+----------------------
+                  502
+(1 row)
+
 -- Add enough TIDs to cause the store to appear "full", compared
 -- to the allocated memory it started out with. This is easier
 -- with memory contexts in local memory.
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
index e5edfbb264..1f4e4a807e 100644
--- a/src/test/modules/test_tidstore/sql/test_tidstore.sql
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -28,6 +28,11 @@ SELECT do_set_block_offsets(blk, array_agg(off)::int2[])
     (VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)) AS offsets(off)
   GROUP BY blk;
 
+-- Test offsets embedded in the bitmap header.
+SELECT do_set_block_offsets(501, array[greatest((random() * :maxoffset)::int, 1)]::int2[]);
+SELECT do_set_block_offsets(502, array_agg(DISTINCT greatest((random() * :maxoffset)::int, 1))::int2[])
+  FROM generate_series(1, 3);
+
 -- Add enough TIDs to cause the store to appear "full", compared
 -- to the allocated memory it started out with. This is easier
 -- with memory contexts in local memory.
-- 
2.44.0

#447

andres@anarazel.de

almost 2 years ago

In reply to: Masahiko Sawada (#444)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi,

On 2024-04-01 11:53:28 +0900, Masahiko Sawada wrote:

On Fri, Mar 29, 2024 at 4:21 PM John Naylor <johncnaylorls@gmail.com> wrote:

I've marked it Ready for Committer.

Thank you! I've attached the patch that I'm going to push tomorrow.

Locally I ran a 32bit build with ubsan enabled (by accident actually), which
complains:

performing post-bootstrap initialization ...
----------------------------------- stderr -----------------------------------
../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341:24: runtime error: member access within misaligned address 0xffb6258e for type 'struct BlocktableEntry', which requires 4 byte alignment
0xffb6258e: note: pointer points here
00 00 02 00 01 40 dc e9 83 0b 80 48 70 ee 00 00 00 00 00 00 00 01 17 00 00 00 f8 d4 a6 ee e8 25
^
#0 0x814097e in TidStoreSetBlockOffsets ../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341
#1 0x826560a in dead_items_add ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:2889
#2 0x825f8da in lazy_scan_prune ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:1502
#3 0x825da71 in lazy_scan_heap ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:977
#4 0x825ad8f in heap_vacuum_rel ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:499
#5 0x8697e97 in table_relation_vacuum ../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1725
#6 0x869fca6 in vacuum_rel ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:2206
#7 0x869a0fd in vacuum ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:622
#8 0x869986b in ExecVacuum ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:449
#9 0x8e5f832 in standard_ProcessUtility ../../../../../home/andres/src/postgresql/src/backend/tcop/utility.c:859
#10 0x8e5e5f6 in ProcessUtility ../../../../../home/andres/src/postgresql/src/backend/tcop/utility.c:523
#11 0x8e5b71a in PortalRunUtility ../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:1158
#12 0x8e5be80 in PortalRunMulti ../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:1315
#13 0x8e59f9b in PortalRun ../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:791
#14 0x8e4d5f3 in exec_simple_query ../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:1274
#15 0x8e55159 in PostgresMain ../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4680
#16 0x8e54445 in PostgresSingleUserMain ../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4136
#17 0x88bb55e in main ../../../../../home/andres/src/postgresql/src/backend/main/main.c:194
#18 0xf76f47c4 (/lib/i386-linux-gnu/libc.so.6+0x237c4) (BuildId: fe79efe6681a919714a4e119da2baac3a4953fbf)
#19 0xf76f4887 in __libc_start_main (/lib/i386-linux-gnu/libc.so.6+0x23887) (BuildId: fe79efe6681a919714a4e119da2baac3a4953fbf)
#20 0x80d40f7 in _start (/srv/dev/build/postgres/m-dev-assert-32/tmp_install/srv/dev/install/postgres/m-dev-assert-32/bin/postgres+0x80d40f7)

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341:24 in
Aborted (core dumped)
child process exited with exit code 134
initdb: data directory "/srv/dev/build/postgres/m-dev-assert-32/tmp_install/initdb-template" not removed at user's request

At first I was confused why CI didn't find this. Turns out that, for me, this
is only triggered without compiler optimizations, and I had used -O0 while CI
uses some optimizations.

Backtrace:
#9 0x0814097f in TidStoreSetBlockOffsets (ts=0xb8dfde4, blkno=15, offsets=0xffb6275c, num_offsets=11)
at ../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341
#10 0x0826560b in dead_items_add (vacrel=0xb8df6d4, blkno=15, offsets=0xffb6275c, num_offsets=11)
at ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:2889
#11 0x0825f8db in lazy_scan_prune (vacrel=0xb8df6d4, buf=24, blkno=15, page=0xeeb6c000 "", vmbuffer=729, all_visible_according_to_vm=false,
has_lpdead_items=0xffb62a1f) at ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:1502
#12 0x0825da72 in lazy_scan_heap (vacrel=0xb8df6d4) at ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:977
#13 0x0825ad90 in heap_vacuum_rel (rel=0xb872810, params=0xffb62e90, bstrategy=0xb99d5e0)
at ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:499
#14 0x08697e98 in table_relation_vacuum (rel=0xb872810, params=0xffb62e90, bstrategy=0xb99d5e0)
at ../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1725
#15 0x0869fca7 in vacuum_rel (relid=1249, relation=0x0, params=0xffb62e90, bstrategy=0xb99d5e0)
at ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:2206
#16 0x0869a0fe in vacuum (relations=0xb99de08, params=0xffb62e90, bstrategy=0xb99d5e0, vac_context=0xb99d550, isTopLevel=true)

(gdb) p/x page
$1 = 0xffb6258e

I think compiler optimizations are only tangentially involved here, they
trigger the stack frame layout to change, e.g. because some variable will just
exist in a register.

Looking at the code, the failure isn't suprising anymore:
char data[MaxBlocktableEntrySize];
BlocktableEntry *page = (BlocktableEntry *) data;

'char' doesn't enforce any alignment, but you're storing a BlocktableEntry in
a char[]. You can't just do that. Look at how we do that for
e.g. PGAlignedblock.

With the attached minimal fix, the tests pass again.

Greetings,

Andres Freund

Attachments:

tidstore-quickfix-alignment.difftext/x-diff; charset=us-asciiDownload

diff --git i/src/backend/access/common/tidstore.c w/src/backend/access/common/tidstore.c
index e1a7e824690..0fd0587eaf7 100644
--- i/src/backend/access/common/tidstore.c
+++ w/src/backend/access/common/tidstore.c
@@ -303,8 +303,12 @@ void
 TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 						int num_offsets)
 {
-	char		data[MaxBlocktableEntrySize];
-	BlocktableEntry *page = (BlocktableEntry *) data;
+	union
+	{
+		char		data[MaxBlocktableEntrySize];
+		BlocktableEntry force_align_entry;
+	} data;
+	BlocktableEntry *page = (BlocktableEntry *) data.data;
 	bitmapword	word;
 	int			wordnum;
 	int			next_word_threshold;

#448

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Andres Freund (#447)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Apr 8, 2024 at 2:07 AM Andres Freund <andres@anarazel.de> wrote:

Looking at the code, the failure isn't suprising anymore:
char data[MaxBlocktableEntrySize];
BlocktableEntry *page = (BlocktableEntry *) data;

'char' doesn't enforce any alignment, but you're storing a BlocktableEntry in
a char[]. You can't just do that. Look at how we do that for
e.g. PGAlignedblock.

With the attached minimal fix, the tests pass again.

Thanks, will push this shortly!

#449

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: John Naylor (#448)

Re: [PoC] Improve dead tuple storage for lazy vacuum

Hi, John!

On Mon, 8 Apr 2024 at 03:13, John Naylor <johncnaylorls@gmail.com> wrote:

On Mon, Apr 8, 2024 at 2:07 AM Andres Freund <andres@anarazel.de> wrote:

Looking at the code, the failure isn't suprising anymore:
char data[MaxBlocktableEntrySize];
BlocktableEntry *page = (BlocktableEntry *) data;

'char' doesn't enforce any alignment, but you're storing a

BlocktableEntry in

a char[]. You can't just do that. Look at how we do that for
e.g. PGAlignedblock.

With the attached minimal fix, the tests pass again.

Thanks, will push this shortly!

Buildfarm animal mylodon looks unhappy with this:

FAILED: src/backend/postgres_lib.a.p/access_common_tidstore.c.o
ccache clang-14 -Isrc/backend/postgres_lib.a.p -Isrc/include
-I../pgsql/src/include -I/usr/include/libxml2 -I/usr/include/security
-fdiagnostics-color=never -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch
-O2 -g -fno-strict-aliasing -fwrapv -D_GNU_SOURCE -Wmissing-prototypes
-Wpointer-arith -Werror=vla -Werror=unguarded-availability-new
-Wendif-labels -Wmissing-format-attribute -Wcast-function-type
-Wformat-security -Wdeclaration-after-statement
-Wno-unused-command-line-argument -Wno-compound-token-split-by-macro
-O1 -ggdb -g3 -fno-omit-frame-pointer -Wall -Wextra
-Wno-unused-parameter -Wno-sign-compare
-Wno-missing-field-initializers -Wno-array-bounds -std=c99
-Wc11-extensions -Werror=c11-extensions -fPIC -isystem
/usr/include/mit-krb5 -pthread -DBUILDING_DLL -MD -MQ
src/backend/postgres_lib.a.p/access_common_tidstore.c.o -MF
src/backend/postgres_lib.a.p/access_common_tidstore.c.o.d -o
src/backend/postgres_lib.a.p/access_common_tidstore.c.o -c
../pgsql/src/backend/access/common/tidstore.c
../pgsql/src/backend/access/common/tidstore.c:48:3: error: anonymous
structs are a C11 extension [-Werror,-Wc11-extensions]
struct
^

1 error generated.

Regards,
Pavel Borisov
Supabase

#450

johncnaylorls@gmail.com

almost 2 years ago

In reply to: John Naylor (#445)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Sun, Apr 7, 2024 at 9:08 AM John Naylor <johncnaylorls@gmail.com> wrote:

I've attached a mostly-polished update on runtime embeddable values,
storing up to 3 offsets in the child pointer (1 on 32-bit platforms).
As discussed, this includes a macro to cap max possible offset that
can be stored in the bitmap, which I believe only reduces the valid
offset range for 32kB pages on 32-bit platforms. Even there, it allows
for more line pointers than can possibly be useful. It also splits
into two parts for readability. It would be committed in two pieces as
well, since they are independently useful.

I pushed both of these and see that mylodon complains that anonymous
unions are a C11 feature. I'm not actually sure that the union with
uintptr_t is actually needed, though, since that's not accessed as
such here. The simplest thing seems to get rid if the union and name
the inner struct "header", as in the attached.

Attachments:

fix-anonymous-union.patchtext/x-patch; charset=US-ASCII; name=fix-anonymous-union.patchDownload

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index cddbaf013b..78730797d6 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -43,35 +43,30 @@
  */
 typedef struct BlocktableEntry
 {
-	union
+	struct
 	{
-		struct
-		{
 #ifndef WORDS_BIGENDIAN
-			/*
-			 * We need to position this member so that the backing radix tree
-			 * can use the lowest bit for a pointer tag. In particular, it
-			 * must be placed within 'header' so that it corresponds to the
-			 * lowest byte in 'ptr'. We position 'nwords' along with it to
-			 * avoid struct padding.
-			 */
-			uint8		flags;
-
-			int8		nwords;
+		/*
+		 * We need to position this member so that the backing radix tree can
+		 * use the lowest bit for a pointer tag. In particular, it must be
+		 * placed within 'header' so that it corresponds to the lowest byte in
+		 * 'ptr'. We position 'nwords' along with it to avoid struct padding.
+		 */
+		uint8		flags;
+
+		int8		nwords;
 #endif
 
-			/*
-			 * We can store a small number of offsets here to avoid wasting
-			 * space with a sparse bitmap.
-			 */
-			OffsetNumber full_offsets[NUM_FULL_OFFSETS];
+		/*
+		 * We can store a small number of offsets here to avoid wasting space
+		 * with a sparse bitmap.
+		 */
+		OffsetNumber full_offsets[NUM_FULL_OFFSETS];
 
 #ifdef WORDS_BIGENDIAN
-			int8		nwords;
-			uint8		flags;
+		int8		nwords;
+		uint8		flags;
 #endif
-		};
-		uintptr_t	ptr;
 	}			header;
 
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];

#451

Pavel Borisov

pashkin.elfe@gmail.com

almost 2 years ago

In reply to: John Naylor (#450)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, 8 Apr 2024 at 16:27, John Naylor <johncnaylorls@gmail.com> wrote:

On Sun, Apr 7, 2024 at 9:08 AM John Naylor <johncnaylorls@gmail.com>
wrote:

I've attached a mostly-polished update on runtime embeddable values,
storing up to 3 offsets in the child pointer (1 on 32-bit platforms).
As discussed, this includes a macro to cap max possible offset that
can be stored in the bitmap, which I believe only reduces the valid
offset range for 32kB pages on 32-bit platforms. Even there, it allows
for more line pointers than can possibly be useful. It also splits
into two parts for readability. It would be committed in two pieces as
well, since they are independently useful.

I pushed both of these and see that mylodon complains that anonymous
unions are a C11 feature. I'm not actually sure that the union with
uintptr_t is actually needed, though, since that's not accessed as
such here. The simplest thing seems to get rid if the union and name
the inner struct "header", as in the attached.

Provided uintptr_t is not accessed it might be good to get rid of it.

Maybe this patch also need correction in this:
+#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) -
sizeof(int8)) / sizeof(OffsetNumber))

Regards,
Pavel

#452

johncnaylorls@gmail.com

almost 2 years ago

In reply to: Pavel Borisov (#451)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Apr 8, 2024 at 7:42 PM Pavel Borisov <pashkin.elfe@gmail.com> wrote:

I pushed both of these and see that mylodon complains that anonymous
unions are a C11 feature. I'm not actually sure that the union with
uintptr_t is actually needed, though, since that's not accessed as
such here. The simplest thing seems to get rid if the union and name
the inner struct "header", as in the attached.

Provided uintptr_t is not accessed it might be good to get rid of it.

Maybe this patch also need correction in this:
+#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) - sizeof(int8)) / sizeof(OffsetNumber))

For full context the diff was

-#define NUM_FULL_OFFSETS ((sizeof(bitmapword) - sizeof(uint16)) /
sizeof(OffsetNumber))
+#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) -
sizeof(int8)) / sizeof(OffsetNumber))

I wanted the former, from f35bd9bf35 , to be independently useful (in
case the commit in question had some unresolvable issue), and its
intent is to fill struct padding when the array of bitmapword happens
to have length zero. Changing to uintptr_t for the size calculation
reflects the intent to fit in a (local) pointer, regardless of the
size of a bitmapword. (If a DSA pointer happens to be a different size
for some odd platform, it should still work, BTW.)

My thinking with the union was, for big-endian, to force the 'flags'
member to where it can be set, but thinking again, it should still
work if by happenstance the header was smaller than the child pointer:
A different bit would get tagged, but I believe that's irrelevant. The
'flags' member makes sure a byte is reserved for the tag, but it may
not be where the tag is actually located, if that makes sense.

#453

johncnaylorls@gmail.com

almost 2 years ago

In reply to: John Naylor (#450)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Apr 8, 2024 at 7:26 PM John Naylor <johncnaylorls@gmail.com> wrote:

I pushed both of these and see that mylodon complains that anonymous
unions are a C11 feature. I'm not actually sure that the union with
uintptr_t is actually needed, though, since that's not accessed as
such here. The simplest thing seems to get rid if the union and name
the inner struct "header", as in the attached.

I pushed this with some comment adjustments.

#454

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803

johncnaylorls@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#444)

Re: [PoC] Improve dead tuple storage for lazy vacuum

I took a look at the coverage report from [1]/messages/by-id/20240414223305.m3i5eju6zylabvln@awork3.anarazel.de and it seems pretty
good, but there are a couple more tests we could do.

- RT_KEY_GET_SHIFT is not covered for key=0:

That should be fairly simple to add to the tests.

- Some paths for single-value leaves are not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606

However, these paths do get regression test coverage on 32-bit
machines. 64-bit builds only have leaves in the TID store, which
doesn't (currently) delete entries, and doesn't instantiate the tree
with the debug option.

- In RT_SET "if (found)" is not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768

That's because we don't yet have code that replaces an existing value
with a value of a different length.

- RT_FREE_RECURSE isn't well covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768

The TID store test is pretty simple as far as distribution of block
keys, and focuses more on the offset bitmaps. We could try to cover
all branches here, but it would make the test less readable, and it's
kind of the wrong place to do that anyway. test_radixtree.c does have
a commented-out option to use shared memory, but that's for local
testing and won't be reflected in the coverage report. Maybe it's
enough.

- RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644

That should be easy to add.

- RT_DUMP_NODE is not covered, and never called by default anyway:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2804

It seems we could just leave it alone since it's debug-only, but it's
also a lot of lines. One idea is to use elog with DEBUG5 instead of
commenting out the call sites, but that would cause a lot of noise.

- TidStoreCreate* has some memory clamps that are not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234

Maybe we could experiment with using 1MB for shared, and something
smaller for local.

[1]: /messages/by-id/20240414223305.m3i5eju6zylabvln@awork3.anarazel.de

#455

Noah Misch

noah@leadboat.com

over 1 year ago

In reply to: John Naylor (#454)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Apr 15, 2024 at 04:12:38PM +0700, John Naylor wrote:

- Some paths for single-value leaves are not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606

However, these paths do get regression test coverage on 32-bit
machines. 64-bit builds only have leaves in the TID store, which
doesn't (currently) delete entries, and doesn't instantiate the tree
with the debug option.

- In RT_SET "if (found)" is not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768

That's because we don't yet have code that replaces an existing value
with a value of a different length.

I saw a SIGSEGV there when using tidstore to write a fix for something else.
Patch attached.

Attachments:

tidstore-overwrite-v1.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    radixtree: Fix SIGSEGV at update of embeddable value to non-embeddable.
    
    Also, fix a memory leak when updating from non-embeddable to embeddable.
    Both were unreachable without adding C code.
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/FIXME

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index dc4c00d..aa8f44c 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1749,6 +1749,9 @@ have_slot:
 
 	if (RT_VALUE_IS_EMBEDDABLE(value_p))
 	{
+		if (found && !RT_CHILDPTR_IS_VALUE(*slot))
+			RT_FREE_LEAF(tree, *slot);
+
 		/* store value directly in child pointer slot */
 		memcpy(slot, value_p, value_sz);
 
@@ -1765,7 +1768,7 @@ have_slot:
 	{
 		RT_CHILD_PTR leaf;
 
-		if (found)
+		if (found && !RT_CHILDPTR_IS_VALUE(*slot))
 		{
 			Assert(RT_PTR_ALLOC_IS_VALID(*slot));
 			leaf.alloc = *slot;
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
index 06c610e..d116927 100644
--- a/src/test/modules/test_tidstore/expected/test_tidstore.out
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -79,6 +79,96 @@ SELECT test_destroy();
  
 (1 row)
 
+-- Test replacements crossing RT_CHILDPTR_IS_VALUE in both directions
+SELECT test_create(false);
+ test_create 
+-------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3,4]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT test_destroy();
+ test_destroy 
+--------------
+ 
+(1 row)
+
 -- Use shared memory this time. We can't do that in test_radixtree.sql,
 -- because unused static functions would raise warnings there.
 SELECT test_create(true);
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
index bb31877..704d869 100644
--- a/src/test/modules/test_tidstore/sql/test_tidstore.sql
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -14,6 +14,7 @@ CREATE TEMP TABLE hideblocks(blockno bigint);
 -- We use a higher number to test tidstore.
 \set maxoffset 512
 
+
 SELECT test_create(false);
 
 -- Test on empty tidstore.
@@ -50,6 +51,20 @@ SELECT test_is_full();
 
 -- Re-create the TID store for randommized tests.
 SELECT test_destroy();
+
+
+-- Test replacements crossing RT_CHILDPTR_IS_VALUE in both directions
+SELECT test_create(false);
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2,3,4]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT check_set_block_offsets();
+SELECT test_destroy();
+
+
 -- Use shared memory this time. We can't do that in test_radixtree.sql,
 -- because unused static functions would raise warnings there.
 SELECT test_create(true);
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 0a3a587..dac3b97 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -146,6 +146,18 @@ sanity_check_array(ArrayType *ta)
 				 errmsg("argument must be empty or one-dimensional array")));
 }
 
+static void
+purge_from_verification_array(BlockNumber blkno)
+{
+	int			dst = 0;
+
+	for (int src = 0; src < items.num_tids; src++)
+		if (ItemPointerGetBlockNumber(&items.insert_tids[src]) != blkno)
+			items.insert_tids[dst++] = items.insert_tids[src];
+	items.num_tids = dst;
+}
+
+
 /* Set the given block and offsets pairs */
 Datum
 do_set_block_offsets(PG_FUNCTION_ARGS)
@@ -166,6 +178,7 @@ do_set_block_offsets(PG_FUNCTION_ARGS)
 	TidStoreUnlock(tidstore);
 
 	/* Set TIDs in verification array */
+	purge_from_verification_array(blkno);
 	for (int i = 0; i < noffs; i++)
 	{
 		ItemPointer tid;

#456

sawada.mshk@gmail.com

over 1 year ago

In reply to: John Naylor (#454)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Mon, Apr 15, 2024 at 6:12 PM John Naylor <johncnaylorls@gmail.com> wrote:

I took a look at the coverage report from [1] and it seems pretty
good, but there are a couple more tests we could do.

Thank you for checking!

- RT_KEY_GET_SHIFT is not covered for key=0:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803

That should be fairly simple to add to the tests.

There are two paths to call RT_KEY_GET_SHIFT():

1. RT_SET() -> RT_KEY_GET_SHIFT()
2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT()

In both cases, it's called when key > tree->ctl->max_val. Since the
minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called
when key=0.

- Some paths for single-value leaves are not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606

However, these paths do get regression test coverage on 32-bit
machines. 64-bit builds only have leaves in the TID store, which
doesn't (currently) delete entries, and doesn't instantiate the tree
with the debug option.

Right.

- In RT_SET "if (found)" is not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768

That's because we don't yet have code that replaces an existing value
with a value of a different length.

Noah reported an issue around that. We should incorporate the patch
and cover this code path.

- RT_FREE_RECURSE isn't well covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768

The TID store test is pretty simple as far as distribution of block
keys, and focuses more on the offset bitmaps. We could try to cover
all branches here, but it would make the test less readable, and it's
kind of the wrong place to do that anyway. test_radixtree.c does have
a commented-out option to use shared memory, but that's for local
testing and won't be reflected in the coverage report. Maybe it's
enough.

Agreed.

- RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644

That should be easy to add.

Agreed. The patch is attached.

- RT_DUMP_NODE is not covered, and never called by default anyway:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2804

It seems we could just leave it alone since it's debug-only, but it's
also a lot of lines. One idea is to use elog with DEBUG5 instead of
commenting out the call sites, but that would cause a lot of noise.

I think we can leave it alone.

- TidStoreCreate* has some memory clamps that are not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234

Maybe we could experiment with using 1MB for shared, and something
smaller for local.

I've confirmed that the local and shared tidstore with small max sizes
such as 4kB and 1MB worked. Currently the max size is hard-coded in
test_tidstore.c but if we use work_mem as the max size, we can pass
different max sizes for local and shared in the test script.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

improve_code_coverage_radixtree.patchapplication/octet-stream; name=improve_code_coverage_radixtree.patchDownload

diff --git a/src/test/modules/test_radixtree/test_radixtree.c b/src/test/modules/test_radixtree/test_radixtree.c
index d301c60d00..a92d353fa7 100644
--- a/src/test/modules/test_radixtree/test_radixtree.c
+++ b/src/test/modules/test_radixtree/test_radixtree.c
@@ -151,6 +151,7 @@ test_empty(void)
 	EXPECT_TRUE(rt_find(radixtree, 1) == NULL);
 	EXPECT_TRUE(rt_find(radixtree, PG_UINT64_MAX) == NULL);
 	EXPECT_FALSE(rt_delete(radixtree, 0));
+	EXPECT_FALSE(rt_delete(radixtree, PG_UINT64_MAX));
 	EXPECT_TRUE(rt_num_entries(radixtree) == 0);
 
 	/* Iterating on an empty tree should not return anything */

#457

sawada.mshk@gmail.com

over 1 year ago

In reply to: Noah Misch (#455)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Apr 25, 2024 at 6:03 AM Noah Misch <noah@leadboat.com> wrote:

On Mon, Apr 15, 2024 at 04:12:38PM +0700, John Naylor wrote:

- Some paths for single-value leaves are not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606

However, these paths do get regression test coverage on 32-bit
machines. 64-bit builds only have leaves in the TID store, which
doesn't (currently) delete entries, and doesn't instantiate the tree
with the debug option.

- In RT_SET "if (found)" is not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768

That's because we don't yet have code that replaces an existing value
with a value of a different length.

I saw a SIGSEGV there when using tidstore to write a fix for something else.
Patch attached.

Great find, thank you for the patch!

The fix looks good to me. I think we can improve regression tests for
better coverage. In TidStore on a 64-bit machine, we can store 3
offsets in the header and these values are embedded to the leaf page.
With more than 3 offsets, the value size becomes more than 16 bytes
and a single value leaf. Therefore, if we can add the test with the
array[1,2,3,4,100], we can cover the case of replacing a single-value
leaf with a different size new single-value leaf. Now we add 9 pairs
of do_gset_block_offset() and check_set_block_offsets(). If these are
annoying, we can remove the cases of array[1] and array[1,2].

I've attached a new patch. In addition to the new test case I
mentioned, I've added some new comments and removed an unnecessary
added line in test_tidstore.sql.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

0001-radixtree-Fix-SIGSEGV-at-update-of-embeddable-value-.patchapplication/octet-stream; name=0001-radixtree-Fix-SIGSEGV-at-update-of-embeddable-value-.patchDownload

From e8c2a13c58e16a8482fb1a75f5d99754babaab9f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 25 Apr 2024 10:55:20 +0900
Subject: [PATCH] radixtree: Fix SIGSEGV at update of embeddable value to
 non-embeddable.

Also, fix a memory leak when updating from non-embeddable to
embeddable. Both were unreachable without adding C code.

Reported-by: Noah Misch
Author: Noah Misch
Reviewed-by: Masahiko Sawada
Discussion: https://postgr.es/m/20240424210319.4c.nmisch%40google.com
---
 src/include/lib/radixtree.h                   |   6 +-
 .../test_tidstore/expected/test_tidstore.out  | 112 ++++++++++++++++++
 .../test_tidstore/sql/test_tidstore.sql       |  16 +++
 .../modules/test_tidstore/test_tidstore.c     |  15 +++
 4 files changed, 148 insertions(+), 1 deletion(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d9f545d491..2896a6efc5 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1749,6 +1749,10 @@ have_slot:
 
 	if (RT_VALUE_IS_EMBEDDABLE(value_p))
 	{
+		/* free the existing leaf */
+		if (found && !RT_CHILDPTR_IS_VALUE(*slot))
+			RT_FREE_LEAF(tree, *slot);
+
 		/* store value directly in child pointer slot */
 		memcpy(slot, value_p, value_sz);
 
@@ -1765,7 +1769,7 @@ have_slot:
 	{
 		RT_CHILD_PTR leaf;
 
-		if (found)
+		if (found && !RT_CHILDPTR_IS_VALUE(*slot))
 		{
 			Assert(RT_PTR_ALLOC_IS_VALID(*slot));
 			leaf.alloc = *slot;
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
index 06c610e84c..2ef701f866 100644
--- a/src/test/modules/test_tidstore/expected/test_tidstore.out
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -79,6 +79,118 @@ SELECT test_destroy();
  
 (1 row)
 
+-- Test replacements crossing RT_CHILDPTR_IS_VALUE in both directions
+SELECT test_create(false);
+ test_create 
+-------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3,4]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3,4,100]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3,4]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT test_destroy();
+ test_destroy 
+--------------
+ 
+(1 row)
+
 -- Use shared memory this time. We can't do that in test_radixtree.sql,
 -- because unused static functions would raise warnings there.
 SELECT test_create(true);
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
index bb31877b9a..9ce1119128 100644
--- a/src/test/modules/test_tidstore/sql/test_tidstore.sql
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -50,6 +50,22 @@ SELECT test_is_full();
 
 -- Re-create the TID store for randommized tests.
 SELECT test_destroy();
+
+
+-- Test replacements crossing RT_CHILDPTR_IS_VALUE in both directions
+SELECT test_create(false);
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2,3,4]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2,3,4,100]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2,3,4]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT check_set_block_offsets();
+SELECT test_destroy();
+
+
 -- Use shared memory this time. We can't do that in test_radixtree.sql,
 -- because unused static functions would raise warnings there.
 SELECT test_create(true);
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 0a3a58722d..5417163407 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -146,6 +146,18 @@ sanity_check_array(ArrayType *ta)
 				 errmsg("argument must be empty or one-dimensional array")));
 }
 
+static void
+purge_from_verification_array(BlockNumber blkno)
+{
+	int			dst = 0;
+
+	for (int src = 0; src < items.num_tids; src++)
+		if (ItemPointerGetBlockNumber(&items.insert_tids[src]) != blkno)
+			items.insert_tids[dst++] = items.insert_tids[src];
+	items.num_tids = dst;
+}
+
+
 /* Set the given block and offsets pairs */
 Datum
 do_set_block_offsets(PG_FUNCTION_ARGS)
@@ -165,6 +177,9 @@ do_set_block_offsets(PG_FUNCTION_ARGS)
 	TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
 	TidStoreUnlock(tidstore);
 
+	/* Remove the existing items of blkno from the verification array */
+	purge_from_verification_array(blkno);
+
 	/* Set TIDs in verification array */
 	for (int i = 0; i < noffs; i++)
 	{
-- 
2.39.3

#458

johncnaylorls@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#457)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I saw a SIGSEGV there when using tidstore to write a fix for something else.
Patch attached.

Great find, thank you for the patch!

(This occurred to me a few days ago, but I was far from my computer.)

With the purge function that Noah proposed, I believe we can also get
rid of the comment at the top of the .sql test file warning of a
maintenance hazard:
..."To avoid adding duplicates,
-- each call to do_set_block_offsets() should use different block
-- numbers."

I found that it doesn't add any measurable time to run the test.

The fix looks good to me. I think we can improve regression tests for
better coverage. In TidStore on a 64-bit machine, we can store 3
offsets in the header and these values are embedded to the leaf page.
With more than 3 offsets, the value size becomes more than 16 bytes
and a single value leaf. Therefore, if we can add the test with the
array[1,2,3,4,100], we can cover the case of replacing a single-value
leaf with a different size new single-value leaf. Now we add 9 pairs

Good idea.

of do_gset_block_offset() and check_set_block_offsets(). If these are
annoying, we can remove the cases of array[1] and array[1,2].

Let's keep those -- 32-bit platforms should also exercise this path.

#459

sawada.mshk@gmail.com

over 1 year ago

In reply to: John Naylor (#458)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Apr 25, 2024 at 12:17 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I saw a SIGSEGV there when using tidstore to write a fix for something else.
Patch attached.

Great find, thank you for the patch!

+1

(This occurred to me a few days ago, but I was far from my computer.)

With the purge function that Noah proposed, I believe we can also get
rid of the comment at the top of the .sql test file warning of a
maintenance hazard:
..."To avoid adding duplicates,
-- each call to do_set_block_offsets() should use different block
-- numbers."

Good point. Removed.

of do_gset_block_offset() and check_set_block_offsets(). If these are
annoying, we can remove the cases of array[1] and array[1,2].

Let's keep those -- 32-bit platforms should also exercise this path.

Agreed.

I've attached a new patch. I'll push it tonight, if there is no further comment.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-radixtree-Fix-SIGSEGV-at-update-of-embeddable-val.patchapplication/octet-stream; name=v2-0001-radixtree-Fix-SIGSEGV-at-update-of-embeddable-val.patchDownload

From 28a5d66682a89d22abf7cdc96777a112a9751db1 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 25 Apr 2024 10:55:20 +0900
Subject: [PATCH v2] radixtree: Fix SIGSEGV at update of embeddable value to
 non-embeddable.

Also, fix a memory leak when updating from non-embeddable to
embeddable. Both were unreachable without adding C code.

Reported-by: Noah Misch
Author: Noah Misch
Reviewed-by: Masahiko Sawada, John Naylor
Discussion: https://postgr.es/m/20240424210319.4c.nmisch%40google.com
---
 src/include/lib/radixtree.h                   |   6 +-
 .../test_tidstore/expected/test_tidstore.out  | 116 +++++++++++++++++-
 .../test_tidstore/sql/test_tidstore.sql       |  21 +++-
 .../modules/test_tidstore/test_tidstore.c     |  15 +++
 4 files changed, 148 insertions(+), 10 deletions(-)

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index d9f545d491..2896a6efc5 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1749,6 +1749,10 @@ have_slot:
 
 	if (RT_VALUE_IS_EMBEDDABLE(value_p))
 	{
+		/* free the existing leaf */
+		if (found && !RT_CHILDPTR_IS_VALUE(*slot))
+			RT_FREE_LEAF(tree, *slot);
+
 		/* store value directly in child pointer slot */
 		memcpy(slot, value_p, value_sz);
 
@@ -1765,7 +1769,7 @@ have_slot:
 	{
 		RT_CHILD_PTR leaf;
 
-		if (found)
+		if (found && !RT_CHILDPTR_IS_VALUE(*slot))
 		{
 			Assert(RT_PTR_ALLOC_IS_VALID(*slot));
 			leaf.alloc = *slot;
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
index 06c610e84c..cbcacfd26e 100644
--- a/src/test/modules/test_tidstore/expected/test_tidstore.out
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -1,7 +1,3 @@
--- Note: The test code use an array of TIDs for verification similar
--- to vacuum's dead item array pre-PG17. To avoid adding duplicates,
--- each call to do_set_block_offsets() should use different block
--- numbers.
 CREATE EXTENSION test_tidstore;
 -- To hide the output of do_set_block_offsets()
 CREATE TEMP TABLE hideblocks(blockno bigint);
@@ -79,6 +75,118 @@ SELECT test_destroy();
  
 (1 row)
 
+-- Test replacements crossing RT_CHILDPTR_IS_VALUE in both directions
+SELECT test_create(false);
+ test_create 
+-------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3,4]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3,4,100]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3,4]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT check_set_block_offsets();
+ do_set_block_offsets 
+----------------------
+                    1
+(1 row)
+
+ check_set_block_offsets 
+-------------------------
+ 
+(1 row)
+
+SELECT test_destroy();
+ test_destroy 
+--------------
+ 
+(1 row)
+
 -- Use shared memory this time. We can't do that in test_radixtree.sql,
 -- because unused static functions would raise warnings there.
 SELECT test_create(true);
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql
index bb31877b9a..a29e4ec1c5 100644
--- a/src/test/modules/test_tidstore/sql/test_tidstore.sql
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -1,8 +1,3 @@
--- Note: The test code use an array of TIDs for verification similar
--- to vacuum's dead item array pre-PG17. To avoid adding duplicates,
--- each call to do_set_block_offsets() should use different block
--- numbers.
-
 CREATE EXTENSION test_tidstore;
 
 -- To hide the output of do_set_block_offsets()
@@ -50,6 +45,22 @@ SELECT test_is_full();
 
 -- Re-create the TID store for randommized tests.
 SELECT test_destroy();
+
+
+-- Test replacements crossing RT_CHILDPTR_IS_VALUE in both directions
+SELECT test_create(false);
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2,3,4]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2,3,4,100]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2,3,4]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT check_set_block_offsets();
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT check_set_block_offsets();
+SELECT test_destroy();
+
+
 -- Use shared memory this time. We can't do that in test_radixtree.sql,
 -- because unused static functions would raise warnings there.
 SELECT test_create(true);
diff --git a/src/test/modules/test_tidstore/test_tidstore.c b/src/test/modules/test_tidstore/test_tidstore.c
index 0a3a58722d..5417163407 100644
--- a/src/test/modules/test_tidstore/test_tidstore.c
+++ b/src/test/modules/test_tidstore/test_tidstore.c
@@ -146,6 +146,18 @@ sanity_check_array(ArrayType *ta)
 				 errmsg("argument must be empty or one-dimensional array")));
 }
 
+static void
+purge_from_verification_array(BlockNumber blkno)
+{
+	int			dst = 0;
+
+	for (int src = 0; src < items.num_tids; src++)
+		if (ItemPointerGetBlockNumber(&items.insert_tids[src]) != blkno)
+			items.insert_tids[dst++] = items.insert_tids[src];
+	items.num_tids = dst;
+}
+
+
 /* Set the given block and offsets pairs */
 Datum
 do_set_block_offsets(PG_FUNCTION_ARGS)
@@ -165,6 +177,9 @@ do_set_block_offsets(PG_FUNCTION_ARGS)
 	TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
 	TidStoreUnlock(tidstore);
 
+	/* Remove the existing items of blkno from the verification array */
+	purge_from_verification_array(blkno);
+
 	/* Set TIDs in verification array */
 	for (int i = 0; i < noffs; i++)
 	{
-- 
2.39.3

#460

sawada.mshk@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#459)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Apr 25, 2024 at 1:38 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Apr 25, 2024 at 12:17 PM John Naylor <johncnaylorls@gmail.com> wrote:

On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I saw a SIGSEGV there when using tidstore to write a fix for something else.
Patch attached.

Great find, thank you for the patch!

+1

(This occurred to me a few days ago, but I was far from my computer.)

With the purge function that Noah proposed, I believe we can also get
rid of the comment at the top of the .sql test file warning of a
maintenance hazard:
..."To avoid adding duplicates,
-- each call to do_set_block_offsets() should use different block
-- numbers."

Good point. Removed.

of do_gset_block_offset() and check_set_block_offsets(). If these are
annoying, we can remove the cases of array[1] and array[1,2].

Let's keep those -- 32-bit platforms should also exercise this path.

Agreed.

I've attached a new patch. I'll push it tonight, if there is no further comment.

Pushed.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#461

johncnaylorls@gmail.com

over 1 year ago

In reply to: Masahiko Sawada (#456)

1 attachment(s)

Re: [PoC] Improve dead tuple storage for lazy vacuum

On Thu, Apr 25, 2024 at 8:36 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Apr 15, 2024 at 6:12 PM John Naylor <johncnaylorls@gmail.com> wrote:

- RT_KEY_GET_SHIFT is not covered for key=0:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803

That should be fairly simple to add to the tests.

There are two paths to call RT_KEY_GET_SHIFT():

1. RT_SET() -> RT_KEY_GET_SHIFT()
2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT()

In both cases, it's called when key > tree->ctl->max_val. Since the
minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called
when key=0.

Ah, right, so it is dead code. Nothing to worry about, but it does
point the way to some simplifications, which I've put together in the
attached.

- RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644

That should be easy to add.

Agreed. The patch is attached.

LGTM

- TidStoreCreate* has some memory clamps that are not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234

Maybe we could experiment with using 1MB for shared, and something
smaller for local.

I've confirmed that the local and shared tidstore with small max sizes
such as 4kB and 1MB worked. Currently the max size is hard-coded in
test_tidstore.c but if we use work_mem as the max size, we can pass
different max sizes for local and shared in the test script.

Seems okay, do you want to try that and see how it looks?

Attachments:

simplify-shift-computations.patchtext/x-patch; charset=US-ASCII; name=simplify-shift-computations.patchDownload

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 2896a6efc5..fdac103763 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -217,7 +217,6 @@
 #define RT_NODE_48_GET_CHILD RT_MAKE_NAME(node_48_get_child)
 #define RT_NODE_256_IS_CHUNK_USED RT_MAKE_NAME(node_256_is_chunk_used)
 #define RT_NODE_256_GET_CHILD RT_MAKE_NAME(node_256_get_child)
-#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
 #define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
 #define RT_NODE_SEARCH RT_MAKE_NAME(node_search)
 #define RT_NODE_DELETE RT_MAKE_NAME(node_delete)
@@ -320,9 +319,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE * tree);
 /* Mask for extracting a chunk from a key */
 #define RT_CHUNK_MASK ((1 << RT_SPAN) - 1)
 
-/* Maximum shift needed to extract a chunk from a key */
-#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
-
 /* Maximum level a tree can reach for a key */
 #define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_SPAN)
 
@@ -797,28 +793,15 @@ RT_NODE_256_GET_CHILD(RT_NODE_256 * node, uint8 chunk)
 	return &node->children[chunk];
 }
 
-/*
- * Return the smallest shift that will allowing storing the given key.
- */
-static inline int
-RT_KEY_GET_SHIFT(uint64 key)
-{
-	if (key == 0)
-		return 0;
-	else
-		return (pg_leftmost_one_pos64(key) / RT_SPAN) * RT_SPAN;
-}
-
 /*
  * Return the max value that can be stored in the tree with the given shift.
  */
 static uint64
 RT_SHIFT_GET_MAX_VAL(int shift)
 {
-	if (shift == RT_MAX_SHIFT)
-		return UINT64_MAX;
-	else
-		return (UINT64CONST(1) << (shift + RT_SPAN)) - 1;
+	int max_shift = (sizeof(uint64) - 1) * BITS_PER_BYTE;
+
+	return UINT64_MAX >> (max_shift - shift);
 }
 
 /*
@@ -1574,9 +1557,8 @@ RT_NODE_INSERT(RT_RADIX_TREE * tree, RT_PTR_ALLOC * parent_slot, RT_CHILD_PTR no
  * and move the old node below it.
  */
 static pg_noinline void
-RT_EXTEND_UP(RT_RADIX_TREE * tree, uint64 key)
+RT_EXTEND_UP(RT_RADIX_TREE * tree, uint64 key, int target_shift)
 {
-	int			target_shift = RT_KEY_GET_SHIFT(key);
 	int			shift = tree->ctl->start_shift;
 
 	Assert(shift < target_shift);
@@ -1713,11 +1695,15 @@ RT_SET(RT_RADIX_TREE * tree, uint64 key, RT_VALUE_TYPE * value_p)
 	/* Extend the tree if necessary */
 	if (unlikely(key > tree->ctl->max_val))
 	{
+		int			start_shift;
+
+		/* compute the smallest shift that will allowing storing the key */
+		start_shift = pg_leftmost_one_pos64(key) / RT_SPAN * RT_SPAN;
+
 		if (tree->ctl->num_keys == 0)
 		{
 			RT_CHILD_PTR node;
 			RT_NODE_4  *n4;
-			int			start_shift = RT_KEY_GET_SHIFT(key);
 
 			/*
 			 * With an empty root node, we don't extend the tree upwards,
@@ -1738,7 +1724,7 @@ RT_SET(RT_RADIX_TREE * tree, uint64 key, RT_VALUE_TYPE * value_p)
 			goto have_slot;
 		}
 		else
-			RT_EXTEND_UP(tree, key);
+			RT_EXTEND_UP(tree, key, start_shift);
 	}
 
 	slot = RT_GET_SLOT_RECURSIVE(tree, &tree->ctl->root,
@@ -2937,7 +2923,6 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_SPAN
 #undef RT_NODE_MAX_SLOTS
 #undef RT_CHUNK_MASK
-#undef RT_MAX_SHIFT
 #undef RT_MAX_LEVEL
 #undef RT_GET_KEY_CHUNK
 #undef RT_BM_IDX
@@ -3032,7 +3017,6 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_NODE_48_GET_CHILD
 #undef RT_NODE_256_IS_CHUNK_USED
 #undef RT_NODE_256_GET_CHILD
-#undef RT_KEY_GET_SHIFT
 #undef RT_SHIFT_GET_MAX_VAL
 #undef RT_NODE_SEARCH
 #undef RT_ADD_CHILD_4

#462